Module `eremitalpa.influenza`

Functions

def aa_counts_thru_time(df_seq: pandas.core.frame.DataFrame, site: int, ignore='-X') ‑> pandas.core.frame.DataFrame

Browse git

Make a DataFrame containing counts of amino acids that were sampled in months at a particular sequence site.

Columns in the returned DataFrame are dates, the index is amino acids.

Args

df_seq: Must contain columns "aa" which contains amino acid sequences and "dt" which contains datetime objects of collection dates.
site: Count amino acids at this site. Note, this is 1-indexed.
ignore: Don't include these characters in the counts.

def cluster_from_ha(sequence, seq_type='long')

Browse git

Classify an amino acid sequence as an antigenic cluster by checking whether the sequences Bjorn 7 sites match exactly sites that are known in a cluster.

Args

sequence : str: HA amino acid sequence.
seq_type : str: "long" or "b7". If long, sequence must contain at least the fist 193 positions of HA1. If b7, sequence should be the b7 positions.

Raises

ValueError: If the sequence can't be classified.

Returns

(str): The name of the cluster.

def cluster_from_ha_2(sequence: str, strict_len: bool = True, max_hd: float = 10)

Browse git

Classify an amino acid sequence into an antigenic cluster.

First identify clusters that have matching key residues with the sequence. If multiple clusters are found, find the one with the lowest hamming distance to the sequence. If the resulting hamming distance is less than 10, return the cluster.

Args

sequence (str)
strict_len : bool: See hamming_to_cluster.
hd : int: Queries that have matching key residues to a cluster are not classified as a cluster if the hamming distance to the cluster consensus is > hd.

Returns

Cluster

def clusters_with_matching_key_residues(sequence: str) ‑> list[Cluster]

Browse git

List of H3N2 clusters that have matching key residues.

Args

sequence : str: Amino acid sequence. At least 193 residues long (highest numeric position of a key residue).

def extract_12_nts_around_splice_site(seq: str, donor_loc: int) ‑> str

Browse git

Start 4 nts downstream of the donor location

    Start of splice donor 'AGGT' signal
    ⌄     Splice site
    |     ⌄
XXXXTTTCAG|GA...
^
Extracts 12 nts from here

def find_ns_splice_acceptor(seq: str, donor_loc=None) ‑> int

Browse git

Find the 'AG' splice acceptor signal. Returns an int which is the index of the 'A'.

Notes

Should be >350 nts downstream of the splice donor location.

Args

seq (str)
donor_loc : int: Location of the splice donor

def find_ns_splice_donor(seq: str) ‑> int

Browse git

Find the AGGT splice donor signal. Returns an int which is the index of the 'A'

Notes

In NS1 sequences Gabi has sent the AGGT is on average at position 38.2.

def find_ns_splice_sites(seq: str) ‑> tuple[int, int]

Browse git

Lookup the splice donor and acceptor locations for an NS1 transcript.

def findall(sub: str, string: str) ‑> list[int]

Browse git

Return indexes of all substrings in a string

def four_aas_around_splice_site(seq: str, donor_loc: int, accept_loc: int) ‑> str

Browse git

What are the four amino acids either side of the splice site given a sequence, donor location and acceptor location?

def guess_clusters_in_tree(node)

Browse git

If a node is in a known cluster, and all of it's descendants are in the same cluster, or an unknown cluster, then, update all the descendent nodes to the matching cluster.

def hamming_to_all_clusters(sequence: str, strict_len: bool = True) ‑> list[float]

Browse git

The hamming distance from sequence to all known clusters.

Args

sequence (str)
strict_len : bool: See hamming_to_cluster

Returns

2-tuples containing (cluster, hamming distance)

def hamming_to_cluster(sequence: str, cluster: str | ForwardRef('Cluster'), strict_len: bool = True) ‑> float

Browse git

The hamming distance from sequence to the consensus sequence of a cluster.

Args

sequence (str)
cluster (str or Cluster)
strict_len : bool: Cluster consensus sequences are for HA1 only, and are 328 residues long. If strict_len is True, then don't check whether sequence matches this length. If False, the sequence is truncated to 328 residues to match. If a sequence is less than 328 residues then an error will still be raised.

Returns int

def hamming_to_clusters(sequence: str, clusters: Iterable[str | ForwardRef('Cluster')], strict_len: bool = True) ‑> list[float]

Browse git

The hamming distance from sequence to given clusters.

Args

sequence (str)
clusters (iterable)
strict_len : bool: See hamming_to_cluster

Returns

2-tuples containing (cluster, hamming distance)

def has_different_cluster_descendent(node)

Browse git

Test if node has a descendent in a cluster different to its own.

Args

node (dendropy Node)

Returns

(bool)

def load_cluster_nt_consensus() ‑> dict[str, str]

Browse git

Load cluster nt consensus seqs.

def ns1_splice_sites(sequence) ‑> tuple[tuple[int, int]]

Browse git

Return the start and end positions of the NS1 gene in a given sequence.

Given the sequence, find the splice donor and acceptor sites, and then return the start and end positions of the NS1 gene in terms of 1-based indexing.

Returns

tuple[tuple[int, int]]: A single tuple containing the start and end positions of the NS1 gene.

def ns2_splice_sites(sequence)

Browse git

Return start and end positions for the two exons of the NS2 gene.

Args

sequence : str The sequence to extract the exons from.

Returns

tuple[tuple[int, int]] A tuple of two tuples. The first tuple contains the start and end positions of the first exon, and the second tuple contains the start and end positions of the second exon. The positions are 1-based.

def plot_aa_freq_thru_time(t0: pandas._libs.tslibs.timestamps.Timestamp, t_end: pandas._libs.tslibs.timestamps.Timestamp, df_seq: pandas.core.frame.DataFrame, site: int, proportion=False, ax=None, ignore='X-', blank_xtick_labels=False)

Browse git

def plot_subs_on_tree(tree, seq_attr, length=30, exclude_leaves=True, find_mutation_offset=0, max_mutations=20, only_these_positions=None, exclude_characters='X', either_side_trunk=True, trunk_attr='_x', **kws)

Browse git

Annotate a tree with substitutions.

Args

tree (dendropy Tree)
seq_attr : str: Name of the attribute on nodes that contain the sequence.
cluster_change_only : bool: Only plot substitutions on nodes when a cluster has changed.
length : scalar: Length of the line.
exclude_leaves : bool: Don't label substitutions on branches leading to leaf nodes.
find_mutation_offset : int: See ere.find_mutations.
max_mutations : int: Annotate at most this number of mutations.
exclude_characters : str: If a mutation contains a character in this string, don't annotate it.
only_these_positions : iterable: Contains ints. Only show mutations that at these positions.
either_side_trunk : bool: Plot labels both sides of the trunk.
trunk_attr : str: _x or _y. Trunk is defined as root to deepest leaf. Deepest leaf is the leaf with maximum trunk_attr.
**kws: Keyword arguments passed to plt.annotate.

def plot_tree_coloured_by_cluster(tree, legend=True, leg_kws={}, unknown_color='black', leaf_kws={}, internal_kws={}, **kws)

Browse git

Plot a tree with nodes coloured according to cluster.

Args

tree : dendropy Tree: Nodes that have 'cluster' attribute will be coloured.
legend : bool: Add a legend showing the clusters.
leg_kws : dict: Keyword arguments passed to plt.legend.
unknown_color : mpl color: Color if cluster is not known.
**kws: Keyword arguments passed to plot_tree.

def splice(sequence: str, start_ends: tuple[tuple[int, int], ...], translate: bool = True) ‑> str

Browse git

Splice a sequence.

Args

start_ends: Contains 2-tuples that define the start and end of coding sequences. Values are 1-indexed, and inclusive. E.g. passing (2, 4) for the sequence 'ACTGT' would return 'CTG'.

def splice_ns(seq: str, donor_loc: int, accept_loc: int) ‑> str

Browse git

Splice an NS sequence given a splice donor and acceptor locations.

Args

seq (str):
donor_loc : int: Location of the AGGT. The 'AG' remains in the transcript, the 'GT' is lost.
acceptor_loc : int: Location of the 'AG'. The 'AG' is lost.

def translate_segment(sequence: str, segment: str) ‑> dict[str, str]

Browse git

Transcribe an influenza A segment. MP, PA, PB1 and NS all have splice variants, so simply transcribing the ORF of the segment would miss out proteins. This function returns a list containing coding sequences that are transcribed from a particular segment.

Args

sequence : str: The RNA sequence of the segment.
segment : str: The segment to translate. Must be one of 'HA', 'NA', 'NP', 'PB2', 'PA', 'MP' or 'PB1'.