Module eremitalpa.influenza

Functions

def aa_counts_thru_time(df_seq: pandas.core.frame.DataFrame, site: int, ignore='-X') ‑> pandas.core.frame.DataFrame

Make a DataFrame containing counts of amino acids that were sampled in months at a particular sequence site.

Columns in the returned DataFrame are dates, the index is amino acids.

Args

df_seq
Must contain columns "aa" which contains amino acid sequences and "dt" which contains datetime objects of collection dates.
site
Count amino acids at this site. Note, this is 1-indexed.
ignore
Don't include these characters in the counts.
def cluster_from_ha(sequence, seq_type='long')

Classify an amino acid sequence as an antigenic cluster by checking whether the sequences Bjorn 7 sites match exactly sites that are known in a cluster.

Args

sequence : str
HA amino acid sequence.
seq_type : str
"long" or "b7". If long, sequence must contain at least the fist 193 positions of HA1. If b7, sequence should be the b7 positions.

Raises

ValueError
If the sequence can't be classified.

Returns

(str): The name of the cluster.

def cluster_from_ha_2(sequence: str, strict_len: bool = True, max_hd: float = 10)

Classify an amino acid sequence into an antigenic cluster.

First identify clusters that have matching key residues with the sequence. If multiple clusters are found, find the one with the lowest hamming distance to the sequence. If the resulting hamming distance is less than 10, return the cluster.

Args

sequence (str)
strict_len : bool
See hamming_to_cluster.
hd : int
Queries that have matching key residues to a cluster are not classified as a cluster if the hamming distance to the cluster consensus is > hd.

Returns

Cluster

def clusters_with_matching_key_residues(sequence: str) ‑> list[Cluster]

List of H3N2 clusters that have matching key residues.

Args

sequence : str
Amino acid sequence. At least 193 residues long (highest numeric position of a key residue).
def extract_12_nts_around_splice_site(seq: str, donor_loc: int) ‑> str

Start 4 nts downstream of the donor location

    Start of splice donor 'AGGT' signal
    ⌄     Splice site
    |     ⌄
XXXXTTTCAG|GA...
^
Extracts 12 nts from here
def find_ns_splice_acceptor(seq: str, donor_loc=None) ‑> int

Find the 'AG' splice acceptor signal. Returns an int which is the index of the 'A'.

Notes

Should be >350 nts downstream of the splice donor location.

Args

seq (str)
donor_loc : int
Location of the splice donor
def find_ns_splice_donor(seq: str) ‑> int

Find the AGGT splice donor signal. Returns an int which is the index of the 'A'

Notes

In NS1 sequences Gabi has sent the AGGT is on average at position 38.2.

def find_ns_splice_sites(seq: str) ‑> tuple[int, int]

Lookup the splice donor and acceptor locations for an NS1 transcript.

def findall(sub: str, string: str) ‑> list[int]

Return indexes of all substrings in a string

def four_aas_around_splice_site(seq: str, donor_loc: int, accept_loc: int) ‑> str

What are the four amino acids either side of the splice site given a sequence, donor location and acceptor location?

def guess_clusters_in_tree(node)

If a node is in a known cluster, and all of it's descendants are in the same cluster, or an unknown cluster, then, update all the descendent nodes to the matching cluster.

def hamming_to_all_clusters(sequence: str, strict_len: bool = True) ‑> list[float]

The hamming distance from sequence to all known clusters.

Args

sequence (str)
strict_len : bool
See hamming_to_cluster

Returns

2-tuples containing (cluster, hamming distance)

def hamming_to_cluster(sequence: str,
cluster: str | ForwardRef('Cluster'),
strict_len: bool = True) ‑> float

The hamming distance from sequence to the consensus sequence of a cluster.

Args

sequence (str)
cluster (str or Cluster)
strict_len : bool
Cluster consensus sequences are for HA1 only, and are 328 residues long. If strict_len is True, then don't check whether sequence matches this length. If False, the sequence is truncated to 328 residues to match. If a sequence is less than 328 residues then an error will still be raised.

Returns int

def hamming_to_clusters(sequence: str,
clusters: Iterable[str | ForwardRef('Cluster')],
strict_len: bool = True) ‑> list[float]

The hamming distance from sequence to given clusters.

Args

sequence (str)
clusters (iterable)
strict_len : bool
See hamming_to_cluster

Returns

2-tuples containing (cluster, hamming distance)

def has_different_cluster_descendent(node)

Test if node has a descendent in a cluster different to its own.

Args

node (dendropy Node)

Returns

(bool)

def load_cluster_nt_consensus() ‑> dict[str, str]

Load cluster nt consensus seqs.

def ns1_splice_sites(sequence) ‑> tuple[tuple[int, int]]

Return the start and end positions of the NS1 gene in a given sequence.

Given the sequence, find the splice donor and acceptor sites, and then return the start and end positions of the NS1 gene in terms of 1-based indexing.

Returns

tuple[tuple[int, int]]
A single tuple containing the start and end positions of the NS1 gene.
def ns2_splice_sites(sequence)

Return start and end positions for the two exons of the NS2 gene.

Args

sequence : str The sequence to extract the exons from.

Returns

tuple[tuple[int, int]] A tuple of two tuples. The first tuple contains the start and end positions of the first exon, and the second tuple contains the start and end positions of the second exon. The positions are 1-based.

def plot_aa_freq_thru_time(t0: pandas._libs.tslibs.timestamps.Timestamp,
t_end: pandas._libs.tslibs.timestamps.Timestamp,
df_seq: pandas.core.frame.DataFrame,
site: int,
proportion=False,
ax=None,
ignore='X-',
blank_xtick_labels=False)
def plot_subs_on_tree(tree,
seq_attr,
length=30,
exclude_leaves=True,
find_mutation_offset=0,
max_mutations=20,
only_these_positions=None,
exclude_characters='X',
either_side_trunk=True,
trunk_attr='_x',
**kws)

Annotate a tree with substitutions.

Args

tree (dendropy Tree)
seq_attr : str
Name of the attribute on nodes that contain the sequence.
cluster_change_only : bool
Only plot substitutions on nodes when a cluster has changed.
length : scalar
Length of the line.
exclude_leaves : bool
Don't label substitutions on branches leading to leaf nodes.
find_mutation_offset : int
See ere.find_mutations.
max_mutations : int
Annotate at most this number of mutations.
exclude_characters : str
If a mutation contains a character in this string, don't annotate it.
only_these_positions : iterable
Contains ints. Only show mutations that at these positions.
either_side_trunk : bool
Plot labels both sides of the trunk.
trunk_attr : str
_x or _y. Trunk is defined as root to deepest leaf. Deepest leaf is the leaf with maximum trunk_attr.
**kws
Keyword arguments passed to plt.annotate.
def plot_tree_coloured_by_cluster(tree,
legend=True,
leg_kws={},
unknown_color='black',
leaf_kws={},
internal_kws={},
**kws)

Plot a tree with nodes coloured according to cluster.

Args

tree : dendropy Tree
Nodes that have 'cluster' attribute will be coloured.
legend : bool
Add a legend showing the clusters.
leg_kws : dict
Keyword arguments passed to plt.legend.
unknown_color : mpl color
Color if cluster is not known.
**kws
Keyword arguments passed to plot_tree.
def splice(sequence: str, start_ends: tuple[tuple[int, int], ...], translate: bool = True) ‑> str

Splice a sequence.

Args

start_ends
Contains 2-tuples that define the start and end of coding sequences. Values are 1-indexed, and inclusive. E.g. passing (2, 4) for the sequence 'ACTGT' would return 'CTG'.
def splice_ns(seq: str, donor_loc: int, accept_loc: int) ‑> str

Splice an NS sequence given a splice donor and acceptor locations.

Args

seq (str):
donor_loc : int
Location of the AGGT. The 'AG' remains in the transcript, the 'GT' is lost.
acceptor_loc : int
Location of the 'AG'. The 'AG' is lost.
def translate_segment(sequence: str, segment: str) ‑> dict[str, str]

Transcribe an influenza A segment. MP, PA, PB1 and NS all have splice variants, so simply transcribing the ORF of the segment would miss out proteins. This function returns a list containing coding sequences that are transcribed from a particular segment.

Args

sequence : str
The RNA sequence of the segment.
segment : str
The segment to translate. Must be one of 'HA', 'NA', 'NP', 'PB2', 'PA', 'MP' or 'PB1'.

Returns

dict[str, str]
A dictionary mapping the segment name to the translated protein sequence.
def translate_trim_default_ha(nt: str) ‑> str

Take a default HA nucleotide sequence and return an HA1 sequence.

Classes

class Cluster (cluster)

Instance variables

prop aa_sequence

Representative amino acid sequence.

prop b7_motifs
prop color
prop key_residues
prop nt_sequence : str

Representative nucleotide sequence.

prop year : int

Methods

def codon(self, n: int) ‑> str

Codon at amino acid position n. 1-indexed.

class ClusterTransition (c0: str | Cluster,
c1: str | Cluster)

A cluster transition.

Static methods

def from_tuple(c0c1) ‑> ClusterTransition

Make an instance from a tuple

Instance variables

prop preceding_transitions : Generator[ClusterTransition, None, None]

All preceding cluster transitions

class HammingDistTooLargeError (*args, **kwargs)

Common base class for all non-exit exceptions.

Ancestors

  • builtins.Exception
  • builtins.BaseException
class NHSeason (years: tuple[int, int] | ForwardRef('NHSeason'))

A northern hemisphere flu season.

Static methods

def from_datetime(dt)
class NoMatchingKeyResidues (*args, **kwargs)

Common base class for all non-exit exceptions.

Ancestors

  • builtins.Exception
  • builtins.BaseException
class TiedHammingDistances (*args, **kwargs)

Common base class for all non-exit exceptions.

Ancestors

  • builtins.Exception
  • builtins.BaseException