Module eremitalpa.influenza
Functions
def aa_counts_thru_time(df_seq: pandas.core.frame.DataFrame, site: int, ignore='-X') ‑> pandas.core.frame.DataFrame
-
Make a DataFrame containing counts of amino acids that were sampled in months at a particular sequence site.
Columns in the returned DataFrame are dates, the index is amino acids.
Args
df_seq
- Must contain columns "aa" which contains amino acid sequences and "dt" which contains datetime objects of collection dates.
site
- Count amino acids at this site. Note, this is 1-indexed.
ignore
- Don't include these characters in the counts.
def cluster_from_ha(sequence, seq_type='long')
-
Classify an amino acid sequence as an antigenic cluster by checking whether the sequences Bjorn 7 sites match exactly sites that are known in a cluster.
Args
sequence
:str
- HA amino acid sequence.
seq_type
:str
- "long" or "b7". If long, sequence must contain at least the fist 193 positions of HA1. If b7, sequence should be the b7 positions.
Raises
ValueError
- If the sequence can't be classified.
Returns
(str): The name of the cluster.
def cluster_from_ha_2(sequence: str, strict_len: bool = True, max_hd: float = 10)
-
Classify an amino acid sequence into an antigenic cluster.
First identify clusters that have matching key residues with the sequence. If multiple clusters are found, find the one with the lowest hamming distance to the sequence. If the resulting hamming distance is less than 10, return the cluster.
Args
- sequence (str)
strict_len
:bool
- See hamming_to_cluster.
hd
:int
- Queries that have matching key residues to a cluster are not classified as a cluster if the hamming distance to the cluster consensus is > hd.
Returns
Cluster
def clusters_with_matching_key_residues(sequence: str) ‑> list[Cluster]
-
List of H3N2 clusters that have matching key residues.
Args
sequence
:str
- Amino acid sequence. At least 193 residues long (highest numeric position of a key residue).
def extract_12_nts_around_splice_site(seq: str, donor_loc: int) ‑> str
-
Start 4 nts downstream of the donor location
Start of splice donor 'AGGT' signal ⌄ Splice site | ⌄ XXXXTTTCAG|GA... ^ Extracts 12 nts from here
def find_ns_splice_acceptor(seq: str, donor_loc=None) ‑> int
-
Find the 'AG' splice acceptor signal. Returns an int which is the index of the 'A'.
Notes
Should be >350 nts downstream of the splice donor location.
Args
- seq (str)
donor_loc
:int
- Location of the splice donor
def find_ns_splice_donor(seq: str) ‑> int
-
Find the AGGT splice donor signal. Returns an int which is the index of the 'A'
Notes
In NS1 sequences Gabi has sent the AGGT is on average at position 38.2.
def find_ns_splice_sites(seq: str) ‑> tuple[int, int]
-
Lookup the splice donor and acceptor locations for an NS1 transcript.
def findall(sub: str, string: str) ‑> list[int]
-
Return indexes of all substrings in a string
def four_aas_around_splice_site(seq: str, donor_loc: int, accept_loc: int) ‑> str
-
What are the four amino acids either side of the splice site given a sequence, donor location and acceptor location?
def guess_clusters_in_tree(node)
-
If a node is in a known cluster, and all of it's descendants are in the same cluster, or an unknown cluster, then, update all the descendent nodes to the matching cluster.
def hamming_to_all_clusters(sequence: str, strict_len: bool = True) ‑> list[float]
-
The hamming distance from sequence to all known clusters.
Args
- sequence (str)
strict_len
:bool
- See hamming_to_cluster
Returns
2-tuples containing (cluster, hamming distance)
def hamming_to_cluster(sequence: str,
cluster: str | ForwardRef('Cluster'),
strict_len: bool = True) ‑> float-
The hamming distance from sequence to the consensus sequence of a cluster.
Args
- sequence (str)
- cluster (str or Cluster)
strict_len
:bool
- Cluster consensus sequences are for HA1 only, and are 328 residues long. If strict_len is True, then don't check whether sequence matches this length. If False, the sequence is truncated to 328 residues to match. If a sequence is less than 328 residues then an error will still be raised.
Returns int
def hamming_to_clusters(sequence: str,
clusters: Iterable[str | ForwardRef('Cluster')],
strict_len: bool = True) ‑> list[float]-
The hamming distance from sequence to given clusters.
Args
- sequence (str)
- clusters (iterable)
strict_len
:bool
- See hamming_to_cluster
Returns
2-tuples containing (cluster, hamming distance)
def has_different_cluster_descendent(node)
-
Test if node has a descendent in a cluster different to its own.
Args
node (dendropy Node)
Returns
(bool)
def load_cluster_nt_consensus() ‑> dict[str, str]
-
Load cluster nt consensus seqs.
def ns1_splice_sites(sequence) ‑> tuple[tuple[int, int]]
-
Return the start and end positions of the NS1 gene in a given sequence.
Given the sequence, find the splice donor and acceptor sites, and then return the start and end positions of the NS1 gene in terms of 1-based indexing.
Returns
tuple[tuple[int, int]]
- A single tuple containing the start and end positions of the NS1 gene.
def ns2_splice_sites(sequence)
-
Return start and end positions for the two exons of the NS2 gene.
Args
sequence : str The sequence to extract the exons from.
Returns
tuple[tuple[int, int]] A tuple of two tuples. The first tuple contains the start and end positions of the first exon, and the second tuple contains the start and end positions of the second exon. The positions are 1-based.
def plot_aa_freq_thru_time(t0: pandas._libs.tslibs.timestamps.Timestamp,
t_end: pandas._libs.tslibs.timestamps.Timestamp,
df_seq: pandas.core.frame.DataFrame,
site: int,
proportion=False,
ax=None,
ignore='X-',
blank_xtick_labels=False)def plot_subs_on_tree(tree,
seq_attr,
length=30,
exclude_leaves=True,
find_mutation_offset=0,
max_mutations=20,
only_these_positions=None,
exclude_characters='X',
either_side_trunk=True,
trunk_attr='_x',
**kws)-
Annotate a tree with substitutions.
Args
- tree (dendropy Tree)
seq_attr
:str
- Name of the attribute on nodes that contain the sequence.
cluster_change_only
:bool
- Only plot substitutions on nodes when a cluster has changed.
length
:scalar
- Length of the line.
exclude_leaves
:bool
- Don't label substitutions on branches leading to leaf nodes.
find_mutation_offset
:int
- See ere.find_mutations.
max_mutations
:int
- Annotate at most this number of mutations.
exclude_characters
:str
- If a mutation contains a character in this string, don't annotate it.
only_these_positions
:iterable
- Contains ints. Only show mutations that at these positions.
either_side_trunk
:bool
- Plot labels both sides of the trunk.
trunk_attr
:str
- _x or _y. Trunk is defined as root to deepest leaf. Deepest leaf is the leaf with maximum trunk_attr.
**kws
- Keyword arguments passed to plt.annotate.
def plot_tree_coloured_by_cluster(tree,
legend=True,
leg_kws={},
unknown_color='black',
leaf_kws={},
internal_kws={},
**kws)-
Plot a tree with nodes coloured according to cluster.
Args
tree
:dendropy Tree
- Nodes that have 'cluster' attribute will be coloured.
legend
:bool
- Add a legend showing the clusters.
leg_kws
:dict
- Keyword arguments passed to plt.legend.
unknown_color
:mpl color
- Color if cluster is not known.
**kws
- Keyword arguments passed to plot_tree.
def splice(sequence: str, start_ends: tuple[tuple[int, int], ...], translate: bool = True) ‑> str
-
Splice a sequence.
Args
start_ends
- Contains 2-tuples that define the start and end of coding sequences. Values are 1-indexed, and inclusive. E.g. passing (2, 4) for the sequence 'ACTGT' would return 'CTG'.
def splice_ns(seq: str, donor_loc: int, accept_loc: int) ‑> str
-
Splice an NS sequence given a splice donor and acceptor locations.
Args
- seq (str):
donor_loc
:int
- Location of the AGGT. The 'AG' remains in the transcript, the 'GT' is lost.
acceptor_loc
:int
- Location of the 'AG'. The 'AG' is lost.
def translate_segment(sequence: str, segment: str) ‑> dict[str, str]
-
Transcribe an influenza A segment. MP, PA, PB1 and NS all have splice variants, so simply transcribing the ORF of the segment would miss out proteins. This function returns a list containing coding sequences that are transcribed from a particular segment.
Args
sequence
:str
- The RNA sequence of the segment.
segment
:str
- The segment to translate. Must be one of 'HA', 'NA', 'NP', 'PB2', 'PA', 'MP' or 'PB1'.
Returns
dict[str, str]
- A dictionary mapping the segment name to the translated protein sequence.
def translate_trim_default_ha(nt: str) ‑> str
-
Take a default HA nucleotide sequence and return an HA1 sequence.
Classes
class Cluster (cluster)
-
Instance variables
prop aa_sequence
-
Representative amino acid sequence.
prop b7_motifs
prop color
prop key_residues
prop nt_sequence : str
-
Representative nucleotide sequence.
prop year : int
Methods
def codon(self, n: int) ‑> str
-
Codon at amino acid position n. 1-indexed.
class ClusterTransition (c0: str | Cluster,
c1: str | Cluster)-
A cluster transition.
Static methods
def from_tuple(c0c1) ‑> ClusterTransition
-
Make an instance from a tuple
Instance variables
prop preceding_transitions : Generator[ClusterTransition, None, None]
-
All preceding cluster transitions
class HammingDistTooLargeError (*args, **kwargs)
-
Common base class for all non-exit exceptions.
Ancestors
- builtins.Exception
- builtins.BaseException
class NHSeason (years: tuple[int, int] | ForwardRef('NHSeason'))
-
A northern hemisphere flu season.
Static methods
def from_datetime(dt)
class NoMatchingKeyResidues (*args, **kwargs)
-
Common base class for all non-exit exceptions.
Ancestors
- builtins.Exception
- builtins.BaseException
class TiedHammingDistances (*args, **kwargs)
-
Common base class for all non-exit exceptions.
Ancestors
- builtins.Exception
- builtins.BaseException