Package eremitalpa
Sub-modules
eremitalpa.bio
eremitalpa.eremitalpa
eremitalpa.flu_wider
eremitalpa.influenza
eremitalpa.lib
-
Generic library functions.
eremitalpa.scripts
Functions
def aa_counts_thru_time(df_seq: pandas.core.frame.DataFrame, site: int, ignore='-X') ‑> pandas.core.frame.DataFrame
-
Make a DataFrame containing counts of amino acids that were sampled in months at a particular sequence site.
Columns in the returned DataFrame are dates, the index is amino acids.
Args
df_seq
- Must contain columns "aa" which contains amino acid sequences and "dt" which contains datetime objects of collection dates.
site
- Count amino acids at this site. Note, this is 1-indexed.
ignore
- Don't include these characters in the counts.
def annotate_points(df: pandas.core.frame.DataFrame,
ax: matplotlib.axes._axes.Axes,
n: int = -1,
adjust: bool = True,
**kwds)-
Label (x, y) points on a matplotlib ax.
Args
df
- Pandas DataFrame with 2 columns, (x, y) respectively. Index contains the
- labels.
n
- Label this many points. Default (-1) annotates all points.
ax
- Matplotlib ax
adjust
- Use adjustText.adjust_text to try to prevent label overplotting.
**kwds
- Passed to adjustText.adjustText
def cal_months_diff(date1: pandas._libs.tslibs.timestamps.Timestamp,
date0: pandas._libs.tslibs.timestamps.Timestamp) ‑> int-
Number of calendar months between two dates (date1 - date0)
def cluster_from_ha(sequence, seq_type='long')
-
Classify an amino acid sequence as an antigenic cluster by checking whether the sequences Bjorn 7 sites match exactly sites that are known in a cluster.
Args
sequence
:str
- HA amino acid sequence.
seq_type
:str
- "long" or "b7". If long, sequence must contain at least the fist 193 positions of HA1. If b7, sequence should be the b7 positions.
Raises
ValueError
- If the sequence can't be classified.
Returns
(str): The name of the cluster.
def cluster_from_ha_2(sequence: str, strict_len: bool = True, max_hd: float = 10)
-
Classify an amino acid sequence into an antigenic cluster.
First identify clusters that have matching key residues with the sequence. If multiple clusters are found, find the one with the lowest hamming distance to the sequence. If the resulting hamming distance is less than 10, return the cluster.
Args
- sequence (str)
strict_len
:bool
- See hamming_to_cluster.
hd
:int
- Queries that have matching key residues to a cluster are not classified as a cluster if the hamming distance to the cluster consensus is > hd.
Returns
Cluster
def color_stack(tree: Tree,
values: dict[str, typing.Any],
color_dict: dict[str, str],
default_color: str | None = None,
x: float = 0,
ax: matplotlib.axes._axes.Axes | None = None,
leg_kwds: dict | None = None) ‑> tuple[matplotlib.axes._axes.Axes, matplotlib.legend.Legend]-
A stack of colored patches that can be plotted adjacent to a tree to show how values vary on the tree leaves.
Must have called eremitalpa.compute_layout on the tree in order to know y values for leaves (done anyway by eremitalpa.plot_tree).
Args
tree
- The tree to be plotted next to.
values
- Maps taxon labels to values to be plotted.
color_dict
- Maps values to colors.
default_color
- Color to use for values missing from color_dict.
x
- The x value to plot the stack at.
ax
- Matplotlib ax
def compare_trees(left,
right,
gap=0.1,
x0=0,
connect_kws={},
extend_kws={},
extend_every=10,
left_kws={},
right_kws={},
connect_colors={},
extend_colors={})-
Plot two phylogenies side by side, and join the same taxa in each tree.
Args
- left (dendropy Tree)
- right (dendropy Tree)
gap
:float
- Space between the two trees.
x0
:float
- The x coordinate of the root of the left hand tree.
connect_kws
:dict
- Keywords passed to matplotlib LineCollection. These are used for the lines that connect matching taxa.
extend_kws
:dict
- Keywords passed to matplotlib LineCollection. These are used for lines that connect taxa to the connection lines.
extend_every
:n
- Draw branch extension lines every n leaves.
left_kws
:dict
- Passed to plot_tree for the left tree.
right_kws
:dict
- Passed to plot_tree for the right tree.
connect_colors
:dict
orCallable
- Maps taxon labels to colors. Ignored if 'colors' is used in connect_kws.
extend_colors
:dict
orCallable
- Maps taxon labels to colors. Ignored if 'colors' is used in extend_kws.
Returns
(2-tuple) containing dendropy Trees with _x and _y plot locations on nodes.
def compute_errorbars(trace: arviz.data.inference_data.InferenceData,
varname: str,
hdi_prob: float = 0.95) ‑> numpy.ndarray-
Compute HDI widths for plotting with plt.errorbar.
Args
trace
- E.g. the output from pymc.sample.
varname
- Variable to compute error bars for.
hdi_prob
- Width of the HDI.
Returns
(2, n) array of the lower and upper error bar sizes for passing to plt.errorbar.
def compute_tree_layout(tree: dendropy.datamodel.treemodel._tree.Tree,
has_brlens: bool = True,
copy: bool = False) ‑> dendropy.datamodel.treemodel._tree.Tree-
Compute layout parameters for a tree.
Each node gets _x and _y values. The tree gets _xlim and _ylim values (tuples).
Args
- tree
has_brlens
- Does the tree have branch lengths?
copy
- Make a fresh copy of the tree.
def consensus_seq(seqs: Iterable[str], case_sensitive: bool = True, **kwds) ‑> str
-
Compute the consensus of sequences.
Args
seqs
- Sequences.
case_sensitive
- If False, all seqs are converted to lowercase.
error_without_strict_majority
- Raise an error if a position has a tied most common character. If set to False, a warning is raised and a single value is chosen.
def deepest_leaf(tree, attr='_x')
-
Find the deepest leaf node in the tree.
Args
- tree (dendropy Tree)
attr
:str
- Either _x or _y. Gets node with max attribute.
Returns
dendropy Node
def filter_similar_hd(sequences, n, progress_bar=False, ignore=None)
-
Iterate through sequences excluding those that have a hamming distance of less than n to a sequence already seen. Return the non-excluded sequences.
Args
sequences (iterable of str / Bio.SeqRecord) progress_bar (bool) ignore (set or None)
Returns
list
def find_mutations(*args, **kwargs)
def find_runs(arr: numpy.ndarray) ‑> tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]
-
Find runs of consecutive items in an array.
Args
arr
- An array
Returns
3-tuple containing run values, starts and lengths.
def find_substitutions(a, b, offset=0)
-
Find mutations between strings a and b.
Args
a (str) b (str) offset (int)
Raises
ValueError if lengths of a an b differ.
Returns
list
oftuples. tuples are like
- ("N", 145, "K") The number indicates the 1-indexed position of the mutation. The first element is the a character. The last element is the b character.
def get_trunk(tree, attr='_x')
-
Ordered nodes in tree, from deepest leaf to root.
Args
tree (dendropy Tree) attr (str)
Returns
tuple containing dendropy Nodes
def group_sequences_by_character_at_site(seqs: dict[str, str], site: int) ‑> dict[str, str]
-
Group sequences by the character they have at a particular site.
Args
seqs
- Dict of sequence names -> sequence.
site
- 1-based.
Returns
dict containing
char at site
->sequence name
. def grouped_sample(population, n, key=None)
-
Randomly sample a population taking at most n elements from each group.
Args
- population (iterable)
n
:int
- Take at most n samples from each group.
key
:callable
- Function by which to group elements. Default (None).
Returns
list
def guess_clusters_in_tree(node)
-
If a node is in a known cluster, and all of it's descendants are in the same cluster, or an unknown cluster, then, update all the descendent nodes to the matching cluster.
def hamming_dist(a: str,
b: str,
ignore: Iterable[str] = '-X',
case_sensitive: bool = True,
per_site: bool = False) ‑> float-
The hamming distance between a and b.
Args
a
- Sequence.
b
- Sequence.
ignore
- String containing characters to ignore. If there is a mismatch where one string has a character in ignore, this does not contribute to the hamming distance.
per_site
- Divide the hamming distance by the length of a and b, minus the number of sites with ignored characters.
Returns
float
def hamming_dist_lt(a, b, n, ignore=None)
-
Test if hamming distance between a and b is less than n. This is case sensitive and does not check a and b have matching lengths.
Args
a (iterable) b (iterable) n (scalar) ignore (set or None)
Returns
bool
def hamming_to_all_clusters(sequence: str, strict_len: bool = True) ‑> list[float]
-
The hamming distance from sequence to all known clusters.
Args
- sequence (str)
strict_len
:bool
- See hamming_to_cluster
Returns
2-tuples containing (cluster, hamming distance)
def hamming_to_cluster(sequence: str,
cluster: str | ForwardRef('Cluster'),
strict_len: bool = True) ‑> float-
The hamming distance from sequence to the consensus sequence of a cluster.
Args
- sequence (str)
- cluster (str or Cluster)
strict_len
:bool
- Cluster consensus sequences are for HA1 only, and are 328 residues long. If strict_len is True, then don't check whether sequence matches this length. If False, the sequence is truncated to 328 residues to match. If a sequence is less than 328 residues then an error will still be raised.
Returns int
def load_fasta(path: str,
translate_nt: bool = False,
convert_to_upper: bool = False,
start: int = 0) ‑> dict[str, str]-
Load fasta file sequences.
Args
path
- Path to fasta file.
translate_nt
- Translate nucleotide sequences.
convert_to_upper
- Force sequences to be uppercase.
start
- The (0-based) index of the first character of each record to take. This selection is done before any translation. (Default=0).
def log_df_func(f: Callable, *args, **kwargs)
-
Callable should return a DataFrame. Report time taken to call a function, and the shape of the resulting DataFrame.
def node_x_y(nodes: Iterable[dendropy.datamodel.treemodel._node.Node],
jitter_x: float | None = None) ‑> tuple[tuple, tuple]-
x and y coordinates of nodes.
Args
nodes
:Iterable[dp.Node]
- An iterable collection of dp.Node objects.
jitter_x
:Optional[float]
- The amount of jitter to add to x coordinates. X is jittered by a quarter of this value above and below.
Returns
tuple[tuple, tuple]
- A tuple containing two tuples, the first with all x coordinates and the second with all y coordinates.
def node_x_y_from_taxon_label(tree: Tree,
taxon_label: str) ‑> tuple[float, float]-
Find the x and y attributes of a node in a tree from a taxon label.
Args
tree
- Tree
taxon_label
- str
Returns
tuple[float, float]
def pairwise_hamming_dists(collection, ignore='-X', per_site=False)
-
Compute all pairwise hamming distances between items in collection.
Args
collection (iterable)
Returns
list of hamming distances
def plot_aa_freq_thru_time(t0: pandas._libs.tslibs.timestamps.Timestamp,
t_end: pandas._libs.tslibs.timestamps.Timestamp,
df_seq: pandas.core.frame.DataFrame,
site: int,
proportion=False,
ax=None,
ignore='X-',
blank_xtick_labels=False)def plot_amino_acid_colors(ax: matplotlib.axes.Axes = None) ‑> matplotlib.axes.Axes
-
Simple plot to show amino acid colors.
def plot_leaves_with_labels(tree: dendropy.datamodel.treemodel._tree.Tree,
labels: list[str],
ax: matplotlib.axes._axes.Axes = None,
**kws)-
Plot leaves that have taxon labels in labels.
Args
- tree
labels
- Taxon labels to plot.
ax
- Matplotlib ax
**kws
- Passed to plt.scatter
def plot_subs_on_tree(tree: dendropy.datamodel.treemodel._tree.Tree,
sequences: dict[str, str],
exclude_leaves: bool = True,
site_offset: int = 0,
ignore_chars: str = 'X-',
arrow_length: float = 40,
arrow_facecolor: str = 'black',
fontsize: float = 6,
xytext_transform: tuple[float, float] = (1.0, 1.0),
**kwds) ‑> collections.Counter-
Plot substitutions on a tree. This function plots substitutions on the tree by finding substitutions between each node and its parent node. The substitutions are then plotted at the midpoint of the edge between the node and its parent node.
Args
tree
:dendropy.Tree
- The tree to annotate.
sequences
:dict[str, str]
- A mapping of node labels to sequences.
exclude_leaves
:bool
- If True, exclude leaves from getting substitutions.
site_offset
:int
- Value added to substitution sites. E.g. if site '1' is actually at index 16 in the sequences, then pass 16.
ignore_chars
:str
- Substitutions involving characters in this string will not be shown in substitutions.
arrow_length
:float
- The length of the arrow pointing to the mutation.
arrow_facecolor
:str
- The facecolor of the arrow pointing to the mutation.
fontsize
:float
- The fontsize of the text.
- xytext_transform (tuple(float, float)): Multipliers for the xytext offsets.
**kwds
- Other keyword arguments to pass to plt.annotate.
Returns
Counter containing the number of times each substitution appears in the tree.
def plot_tree(tree: dendropy.datamodel.treemodel._tree.Tree,
has_brlens: bool = True,
edge_kws: dict = {'color': 'black', 'linewidth': 0.5, 'clip_on': False, 'capstyle': 'round', 'zorder': 10},
leaf_kws: dict = {'zorder': 15, 'color': 'black', 's': 0, 'marker': 'o', 'edgecolor': 'white', 'lw': 0.1, 'clip_on': False},
internal_kws: dict = {'zorder': 12, 'color': 'black', 's': 0, 'marker': 'o', 'edgecolor': 'white', 'lw': 0.1, 'clip_on': False},
ax: matplotlib.axes._axes.Axes = None,
labels: Iterable[str] | Literal['all'] | None = None,
label_kws: dict = {'horizontalalignment': 'left', 'verticalalignment': 'center', 'fontsize': 8},
compute_layout: bool = True,
fill_dotted_lines: bool = False,
color_leaves_by_site_aa: int | None = None,
color_internal_nodes_by_site_aa: int | None = None,
sequences: dict[str, str] | None = None,
jitter_x: float | str | None = None,
scale_bar: bool | None = True) ‑> matplotlib.axes._axes.Axes-
Plot a dendropy tree object.
Tree nodes are plotted in their current order. So, to ladderize, call tree.ladderize() before plotting.
Args
- tree
has_brlens
- Does the tree have branch lengths? If not, all branch lengths are plotted length 1.
edge_kws
- Keyword arguments for edges, passed to matplotlib.collections.LineCollection
leaf_kws
- Keyword arguments for leafs, passed to ax.scatter. For arguments that can be a vector, the order and length should match tree.leaf_node_iter().
label_kwds
- Passed to plt.text.
internal_kws
- Keyword arguments for internal nodes. Passed to ax.scatter. For arguments that can be a vector, the order and length should match tree.internal_nodes().
ax
- Matplotlib ax.
labels
- Taxon labels to annotate, or "all".
compute_layout
- Compute the layout or not. If the tree nodes already have _x and _y attributes, then just plot.
fill_dotted_lines
- Show dotted lines from leaves to the right hand edge of the tree.
color_leaves_by_site_aa
- Pass an integer to color the leaves by the amino acid at this site
(1-based). This will overwrite the 'c' kwarg in leaf_kws.
sequences
must be passed. color_internal_nodes_by_site_aa
- Same behaviour as color_leaves_by_site_aa but for internal nodes.
sequences
- A mapping of taxon labels and to sequences. Required for
color_leaves_by_site_aa
. jitter_x
- Add a small amount of noise to the x value of the leaves to avoid over plotting. Either pass a float (the amount of noise) or 'auto' to try to automatically calculate a suitable value. 'auto' tries to calculate the fundamental 'unit' of branch length in the tree and then jitters x values by 1/2 of this value in either direction. See estimate_unit_branch_length for more information. Currently, positions of labels are not jittered.
scale_bar
- Show a scale bar at the bottom of the tree.
Returns
tuple containing (Tree, ax). The tree and matplotlib ax. The tree has these additional attributes:
_xlim (tuple) Min and max x value of nodes. _ylim (tuple) Min and max y value of nodes. Each node has these attributes: _x (number) X-coordinate of the nodes layout _y (number) Y-coordinate of the node's layout
def plot_tree_coloured_by_cluster(tree,
legend=True,
leg_kws={},
unknown_color='black',
leaf_kws={},
internal_kws={},
**kws)-
Plot a tree with nodes coloured according to cluster.
Args
tree
:dendropy Tree
- Nodes that have 'cluster' attribute will be coloured.
legend
:bool
- Add a legend showing the clusters.
leg_kws
:dict
- Keyword arguments passed to plt.legend.
unknown_color
:mpl color
- Color if cluster is not known.
**kws
- Keyword arguments passed to plot_tree.
def plot_tree_with_subplots(tree: Tree,
aa_seqs: dict,
site: int,
subplot_taxa_shifts: dict[str, tuple[float, float]],
fun: Callable,
fun_kwds: dict | None = None,
subplot_width: float = 0.2,
subplot_height: float = 0.1,
figsize: tuple[float, float] = (8, 12),
sharex: bool = True,
sharey: bool = True) ‑> None-
Plot a phylogeny tree with subplots for specified taxa.
This function draws a phylogeny based on a given tree and amino acid sequences. It colors leaves (and internal nodes) according to their amino acid at a specified site, and attaches additional subplots at user-defined nodes for further custom visualization.
Args
tree
- eremitalpa.Tree The phylogenetic tree to be plotted.
aa_seqs
- dict A dictionary containing amino acid sequences for each taxon. Keys should match the node names in the tree, and values should be the sequences.
site
- int The site (1-based) to color the tree's leaves and internal nodes.
subplot_taxa_shifts
- dict of str -> tuple of float A mapping from taxon names to tuples (x_shift, y_shift). These values control the position of the subplot axes relative to their respective nodes.
fun
- Callable A callable function to generate each subplot. Must accept the current taxon as the first argument and an axes object as the second argument.
fun_kwds
- dict
A dictionary of additional keyword arguments passed to the subplot function
fun
. subplot_width
- float, optional The width of each subplot in figure coordinates, by default 0.2.
subplot_height
- float, optional The height of each subplot in figure coordinates, by default 0.1.
figsize
- tuple of float, optional The overall size of the figure, by default (8, 12).
sharex
- bool, Have the sub axes share x-axes.
sharey
- bool, Have the sub axes share y-axes.
Returns
None The function draws and displays a matplotlib figure with the main phylogeny and subplots at specified taxa. It does not return any objects.
def prune_nodes_with_labels(tree, labels)
-
Prune nodes from tree that have a taxon label in labels.
Args
tree (dendropy Tree) labels (iterable containing str)
Returns
(dendropy Tree)
def read_iqtree_ancestral_states(state_file,
partition_names: list[str] | None = None,
translate_nt: bool = False) ‑> dict[slice(, dict[str, str], None)] | dict[slice( , , None)] -
Read an ancestral state file generated by IQ-TREE. If the file contains multiple partitions (i.e. a 'Part' column is present), then return a dict of dicts containing sequences accessed by [partition][node]. Otherwise return a dict of sequences accessed by node.
Args
state_file
- Path to .state file generated by iqtree –ancestral
partition_names
- Partitions are numbered from 1 in the .state file. Pass names for each segment (i.e. the order that partition_names appear in the partitions). Only takes effect if multiple partitions are present.
translate_nt
- If ancestral states are nucleotide sequences then translate them.
Returns
dict of dicts that maps [node][partition] -> sequence, or dict that maps node -> sequence.
def read_raxml_ancestral_sequences(tree, node_labelled_tree, ancestral_seqs, leaf_seqs=None)
-
Read a tree and ancestral sequences estimated by RAxML.
RAxML can estimate marginal ancestral sequences for internal nodes on a tree using a call like:
raxmlHPC -f A -t {treeFile} -s {sequenceFile} -m {model} -n {name}
The analysis outputs several files:
- RAxML_nodeLabelledRootedTree.{name} contains a copy of the input tree where all internal nodes have a unique identifier {id}.
- RAxML_marginalAncestralStates.{name} contains the ancestral sequence for each internal node. The format of each line is '{id} {sequence}'
- RAxML_marginalAncestralProbabilities.{name} contains probabilities of each base at each site for each internal node. (Not used by this function.)
Notes
Developed with output from RAxML version 8.2.12.
Args
tree
:str
- Path to original input tree ({treeFile}).
node_labelled_tree
:str
- Path to the tree with node labels. (RAxML_nodeLabelledRootedTree.{name})
ancestral_seqs
:str
- Path to file containing the ancestral sequences. (RAxML_marginalAncestralStates.{name})
leaf_seqs
:str
- (Optional) path to fasta file containing leaf sequences. ({sequenceFile}). If this is provided, also attach sequences to leaf nodes.
Returns
(dendropy Tree) with sequences attached to nodes. Sequences are attached as 'sequence' attributes on Nodes.
def sloppy_translate(sequence)
-
Translate a nucleotide sequence.
Don't check that the sequence length is a multiple of three. If any 'codon' contains any character not in [ACTG] then return X.
Args
sequence
:str
- Lower or upper case.
Returns
(str)
def split_pairs(values: Iterable, separation: float = 1.0) ‑> list
-
If values are repeated, e.g.:
1, 5, 5, 8
Then 'split' them by adding and subtracting half of
separation
(default=1.0 -> 0.5) from each item in the pair:1, 4.5, 5.5, 8
def spread_points(points: Iterable[float],
tol: float = 0.0001,
maxiter: int = 100,
repel: float = 0.5,
attract: float = 0.1) ‑>-
Spread out 1D points. Imagines points are attracted to their starting poisitions (with
attract
force constant) and are repelled from other points with a force proportional to the inverse of their cubed distance (multiplied byrepel
). Iteratively updates poisitions until eithermaxiter
is reached or the sum of the squared differences between positions in successive iterations falls belowtol
. def taxon_in_node_label(label, node)
-
True if a node has a matching taxon label
def taxon_in_node_labels(labels, node)
-
True if node has taxon label in labels, else False
def translate_segment(sequence: str, segment: str) ‑> dict[str, str]
-
Transcribe an influenza A segment. MP, PA, PB1 and NS all have splice variants, so simply transcribing the ORF of the segment would miss out proteins. This function returns a list containing coding sequences that are transcribed from a particular segment.
Args
sequence
:str
- The RNA sequence of the segment.
segment
:str
- The segment to translate. Must be one of 'HA', 'NA', 'NP', 'PB2', 'PA', 'MP' or 'PB1'.
Returns
dict[str, str]
- A dictionary mapping the segment name to the translated protein sequence.
def translate_trim_default_ha(nt: str) ‑> str
-
Take a default HA nucleotide sequence and return an HA1 sequence.
def variable_sites(seq: pandas.core.series.Series,
max_biggest_prop: float = 0.95,
ignore: str = '-X') ‑> Generator[int, None, None]-
Find variable sites among sequences. Returns 1-indexed sites.
Args
seq
- Sequences.
max_biggest_prop
- Don't include sites where a single character has a proportion above this.
ignore
- Characters to exclude when calculating proportions.
def write_fasta(path: str, records: dict[str, str]) ‑> None
-
Write a fasta file.
Args
path
- Path to fasta file to write.
records
- A dict, the keys will become fasta headers, values will be sequences.
Classes
class Cluster (cluster)
-
Instance variables
prop aa_sequence
-
Representative amino acid sequence.
prop b7_motifs
prop color
prop key_residues
prop nt_sequence : str
-
Representative nucleotide sequence.
prop year : int
Methods
def codon(self, n: int) ‑> str
-
Codon at amino acid position n. 1-indexed.
class ClusterTransition (c0: str | Cluster,
c1: str | Cluster)-
A cluster transition.
Static methods
def from_tuple(c0c1) ‑> ClusterTransition
-
Make an instance from a tuple
Instance variables
prop preceding_transitions : Generator[ClusterTransition, None, None]
-
All preceding cluster transitions
class MultipleSequenceAlignment (records, alphabet=None, annotations=None, column_annotations=None)
-
Represents a classical multiple sequence alignment (MSA).
By this we mean a collection of sequences (usually shown as rows) which are all the same length (usually with gap characters for insertions or padding). The data can then be regarded as a matrix of letters, with well defined columns.
You would typically create an MSA by loading an alignment file with the AlignIO module:
>>> from Bio import AlignIO >>> align = AlignIO.read("Clustalw/opuntia.aln", "clustal") >>> print(align) Alignment with 7 rows and 156 columns TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF191 TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273291|gb|AF191665.1|AF191
In some respects you can treat these objects as lists of SeqRecord objects, each representing a row of the alignment. Iterating over an alignment gives the SeqRecord object for each row:
>>> len(align) 7 >>> for record in align: ... print("%s %i" % (record.id, len(record))) ... gi|6273285|gb|AF191659.1|AF191 156 gi|6273284|gb|AF191658.1|AF191 156 gi|6273287|gb|AF191661.1|AF191 156 gi|6273286|gb|AF191660.1|AF191 156 gi|6273290|gb|AF191664.1|AF191 156 gi|6273289|gb|AF191663.1|AF191 156 gi|6273291|gb|AF191665.1|AF191 156
You can also access individual rows as SeqRecord objects via their index:
>>> print(align[0].id) gi|6273285|gb|AF191659.1|AF191 >>> print(align[-1].id) gi|6273291|gb|AF191665.1|AF191
And extract columns as strings:
>>> print(align[:, 1]) AAAAAAA
Or, take just the first ten columns as a sub-alignment:
>>> print(align[:, :10]) Alignment with 7 rows and 10 columns TATACATTAA gi|6273285|gb|AF191659.1|AF191 TATACATTAA gi|6273284|gb|AF191658.1|AF191 TATACATTAA gi|6273287|gb|AF191661.1|AF191 TATACATAAA gi|6273286|gb|AF191660.1|AF191 TATACATTAA gi|6273290|gb|AF191664.1|AF191 TATACATTAA gi|6273289|gb|AF191663.1|AF191 TATACATTAA gi|6273291|gb|AF191665.1|AF191
Combining this alignment slicing with alignment addition allows you to remove a section of the alignment. For example, taking just the first and last ten columns:
>>> print(align[:, :10] + align[:, -10:]) Alignment with 7 rows and 20 columns TATACATTAAGTGTACCAGA gi|6273285|gb|AF191659.1|AF191 TATACATTAAGTGTACCAGA gi|6273284|gb|AF191658.1|AF191 TATACATTAAGTGTACCAGA gi|6273287|gb|AF191661.1|AF191 TATACATAAAGTGTACCAGA gi|6273286|gb|AF191660.1|AF191 TATACATTAAGTGTACCAGA gi|6273290|gb|AF191664.1|AF191 TATACATTAAGTATACCAGA gi|6273289|gb|AF191663.1|AF191 TATACATTAAGTGTACCAGA gi|6273291|gb|AF191665.1|AF191
Note - This object does NOT attempt to model the kind of alignments used in next generation sequencing with multiple sequencing reads which are much shorter than the alignment, and where there is usually a consensus or reference sequence with special status.
Initialize a new MultipleSeqAlignment object.
Arguments: - records - A list (or iterator) of SeqRecord objects, whose sequences are all the same length. This may be an empty list. - alphabet - For backward compatibility only; its value should always be None. - annotations - Information about the whole alignment (dictionary). - column_annotations - Per column annotation (restricted dictionary). This holds Python sequences (lists, strings, tuples) whose length matches the number of columns. A typical use would be a secondary structure consensus string.
You would normally load a MSA from a file using Bio.AlignIO, but you can do this from a list of SeqRecord objects too:
>>> from Bio.Seq import Seq >>> from Bio.SeqRecord import SeqRecord >>> from Bio.Align import MultipleSeqAlignment >>> a = SeqRecord(Seq("AAAACGT"), id="Alpha") >>> b = SeqRecord(Seq("AAA-CGT"), id="Beta") >>> c = SeqRecord(Seq("AAAAGGT"), id="Gamma") >>> align = MultipleSeqAlignment([a, b, c], ... annotations={"tool": "demo"}, ... column_annotations={"stats": "CCCXCCC"}) >>> print(align) Alignment with 3 rows and 7 columns AAAACGT Alpha AAA-CGT Beta AAAAGGT Gamma >>> align.annotations {'tool': 'demo'} >>> align.column_annotations {'stats': 'CCCXCCC'}
Ancestors
- Bio.Align.MultipleSeqAlignment
Methods
def plot(self,
ax: matplotlib.axes._axes.Axes | None = None,
fontsize: int = 6,
variable_sites_kwds: dict | None = None,
rotate_xtick_labels: bool = False,
sites: Iterable[int] | None = None) ‑> matplotlib.axes._axes.Axes-
Plot variable sites in the alignment.
Args
ax
- Matplotlib ax.
fontsize
- Fontsize of the character labels.
variable_sites_kwds
- Passed to MultipleSequenceAlignment.variable_sites.
rotate_xtick_labels
- Rotate the xtick labels 90 degrees.
sites
- Only plot these sites. (Note: Only variable sites are plotted, so if a site is passed in this argument but it is not variable it will not be displayed.)
def variable_sites(self, min_2nd_most_freq: int = 1) ‑> Generator[Column, None, None]
-
Generator for variable sites in the alignment.
Args
min_2nd_most_freq
- Used to filter out sites that have low variability. For instance if min_2nd_most_freq is 2 a column containing 'AAAAT' should be excluded because the second most frequent character (T) has a frequency of 1.
class NHSeason (years: tuple[int, int] | ForwardRef('NHSeason'))
-
A northern hemisphere flu season.
Static methods
def from_datetime(dt)
class Substitution (*args)
-
Change of a character at a site.
Instantiate using either 1 or three arguments: Substitution("N145K") or Substitution("N", 145, "K")
class TiedCounter (iterable=None, /, **kwds)
-
Dict subclass for counting hashable items. Sometimes called a bag or multiset. Elements are stored as dictionary keys and their counts are stored as dictionary values.
>>> c = Counter('abcdeabcdabcaba') # count elements from a string
>>> c.most_common(3) # three most common elements [('a', 5), ('b', 4), ('c', 3)] >>> sorted(c) # list all unique elements ['a', 'b', 'c', 'd', 'e'] >>> ''.join(sorted(c.elements())) # list elements with repetitions 'aaaaabbbbcccdde' >>> sum(c.values()) # total of all counts 15
>>> c['a'] # count of letter 'a' 5 >>> for elem in 'shazam': # update counts from an iterable ... c[elem] += 1 # by adding 1 to each element's count >>> c['a'] # now there are seven 'a' 7 >>> del c['b'] # remove all 'b' >>> c['b'] # now there are zero 'b' 0
>>> d = Counter('simsalabim') # make another counter >>> c.update(d) # add in the second counter >>> c['a'] # now there are nine 'a' 9
>>> c.clear() # empty the counter >>> c Counter()
Note: If a count is set to zero or reduced to zero, it will remain in the counter until the entry is deleted or the counter is cleared:
>>> c = Counter('aaabbc') >>> c['b'] -= 2 # reduce the count of 'b' by two >>> c.most_common() # 'b' is still in, but its count is zero [('a', 3), ('c', 1), ('b', 0)]
Create a new, empty Counter object. And if given, count elements from an input iterable. Or, initialize the count from another mapping of elements to their counts.
>>> c = Counter() # a new, empty counter >>> c = Counter('gallahad') # a new counter from an iterable >>> c = Counter({'a': 4, 'b': 2}) # a new counter from a mapping >>> c = Counter(a=4, b=2) # a new counter from keyword args
Ancestors
- collections.Counter
- builtins.dict
Methods
def most_common(self, n: int | None = None) ‑> list[tuple[typing.Any, int]]
-
If n=1 and there are more than one item that has the maximum count, return all of them, not just one. If n is not 1, do the same thing as normal Counter.most_common.
class Tree (*args, **kwargs)
-
An arborescence, i.e. a fully-connected directed acyclic graph with all edges directing away from the root and toward the tips. The "root" of the tree is represented by the :attr:
Tree.seed_node
attribute. In unrooted trees, this node is an algorithmic artifact. In rooted trees this node is semantically equivalent to the root.The constructor can optionally construct a |Tree| object by cloning another |Tree| object passed as the first positional argument, or out of a data source if
stream
andschema
keyword arguments are passed with a file-like object and a schema-specification string object values respectively.Parameters
*args : positional argument, optional If given, should be exactly one |Tree| object. The new |Tree| will then be a structural clone of this argument.
**kwargs : keyword arguments, optional The following optional keyword arguments are recognized and handled by this constructor:
<code>label</code> The label or description of the new |Tree| object. <code>taxon\_namespace</code> Specifies the |TaxonNamespace| object to be that the new |Tree| object will reference.
Examples
Tree objects can be instantiated in the following ways::
# /usr/bin/env python try: from StringIO import StringIO except ImportError: from io import StringIO from dendropy import Tree, TaxonNamespace # empty tree t1 = Tree() # Tree objects can be instantiated from an external data source # using the 'get()' factory class method # From a file-like object t2 = Tree.get(file=open('treefile.tre', 'r'), schema="newick", tree_offset=0) # From a path t3 = Tree.get(path='sometrees.nexus', schema="nexus", collection_offset=2, tree_offset=1) # From a string s = "((A,B),(C,D));((A,C),(B,D));" # tree will be '((A,B),(C,D))' t4 = Tree.get(data=s, schema="newick") # tree will be '((A,C),(B,D))' t5 = Tree.get(data=s, schema="newick", tree_offset=1) # passing keywords to underlying tree parser t7 = dendropy.Tree.get( data="((A,B),(C,D));", schema="newick", taxon_namespace=t3.taxon_namespace, suppress_internal_node_taxa=False, preserve_underscores=True) # Tree objects can be written out using the 'write()' method. t1.write(file=open('treefile.tre', 'r'), schema="newick") t1.write(path='treefile.nex', schema="nexus") # Or returned as a string using the 'as_string()' method. s = t1.as_string("nexml") # tree structure deep-copied from another tree t8 = dendropy.Tree(t7) assert t8 is not t7 # Trees are distinct assert t8.symmetric_difference(t7) == 0 # and structure is identical assert t8.taxon_namespace is t7.taxon_namespace # BUT taxa are not cloned. nds3 = [nd for nd in t7.postorder_node_iter()] # Nodes in the two trees nds4 = [nd for nd in t8.postorder_node_iter()] # are distinct objects, for i, n in enumerate(nds3): # and can be manipulated assert nds3[i] is not nds4[i] # independentally. egs3 = [eg for eg in t7.postorder_edge_iter()] # Edges in the two trees egs4 = [eg for eg in t8.postorder_edge_iter()] # are also distinct objects, for i, e in enumerate(egs3): # and can also be manipulated assert egs3[i] is not egs4[i] # independentally. lves7 = t7.leaf_nodes() # Leaf nodes in the two trees lves8 = t8.leaf_nodes() # are also distinct objects, for i, lf in enumerate(lves3): # but order is the same, assert lves7[i] is not lves8[i] # and associated Taxon objects assert lves7[i].taxon is lves8[i].taxon # are the same. # To create deep copy of a tree with a different taxon namespace, # Use 'copy.deepcopy()' t9 = copy.deepcopy(t7) # Or explicitly pass in a new TaxonNamespace instance taxa = TaxonNamespace() t9 = dendropy.Tree(t7, taxon_namespace=taxa) assert t9 is not t7 # As above, the trees are distinct assert t9.symmetric_difference(t7) == 0 # and the structures are identical, assert t9.taxon_namespace is not t7.taxon_namespace # but this time, the taxa *are* different assert t9.taxon_namespace is taxa # as the given TaxonNamespace is used instead. lves3 = t7.leaf_nodes() # Leaf nodes (and, for that matter other nodes lves5 = t9.leaf_nodes() # as well as edges) are also distinct objects for i, lf in enumerate(lves3): # and the order is the same, as above, assert lves7[i] is not lves9[i] # but this time the associated Taxon assert lves7[i].taxon is not lves9[i].taxon # objects are distinct though the taxon assert lves7[i].taxon.label == lves9[i].taxon.label # labels are the same. # to 'switch out' the TaxonNamespace of a tree, replace the reference and # reindex the taxa: t11 = Tree.get(data='((A,B),(C,D));', 'newick') taxa = TaxonNamespace() t11.taxon_namespace = taxa t11.reindex_subcomponent_taxa() # You can also explicitly pass in a seed node: seed = Node(label="root") t12 = Tree(seed_node=seed) assert t12.seed_node is seed
Ancestors
- dendropy.datamodel.treemodel._tree.Tree
- dendropy.datamodel.taxonmodel.TaxonNamespaceAssociated
- dendropy.datamodel.basemodel.Annotable
- dendropy.datamodel.basemodel.Deserializable
- dendropy.datamodel.basemodel.NonMultiReadable
- dendropy.datamodel.basemodel.Serializable
- dendropy.datamodel.basemodel.DataObject
Static methods
def from_disk(path: str,
schema: str = 'newick',
preserve_underscores: bool = True,
outgroup: str | None = None,
msa_path: str | None = None,
get_kwds: dict | None = None,
**kwds) ‑> Tree-
Load a tree from a file.
Args
path
- Path to file containing tree.
schema
- See dendropy.Tree.get
preserve_underscores
- Preserve underscores in taxon labels. (Overwrites 'preserve_underscores' key if passed in get_kwds.)
outgroup
- Name of taxon to use as outgroup.
msa_path
- Path to fasta file containing leaf sequences.
get_kwds
- Passed to dendropy.Tree.get.
kwds
- Passed to add_sequences_to_tree
Instance variables
prop multiple_sequence_alignment
-
Generate an eremitalpa.MultipleSequence alignment object from a tree. Leaf nodes on the tree must have 'sequence' attributes and taxon labels.
Methods
def plot_tree_msa(self,
msa_plot_kwds: dict | None = None,
axes: tuple[matplotlib.axes._axes.Axes, matplotlib.axes._axes.Axes] | None = None) ‑> tuple[matplotlib.axes._axes.Axes, matplotlib.axes._axes.Axes]-
Plot the tree and multiple sequence alignment.