Package `eremitalpa`

Sub-modules

eremitalpa.bio
eremitalpa.eremitalpa
eremitalpa.flu_wider
eremitalpa.influenza
eremitalpa.lib: Generic library functions.
eremitalpa.scripts

Functions

def aa_counts_thru_time(df_seq: pandas.core.frame.DataFrame, site: int, ignore='-X') ‑> pandas.core.frame.DataFrame

Browse git

Make a DataFrame containing counts of amino acids that were sampled in months at a particular sequence site.

Columns in the returned DataFrame are dates, the index is amino acids.

Args

df_seq: Must contain columns "aa" which contains amino acid sequences and "dt" which contains datetime objects of collection dates.
site: Count amino acids at this site. Note, this is 1-indexed.
ignore: Don't include these characters in the counts.

def annotate_points(df: pandas.core.frame.DataFrame, ax: matplotlib.axes._axes.Axes, n: int = -1, adjust: bool = True, **kwds)

Browse git

Label (x, y) points on a matplotlib ax.

Args

df: Pandas DataFrame with 2 columns, (x, y) respectively. Index contains the
labels.
n: Label this many points. Default (-1) annotates all points.
ax: Matplotlib ax
adjust: Use adjustText.adjust_text to try to prevent label overplotting.
**kwds: Passed to adjustText.adjustText

def cal_months_diff(date1: pandas._libs.tslibs.timestamps.Timestamp, date0: pandas._libs.tslibs.timestamps.Timestamp) ‑> int

Browse git

Number of calendar months between two dates (date1 - date0)

def cluster_from_ha(sequence, seq_type='long')

Browse git

Classify an amino acid sequence as an antigenic cluster by checking whether the sequences Bjorn 7 sites match exactly sites that are known in a cluster.

Args

sequence : str: HA amino acid sequence.
seq_type : str: "long" or "b7". If long, sequence must contain at least the fist 193 positions of HA1. If b7, sequence should be the b7 positions.

Raises

ValueError: If the sequence can't be classified.

Returns

(str): The name of the cluster.

def cluster_from_ha_2(sequence: str, strict_len: bool = True, max_hd: float = 10)

Browse git

Classify an amino acid sequence into an antigenic cluster.

First identify clusters that have matching key residues with the sequence. If multiple clusters are found, find the one with the lowest hamming distance to the sequence. If the resulting hamming distance is less than 10, return the cluster.

Args

sequence (str)
strict_len : bool: See hamming_to_cluster.
hd : int: Queries that have matching key residues to a cluster are not classified as a cluster if the hamming distance to the cluster consensus is > hd.

Returns

Cluster

def color_stack(tree: Tree, values: dict[str, typing.Any], color_dict: dict[str, str], default_color: str | None = None, x: float = 0, ax: matplotlib.axes._axes.Axes | None = None, leg_kwds: dict | None = None) ‑> tuple[matplotlib.axes._axes.Axes, matplotlib.legend.Legend]

Browse git

A stack of colored patches that can be plotted adjacent to a tree to show how values vary on the tree leaves.

Must have called eremitalpa.compute_layout on the tree in order to know y values for leaves (done anyway by eremitalpa.plot_tree).

Args

tree: The tree to be plotted next to.
values: Maps taxon labels to values to be plotted.
color_dict: Maps values to colors.
default_color: Color to use for values missing from color_dict.
x: The x value to plot the stack at.
ax: Matplotlib ax

def compare_trees(left, right, gap=0.1, x0=0, connect_kwds={}, extend_kwds={}, extend_every=10, left_kwds={}, right_kwds={}, connect_colors={}, extend_colors={})

Browse git

Plot two phylogenies side by side, and join the same taxa in each tree.

Args

left (dendropy Tree)
right (dendropy Tree)
gap : float: Space between the two trees.
x0 : float: The x coordinate of the root of the left hand tree.
connect_kwds : dict: Keywords passed to matplotlib LineCollection. These are used for the lines that connect matching taxa.
extend_kwds : dict: Keywords passed to matplotlib LineCollection. These are used for lines that connect taxa to the connection lines.
extend_every : n: Draw branch extension lines every n leaves.
left_kwds : dict: Passed to plot_tree for the left tree.
right_kwds : dict: Passed to plot_tree for the right tree.
connect_colors : dict or Callable: Maps taxon labels to colors. Ignored if 'colors' is used in connect_kwds.
extend_colors : dict or Callable: Maps taxon labels to colors. Ignored if 'colors' is used in extend_kwds.

Returns

(2-tuple) containing dendropy Trees with _x and _y plot locations on nodes.

def compute_errorbars(trace: arviz.data.inference_data.InferenceData, varname: str, hdi_prob: float = 0.95) ‑> numpy.ndarray

Browse git

Compute HDI widths for plotting with plt.errorbar.

Args

trace: E.g. the output from pymc.sample.
varname: Variable to compute error bars for.
hdi_prob: Width of the HDI.

Returns

(2, n) array of the lower and upper error bar sizes for passing to plt.errorbar.

def compute_tree_layout(tree: dendropy.datamodel.treemodel._tree.Tree, has_brlens: bool = True, copy: bool = False, round_brlens: int | None = None) ‑> dendropy.datamodel.treemodel._tree.Tree

Browse git

Computes layout parameters for a tree.

Each node gets _x and _y values. The tree gets _xlim and _ylim values (tuples).

Args

tree : dp.Tree: The tree to lay out.
has_brlens : bool: Whether the tree has branch lengths.
copy : bool: If True, a fresh copy of the tree is made.
round_brlens : int, optional: The number of digits to round branch lengths to. Defaults to None.

Returns

dp.Tree: The tree with layout parameters.

def consensus_seq(seqs: Iterable[str], case_sensitive: bool = True, **kwds) ‑> str

Browse git

Computes the consensus of a set of sequences.

Args

seqs : Iterable[str]: The sequences to compute the consensus from.
case_sensitive : bool: If False, all sequences are converted to lowercase.
**kwds: Additional keyword arguments passed to _generate_consensus_chars.

Returns

str: The consensus sequence.

def deepest_leaf(tree, attr='_x')

Browse git

Find the deepest leaf node in the tree.

Args

tree (dendropy Tree)
attr : str: Either _x or _y. Gets node with max attribute.

Returns

dendropy Node

def filter_similar_hd(sequences, n, progress_bar=False, ignore=None, case_sensitive=False) ‑> list

Browse git

Filters sequences based on Hamming distance.

Iterates through sequences, excluding those that have a Hamming distance of less than n to a sequence already seen.

Args

sequences : iterable[str | Bio.SeqRecord]: The sequences to filter.
n : int: The Hamming distance threshold.
progress_bar : bool: Whether to display a progress bar.
ignore : set, optional: Characters to ignore during comparison. Defaults to None.
case_sensitive : bool: Whether the comparison is case-sensitive.

Returns

list: The filtered sequences.

def find_mutations(*args, **kwargs)

Browse git

def find_runs(arr: numpy.ndarray) ‑> tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

Browse git

Find runs of consecutive items in an array.

Args

arr: An array

Returns

3-tuple containing run values, starts and lengths.

def find_substitutions(a, b, offset=0, ignore='X-')

Browse git

Find mutations between strings a and b.

Args

a : str: The first string.
b : str: The second string.
offset : int: An offset to be added to the mutation position.
ignore : str: Ignore substitution if these characters are involved.

Raises

ValueError: If lengths of a and b differ.

Returns

tuple[Substitution, …]: A tuple of Substitution objects.

def get_trunk(tree, attr='_x')

Browse git

Ordered nodes in tree, from deepest leaf to root.

Args

tree (dendropy Tree) attr (str)

Returns

tuple containing dendropy Nodes

def group_sequences_by_character_at_site(seqs: dict[str, str], site: int) ‑> dict[str, str]

Browse git

Groups sequences by the character at a specific site.

Args

seqs : dict[str, str]: A dictionary mapping sequence names to sequences.
site : int: The 1-based site to group by.

Returns

dict[str, str]: A dictionary where keys are characters at the given site and values are lists of sequence names.

def grouped_sample(population, n, key=None)

Browse git

Randomly samples a population, taking at most n elements from each group.

Args

population : iterable: The population to sample from.
n : int: The maximum number of samples to take from each group.
key : callable, optional: A function to group elements by. Defaults to None.

Returns

list: The sampled elements.

def guess_clusters_in_tree(node)

Browse git

If a node is in a known cluster, and all of it's descendants are in the same cluster, or an unknown cluster, then, update all the descendent nodes to the matching cluster.

def hamming_dist(a: str, b: str, ignore: Iterable[str] = '-X', case_sensitive: bool = True, per_site: bool = False) ‑> float

Browse git

Computes the Hamming distance between two sequences.

Args

a : str: The first sequence.
b : str: The second sequence.
ignore : Iterable[str]: A string containing characters to ignore. Mismatches involving these characters will not contribute to the Hamming distance.
case_sensitive : bool: If True, the comparison is case-sensitive.
per_site : bool: If True, the Hamming distance is divided by the length of the sequences, excluding ignored sites.

Returns

float: The Hamming distance.

def hamming_dist_lt(a, b, n, ignore=None)

Browse git

Checks if the Hamming distance between two iterables is less than n.

This is case-sensitive and does not check if a and b have matching lengths.

Args

a : iterable: The first iterable.
b : iterable: The second iterable.
n : scalar: The threshold value.
ignore : set or None: A set of characters to ignore during comparison.

Returns

bool: True if the Hamming distance is less than n, False otherwise.

def hamming_to_all_clusters(sequence: str, strict_len: bool = True) ‑> list[float]

Browse git

The hamming distance from sequence to all known clusters.

Args

sequence (str)
strict_len : bool: See hamming_to_cluster

Returns

2-tuples containing (cluster, hamming distance)

def hamming_to_cluster(sequence: str, cluster: str | ForwardRef('Cluster'), strict_len: bool = True) ‑> float

Browse git

The hamming distance from sequence to the consensus sequence of a cluster.

Args

sequence (str)
cluster (str or Cluster)
strict_len : bool: Cluster consensus sequences are for HA1 only, and are 328 residues long. If strict_len is True, then don't check whether sequence matches this length. If False, the sequence is truncated to 328 residues to match. If a sequence is less than 328 residues then an error will still be raised.

Returns int

def load_fasta(path: str, translate_nt: bool = False, convert_to_upper: bool = False, start: int = 0) ‑> dict[str, str]

Browse git

Loads sequences from a FASTA file.

Args

path : str: The path to the FASTA file.
translate_nt : bool: If True, translate nucleotide sequences to amino acids.
convert_to_upper : bool: If True, convert sequences to uppercase.
start : int: The 0-based index of the first character to include from each sequence. This is applied before translation.

Returns

dict[str, str]: A dictionary mapping sequence descriptions to sequences.

def load_fastas(paths: Iterable[str], **kwargs) ‑> dict[str, str]

Browse git

Loads sequences from multiple FASTA files.

If the same sequence description appears in multiple files, the sequence from the last file is used.

Args

paths : Iterable[str]: An iterable of paths to FASTA files.
**kwargs: Passed to load_fasta().

Returns

dict[str, str]: A dictionary mapping sequence descriptions to sequences.

def log_df_func(f: Callable, *args, **kwargs)

Browse git

Callable should return a DataFrame. Report time taken to call a function, and the shape of the resulting DataFrame.

def node_x_y(nodes: Iterable[dendropy.datamodel.treemodel._node.Node], jitter_x: float | None = None) ‑> tuple[tuple, tuple]

Browse git

Gets the x and y coordinates of nodes.

Args

nodes : Iterable[dp.Node]: An iterable of dendropy Node objects.
jitter_x : float, optional: The amount of jitter to add to the x coordinates. X is jittered by a quarter of this value in both directions. Defaults to None.

Returns

tuple[tuple, tuple]: A tuple containing two tuples: one for x coordinates and one for y coordinates.

def node_x_y_from_taxon_label(tree: Tree, taxon_label: str) ‑> tuple[float, float]

Browse git

Finds the x and y attributes of a node from its taxon label.

Args

tree : Tree: The tree to search in.
taxon_label : str: The taxon label of the node.

Returns

tuple[float, float]: The x and y coordinates of the node.

def pairwise_hamming_dists(sequences: list | tuple | dict[str, str], **kwds) ‑> list[float] | dict[str, dict[str, str]]

Browse git

All pairwise Hamming distances between items in a collection.

Args

collection : list | tuple | dict: A collection of sequences.
**kwds: Passed to hamming_dist().

Returns

list[float] or dict[str][str] -> float

def plot_aa_freq_thru_time(t0: pandas._libs.tslibs.timestamps.Timestamp, t_end: pandas._libs.tslibs.timestamps.Timestamp, df_seq: pandas.core.frame.DataFrame, site: int, proportion=False, ax=None, ignore='X-', blank_xtick_labels=False)

Browse git

def plot_amino_acid_colors(ax: matplotlib.axes.Axes = None) ‑> matplotlib.axes.Axes

Browse git

Creates a simple plot to display amino acid colors.

Args

ax : matplotlib.axes.Axes, optional: The matplotlib axes to plot on. If None, the current axes are used. Defaults to None.

Returns

matplotlib.axes.Axes: The axes with the plot.

def plot_leaves_with_labels(tree: dendropy.datamodel.treemodel._tree.Tree, labels: list[str], ax: matplotlib.axes._axes.Axes = None, **kwds)

Browse git

Plots leaves that have taxon labels in a given list.

Args

tree : dp.Tree: The tree to plot.
labels : list[str]: A list of taxon labels to plot.
ax : mp.axes.Axes, optional: The matplotlib axes to plot on. Defaults to None.
**kwds: Additional keyword arguments passed to plt.scatter.

def plot_legend(patch_colors: dict[str, str], ax: matplotlib.axes._axes.Axes | None, **kwds) ‑> matplotlib.legend.Legend

Browse git

Plot a legend for the given patch colors.

Args

patch_colors: Dictionary mapping label to color.
ax: Matplotlib Axes to plot the legend on. If None, use current Axes.
**kwds: Passed to ax.legend.

Returns

The legend object.

def plot_path_to_taxon(tree: dendropy.datamodel.treemodel._tree.Tree | Tree, taxon_label: str, ax: matplotlib.axes._axes.Axes | None = None, label_taxon: bool = True, label_kwds: dict | None = None, **kwds) ‑> matplotlib.collections.LineCollection

Browse git

Plots the path from the root to a given taxon.

Args

tree : dp.Tree | Tree: The tree to plot.
taxon_label : str: The taxon label of the node to plot the path to.
ax : mp.axes.Axes, optional: The matplotlib axes to plot on.
label_taxon : bool: If True, label the taxon at the end of the path.
label_kwds : dict, optional: Keyword arguments passed to plt.text.

Returns

mp.collections.LineCollection

def plot_subs_on_tree(tree: dendropy.datamodel.treemodel._tree.Tree, sequences: dict[str, str], exclude_leaves: bool = True, on_path_to_taxon: str | None = None, site_offset: int = 0, ignore_chars: str = 'X-', arrow_length: float = 40, arrow_facecolor: str = 'black', fontsize: float = 6, xytext_transform: tuple[float, float] = (1.0, 1.0), **kwds) ‑> collections.Counter

Browse git

Plots substitutions on a tree.

This function plots substitutions on the tree by finding substitutions between each node and its parent node. The substitutions are then plotted at the midpoint of the edge between the node and its parent.

Args

tree : dp.Tree: The tree to annotate.
sequences : dict[str, str]: A mapping of node labels to sequences.
exclude_leaves : bool: If True, exclude leaves from substitution plotting.
on_path_to_taxon : str, optional: If provided, only plot substitutions on the path from the root to this taxon. Defaults to None.
site_offset : int: Value to add to substitution site numbers.
ignore_chars : str: Substitutions involving these characters will not be shown.
arrow_length : float: The length of the arrow pointing to the mutation.
arrow_facecolor : str: The face color of the arrow.
fontsize : float: The font size of the text.
xytext_transform : tuple[float, float]: Multipliers for the xytext offsets.
**kwds: Other keyword arguments passed to plt.annotate.

Returns

Counter: A counter of the number of times each substitution appears in the tree.

def plot_tree(tree: dendropy.datamodel.treemodel._tree.Tree | Tree, has_brlens: bool = True, edge_kwds: dict = {'color': 'black', 'linewidth': 0.5, 'clip_on': False, 'capstyle': 'round', 'zorder': 10}, leaf_kwds: dict = {'zorder': 15, 'color': 'black', 's': 0, 'marker': 'o', 'edgecolor': 'white', 'lw': 0.1, 'clip_on': False}, internal_kwds: dict = {'zorder': 12, 'color': 'black', 's': 0, 'marker': 'o', 'edgecolor': 'white', 'lw': 0.1, 'clip_on': False}, ax: matplotlib.axes._axes.Axes = None, labels: Iterable[str] | Literal['all'] | None = None, label_kwds: dict = {'horizontalalignment': 'left', 'verticalalignment': 'center', 'fontsize': 8, 'zorder': 15}, label_x_offset: float = 0.0, compute_layout: bool = True, fill_dotted_lines: bool = False, round_brlens: int | None = None, color_leaves_by_site_aa: int | None = None, hide_aa: str | None = None, color_internal_nodes_by_site_aa: int | None = None, sequences: dict[str, str] | None = None, jitter_x: float | str | None = None, scale_bar: bool | None = True, scale_bar_x_start: float = 0.0) ‑> matplotlib.axes._axes.Axes

Browse git

Plots a dendropy tree object.

Tree nodes are plotted in their current order. To ladderize, call tree.ladderize() before plotting.

Args

tree : dp.Tree | Tree: The tree to plot.
has_brlens : bool: If False, all branch lengths are plotted as 1.
edge_kwds : dict: Keyword arguments for edges, passed to matplotlib.collections.LineCollection.
leaf_kwds : dict: Keyword arguments for leaves, passed to ax.scatter.
label_kwds : dict: Keyword arguments passed to plt.text.
internal_kwds : dict: Keyword arguments for internal nodes, passed to ax.scatter.
ax : mp.axes.Axes, optional: The matplotlib axes to plot on. Defaults to None.
labels (Optional[Union[Iterable[str], Literal["all"]]]): Taxon labels
to annotate, or "all".
label_kwds : dict: Keyword arguments passed to plt.text.
leaf_label_x_offset : float: Amount to offset leaf labels in the x direction.
compute_layout : bool: If True, compute the layout. If False, assumes the tree nodes already have _x and _y attributes.
fill_dotted_lines : bool: If True, show dotted lines from leaves to the right-hand edge of the tree.
round_brlens : int, optional: The number of decimal places to round branch lengths to. Passed to compute_tree_layout().
color_leaves_by_site_aa : int, optional: Color leaves by the amino acid at this site (1-based). Overwrites 'c' in leaf_kwds. Requires sequences.
hide_aa : str, optional: A string of amino acids to hide when coloring by site.
color_internal_nodes_by_site_aa : int, optional: Same as color_leaves_by_site_aa but for internal nodes.
sequences : dict[str, str], optional: A mapping of taxon labels to sequences. Required for coloring by site.
jitter_x : float | str, optional: Amount of noise to add to the x value of leaves to avoid overplotting. Can be a float or 'auto'.
scale_bar : bool: If True, show a scale bar.
scale_bar_x_start : float: The leftmost x position of the scale bar.

Returns

mp.axes.Axes: The matplotlib axes with the plotted tree. The tree object is returned with added attributes: _xlim, _ylim, and _x, _y on each node.

def plot_tree_coloured_by_cluster(tree, legend=True, leg_kwds={}, unknown_color='black', leaf_kwds={}, internal_kwds={}, **kwds)

Browse git

Plot a tree with nodes coloured according to cluster.

Args

tree : dendropy Tree: Nodes that have 'cluster' attribute will be coloured.
legend : bool: Add a legend showing the clusters.
leg_kwds : dict: Keyword arguments passed to plt.legend.
unknown_color : mpl color: Color if cluster is not known.
**kwds: Keyword arguments passed to plot_tree.

def plot_tree_interactive(tree: dendropy.datamodel.treemodel._tree.Tree, has_brlens: bool = True, leaf_colors: dict | None = None, default_leaf_color: str = 'black', leaf_sizes: dict | None = None, default_leaf_size: int = 5)

Browse git

Plots a dendropy tree object interactively using plotly.

Args

tree : dp.Tree: The tree to plot.
has_brlens : bool: If False, all branch lengths are plotted as 1.
leaf_colors : dict, optional: A dictionary mapping taxon labels to colors.
default_leaf_color : str: The default color for taxa not in leaf_colors.
leaf_sizes : dict, optional: A dictionary mapping taxon labels to sizes.
default_leaf_size : int: The default size for taxa not in leaf_sizes.

def plot_tree_with_subplots(tree: Tree, aa_seqs: dict, site: int, subplot_taxa_shifts: dict[str, tuple[float, float]], fun: Callable, fun_kwds: dict | None = None, subplot_width: float = 0.2, subplot_height: float = 0.1, figsize: tuple[float, float] = (8, 12), sharex: bool = True, sharey: bool = True, snap_x: float | None = None, snap_y: float | None = None, arrow_origins: dict[str, tuple[float, float]] | None = None, **kwds) ‑> matplotlib.axes._axes.Axes

Browse git

Plot a phylogeny tree with subplots for specified taxa.

This function draws a phylogeny based on a given tree and amino acid sequences. It colors leaves (and internal nodes) according to their amino acid at a specified site, and attaches additional subplots at user-defined nodes for further custom visualization.

Args

tree: eremitalpa.Tree The phylogenetic tree to be plotted.
aa_seqs: dict A dictionary containing amino acid sequences for each taxon. Keys should match the node names in the tree, and values should be the sequences.
site: int The site (1-based) to color the tree's leaves and internal nodes.
subplot_taxa_shifts: dict of str -> tuple of float A mapping from taxon names to tuples (dx, dy). These values control the position of the subplot axes relative to their respective nodes. Uses axes coordinates (i.e. a value of 1 would shift an entire ax worth of distance).
fun: Callable A callable function to generate each subplot. Must accept the current taxon as the first argument and an axes object as the second argument.
fun_kwds: dict A dictionary of additional keyword arguments passed to the subplot function fun.
subplot_width: float, optional The width of each subplot in figure coordinates, by default 0.2.
subplot_height: float, optional The height of each subplot in figure coordinates, by default 0.1.
figsize: tuple of float, optional The overall size of the figure, by default (8, 12).
sharex: bool, Have the sub axes share x-axes.
sharey: bool, Have the sub axes share y-axes.
snap_x: float, Snap x position of the subplots to a grid. This argument sets the grid size.
snap_y: float, Snap y position of the subplots to a grid. This argument sets the grid size.
arrow_origins: dict of str -> tuple of float. Pass the axes coordinates of each subplot for where its arrow should originate. By default arrows originate from the center.
**kwds: Passed to plot_tree.

Returns

2-tuple containing: matplotlib.axes.Axes - the main axes. dict [str, matplotlib.axes.Axes] containing sub plots.

def prune_nodes_with_labels(tree, labels)

Browse git

Prune nodes from tree that have a taxon label in labels.

Args

tree (dendropy Tree) labels (iterable containing str)

Returns

(dendropy Tree)

def read_iqtree_ancestral_states(state_file, partition_names: list[str] | None = None, translate_nt: bool = False) ‑> dict[slice(, dict[str, str], None)] | dict[slice(, , None)]

Browse git

Read an ancestral state file generated by IQ-TREE. If the file contains multiple partitions (i.e. a 'Part' column is present), then return a dict of dicts containing sequences accessed by [partition][node]. Otherwise return a dict of sequences accessed by node.

Args

state_file: Path to .state file generated by iqtree –ancestral
partition_names: Partitions are numbered from 1 in the .state file. Pass names for each segment (i.e. the order that partition_names appear in the partitions). Only takes effect if multiple partitions are present.
translate_nt: If ancestral states are nucleotide sequences then translate them.

Returns

dict of dicts that maps [node][partition] -> sequence, or dict that maps node -> sequence.

def read_raxml_ancestral_sequences(tree, node_labelled_tree, ancestral_seqs, leaf_seqs=None)

Browse git

Read a tree and ancestral sequences estimated by RAxML.

RAxML can estimate marginal ancestral sequences for internal nodes on a tree using a call like:

raxmlHPC -f A -t {treeFile} -s {sequenceFile} -m {model} -n {name}

The analysis outputs several files:

RAxML_nodeLabelledRootedTree.{name} contains a copy of the input tree where all internal nodes have a unique identifier {id}.
RAxML_marginalAncestralStates.{name} contains the ancestral sequence for each internal node. The format of each line is '{id} {sequence}'
RAxML_marginalAncestralProbabilities.{name} contains probabilities of each base at each site for each internal node. (Not used by this function.)

Notes

Developed with output from RAxML version 8.2.12.

Args

tree : str: Path to original input tree ({treeFile}).
node_labelled_tree : str: Path to the tree with node labels. (RAxML_nodeLabelledRootedTree.{name})
ancestral_seqs : str: Path to file containing the ancestral sequences. (RAxML_marginalAncestralStates.{name})
leaf_seqs : str: (Optional) path to fasta file containing leaf sequences. ({sequenceFile}). If this is provided, also attach sequences to leaf nodes.

Returns

(dendropy Tree) with sequences attached to nodes. Sequences are attached as 'sequence' attributes on Nodes.

def sloppy_translate(sequence)

Browse git

Translate a nucleotide sequence.

Doesn't check that the sequence length is a multiple of three. If any 'codon' contains any character not in [ACTG] then return X.

Args

sequence : str: Lower or upper case.

Returns

str: The translated sequence.

def split_pairs(values: Iterable, separation: float = 1.0) ‑> list

Browse git

If values are repeated, e.g.:

1, 5, 5, 8

Then 'split' them by adding and subtracting half of separation (default=1.0 -> 0.5) from each item in the pair:

1, 4.5, 5.5, 8

def spread_points(points: Iterable[float], tol: float = 0.0001, maxiter: int = 100, repel: float = 0.5, attract: float = 0.1) ‑>

Browse git

Spread out 1D points. Imagines points are attracted to their starting poisitions (with attract force constant) and are repelled from other points with a force proportional to the inverse of their cubed distance (multiplied by repel). Iteratively updates poisitions until either maxiter is reached or the sum of the squared differences between positions in successive iterations falls below tol.

def taxon_in_node_label(label, node)

Browse git

Checks if a node has a matching taxon label.

Args

label : str: The label to check for.
node : dp.Node: The node to check.

Returns

bool: True if the node's taxon label matches, False otherwise.

def taxon_in_node_labels(labels, node)

Browse git

Checks if a node's taxon label is in a set of labels.

Args

labels : iterable: A collection of labels to check against.
node : dp.Node: The node to check.

Returns

bool: True if the node's taxon label is in the labels, False otherwise.

def translate_segment(sequence: str, segment: str) ‑> dict[str, str]

Browse git

Transcribe an influenza A segment. MP, PA, PB1 and NS all have splice variants, so simply transcribing the ORF of the segment would miss out proteins. This function returns a list containing coding sequences that are transcribed from a particular segment.

Args

sequence : str: The RNA sequence of the segment.
segment : str: The segment to translate. Must be one of 'HA', 'NA', 'NP', 'PB2', 'PA', 'MP' or 'PB1'.

Returns

dict[str, str]: A dictionary mapping the segment name to the translated protein sequence.

def translate_trim_default_ha(nt: str) ‑> str

Browse git

Take a default HA nucleotide sequence and return an HA1 sequence.

def variable_sites(seq: pandas.core.series.Series, max_biggest_prop: float = 0.95, ignore: str = '-X') ‑> Generator[int, None, None]

Browse git

Finds variable sites among sequences.

Args

seq : pd.Series: A pandas Series of sequences.
max_biggest_prop : float: The maximum proportion for the most common character at a site for it to be considered variable.
ignore : str: Characters to ignore when calculating proportions.

Yields

int: The 1-indexed position of the next variable site.

def write_fasta(path: str, records: dict[str, str]) ‑> None

Browse git

Writes sequences to a FASTA file.

Args

path : str: The path to the output FASTA file.
records : dict[str, str]: A dictionary where keys are sequence headers and values are the sequences.

Classes

class Cluster (cluster)

Browse git

Instance variables

prop aa_sequence: Browse git

Representative amino acid sequence.
prop b7_motifs: Browse git
prop color: Browse git
prop key_residues: Browse git
prop nt_sequence : str: Browse git

Representative nucleotide sequence.
prop year : int: Browse git

Methods

def codon(self, n: int) ‑> str: Browse git

Codon at amino acid position n. 1-indexed.

class ClusterTransition (c0: str | Cluster, c1: str | Cluster)

Browse git

A cluster transition.

Static methods

def from_tuple(c0c1) ‑> ClusterTransition: Make an instance from a tuple

Instance variables

prop preceding_transitions : Generator[ClusterTransition, None, None]: Browse git

All preceding cluster transitions

class MultipleSequenceAlignment (records, alphabet=None, annotations=None, column_annotations=None)

Browse git

Represents a classical multiple sequence alignment (MSA).

By this we mean a collection of sequences (usually shown as rows) which are all the same length (usually with gap characters for insertions or padding). The data can then be regarded as a matrix of letters, with well defined columns.

You would typically create an MSA by loading an alignment file with the AlignIO module:

>>> from Bio import AlignIO
>>> align = AlignIO.read("Clustalw/opuntia.aln", "clustal")
>>> print(align)
Alignment with 7 rows and 156 columns
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF191
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273291|gb|AF191665.1|AF191

In some respects you can treat these objects as lists of SeqRecord objects, each representing a row of the alignment. Iterating over an alignment gives the SeqRecord object for each row:

>>> len(align)
7
>>> for record in align:
...     print("%s %i" % (record.id, len(record)))
...
gi|6273285|gb|AF191659.1|AF191 156
gi|6273284|gb|AF191658.1|AF191 156
gi|6273287|gb|AF191661.1|AF191 156
gi|6273286|gb|AF191660.1|AF191 156
gi|6273290|gb|AF191664.1|AF191 156
gi|6273289|gb|AF191663.1|AF191 156
gi|6273291|gb|AF191665.1|AF191 156

You can also access individual rows as SeqRecord objects via their index:

>>> print(align[0].id)
gi|6273285|gb|AF191659.1|AF191
>>> print(align[-1].id)
gi|6273291|gb|AF191665.1|AF191

And extract columns as strings:

>>> print(align[:, 1])
AAAAAAA

Or, take just the first ten columns as a sub-alignment:

>>> print(align[:, :10])
Alignment with 7 rows and 10 columns
TATACATTAA gi|6273285|gb|AF191659.1|AF191
TATACATTAA gi|6273284|gb|AF191658.1|AF191
TATACATTAA gi|6273287|gb|AF191661.1|AF191
TATACATAAA gi|6273286|gb|AF191660.1|AF191
TATACATTAA gi|6273290|gb|AF191664.1|AF191
TATACATTAA gi|6273289|gb|AF191663.1|AF191
TATACATTAA gi|6273291|gb|AF191665.1|AF191

Combining this alignment slicing with alignment addition allows you to remove a section of the alignment. For example, taking just the first and last ten columns:

>>> print(align[:, :10] + align[:, -10:])
Alignment with 7 rows and 20 columns
TATACATTAAGTGTACCAGA gi|6273285|gb|AF191659.1|AF191
TATACATTAAGTGTACCAGA gi|6273284|gb|AF191658.1|AF191
TATACATTAAGTGTACCAGA gi|6273287|gb|AF191661.1|AF191
TATACATAAAGTGTACCAGA gi|6273286|gb|AF191660.1|AF191
TATACATTAAGTGTACCAGA gi|6273290|gb|AF191664.1|AF191
TATACATTAAGTATACCAGA gi|6273289|gb|AF191663.1|AF191
TATACATTAAGTGTACCAGA gi|6273291|gb|AF191665.1|AF191

Note - This object does NOT attempt to model the kind of alignments used in next generation sequencing with multiple sequencing reads which are much shorter than the alignment, and where there is usually a consensus or reference sequence with special status.

Initialize a new MultipleSeqAlignment object.

Arguments: - records - A list (or iterator) of SeqRecord objects, whose sequences are all the same length. This may be an empty list. - alphabet - For backward compatibility only; its value should always be None. - annotations - Information about the whole alignment (dictionary). - column_annotations - Per column annotation (restricted dictionary). This holds Python sequences (lists, strings, tuples) whose length matches the number of columns. A typical use would be a secondary structure consensus string.

You would normally load a MSA from a file using Bio.AlignIO, but you can do this from a list of SeqRecord objects too:

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Align import MultipleSeqAlignment
>>> a = SeqRecord(Seq("AAAACGT"), id="Alpha")
>>> b = SeqRecord(Seq("AAA-CGT"), id="Beta")
>>> c = SeqRecord(Seq("AAAAGGT"), id="Gamma")
>>> align = MultipleSeqAlignment([a, b, c],
...                              annotations={"tool": "demo"},
...                              column_annotations={"stats": "CCCXCCC"})
>>> print(align)
Alignment with 3 rows and 7 columns
AAAACGT Alpha
AAA-CGT Beta
AAAAGGT Gamma
>>> align.annotations
{'tool': 'demo'}
>>> align.column_annotations
{'stats': 'CCCXCCC'}

Ancestors

Bio.Align.MultipleSeqAlignment

Methods

def plot(self, ax: matplotlib.axes._axes.Axes | None = None, fontsize: int = 6, variable_sites_kwds: dict | None = None, rotate_xtick_labels: bool = False, sites: Iterable[int] | None = None) ‑> matplotlib.axes._axes.Axes

Browse git

Plot variable sites in the alignment.

Args

ax: Matplotlib ax.
fontsize: Fontsize of the character labels.
variable_sites_kwds: Passed to MultipleSequenceAlignment.variable_sites.
rotate_xtick_labels: Rotate the xtick labels 90 degrees.
sites: Only plot these sites. (Note: Only variable sites are plotted, so if a site is passed in this argument but it is not variable it will not be displayed.)

def variable_sites(self, min_2nd_most_freq: int = 1) ‑> Generator[Column, None, None]

Browse git

Generator for variable sites in the alignment.

Args

min_2nd_most_freq: Used to filter out sites that have low variability. For instance if min_2nd_most_freq is 2 a column containing 'AAAAT' should be excluded because the second most frequent character (T) has a frequency of 1.

class NHSeason (years: tuple[int, int] | ForwardRef('NHSeason'))

Browse git

A northern hemisphere flu season.

Static methods

def from_datetime(dt)

class Substitution (*args)

Browse git

A change of a character at a site.

Initializes a Substitution object.

Instantiate using either 1 or three arguments: Substitution("N145K") or Substitution("N", 145, "K")

Args

*args: Either a single string like "N145K" or three arguments ("N", 145, "K").

Raises

ValueError: If the number of arguments is not 1 or 3.

class TiedCounter (iterable=None, /, **kwds)

Browse git

A Counter that handles ties in most_common(1).

Create a new, empty Counter object. And if given, count elements from an input iterable. Or, initialize the count from another mapping of elements to their counts.

>>> c = Counter()                           # a new, empty counter
>>> c = Counter('gallahad')                 # a new counter from an iterable
>>> c = Counter({'a': 4, 'b': 2})           # a new counter from a mapping
>>> c = Counter(a=4, b=2)                   # a new counter from keyword args

Ancestors

collections.Counter
builtins.dict

Methods

def most_common(self, n: int | None = None) ‑> list[tuple[typing.Any, int]]

Browse git

Returns the most common elements.

If n=1 and there is a tie for the most common element, all tied elements are returned. Otherwise, it behaves like Counter.most_common.

Args

n : int, optional: The number of most common elements to return. Defaults to None.

Returns

list[tuple[Any, int]]: A list of the most common elements and their counts.

class Tree (*args, **kwargs)

Browse git

An arborescence, i.e. a fully-connected directed acyclic graph with all edges directing away from the root and toward the tips. The "root" of the tree is represented by the :attr:Tree.seed_node attribute. In unrooted trees, this node is an algorithmic artifact. In rooted trees this node is semantically equivalent to the root.

The constructor can optionally construct a |Tree| object by cloning another |Tree| object passed as the first positional argument, or out of a data source if stream and schema keyword arguments are passed with a file-like object and a schema-specification string object values respectively.

Parameters

*args : positional argument, optional If given, should be exactly one |Tree| object. The new |Tree| will then be a structural clone of this argument.

**kwargs : keyword arguments, optional The following optional keyword arguments are recognized and handled by this constructor:

    <code>label</code>
        The label or description of the new |Tree| object.
    <code>taxon\_namespace</code>
        Specifies the |TaxonNamespace| object to be
        that the new |Tree| object will reference.

Examples

Tree objects can be instantiated in the following ways::

# /usr/bin/env python

try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO
from dendropy import Tree, TaxonNamespace

# empty tree
t1 = Tree()

# Tree objects can be instantiated from an external data source
# using the 'get()' factory class method

# From a file-like object
t2 = Tree.get(file=open('treefile.tre', 'r'),
                schema="newick",
                tree_offset=0)

# From a path
t3 = Tree.get(path='sometrees.nexus',
        schema="nexus",
        collection_offset=2,
        tree_offset=1)

# From a string
s = "((A,B),(C,D));((A,C),(B,D));"
# tree will be '((A,B),(C,D))'
t4 = Tree.get(data=s,
        schema="newick")
# tree will be '((A,C),(B,D))'
t5 = Tree.get(data=s,
        schema="newick",
        tree_offset=1)
# passing keywords to underlying tree parser
t7 = dendropy.Tree.get(
        data="((A,B),(C,D));",
        schema="newick",
        taxon_namespace=t3.taxon_namespace,
        suppress_internal_node_taxa=False,
        preserve_underscores=True)

# Tree objects can be written out using the 'write()' method.
t1.write(file=open('treefile.tre', 'r'),
        schema="newick")
t1.write(path='treefile.nex',
        schema="nexus")

# Or returned as a string using the 'as_string()' method.
s = t1.as_string("nexml")

# tree structure deep-copied from another tree
t8 = dendropy.Tree(t7)
assert t8 is not t7                             # Trees are distinct
assert t8.symmetric_difference(t7) == 0         # and structure is identical
assert t8.taxon_namespace is t7.taxon_namespace             # BUT taxa are not cloned.
nds3 = [nd for nd in t7.postorder_node_iter()]  # Nodes in the two trees
nds4 = [nd for nd in t8.postorder_node_iter()]  # are distinct objects,
for i, n in enumerate(nds3):                    # and can be manipulated
    assert nds3[i] is not nds4[i]               # independentally.
egs3 = [eg for eg in t7.postorder_edge_iter()]  # Edges in the two trees
egs4 = [eg for eg in t8.postorder_edge_iter()]  # are also distinct objects,
for i, e in enumerate(egs3):                    # and can also be manipulated
    assert egs3[i] is not egs4[i]               # independentally.
lves7 = t7.leaf_nodes()                         # Leaf nodes in the two trees
lves8 = t8.leaf_nodes()                         # are also distinct objects,
for i, lf in enumerate(lves3):                  # but order is the same,
    assert lves7[i] is not lves8[i]             # and associated Taxon objects
    assert lves7[i].taxon is lves8[i].taxon     # are the same.

# To create deep copy of a tree with a different taxon namespace,
# Use 'copy.deepcopy()'
t9 = copy.deepcopy(t7)

# Or explicitly pass in a new TaxonNamespace instance
taxa = TaxonNamespace()
t9 = dendropy.Tree(t7, taxon_namespace=taxa)
assert t9 is not t7                             # As above, the trees are distinct
assert t9.symmetric_difference(t7) == 0         # and the structures are identical,
assert t9.taxon_namespace is not t7.taxon_namespace         # but this time, the taxa *are* different
assert t9.taxon_namespace is taxa                     # as the given TaxonNamespace is used instead.
lves3 = t7.leaf_nodes()                         # Leaf nodes (and, for that matter other nodes
lves5 = t9.leaf_nodes()                         # as well as edges) are also distinct objects
for i, lf in enumerate(lves3):                  # and the order is the same, as above,
    assert lves7[i] is not lves9[i]             # but this time the associated Taxon
    assert lves7[i].taxon is not lves9[i].taxon # objects are distinct though the taxon
    assert lves7[i].taxon.label == lves9[i].taxon.label # labels are the same.

# to 'switch out' the TaxonNamespace of a tree, replace the reference and
# reindex the taxa:
t11 = Tree.get(data='((A,B),(C,D));', 'newick')
taxa = TaxonNamespace()
t11.taxon_namespace = taxa
t11.reindex_subcomponent_taxa()

# You can also explicitly pass in a seed node:
seed = Node(label="root")
t12 = Tree(seed_node=seed)
assert t12.seed_node is seed

Ancestors

dendropy.datamodel.treemodel._tree.Tree
dendropy.datamodel.taxonmodel.TaxonNamespaceAssociated
dendropy.datamodel.basemodel.Annotable
dendropy.datamodel.basemodel.Deserializable
dendropy.datamodel.basemodel.NonMultiReadable
dendropy.datamodel.basemodel.Serializable
dendropy.datamodel.basemodel.DataObject

Static methods

def find_closest_leaf_node(node: dendropy.datamodel.treemodel._node.Node) ‑> dendropy.datamodel.treemodel._node.Node

Browse git

Find the leaf node that has the shortest path length to the given node.

Args

node: The node of interest.

Returns

The leaf node closest to the node of interest.

def from_disk(path: str, schema: str = 'newick', preserve_underscores: bool = True, outgroup: str | None = None, msa_path: str | None = None, get_kwds: dict | None = None, **kwds) ‑> Tree

Loads a tree from a file.

Args

path : str: Path to the file containing the tree.
schema : str: The schema of the tree file (e.g., "newick"). See dendropy.Tree.get for options.
preserve_underscores : bool: If True, preserve underscores in taxon labels.
outgroup : str, optional: The name of the taxon to use as the outgroup. Defaults to None.
msa_path : str, optional: Path to a FASTA file containing leaf sequences. Defaults to None.
get_kwds : dict, optional: Keyword arguments passed to dendropy.Tree.get. Defaults to None.
**kwds: Additional keyword arguments passed to add_sequences_to_tree.

Returns

Tree: The loaded tree object.

Instance variables

prop multiple_sequence_alignment

Browse git

Generates a MultipleSequenceAlignment object from the tree.

Leaf nodes on the tree must have 'sequence' attributes and taxon labels.

Returns

MultipleSequenceAlignment: The generated alignment object.

Methods

def clade_bbox(self, taxon_labels: list[str]) ‑> dict[str, float]

Browse git

Calculates the bounding box of a clade.

The bounding box is determined by finding the most recent common ancestor (MRCA) of the specified taxa and then calculating the minimum and maximum x and y coordinates of its child nodes.

Args

taxon_labels : list[str]: A list of taxon labels that define the clade.

Returns

dict[str, float]: A dictionary containing the coordinates of the bounding box with keys: 'min_x', 'max_x', 'min_y', 'max_y'.

def distance_between(self, node1: dendropy.datamodel.treemodel._node.Node, node2: dendropy.datamodel.treemodel._node.Node) ‑> float

Browse git

Calculates the distance between two nodes.

The distance is calculated as the sum of the branch lengths along the path connecting the two nodes.

Args

node1 : dp.Node: The first node.
node2 : dp.Node: The second node.

Returns

float: The distance between the two nodes.

def internal_node_mrca(self, node1: dendropy.datamodel.treemodel._node.Node, node2: dendropy.datamodel.treemodel._node.Node) ‑> dendropy.datamodel.treemodel._node.Node

Browse git

Finds the MRCA of two nodes.

Note

dendropy.tree.mrca only works on leaf nodes (or nodes with taxon labels).

Args

node1 : dp.Node: The first node.
node2 : dp.Node: The second node.

Returns

dp.Node: The MRCA of the two nodes.

def node_to_root_path(self, taxon: str | dendropy.datamodel.treemodel._node.Node) ‑> Generator[dendropy.datamodel.treemodel._node.Node, None, None]

Browse git

Nodes from a taxon to the root node.

Args

taxon : str or dendropy.Node: The taxon label of the starting node, or the node object.

Yields

dp.Node: The nodes from the taxon to the root.

def plot_clade_bbox(self, taxon_labels: list[str], ax: matplotlib.axes._axes.Axes | None = None, extend_right: float = 0.0, extend_down: float = 0.0, label: str | None = None, label_kwds: dict | None = None, **kwds)

Browse git

Plots a rectangle around the bounding box of a clade.

Args

taxon_labels : list[str]: A list of taxon labels that define the clade.
ax : mp.axes.Axes, optional: The matplotlib axes to plot on. Defaults to None.
extend_right : float: Amount to extend the box to the right, in axes coordinates.
extend_down : float: Amount to extend the box down, in axes coordinates.
label : str, optional: A label to apply to the box. Defaults to None.
label_kwds : dict, optional: Keyword arguments passed to matplotlib.axes.Axes.text. Defaults to None.
**kwds: Additional keyword arguments passed to matplotlib.patches.Rectangle.

Returns

matplotlib.patches.Rectangle: The rectangle patch added to the axes.

def plot_tree_msa(self, msa_plot_kwds: dict | None = None, axes: tuple[matplotlib.axes._axes.Axes, matplotlib.axes._axes.Axes] | None = None) ‑> tuple[matplotlib.axes._axes.Axes, matplotlib.axes._axes.Axes]

Browse git

Plots the tree and multiple sequence alignment.

Args

msa_plot_kwds : dict, optional: Keyword arguments passed to the multiple sequence alignment plot function. Defaults to None.
axes : tuple[mp.axes.Axes, mp.axes.Axes], optional: A tuple of two matplotlib axes to plot on. If None, new axes are created. Defaults to None.

Returns

tuple[mp.axes.Axes, mp.axes.Axes]: The matplotlib axes used for plotting.