Module eremitalpa.bio

Functions

def align_to_reference(reference_seq: str, input_seq: str) ‑> ReferenceAlignment

Aligns an input sequence to a reference sequence.

Returns the aligned input sequence trimmed to the region of the reference.

Args

reference_seq : str
The reference sequence.
input_seq : str
The input sequence to align.

Returns

ReferenceAlignment
A named tuple containing the aligned sequence and a boolean indicating if an internal gap was introduced in the reference.
def consensus_seq(seqs: Iterable[str], case_sensitive: bool = True, **kwds) ‑> str

Computes the consensus of a set of sequences.

Args

seqs : Iterable[str]
The sequences to compute the consensus from.
case_sensitive : bool
If False, all sequences are converted to lowercase.
**kwds
Additional keyword arguments passed to _generate_consensus_chars.

Returns

str
The consensus sequence.
def filter_similar_hd(sequences, n, progress_bar=False, ignore=None, case_sensitive=False) ‑> list

Filters sequences based on Hamming distance.

Iterates through sequences, excluding those that have a Hamming distance of less than n to a sequence already seen.

Args

sequences : iterable[str | Bio.SeqRecord]
The sequences to filter.
n : int
The Hamming distance threshold.
progress_bar : bool
Whether to display a progress bar.
ignore : set, optional
Characters to ignore during comparison. Defaults to None.
case_sensitive : bool
Whether the comparison is case-sensitive.

Returns

list
The filtered sequences.
def find_mutations(*args, **kwargs)
def find_substitutions(a, b, offset=0, ignore='X-')

Find mutations between strings a and b.

Args

a : str
The first string.
b : str
The second string.
offset : int
An offset to be added to the mutation position.
ignore : str
Ignore substitution if these characters are involved.

Raises

ValueError
If lengths of a and b differ.

Returns

tuple[Substitution, …]
A tuple of Substitution objects.
def group_sequences_by_character_at_site(seqs: dict[str, str], site: int) ‑> dict[str, str]

Groups sequences by the character at a specific site.

Args

seqs : dict[str, str]
A dictionary mapping sequence names to sequences.
site : int
The 1-based site to group by.

Returns

dict[str, str]
A dictionary where keys are characters at the given site and values are lists of sequence names.
def grouped_sample(population, n, key=None)

Randomly samples a population, taking at most n elements from each group.

Args

population : iterable
The population to sample from.
n : int
The maximum number of samples to take from each group.
key : callable, optional
A function to group elements by. Defaults to None.

Returns

list
The sampled elements.
def hamming_dist(a: str,
b: str,
ignore: Iterable[str] = '-X',
case_sensitive: bool = True,
per_site: bool = False) ‑> float

Computes the Hamming distance between two sequences.

Args

a : str
The first sequence.
b : str
The second sequence.
ignore : Iterable[str]
A string containing characters to ignore. Mismatches involving these characters will not contribute to the Hamming distance.
case_sensitive : bool
If True, the comparison is case-sensitive.
per_site : bool
If True, the Hamming distance is divided by the length of the sequences, excluding ignored sites.

Returns

float
The Hamming distance.
def hamming_dist_lt(a, b, n, ignore=None)

Checks if the Hamming distance between two iterables is less than n.

This is case-sensitive and does not check if a and b have matching lengths.

Args

a : iterable
The first iterable.
b : iterable
The second iterable.
n : scalar
The threshold value.
ignore : set or None
A set of characters to ignore during comparison.

Returns

bool
True if the Hamming distance is less than n, False otherwise.
def idx_first_and_last_non_gap(sequence: str) ‑> tuple[int, int]

Finds the indices of the first and last non-gap characters.

If all characters are gaps, the behavior is determined by the loop logic (likely resulting in an UnboundLocalError if not handled).

Args

sequence : str
The input sequence, which may contain gaps ('-').

Returns

tuple[int, int]
A tuple containing the indices of the first and last non-gap characters.
def load_fasta(path: str,
translate_nt: bool = False,
convert_to_upper: bool = False,
start: int = 0) ‑> dict[str, str]

Loads sequences from a FASTA file.

Args

path : str
The path to the FASTA file.
translate_nt : bool
If True, translate nucleotide sequences to amino acids.
convert_to_upper : bool
If True, convert sequences to uppercase.
start : int
The 0-based index of the first character to include from each sequence. This is applied before translation.

Returns

dict[str, str]
A dictionary mapping sequence descriptions to sequences.
def load_fastas(paths: Iterable[str], **kwargs) ‑> dict[str, str]

Loads sequences from multiple FASTA files.

If the same sequence description appears in multiple files, the sequence from the last file is used.

Args

paths : Iterable[str]
An iterable of paths to FASTA files.
**kwargs
Passed to load_fasta().

Returns

dict[str, str]
A dictionary mapping sequence descriptions to sequences.
def pairwise_hamming_dists(sequences: list | tuple | dict[str, str], **kwds) ‑> list[float] | dict[str, dict[str, str]]

All pairwise Hamming distances between items in a collection.

Args

collection : list | tuple | dict
A collection of sequences.
**kwds
Passed to hamming_dist().

Returns

list[float] or dict[str][str] -> float

def plot_amino_acid_colors(ax: matplotlib.axes.Axes = None) ‑> matplotlib.axes.Axes

Creates a simple plot to display amino acid colors.

Args

ax : matplotlib.axes.Axes, optional
The matplotlib axes to plot on. If None, the current axes are used. Defaults to None.

Returns

matplotlib.axes.Axes
The axes with the plot.
def sloppy_translate(sequence)

Translate a nucleotide sequence.

Doesn't check that the sequence length is a multiple of three. If any 'codon' contains any character not in [ACTG] then return X.

Args

sequence : str
Lower or upper case.

Returns

str
The translated sequence.
def variable_sites(seq: pandas.core.series.Series,
max_biggest_prop: float = 0.95,
ignore: str = '-X') ‑> Generator[int, None, None]

Finds variable sites among sequences.

Args

seq : pd.Series
A pandas Series of sequences.
max_biggest_prop : float
The maximum proportion for the most common character at a site for it to be considered variable.
ignore : str
Characters to ignore when calculating proportions.

Yields

int
The 1-indexed position of the next variable site.
def write_fasta(path: str, records: dict[str, str]) ‑> None

Writes sequences to a FASTA file.

Args

path : str
The path to the output FASTA file.
records : dict[str, str]
A dictionary where keys are sequence headers and values are the sequences.

Classes

class ReferenceAlignment (aligned, internal_gap_in_ref)

ReferenceAlignment(aligned, internal_gap_in_ref)

Ancestors

  • builtins.tuple

Instance variables

var aligned

Alias for field number 0

var internal_gap_in_ref

Alias for field number 1

class Substitution (*args)

A change of a character at a site.

Initializes a Substitution object.

Instantiate using either 1 or three arguments: Substitution("N145K") or Substitution("N", 145, "K")

Args

*args
Either a single string like "N145K" or three arguments ("N", 145, "K").

Raises

ValueError
If the number of arguments is not 1 or 3.
class TiedCounter (iterable=None, /, **kwds)

A Counter that handles ties in most_common(1).

Create a new, empty Counter object. And if given, count elements from an input iterable. Or, initialize the count from another mapping of elements to their counts.

>>> c = Counter()                           # a new, empty counter
>>> c = Counter('gallahad')                 # a new counter from an iterable
>>> c = Counter({'a': 4, 'b': 2})           # a new counter from a mapping
>>> c = Counter(a=4, b=2)                   # a new counter from keyword args

Ancestors

  • collections.Counter
  • builtins.dict

Methods

def most_common(self, n: int | None = None) ‑> list[tuple[typing.Any, int]]

Returns the most common elements.

If n=1 and there is a tie for the most common element, all tied elements are returned. Otherwise, it behaves like Counter.most_common.

Args

n : int, optional
The number of most common elements to return. Defaults to None.

Returns

list[tuple[Any, int]]
A list of the most common elements and their counts.