Module eremitalpa.bio
Functions
def align_to_reference(reference_seq: str, input_seq: str) ‑> ReferenceAlignment-
Aligns an input sequence to a reference sequence.
Returns the aligned input sequence trimmed to the region of the reference.
Args
reference_seq:str- The reference sequence.
input_seq:str- The input sequence to align.
Returns
ReferenceAlignment- A named tuple containing the aligned sequence and a boolean indicating if an internal gap was introduced in the reference.
def consensus_seq(seqs: Iterable[str], case_sensitive: bool = True, **kwds) ‑> str-
Computes the consensus of a set of sequences.
Args
seqs:Iterable[str]- The sequences to compute the consensus from.
case_sensitive:bool- If False, all sequences are converted to lowercase.
**kwds- Additional keyword arguments passed to _generate_consensus_chars.
Returns
str- The consensus sequence.
def filter_similar_hd(sequences, n, progress_bar=False, ignore=None, case_sensitive=False) ‑> list-
Filters sequences based on Hamming distance.
Iterates through sequences, excluding those that have a Hamming distance of less than n to a sequence already seen.
Args
sequences:iterable[str | Bio.SeqRecord]- The sequences to filter.
n:int- The Hamming distance threshold.
progress_bar:bool- Whether to display a progress bar.
ignore:set, optional- Characters to ignore during comparison. Defaults to None.
case_sensitive:bool- Whether the comparison is case-sensitive.
Returns
list- The filtered sequences.
def find_mutations(*args, **kwargs)def find_substitutions(a, b, offset=0, ignore='X-')-
Find mutations between strings a and b.
Args
a:str- The first string.
b:str- The second string.
offset:int- An offset to be added to the mutation position.
ignore:str- Ignore substitution if these characters are involved.
Raises
ValueError- If lengths of a and b differ.
Returns
tuple[Substitution, …]- A tuple of Substitution objects.
def group_sequences_by_character_at_site(seqs: dict[str, str], site: int) ‑> dict[str, str]-
Groups sequences by the character at a specific site.
Args
seqs:dict[str, str]- A dictionary mapping sequence names to sequences.
site:int- The 1-based site to group by.
Returns
dict[str, str]- A dictionary where keys are characters at the given site and values are lists of sequence names.
def grouped_sample(population, n, key=None)-
Randomly samples a population, taking at most n elements from each group.
Args
population:iterable- The population to sample from.
n:int- The maximum number of samples to take from each group.
key:callable, optional- A function to group elements by. Defaults to None.
Returns
list- The sampled elements.
def hamming_dist(a: str,
b: str,
ignore: Iterable[str] = '-X',
case_sensitive: bool = True,
per_site: bool = False) ‑> float-
Computes the Hamming distance between two sequences.
Args
a:str- The first sequence.
b:str- The second sequence.
ignore:Iterable[str]- A string containing characters to ignore. Mismatches involving these characters will not contribute to the Hamming distance.
case_sensitive:bool- If True, the comparison is case-sensitive.
per_site:bool- If True, the Hamming distance is divided by the length of the sequences, excluding ignored sites.
Returns
float- The Hamming distance.
def hamming_dist_lt(a, b, n, ignore=None)-
Checks if the Hamming distance between two iterables is less than n.
This is case-sensitive and does not check if a and b have matching lengths.
Args
a:iterable- The first iterable.
b:iterable- The second iterable.
n:scalar- The threshold value.
ignore:setorNone- A set of characters to ignore during comparison.
Returns
bool- True if the Hamming distance is less than n, False otherwise.
def idx_first_and_last_non_gap(sequence: str) ‑> tuple[int, int]-
Finds the indices of the first and last non-gap characters.
If all characters are gaps, the behavior is determined by the loop logic (likely resulting in an UnboundLocalError if not handled).
Args
sequence:str- The input sequence, which may contain gaps ('-').
Returns
tuple[int, int]- A tuple containing the indices of the first and last non-gap characters.
def load_fasta(path: str,
translate_nt: bool = False,
convert_to_upper: bool = False,
start: int = 0) ‑> dict[str, str]-
Loads sequences from a FASTA file.
Args
path:str- The path to the FASTA file.
translate_nt:bool- If True, translate nucleotide sequences to amino acids.
convert_to_upper:bool- If True, convert sequences to uppercase.
start:int- The 0-based index of the first character to include from each sequence. This is applied before translation.
Returns
dict[str, str]- A dictionary mapping sequence descriptions to sequences.
def load_fastas(paths: Iterable[str], **kwargs) ‑> dict[str, str]-
Loads sequences from multiple FASTA files.
If the same sequence description appears in multiple files, the sequence from the last file is used.
Args
paths:Iterable[str]- An iterable of paths to FASTA files.
**kwargs- Passed to
load_fasta().
Returns
dict[str, str]- A dictionary mapping sequence descriptions to sequences.
def pairwise_hamming_dists(sequences: list | tuple | dict[str, str], **kwds) ‑> list[float] | dict[str, dict[str, str]]-
All pairwise Hamming distances between items in a collection.
Args
collection:list | tuple | dict- A collection of sequences.
**kwds- Passed to
hamming_dist().
Returns
list[float] or dict[str][str] -> float
def plot_amino_acid_colors(ax: matplotlib.axes.Axes = None) ‑> matplotlib.axes.Axes-
Creates a simple plot to display amino acid colors.
Args
ax:matplotlib.axes.Axes, optional- The matplotlib axes to plot on. If None, the current axes are used. Defaults to None.
Returns
matplotlib.axes.Axes- The axes with the plot.
def sloppy_translate(sequence)-
Translate a nucleotide sequence.
Doesn't check that the sequence length is a multiple of three. If any 'codon' contains any character not in [ACTG] then return X.
Args
sequence:str- Lower or upper case.
Returns
str- The translated sequence.
def variable_sites(seq: pandas.core.series.Series,
max_biggest_prop: float = 0.95,
ignore: str = '-X') ‑> Generator[int, None, None]-
Finds variable sites among sequences.
Args
seq:pd.Series- A pandas Series of sequences.
max_biggest_prop:float- The maximum proportion for the most common character at a site for it to be considered variable.
ignore:str- Characters to ignore when calculating proportions.
Yields
int- The 1-indexed position of the next variable site.
def write_fasta(path: str, records: dict[str, str]) ‑> None-
Writes sequences to a FASTA file.
Args
path:str- The path to the output FASTA file.
records:dict[str, str]- A dictionary where keys are sequence headers and values are the sequences.
Classes
class ReferenceAlignment (aligned, internal_gap_in_ref)-
ReferenceAlignment(aligned, internal_gap_in_ref)
Ancestors
- builtins.tuple
Instance variables
var aligned-
Alias for field number 0
var internal_gap_in_ref-
Alias for field number 1
class Substitution (*args)-
A change of a character at a site.
Initializes a Substitution object.
Instantiate using either 1 or three arguments: Substitution("N145K") or Substitution("N", 145, "K")
Args
*args- Either a single string like "N145K" or three arguments ("N", 145, "K").
Raises
ValueError- If the number of arguments is not 1 or 3.
class TiedCounter (iterable=None, /, **kwds)-
A Counter that handles ties in most_common(1).
Create a new, empty Counter object. And if given, count elements from an input iterable. Or, initialize the count from another mapping of elements to their counts.
>>> c = Counter() # a new, empty counter >>> c = Counter('gallahad') # a new counter from an iterable >>> c = Counter({'a': 4, 'b': 2}) # a new counter from a mapping >>> c = Counter(a=4, b=2) # a new counter from keyword argsAncestors
- collections.Counter
- builtins.dict
Methods
def most_common(self, n: int | None = None) ‑> list[tuple[typing.Any, int]]-
Returns the most common elements.
If n=1 and there is a tie for the most common element, all tied elements are returned. Otherwise, it behaves like Counter.most_common.
Args
n:int, optional- The number of most common elements to return. Defaults to None.
Returns
list[tuple[Any, int]]- A list of the most common elements and their counts.