Module `eremitalpa.bio`

Functions

def align_to_reference(reference_seq: str, input_seq: str) ‑> ReferenceAlignment

Browse git

Align an input sequence to a reference. Returns the aligned input sequence trimmed to the region of the reference.

Args

reference_seq (str): input_seq (str):

Raises

ValueError: If internal gaps are introduced in the reference during alignment.

Returns

ReferenceAlignment tuple containing: 'aligned' - the aligned input sequence 'internal_gap_in_ref' - boolean indicating if a gap was inserted in the reference

def consensus_seq(seqs: Iterable[str], case_sensitive: bool = True, **kwds) ‑> str

Browse git

Compute the consensus of sequences.

Args

seqs: Sequences.
case_sensitive: If False, all seqs are converted to lowercase.
error_without_strict_majority: Raise an error if a position has a tied most common character. If set to False, a warning is raised and a single value is chosen.

def filter_similar_hd(sequences, n, progress_bar=False, ignore=None)

Browse git

Iterate through sequences excluding those that have a hamming distance of less than n to a sequence already seen. Return the non-excluded sequences.

Args

sequences (iterable of str / Bio.SeqRecord) progress_bar (bool) ignore (set or None)

Returns

list

def find_mutations(*args, **kwargs)

Browse git

def find_substitutions(a, b, offset=0)

Browse git

Find mutations between strings a and b.

Args

a (str) b (str) offset (int)

Raises

ValueError if lengths of a an b differ.

Returns

list of tuples. tuples are like: ("N", 145, "K") The number indicates the 1-indexed position of the mutation. The first element is the a character. The last element is the b character.

def group_sequences_by_character_at_site(seqs: dict[str, str], site: int) ‑> dict[str, str]

Browse git

Group sequences by the character they have at a particular site.

Args

seqs: Dict of sequence names -> sequence.
site: 1-based.

Returns

dict containing char at site -> sequence name.

def grouped_sample(population, n, key=None)

Browse git

Randomly sample a population taking at most n elements from each group.

Args

population (iterable)
n : int: Take at most n samples from each group.
key : callable: Function by which to group elements. Default (None).

Returns

list

def hamming_dist(a: str, b: str, ignore: Iterable[str] = '-X', case_sensitive: bool = True, per_site: bool = False) ‑> float

Browse git

The hamming distance between a and b.

Args

a: Sequence.
b: Sequence.
ignore: String containing characters to ignore. If there is a mismatch where one string has a character in ignore, this does not contribute to the hamming distance.
per_site: Divide the hamming distance by the length of a and b, minus the number of sites with ignored characters.

Returns

float

def hamming_dist_lt(a, b, n, ignore=None)

Browse git

Test if hamming distance between a and b is less than n. This is case sensitive and does not check a and b have matching lengths.

Args

a (iterable) b (iterable) n (scalar) ignore (set or None)

Returns

bool

def idx_first_and_last_non_gap(sequence: str) ‑> tuple[int, int]

Browse git

Returns the indices of the first and last non-gap ('-') characters in a sequence. If all characters are gaps, returns None for both indices.

Args

sequence : str: The input sequence containing gaps ('-').

Returns

tuple: A tuple (first_non_gap_index, last_non_gap_index).

def load_fasta(path: str, translate_nt: bool = False, convert_to_upper: bool = False, start: int = 0) ‑> dict[str, str]

Browse git

Load fasta file sequences.

Args

path: Path to fasta file.
translate_nt: Translate nucleotide sequences.
convert_to_upper: Force sequences to be uppercase.
start: The (0-based) index of the first character of each record to take. This selection is done before any translation. (Default=0).

def pairwise_hamming_dists(collection, ignore='-X', per_site=False)

Browse git

Compute all pairwise hamming distances between items in collection.

Args

collection (iterable)

Returns

list of hamming distances

def plot_amino_acid_colors(ax: matplotlib.axes.Axes = None) ‑> matplotlib.axes.Axes

Browse git

Simple plot to show amino acid colors.

def sloppy_translate(sequence)

Browse git

Translate a nucleotide sequence.

Don't check that the sequence length is a multiple of three. If any 'codon' contains any character not in [ACTG] then return X.

Args

sequence : str: Lower or upper case.

Returns

(str)

def variable_sites(seq: pandas.core.series.Series, max_biggest_prop: float = 0.95, ignore: str = '-X') ‑> Generator[int, None, None]

Browse git

Find variable sites among sequences. Returns 1-indexed sites.

Args

seq: Sequences.
max_biggest_prop: Don't include sites where a single character has a proportion above this.
ignore: Characters to exclude when calculating proportions.

def write_fasta(path: str, records: dict[str, str]) ‑> None

Browse git

Write a fasta file.

Args

path: Path to fasta file to write.
records: A dict, the keys will become fasta headers, values will be sequences.

Classes

class ReferenceAlignment (aligned, internal_gap_in_ref)

ReferenceAlignment(aligned, internal_gap_in_ref)

Ancestors

builtins.tuple

Instance variables

var aligned: Alias for field number 0
var internal_gap_in_ref: Alias for field number 1

class Substitution (*args)

Browse git

Change of a character at a site.

Instantiate using either 1 or three arguments: Substitution("N145K") or Substitution("N", 145, "K")

class TiedCounter (iterable=None, /, **kwds)

Browse git

Dict subclass for counting hashable items. Sometimes called a bag or multiset. Elements are stored as dictionary keys and their counts are stored as dictionary values.

>>> c = Counter('abcdeabcdabcaba')  # count elements from a string

>>> c.most_common(3)                # three most common elements
[('a', 5), ('b', 4), ('c', 3)]
>>> sorted(c)                       # list all unique elements
['a', 'b', 'c', 'd', 'e']
>>> ''.join(sorted(c.elements()))   # list elements with repetitions
'aaaaabbbbcccdde'
>>> sum(c.values())                 # total of all counts
15

>>> c['a']                          # count of letter 'a'
5
>>> for elem in 'shazam':           # update counts from an iterable
...     c[elem] += 1                # by adding 1 to each element's count
>>> c['a']                          # now there are seven 'a'
7
>>> del c['b']                      # remove all 'b'
>>> c['b']                          # now there are zero 'b'
0

>>> d = Counter('simsalabim')       # make another counter
>>> c.update(d)                     # add in the second counter
>>> c['a']                          # now there are nine 'a'
9

>>> c.clear()                       # empty the counter
>>> c
Counter()

Note: If a count is set to zero or reduced to zero, it will remain in the counter until the entry is deleted or the counter is cleared:

>>> c = Counter('aaabbc')
>>> c['b'] -= 2                     # reduce the count of 'b' by two
>>> c.most_common()                 # 'b' is still in, but its count is zero
[('a', 3), ('c', 1), ('b', 0)]

Create a new, empty Counter object. And if given, count elements from an input iterable. Or, initialize the count from another mapping of elements to their counts.

>>> c = Counter()                           # a new, empty counter
>>> c = Counter('gallahad')                 # a new counter from an iterable
>>> c = Counter({'a': 4, 'b': 2})           # a new counter from a mapping
>>> c = Counter(a=4, b=2)                   # a new counter from keyword args

Ancestors

collections.Counter
builtins.dict

Methods

def most_common(self, n: int | None = None) ‑> list[tuple[typing.Any, int]]: Browse git

If n=1 and there are more than one item that has the maximum count, return all of them, not just one. If n is not 1, do the same thing as normal Counter.most_common.