Module eremitalpa.bio

Functions

def align_to_reference(reference_seq: str, input_seq: str) ‑> ReferenceAlignment

Align an input sequence to a reference. Returns the aligned input sequence trimmed to the region of the reference.

Args

reference_seq (str): input_seq (str):

Raises

ValueError
If internal gaps are introduced in the reference during alignment.

Returns

ReferenceAlignment tuple containing: 'aligned' - the aligned input sequence 'internal_gap_in_ref' - boolean indicating if a gap was inserted in the reference

def consensus_seq(seqs: Iterable[str], case_sensitive: bool = True, **kwds) ‑> str

Compute the consensus of sequences.

Args

seqs
Sequences.
case_sensitive
If False, all seqs are converted to lowercase.
error_without_strict_majority
Raise an error if a position has a tied most common character. If set to False, a warning is raised and a single value is chosen.
def filter_similar_hd(sequences, n, progress_bar=False, ignore=None)

Iterate through sequences excluding those that have a hamming distance of less than n to a sequence already seen. Return the non-excluded sequences.

Args

sequences (iterable of str / Bio.SeqRecord) progress_bar (bool) ignore (set or None)

Returns

list

def find_mutations(*args, **kwargs)
def find_substitutions(a, b, offset=0)

Find mutations between strings a and b.

Args

a (str) b (str) offset (int)

Raises

ValueError if lengths of a an b differ.

Returns

list of tuples. tuples are like
("N", 145, "K") The number indicates the 1-indexed position of the mutation. The first element is the a character. The last element is the b character.
def group_sequences_by_character_at_site(seqs: dict[str, str], site: int) ‑> dict[str, str]

Group sequences by the character they have at a particular site.

Args

seqs
Dict of sequence names -> sequence.
site
1-based.

Returns

dict containing char at site -> sequence name.

def grouped_sample(population, n, key=None)

Randomly sample a population taking at most n elements from each group.

Args

population (iterable)
n : int
Take at most n samples from each group.
key : callable
Function by which to group elements. Default (None).

Returns

list

def hamming_dist(a: str,
b: str,
ignore: Iterable[str] = '-X',
case_sensitive: bool = True,
per_site: bool = False) ‑> float

The hamming distance between a and b.

Args

a
Sequence.
b
Sequence.
ignore
String containing characters to ignore. If there is a mismatch where one string has a character in ignore, this does not contribute to the hamming distance.
per_site
Divide the hamming distance by the length of a and b, minus the number of sites with ignored characters.

Returns

float

def hamming_dist_lt(a, b, n, ignore=None)

Test if hamming distance between a and b is less than n. This is case sensitive and does not check a and b have matching lengths.

Args

a (iterable) b (iterable) n (scalar) ignore (set or None)

Returns

bool

def idx_first_and_last_non_gap(sequence: str) ‑> tuple[int, int]

Returns the indices of the first and last non-gap ('-') characters in a sequence. If all characters are gaps, returns None for both indices.

Args

sequence : str
The input sequence containing gaps ('-').

Returns

tuple
A tuple (first_non_gap_index, last_non_gap_index).
def load_fasta(path: str,
translate_nt: bool = False,
convert_to_upper: bool = False,
start: int = 0) ‑> dict[str, str]

Load fasta file sequences.

Args

path
Path to fasta file.
translate_nt
Translate nucleotide sequences.
convert_to_upper
Force sequences to be uppercase.
start
The (0-based) index of the first character of each record to take. This selection is done before any translation. (Default=0).
def pairwise_hamming_dists(collection, ignore='-X', per_site=False)

Compute all pairwise hamming distances between items in collection.

Args

collection (iterable)

Returns

list of hamming distances

def plot_amino_acid_colors(ax: matplotlib.axes.Axes = None) ‑> matplotlib.axes.Axes

Simple plot to show amino acid colors.

def sloppy_translate(sequence)

Translate a nucleotide sequence.

Don't check that the sequence length is a multiple of three. If any 'codon' contains any character not in [ACTG] then return X.

Args

sequence : str
Lower or upper case.

Returns

(str)

def variable_sites(seq: pandas.core.series.Series,
max_biggest_prop: float = 0.95,
ignore: str = '-X') ‑> Generator[int, None, None]

Find variable sites among sequences. Returns 1-indexed sites.

Args

seq
Sequences.
max_biggest_prop
Don't include sites where a single character has a proportion above this.
ignore
Characters to exclude when calculating proportions.
def write_fasta(path: str, records: dict[str, str]) ‑> None

Write a fasta file.

Args

path
Path to fasta file to write.
records
A dict, the keys will become fasta headers, values will be sequences.

Classes

class ReferenceAlignment (aligned, internal_gap_in_ref)

ReferenceAlignment(aligned, internal_gap_in_ref)

Ancestors

  • builtins.tuple

Instance variables

var aligned

Alias for field number 0

var internal_gap_in_ref

Alias for field number 1

class Substitution (*args)

Change of a character at a site.

Instantiate using either 1 or three arguments: Substitution("N145K") or Substitution("N", 145, "K")

class TiedCounter (iterable=None, /, **kwds)

Dict subclass for counting hashable items. Sometimes called a bag or multiset. Elements are stored as dictionary keys and their counts are stored as dictionary values.

>>> c = Counter('abcdeabcdabcaba')  # count elements from a string
>>> c.most_common(3)                # three most common elements
[('a', 5), ('b', 4), ('c', 3)]
>>> sorted(c)                       # list all unique elements
['a', 'b', 'c', 'd', 'e']
>>> ''.join(sorted(c.elements()))   # list elements with repetitions
'aaaaabbbbcccdde'
>>> sum(c.values())                 # total of all counts
15
>>> c['a']                          # count of letter 'a'
5
>>> for elem in 'shazam':           # update counts from an iterable
...     c[elem] += 1                # by adding 1 to each element's count
>>> c['a']                          # now there are seven 'a'
7
>>> del c['b']                      # remove all 'b'
>>> c['b']                          # now there are zero 'b'
0
>>> d = Counter('simsalabim')       # make another counter
>>> c.update(d)                     # add in the second counter
>>> c['a']                          # now there are nine 'a'
9
>>> c.clear()                       # empty the counter
>>> c
Counter()

Note: If a count is set to zero or reduced to zero, it will remain in the counter until the entry is deleted or the counter is cleared:

>>> c = Counter('aaabbc')
>>> c['b'] -= 2                     # reduce the count of 'b' by two
>>> c.most_common()                 # 'b' is still in, but its count is zero
[('a', 3), ('c', 1), ('b', 0)]

Create a new, empty Counter object. And if given, count elements from an input iterable. Or, initialize the count from another mapping of elements to their counts.

>>> c = Counter()                           # a new, empty counter
>>> c = Counter('gallahad')                 # a new counter from an iterable
>>> c = Counter({'a': 4, 'b': 2})           # a new counter from a mapping
>>> c = Counter(a=4, b=2)                   # a new counter from keyword args

Ancestors

  • collections.Counter
  • builtins.dict

Methods

def most_common(self, n: int | None = None) ‑> list[tuple[typing.Any, int]]

If n=1 and there are more than one item that has the maximum count, return all of them, not just one. If n is not 1, do the same thing as normal Counter.most_common.