Module eremitalpa.bio
Functions
def align_to_reference(reference_seq: str, input_seq: str) ‑> ReferenceAlignment
-
Align an input sequence to a reference. Returns the aligned input sequence trimmed to the region of the reference.
Args
reference_seq (str): input_seq (str):
Raises
ValueError
- If internal gaps are introduced in the reference during alignment.
Returns
ReferenceAlignment tuple containing: 'aligned' - the aligned input sequence 'internal_gap_in_ref' - boolean indicating if a gap was inserted in the reference
def consensus_seq(seqs: Iterable[str], case_sensitive: bool = True, **kwds) ‑> str
-
Compute the consensus of sequences.
Args
seqs
- Sequences.
case_sensitive
- If False, all seqs are converted to lowercase.
error_without_strict_majority
- Raise an error if a position has a tied most common character. If set to False, a warning is raised and a single value is chosen.
def filter_similar_hd(sequences, n, progress_bar=False, ignore=None)
-
Iterate through sequences excluding those that have a hamming distance of less than n to a sequence already seen. Return the non-excluded sequences.
Args
sequences (iterable of str / Bio.SeqRecord) progress_bar (bool) ignore (set or None)
Returns
list
def find_mutations(*args, **kwargs)
def find_substitutions(a, b, offset=0)
-
Find mutations between strings a and b.
Args
a (str) b (str) offset (int)
Raises
ValueError if lengths of a an b differ.
Returns
list
oftuples. tuples are like
- ("N", 145, "K") The number indicates the 1-indexed position of the mutation. The first element is the a character. The last element is the b character.
def group_sequences_by_character_at_site(seqs: dict[str, str], site: int) ‑> dict[str, str]
-
Group sequences by the character they have at a particular site.
Args
seqs
- Dict of sequence names -> sequence.
site
- 1-based.
Returns
dict containing
char at site
->sequence name
. def grouped_sample(population, n, key=None)
-
Randomly sample a population taking at most n elements from each group.
Args
- population (iterable)
n
:int
- Take at most n samples from each group.
key
:callable
- Function by which to group elements. Default (None).
Returns
list
def hamming_dist(a: str,
b: str,
ignore: Iterable[str] = '-X',
case_sensitive: bool = True,
per_site: bool = False) ‑> float-
The hamming distance between a and b.
Args
a
- Sequence.
b
- Sequence.
ignore
- String containing characters to ignore. If there is a mismatch where one string has a character in ignore, this does not contribute to the hamming distance.
per_site
- Divide the hamming distance by the length of a and b, minus the number of sites with ignored characters.
Returns
float
def hamming_dist_lt(a, b, n, ignore=None)
-
Test if hamming distance between a and b is less than n. This is case sensitive and does not check a and b have matching lengths.
Args
a (iterable) b (iterable) n (scalar) ignore (set or None)
Returns
bool
def idx_first_and_last_non_gap(sequence: str) ‑> tuple[int, int]
-
Returns the indices of the first and last non-gap ('-') characters in a sequence. If all characters are gaps, returns None for both indices.
Args
sequence
:str
- The input sequence containing gaps ('-').
Returns
tuple
- A tuple (first_non_gap_index, last_non_gap_index).
def load_fasta(path: str,
translate_nt: bool = False,
convert_to_upper: bool = False,
start: int = 0) ‑> dict[str, str]-
Load fasta file sequences.
Args
path
- Path to fasta file.
translate_nt
- Translate nucleotide sequences.
convert_to_upper
- Force sequences to be uppercase.
start
- The (0-based) index of the first character of each record to take. This selection is done before any translation. (Default=0).
def pairwise_hamming_dists(collection, ignore='-X', per_site=False)
-
Compute all pairwise hamming distances between items in collection.
Args
collection (iterable)
Returns
list of hamming distances
def plot_amino_acid_colors(ax: matplotlib.axes.Axes = None) ‑> matplotlib.axes.Axes
-
Simple plot to show amino acid colors.
def sloppy_translate(sequence)
-
Translate a nucleotide sequence.
Don't check that the sequence length is a multiple of three. If any 'codon' contains any character not in [ACTG] then return X.
Args
sequence
:str
- Lower or upper case.
Returns
(str)
def variable_sites(seq: pandas.core.series.Series,
max_biggest_prop: float = 0.95,
ignore: str = '-X') ‑> Generator[int, None, None]-
Find variable sites among sequences. Returns 1-indexed sites.
Args
seq
- Sequences.
max_biggest_prop
- Don't include sites where a single character has a proportion above this.
ignore
- Characters to exclude when calculating proportions.
def write_fasta(path: str, records: dict[str, str]) ‑> None
-
Write a fasta file.
Args
path
- Path to fasta file to write.
records
- A dict, the keys will become fasta headers, values will be sequences.
Classes
class ReferenceAlignment (aligned, internal_gap_in_ref)
-
ReferenceAlignment(aligned, internal_gap_in_ref)
Ancestors
- builtins.tuple
Instance variables
var aligned
-
Alias for field number 0
var internal_gap_in_ref
-
Alias for field number 1
class Substitution (*args)
-
Change of a character at a site.
Instantiate using either 1 or three arguments: Substitution("N145K") or Substitution("N", 145, "K")
class TiedCounter (iterable=None, /, **kwds)
-
Dict subclass for counting hashable items. Sometimes called a bag or multiset. Elements are stored as dictionary keys and their counts are stored as dictionary values.
>>> c = Counter('abcdeabcdabcaba') # count elements from a string
>>> c.most_common(3) # three most common elements [('a', 5), ('b', 4), ('c', 3)] >>> sorted(c) # list all unique elements ['a', 'b', 'c', 'd', 'e'] >>> ''.join(sorted(c.elements())) # list elements with repetitions 'aaaaabbbbcccdde' >>> sum(c.values()) # total of all counts 15
>>> c['a'] # count of letter 'a' 5 >>> for elem in 'shazam': # update counts from an iterable ... c[elem] += 1 # by adding 1 to each element's count >>> c['a'] # now there are seven 'a' 7 >>> del c['b'] # remove all 'b' >>> c['b'] # now there are zero 'b' 0
>>> d = Counter('simsalabim') # make another counter >>> c.update(d) # add in the second counter >>> c['a'] # now there are nine 'a' 9
>>> c.clear() # empty the counter >>> c Counter()
Note: If a count is set to zero or reduced to zero, it will remain in the counter until the entry is deleted or the counter is cleared:
>>> c = Counter('aaabbc') >>> c['b'] -= 2 # reduce the count of 'b' by two >>> c.most_common() # 'b' is still in, but its count is zero [('a', 3), ('c', 1), ('b', 0)]
Create a new, empty Counter object. And if given, count elements from an input iterable. Or, initialize the count from another mapping of elements to their counts.
>>> c = Counter() # a new, empty counter >>> c = Counter('gallahad') # a new counter from an iterable >>> c = Counter({'a': 4, 'b': 2}) # a new counter from a mapping >>> c = Counter(a=4, b=2) # a new counter from keyword args
Ancestors
- collections.Counter
- builtins.dict
Methods
def most_common(self, n: int | None = None) ‑> list[tuple[typing.Any, int]]
-
If n=1 and there are more than one item that has the maximum count, return all of them, not just one. If n is not 1, do the same thing as normal Counter.most_common.