Back to index

python-biopython  1.60
Classes | Functions | Variables
Bio.Seq Namespace Reference

Classes

class  Seq
class  UnknownSeq
class  MutableSeq

Functions

def _maketrans
def transcribe
def back_transcribe
def _translate_str
def translate
def reverse_complement
def _test

Variables

string __docformat__ = "epytext en"
tuple _dna_complement_table = _maketrans(ambiguous_dna_complement)
tuple _rna_complement_table = _maketrans(ambiguous_rna_complement)

Function Documentation

def Bio.Seq._maketrans (   complement_mapping) [private]
Makes a python string translation table (PRIVATE).

Arguments:
 - complement_mapping - a dictionary such as ambiguous_dna_complement
   and ambiguous_rna_complement from Data.IUPACData.

Returns a translation table (a string of length 256) for use with the
python string's translate method to use in a (reverse) complement.

Compatible with lower case and upper case sequences.

For internal use only.

Definition at line 25 of file Seq.py.

00025 
00026 def _maketrans(complement_mapping):
00027     """Makes a python string translation table (PRIVATE).
00028 
00029     Arguments:
00030      - complement_mapping - a dictionary such as ambiguous_dna_complement
00031        and ambiguous_rna_complement from Data.IUPACData.
00032 
00033     Returns a translation table (a string of length 256) for use with the
00034     python string's translate method to use in a (reverse) complement.
00035     
00036     Compatible with lower case and upper case sequences.
00037 
00038     For internal use only.
00039     """
00040     before = ''.join(complement_mapping.keys())
00041     after  = ''.join(complement_mapping.values())
00042     before = before + before.lower()
00043     after  = after + after.lower()
00044     if sys.version_info[0] == 3 :
00045         return str.maketrans(before, after)
00046     else:
00047         return string.maketrans(before, after)

def Bio.Seq._test ( ) [private]
Run the Bio.Seq module's doctests (PRIVATE).

Definition at line 2124 of file Seq.py.

02124 
02125 def _test():
02126     """Run the Bio.Seq module's doctests (PRIVATE)."""
02127     if sys.version_info[0:2] == (3,1):
02128         print "Not running Bio.Seq doctest on Python 3.1"
02129         print "See http://bugs.python.org/issue7490"
02130     else:
02131         print "Runing doctests..."
02132         import doctest
02133         doctest.testmod(optionflags=doctest.IGNORE_EXCEPTION_DETAIL)
02134         print "Done"

def Bio.Seq._translate_str (   sequence,
  table,
  stop_symbol = "*",
  to_stop = False,
  cds = False,
  pos_stop = "X" 
) [private]
Helper function to translate a nucleotide string (PRIVATE).

Arguments:
 - sequence    - a string
 - table       - a CodonTable object (NOT a table name or id number)
 - stop_symbol - a single character string, what to use for terminators.
 - to_stop     - boolean, should translation terminate at the first
                 in frame stop codon?  If there is no in-frame stop codon
                 then translation continues to the end.
 - pos_stop    - a single character string for a possible stop codon
                 (e.g. TAN or NNN)
 - cds - Boolean, indicates this is a complete CDS.  If True, this
         checks the sequence starts with a valid alternative start
         codon (which will be translated as methionine, M), that the
         sequence length is a multiple of three, and that there is a
         single in frame stop codon at the end (this will be excluded
         from the protein sequence, regardless of the to_stop option).
         If these tests fail, an exception is raised.

Returns a string.

e.g.

>>> from Bio.Data import CodonTable
>>> table = CodonTable.ambiguous_dna_by_id[1]
>>> _translate_str("AAA", table)
'K'
>>> _translate_str("TAR", table)
'*'
>>> _translate_str("TAN", table)
'X'
>>> _translate_str("TAN", table, pos_stop="@")
'@'
>>> _translate_str("TA?", table)
Traceback (most recent call last):
   ...
TranslationError: Codon 'TA?' is invalid
>>> _translate_str("ATGCCCTAG", table, cds=True)
'MP'
>>> _translate_str("AAACCCTAG", table, cds=True)
Traceback (most recent call last):
   ...
TranslationError: First codon 'AAA' is not a start codon
>>> _translate_str("ATGCCCTAGCCCTAG", table, cds=True)
Traceback (most recent call last):
   ...
TranslationError: Extra in frame stop codon found.

Definition at line 1910 of file Seq.py.

01910 
01911                    cds=False, pos_stop="X"):
01912     """Helper function to translate a nucleotide string (PRIVATE).
01913 
01914     Arguments:
01915      - sequence    - a string
01916      - table       - a CodonTable object (NOT a table name or id number)
01917      - stop_symbol - a single character string, what to use for terminators.
01918      - to_stop     - boolean, should translation terminate at the first
01919                      in frame stop codon?  If there is no in-frame stop codon
01920                      then translation continues to the end.
01921      - pos_stop    - a single character string for a possible stop codon
01922                      (e.g. TAN or NNN)
01923      - cds - Boolean, indicates this is a complete CDS.  If True, this
01924              checks the sequence starts with a valid alternative start
01925              codon (which will be translated as methionine, M), that the
01926              sequence length is a multiple of three, and that there is a
01927              single in frame stop codon at the end (this will be excluded
01928              from the protein sequence, regardless of the to_stop option).
01929              If these tests fail, an exception is raised.
01930 
01931     Returns a string.
01932 
01933     e.g.
01934 
01935     >>> from Bio.Data import CodonTable
01936     >>> table = CodonTable.ambiguous_dna_by_id[1]
01937     >>> _translate_str("AAA", table)
01938     'K'
01939     >>> _translate_str("TAR", table)
01940     '*'
01941     >>> _translate_str("TAN", table)
01942     'X'
01943     >>> _translate_str("TAN", table, pos_stop="@")
01944     '@'
01945     >>> _translate_str("TA?", table)
01946     Traceback (most recent call last):
01947        ...
01948     TranslationError: Codon 'TA?' is invalid
01949     >>> _translate_str("ATGCCCTAG", table, cds=True)
01950     'MP'
01951     >>> _translate_str("AAACCCTAG", table, cds=True)
01952     Traceback (most recent call last):
01953        ...
01954     TranslationError: First codon 'AAA' is not a start codon
01955     >>> _translate_str("ATGCCCTAGCCCTAG", table, cds=True)
01956     Traceback (most recent call last):
01957        ...
01958     TranslationError: Extra in frame stop codon found.
01959     """
01960     sequence = sequence.upper()
01961     amino_acids = []
01962     forward_table = table.forward_table
01963     stop_codons = table.stop_codons
01964     if table.nucleotide_alphabet.letters is not None:
01965         valid_letters = set(table.nucleotide_alphabet.letters.upper())
01966     else:
01967         #Assume the worst case, ambiguous DNA or RNA:
01968         valid_letters = set(IUPAC.ambiguous_dna.letters.upper() + \
01969                             IUPAC.ambiguous_rna.letters.upper())
01970     if cds:
01971         if str(sequence[:3]).upper() not in table.start_codons:
01972             raise CodonTable.TranslationError(\
01973                 "First codon '%s' is not a start codon" % sequence[:3])
01974         if len(sequence) % 3 != 0:
01975             raise CodonTable.TranslationError(\
01976                 "Sequence length %i is not a multiple of three" % len(sequence))
01977         if str(sequence[-3:]).upper() not in stop_codons:
01978             raise CodonTable.TranslationError(\
01979                 "Final codon '%s' is not a stop codon" % sequence[-3:])
01980         #Don't translate the stop symbol, and manually translate the M
01981         sequence = sequence[3:-3]
01982         amino_acids = ["M"]
01983     n = len(sequence)
01984     for i in xrange(0,n-n%3,3):
01985         codon = sequence[i:i+3]
01986         try:
01987             amino_acids.append(forward_table[codon])
01988         except (KeyError, CodonTable.TranslationError):
01989             #Todo? Treat "---" as a special case (gapped translation)
01990             if codon in table.stop_codons:
01991                 if cds:
01992                     raise CodonTable.TranslationError(\
01993                         "Extra in frame stop codon found.")
01994                 if to_stop : break
01995                 amino_acids.append(stop_symbol)
01996             elif valid_letters.issuperset(set(codon)):
01997                 #Possible stop codon (e.g. NNN or TAN)
01998                 amino_acids.append(pos_stop)
01999             else:
02000                 raise CodonTable.TranslationError(\
02001                     "Codon '%s' is invalid" % codon)
02002     return "".join(amino_acids)

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Seq.back_transcribe (   rna)
Back-transcribes an RNA sequence into DNA.

If given a string, returns a new string object.

Given a Seq or MutableSeq, returns a new Seq object with an RNA alphabet.

Trying to transcribe a protein or DNA sequence raises an exception.

e.g.

>>> back_transcribe("ACUGN")
'ACTGN'

Definition at line 1888 of file Seq.py.

01888 
01889 def back_transcribe(rna):
01890     """Back-transcribes an RNA sequence into DNA.
01891 
01892     If given a string, returns a new string object.
01893     
01894     Given a Seq or MutableSeq, returns a new Seq object with an RNA alphabet.
01895 
01896     Trying to transcribe a protein or DNA sequence raises an exception.
01897 
01898     e.g.
01899 
01900     >>> back_transcribe("ACUGN")
01901     'ACTGN'
01902     """
01903     if isinstance(rna, Seq):
01904         return rna.back_transcribe()
01905     elif isinstance(rna, MutableSeq):
01906         return rna.toseq().back_transcribe()
01907     else:
01908         return rna.replace('U','T').replace('u','t')
    

Here is the call graph for this function:

def Bio.Seq.reverse_complement (   sequence)
Returns the reverse complement sequence of a nucleotide string.

If given a string, returns a new string object.
Given a Seq or a MutableSeq, returns a new Seq object with the same alphabet.

Supports unambiguous and ambiguous nucleotide sequences.

e.g.

>>> reverse_complement("ACTG-NH")
'DN-CAGT'

Definition at line 2090 of file Seq.py.

02090 
02091 def reverse_complement(sequence):
02092     """Returns the reverse complement sequence of a nucleotide string.
02093 
02094     If given a string, returns a new string object.
02095     Given a Seq or a MutableSeq, returns a new Seq object with the same alphabet.
02096 
02097     Supports unambiguous and ambiguous nucleotide sequences.
02098 
02099     e.g.
02100 
02101     >>> reverse_complement("ACTG-NH")
02102     'DN-CAGT'
02103     """
02104     if isinstance(sequence, Seq):
02105         #Return a Seq
02106         return sequence.reverse_complement()
02107     elif isinstance(sequence, MutableSeq):
02108         #Return a Seq
02109         #Don't use the MutableSeq reverse_complement method as it is 'in place'.
02110         return sequence.toseq().reverse_complement()
02111 
02112     #Assume its a string.
02113     #In order to avoid some code duplication, the old code would turn the string
02114     #into a Seq, use the reverse_complement method, and convert back to a string.
02115     #This worked, but is over five times slower on short sequences!
02116     if ('U' in sequence or 'u' in sequence) \
02117     and ('T' in sequence or 't' in sequence):
02118         raise ValueError("Mixed RNA/DNA found")
02119     elif 'U' in sequence or 'u' in sequence:
02120         ttable = _rna_complement_table
02121     else:
02122         ttable = _dna_complement_table
02123     return sequence.translate(ttable)[::-1]

Here is the caller graph for this function:

def Bio.Seq.transcribe (   dna)
Transcribes a DNA sequence into RNA.

If given a string, returns a new string object.

Given a Seq or MutableSeq, returns a new Seq object with an RNA alphabet.

Trying to transcribe a protein or RNA sequence raises an exception.

e.g.

>>> transcribe("ACTGN")
'ACUGN'

Definition at line 1867 of file Seq.py.

01867 
01868 def transcribe(dna):
01869     """Transcribes a DNA sequence into RNA.
01870 
01871     If given a string, returns a new string object.
01872 
01873     Given a Seq or MutableSeq, returns a new Seq object with an RNA alphabet.
01874 
01875     Trying to transcribe a protein or RNA sequence raises an exception.
01876 
01877     e.g.
01878     
01879     >>> transcribe("ACTGN")
01880     'ACUGN'
01881     """
01882     if isinstance(dna, Seq):
01883         return dna.transcribe()
01884     elif isinstance(dna, MutableSeq):
01885         return dna.toseq().transcribe()
01886     else:
01887         return dna.replace('T','U').replace('t','u')

def Bio.Seq.translate (   sequence,
  table = "Standard",
  stop_symbol = "*",
  to_stop = False,
  cds = False 
)
Translate a nucleotide sequence into amino acids.

If given a string, returns a new string object. Given a Seq or
MutableSeq, returns a Seq object with a protein alphabet.

Arguments:
 - table - Which codon table to use?  This can be either a name (string),
           an NCBI identifier (integer), or a CodonTable object (useful
           for non-standard genetic codes).  Defaults to the "Standard"
           table.
 - stop_symbol - Single character string, what to use for any
                 terminators, defaults to the asterisk, "*".
 - to_stop - Boolean, defaults to False meaning do a full
             translation continuing on past any stop codons
             (translated as the specified stop_symbol).  If
             True, translation is terminated at the first in
             frame stop codon (and the stop_symbol is not
             appended to the returned protein sequence).
 - cds - Boolean, indicates this is a complete CDS.  If True, this
             checks the sequence starts with a valid alternative start
             codon (which will be translated as methionine, M), that the
             sequence length is a multiple of three, and that there is a
             single in frame stop codon at the end (this will be excluded
             from the protein sequence, regardless of the to_stop option).
             If these tests fail, an exception is raised.

A simple string example using the default (standard) genetic code:

>>> coding_dna = "GTGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG"
>>> translate(coding_dna)
'VAIVMGR*KGAR*'
>>> translate(coding_dna, stop_symbol="@")
'VAIVMGR@KGAR@'
>>> translate(coding_dna, to_stop=True)
'VAIVMGR'
 
Now using NCBI table 2, where TGA is not a stop codon:

>>> translate(coding_dna, table=2)
'VAIVMGRWKGAR*'
>>> translate(coding_dna, table=2, to_stop=True)
'VAIVMGRWKGAR'

In fact this example uses an alternative start codon valid under NCBI table 2,
GTG, which means this example is a complete valid CDS which when translated
should really start with methionine (not valine):

>>> translate(coding_dna, table=2, cds=True)
'MAIVMGRWKGAR'

Note that if the sequence has no in-frame stop codon, then the to_stop
argument has no effect:

>>> coding_dna2 = "GTGGCCATTGTAATGGGCCGC"
>>> translate(coding_dna2)
'VAIVMGR'
>>> translate(coding_dna2, to_stop=True)
'VAIVMGR'

NOTE - Ambiguous codons like "TAN" or "NNN" could be an amino acid
or a stop codon.  These are translated as "X".  Any invalid codon
(e.g. "TA?" or "T-A") will throw a TranslationError.

NOTE - Does NOT support gapped sequences.

It will however translate either DNA or RNA.

Definition at line 2004 of file Seq.py.

02004 
02005               cds=False):
02006     """Translate a nucleotide sequence into amino acids.
02007 
02008     If given a string, returns a new string object. Given a Seq or
02009     MutableSeq, returns a Seq object with a protein alphabet.
02010 
02011     Arguments:
02012      - table - Which codon table to use?  This can be either a name (string),
02013                an NCBI identifier (integer), or a CodonTable object (useful
02014                for non-standard genetic codes).  Defaults to the "Standard"
02015                table.
02016      - stop_symbol - Single character string, what to use for any
02017                      terminators, defaults to the asterisk, "*".
02018      - to_stop - Boolean, defaults to False meaning do a full
02019                  translation continuing on past any stop codons
02020                  (translated as the specified stop_symbol).  If
02021                  True, translation is terminated at the first in
02022                  frame stop codon (and the stop_symbol is not
02023                  appended to the returned protein sequence).
02024      - cds - Boolean, indicates this is a complete CDS.  If True, this
02025                  checks the sequence starts with a valid alternative start
02026                  codon (which will be translated as methionine, M), that the
02027                  sequence length is a multiple of three, and that there is a
02028                  single in frame stop codon at the end (this will be excluded
02029                  from the protein sequence, regardless of the to_stop option).
02030                  If these tests fail, an exception is raised.
02031     
02032     A simple string example using the default (standard) genetic code:
02033     
02034     >>> coding_dna = "GTGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG"
02035     >>> translate(coding_dna)
02036     'VAIVMGR*KGAR*'
02037     >>> translate(coding_dna, stop_symbol="@")
02038     'VAIVMGR@KGAR@'
02039     >>> translate(coding_dna, to_stop=True)
02040     'VAIVMGR'
02041      
02042     Now using NCBI table 2, where TGA is not a stop codon:
02043 
02044     >>> translate(coding_dna, table=2)
02045     'VAIVMGRWKGAR*'
02046     >>> translate(coding_dna, table=2, to_stop=True)
02047     'VAIVMGRWKGAR'
02048 
02049     In fact this example uses an alternative start codon valid under NCBI table 2,
02050     GTG, which means this example is a complete valid CDS which when translated
02051     should really start with methionine (not valine):
02052     
02053     >>> translate(coding_dna, table=2, cds=True)
02054     'MAIVMGRWKGAR'
02055 
02056     Note that if the sequence has no in-frame stop codon, then the to_stop
02057     argument has no effect:
02058 
02059     >>> coding_dna2 = "GTGGCCATTGTAATGGGCCGC"
02060     >>> translate(coding_dna2)
02061     'VAIVMGR'
02062     >>> translate(coding_dna2, to_stop=True)
02063     'VAIVMGR'
02064     
02065     NOTE - Ambiguous codons like "TAN" or "NNN" could be an amino acid
02066     or a stop codon.  These are translated as "X".  Any invalid codon
02067     (e.g. "TA?" or "T-A") will throw a TranslationError.
02068 
02069     NOTE - Does NOT support gapped sequences.
02070     
02071     It will however translate either DNA or RNA.
02072     """
02073     if isinstance(sequence, Seq):
02074         return sequence.translate(table, stop_symbol, to_stop, cds)
02075     elif isinstance(sequence, MutableSeq):
02076         #Return a Seq object
02077         return sequence.toseq().translate(table, stop_symbol, to_stop, cds)
02078     else:
02079         #Assume its a string, return a string
02080         try:
02081             codon_table = CodonTable.ambiguous_generic_by_id[int(table)]
02082         except ValueError:
02083             codon_table = CodonTable.ambiguous_generic_by_name[table]
02084         except (AttributeError, TypeError):
02085             if isinstance(table, CodonTable.CodonTable):
02086                 codon_table = table
02087             else:
02088                 raise ValueError('Bad table argument')
02089         return _translate_str(sequence, codon_table, stop_symbol, to_stop, cds)
      

Here is the call graph for this function:

Here is the caller graph for this function:


Variable Documentation

string Bio.Seq.__docformat__ = "epytext en"

Definition at line 14 of file Seq.py.

tuple Bio.Seq._dna_complement_table = _maketrans(ambiguous_dna_complement)

Definition at line 48 of file Seq.py.

tuple Bio.Seq._rna_complement_table = _maketrans(ambiguous_rna_complement)

Definition at line 49 of file Seq.py.