Back to index

python-biopython  1.60
Public Member Functions | Public Attributes
BioSQL.BioSeq.DBSeq Class Reference
Inheritance diagram for BioSQL.BioSeq.DBSeq:
Inheritance graph
[legend]
Collaboration diagram for BioSQL.BioSeq.DBSeq:
Collaboration graph
[legend]

List of all members.

Public Member Functions

def data
def __repr__
def __str__
def __hash__
def __cmp__
def __len__
def __getitem__
def __add__
def __radd__
def tostring
def tomutable
def count
def __contains__
def find
def rfind
def startswith
def endswith
def split
def rsplit
def strip
def lstrip
def rstrip
def upper
def lower
def complement
def reverse_complement
def transcribe
def back_transcribe
def translate
def ungap

Public Attributes

 alphabet

Detailed Description

Definition at line 24 of file BioSeq.py.


Member Function Documentation

def Bio.Seq.Seq.__add__ (   self,
  other 
) [inherited]
Add another sequence or string to this sequence.

If adding a string to a Seq, the alphabet is preserved:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_protein
>>> Seq("MELKI", generic_protein) + "LV"
Seq('MELKILV', ProteinAlphabet())

When adding two Seq (like) objects, the alphabets are important.
Consider this example:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet.IUPAC import unambiguous_dna, ambiguous_dna
>>> unamb_dna_seq = Seq("ACGT", unambiguous_dna)
>>> ambig_dna_seq = Seq("ACRGT", ambiguous_dna)
>>> unamb_dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> ambig_dna_seq
Seq('ACRGT', IUPACAmbiguousDNA())

If we add the ambiguous and unambiguous IUPAC DNA alphabets, we get
the more general ambiguous IUPAC DNA alphabet:

>>> unamb_dna_seq + ambig_dna_seq
Seq('ACGTACRGT', IUPACAmbiguousDNA())

However, if the default generic alphabet is included, the result is
a generic alphabet:

>>> Seq("") + ambig_dna_seq
Seq('ACRGT', Alphabet())

You can't add RNA and DNA sequences:

>>> from Bio.Alphabet import generic_dna, generic_rna
>>> Seq("ACGT", generic_dna) + Seq("ACGU", generic_rna)
Traceback (most recent call last):
   ...
TypeError: Incompatible alphabets DNAAlphabet() and RNAAlphabet()

You can't add nucleotide and protein sequences:

>>> from Bio.Alphabet import generic_dna, generic_protein
>>> Seq("ACGT", generic_dna) + Seq("MELKI", generic_protein)
Traceback (most recent call last):
   ...
TypeError: Incompatible alphabets DNAAlphabet() and ProteinAlphabet()

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 216 of file Seq.py.

00216 
00217     def __add__(self, other):
00218         """Add another sequence or string to this sequence.
00219 
00220         If adding a string to a Seq, the alphabet is preserved:
00221 
00222         >>> from Bio.Seq import Seq
00223         >>> from Bio.Alphabet import generic_protein
00224         >>> Seq("MELKI", generic_protein) + "LV"
00225         Seq('MELKILV', ProteinAlphabet())
00226 
00227         When adding two Seq (like) objects, the alphabets are important.
00228         Consider this example:
00229 
00230         >>> from Bio.Seq import Seq
00231         >>> from Bio.Alphabet.IUPAC import unambiguous_dna, ambiguous_dna
00232         >>> unamb_dna_seq = Seq("ACGT", unambiguous_dna)
00233         >>> ambig_dna_seq = Seq("ACRGT", ambiguous_dna)
00234         >>> unamb_dna_seq
00235         Seq('ACGT', IUPACUnambiguousDNA())
00236         >>> ambig_dna_seq
00237         Seq('ACRGT', IUPACAmbiguousDNA())
00238 
00239         If we add the ambiguous and unambiguous IUPAC DNA alphabets, we get
00240         the more general ambiguous IUPAC DNA alphabet:
00241         
00242         >>> unamb_dna_seq + ambig_dna_seq
00243         Seq('ACGTACRGT', IUPACAmbiguousDNA())
00244 
00245         However, if the default generic alphabet is included, the result is
00246         a generic alphabet:
00247 
00248         >>> Seq("") + ambig_dna_seq
00249         Seq('ACRGT', Alphabet())
00250 
00251         You can't add RNA and DNA sequences:
00252         
00253         >>> from Bio.Alphabet import generic_dna, generic_rna
00254         >>> Seq("ACGT", generic_dna) + Seq("ACGU", generic_rna)
00255         Traceback (most recent call last):
00256            ...
00257         TypeError: Incompatible alphabets DNAAlphabet() and RNAAlphabet()
00258 
00259         You can't add nucleotide and protein sequences:
00260 
00261         >>> from Bio.Alphabet import generic_dna, generic_protein
00262         >>> Seq("ACGT", generic_dna) + Seq("MELKI", generic_protein)
00263         Traceback (most recent call last):
00264            ...
00265         TypeError: Incompatible alphabets DNAAlphabet() and ProteinAlphabet()
00266         """
00267         if hasattr(other, "alphabet"):
00268             #other should be a Seq or a MutableSeq
00269             if not Alphabet._check_type_compatible([self.alphabet,
00270                                                     other.alphabet]):
00271                 raise TypeError("Incompatible alphabets %s and %s" \
00272                                 % (repr(self.alphabet), repr(other.alphabet)))
00273             #They should be the same sequence type (or one of them is generic)
00274             a = Alphabet._consensus_alphabet([self.alphabet, other.alphabet])
00275             return self.__class__(str(self) + str(other), a)
00276         elif isinstance(other, basestring):
00277             #other is a plain string - use the current alphabet
00278             return self.__class__(str(self) + other, self.alphabet)
00279         from Bio.SeqRecord import SeqRecord #Lazy to avoid circular imports
00280         if isinstance(other, SeqRecord):
00281             #Get the SeqRecord's __radd__ to handle this
00282             return NotImplemented
00283         else :
00284             raise TypeError

def Bio.Seq.Seq.__cmp__ (   self,
  other 
) [inherited]
Compare the sequence to another sequence or a string (README).

Historically comparing Seq objects has done Python object comparison.
After considerable discussion (keeping in mind constraints of the
Python language, hashes and dictionary support) a future release of
Biopython will change this to use simple string comparison. The plan is
that comparing incompatible alphabets (e.g. DNA to RNA) will trigger a
warning.

This version of Biopython still does Python object comparison, but with
a warning about this future change. During this transition period,
please just do explicit comparisons:

>>> seq1 = Seq("ACGT")
>>> seq2 = Seq("ACGT")
>>> id(seq1) == id(seq2)
False
>>> str(seq1) == str(seq2)
True

Note - This method indirectly supports ==, < , etc.

Definition at line 166 of file Seq.py.

00166 
00167     def __cmp__(self, other):
00168         """Compare the sequence to another sequence or a string (README).
00169 
00170         Historically comparing Seq objects has done Python object comparison.
00171         After considerable discussion (keeping in mind constraints of the
00172         Python language, hashes and dictionary support) a future release of
00173         Biopython will change this to use simple string comparison. The plan is
00174         that comparing incompatible alphabets (e.g. DNA to RNA) will trigger a
00175         warning.
00176 
00177         This version of Biopython still does Python object comparison, but with
00178         a warning about this future change. During this transition period,
00179         please just do explicit comparisons:
00180 
00181         >>> seq1 = Seq("ACGT")
00182         >>> seq2 = Seq("ACGT")
00183         >>> id(seq1) == id(seq2)
00184         False
00185         >>> str(seq1) == str(seq2)
00186         True
00187 
00188         Note - This method indirectly supports ==, < , etc.
00189         """
00190         if hasattr(other, "alphabet"):
00191             #other should be a Seq or a MutableSeq
00192             import warnings
00193             warnings.warn("In future comparing Seq objects will use string "
00194                           "comparison (not object comparison). Incompatible "
00195                           "alphabets will trigger a warning (not an exception). "
00196                           "In the interim please use id(seq1)==id(seq2) or "
00197                           "str(seq1)==str(seq2) to make your code explicit "
00198                           "and to avoid this warning.", FutureWarning)
00199         return cmp(id(self), id(other))

def Bio.Seq.Seq.__contains__ (   self,
  char 
) [inherited]
Implements the 'in' keyword, like a python string.

e.g.

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna, generic_rna, generic_protein
>>> my_dna = Seq("ATATGAAATTTGAAAA", generic_dna)
>>> "AAA" in my_dna
True
>>> Seq("AAA") in my_dna
True
>>> Seq("AAA", generic_dna) in my_dna
True

Like other Seq methods, this will raise a type error if another Seq
(or Seq like) object with an incompatible alphabet is used:

>>> Seq("AAA", generic_rna) in my_dna
Traceback (most recent call last):
   ...
TypeError: Incompatable alphabets DNAAlphabet() and RNAAlphabet()
>>> Seq("AAA", generic_protein) in my_dna
Traceback (most recent call last):
   ...
TypeError: Incompatable alphabets DNAAlphabet() and ProteinAlphabet()

Definition at line 406 of file Seq.py.

00406 
00407     def __contains__(self, char):
00408         """Implements the 'in' keyword, like a python string.
00409 
00410         e.g.
00411 
00412         >>> from Bio.Seq import Seq
00413         >>> from Bio.Alphabet import generic_dna, generic_rna, generic_protein
00414         >>> my_dna = Seq("ATATGAAATTTGAAAA", generic_dna)
00415         >>> "AAA" in my_dna
00416         True
00417         >>> Seq("AAA") in my_dna
00418         True
00419         >>> Seq("AAA", generic_dna) in my_dna
00420         True
00421 
00422         Like other Seq methods, this will raise a type error if another Seq
00423         (or Seq like) object with an incompatible alphabet is used:
00424 
00425         >>> Seq("AAA", generic_rna) in my_dna
00426         Traceback (most recent call last):
00427            ...
00428         TypeError: Incompatable alphabets DNAAlphabet() and RNAAlphabet()
00429         >>> Seq("AAA", generic_protein) in my_dna
00430         Traceback (most recent call last):
00431            ...
00432         TypeError: Incompatable alphabets DNAAlphabet() and ProteinAlphabet()
00433         """
00434         #If it has one, check the alphabet:
00435         sub_str = self._get_seq_str_and_check_alphabet(char)
00436         return sub_str in str(self)

Here is the call graph for this function:

def Bio.Seq.Seq.__getitem__ (   self,
  index 
) [inherited]
Returns a subsequence of single letter, use my_seq[index].

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 204 of file Seq.py.

00204 
00205     def __getitem__(self, index) :                 # Seq API requirement
00206         """Returns a subsequence of single letter, use my_seq[index]."""
00207         #Note since Python 2.0, __getslice__ is deprecated
00208         #and __getitem__ is used instead.
00209         #See http://docs.python.org/ref/sequence-methods.html
00210         if isinstance(index, int):
00211             #Return a single letter as a string
00212             return self._data[index]
00213         else:
00214             #Return the (sub)sequence as another Seq object
00215             return Seq(self._data[index], self.alphabet)

Here is the caller graph for this function:

def Bio.Seq.Seq.__hash__ (   self) [inherited]
Hash for comparison.

See the __cmp__ documentation - we plan to change this!

Definition at line 159 of file Seq.py.

00159 
00160     def __hash__(self):
00161         """Hash for comparison.
00162 
00163         See the __cmp__ documentation - we plan to change this!
00164         """
00165         return id(self) #Currently use object identity for equality testing
    
def Bio.Seq.Seq.__len__ (   self) [inherited]
Returns the length of the sequence, use len(my_seq).

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 200 of file Seq.py.

00200 
00201     def __len__(self):
00202         """Returns the length of the sequence, use len(my_seq)."""
00203         return len(self._data)       # Seq API requirement

def Bio.Seq.Seq.__radd__ (   self,
  other 
) [inherited]
Adding a sequence on the left.

If adding a string to a Seq, the alphabet is preserved:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_protein
>>> "LV" + Seq("MELKI", generic_protein)
Seq('LVMELKI', ProteinAlphabet())

Adding two Seq (like) objects is handled via the __add__ method.

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 285 of file Seq.py.

00285 
00286     def __radd__(self, other):
00287         """Adding a sequence on the left.
00288 
00289         If adding a string to a Seq, the alphabet is preserved:
00290 
00291         >>> from Bio.Seq import Seq
00292         >>> from Bio.Alphabet import generic_protein
00293         >>> "LV" + Seq("MELKI", generic_protein)
00294         Seq('LVMELKI', ProteinAlphabet())
00295 
00296         Adding two Seq (like) objects is handled via the __add__ method.
00297         """
00298         if hasattr(other, "alphabet"):
00299             #other should be a Seq or a MutableSeq
00300             if not Alphabet._check_type_compatible([self.alphabet,
00301                                                     other.alphabet]):
00302                 raise TypeError("Incompatable alphabets %s and %s" \
00303                                 % (repr(self.alphabet), repr(other.alphabet)))
00304             #They should be the same sequence type (or one of them is generic)
00305             a = Alphabet._consensus_alphabet([self.alphabet, other.alphabet])
00306             return self.__class__(str(other) + str(self), a)
00307         elif isinstance(other, basestring):
00308             #other is a plain string - use the current alphabet
00309             return self.__class__(other + str(self), self.alphabet)
00310         else:
00311             raise TypeError

def Bio.Seq.Seq.__repr__ (   self) [inherited]
Returns a (truncated) representation of the sequence for debugging.

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 136 of file Seq.py.

00136 
00137     def __repr__(self):
00138         """Returns a (truncated) representation of the sequence for debugging."""
00139         if len(self) > 60:
00140             #Shows the last three letters as it is often useful to see if there
00141             #is a stop codon at the end of a sequence.
00142             #Note total length is 54+3+3=60
00143             return "%s('%s...%s', %s)" % (self.__class__.__name__,
00144                                    str(self)[:54], str(self)[-3:],
00145                                    repr(self.alphabet))
00146         else:
00147             return "%s(%s, %s)" % (self.__class__.__name__,
00148                                   repr(self._data),
                                   repr(self.alphabet))
def Bio.Seq.Seq.__str__ (   self) [inherited]
Returns the full sequence as a python string, use str(my_seq).

Note that Biopython 1.44 and earlier would give a truncated
version of repr(my_seq) for str(my_seq).  If you are writing code
which need to be backwards compatible with old Biopython, you
should continue to use my_seq.tostring() rather than str(my_seq).

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 149 of file Seq.py.

00149 
00150     def __str__(self):
00151         """Returns the full sequence as a python string, use str(my_seq).
00152 
00153         Note that Biopython 1.44 and earlier would give a truncated
00154         version of repr(my_seq) for str(my_seq).  If you are writing code
00155         which need to be backwards compatible with old Biopython, you
00156         should continue to use my_seq.tostring() rather than str(my_seq).
00157         """
00158         return self._data

def Bio.Seq.Seq.back_transcribe (   self) [inherited]
Returns the DNA sequence from an RNA sequence. New Seq object.

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",
...                     IUPAC.unambiguous_rna)
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
>>> messenger_rna.back_transcribe()
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

Trying to back-transcribe a protein or DNA sequence raises an
exception:

>>> my_protein = Seq("MAIVMGR", IUPAC.protein)
>>> my_protein.back_transcribe()
Traceback (most recent call last):
   ...
ValueError: Proteins cannot be back transcribed!

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 840 of file Seq.py.

00840 
00841     def back_transcribe(self):
00842         """Returns the DNA sequence from an RNA sequence. New Seq object.
00843 
00844         >>> from Bio.Seq import Seq
00845         >>> from Bio.Alphabet import IUPAC
00846         >>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",
00847         ...                     IUPAC.unambiguous_rna)
00848         >>> messenger_rna
00849         Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
00850         >>> messenger_rna.back_transcribe()
00851         Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
00852 
00853         Trying to back-transcribe a protein or DNA sequence raises an
00854         exception:
00855 
00856         >>> my_protein = Seq("MAIVMGR", IUPAC.protein)
00857         >>> my_protein.back_transcribe()
00858         Traceback (most recent call last):
00859            ...
00860         ValueError: Proteins cannot be back transcribed!
00861         """
00862         base = Alphabet._get_base_alphabet(self.alphabet)
00863         if isinstance(base, Alphabet.ProteinAlphabet):
00864             raise ValueError("Proteins cannot be back transcribed!")
00865         if isinstance(base, Alphabet.DNAAlphabet):
00866             raise ValueError("DNA cannot be back transcribed!")
00867 
00868         if self.alphabet==IUPAC.unambiguous_rna:
00869             alphabet = IUPAC.unambiguous_dna
00870         elif self.alphabet==IUPAC.ambiguous_rna:
00871             alphabet = IUPAC.ambiguous_dna
00872         else:
00873             alphabet = Alphabet.generic_dna
00874         return Seq(str(self).replace("U", "T").replace("u", "t"), alphabet)

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Seq.Seq.complement (   self) [inherited]
Returns the complement sequence. New Seq object.

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_dna = Seq("CCCCCGATAG", IUPAC.unambiguous_dna)
>>> my_dna
Seq('CCCCCGATAG', IUPACUnambiguousDNA())
>>> my_dna.complement()
Seq('GGGGGCTATC', IUPACUnambiguousDNA())

You can of course used mixed case sequences,

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> my_dna = Seq("CCCCCgatA-GD", generic_dna)
>>> my_dna
Seq('CCCCCgatA-GD', DNAAlphabet())
>>> my_dna.complement()
Seq('GGGGGctaT-CH', DNAAlphabet())

Note in the above example, ambiguous character D denotes
G, A or T so its complement is H (for C, T or A).

Trying to complement a protein sequence raises an exception.

>>> my_protein = Seq("MAIVMGR", IUPAC.protein)
>>> my_protein.complement()
Traceback (most recent call last):
   ...
ValueError: Proteins do not have complements!

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 720 of file Seq.py.

00720 
00721     def complement(self):
00722         """Returns the complement sequence. New Seq object.
00723 
00724         >>> from Bio.Seq import Seq
00725         >>> from Bio.Alphabet import IUPAC
00726         >>> my_dna = Seq("CCCCCGATAG", IUPAC.unambiguous_dna)
00727         >>> my_dna
00728         Seq('CCCCCGATAG', IUPACUnambiguousDNA())
00729         >>> my_dna.complement()
00730         Seq('GGGGGCTATC', IUPACUnambiguousDNA())
00731 
00732         You can of course used mixed case sequences,
00733 
00734         >>> from Bio.Seq import Seq
00735         >>> from Bio.Alphabet import generic_dna
00736         >>> my_dna = Seq("CCCCCgatA-GD", generic_dna)
00737         >>> my_dna
00738         Seq('CCCCCgatA-GD', DNAAlphabet())
00739         >>> my_dna.complement()
00740         Seq('GGGGGctaT-CH', DNAAlphabet())
00741 
00742         Note in the above example, ambiguous character D denotes
00743         G, A or T so its complement is H (for C, T or A).
00744         
00745         Trying to complement a protein sequence raises an exception.
00746 
00747         >>> my_protein = Seq("MAIVMGR", IUPAC.protein)
00748         >>> my_protein.complement()
00749         Traceback (most recent call last):
00750            ...
00751         ValueError: Proteins do not have complements!
00752         """
00753         base = Alphabet._get_base_alphabet(self.alphabet)
00754         if isinstance(base, Alphabet.ProteinAlphabet):
00755             raise ValueError("Proteins do not have complements!")
00756         if isinstance(base, Alphabet.DNAAlphabet):
00757             ttable = _dna_complement_table
00758         elif isinstance(base, Alphabet.RNAAlphabet):
00759             ttable = _rna_complement_table
00760         elif ('U' in self._data or 'u' in self._data) \
00761         and ('T' in self._data or 't' in self._data):
00762             #TODO - Handle this cleanly?
00763             raise ValueError("Mixed RNA/DNA found")
00764         elif 'U' in self._data or 'u' in self._data:
00765             ttable = _rna_complement_table
00766         else:
00767             ttable = _dna_complement_table
00768         #Much faster on really long sequences than the previous loop based one.
00769         #thx to Michael Palmer, University of Waterloo
00770         return Seq(str(self).translate(ttable), self.alphabet)

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Seq.Seq.count (   self,
  sub,
  start = 0,
  end = sys.maxint 
) [inherited]
Non-overlapping count method, like that of a python string.

This behaves like the python string method of the same name,
which does a non-overlapping count!

Returns an integer, the number of occurrences of substring
argument sub in the (sub)sequence given by [start:end].
Optional arguments start and end are interpreted as in slice
notation.
    
Arguments:
 - sub - a string or another Seq object to look for
 - start - optional integer, slice start
 - end - optional integer, slice end

e.g.

>>> from Bio.Seq import Seq
>>> my_seq = Seq("AAAATGA")
>>> print my_seq.count("A")
5
>>> print my_seq.count("ATG")
1
>>> print my_seq.count(Seq("AT"))
1
>>> print my_seq.count("AT", 2, -1)
1

HOWEVER, please note because python strings and Seq objects (and
MutableSeq objects) do a non-overlapping search, this may not give
the answer you expect:

>>> "AAAA".count("AA")
2
>>> print Seq("AAAA").count("AA")
2

A non-overlapping search would give the answer as three!

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 362 of file Seq.py.

00362 
00363     def count(self, sub, start=0, end=sys.maxint):
00364         """Non-overlapping count method, like that of a python string.
00365 
00366         This behaves like the python string method of the same name,
00367         which does a non-overlapping count!
00368 
00369         Returns an integer, the number of occurrences of substring
00370         argument sub in the (sub)sequence given by [start:end].
00371         Optional arguments start and end are interpreted as in slice
00372         notation.
00373     
00374         Arguments:
00375          - sub - a string or another Seq object to look for
00376          - start - optional integer, slice start
00377          - end - optional integer, slice end
00378 
00379         e.g.
00380 
00381         >>> from Bio.Seq import Seq
00382         >>> my_seq = Seq("AAAATGA")
00383         >>> print my_seq.count("A")
00384         5
00385         >>> print my_seq.count("ATG")
00386         1
00387         >>> print my_seq.count(Seq("AT"))
00388         1
00389         >>> print my_seq.count("AT", 2, -1)
00390         1
00391 
00392         HOWEVER, please note because python strings and Seq objects (and
00393         MutableSeq objects) do a non-overlapping search, this may not give
00394         the answer you expect:
00395 
00396         >>> "AAAA".count("AA")
00397         2
00398         >>> print Seq("AAAA").count("AA")
00399         2
00400 
00401         A non-overlapping search would give the answer as three!
00402         """
00403         #If it has one, check the alphabet:
00404         sub_str = self._get_seq_str_and_check_alphabet(sub)
00405         return str(self).count(sub_str, start, end)

Here is the call graph for this function:

def Bio.Seq.Seq.data (   self) [inherited]
Sequence as a string (DEPRECATED).

This is a read only property provided for backwards compatility with
older versions of Biopython (as is the tostring() method). We now
encourage you to use str(my_seq) instead of my_seq.data or the method
my_seq.tostring().

In recent releases of Biopython it was possible to change a Seq object
by updating its data property, but this triggered a deprecation warning.
Now the data property is read only, since Seq objects are meant to be
immutable:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> my_seq = Seq("ACGT", generic_dna)
>>> str(my_seq) == my_seq.tostring() == "ACGT"
True
>>> my_seq.data = "AAAA"
Traceback (most recent call last):
   ...
AttributeError: can't set attribute

Definition at line 106 of file Seq.py.

00106 
00107     def data(self) :
00108         """Sequence as a string (DEPRECATED).
00109 
00110         This is a read only property provided for backwards compatility with
00111         older versions of Biopython (as is the tostring() method). We now
00112         encourage you to use str(my_seq) instead of my_seq.data or the method
00113         my_seq.tostring().
00114 
00115         In recent releases of Biopython it was possible to change a Seq object
00116         by updating its data property, but this triggered a deprecation warning.
00117         Now the data property is read only, since Seq objects are meant to be
00118         immutable:
00119 
00120         >>> from Bio.Seq import Seq
00121         >>> from Bio.Alphabet import generic_dna
00122         >>> my_seq = Seq("ACGT", generic_dna)
00123         >>> str(my_seq) == my_seq.tostring() == "ACGT"
00124         True
00125         >>> my_seq.data = "AAAA"
00126         Traceback (most recent call last):
00127            ...
00128         AttributeError: can't set attribute
00129         """
00130         import warnings
00131         import Bio
00132         warnings.warn("Accessing the .data attribute is deprecated. Please "
00133                       "use str(my_seq) or my_seq.tostring() instead of "
00134                       "my_seq.data.", Bio.BiopythonDeprecationWarning)
00135         return str(self)

def Bio.Seq.Seq.endswith (   self,
  suffix,
  start = 0,
  end = sys.maxint 
) [inherited]
Does the Seq end with the given suffix?  Returns True/False.

This behaves like the python string method of the same name.

Return True if the sequence ends with the specified suffix
(a string or another Seq object), False otherwise.
With optional start, test sequence beginning at that position.
With optional end, stop comparing sequence at that position.
suffix can also be a tuple of strings to try.  e.g.

>>> from Bio.Seq import Seq
>>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
>>> my_rna.endswith("UUG")
True
>>> my_rna.endswith("AUG")
False
>>> my_rna.endswith("AUG", 0, 18)
True
>>> my_rna.endswith(("UCC","UCA","UUG"))
True

Definition at line 526 of file Seq.py.

00526 
00527     def endswith(self, suffix, start=0, end=sys.maxint):
00528         """Does the Seq end with the given suffix?  Returns True/False.
00529 
00530         This behaves like the python string method of the same name.
00531 
00532         Return True if the sequence ends with the specified suffix
00533         (a string or another Seq object), False otherwise.
00534         With optional start, test sequence beginning at that position.
00535         With optional end, stop comparing sequence at that position.
00536         suffix can also be a tuple of strings to try.  e.g.
00537 
00538         >>> from Bio.Seq import Seq
00539         >>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
00540         >>> my_rna.endswith("UUG")
00541         True
00542         >>> my_rna.endswith("AUG")
00543         False
00544         >>> my_rna.endswith("AUG", 0, 18)
00545         True
00546         >>> my_rna.endswith(("UCC","UCA","UUG"))
00547         True
00548         """        
00549         #If it has one, check the alphabet:
00550         if isinstance(suffix, tuple):
00551             #TODO - Once we drop support for Python 2.4, instead of this
00552             #loop offload to the string method (requires Python 2.5+).
00553             #Check all the alphabets first...
00554             suffix_strings = [self._get_seq_str_and_check_alphabet(p) \
00555                               for p in suffix]
00556             for suffix_str in suffix_strings:
00557                 if str(self).endswith(suffix_str, start, end):
00558                     return True
00559             return False
00560         else:
00561             suffix_str = self._get_seq_str_and_check_alphabet(suffix)
00562             return str(self).endswith(suffix_str, start, end)
00563 

Here is the call graph for this function:

def Bio.Seq.Seq.find (   self,
  sub,
  start = 0,
  end = sys.maxint 
) [inherited]
Find method, like that of a python string.

This behaves like the python string method of the same name.

Returns an integer, the index of the first occurrence of substring
argument sub in the (sub)sequence given by [start:end].

Arguments:
 - sub - a string or another Seq object to look for
 - start - optional integer, slice start
 - end - optional integer, slice end

Returns -1 if the subsequence is NOT found.

e.g. Locating the first typical start codon, AUG, in an RNA sequence:

>>> from Bio.Seq import Seq
>>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
>>> my_rna.find("AUG")
3

Definition at line 437 of file Seq.py.

00437 
00438     def find(self, sub, start=0, end=sys.maxint):
00439         """Find method, like that of a python string.
00440 
00441         This behaves like the python string method of the same name.
00442 
00443         Returns an integer, the index of the first occurrence of substring
00444         argument sub in the (sub)sequence given by [start:end].
00445 
00446         Arguments:
00447          - sub - a string or another Seq object to look for
00448          - start - optional integer, slice start
00449          - end - optional integer, slice end
00450 
00451         Returns -1 if the subsequence is NOT found.
00452         
00453         e.g. Locating the first typical start codon, AUG, in an RNA sequence:
00454 
00455         >>> from Bio.Seq import Seq
00456         >>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
00457         >>> my_rna.find("AUG")
00458         3
00459         """
00460         #If it has one, check the alphabet:
00461         sub_str = self._get_seq_str_and_check_alphabet(sub)
00462         return str(self).find(sub_str, start, end)

Here is the call graph for this function:

def Bio.Seq.Seq.lower (   self) [inherited]
Returns a lower case copy of the sequence.

This will adjust the alphabet if required. Note that the IUPAC alphabets
are upper case only, and thus a generic alphabet must be substituted.

>>> from Bio.Alphabet import Gapped, generic_dna
>>> from Bio.Alphabet import IUPAC
>>> from Bio.Seq import Seq
>>> my_seq = Seq("CGGTACGCTTATGTCACGTAG*AAAAAA", Gapped(IUPAC.unambiguous_dna, "*"))
>>> my_seq
Seq('CGGTACGCTTATGTCACGTAG*AAAAAA', Gapped(IUPACUnambiguousDNA(), '*'))
>>> my_seq.lower()
Seq('cggtacgcttatgtcacgtag*aaaaaa', Gapped(DNAAlphabet(), '*'))

See also the upper method.

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 701 of file Seq.py.

00701 
00702     def lower(self):
00703         """Returns a lower case copy of the sequence.
00704 
00705         This will adjust the alphabet if required. Note that the IUPAC alphabets
00706         are upper case only, and thus a generic alphabet must be substituted.
00707 
00708         >>> from Bio.Alphabet import Gapped, generic_dna
00709         >>> from Bio.Alphabet import IUPAC
00710         >>> from Bio.Seq import Seq
00711         >>> my_seq = Seq("CGGTACGCTTATGTCACGTAG*AAAAAA", Gapped(IUPAC.unambiguous_dna, "*"))
00712         >>> my_seq
00713         Seq('CGGTACGCTTATGTCACGTAG*AAAAAA', Gapped(IUPACUnambiguousDNA(), '*'))
00714         >>> my_seq.lower()
00715         Seq('cggtacgcttatgtcacgtag*aaaaaa', Gapped(DNAAlphabet(), '*'))
00716 
00717         See also the upper method.
00718         """
00719         return Seq(str(self).lower(), self.alphabet._lower())

def Bio.Seq.Seq.lstrip (   self,
  chars = None 
) [inherited]
Returns a new Seq object with leading (left) end stripped.

This behaves like the python string method of the same name.

Optional argument chars defines which characters to remove.  If
ommitted or None (default) then as for the python string method,
this defaults to removing any white space.

e.g. print my_seq.lstrip("-")

See also the strip and rstrip methods.

Definition at line 642 of file Seq.py.

00642 
00643     def lstrip(self, chars=None):
00644         """Returns a new Seq object with leading (left) end stripped.
00645 
00646         This behaves like the python string method of the same name.
00647 
00648         Optional argument chars defines which characters to remove.  If
00649         ommitted or None (default) then as for the python string method,
00650         this defaults to removing any white space.
00651         
00652         e.g. print my_seq.lstrip("-")
00653 
00654         See also the strip and rstrip methods.
00655         """
00656         #If it has one, check the alphabet:
00657         strip_str = self._get_seq_str_and_check_alphabet(chars)
00658         return Seq(str(self).lstrip(strip_str), self.alphabet)

Here is the call graph for this function:

def Bio.Seq.Seq.reverse_complement (   self) [inherited]
Returns the reverse complement sequence. New Seq object.

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_dna = Seq("CCCCCGATAGNR", IUPAC.ambiguous_dna)
>>> my_dna
Seq('CCCCCGATAGNR', IUPACAmbiguousDNA())
>>> my_dna.reverse_complement()
Seq('YNCTATCGGGGG', IUPACAmbiguousDNA())

Note in the above example, since R = G or A, its complement
is Y (which denotes C or T).

You can of course used mixed case sequences,

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> my_dna = Seq("CCCCCgatA-G", generic_dna)
>>> my_dna
Seq('CCCCCgatA-G', DNAAlphabet())
>>> my_dna.reverse_complement()
Seq('C-TatcGGGGG', DNAAlphabet())

Trying to complement a protein sequence raises an exception:

>>> my_protein = Seq("MAIVMGR", IUPAC.protein)
>>> my_protein.reverse_complement()
Traceback (most recent call last):
   ...
ValueError: Proteins do not have complements!

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 771 of file Seq.py.

00771 
00772     def reverse_complement(self):
00773         """Returns the reverse complement sequence. New Seq object.
00774 
00775         >>> from Bio.Seq import Seq
00776         >>> from Bio.Alphabet import IUPAC
00777         >>> my_dna = Seq("CCCCCGATAGNR", IUPAC.ambiguous_dna)
00778         >>> my_dna
00779         Seq('CCCCCGATAGNR', IUPACAmbiguousDNA())
00780         >>> my_dna.reverse_complement()
00781         Seq('YNCTATCGGGGG', IUPACAmbiguousDNA())
00782 
00783         Note in the above example, since R = G or A, its complement
00784         is Y (which denotes C or T).
00785 
00786         You can of course used mixed case sequences,
00787 
00788         >>> from Bio.Seq import Seq
00789         >>> from Bio.Alphabet import generic_dna
00790         >>> my_dna = Seq("CCCCCgatA-G", generic_dna)
00791         >>> my_dna
00792         Seq('CCCCCgatA-G', DNAAlphabet())
00793         >>> my_dna.reverse_complement()
00794         Seq('C-TatcGGGGG', DNAAlphabet())
00795 
00796         Trying to complement a protein sequence raises an exception:
00797 
00798         >>> my_protein = Seq("MAIVMGR", IUPAC.protein)
00799         >>> my_protein.reverse_complement()
00800         Traceback (most recent call last):
00801            ...
00802         ValueError: Proteins do not have complements!
00803         """
00804         #Use -1 stride/step to reverse the complement
00805         return self.complement()[::-1]

Here is the call graph for this function:

def Bio.Seq.Seq.rfind (   self,
  sub,
  start = 0,
  end = sys.maxint 
) [inherited]
Find from right method, like that of a python string.

This behaves like the python string method of the same name.

Returns an integer, the index of the last (right most) occurrence of
substring argument sub in the (sub)sequence given by [start:end].

Arguments:
 - sub - a string or another Seq object to look for
 - start - optional integer, slice start
 - end - optional integer, slice end

Returns -1 if the subsequence is NOT found.

e.g. Locating the last typical start codon, AUG, in an RNA sequence:

>>> from Bio.Seq import Seq
>>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
>>> my_rna.rfind("AUG")
15

Definition at line 463 of file Seq.py.

00463 
00464     def rfind(self, sub, start=0, end=sys.maxint):
00465         """Find from right method, like that of a python string.
00466 
00467         This behaves like the python string method of the same name.
00468 
00469         Returns an integer, the index of the last (right most) occurrence of
00470         substring argument sub in the (sub)sequence given by [start:end].
00471 
00472         Arguments:
00473          - sub - a string or another Seq object to look for
00474          - start - optional integer, slice start
00475          - end - optional integer, slice end
00476 
00477         Returns -1 if the subsequence is NOT found.
00478 
00479         e.g. Locating the last typical start codon, AUG, in an RNA sequence:
00480 
00481         >>> from Bio.Seq import Seq
00482         >>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
00483         >>> my_rna.rfind("AUG")
00484         15
00485         """
00486         #If it has one, check the alphabet:
00487         sub_str = self._get_seq_str_and_check_alphabet(sub)
00488         return str(self).rfind(sub_str, start, end)

Here is the call graph for this function:

def Bio.Seq.Seq.rsplit (   self,
  sep = None,
  maxsplit = -1 
) [inherited]
Right split method, like that of a python string.

This behaves like the python string method of the same name.

Return a list of the 'words' in the string (as Seq objects),
using sep as the delimiter string.  If maxsplit is given, at
most maxsplit splits are done COUNTING FROM THE RIGHT.
If maxsplit is ommited, all splits are made.

Following the python string method, sep will by default be any
white space (tabs, spaces, newlines) but this is unlikely to
apply to biological sequences.

e.g. print my_seq.rsplit("*",1)

See also the split method.

Definition at line 602 of file Seq.py.

00602 
00603     def rsplit(self, sep=None, maxsplit=-1):
00604         """Right split method, like that of a python string.
00605 
00606         This behaves like the python string method of the same name.
00607 
00608         Return a list of the 'words' in the string (as Seq objects),
00609         using sep as the delimiter string.  If maxsplit is given, at
00610         most maxsplit splits are done COUNTING FROM THE RIGHT.
00611         If maxsplit is ommited, all splits are made.
00612 
00613         Following the python string method, sep will by default be any
00614         white space (tabs, spaces, newlines) but this is unlikely to
00615         apply to biological sequences.
00616         
00617         e.g. print my_seq.rsplit("*",1)
00618 
00619         See also the split method.
00620         """
00621         #If it has one, check the alphabet:
00622         sep_str = self._get_seq_str_and_check_alphabet(sep)
00623         return [Seq(part, self.alphabet) \
00624                 for part in str(self).rsplit(sep_str, maxsplit)]

Here is the call graph for this function:

def Bio.Seq.Seq.rstrip (   self,
  chars = None 
) [inherited]
Returns a new Seq object with trailing (right) end stripped.

This behaves like the python string method of the same name.

Optional argument chars defines which characters to remove.  If
ommitted or None (default) then as for the python string method,
this defaults to removing any white space.

e.g. Removing a nucleotide sequence's polyadenylation (poly-A tail):

>>> from Bio.Alphabet import IUPAC
>>> from Bio.Seq import Seq
>>> my_seq = Seq("CGGTACGCTTATGTCACGTAGAAAAAA", IUPAC.unambiguous_dna)
>>> my_seq
Seq('CGGTACGCTTATGTCACGTAGAAAAAA', IUPACUnambiguousDNA())
>>> my_seq.rstrip("A")
Seq('CGGTACGCTTATGTCACGTAG', IUPACUnambiguousDNA())

See also the strip and lstrip methods.

Definition at line 659 of file Seq.py.

00659 
00660     def rstrip(self, chars=None):
00661         """Returns a new Seq object with trailing (right) end stripped.
00662 
00663         This behaves like the python string method of the same name.
00664 
00665         Optional argument chars defines which characters to remove.  If
00666         ommitted or None (default) then as for the python string method,
00667         this defaults to removing any white space.
00668         
00669         e.g. Removing a nucleotide sequence's polyadenylation (poly-A tail):
00670 
00671         >>> from Bio.Alphabet import IUPAC
00672         >>> from Bio.Seq import Seq
00673         >>> my_seq = Seq("CGGTACGCTTATGTCACGTAGAAAAAA", IUPAC.unambiguous_dna)
00674         >>> my_seq
00675         Seq('CGGTACGCTTATGTCACGTAGAAAAAA', IUPACUnambiguousDNA())
00676         >>> my_seq.rstrip("A")
00677         Seq('CGGTACGCTTATGTCACGTAG', IUPACUnambiguousDNA())
00678 
00679         See also the strip and lstrip methods.
00680         """
00681         #If it has one, check the alphabet:
00682         strip_str = self._get_seq_str_and_check_alphabet(chars)
00683         return Seq(str(self).rstrip(strip_str), self.alphabet)

Here is the call graph for this function:

def Bio.Seq.Seq.split (   self,
  sep = None,
  maxsplit = -1 
) [inherited]
Split method, like that of a python string.

This behaves like the python string method of the same name.

Return a list of the 'words' in the string (as Seq objects),
using sep as the delimiter string.  If maxsplit is given, at
most maxsplit splits are done.  If maxsplit is ommited, all
splits are made.

Following the python string method, sep will by default be any
white space (tabs, spaces, newlines) but this is unlikely to
apply to biological sequences.

e.g.

>>> from Bio.Seq import Seq
>>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
>>> my_aa = my_rna.translate()
>>> my_aa
Seq('VMAIVMGR*KGAR*L', HasStopCodon(ExtendedIUPACProtein(), '*'))
>>> my_aa.split("*")
[Seq('VMAIVMGR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('KGAR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('L', HasStopCodon(ExtendedIUPACProtein(), '*'))]
>>> my_aa.split("*",1)
[Seq('VMAIVMGR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('KGAR*L', HasStopCodon(ExtendedIUPACProtein(), '*'))]

See also the rsplit method:

>>> my_aa.rsplit("*",1)
[Seq('VMAIVMGR*KGAR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('L', HasStopCodon(ExtendedIUPACProtein(), '*'))]

Definition at line 564 of file Seq.py.

00564 
00565     def split(self, sep=None, maxsplit=-1):
00566         """Split method, like that of a python string.
00567 
00568         This behaves like the python string method of the same name.
00569 
00570         Return a list of the 'words' in the string (as Seq objects),
00571         using sep as the delimiter string.  If maxsplit is given, at
00572         most maxsplit splits are done.  If maxsplit is ommited, all
00573         splits are made.
00574 
00575         Following the python string method, sep will by default be any
00576         white space (tabs, spaces, newlines) but this is unlikely to
00577         apply to biological sequences.
00578         
00579         e.g.
00580 
00581         >>> from Bio.Seq import Seq
00582         >>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
00583         >>> my_aa = my_rna.translate()
00584         >>> my_aa
00585         Seq('VMAIVMGR*KGAR*L', HasStopCodon(ExtendedIUPACProtein(), '*'))
00586         >>> my_aa.split("*")
00587         [Seq('VMAIVMGR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('KGAR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('L', HasStopCodon(ExtendedIUPACProtein(), '*'))]
00588         >>> my_aa.split("*",1)
00589         [Seq('VMAIVMGR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('KGAR*L', HasStopCodon(ExtendedIUPACProtein(), '*'))]
00590 
00591         See also the rsplit method:
00592 
00593         >>> my_aa.rsplit("*",1)
00594         [Seq('VMAIVMGR*KGAR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('L', HasStopCodon(ExtendedIUPACProtein(), '*'))]
00595         """
00596         #If it has one, check the alphabet:
00597         sep_str = self._get_seq_str_and_check_alphabet(sep)
00598         #TODO - If the sep is the defined stop symbol, or gap char,
00599         #should we adjust the alphabet?
00600         return [Seq(part, self.alphabet) \
00601                 for part in str(self).split(sep_str, maxsplit)]

Here is the call graph for this function:

def Bio.Seq.Seq.startswith (   self,
  prefix,
  start = 0,
  end = sys.maxint 
) [inherited]
Does the Seq start with the given prefix?  Returns True/False.

This behaves like the python string method of the same name.

Return True if the sequence starts with the specified prefix
(a string or another Seq object), False otherwise.
With optional start, test sequence beginning at that position.
With optional end, stop comparing sequence at that position.
prefix can also be a tuple of strings to try.  e.g.

>>> from Bio.Seq import Seq
>>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
>>> my_rna.startswith("GUC")
True
>>> my_rna.startswith("AUG")
False
>>> my_rna.startswith("AUG", 3)
True
>>> my_rna.startswith(("UCC","UCA","UCG"),1)
True

Definition at line 489 of file Seq.py.

00489 
00490     def startswith(self, prefix, start=0, end=sys.maxint):
00491         """Does the Seq start with the given prefix?  Returns True/False.
00492 
00493         This behaves like the python string method of the same name.
00494 
00495         Return True if the sequence starts with the specified prefix
00496         (a string or another Seq object), False otherwise.
00497         With optional start, test sequence beginning at that position.
00498         With optional end, stop comparing sequence at that position.
00499         prefix can also be a tuple of strings to try.  e.g.
00500         
00501         >>> from Bio.Seq import Seq
00502         >>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
00503         >>> my_rna.startswith("GUC")
00504         True
00505         >>> my_rna.startswith("AUG")
00506         False
00507         >>> my_rna.startswith("AUG", 3)
00508         True
00509         >>> my_rna.startswith(("UCC","UCA","UCG"),1)
00510         True
00511         """
00512         #If it has one, check the alphabet:
00513         if isinstance(prefix, tuple):
00514             #TODO - Once we drop support for Python 2.4, instead of this
00515             #loop offload to the string method (requires Python 2.5+).
00516             #Check all the alphabets first...
00517             prefix_strings = [self._get_seq_str_and_check_alphabet(p) \
00518                               for p in prefix]
00519             for prefix_str in prefix_strings:
00520                 if str(self).startswith(prefix_str, start, end):
00521                     return True
00522             return False
00523         else:
00524             prefix_str = self._get_seq_str_and_check_alphabet(prefix)
00525             return str(self).startswith(prefix_str, start, end)

Here is the call graph for this function:

def Bio.Seq.Seq.strip (   self,
  chars = None 
) [inherited]
Returns a new Seq object with leading and trailing ends stripped.

This behaves like the python string method of the same name.

Optional argument chars defines which characters to remove.  If
ommitted or None (default) then as for the python string method,
this defaults to removing any white space.

e.g. print my_seq.strip("-")

See also the lstrip and rstrip methods.

Definition at line 625 of file Seq.py.

00625 
00626     def strip(self, chars=None):
00627         """Returns a new Seq object with leading and trailing ends stripped.
00628 
00629         This behaves like the python string method of the same name.
00630 
00631         Optional argument chars defines which characters to remove.  If
00632         ommitted or None (default) then as for the python string method,
00633         this defaults to removing any white space.
00634         
00635         e.g. print my_seq.strip("-")
00636 
00637         See also the lstrip and rstrip methods.
00638         """
00639         #If it has one, check the alphabet:
00640         strip_str = self._get_seq_str_and_check_alphabet(chars)
00641         return Seq(str(self).strip(strip_str), self.alphabet)

Here is the call graph for this function:

def Bio.Seq.Seq.tomutable (   self) [inherited]
Returns the full sequence as a MutableSeq object.

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("MKQHKAMIVALIVICITAVVAAL",
...              IUPAC.protein)
>>> my_seq
Seq('MKQHKAMIVALIVICITAVVAAL', IUPACProtein())
>>> my_seq.tomutable()
MutableSeq('MKQHKAMIVALIVICITAVVAAL', IUPACProtein())

Note that the alphabet is preserved.

Definition at line 325 of file Seq.py.

00325 
00326     def tomutable(self):   # Needed?  Or use a function?
00327         """Returns the full sequence as a MutableSeq object.
00328 
00329         >>> from Bio.Seq import Seq
00330         >>> from Bio.Alphabet import IUPAC
00331         >>> my_seq = Seq("MKQHKAMIVALIVICITAVVAAL",
00332         ...              IUPAC.protein)
00333         >>> my_seq
00334         Seq('MKQHKAMIVALIVICITAVVAAL', IUPACProtein())
00335         >>> my_seq.tomutable()
00336         MutableSeq('MKQHKAMIVALIVICITAVVAAL', IUPACProtein())
00337 
00338         Note that the alphabet is preserved.
00339         """
00340         return MutableSeq(str(self), self.alphabet)

def Bio.Seq.Seq.tostring (   self) [inherited]
Returns the full sequence as a python string (semi-obsolete).

Although not formally deprecated, you are now encouraged to use
str(my_seq) instead of my_seq.tostring().

Definition at line 312 of file Seq.py.

00312 
00313     def tostring(self):                            # Seq API requirement
00314         """Returns the full sequence as a python string (semi-obsolete).
00315 
00316         Although not formally deprecated, you are now encouraged to use
00317         str(my_seq) instead of my_seq.tostring()."""
00318         #TODO - Fix all places elsewhere in Biopython using this method,
00319         #then start deprecation process?
00320         #import warnings
00321         #warnings.warn("This method is obsolete; please use str(my_seq) "
00322         #              "instead of my_seq.tostring().",
00323         #              PendingDeprecationWarning)
00324         return str(self)
    

Here is the caller graph for this function:

def Bio.Seq.Seq.transcribe (   self) [inherited]
Returns the RNA sequence from a DNA sequence. New Seq object.

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
...                  IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> coding_dna.transcribe()
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

Trying to transcribe a protein or RNA sequence raises an exception:

>>> my_protein = Seq("MAIVMGR", IUPAC.protein)
>>> my_protein.transcribe()
Traceback (most recent call last):
   ...
ValueError: Proteins cannot be transcribed!

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 806 of file Seq.py.

00806 
00807     def transcribe(self):
00808         """Returns the RNA sequence from a DNA sequence. New Seq object.
00809 
00810         >>> from Bio.Seq import Seq
00811         >>> from Bio.Alphabet import IUPAC
00812         >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
00813         ...                  IUPAC.unambiguous_dna)
00814         >>> coding_dna
00815         Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
00816         >>> coding_dna.transcribe()
00817         Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
00818 
00819         Trying to transcribe a protein or RNA sequence raises an exception:
00820 
00821         >>> my_protein = Seq("MAIVMGR", IUPAC.protein)
00822         >>> my_protein.transcribe()
00823         Traceback (most recent call last):
00824            ...
00825         ValueError: Proteins cannot be transcribed!
00826         """
00827         base = Alphabet._get_base_alphabet(self.alphabet)
00828         if isinstance(base, Alphabet.ProteinAlphabet):
00829             raise ValueError("Proteins cannot be transcribed!")
00830         if isinstance(base, Alphabet.RNAAlphabet):
00831             raise ValueError("RNA cannot be transcribed!")
00832 
00833         if self.alphabet==IUPAC.unambiguous_dna:
00834             alphabet = IUPAC.unambiguous_rna
00835         elif self.alphabet==IUPAC.ambiguous_dna:
00836             alphabet = IUPAC.ambiguous_rna
00837         else:
00838             alphabet = Alphabet.generic_rna
00839         return Seq(str(self).replace('T','U').replace('t','u'), alphabet)
    

Here is the caller graph for this function:

def Bio.Seq.Seq.translate (   self,
  table = "Standard",
  stop_symbol = "*",
  to_stop = False,
  cds = False 
) [inherited]
Turns a nucleotide sequence into a protein sequence. New Seq object.

This method will translate DNA or RNA sequences, and those with a
nucleotide or generic alphabet.  Trying to translate a protein
sequence raises an exception.

Arguments:
 - table - Which codon table to use?  This can be either a name
   (string), an NCBI identifier (integer), or a CodonTable
   object (useful for non-standard genetic codes).  This
   defaults to the "Standard" table.
 - stop_symbol - Single character string, what to use for terminators.
         This defaults to the asterisk, "*".
 - to_stop - Boolean, defaults to False meaning do a full translation
     continuing on past any stop codons (translated as the
     specified stop_symbol).  If True, translation is
     terminated at the first in frame stop codon (and the
     stop_symbol is not appended to the returned protein
     sequence).
 - cds - Boolean, indicates this is a complete CDS.  If True,
 this checks the sequence starts with a valid alternative start
 codon (which will be translated as methionine, M), that the
 sequence length is a multiple of three, and that there is a
 single in frame stop codon at the end (this will be excluded
 from the protein sequence, regardless of the to_stop option).
 If these tests fail, an exception is raised.

e.g. Using the standard table:

>>> coding_dna = Seq("GTGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
>>> coding_dna.translate()
Seq('VAIVMGR*KGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
>>> coding_dna.translate(stop_symbol="@")
Seq('VAIVMGR@KGAR@', HasStopCodon(ExtendedIUPACProtein(), '@'))
>>> coding_dna.translate(to_stop=True)
Seq('VAIVMGR', ExtendedIUPACProtein())

Now using NCBI table 2, where TGA is not a stop codon:

>>> coding_dna.translate(table=2)
Seq('VAIVMGRWKGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
>>> coding_dna.translate(table=2, to_stop=True)
Seq('VAIVMGRWKGAR', ExtendedIUPACProtein())

In fact, GTG is an alternative start codon under NCBI table 2, meaning
this sequence could be a complete CDS:

>>> coding_dna.translate(table=2, cds=True)
Seq('MAIVMGRWKGAR', ExtendedIUPACProtein())

It isn't a valid CDS under NCBI table 1, due to both the start codon and
also the in frame stop codons:

>>> coding_dna.translate(table=1, cds=True)
Traceback (most recent call last):
    ...
TranslationError: First codon 'GTG' is not a start codon

If the sequence has no in-frame stop codon, then the to_stop argument
has no effect:

>>> coding_dna2 = Seq("TTGGCCATTGTAATGGGCCGC")
>>> coding_dna2.translate()
Seq('LAIVMGR', ExtendedIUPACProtein())
>>> coding_dna2.translate(to_stop=True)
Seq('LAIVMGR', ExtendedIUPACProtein())

NOTE - Ambiguous codons like "TAN" or "NNN" could be an amino acid
or a stop codon.  These are translated as "X".  Any invalid codon
(e.g. "TA?" or "T-A") will throw a TranslationError.

NOTE - Does NOT support gapped sequences.

NOTE - This does NOT behave like the python string's translate
method.  For that use str(my_seq).translate(...) instead.

Definition at line 876 of file Seq.py.

00876 
00877                   cds=False):
00878         """Turns a nucleotide sequence into a protein sequence. New Seq object.
00879 
00880         This method will translate DNA or RNA sequences, and those with a
00881         nucleotide or generic alphabet.  Trying to translate a protein
00882         sequence raises an exception.
00883 
00884         Arguments:
00885          - table - Which codon table to use?  This can be either a name
00886                    (string), an NCBI identifier (integer), or a CodonTable
00887                    object (useful for non-standard genetic codes).  This
00888                    defaults to the "Standard" table.
00889          - stop_symbol - Single character string, what to use for terminators.
00890                          This defaults to the asterisk, "*".
00891          - to_stop - Boolean, defaults to False meaning do a full translation
00892                      continuing on past any stop codons (translated as the
00893                      specified stop_symbol).  If True, translation is
00894                      terminated at the first in frame stop codon (and the
00895                      stop_symbol is not appended to the returned protein
00896                      sequence).
00897          - cds - Boolean, indicates this is a complete CDS.  If True,
00898                  this checks the sequence starts with a valid alternative start
00899                  codon (which will be translated as methionine, M), that the
00900                  sequence length is a multiple of three, and that there is a
00901                  single in frame stop codon at the end (this will be excluded
00902                  from the protein sequence, regardless of the to_stop option).
00903                  If these tests fail, an exception is raised.
00904         
00905         e.g. Using the standard table:
00906 
00907         >>> coding_dna = Seq("GTGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
00908         >>> coding_dna.translate()
00909         Seq('VAIVMGR*KGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
00910         >>> coding_dna.translate(stop_symbol="@")
00911         Seq('VAIVMGR@KGAR@', HasStopCodon(ExtendedIUPACProtein(), '@'))
00912         >>> coding_dna.translate(to_stop=True)
00913         Seq('VAIVMGR', ExtendedIUPACProtein())
00914 
00915         Now using NCBI table 2, where TGA is not a stop codon:
00916 
00917         >>> coding_dna.translate(table=2)
00918         Seq('VAIVMGRWKGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
00919         >>> coding_dna.translate(table=2, to_stop=True)
00920         Seq('VAIVMGRWKGAR', ExtendedIUPACProtein())
00921 
00922         In fact, GTG is an alternative start codon under NCBI table 2, meaning
00923         this sequence could be a complete CDS:
00924 
00925         >>> coding_dna.translate(table=2, cds=True)
00926         Seq('MAIVMGRWKGAR', ExtendedIUPACProtein())
00927 
00928         It isn't a valid CDS under NCBI table 1, due to both the start codon and
00929         also the in frame stop codons:
00930         
00931         >>> coding_dna.translate(table=1, cds=True)
00932         Traceback (most recent call last):
00933             ...
00934         TranslationError: First codon 'GTG' is not a start codon
00935 
00936         If the sequence has no in-frame stop codon, then the to_stop argument
00937         has no effect:
00938 
00939         >>> coding_dna2 = Seq("TTGGCCATTGTAATGGGCCGC")
00940         >>> coding_dna2.translate()
00941         Seq('LAIVMGR', ExtendedIUPACProtein())
00942         >>> coding_dna2.translate(to_stop=True)
00943         Seq('LAIVMGR', ExtendedIUPACProtein())
00944 
00945         NOTE - Ambiguous codons like "TAN" or "NNN" could be an amino acid
00946         or a stop codon.  These are translated as "X".  Any invalid codon
00947         (e.g. "TA?" or "T-A") will throw a TranslationError.
00948 
00949         NOTE - Does NOT support gapped sequences.
00950 
00951         NOTE - This does NOT behave like the python string's translate
00952         method.  For that use str(my_seq).translate(...) instead.
00953         """
00954         if isinstance(table, str) and len(table)==256:
00955             raise ValueError("The Seq object translate method DOES NOT take " \
00956                              + "a 256 character string mapping table like " \
00957                              + "the python string object's translate method. " \
00958                              + "Use str(my_seq).translate(...) instead.")
00959         if isinstance(Alphabet._get_base_alphabet(self.alphabet),
00960                       Alphabet.ProteinAlphabet):
00961             raise ValueError("Proteins cannot be translated!")
00962         try:
00963             table_id = int(table)
00964         except ValueError:
00965             #Assume its a table name
00966             if self.alphabet==IUPAC.unambiguous_dna:
00967                 #Will use standard IUPAC protein alphabet, no need for X
00968                 codon_table = CodonTable.unambiguous_dna_by_name[table]
00969             elif self.alphabet==IUPAC.unambiguous_rna:
00970                 #Will use standard IUPAC protein alphabet, no need for X
00971                 codon_table = CodonTable.unambiguous_rna_by_name[table]
00972             else:
00973                 #This will use the extended IUPAC protein alphabet with X etc.
00974                 #The same table can be used for RNA or DNA (we use this for
00975                 #translating strings).
00976                 codon_table = CodonTable.ambiguous_generic_by_name[table]
00977         except (AttributeError, TypeError):
00978             #Assume its a CodonTable object
00979             if isinstance(table, CodonTable.CodonTable):
00980                 codon_table = table
00981             else:
00982                 raise ValueError('Bad table argument')
00983         else:
00984             #Assume its a table ID
00985             if self.alphabet==IUPAC.unambiguous_dna:
00986                 #Will use standard IUPAC protein alphabet, no need for X
00987                 codon_table = CodonTable.unambiguous_dna_by_id[table_id]
00988             elif self.alphabet==IUPAC.unambiguous_rna:
00989                 #Will use standard IUPAC protein alphabet, no need for X
00990                 codon_table = CodonTable.unambiguous_rna_by_id[table_id]
00991             else:
00992                 #This will use the extended IUPAC protein alphabet with X etc.
00993                 #The same table can be used for RNA or DNA (we use this for
00994                 #translating strings).
00995                 codon_table = CodonTable.ambiguous_generic_by_id[table_id]
00996         protein = _translate_str(str(self), codon_table, \
00997                                  stop_symbol, to_stop, cds)
00998         if stop_symbol in protein:
00999             alphabet = Alphabet.HasStopCodon(codon_table.protein_alphabet,
01000                                              stop_symbol = stop_symbol)
01001         else:
01002             alphabet = codon_table.protein_alphabet
01003         return Seq(protein, alphabet)

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Seq.Seq.ungap (   self,
  gap = None 
) [inherited]
Return a copy of the sequence without the gap character(s).

The gap character can be specified in two ways - either as an explicit
argument, or via the sequence's alphabet. For example:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> my_dna = Seq("-ATA--TGAAAT-TTGAAAA", generic_dna)
>>> my_dna
Seq('-ATA--TGAAAT-TTGAAAA', DNAAlphabet())
>>> my_dna.ungap("-")
Seq('ATATGAAATTTGAAAA', DNAAlphabet())

If the gap character is not given as an argument, it will be taken from
the sequence's alphabet (if defined). Notice that the returned sequence's
alphabet is adjusted since it no longer requires a gapped alphabet:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC, Gapped, HasStopCodon
>>> my_pro = Seq("MVVLE=AD*", HasStopCodon(Gapped(IUPAC.protein, "=")))
>>> my_pro
Seq('MVVLE=AD*', HasStopCodon(Gapped(IUPACProtein(), '='), '*'))
>>> my_pro.ungap()
Seq('MVVLEAD*', HasStopCodon(IUPACProtein(), '*'))

Or, with a simpler gapped DNA example:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC, Gapped
>>> my_seq = Seq("CGGGTAG=AAAAAA", Gapped(IUPAC.unambiguous_dna, "="))
>>> my_seq
Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '='))
>>> my_seq.ungap()
Seq('CGGGTAGAAAAAA', IUPACUnambiguousDNA())

As long as it is consistent with the alphabet, although it is redundant,
you can still supply the gap character as an argument to this method:

>>> my_seq
Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '='))
>>> my_seq.ungap("=")
Seq('CGGGTAGAAAAAA', IUPACUnambiguousDNA())

However, if the gap character given as the argument disagrees with that
declared in the alphabet, an exception is raised:

>>> my_seq
Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '='))
>>> my_seq.ungap("-")
Traceback (most recent call last):
   ...
ValueError: Gap '-' does not match '=' from alphabet

Finally, if a gap character is not supplied, and the alphabet does not
define one, an exception is raised:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> my_dna = Seq("ATA--TGAAAT-TTGAAAA", generic_dna)
>>> my_dna
Seq('ATA--TGAAAT-TTGAAAA', DNAAlphabet())
>>> my_dna.ungap()
Traceback (most recent call last):
   ...
ValueError: Gap character not given and not defined in alphabet

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 1004 of file Seq.py.

01004 
01005     def ungap(self, gap=None):
01006         """Return a copy of the sequence without the gap character(s).
01007 
01008         The gap character can be specified in two ways - either as an explicit
01009         argument, or via the sequence's alphabet. For example:
01010 
01011         >>> from Bio.Seq import Seq
01012         >>> from Bio.Alphabet import generic_dna
01013         >>> my_dna = Seq("-ATA--TGAAAT-TTGAAAA", generic_dna)
01014         >>> my_dna
01015         Seq('-ATA--TGAAAT-TTGAAAA', DNAAlphabet())
01016         >>> my_dna.ungap("-")
01017         Seq('ATATGAAATTTGAAAA', DNAAlphabet())
01018 
01019         If the gap character is not given as an argument, it will be taken from
01020         the sequence's alphabet (if defined). Notice that the returned sequence's
01021         alphabet is adjusted since it no longer requires a gapped alphabet:
01022 
01023         >>> from Bio.Seq import Seq
01024         >>> from Bio.Alphabet import IUPAC, Gapped, HasStopCodon
01025         >>> my_pro = Seq("MVVLE=AD*", HasStopCodon(Gapped(IUPAC.protein, "=")))
01026         >>> my_pro
01027         Seq('MVVLE=AD*', HasStopCodon(Gapped(IUPACProtein(), '='), '*'))
01028         >>> my_pro.ungap()
01029         Seq('MVVLEAD*', HasStopCodon(IUPACProtein(), '*'))
01030 
01031         Or, with a simpler gapped DNA example:
01032 
01033         >>> from Bio.Seq import Seq
01034         >>> from Bio.Alphabet import IUPAC, Gapped
01035         >>> my_seq = Seq("CGGGTAG=AAAAAA", Gapped(IUPAC.unambiguous_dna, "="))
01036         >>> my_seq
01037         Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '='))
01038         >>> my_seq.ungap()
01039         Seq('CGGGTAGAAAAAA', IUPACUnambiguousDNA())
01040 
01041         As long as it is consistent with the alphabet, although it is redundant,
01042         you can still supply the gap character as an argument to this method:
01043 
01044         >>> my_seq
01045         Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '='))
01046         >>> my_seq.ungap("=")
01047         Seq('CGGGTAGAAAAAA', IUPACUnambiguousDNA())
01048         
01049         However, if the gap character given as the argument disagrees with that
01050         declared in the alphabet, an exception is raised:
01051 
01052         >>> my_seq
01053         Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '='))
01054         >>> my_seq.ungap("-")
01055         Traceback (most recent call last):
01056            ...
01057         ValueError: Gap '-' does not match '=' from alphabet
01058 
01059         Finally, if a gap character is not supplied, and the alphabet does not
01060         define one, an exception is raised:
01061 
01062         >>> from Bio.Seq import Seq
01063         >>> from Bio.Alphabet import generic_dna
01064         >>> my_dna = Seq("ATA--TGAAAT-TTGAAAA", generic_dna)
01065         >>> my_dna
01066         Seq('ATA--TGAAAT-TTGAAAA', DNAAlphabet())
01067         >>> my_dna.ungap()
01068         Traceback (most recent call last):
01069            ...
01070         ValueError: Gap character not given and not defined in alphabet
01071 
01072         """
01073         if hasattr(self.alphabet, "gap_char"):
01074             if not gap:
01075                 gap = self.alphabet.gap_char
01076             elif gap != self.alphabet.gap_char:
01077                 raise ValueError("Gap %s does not match %s from alphabet" \
01078                                  % (repr(gap), repr(self.alphabet.gap_char)))
01079             alpha = Alphabet._ungap(self.alphabet)
01080         elif not gap:
01081             raise ValueError("Gap character not given and not defined in alphabet")
01082         else:
01083             alpha = self.alphabet #modify!
01084         if len(gap)!=1 or not isinstance(gap, str):
01085             raise ValueError("Unexpected gap character, %s" % repr(gap))
01086         return Seq(str(self).replace(gap, ""), alpha)

def Bio.Seq.Seq.upper (   self) [inherited]
Returns an upper case copy of the sequence.

>>> from Bio.Alphabet import HasStopCodon, generic_protein
>>> from Bio.Seq import Seq
>>> my_seq = Seq("VHLTPeeK*", HasStopCodon(generic_protein))
>>> my_seq
Seq('VHLTPeeK*', HasStopCodon(ProteinAlphabet(), '*'))
>>> my_seq.lower()
Seq('vhltpeek*', HasStopCodon(ProteinAlphabet(), '*'))
>>> my_seq.upper()
Seq('VHLTPEEK*', HasStopCodon(ProteinAlphabet(), '*'))

This will adjust the alphabet if required. See also the lower method.

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 684 of file Seq.py.

00684 
00685     def upper(self):
00686         """Returns an upper case copy of the sequence.
00687 
00688         >>> from Bio.Alphabet import HasStopCodon, generic_protein
00689         >>> from Bio.Seq import Seq
00690         >>> my_seq = Seq("VHLTPeeK*", HasStopCodon(generic_protein))
00691         >>> my_seq
00692         Seq('VHLTPeeK*', HasStopCodon(ProteinAlphabet(), '*'))
00693         >>> my_seq.lower()
00694         Seq('vhltpeek*', HasStopCodon(ProteinAlphabet(), '*'))
00695         >>> my_seq.upper()
00696         Seq('VHLTPEEK*', HasStopCodon(ProteinAlphabet(), '*'))
00697 
00698         This will adjust the alphabet if required. See also the lower method.
00699         """
00700         return Seq(str(self).upper(), self.alphabet._upper())


Member Data Documentation

Bio.Seq.Seq.alphabet [inherited]

Reimplemented in Bio.Seq.UnknownSeq.

Definition at line 101 of file Seq.py.


The documentation for this class was generated from the following file: