Back to index

python-biopython  1.60
Public Member Functions | Public Attributes | Private Attributes
Bio.Seq.UnknownSeq Class Reference
Inheritance diagram for Bio.Seq.UnknownSeq:
Inheritance graph
[legend]
Collaboration diagram for Bio.Seq.UnknownSeq:
Collaboration graph
[legend]

List of all members.

Public Member Functions

def __init__
def __len__
def __str__
def __repr__
def __add__
def __radd__
def __getitem__
def count
def complement
def reverse_complement
def transcribe
def back_transcribe
def upper
def lower
def translate
def ungap
def data
def __hash__
def __cmp__
def tostring
def tomutable
def __contains__
def find
def rfind
def startswith
def endswith
def split
def rsplit
def strip
def lstrip
def rstrip
def translate

Public Attributes

 alphabet

Private Attributes

 _length
 _character

Detailed Description

A read-only sequence object of known length but unknown contents.

If you have an unknown sequence, you can represent this with a normal
Seq object, for example:

>>> my_seq = Seq("N"*5)
>>> my_seq
Seq('NNNNN', Alphabet())
>>> len(my_seq)
5
>>> print my_seq
NNNNN

However, this is rather wasteful of memory (especially for large
sequences), which is where this class is most usefull:

>>> unk_five = UnknownSeq(5)
>>> unk_five
UnknownSeq(5, alphabet = Alphabet(), character = '?')
>>> len(unk_five)
5
>>> print(unk_five)
?????

You can add unknown sequence together, provided their alphabets and
characters are compatible, and get another memory saving UnknownSeq:

>>> unk_four = UnknownSeq(4)
>>> unk_four
UnknownSeq(4, alphabet = Alphabet(), character = '?')
>>> unk_four + unk_five
UnknownSeq(9, alphabet = Alphabet(), character = '?')

If the alphabet or characters don't match up, the addition gives an
ordinary Seq object:

>>> unk_nnnn = UnknownSeq(4, character = "N")
>>> unk_nnnn
UnknownSeq(4, alphabet = Alphabet(), character = 'N')
>>> unk_nnnn + unk_four
Seq('NNNN????', Alphabet())

Combining with a real Seq gives a new Seq object:

>>> known_seq = Seq("ACGT")
>>> unk_four + known_seq
Seq('????ACGT', Alphabet())
>>> known_seq + unk_four
Seq('ACGT????', Alphabet())

Definition at line 1087 of file Seq.py.


Constructor & Destructor Documentation

def Bio.Seq.UnknownSeq.__init__ (   self,
  length,
  alphabet = Alphabet.generic_alphabet,
  character = None 
)
Create a new UnknownSeq object.

If character is ommited, it is determed from the alphabet, "N" for
nucleotides, "X" for proteins, and "?" otherwise.

Definition at line 1138 of file Seq.py.

01138 
01139     def __init__(self, length, alphabet = Alphabet.generic_alphabet, character = None):
01140         """Create a new UnknownSeq object.
01141 
01142         If character is ommited, it is determed from the alphabet, "N" for
01143         nucleotides, "X" for proteins, and "?" otherwise.
01144         """
01145         self._length = int(length)
01146         if self._length < 0:
01147             #TODO - Block zero length UnknownSeq?  You can just use a Seq!
01148             raise ValueError("Length must not be negative.")
01149         self.alphabet = alphabet
01150         if character:
01151             if len(character) != 1:
01152                 raise ValueError("character argument should be a single letter string.")
01153             self._character = character
01154         else:
01155             base = Alphabet._get_base_alphabet(alphabet)
01156             #TODO? Check the case of the letters in the alphabet?
01157             #We may have to use "n" instead of "N" etc.
01158             if isinstance(base, Alphabet.NucleotideAlphabet):
01159                 self._character = "N"
01160             elif isinstance(base, Alphabet.ProteinAlphabet):
01161                 self._character = "X"
01162             else:
01163                 self._character = "?"


Member Function Documentation

def Bio.Seq.UnknownSeq.__add__ (   self,
  other 
)
Add another sequence or string to this sequence.

Adding two UnknownSeq objects returns another UnknownSeq object
provided the character is the same and the alphabets are compatible.

>>> from Bio.Seq import UnknownSeq
>>> from Bio.Alphabet import generic_protein
>>> UnknownSeq(10, generic_protein) + UnknownSeq(5, generic_protein)
UnknownSeq(15, alphabet = ProteinAlphabet(), character = 'X')

If the characters differ, an UnknownSeq object cannot be used, so a
Seq object is returned:

>>> from Bio.Seq import UnknownSeq
>>> from Bio.Alphabet import generic_protein
>>> UnknownSeq(10, generic_protein) + UnknownSeq(5, generic_protein,
...                                              character="x")
Seq('XXXXXXXXXXxxxxx', ProteinAlphabet())

If adding a string to an UnknownSeq, a new Seq is returned with the
same alphabet:

>>> from Bio.Seq import UnknownSeq
>>> from Bio.Alphabet import generic_protein
>>> UnknownSeq(5, generic_protein) + "LV"
Seq('XXXXXLV', ProteinAlphabet())

Reimplemented from Bio.Seq.Seq.

Definition at line 1176 of file Seq.py.

01176 
01177     def __add__(self, other):
01178         """Add another sequence or string to this sequence.
01179 
01180         Adding two UnknownSeq objects returns another UnknownSeq object
01181         provided the character is the same and the alphabets are compatible.
01182 
01183         >>> from Bio.Seq import UnknownSeq
01184         >>> from Bio.Alphabet import generic_protein
01185         >>> UnknownSeq(10, generic_protein) + UnknownSeq(5, generic_protein)
01186         UnknownSeq(15, alphabet = ProteinAlphabet(), character = 'X')
01187 
01188         If the characters differ, an UnknownSeq object cannot be used, so a
01189         Seq object is returned:
01190 
01191         >>> from Bio.Seq import UnknownSeq
01192         >>> from Bio.Alphabet import generic_protein
01193         >>> UnknownSeq(10, generic_protein) + UnknownSeq(5, generic_protein,
01194         ...                                              character="x")
01195         Seq('XXXXXXXXXXxxxxx', ProteinAlphabet())
01196 
01197         If adding a string to an UnknownSeq, a new Seq is returned with the
01198         same alphabet:
01199         
01200         >>> from Bio.Seq import UnknownSeq
01201         >>> from Bio.Alphabet import generic_protein
01202         >>> UnknownSeq(5, generic_protein) + "LV"
01203         Seq('XXXXXLV', ProteinAlphabet())
01204         """
01205         if isinstance(other, UnknownSeq) \
01206         and other._character == self._character:
01207             #TODO - Check the alphabets match
01208             return UnknownSeq(len(self)+len(other),
01209                               self.alphabet, self._character)
01210         #Offload to the base class...
01211         return Seq(str(self), self.alphabet) + other

def Bio.Seq.Seq.__cmp__ (   self,
  other 
) [inherited]
Compare the sequence to another sequence or a string (README).

Historically comparing Seq objects has done Python object comparison.
After considerable discussion (keeping in mind constraints of the
Python language, hashes and dictionary support) a future release of
Biopython will change this to use simple string comparison. The plan is
that comparing incompatible alphabets (e.g. DNA to RNA) will trigger a
warning.

This version of Biopython still does Python object comparison, but with
a warning about this future change. During this transition period,
please just do explicit comparisons:

>>> seq1 = Seq("ACGT")
>>> seq2 = Seq("ACGT")
>>> id(seq1) == id(seq2)
False
>>> str(seq1) == str(seq2)
True

Note - This method indirectly supports ==, < , etc.

Definition at line 166 of file Seq.py.

00166 
00167     def __cmp__(self, other):
00168         """Compare the sequence to another sequence or a string (README).
00169 
00170         Historically comparing Seq objects has done Python object comparison.
00171         After considerable discussion (keeping in mind constraints of the
00172         Python language, hashes and dictionary support) a future release of
00173         Biopython will change this to use simple string comparison. The plan is
00174         that comparing incompatible alphabets (e.g. DNA to RNA) will trigger a
00175         warning.
00176 
00177         This version of Biopython still does Python object comparison, but with
00178         a warning about this future change. During this transition period,
00179         please just do explicit comparisons:
00180 
00181         >>> seq1 = Seq("ACGT")
00182         >>> seq2 = Seq("ACGT")
00183         >>> id(seq1) == id(seq2)
00184         False
00185         >>> str(seq1) == str(seq2)
00186         True
00187 
00188         Note - This method indirectly supports ==, < , etc.
00189         """
00190         if hasattr(other, "alphabet"):
00191             #other should be a Seq or a MutableSeq
00192             import warnings
00193             warnings.warn("In future comparing Seq objects will use string "
00194                           "comparison (not object comparison). Incompatible "
00195                           "alphabets will trigger a warning (not an exception). "
00196                           "In the interim please use id(seq1)==id(seq2) or "
00197                           "str(seq1)==str(seq2) to make your code explicit "
00198                           "and to avoid this warning.", FutureWarning)
00199         return cmp(id(self), id(other))

def Bio.Seq.Seq.__contains__ (   self,
  char 
) [inherited]
Implements the 'in' keyword, like a python string.

e.g.

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna, generic_rna, generic_protein
>>> my_dna = Seq("ATATGAAATTTGAAAA", generic_dna)
>>> "AAA" in my_dna
True
>>> Seq("AAA") in my_dna
True
>>> Seq("AAA", generic_dna) in my_dna
True

Like other Seq methods, this will raise a type error if another Seq
(or Seq like) object with an incompatible alphabet is used:

>>> Seq("AAA", generic_rna) in my_dna
Traceback (most recent call last):
   ...
TypeError: Incompatable alphabets DNAAlphabet() and RNAAlphabet()
>>> Seq("AAA", generic_protein) in my_dna
Traceback (most recent call last):
   ...
TypeError: Incompatable alphabets DNAAlphabet() and ProteinAlphabet()

Definition at line 406 of file Seq.py.

00406 
00407     def __contains__(self, char):
00408         """Implements the 'in' keyword, like a python string.
00409 
00410         e.g.
00411 
00412         >>> from Bio.Seq import Seq
00413         >>> from Bio.Alphabet import generic_dna, generic_rna, generic_protein
00414         >>> my_dna = Seq("ATATGAAATTTGAAAA", generic_dna)
00415         >>> "AAA" in my_dna
00416         True
00417         >>> Seq("AAA") in my_dna
00418         True
00419         >>> Seq("AAA", generic_dna) in my_dna
00420         True
00421 
00422         Like other Seq methods, this will raise a type error if another Seq
00423         (or Seq like) object with an incompatible alphabet is used:
00424 
00425         >>> Seq("AAA", generic_rna) in my_dna
00426         Traceback (most recent call last):
00427            ...
00428         TypeError: Incompatable alphabets DNAAlphabet() and RNAAlphabet()
00429         >>> Seq("AAA", generic_protein) in my_dna
00430         Traceback (most recent call last):
00431            ...
00432         TypeError: Incompatable alphabets DNAAlphabet() and ProteinAlphabet()
00433         """
00434         #If it has one, check the alphabet:
00435         sub_str = self._get_seq_str_and_check_alphabet(char)
00436         return sub_str in str(self)

Here is the call graph for this function:

def Bio.Seq.UnknownSeq.__getitem__ (   self,
  index 
)
Get a subsequence from the UnknownSeq object.

>>> unk = UnknownSeq(8, character="N")
>>> print unk[:]
NNNNNNNN
>>> print unk[5:3]
<BLANKLINE>
>>> print unk[1:-1]
NNNNNN
>>> print unk[1:-1:2]
NNN

Reimplemented from Bio.Seq.Seq.

Definition at line 1217 of file Seq.py.

01217 
01218     def __getitem__(self, index):
01219         """Get a subsequence from the UnknownSeq object.
01220         
01221         >>> unk = UnknownSeq(8, character="N")
01222         >>> print unk[:]
01223         NNNNNNNN
01224         >>> print unk[5:3]
01225         <BLANKLINE>
01226         >>> print unk[1:-1]
01227         NNNNNN
01228         >>> print unk[1:-1:2]
01229         NNN
01230         """
01231         if isinstance(index, int):
01232             #TODO - Check the bounds without wasting memory
01233             return str(self)[index]
01234         old_length = self._length
01235         step = index.step
01236         if step is None or step == 1:
01237             #This calculates the length you'd get from ("N"*old_length)[index]
01238             start = index.start
01239             end = index.stop
01240             if start is None:
01241                 start = 0
01242             elif start < 0:
01243                 start = max(0, old_length + start)
01244             elif start > old_length:
01245                 start = old_length
01246             if end is None:
01247                 end = old_length
01248             elif end < 0:
01249                 end = max(0, old_length + end)
01250             elif end > old_length:
01251                 end = old_length
01252             new_length = max(0, end-start)
01253         elif step == 0:
01254             raise ValueError("slice step cannot be zero")
01255         else:
01256             #TODO - handle step efficiently
01257             new_length = len(("X"*old_length)[index])
01258         #assert new_length == len(("X"*old_length)[index]), \
01259         #       (index, start, end, step, old_length,
01260         #        new_length, len(("X"*old_length)[index]))
01261         return UnknownSeq(new_length, self.alphabet, self._character)

Here is the caller graph for this function:

def Bio.Seq.Seq.__hash__ (   self) [inherited]
Hash for comparison.

See the __cmp__ documentation - we plan to change this!

Definition at line 159 of file Seq.py.

00159 
00160     def __hash__(self):
00161         """Hash for comparison.
00162 
00163         See the __cmp__ documentation - we plan to change this!
00164         """
00165         return id(self) #Currently use object identity for equality testing
    
Returns the stated length of the unknown sequence.

Reimplemented from Bio.Seq.Seq.

Definition at line 1164 of file Seq.py.

01164 
01165     def __len__(self):
01166         """Returns the stated length of the unknown sequence."""
01167         return self._length
    
def Bio.Seq.UnknownSeq.__radd__ (   self,
  other 
)
Adding a sequence on the left.

If adding a string to a Seq, the alphabet is preserved:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_protein
>>> "LV" + Seq("MELKI", generic_protein)
Seq('LVMELKI', ProteinAlphabet())

Adding two Seq (like) objects is handled via the __add__ method.

Reimplemented from Bio.Seq.Seq.

Definition at line 1212 of file Seq.py.

01212 
01213     def __radd__(self, other):
01214         #If other is an UnknownSeq, then __add__ would be called.
01215         #Offload to the base class...
01216         return other + Seq(str(self), self.alphabet)

Returns a (truncated) representation of the sequence for debugging.

Reimplemented from Bio.Seq.Seq.

Definition at line 1172 of file Seq.py.

01172 
01173     def __repr__(self):
01174         return "UnknownSeq(%i, alphabet = %s, character = %s)" \
01175                % (self._length, repr(self.alphabet), repr(self._character))

Returns the unknown sequence as full string of the given length.

Reimplemented from Bio.Seq.Seq.

Definition at line 1168 of file Seq.py.

01168 
01169     def __str__(self):
01170         """Returns the unknown sequence as full string of the given length."""
01171         return self._character * self._length

Returns unknown DNA sequence from an unknown RNA sequence.

>>> my_rna = UnknownSeq(20, character="N")
>>> my_rna
UnknownSeq(20, alphabet = Alphabet(), character = 'N')
>>> print my_rna
NNNNNNNNNNNNNNNNNNNN
>>> my_dna = my_rna.back_transcribe()
>>> my_dna
UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
>>> print my_dna
NNNNNNNNNNNNNNNNNNNN

Reimplemented from Bio.Seq.Seq.

Definition at line 1372 of file Seq.py.

01372 
01373     def back_transcribe(self):
01374         """Returns unknown DNA sequence from an unknown RNA sequence.
01375 
01376         >>> my_rna = UnknownSeq(20, character="N")
01377         >>> my_rna
01378         UnknownSeq(20, alphabet = Alphabet(), character = 'N')
01379         >>> print my_rna
01380         NNNNNNNNNNNNNNNNNNNN
01381         >>> my_dna = my_rna.back_transcribe()
01382         >>> my_dna
01383         UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
01384         >>> print my_dna
01385         NNNNNNNNNNNNNNNNNNNN
01386         """
01387         #Offload the alphabet stuff
01388         s = Seq(self._character, self.alphabet).back_transcribe()
01389         return UnknownSeq(self._length, s.alphabet, self._character)

Here is the caller graph for this function:

The complement of an unknown nucleotide equals itself.

>>> my_nuc = UnknownSeq(8)
>>> my_nuc
UnknownSeq(8, alphabet = Alphabet(), character = '?')
>>> print my_nuc
????????
>>> my_nuc.complement()
UnknownSeq(8, alphabet = Alphabet(), character = '?')
>>> print my_nuc.complement()
????????

Reimplemented from Bio.Seq.Seq.

Definition at line 1318 of file Seq.py.

01318 
01319     def complement(self):
01320         """The complement of an unknown nucleotide equals itself.
01321 
01322         >>> my_nuc = UnknownSeq(8)
01323         >>> my_nuc
01324         UnknownSeq(8, alphabet = Alphabet(), character = '?')
01325         >>> print my_nuc
01326         ????????
01327         >>> my_nuc.complement()
01328         UnknownSeq(8, alphabet = Alphabet(), character = '?')
01329         >>> print my_nuc.complement()
01330         ????????
01331         """
01332         if isinstance(Alphabet._get_base_alphabet(self.alphabet),
01333                       Alphabet.ProteinAlphabet):
01334             raise ValueError("Proteins do not have complements!")
01335         return self

Here is the caller graph for this function:

def Bio.Seq.UnknownSeq.count (   self,
  sub,
  start = 0,
  end = sys.maxint 
)
Non-overlapping count method, like that of a python string.

This behaves like the python string (and Seq object) method of the
same name, which does a non-overlapping count!

Returns an integer, the number of occurrences of substring
argument sub in the (sub)sequence given by [start:end].
Optional arguments start and end are interpreted as in slice
notation.
    
Arguments:
 - sub - a string or another Seq object to look for
 - start - optional integer, slice start
 - end - optional integer, slice end

>>> "NNNN".count("N")
4
>>> Seq("NNNN").count("N")
4
>>> UnknownSeq(4, character="N").count("N")
4
>>> UnknownSeq(4, character="N").count("A")
0
>>> UnknownSeq(4, character="N").count("AA")
0

HOWEVER, please note because that python strings and Seq objects (and
MutableSeq objects) do a non-overlapping search, this may not give
the answer you expect:

>>> UnknownSeq(4, character="N").count("NN")
2
>>> UnknownSeq(4, character="N").count("NNN")
1

Reimplemented from Bio.Seq.Seq.

Definition at line 1262 of file Seq.py.

01262 
01263     def count(self, sub, start=0, end=sys.maxint):
01264         """Non-overlapping count method, like that of a python string.
01265 
01266         This behaves like the python string (and Seq object) method of the
01267         same name, which does a non-overlapping count!
01268 
01269         Returns an integer, the number of occurrences of substring
01270         argument sub in the (sub)sequence given by [start:end].
01271         Optional arguments start and end are interpreted as in slice
01272         notation.
01273     
01274         Arguments:
01275          - sub - a string or another Seq object to look for
01276          - start - optional integer, slice start
01277          - end - optional integer, slice end
01278 
01279         >>> "NNNN".count("N")
01280         4
01281         >>> Seq("NNNN").count("N")
01282         4
01283         >>> UnknownSeq(4, character="N").count("N")
01284         4
01285         >>> UnknownSeq(4, character="N").count("A")
01286         0
01287         >>> UnknownSeq(4, character="N").count("AA")
01288         0
01289 
01290         HOWEVER, please note because that python strings and Seq objects (and
01291         MutableSeq objects) do a non-overlapping search, this may not give
01292         the answer you expect:
01293 
01294         >>> UnknownSeq(4, character="N").count("NN")
01295         2
01296         >>> UnknownSeq(4, character="N").count("NNN")
01297         1
01298         """
01299         sub_str = self._get_seq_str_and_check_alphabet(sub)
01300         if len(sub_str) == 1:
01301             if str(sub_str) == self._character:
01302                 if start==0 and end >= self._length:
01303                     return self._length
01304                 else:
01305                     #This could be done more cleverly...
01306                     return str(self).count(sub_str, start, end)
01307             else:
01308                 return 0
01309         else:
01310             if set(sub_str) == set(self._character):
01311                 if start==0 and end >= self._length:
01312                     return self._length // len(sub_str)
01313                 else:
01314                     #This could be done more cleverly...
01315                     return str(self).count(sub_str, start, end)
01316             else:
01317                 return 0

Here is the call graph for this function:

def Bio.Seq.Seq.data (   self) [inherited]
Sequence as a string (DEPRECATED).

This is a read only property provided for backwards compatility with
older versions of Biopython (as is the tostring() method). We now
encourage you to use str(my_seq) instead of my_seq.data or the method
my_seq.tostring().

In recent releases of Biopython it was possible to change a Seq object
by updating its data property, but this triggered a deprecation warning.
Now the data property is read only, since Seq objects are meant to be
immutable:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> my_seq = Seq("ACGT", generic_dna)
>>> str(my_seq) == my_seq.tostring() == "ACGT"
True
>>> my_seq.data = "AAAA"
Traceback (most recent call last):
   ...
AttributeError: can't set attribute

Definition at line 106 of file Seq.py.

00106 
00107     def data(self) :
00108         """Sequence as a string (DEPRECATED).
00109 
00110         This is a read only property provided for backwards compatility with
00111         older versions of Biopython (as is the tostring() method). We now
00112         encourage you to use str(my_seq) instead of my_seq.data or the method
00113         my_seq.tostring().
00114 
00115         In recent releases of Biopython it was possible to change a Seq object
00116         by updating its data property, but this triggered a deprecation warning.
00117         Now the data property is read only, since Seq objects are meant to be
00118         immutable:
00119 
00120         >>> from Bio.Seq import Seq
00121         >>> from Bio.Alphabet import generic_dna
00122         >>> my_seq = Seq("ACGT", generic_dna)
00123         >>> str(my_seq) == my_seq.tostring() == "ACGT"
00124         True
00125         >>> my_seq.data = "AAAA"
00126         Traceback (most recent call last):
00127            ...
00128         AttributeError: can't set attribute
00129         """
00130         import warnings
00131         import Bio
00132         warnings.warn("Accessing the .data attribute is deprecated. Please "
00133                       "use str(my_seq) or my_seq.tostring() instead of "
00134                       "my_seq.data.", Bio.BiopythonDeprecationWarning)
00135         return str(self)

def Bio.Seq.Seq.endswith (   self,
  suffix,
  start = 0,
  end = sys.maxint 
) [inherited]
Does the Seq end with the given suffix?  Returns True/False.

This behaves like the python string method of the same name.

Return True if the sequence ends with the specified suffix
(a string or another Seq object), False otherwise.
With optional start, test sequence beginning at that position.
With optional end, stop comparing sequence at that position.
suffix can also be a tuple of strings to try.  e.g.

>>> from Bio.Seq import Seq
>>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
>>> my_rna.endswith("UUG")
True
>>> my_rna.endswith("AUG")
False
>>> my_rna.endswith("AUG", 0, 18)
True
>>> my_rna.endswith(("UCC","UCA","UUG"))
True

Definition at line 526 of file Seq.py.

00526 
00527     def endswith(self, suffix, start=0, end=sys.maxint):
00528         """Does the Seq end with the given suffix?  Returns True/False.
00529 
00530         This behaves like the python string method of the same name.
00531 
00532         Return True if the sequence ends with the specified suffix
00533         (a string or another Seq object), False otherwise.
00534         With optional start, test sequence beginning at that position.
00535         With optional end, stop comparing sequence at that position.
00536         suffix can also be a tuple of strings to try.  e.g.
00537 
00538         >>> from Bio.Seq import Seq
00539         >>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
00540         >>> my_rna.endswith("UUG")
00541         True
00542         >>> my_rna.endswith("AUG")
00543         False
00544         >>> my_rna.endswith("AUG", 0, 18)
00545         True
00546         >>> my_rna.endswith(("UCC","UCA","UUG"))
00547         True
00548         """        
00549         #If it has one, check the alphabet:
00550         if isinstance(suffix, tuple):
00551             #TODO - Once we drop support for Python 2.4, instead of this
00552             #loop offload to the string method (requires Python 2.5+).
00553             #Check all the alphabets first...
00554             suffix_strings = [self._get_seq_str_and_check_alphabet(p) \
00555                               for p in suffix]
00556             for suffix_str in suffix_strings:
00557                 if str(self).endswith(suffix_str, start, end):
00558                     return True
00559             return False
00560         else:
00561             suffix_str = self._get_seq_str_and_check_alphabet(suffix)
00562             return str(self).endswith(suffix_str, start, end)
00563 

Here is the call graph for this function:

def Bio.Seq.Seq.find (   self,
  sub,
  start = 0,
  end = sys.maxint 
) [inherited]
Find method, like that of a python string.

This behaves like the python string method of the same name.

Returns an integer, the index of the first occurrence of substring
argument sub in the (sub)sequence given by [start:end].

Arguments:
 - sub - a string or another Seq object to look for
 - start - optional integer, slice start
 - end - optional integer, slice end

Returns -1 if the subsequence is NOT found.

e.g. Locating the first typical start codon, AUG, in an RNA sequence:

>>> from Bio.Seq import Seq
>>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
>>> my_rna.find("AUG")
3

Definition at line 437 of file Seq.py.

00437 
00438     def find(self, sub, start=0, end=sys.maxint):
00439         """Find method, like that of a python string.
00440 
00441         This behaves like the python string method of the same name.
00442 
00443         Returns an integer, the index of the first occurrence of substring
00444         argument sub in the (sub)sequence given by [start:end].
00445 
00446         Arguments:
00447          - sub - a string or another Seq object to look for
00448          - start - optional integer, slice start
00449          - end - optional integer, slice end
00450 
00451         Returns -1 if the subsequence is NOT found.
00452         
00453         e.g. Locating the first typical start codon, AUG, in an RNA sequence:
00454 
00455         >>> from Bio.Seq import Seq
00456         >>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
00457         >>> my_rna.find("AUG")
00458         3
00459         """
00460         #If it has one, check the alphabet:
00461         sub_str = self._get_seq_str_and_check_alphabet(sub)
00462         return str(self).find(sub_str, start, end)

Here is the call graph for this function:

def Bio.Seq.UnknownSeq.lower (   self)
Returns a lower case copy of the sequence.

This will adjust the alphabet if required:

>>> from Bio.Alphabet import IUPAC
>>> from Bio.Seq import UnknownSeq
>>> my_seq = UnknownSeq(20, IUPAC.extended_protein)
>>> my_seq
UnknownSeq(20, alphabet = ExtendedIUPACProtein(), character = 'X')
>>> print my_seq
XXXXXXXXXXXXXXXXXXXX
>>> my_seq.lower()
UnknownSeq(20, alphabet = ProteinAlphabet(), character = 'x')
>>> print my_seq.lower()
xxxxxxxxxxxxxxxxxxxx

See also the upper method.

Reimplemented from Bio.Seq.Seq.

Definition at line 1409 of file Seq.py.

01409 
01410     def lower(self):
01411         """Returns a lower case copy of the sequence.
01412 
01413         This will adjust the alphabet if required:
01414 
01415         >>> from Bio.Alphabet import IUPAC
01416         >>> from Bio.Seq import UnknownSeq
01417         >>> my_seq = UnknownSeq(20, IUPAC.extended_protein)
01418         >>> my_seq
01419         UnknownSeq(20, alphabet = ExtendedIUPACProtein(), character = 'X')
01420         >>> print my_seq
01421         XXXXXXXXXXXXXXXXXXXX
01422         >>> my_seq.lower()
01423         UnknownSeq(20, alphabet = ProteinAlphabet(), character = 'x')
01424         >>> print my_seq.lower()
01425         xxxxxxxxxxxxxxxxxxxx
01426 
01427         See also the upper method.
01428         """
01429         return UnknownSeq(self._length, self.alphabet._lower(), self._character.lower())

def Bio.Seq.Seq.lstrip (   self,
  chars = None 
) [inherited]
Returns a new Seq object with leading (left) end stripped.

This behaves like the python string method of the same name.

Optional argument chars defines which characters to remove.  If
ommitted or None (default) then as for the python string method,
this defaults to removing any white space.

e.g. print my_seq.lstrip("-")

See also the strip and rstrip methods.

Definition at line 642 of file Seq.py.

00642 
00643     def lstrip(self, chars=None):
00644         """Returns a new Seq object with leading (left) end stripped.
00645 
00646         This behaves like the python string method of the same name.
00647 
00648         Optional argument chars defines which characters to remove.  If
00649         ommitted or None (default) then as for the python string method,
00650         this defaults to removing any white space.
00651         
00652         e.g. print my_seq.lstrip("-")
00653 
00654         See also the strip and rstrip methods.
00655         """
00656         #If it has one, check the alphabet:
00657         strip_str = self._get_seq_str_and_check_alphabet(chars)
00658         return Seq(str(self).lstrip(strip_str), self.alphabet)

Here is the call graph for this function:

The reverse complement of an unknown nucleotide equals itself.

>>> my_nuc = UnknownSeq(10)
>>> my_nuc
UnknownSeq(10, alphabet = Alphabet(), character = '?')
>>> print my_nuc
??????????
>>> my_nuc.reverse_complement()
UnknownSeq(10, alphabet = Alphabet(), character = '?')
>>> print my_nuc.reverse_complement()
??????????

Reimplemented from Bio.Seq.Seq.

Definition at line 1336 of file Seq.py.

01336 
01337     def reverse_complement(self):
01338         """The reverse complement of an unknown nucleotide equals itself.
01339 
01340         >>> my_nuc = UnknownSeq(10)
01341         >>> my_nuc
01342         UnknownSeq(10, alphabet = Alphabet(), character = '?')
01343         >>> print my_nuc
01344         ??????????
01345         >>> my_nuc.reverse_complement()
01346         UnknownSeq(10, alphabet = Alphabet(), character = '?')
01347         >>> print my_nuc.reverse_complement()
01348         ??????????
01349         """
01350         if isinstance(Alphabet._get_base_alphabet(self.alphabet),
01351                       Alphabet.ProteinAlphabet):
01352             raise ValueError("Proteins do not have complements!")
01353         return self

def Bio.Seq.Seq.rfind (   self,
  sub,
  start = 0,
  end = sys.maxint 
) [inherited]
Find from right method, like that of a python string.

This behaves like the python string method of the same name.

Returns an integer, the index of the last (right most) occurrence of
substring argument sub in the (sub)sequence given by [start:end].

Arguments:
 - sub - a string or another Seq object to look for
 - start - optional integer, slice start
 - end - optional integer, slice end

Returns -1 if the subsequence is NOT found.

e.g. Locating the last typical start codon, AUG, in an RNA sequence:

>>> from Bio.Seq import Seq
>>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
>>> my_rna.rfind("AUG")
15

Definition at line 463 of file Seq.py.

00463 
00464     def rfind(self, sub, start=0, end=sys.maxint):
00465         """Find from right method, like that of a python string.
00466 
00467         This behaves like the python string method of the same name.
00468 
00469         Returns an integer, the index of the last (right most) occurrence of
00470         substring argument sub in the (sub)sequence given by [start:end].
00471 
00472         Arguments:
00473          - sub - a string or another Seq object to look for
00474          - start - optional integer, slice start
00475          - end - optional integer, slice end
00476 
00477         Returns -1 if the subsequence is NOT found.
00478 
00479         e.g. Locating the last typical start codon, AUG, in an RNA sequence:
00480 
00481         >>> from Bio.Seq import Seq
00482         >>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
00483         >>> my_rna.rfind("AUG")
00484         15
00485         """
00486         #If it has one, check the alphabet:
00487         sub_str = self._get_seq_str_and_check_alphabet(sub)
00488         return str(self).rfind(sub_str, start, end)

Here is the call graph for this function:

def Bio.Seq.Seq.rsplit (   self,
  sep = None,
  maxsplit = -1 
) [inherited]
Right split method, like that of a python string.

This behaves like the python string method of the same name.

Return a list of the 'words' in the string (as Seq objects),
using sep as the delimiter string.  If maxsplit is given, at
most maxsplit splits are done COUNTING FROM THE RIGHT.
If maxsplit is ommited, all splits are made.

Following the python string method, sep will by default be any
white space (tabs, spaces, newlines) but this is unlikely to
apply to biological sequences.

e.g. print my_seq.rsplit("*",1)

See also the split method.

Definition at line 602 of file Seq.py.

00602 
00603     def rsplit(self, sep=None, maxsplit=-1):
00604         """Right split method, like that of a python string.
00605 
00606         This behaves like the python string method of the same name.
00607 
00608         Return a list of the 'words' in the string (as Seq objects),
00609         using sep as the delimiter string.  If maxsplit is given, at
00610         most maxsplit splits are done COUNTING FROM THE RIGHT.
00611         If maxsplit is ommited, all splits are made.
00612 
00613         Following the python string method, sep will by default be any
00614         white space (tabs, spaces, newlines) but this is unlikely to
00615         apply to biological sequences.
00616         
00617         e.g. print my_seq.rsplit("*",1)
00618 
00619         See also the split method.
00620         """
00621         #If it has one, check the alphabet:
00622         sep_str = self._get_seq_str_and_check_alphabet(sep)
00623         return [Seq(part, self.alphabet) \
00624                 for part in str(self).rsplit(sep_str, maxsplit)]

Here is the call graph for this function:

def Bio.Seq.Seq.rstrip (   self,
  chars = None 
) [inherited]
Returns a new Seq object with trailing (right) end stripped.

This behaves like the python string method of the same name.

Optional argument chars defines which characters to remove.  If
ommitted or None (default) then as for the python string method,
this defaults to removing any white space.

e.g. Removing a nucleotide sequence's polyadenylation (poly-A tail):

>>> from Bio.Alphabet import IUPAC
>>> from Bio.Seq import Seq
>>> my_seq = Seq("CGGTACGCTTATGTCACGTAGAAAAAA", IUPAC.unambiguous_dna)
>>> my_seq
Seq('CGGTACGCTTATGTCACGTAGAAAAAA', IUPACUnambiguousDNA())
>>> my_seq.rstrip("A")
Seq('CGGTACGCTTATGTCACGTAG', IUPACUnambiguousDNA())

See also the strip and lstrip methods.

Definition at line 659 of file Seq.py.

00659 
00660     def rstrip(self, chars=None):
00661         """Returns a new Seq object with trailing (right) end stripped.
00662 
00663         This behaves like the python string method of the same name.
00664 
00665         Optional argument chars defines which characters to remove.  If
00666         ommitted or None (default) then as for the python string method,
00667         this defaults to removing any white space.
00668         
00669         e.g. Removing a nucleotide sequence's polyadenylation (poly-A tail):
00670 
00671         >>> from Bio.Alphabet import IUPAC
00672         >>> from Bio.Seq import Seq
00673         >>> my_seq = Seq("CGGTACGCTTATGTCACGTAGAAAAAA", IUPAC.unambiguous_dna)
00674         >>> my_seq
00675         Seq('CGGTACGCTTATGTCACGTAGAAAAAA', IUPACUnambiguousDNA())
00676         >>> my_seq.rstrip("A")
00677         Seq('CGGTACGCTTATGTCACGTAG', IUPACUnambiguousDNA())
00678 
00679         See also the strip and lstrip methods.
00680         """
00681         #If it has one, check the alphabet:
00682         strip_str = self._get_seq_str_and_check_alphabet(chars)
00683         return Seq(str(self).rstrip(strip_str), self.alphabet)

Here is the call graph for this function:

def Bio.Seq.Seq.split (   self,
  sep = None,
  maxsplit = -1 
) [inherited]
Split method, like that of a python string.

This behaves like the python string method of the same name.

Return a list of the 'words' in the string (as Seq objects),
using sep as the delimiter string.  If maxsplit is given, at
most maxsplit splits are done.  If maxsplit is ommited, all
splits are made.

Following the python string method, sep will by default be any
white space (tabs, spaces, newlines) but this is unlikely to
apply to biological sequences.

e.g.

>>> from Bio.Seq import Seq
>>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
>>> my_aa = my_rna.translate()
>>> my_aa
Seq('VMAIVMGR*KGAR*L', HasStopCodon(ExtendedIUPACProtein(), '*'))
>>> my_aa.split("*")
[Seq('VMAIVMGR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('KGAR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('L', HasStopCodon(ExtendedIUPACProtein(), '*'))]
>>> my_aa.split("*",1)
[Seq('VMAIVMGR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('KGAR*L', HasStopCodon(ExtendedIUPACProtein(), '*'))]

See also the rsplit method:

>>> my_aa.rsplit("*",1)
[Seq('VMAIVMGR*KGAR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('L', HasStopCodon(ExtendedIUPACProtein(), '*'))]

Definition at line 564 of file Seq.py.

00564 
00565     def split(self, sep=None, maxsplit=-1):
00566         """Split method, like that of a python string.
00567 
00568         This behaves like the python string method of the same name.
00569 
00570         Return a list of the 'words' in the string (as Seq objects),
00571         using sep as the delimiter string.  If maxsplit is given, at
00572         most maxsplit splits are done.  If maxsplit is ommited, all
00573         splits are made.
00574 
00575         Following the python string method, sep will by default be any
00576         white space (tabs, spaces, newlines) but this is unlikely to
00577         apply to biological sequences.
00578         
00579         e.g.
00580 
00581         >>> from Bio.Seq import Seq
00582         >>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
00583         >>> my_aa = my_rna.translate()
00584         >>> my_aa
00585         Seq('VMAIVMGR*KGAR*L', HasStopCodon(ExtendedIUPACProtein(), '*'))
00586         >>> my_aa.split("*")
00587         [Seq('VMAIVMGR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('KGAR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('L', HasStopCodon(ExtendedIUPACProtein(), '*'))]
00588         >>> my_aa.split("*",1)
00589         [Seq('VMAIVMGR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('KGAR*L', HasStopCodon(ExtendedIUPACProtein(), '*'))]
00590 
00591         See also the rsplit method:
00592 
00593         >>> my_aa.rsplit("*",1)
00594         [Seq('VMAIVMGR*KGAR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('L', HasStopCodon(ExtendedIUPACProtein(), '*'))]
00595         """
00596         #If it has one, check the alphabet:
00597         sep_str = self._get_seq_str_and_check_alphabet(sep)
00598         #TODO - If the sep is the defined stop symbol, or gap char,
00599         #should we adjust the alphabet?
00600         return [Seq(part, self.alphabet) \
00601                 for part in str(self).split(sep_str, maxsplit)]

Here is the call graph for this function:

def Bio.Seq.Seq.startswith (   self,
  prefix,
  start = 0,
  end = sys.maxint 
) [inherited]
Does the Seq start with the given prefix?  Returns True/False.

This behaves like the python string method of the same name.

Return True if the sequence starts with the specified prefix
(a string or another Seq object), False otherwise.
With optional start, test sequence beginning at that position.
With optional end, stop comparing sequence at that position.
prefix can also be a tuple of strings to try.  e.g.

>>> from Bio.Seq import Seq
>>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
>>> my_rna.startswith("GUC")
True
>>> my_rna.startswith("AUG")
False
>>> my_rna.startswith("AUG", 3)
True
>>> my_rna.startswith(("UCC","UCA","UCG"),1)
True

Definition at line 489 of file Seq.py.

00489 
00490     def startswith(self, prefix, start=0, end=sys.maxint):
00491         """Does the Seq start with the given prefix?  Returns True/False.
00492 
00493         This behaves like the python string method of the same name.
00494 
00495         Return True if the sequence starts with the specified prefix
00496         (a string or another Seq object), False otherwise.
00497         With optional start, test sequence beginning at that position.
00498         With optional end, stop comparing sequence at that position.
00499         prefix can also be a tuple of strings to try.  e.g.
00500         
00501         >>> from Bio.Seq import Seq
00502         >>> my_rna = Seq("GUCAUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAGUUG")
00503         >>> my_rna.startswith("GUC")
00504         True
00505         >>> my_rna.startswith("AUG")
00506         False
00507         >>> my_rna.startswith("AUG", 3)
00508         True
00509         >>> my_rna.startswith(("UCC","UCA","UCG"),1)
00510         True
00511         """
00512         #If it has one, check the alphabet:
00513         if isinstance(prefix, tuple):
00514             #TODO - Once we drop support for Python 2.4, instead of this
00515             #loop offload to the string method (requires Python 2.5+).
00516             #Check all the alphabets first...
00517             prefix_strings = [self._get_seq_str_and_check_alphabet(p) \
00518                               for p in prefix]
00519             for prefix_str in prefix_strings:
00520                 if str(self).startswith(prefix_str, start, end):
00521                     return True
00522             return False
00523         else:
00524             prefix_str = self._get_seq_str_and_check_alphabet(prefix)
00525             return str(self).startswith(prefix_str, start, end)

Here is the call graph for this function:

def Bio.Seq.Seq.strip (   self,
  chars = None 
) [inherited]
Returns a new Seq object with leading and trailing ends stripped.

This behaves like the python string method of the same name.

Optional argument chars defines which characters to remove.  If
ommitted or None (default) then as for the python string method,
this defaults to removing any white space.

e.g. print my_seq.strip("-")

See also the lstrip and rstrip methods.

Definition at line 625 of file Seq.py.

00625 
00626     def strip(self, chars=None):
00627         """Returns a new Seq object with leading and trailing ends stripped.
00628 
00629         This behaves like the python string method of the same name.
00630 
00631         Optional argument chars defines which characters to remove.  If
00632         ommitted or None (default) then as for the python string method,
00633         this defaults to removing any white space.
00634         
00635         e.g. print my_seq.strip("-")
00636 
00637         See also the lstrip and rstrip methods.
00638         """
00639         #If it has one, check the alphabet:
00640         strip_str = self._get_seq_str_and_check_alphabet(chars)
00641         return Seq(str(self).strip(strip_str), self.alphabet)

Here is the call graph for this function:

def Bio.Seq.Seq.tomutable (   self) [inherited]
Returns the full sequence as a MutableSeq object.

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("MKQHKAMIVALIVICITAVVAAL",
...              IUPAC.protein)
>>> my_seq
Seq('MKQHKAMIVALIVICITAVVAAL', IUPACProtein())
>>> my_seq.tomutable()
MutableSeq('MKQHKAMIVALIVICITAVVAAL', IUPACProtein())

Note that the alphabet is preserved.

Definition at line 325 of file Seq.py.

00325 
00326     def tomutable(self):   # Needed?  Or use a function?
00327         """Returns the full sequence as a MutableSeq object.
00328 
00329         >>> from Bio.Seq import Seq
00330         >>> from Bio.Alphabet import IUPAC
00331         >>> my_seq = Seq("MKQHKAMIVALIVICITAVVAAL",
00332         ...              IUPAC.protein)
00333         >>> my_seq
00334         Seq('MKQHKAMIVALIVICITAVVAAL', IUPACProtein())
00335         >>> my_seq.tomutable()
00336         MutableSeq('MKQHKAMIVALIVICITAVVAAL', IUPACProtein())
00337 
00338         Note that the alphabet is preserved.
00339         """
00340         return MutableSeq(str(self), self.alphabet)

def Bio.Seq.Seq.tostring (   self) [inherited]
Returns the full sequence as a python string (semi-obsolete).

Although not formally deprecated, you are now encouraged to use
str(my_seq) instead of my_seq.tostring().

Definition at line 312 of file Seq.py.

00312 
00313     def tostring(self):                            # Seq API requirement
00314         """Returns the full sequence as a python string (semi-obsolete).
00315 
00316         Although not formally deprecated, you are now encouraged to use
00317         str(my_seq) instead of my_seq.tostring()."""
00318         #TODO - Fix all places elsewhere in Biopython using this method,
00319         #then start deprecation process?
00320         #import warnings
00321         #warnings.warn("This method is obsolete; please use str(my_seq) "
00322         #              "instead of my_seq.tostring().",
00323         #              PendingDeprecationWarning)
00324         return str(self)
    

Here is the caller graph for this function:

Returns unknown RNA sequence from an unknown DNA sequence.

>>> my_dna = UnknownSeq(10, character="N")
>>> my_dna
UnknownSeq(10, alphabet = Alphabet(), character = 'N')
>>> print my_dna
NNNNNNNNNN
>>> my_rna = my_dna.transcribe()
>>> my_rna
UnknownSeq(10, alphabet = RNAAlphabet(), character = 'N')
>>> print my_rna
NNNNNNNNNN

Reimplemented from Bio.Seq.Seq.

Definition at line 1354 of file Seq.py.

01354 
01355     def transcribe(self):
01356         """Returns unknown RNA sequence from an unknown DNA sequence.
01357 
01358         >>> my_dna = UnknownSeq(10, character="N")
01359         >>> my_dna
01360         UnknownSeq(10, alphabet = Alphabet(), character = 'N')
01361         >>> print my_dna
01362         NNNNNNNNNN
01363         >>> my_rna = my_dna.transcribe()
01364         >>> my_rna
01365         UnknownSeq(10, alphabet = RNAAlphabet(), character = 'N')
01366         >>> print my_rna
01367         NNNNNNNNNN
01368         """
01369         #Offload the alphabet stuff
01370         s = Seq(self._character, self.alphabet).transcribe()
01371         return UnknownSeq(self._length, s.alphabet, self._character)

Here is the caller graph for this function:

def Bio.Seq.Seq.translate (   self,
  table = "Standard",
  stop_symbol = "*",
  to_stop = False,
  cds = False 
) [inherited]
Turns a nucleotide sequence into a protein sequence. New Seq object.

This method will translate DNA or RNA sequences, and those with a
nucleotide or generic alphabet.  Trying to translate a protein
sequence raises an exception.

Arguments:
 - table - Which codon table to use?  This can be either a name
   (string), an NCBI identifier (integer), or a CodonTable
   object (useful for non-standard genetic codes).  This
   defaults to the "Standard" table.
 - stop_symbol - Single character string, what to use for terminators.
         This defaults to the asterisk, "*".
 - to_stop - Boolean, defaults to False meaning do a full translation
     continuing on past any stop codons (translated as the
     specified stop_symbol).  If True, translation is
     terminated at the first in frame stop codon (and the
     stop_symbol is not appended to the returned protein
     sequence).
 - cds - Boolean, indicates this is a complete CDS.  If True,
 this checks the sequence starts with a valid alternative start
 codon (which will be translated as methionine, M), that the
 sequence length is a multiple of three, and that there is a
 single in frame stop codon at the end (this will be excluded
 from the protein sequence, regardless of the to_stop option).
 If these tests fail, an exception is raised.

e.g. Using the standard table:

>>> coding_dna = Seq("GTGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
>>> coding_dna.translate()
Seq('VAIVMGR*KGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
>>> coding_dna.translate(stop_symbol="@")
Seq('VAIVMGR@KGAR@', HasStopCodon(ExtendedIUPACProtein(), '@'))
>>> coding_dna.translate(to_stop=True)
Seq('VAIVMGR', ExtendedIUPACProtein())

Now using NCBI table 2, where TGA is not a stop codon:

>>> coding_dna.translate(table=2)
Seq('VAIVMGRWKGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
>>> coding_dna.translate(table=2, to_stop=True)
Seq('VAIVMGRWKGAR', ExtendedIUPACProtein())

In fact, GTG is an alternative start codon under NCBI table 2, meaning
this sequence could be a complete CDS:

>>> coding_dna.translate(table=2, cds=True)
Seq('MAIVMGRWKGAR', ExtendedIUPACProtein())

It isn't a valid CDS under NCBI table 1, due to both the start codon and
also the in frame stop codons:

>>> coding_dna.translate(table=1, cds=True)
Traceback (most recent call last):
    ...
TranslationError: First codon 'GTG' is not a start codon

If the sequence has no in-frame stop codon, then the to_stop argument
has no effect:

>>> coding_dna2 = Seq("TTGGCCATTGTAATGGGCCGC")
>>> coding_dna2.translate()
Seq('LAIVMGR', ExtendedIUPACProtein())
>>> coding_dna2.translate(to_stop=True)
Seq('LAIVMGR', ExtendedIUPACProtein())

NOTE - Ambiguous codons like "TAN" or "NNN" could be an amino acid
or a stop codon.  These are translated as "X".  Any invalid codon
(e.g. "TA?" or "T-A") will throw a TranslationError.

NOTE - Does NOT support gapped sequences.

NOTE - This does NOT behave like the python string's translate
method.  For that use str(my_seq).translate(...) instead.

Definition at line 876 of file Seq.py.

00876 
00877                   cds=False):
00878         """Turns a nucleotide sequence into a protein sequence. New Seq object.
00879 
00880         This method will translate DNA or RNA sequences, and those with a
00881         nucleotide or generic alphabet.  Trying to translate a protein
00882         sequence raises an exception.
00883 
00884         Arguments:
00885          - table - Which codon table to use?  This can be either a name
00886                    (string), an NCBI identifier (integer), or a CodonTable
00887                    object (useful for non-standard genetic codes).  This
00888                    defaults to the "Standard" table.
00889          - stop_symbol - Single character string, what to use for terminators.
00890                          This defaults to the asterisk, "*".
00891          - to_stop - Boolean, defaults to False meaning do a full translation
00892                      continuing on past any stop codons (translated as the
00893                      specified stop_symbol).  If True, translation is
00894                      terminated at the first in frame stop codon (and the
00895                      stop_symbol is not appended to the returned protein
00896                      sequence).
00897          - cds - Boolean, indicates this is a complete CDS.  If True,
00898                  this checks the sequence starts with a valid alternative start
00899                  codon (which will be translated as methionine, M), that the
00900                  sequence length is a multiple of three, and that there is a
00901                  single in frame stop codon at the end (this will be excluded
00902                  from the protein sequence, regardless of the to_stop option).
00903                  If these tests fail, an exception is raised.
00904         
00905         e.g. Using the standard table:
00906 
00907         >>> coding_dna = Seq("GTGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
00908         >>> coding_dna.translate()
00909         Seq('VAIVMGR*KGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
00910         >>> coding_dna.translate(stop_symbol="@")
00911         Seq('VAIVMGR@KGAR@', HasStopCodon(ExtendedIUPACProtein(), '@'))
00912         >>> coding_dna.translate(to_stop=True)
00913         Seq('VAIVMGR', ExtendedIUPACProtein())
00914 
00915         Now using NCBI table 2, where TGA is not a stop codon:
00916 
00917         >>> coding_dna.translate(table=2)
00918         Seq('VAIVMGRWKGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
00919         >>> coding_dna.translate(table=2, to_stop=True)
00920         Seq('VAIVMGRWKGAR', ExtendedIUPACProtein())
00921 
00922         In fact, GTG is an alternative start codon under NCBI table 2, meaning
00923         this sequence could be a complete CDS:
00924 
00925         >>> coding_dna.translate(table=2, cds=True)
00926         Seq('MAIVMGRWKGAR', ExtendedIUPACProtein())
00927 
00928         It isn't a valid CDS under NCBI table 1, due to both the start codon and
00929         also the in frame stop codons:
00930         
00931         >>> coding_dna.translate(table=1, cds=True)
00932         Traceback (most recent call last):
00933             ...
00934         TranslationError: First codon 'GTG' is not a start codon
00935 
00936         If the sequence has no in-frame stop codon, then the to_stop argument
00937         has no effect:
00938 
00939         >>> coding_dna2 = Seq("TTGGCCATTGTAATGGGCCGC")
00940         >>> coding_dna2.translate()
00941         Seq('LAIVMGR', ExtendedIUPACProtein())
00942         >>> coding_dna2.translate(to_stop=True)
00943         Seq('LAIVMGR', ExtendedIUPACProtein())
00944 
00945         NOTE - Ambiguous codons like "TAN" or "NNN" could be an amino acid
00946         or a stop codon.  These are translated as "X".  Any invalid codon
00947         (e.g. "TA?" or "T-A") will throw a TranslationError.
00948 
00949         NOTE - Does NOT support gapped sequences.
00950 
00951         NOTE - This does NOT behave like the python string's translate
00952         method.  For that use str(my_seq).translate(...) instead.
00953         """
00954         if isinstance(table, str) and len(table)==256:
00955             raise ValueError("The Seq object translate method DOES NOT take " \
00956                              + "a 256 character string mapping table like " \
00957                              + "the python string object's translate method. " \
00958                              + "Use str(my_seq).translate(...) instead.")
00959         if isinstance(Alphabet._get_base_alphabet(self.alphabet),
00960                       Alphabet.ProteinAlphabet):
00961             raise ValueError("Proteins cannot be translated!")
00962         try:
00963             table_id = int(table)
00964         except ValueError:
00965             #Assume its a table name
00966             if self.alphabet==IUPAC.unambiguous_dna:
00967                 #Will use standard IUPAC protein alphabet, no need for X
00968                 codon_table = CodonTable.unambiguous_dna_by_name[table]
00969             elif self.alphabet==IUPAC.unambiguous_rna:
00970                 #Will use standard IUPAC protein alphabet, no need for X
00971                 codon_table = CodonTable.unambiguous_rna_by_name[table]
00972             else:
00973                 #This will use the extended IUPAC protein alphabet with X etc.
00974                 #The same table can be used for RNA or DNA (we use this for
00975                 #translating strings).
00976                 codon_table = CodonTable.ambiguous_generic_by_name[table]
00977         except (AttributeError, TypeError):
00978             #Assume its a CodonTable object
00979             if isinstance(table, CodonTable.CodonTable):
00980                 codon_table = table
00981             else:
00982                 raise ValueError('Bad table argument')
00983         else:
00984             #Assume its a table ID
00985             if self.alphabet==IUPAC.unambiguous_dna:
00986                 #Will use standard IUPAC protein alphabet, no need for X
00987                 codon_table = CodonTable.unambiguous_dna_by_id[table_id]
00988             elif self.alphabet==IUPAC.unambiguous_rna:
00989                 #Will use standard IUPAC protein alphabet, no need for X
00990                 codon_table = CodonTable.unambiguous_rna_by_id[table_id]
00991             else:
00992                 #This will use the extended IUPAC protein alphabet with X etc.
00993                 #The same table can be used for RNA or DNA (we use this for
00994                 #translating strings).
00995                 codon_table = CodonTable.ambiguous_generic_by_id[table_id]
00996         protein = _translate_str(str(self), codon_table, \
00997                                  stop_symbol, to_stop, cds)
00998         if stop_symbol in protein:
00999             alphabet = Alphabet.HasStopCodon(codon_table.protein_alphabet,
01000                                              stop_symbol = stop_symbol)
01001         else:
01002             alphabet = codon_table.protein_alphabet
01003         return Seq(protein, alphabet)

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Seq.UnknownSeq.translate (   self,
  kwargs 
)
Translate an unknown nucleotide sequence into an unknown protein.

e.g.

>>> my_seq = UnknownSeq(11, character="N")
>>> print my_seq
NNNNNNNNNNN
>>> my_protein = my_seq.translate()
>>> my_protein
UnknownSeq(3, alphabet = ProteinAlphabet(), character = 'X')
>>> print my_protein
XXX

In comparison, using a normal Seq object:

>>> my_seq = Seq("NNNNNNNNNNN")
>>> print my_seq
NNNNNNNNNNN
>>> my_protein = my_seq.translate()
>>> my_protein
Seq('XXX', ExtendedIUPACProtein())
>>> print my_protein
XXX

Definition at line 1430 of file Seq.py.

01430 
01431     def translate(self, **kwargs):
01432         """Translate an unknown nucleotide sequence into an unknown protein.
01433 
01434         e.g.
01435 
01436         >>> my_seq = UnknownSeq(11, character="N")
01437         >>> print my_seq
01438         NNNNNNNNNNN
01439         >>> my_protein = my_seq.translate()
01440         >>> my_protein
01441         UnknownSeq(3, alphabet = ProteinAlphabet(), character = 'X')
01442         >>> print my_protein
01443         XXX
01444 
01445         In comparison, using a normal Seq object:
01446 
01447         >>> my_seq = Seq("NNNNNNNNNNN")
01448         >>> print my_seq
01449         NNNNNNNNNNN
01450         >>> my_protein = my_seq.translate()
01451         >>> my_protein
01452         Seq('XXX', ExtendedIUPACProtein())
01453         >>> print my_protein
01454         XXX
01455 
01456         """
01457         if isinstance(Alphabet._get_base_alphabet(self.alphabet),
01458                       Alphabet.ProteinAlphabet):
01459             raise ValueError("Proteins cannot be translated!")
01460         return UnknownSeq(self._length//3, Alphabet.generic_protein, "X")

Here is the caller graph for this function:

def Bio.Seq.UnknownSeq.ungap (   self,
  gap = None 
)
Return a copy of the sequence without the gap character(s).

The gap character can be specified in two ways - either as an explicit
argument, or via the sequence's alphabet. For example:

>>> from Bio.Seq import UnknownSeq
>>> from Bio.Alphabet import Gapped, generic_dna
>>> my_dna = UnknownSeq(20, Gapped(generic_dna,"-"))
>>> my_dna
UnknownSeq(20, alphabet = Gapped(DNAAlphabet(), '-'), character = 'N')
>>> my_dna.ungap()
UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
>>> my_dna.ungap("-")
UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')

If the UnknownSeq is using the gap character, then an empty Seq is
returned:

>>> my_gap = UnknownSeq(20, Gapped(generic_dna,"-"), character="-")
>>> my_gap
UnknownSeq(20, alphabet = Gapped(DNAAlphabet(), '-'), character = '-')
>>> my_gap.ungap()
Seq('', DNAAlphabet())
>>> my_gap.ungap("-")
Seq('', DNAAlphabet())

Notice that the returned sequence's alphabet is adjusted to remove any
explicit gap character declaration.

Reimplemented from Bio.Seq.Seq.

Definition at line 1461 of file Seq.py.

01461 
01462     def ungap(self, gap=None):
01463         """Return a copy of the sequence without the gap character(s).
01464 
01465         The gap character can be specified in two ways - either as an explicit
01466         argument, or via the sequence's alphabet. For example:
01467 
01468         >>> from Bio.Seq import UnknownSeq
01469         >>> from Bio.Alphabet import Gapped, generic_dna
01470         >>> my_dna = UnknownSeq(20, Gapped(generic_dna,"-"))
01471         >>> my_dna
01472         UnknownSeq(20, alphabet = Gapped(DNAAlphabet(), '-'), character = 'N')
01473         >>> my_dna.ungap()
01474         UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
01475         >>> my_dna.ungap("-")
01476         UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
01477 
01478         If the UnknownSeq is using the gap character, then an empty Seq is
01479         returned:
01480 
01481         >>> my_gap = UnknownSeq(20, Gapped(generic_dna,"-"), character="-")
01482         >>> my_gap
01483         UnknownSeq(20, alphabet = Gapped(DNAAlphabet(), '-'), character = '-')
01484         >>> my_gap.ungap()
01485         Seq('', DNAAlphabet())
01486         >>> my_gap.ungap("-")
01487         Seq('', DNAAlphabet())
01488 
01489         Notice that the returned sequence's alphabet is adjusted to remove any
01490         explicit gap character declaration.
01491         """
01492         #Offload the alphabet stuff
01493         s = Seq(self._character, self.alphabet).ungap()
01494         if s :
01495             return UnknownSeq(self._length, s.alphabet, self._character)
01496         else :
01497             return Seq("", s.alphabet)

def Bio.Seq.UnknownSeq.upper (   self)
Returns an upper case copy of the sequence.

>>> from Bio.Alphabet import generic_dna
>>> from Bio.Seq import UnknownSeq
>>> my_seq = UnknownSeq(20, generic_dna, character="n")
>>> my_seq
UnknownSeq(20, alphabet = DNAAlphabet(), character = 'n')
>>> print my_seq
nnnnnnnnnnnnnnnnnnnn
>>> my_seq.upper()
UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
>>> print my_seq.upper()
NNNNNNNNNNNNNNNNNNNN

This will adjust the alphabet if required. See also the lower method.

Reimplemented from Bio.Seq.Seq.

Definition at line 1390 of file Seq.py.

01390 
01391     def upper(self):
01392         """Returns an upper case copy of the sequence.
01393 
01394         >>> from Bio.Alphabet import generic_dna
01395         >>> from Bio.Seq import UnknownSeq
01396         >>> my_seq = UnknownSeq(20, generic_dna, character="n")
01397         >>> my_seq
01398         UnknownSeq(20, alphabet = DNAAlphabet(), character = 'n')
01399         >>> print my_seq
01400         nnnnnnnnnnnnnnnnnnnn
01401         >>> my_seq.upper()
01402         UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
01403         >>> print my_seq.upper()
01404         NNNNNNNNNNNNNNNNNNNN
01405 
01406         This will adjust the alphabet if required. See also the lower method.
01407         """
01408         return UnknownSeq(self._length, self.alphabet._upper(), self._character.upper())


Member Data Documentation

Definition at line 1152 of file Seq.py.

Definition at line 1144 of file Seq.py.

Reimplemented from Bio.Seq.Seq.

Definition at line 1148 of file Seq.py.


The documentation for this class was generated from the following file: