Back to index

python-biopython  1.60
Public Member Functions | Public Attributes | Properties | Private Member Functions | Private Attributes
Bio.SeqRecord.SeqRecord Class Reference
Inheritance diagram for Bio.SeqRecord.SeqRecord:
Inheritance graph
[legend]

List of all members.

Public Member Functions

def __init__
def __getitem__
def __iter__
def __contains__
def __str__
def __repr__
def format
def __format__
def __len__
def __nonzero__
def __add__
def __radd__
def upper
def lower
def reverse_complement

Public Attributes

 id
 name
 description
 dbxrefs
 annotations
 features

Properties

 letter_annotations
 seq

Private Member Functions

def _set_per_letter_annotations
def _set_seq

Private Attributes

 _seq
 _per_letter_annotations

Detailed Description

A SeqRecord object holds a sequence and information about it.

Main attributes:
 - id          - Identifier such as a locus tag (string)
 - seq         - The sequence itself (Seq object or similar)

Additional attributes:
 - name        - Sequence name, e.g. gene name (string)
 - description - Additional text (string)
 - dbxrefs     - List of database cross references (list of strings)
 - features    - Any (sub)features defined (list of SeqFeature objects)
 - annotations - Further information about the whole sequence (dictionary)
                 Most entries are strings, or lists of strings.
 - letter_annotations - Per letter/symbol annotation (restricted
                 dictionary). This holds Python sequences (lists, strings
                 or tuples) whose length matches that of the sequence.
                 A typical use would be to hold a list of integers
                 representing sequencing quality scores, or a string
                 representing the secondary structure.

You will typically use Bio.SeqIO to read in sequences from files as
SeqRecord objects.  However, you may want to create your own SeqRecord
objects directly (see the __init__ method for further details):

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Alphabet import IUPAC
>>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
...                         IUPAC.protein),
...                    id="YP_025292.1", name="HokC",
...                    description="toxic membrane protein")
>>> print record
ID: YP_025292.1
Name: HokC
Description: toxic membrane protein
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())

If you want to save SeqRecord objects to a sequence file, use Bio.SeqIO
for this.  For the special case where you want the SeqRecord turned into
a string in a particular file format there is a format method which uses
Bio.SeqIO internally:

>>> print record.format("fasta")
>YP_025292.1 toxic membrane protein
MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
<BLANKLINE>

You can also do things like slicing a SeqRecord, checking its length, etc

>>> len(record)
44
>>> edited = record[:10] + record[11:]
>>> print edited.seq
MKQHKAMIVAIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
>>> print record.seq
MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF

Definition at line 87 of file SeqRecord.py.


Constructor & Destructor Documentation

def Bio.SeqRecord.SeqRecord.__init__ (   self,
  seq,
  id = "<unknown id>",
  name = "<unknown name>",
  description = "<unknown description>",
  dbxrefs = None,
  features = None,
  annotations = None,
  letter_annotations = None 
)
Create a SeqRecord.

Arguments:
 - seq         - Sequence, required (Seq, MutableSeq or UnknownSeq)
 - id          - Sequence identifier, recommended (string)
 - name        - Sequence name, optional (string)
 - description - Sequence description, optional (string)
 - dbxrefs     - Database cross references, optional (list of strings)
 - features    - Any (sub)features, optional (list of SeqFeature objects)
 - annotations - Dictionary of annotations for the whole sequence
 - letter_annotations - Dictionary of per-letter-annotations, values
                should be strings, list or tuples of the same
                length as the full sequence.

You will typically use Bio.SeqIO to read in sequences from files as
SeqRecord objects.  However, you may want to create your own SeqRecord
objects directly.

Note that while an id is optional, we strongly recommend you supply a
unique id string for each record.  This is especially important
if you wish to write your sequences to a file.

If you don't have the actual sequence, but you do know its length,
then using the UnknownSeq object from Bio.Seq is appropriate.

You can create a 'blank' SeqRecord object, and then populate the
attributes later.  

Definition at line 150 of file SeqRecord.py.

00150 
00151                  letter_annotations = None):
00152         """Create a SeqRecord.
00153 
00154         Arguments:
00155          - seq         - Sequence, required (Seq, MutableSeq or UnknownSeq)
00156          - id          - Sequence identifier, recommended (string)
00157          - name        - Sequence name, optional (string)
00158          - description - Sequence description, optional (string)
00159          - dbxrefs     - Database cross references, optional (list of strings)
00160          - features    - Any (sub)features, optional (list of SeqFeature objects)
00161          - annotations - Dictionary of annotations for the whole sequence
00162          - letter_annotations - Dictionary of per-letter-annotations, values
00163                                 should be strings, list or tuples of the same
00164                                 length as the full sequence.
00165 
00166         You will typically use Bio.SeqIO to read in sequences from files as
00167         SeqRecord objects.  However, you may want to create your own SeqRecord
00168         objects directly.
00169 
00170         Note that while an id is optional, we strongly recommend you supply a
00171         unique id string for each record.  This is especially important
00172         if you wish to write your sequences to a file.
00173 
00174         If you don't have the actual sequence, but you do know its length,
00175         then using the UnknownSeq object from Bio.Seq is appropriate.
00176 
00177         You can create a 'blank' SeqRecord object, and then populate the
00178         attributes later.  
00179         """
00180         if id is not None and not isinstance(id, basestring):
00181             #Lots of existing code uses id=None... this may be a bad idea.
00182             raise TypeError("id argument should be a string")
00183         if not isinstance(name, basestring):
00184             raise TypeError("name argument should be a string")
00185         if not isinstance(description, basestring):
00186             raise TypeError("description argument should be a string")
00187         self._seq = seq
00188         self.id = id
00189         self.name = name
00190         self.description = description
00191 
00192         # database cross references (for the whole sequence)
00193         if dbxrefs is None:
00194             dbxrefs = []
00195         elif not isinstance(dbxrefs, list):
00196             raise TypeError("dbxrefs argument should be a list (of strings)")
00197         self.dbxrefs = dbxrefs
00198         
00199         # annotations about the whole sequence
00200         if annotations is None:
00201             annotations = {}
00202         elif not isinstance(annotations, dict):
00203             raise TypeError("annotations argument should be a dict")
00204         self.annotations = annotations
00205 
00206         if letter_annotations is None:
00207             # annotations about each letter in the sequence
00208             if seq is None:
00209                 #Should we allow this and use a normal unrestricted dict?
00210                 self._per_letter_annotations = _RestrictedDict(length=0)
00211             else:
00212                 try:
00213                     self._per_letter_annotations = \
00214                                               _RestrictedDict(length=len(seq))
00215                 except:
00216                     raise TypeError("seq argument should be a Seq object or similar")
00217         else:
00218             #This will be handled via the property set function, which will
00219             #turn this into a _RestrictedDict and thus ensure all the values
00220             #in the dict are the right length
00221             self.letter_annotations = letter_annotations
00222         
00223         # annotations about parts of the sequence
00224         if features is None:
00225             features = []
00226         elif not isinstance(features, list):
00227             raise TypeError("features argument should be a list (of SeqFeature objects)")
00228         self.features = features


Member Function Documentation

def Bio.SeqRecord.SeqRecord.__add__ (   self,
  other 
)
Add another sequence or string to this sequence.

The other sequence can be a SeqRecord object, a Seq object (or
similar, e.g. a MutableSeq) or a plain Python string. If you add
a plain string or a Seq (like) object, the new SeqRecord will simply
have this appended to the existing data. However, any per letter
annotation will be lost:

>>> from Bio import SeqIO
>>> handle = open("Quality/solexa_faked.fastq", "rU")
>>> record = SeqIO.read(handle, "fastq-solexa")
>>> handle.close()
>>> print record.id, record.seq
slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
>>> print record.letter_annotations.keys()
['solexa_quality']

>>> new = record + "ACT"
>>> print new.id, new.seq
slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNNACT
>>> print new.letter_annotations.keys()
[]

The new record will attempt to combine the annotation, but for any
ambiguities (e.g. different names) it defaults to omitting that
annotation.

>>> from Bio import SeqIO
>>> handle = open("GenBank/pBAD30.gb")
>>> plasmid = SeqIO.read(handle, "gb")
>>> handle.close()
>>> print plasmid.id, len(plasmid)
pBAD30 4923

Now let's cut the plasmid into two pieces, and join them back up the
other way round (i.e. shift the starting point on this plasmid, have
a look at the annotated features in the original file to see why this
particular split point might make sense):

>>> left = plasmid[:3765]
>>> right = plasmid[3765:]
>>> new = right + left
>>> print new.id, len(new)
pBAD30 4923
>>> str(new.seq) == str(right.seq + left.seq)
True
>>> len(new.features) == len(left.features) + len(right.features)
True

When we add the left and right SeqRecord objects, their annotation
is all consistent, so it is all conserved in the new SeqRecord:

>>> new.id == left.id == right.id == plasmid.id
True
>>> new.name == left.name == right.name == plasmid.name
True
>>> new.description == plasmid.description
True
>>> new.annotations == left.annotations == right.annotations
True
>>> new.letter_annotations == plasmid.letter_annotations
True
>>> new.dbxrefs == left.dbxrefs == right.dbxrefs
True

However, we should point out that when we sliced the SeqRecord,
any annotations dictionary or dbxrefs list entries were lost.
You can explicitly copy them like this:

>>> new.annotations = plasmid.annotations.copy()
>>> new.dbxrefs = plasmid.dbxrefs[:]

Definition at line 731 of file SeqRecord.py.

00731 
00732     def __add__(self, other):
00733         """Add another sequence or string to this sequence.
00734 
00735         The other sequence can be a SeqRecord object, a Seq object (or
00736         similar, e.g. a MutableSeq) or a plain Python string. If you add
00737         a plain string or a Seq (like) object, the new SeqRecord will simply
00738         have this appended to the existing data. However, any per letter
00739         annotation will be lost:
00740 
00741         >>> from Bio import SeqIO
00742         >>> handle = open("Quality/solexa_faked.fastq", "rU")
00743         >>> record = SeqIO.read(handle, "fastq-solexa")
00744         >>> handle.close()
00745         >>> print record.id, record.seq
00746         slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
00747         >>> print record.letter_annotations.keys()
00748         ['solexa_quality']
00749 
00750         >>> new = record + "ACT"
00751         >>> print new.id, new.seq
00752         slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNNACT
00753         >>> print new.letter_annotations.keys()
00754         []
00755         
00756         The new record will attempt to combine the annotation, but for any
00757         ambiguities (e.g. different names) it defaults to omitting that
00758         annotation.
00759 
00760         >>> from Bio import SeqIO
00761         >>> handle = open("GenBank/pBAD30.gb")
00762         >>> plasmid = SeqIO.read(handle, "gb")
00763         >>> handle.close()
00764         >>> print plasmid.id, len(plasmid)
00765         pBAD30 4923
00766 
00767         Now let's cut the plasmid into two pieces, and join them back up the
00768         other way round (i.e. shift the starting point on this plasmid, have
00769         a look at the annotated features in the original file to see why this
00770         particular split point might make sense):
00771 
00772         >>> left = plasmid[:3765]
00773         >>> right = plasmid[3765:]
00774         >>> new = right + left
00775         >>> print new.id, len(new)
00776         pBAD30 4923
00777         >>> str(new.seq) == str(right.seq + left.seq)
00778         True
00779         >>> len(new.features) == len(left.features) + len(right.features)
00780         True
00781 
00782         When we add the left and right SeqRecord objects, their annotation
00783         is all consistent, so it is all conserved in the new SeqRecord:
00784         
00785         >>> new.id == left.id == right.id == plasmid.id
00786         True
00787         >>> new.name == left.name == right.name == plasmid.name
00788         True
00789         >>> new.description == plasmid.description
00790         True
00791         >>> new.annotations == left.annotations == right.annotations
00792         True
00793         >>> new.letter_annotations == plasmid.letter_annotations
00794         True
00795         >>> new.dbxrefs == left.dbxrefs == right.dbxrefs
00796         True
00797 
00798         However, we should point out that when we sliced the SeqRecord,
00799         any annotations dictionary or dbxrefs list entries were lost.
00800         You can explicitly copy them like this:
00801 
00802         >>> new.annotations = plasmid.annotations.copy()
00803         >>> new.dbxrefs = plasmid.dbxrefs[:]
00804         """
00805         if not isinstance(other, SeqRecord):
00806             #Assume it is a string or a Seq.
00807             #Note can't transfer any per-letter-annotations
00808             return SeqRecord(self.seq + other,
00809                              id = self.id, name = self.name,
00810                              description = self.description,
00811                              features = self.features[:],
00812                              annotations = self.annotations.copy(),
00813                              dbxrefs = self.dbxrefs[:])
00814         #Adding two SeqRecord objects... must merge annotation.
00815         answer = SeqRecord(self.seq + other.seq,
00816                            features = self.features[:],
00817                            dbxrefs = self.dbxrefs[:])
00818         #Will take all the features and all the db cross refs,
00819         l = len(self)
00820         for f in other.features:
00821             answer.features.append(f._shift(l))
00822         del l
00823         for ref in other.dbxrefs:
00824             if ref not in answer.dbxrefs:
00825                 answer.dbxrefs.append(ref)
00826         #Take common id/name/description/annotation
00827         if self.id == other.id:
00828             answer.id = self.id
00829         if self.name == other.name:
00830             answer.name = self.name
00831         if self.description == other.description:
00832             answer.description = self.description
00833         for k,v in self.annotations.iteritems():
00834             if k in other.annotations and other.annotations[k] == v:
00835                 answer.annotations[k] = v
00836         #Can append matching per-letter-annotation
00837         for k,v in self.letter_annotations.iteritems():
00838             if k in other.letter_annotations:
00839                 answer.letter_annotations[k] = v + other.letter_annotations[k]
00840         return answer
        
def Bio.SeqRecord.SeqRecord.__contains__ (   self,
  char 
)
Implements the 'in' keyword, searches the sequence.

e.g.

>>> from Bio import SeqIO
>>> record = SeqIO.read(open("Fasta/sweetpea.nu"), "fasta")
>>> "GAATTC" in record
False
>>> "AAA" in record
True

This essentially acts as a proxy for using "in" on the sequence:

>>> "GAATTC" in record.seq
False
>>> "AAA" in record.seq
True

Note that you can also use Seq objects as the query,

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> Seq("AAA") in record
True
>>> Seq("AAA", generic_dna) in record
True

See also the Seq object's __contains__ method.

Definition at line 520 of file SeqRecord.py.

00520 
00521     def __contains__(self, char):
00522         """Implements the 'in' keyword, searches the sequence.
00523 
00524         e.g.
00525 
00526         >>> from Bio import SeqIO
00527         >>> record = SeqIO.read(open("Fasta/sweetpea.nu"), "fasta")
00528         >>> "GAATTC" in record
00529         False
00530         >>> "AAA" in record
00531         True
00532 
00533         This essentially acts as a proxy for using "in" on the sequence:
00534 
00535         >>> "GAATTC" in record.seq
00536         False
00537         >>> "AAA" in record.seq
00538         True
00539 
00540         Note that you can also use Seq objects as the query,
00541 
00542         >>> from Bio.Seq import Seq
00543         >>> from Bio.Alphabet import generic_dna
00544         >>> Seq("AAA") in record
00545         True
00546         >>> Seq("AAA", generic_dna) in record
00547         True
00548 
00549         See also the Seq object's __contains__ method.
00550         """        
00551         return char in self.seq
00552 

def Bio.SeqRecord.SeqRecord.__format__ (   self,
  format_spec 
)
Returns the record as a string in the specified file format.

This method supports the python format() function added in
Python 2.6/3.0.  The format_spec should be a lower case string
supported by Bio.SeqIO as an output file format. See also the
SeqRecord's format() method.

Definition at line 674 of file SeqRecord.py.

00674 
00675     def __format__(self, format_spec):
00676         """Returns the record as a string in the specified file format.
00677 
00678         This method supports the python format() function added in
00679         Python 2.6/3.0.  The format_spec should be a lower case string
00680         supported by Bio.SeqIO as an output file format. See also the
00681         SeqRecord's format() method.
00682         """
00683         if not format_spec:
00684             #Follow python convention and default to using __str__
00685             return str(self)    
00686         from Bio import SeqIO
00687         if format_spec in SeqIO._BinaryFormats:
00688             #Return bytes on Python 3
00689             try:
00690                 #This is in Python 2.6+, but we need it on Python 3
00691                 from io import BytesIO
00692                 handle = BytesIO()
00693             except ImportError:
00694                 #Must be on Python 2.5 or older
00695                 from StringIO import StringIO
00696                 handle = StringIO()
00697         else:
00698             from StringIO import StringIO
00699             handle = StringIO()
00700         SeqIO.write(self, handle, format_spec)
00701         return handle.getvalue()

Here is the caller graph for this function:

def Bio.SeqRecord.SeqRecord.__getitem__ (   self,
  index 
)
Returns a sub-sequence or an individual letter.

Slicing, e.g. my_record[5:10], returns a new SeqRecord for
that sub-sequence with approriate annotation preserved.  The
name, id and description are kept.

Any per-letter-annotations are sliced to match the requested
sub-sequence.  Unless a stride is used, all those features
which fall fully within the subsequence are included (with
their locations adjusted accordingly).

However, the annotations dictionary and the dbxrefs list are
not used for the new SeqRecord, as in general they may not
apply to the subsequence.  If you want to preserve them, you
must explictly copy them to the new SeqRecord yourself.

Using an integer index, e.g. my_record[5] is shorthand for
extracting that letter from the sequence, my_record.seq[5].

For example, consider this short protein and its secondary
structure as encoded by the PDB (e.g. H for alpha helices),
plus a simple feature for its histidine self phosphorylation
site:

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.SeqFeature import SeqFeature, FeatureLocation
>>> from Bio.Alphabet import IUPAC
>>> rec = SeqRecord(Seq("MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLAT"
...                     "EMMSEQDGYLAESINKDIEECNAIIEQFIDYLR",
...                     IUPAC.protein),
...                 id="1JOY", name="EnvZ",
...                 description="Homodimeric domain of EnvZ from E. coli")
>>> rec.letter_annotations["secondary_structure"] = "  S  SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT  "
>>> rec.features.append(SeqFeature(FeatureLocation(20,21),
...                     type = "Site"))

Now let's have a quick look at the full record,

>>> print rec
ID: 1JOY
Name: EnvZ
Description: Homodimeric domain of EnvZ from E. coli
Number of features: 1
Per letter annotation for: secondary_structure
Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR', IUPACProtein())
>>> print rec.letter_annotations["secondary_structure"]
  S  SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT  
>>> print rec.features[0].location
[20:21]

Now let's take a sub sequence, here chosen as the first (fractured)
alpha helix which includes the histidine phosphorylation site:

>>> sub = rec[11:41]
>>> print sub
ID: 1JOY
Name: EnvZ
Description: Homodimeric domain of EnvZ from E. coli
Number of features: 1
Per letter annotation for: secondary_structure
Seq('RTLLMAGVSHDLRTPLTRIRLATEMMSEQD', IUPACProtein())
>>> print sub.letter_annotations["secondary_structure"]
HHHHHTTTHHHHHHHHHHHHHHHHHHHHHH
>>> print sub.features[0].location
[9:10]

You can also of course omit the start or end values, for
example to get the first ten letters only:

>>> print rec[:10]
ID: 1JOY
Name: EnvZ
Description: Homodimeric domain of EnvZ from E. coli
Number of features: 0
Per letter annotation for: secondary_structure
Seq('MAAGVKQLAD', IUPACProtein())

Or for the last ten letters:

>>> print rec[-10:]
ID: 1JOY
Name: EnvZ
Description: Homodimeric domain of EnvZ from E. coli
Number of features: 0
Per letter annotation for: secondary_structure
Seq('IIEQFIDYLR', IUPACProtein())

If you omit both, then you get a copy of the original record (although
lacking the annotations and dbxrefs):

>>> print rec[:]
ID: 1JOY
Name: EnvZ
Description: Homodimeric domain of EnvZ from E. coli
Number of features: 1
Per letter annotation for: secondary_structure
Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR', IUPACProtein())

Finally, indexing with a simple integer is shorthand for pulling out
that letter from the sequence directly:

>>> rec[5]
'K'
>>> rec.seq[5]
'K'

Definition at line 309 of file SeqRecord.py.

00309 
00310     def __getitem__(self, index):
00311         """Returns a sub-sequence or an individual letter.
00312 
00313         Slicing, e.g. my_record[5:10], returns a new SeqRecord for
00314         that sub-sequence with approriate annotation preserved.  The
00315         name, id and description are kept.
00316 
00317         Any per-letter-annotations are sliced to match the requested
00318         sub-sequence.  Unless a stride is used, all those features
00319         which fall fully within the subsequence are included (with
00320         their locations adjusted accordingly).
00321 
00322         However, the annotations dictionary and the dbxrefs list are
00323         not used for the new SeqRecord, as in general they may not
00324         apply to the subsequence.  If you want to preserve them, you
00325         must explictly copy them to the new SeqRecord yourself.
00326 
00327         Using an integer index, e.g. my_record[5] is shorthand for
00328         extracting that letter from the sequence, my_record.seq[5].
00329 
00330         For example, consider this short protein and its secondary
00331         structure as encoded by the PDB (e.g. H for alpha helices),
00332         plus a simple feature for its histidine self phosphorylation
00333         site:
00334 
00335         >>> from Bio.Seq import Seq
00336         >>> from Bio.SeqRecord import SeqRecord
00337         >>> from Bio.SeqFeature import SeqFeature, FeatureLocation
00338         >>> from Bio.Alphabet import IUPAC
00339         >>> rec = SeqRecord(Seq("MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLAT"
00340         ...                     "EMMSEQDGYLAESINKDIEECNAIIEQFIDYLR",
00341         ...                     IUPAC.protein),
00342         ...                 id="1JOY", name="EnvZ",
00343         ...                 description="Homodimeric domain of EnvZ from E. coli")
00344         >>> rec.letter_annotations["secondary_structure"] = "  S  SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT  "
00345         >>> rec.features.append(SeqFeature(FeatureLocation(20,21),
00346         ...                     type = "Site"))
00347 
00348         Now let's have a quick look at the full record,
00349 
00350         >>> print rec
00351         ID: 1JOY
00352         Name: EnvZ
00353         Description: Homodimeric domain of EnvZ from E. coli
00354         Number of features: 1
00355         Per letter annotation for: secondary_structure
00356         Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR', IUPACProtein())
00357         >>> print rec.letter_annotations["secondary_structure"]
00358           S  SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT  
00359         >>> print rec.features[0].location
00360         [20:21]
00361 
00362         Now let's take a sub sequence, here chosen as the first (fractured)
00363         alpha helix which includes the histidine phosphorylation site:
00364 
00365         >>> sub = rec[11:41]
00366         >>> print sub
00367         ID: 1JOY
00368         Name: EnvZ
00369         Description: Homodimeric domain of EnvZ from E. coli
00370         Number of features: 1
00371         Per letter annotation for: secondary_structure
00372         Seq('RTLLMAGVSHDLRTPLTRIRLATEMMSEQD', IUPACProtein())
00373         >>> print sub.letter_annotations["secondary_structure"]
00374         HHHHHTTTHHHHHHHHHHHHHHHHHHHHHH
00375         >>> print sub.features[0].location
00376         [9:10]
00377 
00378         You can also of course omit the start or end values, for
00379         example to get the first ten letters only:
00380 
00381         >>> print rec[:10]
00382         ID: 1JOY
00383         Name: EnvZ
00384         Description: Homodimeric domain of EnvZ from E. coli
00385         Number of features: 0
00386         Per letter annotation for: secondary_structure
00387         Seq('MAAGVKQLAD', IUPACProtein())
00388 
00389         Or for the last ten letters:
00390 
00391         >>> print rec[-10:]
00392         ID: 1JOY
00393         Name: EnvZ
00394         Description: Homodimeric domain of EnvZ from E. coli
00395         Number of features: 0
00396         Per letter annotation for: secondary_structure
00397         Seq('IIEQFIDYLR', IUPACProtein())
00398 
00399         If you omit both, then you get a copy of the original record (although
00400         lacking the annotations and dbxrefs):
00401 
00402         >>> print rec[:]
00403         ID: 1JOY
00404         Name: EnvZ
00405         Description: Homodimeric domain of EnvZ from E. coli
00406         Number of features: 1
00407         Per letter annotation for: secondary_structure
00408         Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR', IUPACProtein())
00409 
00410         Finally, indexing with a simple integer is shorthand for pulling out
00411         that letter from the sequence directly:
00412 
00413         >>> rec[5]
00414         'K'
00415         >>> rec.seq[5]
00416         'K'
00417         """
00418         if isinstance(index, int):
00419             #NOTE - The sequence level annotation like the id, name, etc
00420             #do not really apply to a single character.  However, should
00421             #we try and expose any per-letter-annotation here?  If so how?
00422             return self.seq[index]
00423         elif isinstance(index, slice):
00424             if self.seq is None:
00425                 raise ValueError("If the sequence is None, we cannot slice it.")
00426             parent_length = len(self)
00427             answer = self.__class__(self.seq[index],
00428                                     id=self.id,
00429                                     name=self.name,
00430                                     description=self.description)
00431             #TODO - The desription may no longer apply.
00432             #It would be safer to change it to something
00433             #generic like "edited" or the default value.
00434             
00435             #Don't copy the annotation dict and dbxefs list,
00436             #they may not apply to a subsequence.
00437             #answer.annotations = dict(self.annotations.iteritems())
00438             #answer.dbxrefs = self.dbxrefs[:]
00439             #TODO - Review this in light of adding SeqRecord objects?
00440             
00441             #TODO - Cope with strides by generating ambiguous locations?
00442             start, stop, step = index.indices(parent_length)
00443             if step == 1:
00444                 #Select relevant features, add them with shifted locations
00445                 #assert str(self.seq)[index] == str(self.seq)[start:stop]
00446                 for f in self.features:
00447                     if f.ref or f.ref_db:
00448                         #TODO - Implement this (with lots of tests)?
00449                         import warnings
00450                         warnings.warn("When slicing SeqRecord objects, any "
00451                               "SeqFeature referencing other sequences (e.g. "
00452                               "from segmented GenBank records) are ignored.")
00453                         continue
00454                     if start <= f.location.nofuzzy_start \
00455                     and f.location.nofuzzy_end <= stop:
00456                         answer.features.append(f._shift(-start))
00457 
00458             #Slice all the values to match the sliced sequence
00459             #(this should also work with strides, even negative strides):
00460             for key, value in self.letter_annotations.iteritems():
00461                 answer._per_letter_annotations[key] = value[index]
00462 
00463             return answer
00464         raise ValueError, "Invalid index"

Iterate over the letters in the sequence.

For example, using Bio.SeqIO to read in a protein FASTA file:

>>> from Bio import SeqIO
>>> record = SeqIO.read(open("Fasta/loveliesbleeding.pro"),"fasta")
>>> for amino in record:
...     print amino
...     if amino == "L": break
X
A
G
L
>>> print record.seq[3]
L

This is just a shortcut for iterating over the sequence directly:

>>> for amino in record.seq:
...     print amino
...     if amino == "L": break
X
A
G
L
>>> print record.seq[3]
L

Note that this does not facilitate iteration together with any
per-letter-annotation.  However, you can achieve that using the
python zip function on the record (or its sequence) and the relevant
per-letter-annotation:

>>> from Bio import SeqIO
>>> rec = SeqIO.read(open("Quality/solexa_faked.fastq", "rU"),
...                  "fastq-solexa")
>>> print rec.id, rec.seq
slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
>>> print rec.letter_annotations.keys()
['solexa_quality']
>>> for nuc, qual in zip(rec,rec.letter_annotations["solexa_quality"]):
...     if qual > 35:
...         print nuc, qual
A 40
C 39
G 38
T 37
A 36

You may agree that using zip(rec.seq, ...) is more explicit than using
zip(rec, ...) as shown above.

Definition at line 465 of file SeqRecord.py.

00465 
00466     def __iter__(self):
00467         """Iterate over the letters in the sequence.
00468 
00469         For example, using Bio.SeqIO to read in a protein FASTA file:
00470 
00471         >>> from Bio import SeqIO
00472         >>> record = SeqIO.read(open("Fasta/loveliesbleeding.pro"),"fasta")
00473         >>> for amino in record:
00474         ...     print amino
00475         ...     if amino == "L": break
00476         X
00477         A
00478         G
00479         L
00480         >>> print record.seq[3]
00481         L
00482 
00483         This is just a shortcut for iterating over the sequence directly:
00484 
00485         >>> for amino in record.seq:
00486         ...     print amino
00487         ...     if amino == "L": break
00488         X
00489         A
00490         G
00491         L
00492         >>> print record.seq[3]
00493         L
00494         
00495         Note that this does not facilitate iteration together with any
00496         per-letter-annotation.  However, you can achieve that using the
00497         python zip function on the record (or its sequence) and the relevant
00498         per-letter-annotation:
00499         
00500         >>> from Bio import SeqIO
00501         >>> rec = SeqIO.read(open("Quality/solexa_faked.fastq", "rU"),
00502         ...                  "fastq-solexa")
00503         >>> print rec.id, rec.seq
00504         slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
00505         >>> print rec.letter_annotations.keys()
00506         ['solexa_quality']
00507         >>> for nuc, qual in zip(rec,rec.letter_annotations["solexa_quality"]):
00508         ...     if qual > 35:
00509         ...         print nuc, qual
00510         A 40
00511         C 39
00512         G 38
00513         T 37
00514         A 36
00515 
00516         You may agree that using zip(rec.seq, ...) is more explicit than using
00517         zip(rec, ...) as shown above.
00518         """
00519         return iter(self.seq)

Returns the length of the sequence.

For example, using Bio.SeqIO to read in a FASTA nucleotide file:

>>> from Bio import SeqIO
>>> record = SeqIO.read(open("Fasta/sweetpea.nu"),"fasta")
>>> len(record)
309
>>> len(record.seq)
309

Definition at line 702 of file SeqRecord.py.

00702 
00703     def __len__(self):
00704         """Returns the length of the sequence.
00705 
00706         For example, using Bio.SeqIO to read in a FASTA nucleotide file:
00707 
00708         >>> from Bio import SeqIO
00709         >>> record = SeqIO.read(open("Fasta/sweetpea.nu"),"fasta")
00710         >>> len(record)
00711         309
00712         >>> len(record.seq)
00713         309
00714         """
00715         return len(self.seq)

Returns True regardless of the length of the sequence.

This behaviour is for backwards compatibility, since until the
__len__ method was added, a SeqRecord always evaluated as True.

Note that in comparison, a Seq object will evaluate to False if it
has a zero length sequence.

WARNING: The SeqRecord may in future evaluate to False when its
sequence is of zero length (in order to better match the Seq
object behaviour)!

Definition at line 716 of file SeqRecord.py.

00716 
00717     def __nonzero__(self):
00718         """Returns True regardless of the length of the sequence.
00719 
00720         This behaviour is for backwards compatibility, since until the
00721         __len__ method was added, a SeqRecord always evaluated as True.
00722 
00723         Note that in comparison, a Seq object will evaluate to False if it
00724         has a zero length sequence.
00725 
00726         WARNING: The SeqRecord may in future evaluate to False when its
00727         sequence is of zero length (in order to better match the Seq
00728         object behaviour)!
00729         """
00730         return True

def Bio.SeqRecord.SeqRecord.__radd__ (   self,
  other 
)
Add another sequence or string to this sequence (from the left).

This method handles adding a Seq object (or similar, e.g. MutableSeq)
or a plain Python string (on the left) to a SeqRecord (on the right).
See the __add__ method for more details, but for example:

>>> from Bio import SeqIO
>>> handle = open("Quality/solexa_faked.fastq", "rU")
>>> record = SeqIO.read(handle, "fastq-solexa")
>>> handle.close()
>>> print record.id, record.seq
slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
>>> print record.letter_annotations.keys()
['solexa_quality']

>>> new = "ACT" + record
>>> print new.id, new.seq
slxa_0001_1_0001_01 ACTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
>>> print new.letter_annotations.keys()
[]

Definition at line 841 of file SeqRecord.py.

00841 
00842     def __radd__(self, other):
00843         """Add another sequence or string to this sequence (from the left).
00844 
00845         This method handles adding a Seq object (or similar, e.g. MutableSeq)
00846         or a plain Python string (on the left) to a SeqRecord (on the right).
00847         See the __add__ method for more details, but for example:
00848 
00849         >>> from Bio import SeqIO
00850         >>> handle = open("Quality/solexa_faked.fastq", "rU")
00851         >>> record = SeqIO.read(handle, "fastq-solexa")
00852         >>> handle.close()
00853         >>> print record.id, record.seq
00854         slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
00855         >>> print record.letter_annotations.keys()
00856         ['solexa_quality']
00857 
00858         >>> new = "ACT" + record
00859         >>> print new.id, new.seq
00860         slxa_0001_1_0001_01 ACTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
00861         >>> print new.letter_annotations.keys()
00862         []
00863         """
00864         if isinstance(other, SeqRecord):
00865             raise RuntimeError("This should have happened via the __add__ of "
00866                                "the other SeqRecord being added!")
00867         #Assume it is a string or a Seq.
00868         #Note can't transfer any per-letter-annotations
00869         offset = len(other)
00870         return SeqRecord(other + self.seq,
00871                          id = self.id, name = self.name,
00872                          description = self.description,
00873                          features = [f._shift(offset) for f in self.features],
00874                          annotations = self.annotations.copy(),
00875                          dbxrefs = self.dbxrefs[:])

A concise summary of the record for debugging (string).

The python built in function repr works by calling the object's ___repr__
method.  e.g.

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Alphabet import generic_protein
>>> rec = SeqRecord(Seq("MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT"
...                    +"GEMKEQTEWHRVVLFGKLAEVASEYLRKGSQVYIEGQLRTRKWTDQ"
...                    +"SGQDRYTTEVVVNVGGTMQMLGGRQGGGAPAGGNIGGGQPQGGWGQ"
...                    +"PQQPQGGNQFSGGAQSRPQQSAPAAPSNEPPMDFDDDIPF",
...                    generic_protein),
...                 id="NP_418483.1", name="b4059",
...                 description="ssDNA-binding protein",
...                 dbxrefs=["ASAP:13298", "GI:16131885", "GeneID:948570"])
>>> print repr(rec)
SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF', ProteinAlphabet()), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])

At the python prompt you can also use this shorthand:

>>> rec
SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF', ProteinAlphabet()), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])

Note that long sequences are shown truncated. Also note that any
annotations, letter_annotations and features are not shown (as they
would lead to a very long string).

Definition at line 606 of file SeqRecord.py.

00606 
00607     def __repr__(self):
00608         """A concise summary of the record for debugging (string).
00609 
00610         The python built in function repr works by calling the object's ___repr__
00611         method.  e.g.
00612 
00613         >>> from Bio.Seq import Seq
00614         >>> from Bio.SeqRecord import SeqRecord
00615         >>> from Bio.Alphabet import generic_protein
00616         >>> rec = SeqRecord(Seq("MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT"
00617         ...                    +"GEMKEQTEWHRVVLFGKLAEVASEYLRKGSQVYIEGQLRTRKWTDQ"
00618         ...                    +"SGQDRYTTEVVVNVGGTMQMLGGRQGGGAPAGGNIGGGQPQGGWGQ"
00619         ...                    +"PQQPQGGNQFSGGAQSRPQQSAPAAPSNEPPMDFDDDIPF",
00620         ...                    generic_protein),
00621         ...                 id="NP_418483.1", name="b4059",
00622         ...                 description="ssDNA-binding protein",
00623         ...                 dbxrefs=["ASAP:13298", "GI:16131885", "GeneID:948570"])
00624         >>> print repr(rec)
00625         SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF', ProteinAlphabet()), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])
00626 
00627         At the python prompt you can also use this shorthand:
00628 
00629         >>> rec
00630         SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF', ProteinAlphabet()), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])
00631 
00632         Note that long sequences are shown truncated. Also note that any
00633         annotations, letter_annotations and features are not shown (as they
00634         would lead to a very long string).
00635         """
00636         return self.__class__.__name__ \
00637          + "(seq=%s, id=%s, name=%s, description=%s, dbxrefs=%s)" \
00638          % tuple(map(repr, (self.seq, self.id, self.name,
00639                             self.description, self.dbxrefs)))

A human readable summary of the record and its annotation (string).

The python built in function str works by calling the object's ___str__
method.  e.g.

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Alphabet import IUPAC
>>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
...                         IUPAC.protein),
...                    id="YP_025292.1", name="HokC",
...                    description="toxic membrane protein, small")
>>> print str(record)
ID: YP_025292.1
Name: HokC
Description: toxic membrane protein, small
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())

In this example you don't actually need to call str explicity, as the
print command does this automatically:

>>> print record
ID: YP_025292.1
Name: HokC
Description: toxic membrane protein, small
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())

Note that long sequences are shown truncated.

Definition at line 553 of file SeqRecord.py.

00553 
00554     def __str__(self):
00555         """A human readable summary of the record and its annotation (string).
00556 
00557         The python built in function str works by calling the object's ___str__
00558         method.  e.g.
00559 
00560         >>> from Bio.Seq import Seq
00561         >>> from Bio.SeqRecord import SeqRecord
00562         >>> from Bio.Alphabet import IUPAC
00563         >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
00564         ...                         IUPAC.protein),
00565         ...                    id="YP_025292.1", name="HokC",
00566         ...                    description="toxic membrane protein, small")
00567         >>> print str(record)
00568         ID: YP_025292.1
00569         Name: HokC
00570         Description: toxic membrane protein, small
00571         Number of features: 0
00572         Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())
00573 
00574         In this example you don't actually need to call str explicity, as the
00575         print command does this automatically:
00576 
00577         >>> print record
00578         ID: YP_025292.1
00579         Name: HokC
00580         Description: toxic membrane protein, small
00581         Number of features: 0
00582         Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())
00583 
00584         Note that long sequences are shown truncated.
00585         """
00586         lines = []
00587         if self.id:
00588             lines.append("ID: %s" % self.id)
00589         if self.name:
00590             lines.append("Name: %s" % self.name)
00591         if self.description:
00592             lines.append("Description: %s" % self.description)
00593         if self.dbxrefs:
00594             lines.append("Database cross-references: " \
00595                          + ", ".join(self.dbxrefs))
00596         lines.append("Number of features: %i" % len(self.features))
00597         for a in self.annotations:
00598             lines.append("/%s=%s" % (a, str(self.annotations[a])))
00599         if self.letter_annotations:
00600             lines.append("Per letter annotation for: " \
00601                          + ", ".join(self.letter_annotations.keys()))
00602         #Don't want to include the entire sequence,
00603         #and showing the alphabet is useful:
00604         lines.append(repr(self.seq))
00605         return "\n".join(lines)

def Bio.SeqRecord.SeqRecord._set_per_letter_annotations (   self,
  value 
) [private]

Definition at line 230 of file SeqRecord.py.

00230 
00231     def _set_per_letter_annotations(self, value):
00232         if not isinstance(value, dict):
00233             raise TypeError("The per-letter-annotations should be a "
00234                             "(restricted) dictionary.")
00235         #Turn this into a restricted-dictionary (and check the entries)
00236         try:
00237             self._per_letter_annotations = _RestrictedDict(length=len(self.seq))
00238         except AttributeError:
00239             #e.g. seq is None
00240             self._per_letter_annotations = _RestrictedDict(length=0)
        self._per_letter_annotations.update(value)
def Bio.SeqRecord.SeqRecord._set_seq (   self,
  value 
) [private]

Definition at line 293 of file SeqRecord.py.

00293 
00294     def _set_seq(self, value):
00295         #TODO - Add a deprecation warning that the seq should be write only?
00296         if self._per_letter_annotations:
00297             #TODO - Make this a warning? Silently empty the dictionary?
00298             raise ValueError("You must empty the letter annotations first!")
00299         self._seq = value
00300         try:
00301             self._per_letter_annotations = _RestrictedDict(length=len(self.seq))
00302         except AttributeError:
00303             #e.g. seq is None
00304             self._per_letter_annotations = _RestrictedDict(length=0)

def Bio.SeqRecord.SeqRecord.format (   self,
  format 
)

Definition at line 640 of file SeqRecord.py.

00640 
00641     def format(self, format):
00642         r"""Returns the record as a string in the specified file format.
00643 
00644         The format should be a lower case string supported as an output
00645         format by Bio.SeqIO, which is used to turn the SeqRecord into a
00646         string.  e.g.
00647 
00648         >>> from Bio.Seq import Seq
00649         >>> from Bio.SeqRecord import SeqRecord
00650         >>> from Bio.Alphabet import IUPAC
00651         >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
00652         ...                         IUPAC.protein),
00653         ...                    id="YP_025292.1", name="HokC",
00654         ...                    description="toxic membrane protein")
00655         >>> record.format("fasta")
00656         '>YP_025292.1 toxic membrane protein\nMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n'
00657         >>> print record.format("fasta")
00658         >YP_025292.1 toxic membrane protein
00659         MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
00660         <BLANKLINE>
00661 
00662         The python print command automatically appends a new line, meaning
00663         in this example a blank line is shown.  If you look at the string
00664         representation you can see there is a trailing new line (shown as
00665         slash n) which is important when writing to a file or if
00666         concatenating mutliple sequence strings together.
00667 
00668         Note that this method will NOT work on every possible file format
00669         supported by Bio.SeqIO (e.g. some are for multiple sequences only).
00670         """
00671         #See also the __format__ added for Python 2.6 / 3.0, PEP 3101
00672         #See also the Bio.Align.Generic.Alignment class and its format()
00673         return self.__format__(format)

Here is the call graph for this function:

Returns a copy of the record with a lower case sequence.

All the annotation is preserved unchanged. e.g.

>>> from Bio import SeqIO
>>> record = SeqIO.read("Fasta/aster.pro", "fasta")
>>> print record.format("fasta")
>gi|3298468|dbj|BAA31520.1| SAMIPF
GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVG
VTNALVFEIVMTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANI
<BLANKLINE>
>>> print record.lower().format("fasta")
>gi|3298468|dbj|BAA31520.1| SAMIPF
gghvnpavtfgafvggnitllrgivyiiaqllgstvaclllkfvtndmavgvfslsagvg
vtnalvfeivmtfglvytvyataidpkkgslgtiapiaigfivgani
<BLANKLINE>

To take a more annotation rich example,

>>> from Bio import SeqIO
>>> old = SeqIO.read("EMBL/TRBG361.embl", "embl")
>>> len(old.features)
3
>>> new = old.lower()
>>> len(old.features) == len(new.features)
True
>>> old.annotations["organism"] == new.annotations["organism"]
True
>>> old.dbxrefs == new.dbxrefs
True

Definition at line 911 of file SeqRecord.py.

00911 
00912     def lower(self):
00913         """Returns a copy of the record with a lower case sequence.
00914 
00915         All the annotation is preserved unchanged. e.g.
00916 
00917         >>> from Bio import SeqIO
00918         >>> record = SeqIO.read("Fasta/aster.pro", "fasta")
00919         >>> print record.format("fasta")
00920         >gi|3298468|dbj|BAA31520.1| SAMIPF
00921         GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVG
00922         VTNALVFEIVMTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANI
00923         <BLANKLINE>
00924         >>> print record.lower().format("fasta")
00925         >gi|3298468|dbj|BAA31520.1| SAMIPF
00926         gghvnpavtfgafvggnitllrgivyiiaqllgstvaclllkfvtndmavgvfslsagvg
00927         vtnalvfeivmtfglvytvyataidpkkgslgtiapiaigfivgani
00928         <BLANKLINE>
00929 
00930         To take a more annotation rich example,
00931 
00932         >>> from Bio import SeqIO
00933         >>> old = SeqIO.read("EMBL/TRBG361.embl", "embl")
00934         >>> len(old.features)
00935         3
00936         >>> new = old.lower()
00937         >>> len(old.features) == len(new.features)
00938         True
00939         >>> old.annotations["organism"] == new.annotations["organism"]
00940         True
00941         >>> old.dbxrefs == new.dbxrefs
00942         True
00943         """
00944         return SeqRecord(self.seq.lower(),
00945                          id = self.id, name = self.name,
00946                          description = self.description,
00947                          dbxrefs = self.dbxrefs[:],
00948                          features = self.features[:],
00949                          annotations = self.annotations.copy(),
00950                          letter_annotations=self.letter_annotations.copy())

Here is the call graph for this function:

def Bio.SeqRecord.SeqRecord.reverse_complement (   self,
  id = False,
  name = False,
  description = False,
  features = True,
  annotations = False,
  letter_annotations = True,
  dbxrefs = False 
)
Returns new SeqRecord with reverse complement sequence.

You can specify the returned record's id, name and description as
strings, or True to keep that of the parent, or False for a default.

You can specify the returned record's features with a list of
SeqFeature objects, or True to keep that of the parent, or False to
omit them. The default is to keep the original features (with the
strand and locations adjusted).

You can also specify both the returned record's annotations and
letter_annotations as dictionaries, True to keep that of the parent,
or False to omit them. The default is to keep the original
annotations (with the letter annotations reversed).

To show what happens to the pre-letter annotations, consider an
example Solexa variant FASTQ file with a single entry, which we'll
read in as a SeqRecord:

>>> from Bio import SeqIO
>>> handle = open("Quality/solexa_faked.fastq", "rU")
>>> record = SeqIO.read(handle, "fastq-solexa")
>>> handle.close()
>>> print record.id, record.seq
slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
>>> print record.letter_annotations.keys()
['solexa_quality']
>>> print record.letter_annotations["solexa_quality"]
[40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]

Now take the reverse complement,

>>> rc_record = record.reverse_complement(id=record.id+"_rc")
>>> print rc_record.id, rc_record.seq
slxa_0001_1_0001_01_rc NNNNNNACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT

Notice that the per-letter-annotations have also been reversed,
although this may not be appropriate for all cases.

>>> print rc_record.letter_annotations["solexa_quality"]
[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]

Now for the features, we need a different example. Parsing a GenBank
file is probably the easiest way to get an nice example with features
in it...

>>> from Bio import SeqIO
>>> handle = open("GenBank/pBAD30.gb")
>>> plasmid = SeqIO.read(handle, "gb")
>>> handle.close()
>>> print plasmid.id, len(plasmid)
pBAD30 4923
>>> plasmid.seq
Seq('GCTAGCGGAGTGTATACTGGCTTACTATGTTGGCACTGATGAGGGTGTCAGTGA...ATG', IUPACAmbiguousDNA())
>>> len(plasmid.features)
13

Now, let's take the reverse complement of this whole plasmid:

>>> rc_plasmid = plasmid.reverse_complement(id=plasmid.id+"_rc")
>>> print rc_plasmid.id, len(rc_plasmid)
pBAD30_rc 4923
>>> rc_plasmid.seq
Seq('CATGGGCAAATATTATACGCAAGGCGACAAGGTGCTGATGCCGCTGGCGATTCA...AGC', IUPACAmbiguousDNA())
>>> len(rc_plasmid.features)
13

Let's compare the first CDS feature - it has gone from being the
second feature (index 1) to the second last feature (index -2), its
strand has changed, and the location switched round.

>>> print plasmid.features[1]
type: CDS
location: [1081:1960](-)
qualifiers: 
    Key: label, Value: ['araC']
    Key: note, Value: ['araC regulator of the arabinose BAD promoter']
    Key: vntifkey, Value: ['4']
<BLANKLINE>
>>> print rc_plasmid.features[-2]
type: CDS
location: [2963:3842](+)
qualifiers: 
    Key: label, Value: ['araC']
    Key: note, Value: ['araC regulator of the arabinose BAD promoter']
    Key: vntifkey, Value: ['4']
<BLANKLINE>

You can check this new location, based on the length of the plasmid:

>>> len(plasmid) - 1081
3842
>>> len(plasmid) - 1960
2963

Note that if the SeqFeature annotation includes any strand specific
information (e.g. base changes for a SNP), this information is not
ammended, and would need correction after the reverse complement.

Note trying to reverse complement a protein SeqRecord raises an
exception:

>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> protein_rec = SeqRecord(Seq("MAIVMGR", IUPAC.protein), id="Test")
>>> protein_rec.reverse_complement()
Traceback (most recent call last):
   ...
ValueError: Proteins do not have complements!

Also note you can reverse complement a SeqRecord using a MutableSeq:

>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Seq import MutableSeq
>>> from Bio.Alphabet import generic_dna
>>> rec = SeqRecord(MutableSeq("ACGT", generic_dna), id="Test")
>>> rec.seq[0] = "T"
>>> print rec.id, rec.seq
Test TCGT
>>> rc = rec.reverse_complement(id=True)
>>> print rc.id, rc.seq
Test ACGA

Definition at line 953 of file SeqRecord.py.

00953 
00954                            letter_annotations=True, dbxrefs=False):
00955         """Returns new SeqRecord with reverse complement sequence.
00956 
00957         You can specify the returned record's id, name and description as
00958         strings, or True to keep that of the parent, or False for a default.
00959 
00960         You can specify the returned record's features with a list of
00961         SeqFeature objects, or True to keep that of the parent, or False to
00962         omit them. The default is to keep the original features (with the
00963         strand and locations adjusted).
00964 
00965         You can also specify both the returned record's annotations and
00966         letter_annotations as dictionaries, True to keep that of the parent,
00967         or False to omit them. The default is to keep the original
00968         annotations (with the letter annotations reversed).
00969 
00970         To show what happens to the pre-letter annotations, consider an
00971         example Solexa variant FASTQ file with a single entry, which we'll
00972         read in as a SeqRecord:
00973 
00974         >>> from Bio import SeqIO
00975         >>> handle = open("Quality/solexa_faked.fastq", "rU")
00976         >>> record = SeqIO.read(handle, "fastq-solexa")
00977         >>> handle.close()
00978         >>> print record.id, record.seq
00979         slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
00980         >>> print record.letter_annotations.keys()
00981         ['solexa_quality']
00982         >>> print record.letter_annotations["solexa_quality"]
00983         [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
00984 
00985         Now take the reverse complement,
00986 
00987         >>> rc_record = record.reverse_complement(id=record.id+"_rc")
00988         >>> print rc_record.id, rc_record.seq
00989         slxa_0001_1_0001_01_rc NNNNNNACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
00990 
00991         Notice that the per-letter-annotations have also been reversed,
00992         although this may not be appropriate for all cases.
00993 
00994         >>> print rc_record.letter_annotations["solexa_quality"]
00995         [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
00996 
00997         Now for the features, we need a different example. Parsing a GenBank
00998         file is probably the easiest way to get an nice example with features
00999         in it...
01000 
01001         >>> from Bio import SeqIO
01002         >>> handle = open("GenBank/pBAD30.gb")
01003         >>> plasmid = SeqIO.read(handle, "gb")
01004         >>> handle.close()
01005         >>> print plasmid.id, len(plasmid)
01006         pBAD30 4923
01007         >>> plasmid.seq
01008         Seq('GCTAGCGGAGTGTATACTGGCTTACTATGTTGGCACTGATGAGGGTGTCAGTGA...ATG', IUPACAmbiguousDNA())
01009         >>> len(plasmid.features)
01010         13
01011 
01012         Now, let's take the reverse complement of this whole plasmid:
01013 
01014         >>> rc_plasmid = plasmid.reverse_complement(id=plasmid.id+"_rc")
01015         >>> print rc_plasmid.id, len(rc_plasmid)
01016         pBAD30_rc 4923
01017         >>> rc_plasmid.seq
01018         Seq('CATGGGCAAATATTATACGCAAGGCGACAAGGTGCTGATGCCGCTGGCGATTCA...AGC', IUPACAmbiguousDNA())
01019         >>> len(rc_plasmid.features)
01020         13
01021 
01022         Let's compare the first CDS feature - it has gone from being the
01023         second feature (index 1) to the second last feature (index -2), its
01024         strand has changed, and the location switched round.
01025 
01026         >>> print plasmid.features[1]
01027         type: CDS
01028         location: [1081:1960](-)
01029         qualifiers: 
01030             Key: label, Value: ['araC']
01031             Key: note, Value: ['araC regulator of the arabinose BAD promoter']
01032             Key: vntifkey, Value: ['4']
01033         <BLANKLINE>
01034         >>> print rc_plasmid.features[-2]
01035         type: CDS
01036         location: [2963:3842](+)
01037         qualifiers: 
01038             Key: label, Value: ['araC']
01039             Key: note, Value: ['araC regulator of the arabinose BAD promoter']
01040             Key: vntifkey, Value: ['4']
01041         <BLANKLINE>
01042 
01043         You can check this new location, based on the length of the plasmid:
01044 
01045         >>> len(plasmid) - 1081
01046         3842
01047         >>> len(plasmid) - 1960
01048         2963
01049 
01050         Note that if the SeqFeature annotation includes any strand specific
01051         information (e.g. base changes for a SNP), this information is not
01052         ammended, and would need correction after the reverse complement.
01053 
01054         Note trying to reverse complement a protein SeqRecord raises an
01055         exception:
01056 
01057         >>> from Bio.SeqRecord import SeqRecord
01058         >>> from Bio.Seq import Seq
01059         >>> from Bio.Alphabet import IUPAC
01060         >>> protein_rec = SeqRecord(Seq("MAIVMGR", IUPAC.protein), id="Test")
01061         >>> protein_rec.reverse_complement()
01062         Traceback (most recent call last):
01063            ...
01064         ValueError: Proteins do not have complements!
01065 
01066         Also note you can reverse complement a SeqRecord using a MutableSeq:
01067 
01068         >>> from Bio.SeqRecord import SeqRecord
01069         >>> from Bio.Seq import MutableSeq
01070         >>> from Bio.Alphabet import generic_dna
01071         >>> rec = SeqRecord(MutableSeq("ACGT", generic_dna), id="Test")
01072         >>> rec.seq[0] = "T"
01073         >>> print rec.id, rec.seq
01074         Test TCGT
01075         >>> rc = rec.reverse_complement(id=True)
01076         >>> print rc.id, rc.seq
01077         Test ACGA
01078         """
01079         from Bio.Seq import MutableSeq #Lazy to avoid circular imports
01080         if isinstance(self.seq, MutableSeq):
01081             #Currently the MutableSeq reverse complement is in situ
01082             answer = SeqRecord(self.seq.toseq().reverse_complement())
01083         else:
01084             answer = SeqRecord(self.seq.reverse_complement())
01085         if isinstance(id, basestring):
01086             answer.id = id
01087         elif id:
01088             answer.id = self.id
01089         if isinstance(name, basestring):
01090             answer.name = name
01091         elif name:
01092             answer.name = self.name
01093         if isinstance(description, basestring):
01094             answer.description = description
01095         elif description:
01096             answer.description = self.description
01097         if isinstance(dbxrefs, list):
01098             answer.dbxrefs = dbxrefs
01099         elif dbxrefs:
01100             #Copy the old dbxrefs
01101             answer.dbxrefs = self.dbxrefs[:]
01102         if isinstance(features, list):
01103             answer.features = features
01104         elif features:
01105             #Copy the old features, adjusting location and string
01106             l = len(answer)
01107             answer.features = [f._flip(l) for f in self.features]
01108             #The old list should have been sorted by start location,
01109             #reversing it will leave it sorted by what is now the end position,
01110             #so we need to resort in case of overlapping features.
01111             #NOTE - In the common case of gene before CDS (and similar) with
01112             #the exact same locations, this will still maintain gene before CDS
01113             answer.features.sort(key=lambda x : x.location.start.position)
01114         if isinstance(annotations, dict):
01115             answer.annotations = annotations
01116         elif annotations:
01117             #Copy the old annotations,
01118             answer.annotations = self.annotations.copy()
01119         if isinstance(letter_annotations, dict):
01120             answer.letter_annotations = letter_annotations
01121         elif letter_annotations:
01122             #Copy the old per letter annotations, reversing them
01123             for key, value in self.letter_annotations.iteritems():
01124                 answer._per_letter_annotations[key] = value[::-1]
01125         return answer

Here is the caller graph for this function:

Returns a copy of the record with an upper case sequence.

All the annotation is preserved unchanged. e.g.

>>> from Bio.Alphabet import generic_dna
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> record = SeqRecord(Seq("acgtACGT", generic_dna), id="Test",
...                    description = "Made up for this example")
>>> record.letter_annotations["phred_quality"] = [1,2,3,4,5,6,7,8]
>>> print record.upper().format("fastq")
@Test Made up for this example
ACGTACGT
+
"#$%&'()
<BLANKLINE>

Naturally, there is a matching lower method:

>>> print record.lower().format("fastq")
@Test Made up for this example
acgtacgt
+
"#$%&'()
<BLANKLINE>

Definition at line 876 of file SeqRecord.py.

00876 
00877     def upper(self):
00878         """Returns a copy of the record with an upper case sequence.
00879 
00880         All the annotation is preserved unchanged. e.g.
00881 
00882         >>> from Bio.Alphabet import generic_dna
00883         >>> from Bio.Seq import Seq
00884         >>> from Bio.SeqRecord import SeqRecord
00885         >>> record = SeqRecord(Seq("acgtACGT", generic_dna), id="Test",
00886         ...                    description = "Made up for this example")
00887         >>> record.letter_annotations["phred_quality"] = [1,2,3,4,5,6,7,8]
00888         >>> print record.upper().format("fastq")
00889         @Test Made up for this example
00890         ACGTACGT
00891         +
00892         "#$%&'()
00893         <BLANKLINE>
00894 
00895         Naturally, there is a matching lower method:
00896         
00897         >>> print record.lower().format("fastq")
00898         @Test Made up for this example
00899         acgtacgt
00900         +
00901         "#$%&'()
00902         <BLANKLINE>
00903         """
00904         return SeqRecord(self.seq.upper(),
00905                          id = self.id, name = self.name,
00906                          description = self.description,
00907                          dbxrefs = self.dbxrefs[:],
00908                          features = self.features[:],
00909                          annotations = self.annotations.copy(),
00910                          letter_annotations=self.letter_annotations.copy())


Member Data Documentation

Reimplemented in BioSQL.BioSeq.DBSeqRecord.

Definition at line 209 of file SeqRecord.py.

Reimplemented in BioSQL.BioSeq.DBSeqRecord.

Definition at line 186 of file SeqRecord.py.

Reimplemented in BioSQL.BioSeq.DBSeqRecord.

Definition at line 203 of file SeqRecord.py.

Reimplemented in BioSQL.BioSeq.DBSeqRecord.

Definition at line 196 of file SeqRecord.py.

Definition at line 189 of file SeqRecord.py.

Reimplemented in BioSQL.BioSeq.DBSeqRecord.

Definition at line 227 of file SeqRecord.py.

Reimplemented in BioSQL.BioSeq.DBSeqRecord.

Definition at line 187 of file SeqRecord.py.

Definition at line 188 of file SeqRecord.py.


Property Documentation

Initial value:
property( \
        fget=lambda self : self._per_letter_annotations,
        fset=_set_per_letter_annotations,
        doc="""Dictionary of per-letter-annotation for the sequence.For example, this can hold quality scores used in FASTQ or QUAL files.Consider this example using Bio.SeqIO to read in an example Solexavariant FASTQ file as a SeqRecord:>>> from Bio import SeqIO>>> handle = open("Quality/solexa_faked.fastq", "rU")>>> record = SeqIO.read(handle, "fastq-solexa")>>> handle.close()>>> print record.id, record.seqslxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN>>> print record.letter_annotations.keys()['solexa_quality']>>> print record.letter_annotations["solexa_quality"][40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]The letter_annotations get sliced automatically if you slice theparent SeqRecord, for example taking the last ten bases:>>> sub_record = record[-10:]>>> print sub_record.id, sub_record.seqslxa_0001_1_0001_01 ACGTNNNNNN>>> print sub_record.letter_annotations["solexa_quality"][4, 3, 2, 1, 0, -1, -2, -3, -4, -5]Any python sequence (i.e. list, tuple or string) can be recorded inthe SeqRecord's letter_annotations dictionary as long as the lengthmatches that of the SeqRecord's sequence.  e.g.>>> len(sub_record.letter_annotations)1>>> sub_record.letter_annotations["dummy"] = "abcdefghij">>> len(sub_record.letter_annotations)2You can delete entries from the letter_annotations dictionary as usual:>>> del sub_record.letter_annotations["solexa_quality"]>>> sub_record.letter_annotations{'dummy': 'abcdefghij'}You can completely clear the dictionary easily as follows:>>> sub_record.letter_annotations = {}>>> sub_record.letter_annotations{}""")

Definition at line 241 of file SeqRecord.py.

Initial value:
property(fget=lambda self : self._seq,
                   fset=_set_seq,
                   doc="The sequence itself, as a Seq or MutableSeq object.")

Reimplemented in BioSQL.BioSeq.DBSeqRecord.

Definition at line 305 of file SeqRecord.py.


The documentation for this class was generated from the following file: