Back to index

python-biopython  1.60
Public Member Functions | Public Attributes | Static Public Attributes | Private Member Functions
Bio.GenBank.Scanner.EmblScanner Class Reference
Inheritance diagram for Bio.GenBank.Scanner.EmblScanner:
Inheritance graph
[legend]
Collaboration diagram for Bio.GenBank.Scanner.EmblScanner:
Collaboration graph
[legend]

List of all members.

Public Member Functions

def parse_footer
def set_handle
def find_start
def parse_header
def parse_features
def parse_feature
def feed
def parse
def parse_records
def parse_cds_features

Public Attributes

 line
 debug
 handle

Static Public Attributes

string RECORD_START = "ID "
int HEADER_WIDTH = 5
list FEATURE_START_MARKERS = ["FH Key Location/Qualifiers","FH"]
list FEATURE_END_MARKERS = ["XX"]
int FEATURE_QUALIFIER_INDENT = 21
string FEATURE_QUALIFIER_SPACER = "FT"
list SEQUENCE_HEADERS = ["SQ", "CO"]

Private Member Functions

def _feed_first_line
def _feed_first_line_old
def _feed_first_line_new
def _feed_seq_length
def _feed_header_lines
def _feed_misc_lines

Detailed Description

For extracting chunks of information in EMBL files

Definition at line 532 of file Scanner.py.


Member Function Documentation

def Bio.GenBank.Scanner.EmblScanner._feed_first_line (   self,
  consumer,
  line 
) [private]
Handle the LOCUS/ID line, passing data to the comsumer

This should be implemented by the EMBL / GenBank specific subclass

Used by the parse_records() and parse() methods.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 578 of file Scanner.py.

00578 
00579     def _feed_first_line(self, consumer, line):
00580         assert line[:self.HEADER_WIDTH].rstrip() == "ID"
00581         if line[self.HEADER_WIDTH:].count(";") == 6:
00582             #Looks like the semi colon separated style introduced in 2006
00583             self._feed_first_line_new(consumer, line)
00584         elif line[self.HEADER_WIDTH:].count(";") == 3:
00585             #Looks like the pre 2006 style
00586             self._feed_first_line_old(consumer, line)
00587         else:
00588             raise ValueError('Did not recognise the ID line layout:\n' + line)

Here is the call graph for this function:

def Bio.GenBank.Scanner.EmblScanner._feed_first_line_new (   self,
  consumer,
  line 
) [private]

Definition at line 612 of file Scanner.py.

00612 
00613     def _feed_first_line_new(self, consumer, line):
00614         #Expects an ID line in the style introduced in 2006, e.g.
00615         #ID   X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP.
00616         #ID   CD789012; SV 4; linear; genomic DNA; HTG; MAM; 500 BP.
00617         assert line[:self.HEADER_WIDTH].rstrip() == "ID"
00618         fields = [data.strip() for data in line[self.HEADER_WIDTH:].strip().split(";")]
00619         assert len(fields) == 7
00620         """
00621         The tokens represent:
00622            0. Primary accession number
00623            1. Sequence version number
00624            2. Topology: 'circular' or 'linear'
00625            3. Molecule type (e.g. 'genomic DNA')
00626            4. Data class (e.g. 'STD')
00627            5. Taxonomic division (e.g. 'PRO')
00628            6. Sequence length (e.g. '4639675 BP.')
00629         """
00630 
00631         consumer.locus(fields[0])
00632 
00633         #Call the accession consumer now, to make sure we record
00634         #something as the record.id, in case there is no AC line
00635         consumer.accession(fields[0])
00636 
00637         #TODO - How to deal with the version field?  At the moment the consumer
00638         #will try and use this for the ID which isn't ideal for EMBL files.
00639         version_parts = fields[1].split()
00640         if len(version_parts)==2 \
00641         and version_parts[0]=="SV" \
00642         and version_parts[1].isdigit():
00643             consumer.version_suffix(version_parts[1])
00644 
00645         #Based on how the old GenBank parser worked, merge these two:
00646         consumer.residue_type(" ".join(fields[2:4])) #TODO - Store as two fields?
00647 
00648         #consumer.xxx(fields[4]) #TODO - What should we do with the data class?
00649 
00650         consumer.data_file_division(fields[5])
00651 
00652         self._feed_seq_length(consumer, fields[6])

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.GenBank.Scanner.EmblScanner._feed_first_line_old (   self,
  consumer,
  line 
) [private]

Definition at line 589 of file Scanner.py.

00589 
00590     def _feed_first_line_old(self, consumer, line):
00591         #Expects an ID line in the style before 2006, e.g.
00592         #ID   SC10H5 standard; DNA; PRO; 4870 BP.
00593         #ID   BSUB9999   standard; circular DNA; PRO; 4214630 BP.
00594         assert line[:self.HEADER_WIDTH].rstrip() == "ID"
00595         fields = [line[self.HEADER_WIDTH:].split(None,1)[0]]
00596         fields.extend(line[self.HEADER_WIDTH:].split(None,1)[1].split(";"))
00597         fields = [entry.strip() for entry in fields]
00598         """
00599         The tokens represent:
00600            0. Primary accession number
00601            (space sep)
00602            1. ??? (e.g. standard)
00603            (semi-colon)
00604            2. Topology and/or Molecule type (e.g. 'circular DNA' or 'DNA')
00605            3. Taxonomic division (e.g. 'PRO')
00606            4. Sequence length (e.g. '4639675 BP.')
00607         """
00608         consumer.locus(fields[0]) #Should we also call the accession consumer?
00609         consumer.residue_type(fields[2])
00610         consumer.data_file_division(fields[3])
00611         self._feed_seq_length(consumer, fields[4])        

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.GenBank.Scanner.EmblScanner._feed_header_lines (   self,
  consumer,
  lines 
) [private]
Handle the header lines (list of strings), passing data to the comsumer

This should be implemented by the EMBL / GenBank specific subclass

Used by the parse_records() and parse() methods.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 659 of file Scanner.py.

00659 
00660     def _feed_header_lines(self, consumer, lines):
00661         EMBL_INDENT = self.HEADER_WIDTH
00662         EMBL_SPACER = " "  * EMBL_INDENT
00663         consumer_dict = {
00664             'AC' : 'accession',
00665             'SV' : 'version', # SV line removed in June 2006, now part of ID line
00666             'DE' : 'definition',
00667             #'RN' : 'reference_num',
00668             #'RC' : reference comment... TODO
00669             #'RP' : 'reference_bases',
00670             #'RX' : reference cross reference... DOI or Pubmed
00671             'RG' : 'consrtm', #optional consortium
00672             #'RA' : 'authors',
00673             #'RT' : 'title',
00674             'RL' : 'journal',
00675             'OS' : 'organism',
00676             'OC' : 'taxonomy',
00677             #'DR' : data reference
00678             'CC' : 'comment',
00679             #'XX' : splitter
00680         }
00681         #We have to handle the following specially:
00682         #RX (depending on reference type...)
00683         for line in lines:
00684             line_type = line[:EMBL_INDENT].strip()
00685             data = line[EMBL_INDENT:].strip()
00686             if line_type == 'XX':
00687                 pass
00688             elif line_type == 'RN':
00689                 # Reformat reference numbers for the GenBank based consumer
00690                 # e.g. '[1]' becomes '1'
00691                 if data[0] == "[" and data[-1] == "]" : data = data[1:-1]
00692                 consumer.reference_num(data)
00693             elif line_type == 'RP':
00694                 # Reformat reference numbers for the GenBank based consumer
00695                 # e.g. '1-4639675' becomes '(bases 1 to 4639675)'
00696                 # and '160-550, 904-1055' becomes '(bases 160 to 550; 904 to 1055)'
00697                 parts = [bases.replace("-"," to ").strip() for bases in data.split(",")]
00698                 consumer.reference_bases("(bases %s)" % "; ".join(parts))
00699             elif line_type == 'RT':
00700                 #Remove the enclosing quotes and trailing semi colon.
00701                 #Note the title can be split over multiple lines.
00702                 if data.startswith('"'):
00703                     data = data[1:]
00704                 if data.endswith('";'):
00705                     data = data[:-2]
00706                 consumer.title(data)
00707             elif line_type == 'RX':
00708                 # EMBL support three reference types at the moment:
00709                 # - PUBMED    PUBMED bibliographic database (NLM)
00710                 # - DOI       Digital Object Identifier (International DOI Foundation)
00711                 # - AGRICOLA  US National Agriculture Library (NAL) of the US Department
00712                 #             of Agriculture (USDA)
00713                 #
00714                 # Format:
00715                 # RX  resource_identifier; identifier.
00716                 #
00717                 # e.g.
00718                 # RX   DOI; 10.1016/0024-3205(83)90010-3.
00719                 # RX   PUBMED; 264242.
00720                 #
00721                 # Currently our reference object only supports PUBMED and MEDLINE
00722                 # (as these were in GenBank files?).
00723                 key, value = data.split(";",1)
00724                 if value.endswith(".") : value = value[:-1]
00725                 value = value.strip()
00726                 if key == "PUBMED":
00727                     consumer.pubmed_id(value)
00728                 #TODO - Handle other reference types (here and in BioSQL bindings)
00729             elif line_type == 'CC':
00730                 # Have to pass a list of strings for this one (not just a string)
00731                 consumer.comment([data])
00732             elif line_type == 'DR':
00733                 # Database Cross-reference, format:
00734                 # DR   database_identifier; primary_identifier; secondary_identifier.
00735                 #
00736                 # e.g.
00737                 # DR   MGI; 98599; Tcrb-V4.
00738                 #
00739                 # TODO - How should we store any secondary identifier?
00740                 parts = data.rstrip(".").split(";")
00741                 #Turn it into "database_identifier:primary_identifier" to
00742                 #mimic the GenBank parser. e.g. "MGI:98599"
00743                 consumer.dblink("%s:%s" % (parts[0].strip(),
00744                                            parts[1].strip()))
00745             elif line_type == 'RA':
00746                 # Remove trailing ; at end of authors list
00747                 consumer.authors(data.rstrip(";"))
00748             elif line_type == 'PR':
00749                 # Remove trailing ; at end of the project reference
00750                 # In GenBank files this corresponds to the old PROJECT
00751                 # line which is being replaced with the DBLINK line.
00752                 consumer.project(data.rstrip(";"))
00753             elif line_type in consumer_dict:
00754                 #Its a semi-automatic entry!
00755                 getattr(consumer, consumer_dict[line_type])(data)
00756             else:
00757                 if self.debug:
00758                     print "Ignoring EMBL header line:\n%s" % line

def Bio.GenBank.Scanner.EmblScanner._feed_misc_lines (   self,
  consumer,
  lines 
) [private]
Handle any lines between features and sequence (list of strings), passing data to the consumer

This should be implemented by the EMBL / GenBank specific subclass

Used by the parse_records() and parse() methods.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 759 of file Scanner.py.

00759 
00760     def _feed_misc_lines(self, consumer, lines):
00761         #TODO - Should we do something with the information on the SQ line(s)?
00762         lines.append("")
00763         line_iter = iter(lines)
00764         try:
00765             for line in line_iter:
00766                 if line.startswith("CO   "):
00767                     line = line[5:].strip()
00768                     contig_location = line
00769                     while True:
00770                         line = line_iter.next()
00771                         if not line:
00772                             break
00773                         elif line.startswith("CO   "):
00774                             #Don't need to preseve the whitespace here.
00775                             contig_location += line[5:].strip()
00776                         else:
00777                             raise ValueError('Expected CO (contig) continuation line, got:\n' + line)
00778                     consumer.contig_location(contig_location)
00779             return
00780         except StopIteration:
00781             raise ValueError("Problem in misc lines before sequence")
00782 

def Bio.GenBank.Scanner.EmblScanner._feed_seq_length (   self,
  consumer,
  text 
) [private]

Definition at line 653 of file Scanner.py.

00653 
00654     def _feed_seq_length(self, consumer, text):
00655         length_parts = text.split()
00656         assert len(length_parts) == 2
00657         assert length_parts[1].upper() in ["BP", "BP.", "AA."]
00658         consumer.size(length_parts[0])

Here is the caller graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.feed (   self,
  handle,
  consumer,
  do_features = True 
) [inherited]
Feed a set of data into the consumer.

This method is intended for use with the "old" code in Bio.GenBank

Arguments:
handle - A handle with the information to parse.
consumer - The consumer that should be informed of events.
do_features - Boolean, should the features be parsed?
      Skipping the features can be much faster.

Return values:
true  - Passed a record
false - Did not find a record

Definition at line 367 of file Scanner.py.

00367 
00368     def feed(self, handle, consumer, do_features=True):
00369         """Feed a set of data into the consumer.
00370 
00371         This method is intended for use with the "old" code in Bio.GenBank
00372 
00373         Arguments:
00374         handle - A handle with the information to parse.
00375         consumer - The consumer that should be informed of events.
00376         do_features - Boolean, should the features be parsed?
00377                       Skipping the features can be much faster.
00378 
00379         Return values:
00380         true  - Passed a record
00381         false - Did not find a record
00382         """        
00383         #Should work with both EMBL and GenBank files provided the
00384         #equivalent Bio.GenBank._FeatureConsumer methods are called...
00385         self.set_handle(handle)
00386         if not self.find_start():
00387             #Could not find (another) record
00388             consumer.data=None
00389             return False
00390                        
00391         #We use the above class methods to parse the file into a simplified format.
00392         #The first line, header lines and any misc lines after the features will be
00393         #dealt with by GenBank / EMBL specific derived classes.
00394 
00395         #First line and header:
00396         self._feed_first_line(consumer, self.line)
00397         self._feed_header_lines(consumer, self.parse_header())
00398 
00399         #Features (common to both EMBL and GenBank):
00400         if do_features:
00401             self._feed_feature_table(consumer, self.parse_features(skip=False))
00402         else:
00403             self.parse_features(skip=True) # ignore the data
00404         
00405         #Footer and sequence
00406         misc_lines, sequence_string = self.parse_footer()
00407         self._feed_misc_lines(consumer, misc_lines)
00408 
00409         consumer.sequence(sequence_string)
00410         #Calls to consumer.base_number() do nothing anyway
00411         consumer.record_end("//")
00412 
00413         assert self.line == "//"
00414 
00415         #And we are done
00416         return True

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.find_start (   self) [inherited]
Read in lines until find the ID/LOCUS line, which is returned.

Any preamble (such as the header used by the NCBI on *.seq.gz archives)
will we ignored.

Definition at line 66 of file Scanner.py.

00066 
00067     def find_start(self):
00068         """Read in lines until find the ID/LOCUS line, which is returned.
00069         
00070         Any preamble (such as the header used by the NCBI on *.seq.gz archives)
00071         will we ignored."""
00072         while True:
00073             if self.line:
00074                 line = self.line
00075                 self.line = ""
00076             else:
00077                 line = self.handle.readline()
00078             if not line:
00079                 if self.debug : print "End of file"
00080                 return None
00081             if line[:self.HEADER_WIDTH]==self.RECORD_START:
00082                 if self.debug > 1: print "Found the start of a record:\n" + line
00083                 break
00084             line = line.rstrip()
00085             if line == "//":
00086                 if self.debug > 1: print "Skipping // marking end of last record"
00087             elif line == "":
00088                 if self.debug > 1: print "Skipping blank line before record"
00089             else:
00090                 #Ignore any header before the first ID/LOCUS line.
00091                 if self.debug > 1:
00092                         print "Skipping header line before record:\n" + line
00093         self.line = line
00094         return line

Here is the caller graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.parse (   self,
  handle,
  do_features = True 
) [inherited]
Returns a SeqRecord (with SeqFeatures if do_features=True)

See also the method parse_records() for use on multi-record files.

Definition at line 417 of file Scanner.py.

00417 
00418     def parse(self, handle, do_features=True):
00419         """Returns a SeqRecord (with SeqFeatures if do_features=True)
00420 
00421         See also the method parse_records() for use on multi-record files.
00422         """
00423         from Bio.GenBank import _FeatureConsumer
00424         from Bio.GenBank.utils import FeatureValueCleaner
00425 
00426         consumer = _FeatureConsumer(use_fuzziness = 1, 
00427                     feature_cleaner = FeatureValueCleaner())
00428 
00429         if self.feed(handle, consumer, do_features):
00430             return consumer.data
00431         else:
00432             return None
00433 
    

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.parse_cds_features (   self,
  handle,
  alphabet = generic_protein,
  tags2id = ('protein_id','locus_tag',
  product 
) [inherited]
Returns SeqRecord object iterator

Each CDS feature becomes a SeqRecord.

alphabet - Used for any sequence found in a translation field.
tags2id  - Tupple of three strings, the feature keys to use
   for the record id, name and description,

This method is intended for use in Bio.SeqIO

Definition at line 454 of file Scanner.py.

00454 
00455                            tags2id=('protein_id','locus_tag','product')):
00456         """Returns SeqRecord object iterator
00457 
00458         Each CDS feature becomes a SeqRecord.
00459 
00460         alphabet - Used for any sequence found in a translation field.
00461         tags2id  - Tupple of three strings, the feature keys to use
00462                    for the record id, name and description,
00463 
00464         This method is intended for use in Bio.SeqIO
00465         """
00466         self.set_handle(handle)
00467         while self.find_start():
00468             #Got an EMBL or GenBank record...
00469             self.parse_header() # ignore header lines!
00470             feature_tuples = self.parse_features()
00471             #self.parse_footer() # ignore footer lines!
00472             while True:
00473                 line = self.handle.readline()
00474                 if not line : break
00475                 if line[:2]=="//" : break
00476             self.line = line.rstrip()
00477 
00478             #Now go though those features...
00479             for key, location_string, qualifiers in feature_tuples:
00480                 if key=="CDS":
00481                     #Create SeqRecord
00482                     #================
00483                     #SeqRecord objects cannot be created with annotations, they
00484                     #must be added afterwards.  So create an empty record and
00485                     #then populate it:
00486                     record = SeqRecord(seq=None)
00487                     annotations = record.annotations
00488 
00489                     #Should we add a location object to the annotations?
00490                     #I *think* that only makes sense for SeqFeatures with their
00491                     #sub features...
00492                     annotations['raw_location'] = location_string.replace(' ','')
00493 
00494                     for (qualifier_name, qualifier_data) in qualifiers:
00495                         if qualifier_data is not None \
00496                         and qualifier_data[0]=='"' and qualifier_data[-1]=='"':
00497                             #Remove quotes
00498                             qualifier_data = qualifier_data[1:-1]
00499                         #Append the data to the annotation qualifier...
00500                         if qualifier_name == "translation":
00501                             assert record.seq is None, "Multiple translations!"
00502                             record.seq = Seq(qualifier_data.replace("\n",""), alphabet)
00503                         elif qualifier_name == "db_xref":
00504                             #its a list, possibly empty.  Its safe to extend
00505                             record.dbxrefs.append(qualifier_data)
00506                         else:
00507                             if qualifier_data is not None:
00508                                 qualifier_data = qualifier_data.replace("\n"," ").replace("  "," ")
00509                             try:
00510                                 annotations[qualifier_name] += " " + qualifier_data
00511                             except KeyError:
00512                                 #Not an addition to existing data, its the first bit
00513                                 annotations[qualifier_name]= qualifier_data
00514                         
00515                     #Fill in the ID, Name, Description
00516                     #=================================
00517                     try:
00518                         record.id = annotations[tags2id[0]]
00519                     except KeyError:
00520                         pass
00521                     try:
00522                         record.name = annotations[tags2id[1]]
00523                     except KeyError:
00524                         pass
00525                     try:
00526                         record.description = annotations[tags2id[2]]
00527                     except KeyError:
00528                         pass
00529 
00530                     yield record
00531 

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.parse_feature (   self,
  feature_key,
  lines 
) [inherited]
Expects a feature as a list of strings, returns a tuple (key, location, qualifiers)

For example given this GenBank feature:

     CDS             complement(join(490883..490885,1..879))
             /locus_tag="NEQ001"
             /note="conserved hypothetical [Methanococcus jannaschii];
             COG1583:Uncharacterized ACR; IPR001472:Bipartite nuclear
             localization signal; IPR002743: Protein of unknown
             function DUF57"
             /codon_start=1
             /transl_table=11
             /product="hypothetical protein"
             /protein_id="NP_963295.1"
             /db_xref="GI:41614797"
             /db_xref="GeneID:2732620"
             /translation="MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKK
             EKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTK
             KFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEP
             IEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFE
             EAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGS
             LNSMGFGFVNTKKNSAR"

Then should give input key="CDS" and the rest of the data as a list of strings
lines=["complement(join(490883..490885,1..879))", ..., "LNSMGFGFVNTKKNSAR"]
where the leading spaces and trailing newlines have been removed.

Returns tuple containing: (key as string, location string, qualifiers as list)
as follows for this example:

key = "CDS", string
location = "complement(join(490883..490885,1..879))", string
qualifiers = list of string tuples:

[('locus_tag', '"NEQ001"'),
 ('note', '"conserved hypothetical [Methanococcus jannaschii];\nCOG1583:..."'),
 ('codon_start', '1'),
 ('transl_table', '11'),
 ('product', '"hypothetical protein"'),
 ('protein_id', '"NP_963295.1"'),
 ('db_xref', '"GI:41614797"'),
 ('db_xref', '"GeneID:2732620"'),
 ('translation', '"MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKK\nEKYFNFT..."')]

In the above example, the "note" and "translation" were edited for compactness,
and they would contain multiple new line characters (displayed above as \n)

If a qualifier is quoted (in this case, everything except codon_start and
transl_table) then the quotes are NOT removed.

Note that no whitespace is removed.

Definition at line 192 of file Scanner.py.

00192 
00193     def parse_feature(self, feature_key, lines):
00194         """Expects a feature as a list of strings, returns a tuple (key, location, qualifiers)
00195 
00196         For example given this GenBank feature:
00197 
00198              CDS             complement(join(490883..490885,1..879))
00199                              /locus_tag="NEQ001"
00200                              /note="conserved hypothetical [Methanococcus jannaschii];
00201                              COG1583:Uncharacterized ACR; IPR001472:Bipartite nuclear
00202                              localization signal; IPR002743: Protein of unknown
00203                              function DUF57"
00204                              /codon_start=1
00205                              /transl_table=11
00206                              /product="hypothetical protein"
00207                              /protein_id="NP_963295.1"
00208                              /db_xref="GI:41614797"
00209                              /db_xref="GeneID:2732620"
00210                              /translation="MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKK
00211                              EKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTK
00212                              KFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEP
00213                              IEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFE
00214                              EAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGS
00215                              LNSMGFGFVNTKKNSAR"
00216 
00217         Then should give input key="CDS" and the rest of the data as a list of strings
00218         lines=["complement(join(490883..490885,1..879))", ..., "LNSMGFGFVNTKKNSAR"]
00219         where the leading spaces and trailing newlines have been removed.
00220 
00221         Returns tuple containing: (key as string, location string, qualifiers as list)
00222         as follows for this example:
00223 
00224         key = "CDS", string
00225         location = "complement(join(490883..490885,1..879))", string
00226         qualifiers = list of string tuples:
00227 
00228         [('locus_tag', '"NEQ001"'),
00229          ('note', '"conserved hypothetical [Methanococcus jannaschii];\nCOG1583:..."'),
00230          ('codon_start', '1'),
00231          ('transl_table', '11'),
00232          ('product', '"hypothetical protein"'),
00233          ('protein_id', '"NP_963295.1"'),
00234          ('db_xref', '"GI:41614797"'),
00235          ('db_xref', '"GeneID:2732620"'),
00236          ('translation', '"MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKK\nEKYFNFT..."')]
00237 
00238         In the above example, the "note" and "translation" were edited for compactness,
00239         and they would contain multiple new line characters (displayed above as \n)
00240 
00241         If a qualifier is quoted (in this case, everything except codon_start and
00242         transl_table) then the quotes are NOT removed.
00243 
00244         Note that no whitespace is removed.
00245         """
00246         #Skip any blank lines
00247         iterator = iter(filter(None, lines))
00248         try:
00249             line = iterator.next()
00250 
00251             feature_location = line.strip()
00252             while feature_location[-1:]==",":
00253                 #Multiline location, still more to come!
00254                 line = iterator.next()
00255                 feature_location += line.strip()
00256 
00257             qualifiers=[]
00258 
00259             for i, line in enumerate(iterator):
00260                 # check for extra wrapping of the location closing parentheses
00261                 if i == 0 and line.startswith(")"):
00262                     feature_location += line.strip()
00263                 elif line[0]=="/":
00264                     #New qualifier
00265                     i = line.find("=")
00266                     key = line[1:i] #does not work if i==-1
00267                     value = line[i+1:] #we ignore 'value' if i==-1
00268                     if i==-1:
00269                         #Qualifier with no key, e.g. /pseudo
00270                         key = line[1:]
00271                         qualifiers.append((key,None))
00272                     elif not value:
00273                         #ApE can output /note=
00274                         qualifiers.append((key,""))
00275                     elif value[0]=='"':
00276                         #Quoted...
00277                         if value[-1]!='"' or value!='"':
00278                             #No closing quote on the first line...
00279                             while value[-1] != '"':
00280                                 value += "\n" + iterator.next()
00281                         else:
00282                             #One single line (quoted)
00283                             assert value == '"'
00284                             if self.debug : print "Quoted line %s:%s" % (key, value)
00285                         #DO NOT remove the quotes...
00286                         qualifiers.append((key,value))
00287                     else:
00288                         #Unquoted
00289                         #if debug : print "Unquoted line %s:%s" % (key,value)
00290                         qualifiers.append((key,value))
00291                 else:
00292                     #Unquoted continuation
00293                     assert len(qualifiers) > 0
00294                     assert key==qualifiers[-1][0]
00295                     #if debug : print "Unquoted Cont %s:%s" % (key, line)
00296                     qualifiers[-1] = (key, qualifiers[-1][1] + "\n" + line)
00297             return (feature_key, feature_location, qualifiers)
00298         except StopIteration:
00299             #Bummer
00300             raise ValueError("Problem with '%s' feature:\n%s" \
00301                               % (feature_key, "\n".join(lines)))

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.parse_features (   self,
  skip = False 
) [inherited]
Return list of tuples for the features (if present)

Each feature is returned as a tuple (key, location, qualifiers)
where key and location are strings (e.g. "CDS" and
"complement(join(490883..490885,1..879))") while qualifiers
is a list of two string tuples (feature qualifier keys and values).

Assumes you have already read to the start of the features table.

Reimplemented in Bio.GenBank.Scanner._ImgtScanner.

Definition at line 126 of file Scanner.py.

00126 
00127     def parse_features(self, skip=False):
00128         """Return list of tuples for the features (if present)
00129 
00130         Each feature is returned as a tuple (key, location, qualifiers)
00131         where key and location are strings (e.g. "CDS" and
00132         "complement(join(490883..490885,1..879))") while qualifiers
00133         is a list of two string tuples (feature qualifier keys and values).
00134 
00135         Assumes you have already read to the start of the features table.
00136         """
00137         if self.line.rstrip() not in self.FEATURE_START_MARKERS:
00138             if self.debug : print "Didn't find any feature table"
00139             return []
00140         
00141         while self.line.rstrip() in self.FEATURE_START_MARKERS:
00142             self.line = self.handle.readline()
00143 
00144         features = []
00145         line = self.line
00146         while True:
00147             if not line:
00148                 raise ValueError("Premature end of line during features table")
00149             if line[:self.HEADER_WIDTH].rstrip() in self.SEQUENCE_HEADERS:
00150                 if self.debug : print "Found start of sequence"
00151                 break
00152             line = line.rstrip()
00153             if line == "//":
00154                 raise ValueError("Premature end of features table, marker '//' found")
00155             if line in self.FEATURE_END_MARKERS:
00156                 if self.debug : print "Found end of features"
00157                 line = self.handle.readline()
00158                 break
00159             if line[2:self.FEATURE_QUALIFIER_INDENT].strip() == "":
00160                 #This is an empty feature line between qualifiers. Empty
00161                 #feature lines within qualifiers are handled below (ignored).
00162                 line = self.handle.readline()
00163                 continue
00164             
00165             if skip:
00166                 line = self.handle.readline()
00167                 while line[:self.FEATURE_QUALIFIER_INDENT] == self.FEATURE_QUALIFIER_SPACER:
00168                     line = self.handle.readline()
00169             else:
00170                 #Build up a list of the lines making up this feature:
00171                 if line[self.FEATURE_QUALIFIER_INDENT]!=" " \
00172                 and " " in line[self.FEATURE_QUALIFIER_INDENT:]:
00173                     #The feature table design enforces a length limit on the feature keys.
00174                     #Some third party files (e.g. IGMT's EMBL like files) solve this by
00175                     #over indenting the location and qualifiers.
00176                     feature_key, line = line[2:].strip().split(None,1)
00177                     feature_lines = [line]
00178                     warnings.warn("Overindented %s feature?" % feature_key)
00179                 else:
00180                     feature_key = line[2:self.FEATURE_QUALIFIER_INDENT].strip()
00181                     feature_lines = [line[self.FEATURE_QUALIFIER_INDENT:]]
00182                 line = self.handle.readline()
00183                 while line[:self.FEATURE_QUALIFIER_INDENT] == self.FEATURE_QUALIFIER_SPACER \
00184                 or line.rstrip() == "" : # cope with blank lines in the midst of a feature
00185                     #Use strip to remove any harmless trailing white space AND and leading
00186                     #white space (e.g. out of spec files with too much intentation)
00187                     feature_lines.append(line[self.FEATURE_QUALIFIER_INDENT:].strip())
00188                     line = self.handle.readline()
00189                 features.append(self.parse_feature(feature_key, feature_lines))
00190         self.line = line
00191         return features

Here is the call graph for this function:

Here is the caller graph for this function:

returns a tuple containing a list of any misc strings, and the sequence

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 543 of file Scanner.py.

00543 
00544     def parse_footer(self):
00545         """returns a tuple containing a list of any misc strings, and the sequence"""
00546         assert self.line[:self.HEADER_WIDTH].rstrip() in self.SEQUENCE_HEADERS, \
00547             "Eh? '%s'" % self.line
00548 
00549         #Note that the SQ line can be split into several lines...
00550         misc_lines = []
00551         while self.line[:self.HEADER_WIDTH].rstrip() in self.SEQUENCE_HEADERS:
00552             misc_lines.append(self.line)
00553             self.line = self.handle.readline()
00554             if not self.line:
00555                 raise ValueError("Premature end of file")
00556             self.line = self.line.rstrip()
00557 
00558         assert self.line[:self.HEADER_WIDTH] == " " * self.HEADER_WIDTH \
00559                or self.line.strip() == '//', repr(self.line)
00560         
00561         seq_lines = []
00562         line = self.line
00563         while True:
00564             if not line:
00565                 raise ValueError("Premature end of file in sequence data")
00566             line = line.strip()
00567             if not line:
00568                 raise ValueError("Blank line in sequence data")
00569             if line=='//':
00570                 break
00571             assert self.line[:self.HEADER_WIDTH] == " " * self.HEADER_WIDTH, \
00572                    repr(self.line)
00573             #Remove tailing number now, remove spaces later
00574             seq_lines.append(line.rsplit(None,1)[0])
00575             line = self.handle.readline()
00576         self.line = line
00577         return (misc_lines, "".join(seq_lines).replace(" ", ""))

Return list of strings making up the header

New line characters are removed.

Assumes you have just read in the ID/LOCUS line.

Definition at line 95 of file Scanner.py.

00095 
00096     def parse_header(self):
00097         """Return list of strings making up the header
00098 
00099         New line characters are removed.
00100 
00101         Assumes you have just read in the ID/LOCUS line.
00102         """
00103         assert self.line[:self.HEADER_WIDTH]==self.RECORD_START, \
00104                "Not at start of record"
00105         
00106         header_lines = []
00107         while True:
00108             line = self.handle.readline()
00109             if not line:
00110                 raise ValueError("Premature end of line during sequence data")
00111             line = line.rstrip()
00112             if line in self.FEATURE_START_MARKERS:
00113                 if self.debug : print "Found header table"
00114                 break
00115             #if line[:self.HEADER_WIDTH]==self.FEATURE_START_MARKER[:self.HEADER_WIDTH]:
00116             #    if self.debug : print "Found header table (?)"
00117             #    break
00118             if line[:self.HEADER_WIDTH].rstrip() in self.SEQUENCE_HEADERS:
00119                 if self.debug : print "Found start of sequence"
00120                 break
00121             if line == "//":
00122                 raise ValueError("Premature end of sequence data marker '//' found")
00123             header_lines.append(line)
00124         self.line = line
00125         return header_lines

Here is the caller graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.parse_records (   self,
  handle,
  do_features = True 
) [inherited]
Returns a SeqRecord object iterator

Each record (from the ID/LOCUS line to the // line) becomes a SeqRecord

The SeqRecord objects include SeqFeatures if do_features=True

This method is intended for use in Bio.SeqIO

Definition at line 434 of file Scanner.py.

00434 
00435     def parse_records(self, handle, do_features=True):
00436         """Returns a SeqRecord object iterator
00437 
00438         Each record (from the ID/LOCUS line to the // line) becomes a SeqRecord
00439 
00440         The SeqRecord objects include SeqFeatures if do_features=True
00441         
00442         This method is intended for use in Bio.SeqIO
00443         """
00444         #This is a generator function
00445         while True:
00446             record = self.parse(handle, do_features)
00447             if record is None : break
00448             assert record.id is not None
00449             assert record.name != "<unknown name>"
00450             assert record.description != "<unknown description>"
00451             yield record

Here is the call graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.set_handle (   self,
  handle 
) [inherited]

Definition at line 62 of file Scanner.py.

00062 
00063     def set_handle(self, handle):
00064         self.handle = handle
00065         self.line = ""

Here is the caller graph for this function:


Member Data Documentation

Definition at line 59 of file Scanner.py.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 538 of file Scanner.py.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 539 of file Scanner.py.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 540 of file Scanner.py.

list Bio.GenBank.Scanner.EmblScanner.FEATURE_START_MARKERS = ["FH Key Location/Qualifiers","FH"] [static]

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Reimplemented in Bio.GenBank.Scanner._ImgtScanner.

Definition at line 537 of file Scanner.py.

Definition at line 63 of file Scanner.py.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 536 of file Scanner.py.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Reimplemented in Bio.GenBank.Scanner._ImgtScanner.

Definition at line 552 of file Scanner.py.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 535 of file Scanner.py.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 541 of file Scanner.py.


The documentation for this class was generated from the following file: