Back to index

python-biopython  1.60
Public Member Functions | Public Attributes | Static Public Attributes | Private Member Functions
Bio.GenBank.Scanner.GenBankScanner Class Reference
Inheritance diagram for Bio.GenBank.Scanner.GenBankScanner:
Inheritance graph
[legend]
Collaboration diagram for Bio.GenBank.Scanner.GenBankScanner:
Collaboration graph
[legend]

List of all members.

Public Member Functions

def parse_footer
def set_handle
def find_start
def parse_header
def parse_features
def parse_feature
def feed
def parse
def parse_records
def parse_cds_features

Public Attributes

 line
 debug
 handle

Static Public Attributes

string RECORD_START = "LOCUS "
int HEADER_WIDTH = 12
list FEATURE_START_MARKERS = ["FEATURES Location/Qualifiers","FEATURES"]
list FEATURE_END_MARKERS = []
int FEATURE_QUALIFIER_INDENT = 21
string FEATURE_QUALIFIER_SPACER = " "
list SEQUENCE_HEADERS = ["CONTIG", "ORIGIN", "BASE COUNT", "WGS"]

Private Member Functions

def _feed_first_line
def _feed_header_lines
def _feed_misc_lines

Detailed Description

For extracting chunks of information in GenBank files

Definition at line 877 of file Scanner.py.


Member Function Documentation

def Bio.GenBank.Scanner.GenBankScanner._feed_first_line (   self,
  consumer,
  line 
) [private]
Scan over and parse GenBank LOCUS line (PRIVATE).

This must cope with several variants, primarily the old and new column
based standards from GenBank. Additionally EnsEMBL produces GenBank
files where the LOCUS line is space separated rather that following
the column based layout.

We also try to cope with GenBank like files with partial LOCUS lines.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 932 of file Scanner.py.

00932 
00933     def _feed_first_line(self, consumer, line):
00934         """Scan over and parse GenBank LOCUS line (PRIVATE).
00935 
00936         This must cope with several variants, primarily the old and new column
00937         based standards from GenBank. Additionally EnsEMBL produces GenBank
00938         files where the LOCUS line is space separated rather that following
00939         the column based layout.
00940 
00941         We also try to cope with GenBank like files with partial LOCUS lines.
00942         """
00943         #####################################
00944         # LOCUS line                        #
00945         #####################################
00946         GENBANK_INDENT = self.HEADER_WIDTH
00947         GENBANK_SPACER = " "*GENBANK_INDENT
00948         assert line[0:GENBANK_INDENT] == 'LOCUS       ', \
00949                'LOCUS line does not start correctly:\n' + line
00950 
00951         #Have to break up the locus line, and handle the different bits of it.
00952         #There are at least two different versions of the locus line...
00953         if line[29:33] in [' bp ', ' aa ',' rc '] and line[55:62] == '       ':
00954             #Old... note we insist on the 55:62 being empty to avoid trying
00955             #to parse space separated LOCUS lines from Ensembl etc, see below.
00956             #
00957             #    Positions  Contents
00958             #    ---------  --------
00959             #    00:06      LOCUS
00960             #    06:12      spaces
00961             #    12:??      Locus name
00962             #    ??:??      space
00963             #    ??:29      Length of sequence, right-justified
00964             #    29:33      space, bp, space
00965             #    33:41      strand type
00966             #    41:42      space
00967             #    42:51      Blank (implies linear), linear or circular
00968             #    51:52      space
00969             #    52:55      The division code (e.g. BCT, VRL, INV)
00970             #    55:62      space
00971             #    62:73      Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)
00972             #
00973             #assert line[29:33] in [' bp ', ' aa ',' rc '] , \
00974             #       'LOCUS line does not contain size units at expected position:\n' + line
00975             assert line[41:42] == ' ', \
00976                    'LOCUS line does not contain space at position 42:\n' + line
00977             assert line[42:51].strip() in ['','linear','circular'], \
00978                    'LOCUS line does not contain valid entry (linear, circular, ...):\n' + line
00979             assert line[51:52] == ' ', \
00980                    'LOCUS line does not contain space at position 52:\n' + line
00981             #assert line[55:62] == '       ', \
00982             #      'LOCUS line does not contain spaces from position 56 to 62:\n' + line
00983             if line[62:73].strip():
00984                 assert line[64:65] == '-', \
00985                        'LOCUS line does not contain - at position 65 in date:\n' + line
00986                 assert line[68:69] == '-', \
00987                        'LOCUS line does not contain - at position 69 in date:\n' + line
00988 
00989             name_and_length_str = line[GENBANK_INDENT:29]
00990             while name_and_length_str.find('  ')!=-1:
00991                 name_and_length_str = name_and_length_str.replace('  ',' ')
00992             name_and_length = name_and_length_str.split(' ')
00993             assert len(name_and_length)<=2, \
00994                    'Cannot parse the name and length in the LOCUS line:\n' + line
00995             assert len(name_and_length)!=1, \
00996                    'Name and length collide in the LOCUS line:\n' + line
00997                    #Should be possible to split them based on position, if
00998                    #a clear definition of the standard exists THAT AGREES with
00999                    #existing files.
01000             consumer.locus(name_and_length[0])
01001             consumer.size(name_and_length[1])
01002             #consumer.residue_type(line[33:41].strip())
01003 
01004             if line[33:51].strip() == "" and line[29:33] == ' aa ':
01005                 #Amino acids -> protein (even if there is no residue type given)
01006                 #We want to use a protein alphabet in this case, rather than a
01007                 #generic one. Not sure if this is the best way to achieve this,
01008                 #but it works because the scanner checks for this:
01009                 consumer.residue_type("PROTEIN")
01010             else:
01011                 consumer.residue_type(line[33:51].strip())
01012 
01013             consumer.data_file_division(line[52:55])
01014             if line[62:73].strip():
01015                 consumer.date(line[62:73])
01016         elif line[40:44] in [' bp ', ' aa ',' rc '] \
01017         and line[54:64].strip() in ['','linear','circular']:
01018             #New... linear/circular/big blank test should avoid EnsEMBL style
01019             #LOCUS line being treated like a proper column based LOCUS line.
01020             #
01021             #    Positions  Contents
01022             #    ---------  --------
01023             #    00:06      LOCUS
01024             #    06:12      spaces
01025             #    12:??      Locus name
01026             #    ??:??      space
01027             #    ??:40      Length of sequence, right-justified
01028             #    40:44      space, bp, space
01029             #    44:47      Blank, ss-, ds-, ms-
01030             #    47:54      Blank, DNA, RNA, tRNA, mRNA, uRNA, snRNA, cDNA
01031             #    54:55      space
01032             #    55:63      Blank (implies linear), linear or circular
01033             #    63:64      space
01034             #    64:67      The division code (e.g. BCT, VRL, INV)
01035             #    67:68      space
01036             #    68:79      Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)
01037             #
01038             assert line[40:44] in [' bp ', ' aa ',' rc '] , \
01039                    'LOCUS line does not contain size units at expected position:\n' + line
01040             assert line[44:47] in ['   ', 'ss-', 'ds-', 'ms-'], \
01041                    'LOCUS line does not have valid strand type (Single stranded, ...):\n' + line
01042             assert line[47:54].strip() == "" \
01043             or line[47:54].strip().find('DNA') != -1 \
01044             or line[47:54].strip().find('RNA') != -1, \
01045                    'LOCUS line does not contain valid sequence type (DNA, RNA, ...):\n' + line
01046             assert line[54:55] == ' ', \
01047                    'LOCUS line does not contain space at position 55:\n' + line
01048             assert line[55:63].strip() in ['','linear','circular'], \
01049                    'LOCUS line does not contain valid entry (linear, circular, ...):\n' + line
01050             assert line[63:64] == ' ', \
01051                    'LOCUS line does not contain space at position 64:\n' + line
01052             assert line[67:68] == ' ', \
01053                    'LOCUS line does not contain space at position 68:\n' + line
01054             if line[68:79].strip():
01055                 assert line[70:71] == '-', \
01056                        'LOCUS line does not contain - at position 71 in date:\n' + line
01057                 assert line[74:75] == '-', \
01058                        'LOCUS line does not contain - at position 75 in date:\n' + line
01059 
01060             name_and_length_str = line[GENBANK_INDENT:40]
01061             while name_and_length_str.find('  ')!=-1:
01062                 name_and_length_str = name_and_length_str.replace('  ',' ')
01063             name_and_length = name_and_length_str.split(' ')
01064             assert len(name_and_length)<=2, \
01065                    'Cannot parse the name and length in the LOCUS line:\n' + line
01066             assert len(name_and_length)!=1, \
01067                    'Name and length collide in the LOCUS line:\n' + line
01068                    #Should be possible to split them based on position, if
01069                    #a clear definition of the stand exists THAT AGREES with
01070                    #existing files.
01071             consumer.locus(name_and_length[0])
01072             consumer.size(name_and_length[1])
01073 
01074             if line[44:54].strip() == "" and line[40:44] == ' aa ':
01075                 #Amino acids -> protein (even if there is no residue type given)
01076                 #We want to use a protein alphabet in this case, rather than a
01077                 #generic one. Not sure if this is the best way to achieve this,
01078                 #but it works because the scanner checks for this:
01079                 consumer.residue_type(("PROTEIN " + line[54:63]).strip())
01080             else:
01081                 consumer.residue_type(line[44:63].strip())
01082 
01083             consumer.data_file_division(line[64:67])
01084             if line[68:79].strip():
01085                 consumer.date(line[68:79])
01086         elif line[GENBANK_INDENT:].strip().count(" ")==0 : 
01087             #Truncated LOCUS line, as produced by some EMBOSS tools - see bug 1762
01088             #
01089             #e.g.
01090             #
01091             #    "LOCUS       U00096"
01092             #
01093             #rather than:
01094             #
01095             #    "LOCUS       U00096               4639675 bp    DNA     circular BCT"
01096             #
01097             #    Positions  Contents
01098             #    ---------  --------
01099             #    00:06      LOCUS
01100             #    06:12      spaces
01101             #    12:??      Locus name
01102             if line[GENBANK_INDENT:].strip() != "":
01103                 consumer.locus(line[GENBANK_INDENT:].strip())
01104             else:
01105                 #Must just have just "LOCUS       ", is this even legitimate?
01106                 #We should be able to continue parsing... we need real world testcases!
01107                 warnings.warn("Minimal LOCUS line found - is this correct?\n:%r" % line)
01108         elif len(line.split())==7 and line.split()[3] in ["aa","bp"]:
01109             #Cope with EnsEMBL genbank files which use space separation rather
01110             #than the expected column based layout. e.g.
01111             #LOCUS       HG531_PATCH 1000000 bp DNA HTG 18-JUN-2011
01112             #LOCUS       HG531_PATCH 759984 bp DNA HTG 18-JUN-2011
01113             #LOCUS       HG506_HG1000_1_PATCH 814959 bp DNA HTG 18-JUN-2011
01114             #LOCUS       HG506_HG1000_1_PATCH 1219964 bp DNA HTG 18-JUN-2011
01115             #Notice that the 'bp' can occur in the position expected by either
01116             #the old or the new fixed column standards (parsed above).
01117             splitline = line.split()
01118             consumer.locus(splitline[1])
01119             consumer.size(splitline[2])
01120             consumer.residue_type(splitline[4])
01121             consumer.data_file_division(splitline[5])
01122             consumer.date(splitline[6])
01123         elif len(line.split())>=4 and line.split()[3] in ["aa","bp"]:
01124             #Cope with EMBOSS seqret output where it seems the locus id can cause
01125             #the other fields to overflow.  We just IGNORE the other fields!
01126             warnings.warn("Malformed LOCUS line found - is this correct?\n:%r" % line)
01127             consumer.locus(line.split()[1])
01128             consumer.size(line.split()[2])
01129         elif len(line.split())>=4 and line.split()[-1] in ["aa","bp"]:
01130             #Cope with psuedo-GenBank files like this:
01131             #   "LOCUS       RNA5 complete       1718 bp"
01132             #Treat everything between LOCUS and the size as the identifier.
01133             warnings.warn("Malformed LOCUS line found - is this correct?\n:%r" % line)
01134             consumer.locus(line[5:].rsplit(None,2)[0].strip())
01135             consumer.size(line.split()[-2])
01136         else:
01137             raise ValueError('Did not recognise the LOCUS line layout:\n' + line)
01138 

Here is the call graph for this function:

def Bio.GenBank.Scanner.GenBankScanner._feed_header_lines (   self,
  consumer,
  lines 
) [private]
Handle the header lines (list of strings), passing data to the comsumer

This should be implemented by the EMBL / GenBank specific subclass

Used by the parse_records() and parse() methods.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 1139 of file Scanner.py.

01139 
01140     def _feed_header_lines(self, consumer, lines):
01141         #Following dictionary maps GenBank lines to the associated
01142         #consumer methods - the special cases like LOCUS where one
01143         #genbank line triggers several consumer calls have to be
01144         #handled individually.
01145         GENBANK_INDENT = self.HEADER_WIDTH
01146         GENBANK_SPACER = " "*GENBANK_INDENT
01147         consumer_dict = {
01148             'DEFINITION' : 'definition',
01149             'ACCESSION'  : 'accession',
01150             'NID'        : 'nid',
01151             'PID'        : 'pid',
01152             'DBSOURCE'   : 'db_source',
01153             'KEYWORDS'   : 'keywords',
01154             'SEGMENT'    : 'segment',
01155             'SOURCE'     : 'source',
01156             'AUTHORS'    : 'authors',
01157             'CONSRTM'    : 'consrtm',
01158             'PROJECT'    : 'project',
01159             'DBLINK'     : 'dblink',
01160             'TITLE'      : 'title',
01161             'JOURNAL'    : 'journal',
01162             'MEDLINE'    : 'medline_id',
01163             'PUBMED'     : 'pubmed_id',
01164             'REMARK'     : 'remark'}
01165         #We have to handle the following specially:
01166         #ORIGIN (locus, size, residue_type, data_file_division and date)
01167         #COMMENT (comment)
01168         #VERSION (version and gi)
01169         #REFERENCE (eference_num and reference_bases)
01170         #ORGANISM (organism and taxonomy)
01171         lines = filter(None,lines)
01172         lines.append("") #helps avoid getting StopIteration all the time
01173         line_iter = iter(lines)
01174         try:
01175             line = line_iter.next()
01176             while True:
01177                 if not line : break
01178                 line_type = line[:GENBANK_INDENT].strip()
01179                 data = line[GENBANK_INDENT:].strip()
01180 
01181                 if line_type == 'VERSION':
01182                     #Need to call consumer.version(), and maybe also consumer.gi() as well.
01183                     #e.g.
01184                     # VERSION     AC007323.5  GI:6587720
01185                     while data.find('  ')!=-1:
01186                         data = data.replace('  ',' ')
01187                     if data.find(' GI:')==-1:
01188                         consumer.version(data)
01189                     else:
01190                         if self.debug : print "Version [" + data.split(' GI:')[0] + "], gi [" + data.split(' GI:')[1] + "]"
01191                         consumer.version(data.split(' GI:')[0])
01192                         consumer.gi(data.split(' GI:')[1])
01193                     #Read in the next line!
01194                     line = line_iter.next()
01195                 elif line_type == 'REFERENCE':
01196                     if self.debug >1 : print "Found reference [" + data + "]"
01197                     #Need to call consumer.reference_num() and consumer.reference_bases()
01198                     #e.g.
01199                     # REFERENCE   1  (bases 1 to 86436)
01200                     #
01201                     #Note that this can be multiline, see Bug 1968, e.g.
01202                     #
01203                     # REFERENCE   42 (bases 1517 to 1696; 3932 to 4112; 17880 to 17975; 21142 to
01204                     #             28259)
01205                     #
01206                     #For such cases we will call the consumer once only.
01207                     data = data.strip()
01208 
01209                     #Read in the next line, and see if its more of the reference:
01210                     while True:
01211                         line = line_iter.next()
01212                         if line[:GENBANK_INDENT] == GENBANK_SPACER:
01213                             #Add this continuation to the data string
01214                             data += " " + line[GENBANK_INDENT:]
01215                             if self.debug >1 : print "Extended reference text [" + data + "]"
01216                         else:
01217                             #End of the reference, leave this text in the variable "line"
01218                             break
01219 
01220                     #We now have all the reference line(s) stored in a string, data,
01221                     #which we pass to the consumer
01222                     while data.find('  ')!=-1:
01223                         data = data.replace('  ',' ')
01224                     if data.find(' ')==-1:
01225                         if self.debug >2 : print 'Reference number \"' + data + '\"'
01226                         consumer.reference_num(data)
01227                     else:
01228                         if self.debug >2 : print 'Reference number \"' + data[:data.find(' ')] + '\", \"' + data[data.find(' ')+1:] + '\"'
01229                         consumer.reference_num(data[:data.find(' ')])
01230                         consumer.reference_bases(data[data.find(' ')+1:])
01231                 elif line_type == 'ORGANISM':
01232                     #Typically the first line is the organism, and subsequent lines
01233                     #are the taxonomy lineage.  However, given longer and longer
01234                     #species names (as more and more strains and sub strains get
01235                     #sequenced) the oragnism name can now get wrapped onto multiple
01236                     #lines.  The NCBI say we have to recognise the lineage line by
01237                     #the presense of semi-colon delimited entries.  In the long term,
01238                     #they are considering adding a new keyword (e.g. LINEAGE).
01239                     #See Bug 2591 for details.
01240                     organism_data = data
01241                     lineage_data = ""
01242                     while True:
01243                         line = line_iter.next()
01244                         if line[0:GENBANK_INDENT] == GENBANK_SPACER:
01245                             if lineage_data or ";" in line:
01246                                 lineage_data += " " + line[GENBANK_INDENT:]
01247                             else:
01248                                 organism_data += " " + line[GENBANK_INDENT:].strip()
01249                         else:
01250                             #End of organism and taxonomy
01251                             break
01252                     consumer.organism(organism_data)
01253                     if lineage_data.strip() == "" and self.debug > 1:
01254                         print "Taxonomy line(s) missing or blank"
01255                     consumer.taxonomy(lineage_data.strip())
01256                     del organism_data, lineage_data
01257                 elif line_type == 'COMMENT':
01258                     if self.debug > 1 : print "Found comment"
01259                     #This can be multiline, and should call consumer.comment() once
01260                     #with a list where each entry is a line.
01261                     comment_list=[]
01262                     comment_list.append(data)
01263                     while True:
01264                         line = line_iter.next()
01265                         if line[0:GENBANK_INDENT] == GENBANK_SPACER:
01266                             data = line[GENBANK_INDENT:]
01267                             comment_list.append(data)
01268                             if self.debug > 2 : print "Comment continuation [" + data + "]"
01269                         else:
01270                             #End of the comment
01271                             break
01272                     consumer.comment(comment_list)
01273                     del comment_list
01274                 elif line_type in consumer_dict:
01275                     #Its a semi-automatic entry!
01276                     #Now, this may be a multi line entry...
01277                     while True:
01278                         line = line_iter.next()
01279                         if line[0:GENBANK_INDENT] == GENBANK_SPACER:
01280                             data += ' ' + line[GENBANK_INDENT:]
01281                         else:
01282                             #We now have all the data for this entry:
01283                             getattr(consumer, consumer_dict[line_type])(data)
01284                             #End of continuation - return to top of loop!
01285                             break
01286                 else:
01287                     if self.debug:
01288                         print "Ignoring GenBank header line:\n" % line
01289                     #Read in next line
01290                     line = line_iter.next()
01291         except StopIteration:
01292             raise ValueError("Problem in header")
        

Here is the call graph for this function:

def Bio.GenBank.Scanner.GenBankScanner._feed_misc_lines (   self,
  consumer,
  lines 
) [private]
Handle any lines between features and sequence (list of strings), passing data to the consumer

This should be implemented by the EMBL / GenBank specific subclass

Used by the parse_records() and parse() methods.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 1293 of file Scanner.py.

01293 
01294     def _feed_misc_lines(self, consumer, lines):
01295         #Deals with a few misc lines between the features and the sequence
01296         GENBANK_INDENT = self.HEADER_WIDTH
01297         GENBANK_SPACER = " "*GENBANK_INDENT
01298         lines.append("")
01299         line_iter = iter(lines)
01300         try:
01301             for line in line_iter:
01302                 if line.find('BASE COUNT')==0:
01303                     line = line[10:].strip()
01304                     if line:
01305                         if self.debug : print "base_count = " + line
01306                         consumer.base_count(line)
01307                 if line.find("ORIGIN")==0:
01308                     line = line[6:].strip()
01309                     if line:
01310                         if self.debug : print "origin_name = " + line
01311                         consumer.origin_name(line)
01312                 if line.find("WGS ")==0 :                        
01313                     line = line[3:].strip()
01314                     consumer.wgs(line)
01315                 if line.find("WGS_SCAFLD")==0 :                        
01316                     line = line[10:].strip()
01317                     consumer.add_wgs_scafld(line)
01318                 if line.find("CONTIG")==0:
01319                     line = line[6:].strip()
01320                     contig_location = line
01321                     while True:
01322                         line = line_iter.next()
01323                         if not line:
01324                             break
01325                         elif line[:GENBANK_INDENT]==GENBANK_SPACER:
01326                             #Don't need to preseve the whitespace here.
01327                             contig_location += line[GENBANK_INDENT:].rstrip()
01328                         else:
01329                             raise ValueError('Expected CONTIG continuation line, got:\n' + line)
01330                     consumer.contig_location(contig_location)
01331             return
01332         except StopIteration:
01333             raise ValueError("Problem in misc lines before sequence")
        
def Bio.GenBank.Scanner.InsdcScanner.feed (   self,
  handle,
  consumer,
  do_features = True 
) [inherited]
Feed a set of data into the consumer.

This method is intended for use with the "old" code in Bio.GenBank

Arguments:
handle - A handle with the information to parse.
consumer - The consumer that should be informed of events.
do_features - Boolean, should the features be parsed?
      Skipping the features can be much faster.

Return values:
true  - Passed a record
false - Did not find a record

Definition at line 367 of file Scanner.py.

00367 
00368     def feed(self, handle, consumer, do_features=True):
00369         """Feed a set of data into the consumer.
00370 
00371         This method is intended for use with the "old" code in Bio.GenBank
00372 
00373         Arguments:
00374         handle - A handle with the information to parse.
00375         consumer - The consumer that should be informed of events.
00376         do_features - Boolean, should the features be parsed?
00377                       Skipping the features can be much faster.
00378 
00379         Return values:
00380         true  - Passed a record
00381         false - Did not find a record
00382         """        
00383         #Should work with both EMBL and GenBank files provided the
00384         #equivalent Bio.GenBank._FeatureConsumer methods are called...
00385         self.set_handle(handle)
00386         if not self.find_start():
00387             #Could not find (another) record
00388             consumer.data=None
00389             return False
00390                        
00391         #We use the above class methods to parse the file into a simplified format.
00392         #The first line, header lines and any misc lines after the features will be
00393         #dealt with by GenBank / EMBL specific derived classes.
00394 
00395         #First line and header:
00396         self._feed_first_line(consumer, self.line)
00397         self._feed_header_lines(consumer, self.parse_header())
00398 
00399         #Features (common to both EMBL and GenBank):
00400         if do_features:
00401             self._feed_feature_table(consumer, self.parse_features(skip=False))
00402         else:
00403             self.parse_features(skip=True) # ignore the data
00404         
00405         #Footer and sequence
00406         misc_lines, sequence_string = self.parse_footer()
00407         self._feed_misc_lines(consumer, misc_lines)
00408 
00409         consumer.sequence(sequence_string)
00410         #Calls to consumer.base_number() do nothing anyway
00411         consumer.record_end("//")
00412 
00413         assert self.line == "//"
00414 
00415         #And we are done
00416         return True

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.find_start (   self) [inherited]
Read in lines until find the ID/LOCUS line, which is returned.

Any preamble (such as the header used by the NCBI on *.seq.gz archives)
will we ignored.

Definition at line 66 of file Scanner.py.

00066 
00067     def find_start(self):
00068         """Read in lines until find the ID/LOCUS line, which is returned.
00069         
00070         Any preamble (such as the header used by the NCBI on *.seq.gz archives)
00071         will we ignored."""
00072         while True:
00073             if self.line:
00074                 line = self.line
00075                 self.line = ""
00076             else:
00077                 line = self.handle.readline()
00078             if not line:
00079                 if self.debug : print "End of file"
00080                 return None
00081             if line[:self.HEADER_WIDTH]==self.RECORD_START:
00082                 if self.debug > 1: print "Found the start of a record:\n" + line
00083                 break
00084             line = line.rstrip()
00085             if line == "//":
00086                 if self.debug > 1: print "Skipping // marking end of last record"
00087             elif line == "":
00088                 if self.debug > 1: print "Skipping blank line before record"
00089             else:
00090                 #Ignore any header before the first ID/LOCUS line.
00091                 if self.debug > 1:
00092                         print "Skipping header line before record:\n" + line
00093         self.line = line
00094         return line

Here is the caller graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.parse (   self,
  handle,
  do_features = True 
) [inherited]
Returns a SeqRecord (with SeqFeatures if do_features=True)

See also the method parse_records() for use on multi-record files.

Definition at line 417 of file Scanner.py.

00417 
00418     def parse(self, handle, do_features=True):
00419         """Returns a SeqRecord (with SeqFeatures if do_features=True)
00420 
00421         See also the method parse_records() for use on multi-record files.
00422         """
00423         from Bio.GenBank import _FeatureConsumer
00424         from Bio.GenBank.utils import FeatureValueCleaner
00425 
00426         consumer = _FeatureConsumer(use_fuzziness = 1, 
00427                     feature_cleaner = FeatureValueCleaner())
00428 
00429         if self.feed(handle, consumer, do_features):
00430             return consumer.data
00431         else:
00432             return None
00433 
    

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.parse_cds_features (   self,
  handle,
  alphabet = generic_protein,
  tags2id = ('protein_id','locus_tag',
  product 
) [inherited]
Returns SeqRecord object iterator

Each CDS feature becomes a SeqRecord.

alphabet - Used for any sequence found in a translation field.
tags2id  - Tupple of three strings, the feature keys to use
   for the record id, name and description,

This method is intended for use in Bio.SeqIO

Definition at line 454 of file Scanner.py.

00454 
00455                            tags2id=('protein_id','locus_tag','product')):
00456         """Returns SeqRecord object iterator
00457 
00458         Each CDS feature becomes a SeqRecord.
00459 
00460         alphabet - Used for any sequence found in a translation field.
00461         tags2id  - Tupple of three strings, the feature keys to use
00462                    for the record id, name and description,
00463 
00464         This method is intended for use in Bio.SeqIO
00465         """
00466         self.set_handle(handle)
00467         while self.find_start():
00468             #Got an EMBL or GenBank record...
00469             self.parse_header() # ignore header lines!
00470             feature_tuples = self.parse_features()
00471             #self.parse_footer() # ignore footer lines!
00472             while True:
00473                 line = self.handle.readline()
00474                 if not line : break
00475                 if line[:2]=="//" : break
00476             self.line = line.rstrip()
00477 
00478             #Now go though those features...
00479             for key, location_string, qualifiers in feature_tuples:
00480                 if key=="CDS":
00481                     #Create SeqRecord
00482                     #================
00483                     #SeqRecord objects cannot be created with annotations, they
00484                     #must be added afterwards.  So create an empty record and
00485                     #then populate it:
00486                     record = SeqRecord(seq=None)
00487                     annotations = record.annotations
00488 
00489                     #Should we add a location object to the annotations?
00490                     #I *think* that only makes sense for SeqFeatures with their
00491                     #sub features...
00492                     annotations['raw_location'] = location_string.replace(' ','')
00493 
00494                     for (qualifier_name, qualifier_data) in qualifiers:
00495                         if qualifier_data is not None \
00496                         and qualifier_data[0]=='"' and qualifier_data[-1]=='"':
00497                             #Remove quotes
00498                             qualifier_data = qualifier_data[1:-1]
00499                         #Append the data to the annotation qualifier...
00500                         if qualifier_name == "translation":
00501                             assert record.seq is None, "Multiple translations!"
00502                             record.seq = Seq(qualifier_data.replace("\n",""), alphabet)
00503                         elif qualifier_name == "db_xref":
00504                             #its a list, possibly empty.  Its safe to extend
00505                             record.dbxrefs.append(qualifier_data)
00506                         else:
00507                             if qualifier_data is not None:
00508                                 qualifier_data = qualifier_data.replace("\n"," ").replace("  "," ")
00509                             try:
00510                                 annotations[qualifier_name] += " " + qualifier_data
00511                             except KeyError:
00512                                 #Not an addition to existing data, its the first bit
00513                                 annotations[qualifier_name]= qualifier_data
00514                         
00515                     #Fill in the ID, Name, Description
00516                     #=================================
00517                     try:
00518                         record.id = annotations[tags2id[0]]
00519                     except KeyError:
00520                         pass
00521                     try:
00522                         record.name = annotations[tags2id[1]]
00523                     except KeyError:
00524                         pass
00525                     try:
00526                         record.description = annotations[tags2id[2]]
00527                     except KeyError:
00528                         pass
00529 
00530                     yield record
00531 

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.parse_feature (   self,
  feature_key,
  lines 
) [inherited]
Expects a feature as a list of strings, returns a tuple (key, location, qualifiers)

For example given this GenBank feature:

     CDS             complement(join(490883..490885,1..879))
             /locus_tag="NEQ001"
             /note="conserved hypothetical [Methanococcus jannaschii];
             COG1583:Uncharacterized ACR; IPR001472:Bipartite nuclear
             localization signal; IPR002743: Protein of unknown
             function DUF57"
             /codon_start=1
             /transl_table=11
             /product="hypothetical protein"
             /protein_id="NP_963295.1"
             /db_xref="GI:41614797"
             /db_xref="GeneID:2732620"
             /translation="MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKK
             EKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTK
             KFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEP
             IEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFE
             EAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGS
             LNSMGFGFVNTKKNSAR"

Then should give input key="CDS" and the rest of the data as a list of strings
lines=["complement(join(490883..490885,1..879))", ..., "LNSMGFGFVNTKKNSAR"]
where the leading spaces and trailing newlines have been removed.

Returns tuple containing: (key as string, location string, qualifiers as list)
as follows for this example:

key = "CDS", string
location = "complement(join(490883..490885,1..879))", string
qualifiers = list of string tuples:

[('locus_tag', '"NEQ001"'),
 ('note', '"conserved hypothetical [Methanococcus jannaschii];\nCOG1583:..."'),
 ('codon_start', '1'),
 ('transl_table', '11'),
 ('product', '"hypothetical protein"'),
 ('protein_id', '"NP_963295.1"'),
 ('db_xref', '"GI:41614797"'),
 ('db_xref', '"GeneID:2732620"'),
 ('translation', '"MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKK\nEKYFNFT..."')]

In the above example, the "note" and "translation" were edited for compactness,
and they would contain multiple new line characters (displayed above as \n)

If a qualifier is quoted (in this case, everything except codon_start and
transl_table) then the quotes are NOT removed.

Note that no whitespace is removed.

Definition at line 192 of file Scanner.py.

00192 
00193     def parse_feature(self, feature_key, lines):
00194         """Expects a feature as a list of strings, returns a tuple (key, location, qualifiers)
00195 
00196         For example given this GenBank feature:
00197 
00198              CDS             complement(join(490883..490885,1..879))
00199                              /locus_tag="NEQ001"
00200                              /note="conserved hypothetical [Methanococcus jannaschii];
00201                              COG1583:Uncharacterized ACR; IPR001472:Bipartite nuclear
00202                              localization signal; IPR002743: Protein of unknown
00203                              function DUF57"
00204                              /codon_start=1
00205                              /transl_table=11
00206                              /product="hypothetical protein"
00207                              /protein_id="NP_963295.1"
00208                              /db_xref="GI:41614797"
00209                              /db_xref="GeneID:2732620"
00210                              /translation="MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKK
00211                              EKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTK
00212                              KFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEP
00213                              IEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFE
00214                              EAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGS
00215                              LNSMGFGFVNTKKNSAR"
00216 
00217         Then should give input key="CDS" and the rest of the data as a list of strings
00218         lines=["complement(join(490883..490885,1..879))", ..., "LNSMGFGFVNTKKNSAR"]
00219         where the leading spaces and trailing newlines have been removed.
00220 
00221         Returns tuple containing: (key as string, location string, qualifiers as list)
00222         as follows for this example:
00223 
00224         key = "CDS", string
00225         location = "complement(join(490883..490885,1..879))", string
00226         qualifiers = list of string tuples:
00227 
00228         [('locus_tag', '"NEQ001"'),
00229          ('note', '"conserved hypothetical [Methanococcus jannaschii];\nCOG1583:..."'),
00230          ('codon_start', '1'),
00231          ('transl_table', '11'),
00232          ('product', '"hypothetical protein"'),
00233          ('protein_id', '"NP_963295.1"'),
00234          ('db_xref', '"GI:41614797"'),
00235          ('db_xref', '"GeneID:2732620"'),
00236          ('translation', '"MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKK\nEKYFNFT..."')]
00237 
00238         In the above example, the "note" and "translation" were edited for compactness,
00239         and they would contain multiple new line characters (displayed above as \n)
00240 
00241         If a qualifier is quoted (in this case, everything except codon_start and
00242         transl_table) then the quotes are NOT removed.
00243 
00244         Note that no whitespace is removed.
00245         """
00246         #Skip any blank lines
00247         iterator = iter(filter(None, lines))
00248         try:
00249             line = iterator.next()
00250 
00251             feature_location = line.strip()
00252             while feature_location[-1:]==",":
00253                 #Multiline location, still more to come!
00254                 line = iterator.next()
00255                 feature_location += line.strip()
00256 
00257             qualifiers=[]
00258 
00259             for i, line in enumerate(iterator):
00260                 # check for extra wrapping of the location closing parentheses
00261                 if i == 0 and line.startswith(")"):
00262                     feature_location += line.strip()
00263                 elif line[0]=="/":
00264                     #New qualifier
00265                     i = line.find("=")
00266                     key = line[1:i] #does not work if i==-1
00267                     value = line[i+1:] #we ignore 'value' if i==-1
00268                     if i==-1:
00269                         #Qualifier with no key, e.g. /pseudo
00270                         key = line[1:]
00271                         qualifiers.append((key,None))
00272                     elif not value:
00273                         #ApE can output /note=
00274                         qualifiers.append((key,""))
00275                     elif value[0]=='"':
00276                         #Quoted...
00277                         if value[-1]!='"' or value!='"':
00278                             #No closing quote on the first line...
00279                             while value[-1] != '"':
00280                                 value += "\n" + iterator.next()
00281                         else:
00282                             #One single line (quoted)
00283                             assert value == '"'
00284                             if self.debug : print "Quoted line %s:%s" % (key, value)
00285                         #DO NOT remove the quotes...
00286                         qualifiers.append((key,value))
00287                     else:
00288                         #Unquoted
00289                         #if debug : print "Unquoted line %s:%s" % (key,value)
00290                         qualifiers.append((key,value))
00291                 else:
00292                     #Unquoted continuation
00293                     assert len(qualifiers) > 0
00294                     assert key==qualifiers[-1][0]
00295                     #if debug : print "Unquoted Cont %s:%s" % (key, line)
00296                     qualifiers[-1] = (key, qualifiers[-1][1] + "\n" + line)
00297             return (feature_key, feature_location, qualifiers)
00298         except StopIteration:
00299             #Bummer
00300             raise ValueError("Problem with '%s' feature:\n%s" \
00301                               % (feature_key, "\n".join(lines)))

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.parse_features (   self,
  skip = False 
) [inherited]
Return list of tuples for the features (if present)

Each feature is returned as a tuple (key, location, qualifiers)
where key and location are strings (e.g. "CDS" and
"complement(join(490883..490885,1..879))") while qualifiers
is a list of two string tuples (feature qualifier keys and values).

Assumes you have already read to the start of the features table.

Reimplemented in Bio.GenBank.Scanner._ImgtScanner.

Definition at line 126 of file Scanner.py.

00126 
00127     def parse_features(self, skip=False):
00128         """Return list of tuples for the features (if present)
00129 
00130         Each feature is returned as a tuple (key, location, qualifiers)
00131         where key and location are strings (e.g. "CDS" and
00132         "complement(join(490883..490885,1..879))") while qualifiers
00133         is a list of two string tuples (feature qualifier keys and values).
00134 
00135         Assumes you have already read to the start of the features table.
00136         """
00137         if self.line.rstrip() not in self.FEATURE_START_MARKERS:
00138             if self.debug : print "Didn't find any feature table"
00139             return []
00140         
00141         while self.line.rstrip() in self.FEATURE_START_MARKERS:
00142             self.line = self.handle.readline()
00143 
00144         features = []
00145         line = self.line
00146         while True:
00147             if not line:
00148                 raise ValueError("Premature end of line during features table")
00149             if line[:self.HEADER_WIDTH].rstrip() in self.SEQUENCE_HEADERS:
00150                 if self.debug : print "Found start of sequence"
00151                 break
00152             line = line.rstrip()
00153             if line == "//":
00154                 raise ValueError("Premature end of features table, marker '//' found")
00155             if line in self.FEATURE_END_MARKERS:
00156                 if self.debug : print "Found end of features"
00157                 line = self.handle.readline()
00158                 break
00159             if line[2:self.FEATURE_QUALIFIER_INDENT].strip() == "":
00160                 #This is an empty feature line between qualifiers. Empty
00161                 #feature lines within qualifiers are handled below (ignored).
00162                 line = self.handle.readline()
00163                 continue
00164             
00165             if skip:
00166                 line = self.handle.readline()
00167                 while line[:self.FEATURE_QUALIFIER_INDENT] == self.FEATURE_QUALIFIER_SPACER:
00168                     line = self.handle.readline()
00169             else:
00170                 #Build up a list of the lines making up this feature:
00171                 if line[self.FEATURE_QUALIFIER_INDENT]!=" " \
00172                 and " " in line[self.FEATURE_QUALIFIER_INDENT:]:
00173                     #The feature table design enforces a length limit on the feature keys.
00174                     #Some third party files (e.g. IGMT's EMBL like files) solve this by
00175                     #over indenting the location and qualifiers.
00176                     feature_key, line = line[2:].strip().split(None,1)
00177                     feature_lines = [line]
00178                     warnings.warn("Overindented %s feature?" % feature_key)
00179                 else:
00180                     feature_key = line[2:self.FEATURE_QUALIFIER_INDENT].strip()
00181                     feature_lines = [line[self.FEATURE_QUALIFIER_INDENT:]]
00182                 line = self.handle.readline()
00183                 while line[:self.FEATURE_QUALIFIER_INDENT] == self.FEATURE_QUALIFIER_SPACER \
00184                 or line.rstrip() == "" : # cope with blank lines in the midst of a feature
00185                     #Use strip to remove any harmless trailing white space AND and leading
00186                     #white space (e.g. out of spec files with too much intentation)
00187                     feature_lines.append(line[self.FEATURE_QUALIFIER_INDENT:].strip())
00188                     line = self.handle.readline()
00189                 features.append(self.parse_feature(feature_key, feature_lines))
00190         self.line = line
00191         return features

Here is the call graph for this function:

Here is the caller graph for this function:

returns a tuple containing a list of any misc strings, and the sequence

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 888 of file Scanner.py.

00888 
00889     def parse_footer(self):
00890         """returns a tuple containing a list of any misc strings, and the sequence"""
00891         assert self.line[:self.HEADER_WIDTH].rstrip() in self.SEQUENCE_HEADERS, \
00892                "Eh? '%s'" % self.line
00893 
00894         misc_lines = []
00895         while self.line[:self.HEADER_WIDTH].rstrip() in self.SEQUENCE_HEADERS \
00896         or self.line[:self.HEADER_WIDTH] == " "*self.HEADER_WIDTH \
00897         or "WGS" == self.line[:3]:
00898             misc_lines.append(self.line.rstrip())
00899             self.line = self.handle.readline()
00900             if not self.line:
00901                 raise ValueError("Premature end of file")
00902             self.line = self.line
00903 
00904         assert self.line[:self.HEADER_WIDTH].rstrip() not in self.SEQUENCE_HEADERS, \
00905                "Eh? '%s'" % self.line
00906 
00907         #Now just consume the sequence lines until reach the // marker
00908         #or a CONTIG line
00909         seq_lines = []
00910         line = self.line
00911         while True:
00912             if not line:
00913                 raise ValueError("Premature end of file in sequence data")
00914             line = line.rstrip()
00915             if not line:
00916                 import warnings
00917                 warnings.warn("Blank line in sequence data")
00918                 line = self.handle.readline()
00919                 continue
00920             if line=='//':
00921                 break
00922             if line.find('CONTIG')==0:
00923                 break
00924             if len(line) > 9 and  line[9:10]!=' ':
00925                 raise ValueError("Sequence line mal-formed, '%s'" % line)
00926             seq_lines.append(line[10:]) #remove spaces later
00927             line = self.handle.readline()
00928 
00929         self.line = line
00930         #Seq("".join(seq_lines), self.alphabet)
00931         return (misc_lines,"".join(seq_lines).replace(" ",""))

Return list of strings making up the header

New line characters are removed.

Assumes you have just read in the ID/LOCUS line.

Definition at line 95 of file Scanner.py.

00095 
00096     def parse_header(self):
00097         """Return list of strings making up the header
00098 
00099         New line characters are removed.
00100 
00101         Assumes you have just read in the ID/LOCUS line.
00102         """
00103         assert self.line[:self.HEADER_WIDTH]==self.RECORD_START, \
00104                "Not at start of record"
00105         
00106         header_lines = []
00107         while True:
00108             line = self.handle.readline()
00109             if not line:
00110                 raise ValueError("Premature end of line during sequence data")
00111             line = line.rstrip()
00112             if line in self.FEATURE_START_MARKERS:
00113                 if self.debug : print "Found header table"
00114                 break
00115             #if line[:self.HEADER_WIDTH]==self.FEATURE_START_MARKER[:self.HEADER_WIDTH]:
00116             #    if self.debug : print "Found header table (?)"
00117             #    break
00118             if line[:self.HEADER_WIDTH].rstrip() in self.SEQUENCE_HEADERS:
00119                 if self.debug : print "Found start of sequence"
00120                 break
00121             if line == "//":
00122                 raise ValueError("Premature end of sequence data marker '//' found")
00123             header_lines.append(line)
00124         self.line = line
00125         return header_lines

Here is the caller graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.parse_records (   self,
  handle,
  do_features = True 
) [inherited]
Returns a SeqRecord object iterator

Each record (from the ID/LOCUS line to the // line) becomes a SeqRecord

The SeqRecord objects include SeqFeatures if do_features=True

This method is intended for use in Bio.SeqIO

Definition at line 434 of file Scanner.py.

00434 
00435     def parse_records(self, handle, do_features=True):
00436         """Returns a SeqRecord object iterator
00437 
00438         Each record (from the ID/LOCUS line to the // line) becomes a SeqRecord
00439 
00440         The SeqRecord objects include SeqFeatures if do_features=True
00441         
00442         This method is intended for use in Bio.SeqIO
00443         """
00444         #This is a generator function
00445         while True:
00446             record = self.parse(handle, do_features)
00447             if record is None : break
00448             assert record.id is not None
00449             assert record.name != "<unknown name>"
00450             assert record.description != "<unknown description>"
00451             yield record

Here is the call graph for this function:

def Bio.GenBank.Scanner.InsdcScanner.set_handle (   self,
  handle 
) [inherited]

Definition at line 62 of file Scanner.py.

00062 
00063     def set_handle(self, handle):
00064         self.handle = handle
00065         self.line = ""

Here is the caller graph for this function:


Member Data Documentation

Definition at line 59 of file Scanner.py.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 883 of file Scanner.py.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 884 of file Scanner.py.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 885 of file Scanner.py.

list Bio.GenBank.Scanner.GenBankScanner.FEATURE_START_MARKERS = ["FEATURES Location/Qualifiers","FEATURES"] [static]

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 882 of file Scanner.py.

Definition at line 63 of file Scanner.py.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 881 of file Scanner.py.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 898 of file Scanner.py.

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 880 of file Scanner.py.

list Bio.GenBank.Scanner.GenBankScanner.SEQUENCE_HEADERS = ["CONTIG", "ORIGIN", "BASE COUNT", "WGS"] [static]

Reimplemented from Bio.GenBank.Scanner.InsdcScanner.

Definition at line 886 of file Scanner.py.


The documentation for this class was generated from the following file: