Back to index

python-biopython  1.60
Public Member Functions | Static Public Attributes | Private Member Functions
Bio.SeqIO.InsdcIO.GenBankWriter Class Reference
Inheritance diagram for Bio.SeqIO.InsdcIO.GenBankWriter:
Inheritance graph
[legend]
Collaboration diagram for Bio.SeqIO.InsdcIO.GenBankWriter:
Collaboration graph
[legend]

List of all members.

Public Member Functions

def write_record

Static Public Attributes

int HEADER_WIDTH = 12
int QUALIFIER_INDENT = 21
int MAX_WIDTH = 80
string QUALIFIER_INDENT_STR = " "
string QUALIFIER_INDENT_TMP = " %s "

Private Member Functions

def _write_single_line
def _write_multi_line
def _write_multi_entries
def _get_date
def _get_data_division
def _write_the_first_line
def _write_references
def _write_comment
def _write_contig
def _write_sequence

Detailed Description

Definition at line 377 of file InsdcIO.py.


Member Function Documentation

def Bio.SeqIO.InsdcIO.GenBankWriter._get_data_division (   self,
  record 
) [private]

Definition at line 428 of file InsdcIO.py.

00428 
00429     def _get_data_division(self, record):
00430         try:
00431             division = record.annotations["data_file_division"]
00432         except KeyError:
00433             division = "UNK"
00434         if division in ["PRI", "ROD", "MAM", "VRT", "INV", "PLN", "BCT",
00435                         "VRL", "PHG", "SYN", "UNA", "EST", "PAT", "STS",
00436                         "GSS", "HTG", "HTC", "ENV", "CON"]:
00437             #Good, already GenBank style
00438             #    PRI - primate sequences
00439             #    ROD - rodent sequences
00440             #    MAM - other mammalian sequences
00441             #    VRT - other vertebrate sequences
00442             #    INV - invertebrate sequences
00443             #    PLN - plant, fungal, and algal sequences
00444             #    BCT - bacterial sequences [plus archea]
00445             #    VRL - viral sequences
00446             #    PHG - bacteriophage sequences
00447             #    SYN - synthetic sequences
00448             #    UNA - unannotated sequences
00449             #    EST - EST sequences (expressed sequence tags) 
00450             #    PAT - patent sequences
00451             #    STS - STS sequences (sequence tagged sites) 
00452             #    GSS - GSS sequences (genome survey sequences) 
00453             #    HTG - HTGS sequences (high throughput genomic sequences) 
00454             #    HTC - HTC sequences (high throughput cDNA sequences) 
00455             #    ENV - Environmental sampling sequences
00456             #    CON - Constructed sequences
00457             #
00458             #(plus UNK for unknown)
00459             pass
00460         else:
00461             #See if this is in EMBL style:
00462             #    Division                 Code
00463             #    -----------------        ----
00464             #    Bacteriophage            PHG - common
00465             #    Environmental Sample     ENV - common
00466             #    Fungal                   FUN - map to PLN (plants + fungal)
00467             #    Human                    HUM - map to PRI (primates)
00468             #    Invertebrate             INV - common
00469             #    Other Mammal             MAM - common
00470             #    Other Vertebrate         VRT - common
00471             #    Mus musculus             MUS - map to ROD (rodent)
00472             #    Plant                    PLN - common
00473             #    Prokaryote               PRO - map to BCT (poor name)
00474             #    Other Rodent             ROD - common
00475             #    Synthetic                SYN - common
00476             #    Transgenic               TGN - ??? map to SYN ???
00477             #    Unclassified             UNC - map to UNK
00478             #    Viral                    VRL - common
00479             #
00480             #(plus XXX for submiting which we can map to UNK)
00481             embl_to_gbk = {"FUN":"PLN",
00482                            "HUM":"PRI",
00483                            "MUS":"ROD",
00484                            "PRO":"BCT",
00485                            "UNC":"UNK",
00486                            "XXX":"UNK",
00487                            }
00488             try:
00489                 division = embl_to_gbk[division]
00490             except KeyError:
00491                 division = "UNK"
00492         assert len(division)==3
00493         return division

def Bio.SeqIO.InsdcIO.GenBankWriter._get_date (   self,
  record 
) [private]

Definition at line 408 of file InsdcIO.py.

00408 
00409     def _get_date(self, record) :
00410         default = "01-JAN-1980"
00411         try :
00412             date = record.annotations["date"]
00413         except KeyError :
00414             return default
00415         #Cope with a list of one string:
00416         if isinstance(date, list) and len(date)==1 :
00417             date = date[0]
00418         #TODO - allow a Python date object
00419         if not isinstance(date, basestring) or len(date) != 11 \
00420         or date[2] != "-" or date[6] != "-" \
00421         or not date[:2].isdigit() or not date[7:].isdigit() \
00422         or int(date[:2]) > 31 \
00423         or date[3:6] not in ["JAN", "FEB", "MAR", "APR", "MAY", "JUN",
00424                              "JUL", "AUG", "SEP", "OCT", "NOV", "DEC"] :
00425             #TODO - Check is a valid date (e.g. not 31 Feb)
00426             return default
00427         return date

def Bio.SeqIO.InsdcIO.GenBankWriter._write_comment (   self,
  record 
) [private]

Definition at line 624 of file InsdcIO.py.

00624 
00625     def _write_comment(self, record):
00626         #This is a bit complicated due to the range of possible
00627         #ways people might have done their annotation...
00628         #Currently the parser uses a single string with newlines.
00629         #A list of lines is also reasonable.
00630         #A single (long) string is perhaps the most natural of all.
00631         #This means we may need to deal with line wrapping.
00632         comment = record.annotations["comment"]
00633         if isinstance(comment, basestring):
00634             lines = comment.split("\n")
00635         elif isinstance(comment, list) or isinstance(comment, tuple):
00636             lines = comment
00637         else:
00638             raise ValueError("Could not understand comment annotation")
00639         self._write_multi_line("COMMENT", lines[0])
00640         for line in lines[1:]:
00641             self._write_multi_line("", line)

Here is the call graph for this function:

def Bio.SeqIO.InsdcIO.GenBankWriter._write_contig (   self,
  record 
) [private]

Definition at line 642 of file InsdcIO.py.

00642 
00643     def _write_contig(self, record):
00644         max_len = self.MAX_WIDTH - self.HEADER_WIDTH
00645         lines = self._split_contig(record, max_len)
00646         self._write_single_line("CONTIG", lines[0])
00647         for text in lines[1:] :
00648             self._write_single_line("", text)

Here is the call graph for this function:

def Bio.SeqIO.InsdcIO.GenBankWriter._write_multi_entries (   self,
  tag,
  text_list 
) [private]

Definition at line 399 of file InsdcIO.py.

00399 
00400     def _write_multi_entries(self, tag, text_list):
00401         #used for DBLINK and any similar later line types.
00402         #If the list of strings is empty, nothing is written.
00403         for i, text in enumerate(text_list):
00404             if i == 0:
00405                 self._write_single_line(tag, text)
00406             else:
00407                 self._write_single_line("", text)

Here is the call graph for this function:

def Bio.SeqIO.InsdcIO.GenBankWriter._write_multi_line (   self,
  tag,
  text 
) [private]

Definition at line 390 of file InsdcIO.py.

00390 
00391     def _write_multi_line(self, tag, text):
00392         "Used in the the 'header' of each GenBank record."""
00393         #TODO - Do the line spliting while preserving white space?
00394         max_len = self.MAX_WIDTH - self.HEADER_WIDTH
00395         lines = self._split_multi_line(text, max_len)
00396         self._write_single_line(tag, lines[0])
00397         for line in lines[1:]:
00398             self._write_single_line("", line)

Here is the caller graph for this function:

def Bio.SeqIO.InsdcIO.GenBankWriter._write_references (   self,
  record 
) [private]

Definition at line 581 of file InsdcIO.py.

00581 
00582     def _write_references(self, record):
00583         number = 0
00584         for ref in record.annotations["references"]:
00585             if not isinstance(ref, SeqFeature.Reference):
00586                 continue
00587             number += 1
00588             data = str(number)
00589             #TODO - support more complex record reference locations?
00590             if ref.location and len(ref.location)==1:
00591                 a = Alphabet._get_base_alphabet(record.seq.alphabet)
00592                 if isinstance(a, Alphabet.ProteinAlphabet):
00593                     units = "residues"
00594                 else:
00595                     units = "bases"
00596                 data += "  (%s %i to %i)" % (units,
00597                                              ref.location[0].nofuzzy_start+1,
00598                                              ref.location[0].nofuzzy_end)
00599             self._write_single_line("REFERENCE", data)
00600             if ref.authors:
00601                 #We store the AUTHORS data as a single string
00602                 self._write_multi_line("  AUTHORS", ref.authors)
00603             if ref.consrtm:
00604                 #We store the consortium as a single string
00605                 self._write_multi_line("  CONSRTM", ref.consrtm)
00606             if ref.title:
00607                 #We store the title as a single string
00608                 self._write_multi_line("  TITLE", ref.title)
00609             if ref.journal:
00610                 #We store this as a single string - holds the journal name,
00611                 #volume, year, and page numbers of the citation
00612                 self._write_multi_line("  JOURNAL", ref.journal)
00613             if ref.medline_id:
00614                 #This line type is obsolete and was removed from the GenBank
00615                 #flatfile format in April 2005. Should we write it?
00616                 #Note this has a two space indent:
00617                 self._write_multi_line("  MEDLINE", ref.medline_id)
00618             if ref.pubmed_id:
00619                 #Note this has a THREE space indent:
00620                 self._write_multi_line("   PUBMED", ref.pubmed_id)
00621             if ref.comment:
00622                 self._write_multi_line("  REMARK", ref.comment)
00623             

Here is the call graph for this function:

def Bio.SeqIO.InsdcIO.GenBankWriter._write_sequence (   self,
  record 
) [private]

Definition at line 649 of file InsdcIO.py.

00649 
00650     def _write_sequence(self, record):
00651         #Loosely based on code from Howard Salis
00652         #TODO - Force lower case?
00653         LETTERS_PER_LINE = 60
00654         SEQUENCE_INDENT = 9
00655 
00656         if isinstance(record.seq, UnknownSeq):
00657             #We have already recorded the length, and there is no need
00658             #to record a long sequence of NNNNNNN...NNN or whatever.
00659             if "contig" in record.annotations:
00660                 self._write_contig(record)
00661             else:
00662                 self.handle.write("ORIGIN\n")
00663             return
00664 
00665         #Catches sequence being None:
00666         data = self._get_seq_string(record).lower()
00667         seq_len = len(data)
00668         self.handle.write("ORIGIN\n")
00669         for line_number in range(0, seq_len, LETTERS_PER_LINE):
00670             self.handle.write(str(line_number+1).rjust(SEQUENCE_INDENT))
00671             for words in range(line_number,
00672                                min(line_number+LETTERS_PER_LINE, seq_len), 10):
00673                 self.handle.write(" %s" % data[words:words+10])
00674             self.handle.write("\n")
        

Here is the call graph for this function:

def Bio.SeqIO.InsdcIO.GenBankWriter._write_single_line (   self,
  tag,
  text 
) [private]

Definition at line 381 of file InsdcIO.py.

00381 
00382     def _write_single_line(self, tag, text):
00383         "Used in the the 'header' of each GenBank record."""
00384         assert len(tag) < self.HEADER_WIDTH
00385         if len(text) > self.MAX_WIDTH - self.HEADER_WIDTH:
00386             import warnings
00387             warnings.warn("Annotation %r too long for %s line" % (text, tag))
00388         self.handle.write("%s%s\n" % (tag.ljust(self.HEADER_WIDTH),
00389                                       text.replace("\n", " ")))

Here is the caller graph for this function:

def Bio.SeqIO.InsdcIO.GenBankWriter._write_the_first_line (   self,
  record 
) [private]
Write the LOCUS line.

Definition at line 494 of file InsdcIO.py.

00494 
00495     def _write_the_first_line(self, record):
00496         """Write the LOCUS line."""
00497         
00498         locus = record.name
00499         if not locus or locus == "<unknown name>":
00500             locus = record.id
00501         if not locus or locus == "<unknown id>":
00502             locus = self._get_annotation_str(record, "accession", just_first=True)
00503         if len(locus) > 16:
00504             raise ValueError("Locus identifier %r is too long" % str(locus))
00505 
00506         if len(record) > 99999999999:
00507             #Currently GenBank only officially support up to 350000, but
00508             #the length field can take eleven digits
00509             raise ValueError("Sequence too long!")
00510 
00511         #Get the base alphabet (underneath any Gapped or StopCodon encoding)
00512         a = Alphabet._get_base_alphabet(record.seq.alphabet)
00513         if not isinstance(a, Alphabet.Alphabet):
00514             raise TypeError("Invalid alphabet")
00515         elif isinstance(a, Alphabet.ProteinAlphabet):
00516             units = "aa"
00517         elif isinstance(a, Alphabet.NucleotideAlphabet):
00518             units = "bp"
00519         else:
00520             #Must be something like NucleotideAlphabet or
00521             #just the generic Alphabet (default for fasta files)
00522             raise ValueError("Need a Nucleotide or Protein alphabet")
00523 
00524         #Get the molecule type
00525         #TODO - record this explicitly in the parser?
00526         if isinstance(a, Alphabet.ProteinAlphabet):
00527             mol_type = ""
00528         elif isinstance(a, Alphabet.DNAAlphabet):
00529             mol_type = "DNA"
00530         elif isinstance(a, Alphabet.RNAAlphabet):
00531             mol_type = "RNA"
00532         else:
00533             #Must be something like NucleotideAlphabet or
00534             #just the generic Alphabet (default for fasta files)
00535             raise ValueError("Need a DNA, RNA or Protein alphabet")
00536         
00537         division = self._get_data_division(record)
00538         
00539         assert len(units) == 2
00540         assert len(division) == 3
00541         #TODO - date
00542         #TODO - mol_type
00543         line = "LOCUS       %s %s %s    %s           %s %s\n" \
00544                      % (locus.ljust(16),
00545                         str(len(record)).rjust(11),
00546                         units,
00547                         mol_type.ljust(6),
00548                         division,
00549                         self._get_date(record))
00550         assert len(line) == 79+1, repr(line) #plus one for new line
00551 
00552         assert line[12:28].rstrip() == locus, \
00553                'LOCUS line does not contain the locus at the expected position:\n' + line
00554         assert line[28:29] == " "
00555         assert line[29:40].lstrip() == str(len(record)), \
00556                'LOCUS line does not contain the length at the expected position:\n' + line
00557 
00558         #Tests copied from Bio.GenBank.Scanner
00559         assert line[40:44] in [' bp ', ' aa '] , \
00560                'LOCUS line does not contain size units at expected position:\n' + line
00561         assert line[44:47] in ['   ', 'ss-', 'ds-', 'ms-'], \
00562                'LOCUS line does not have valid strand type (Single stranded, ...):\n' + line
00563         assert line[47:54].strip() == "" \
00564         or line[47:54].strip().find('DNA') != -1 \
00565         or line[47:54].strip().find('RNA') != -1, \
00566                'LOCUS line does not contain valid sequence type (DNA, RNA, ...):\n' + line
00567         assert line[54:55] == ' ', \
00568                'LOCUS line does not contain space at position 55:\n' + line
00569         assert line[55:63].strip() in ['', 'linear', 'circular'], \
00570                'LOCUS line does not contain valid entry (linear, circular, ...):\n' + line
00571         assert line[63:64] == ' ', \
00572                'LOCUS line does not contain space at position 64:\n' + line
00573         assert line[67:68] == ' ', \
00574                'LOCUS line does not contain space at position 68:\n' + line
00575         assert line[70:71] == '-', \
00576                'LOCUS line does not contain - at position 71 in date:\n' + line
00577         assert line[74:75] == '-', \
00578                'LOCUS line does not contain - at position 75 in date:\n' + line
00579 
00580         self.handle.write(line)

Here is the call graph for this function:

def Bio.SeqIO.InsdcIO.GenBankWriter.write_record (   self,
  record 
)
Write a single record to the output file.

Definition at line 675 of file InsdcIO.py.

00675 
00676     def write_record(self, record):
00677         """Write a single record to the output file."""
00678         handle = self.handle
00679         self._write_the_first_line(record)
00680 
00681         accession = self._get_annotation_str(record, "accession",
00682                                              record.id.split(".", 1)[0],
00683                                              just_first=True)
00684         acc_with_version = accession
00685         if record.id.startswith(accession+"."):
00686             try:
00687                 acc_with_version = "%s.%i" \
00688                                    % (accession,
00689                                       int(record.id.split(".", 1)[1]))
00690             except ValueError:
00691                 pass
00692         gi = self._get_annotation_str(record, "gi", just_first=True)
00693 
00694         descr = record.description
00695         if descr == "<unknown description>" : descr = "."
00696         self._write_multi_line("DEFINITION", descr)
00697         
00698         self._write_single_line("ACCESSION", accession)
00699         if gi != ".":
00700             self._write_single_line("VERSION", "%s  GI:%s" \
00701                                     % (acc_with_version, gi))
00702         else:
00703             self._write_single_line("VERSION", "%s" % (acc_with_version))
00704 
00705         #The NCBI only expect two types of link so far,
00706         #e.g. "Project:28471" and "Trace Assembly Archive:123456"
00707         #TODO - Filter the dbxrefs list to just these?
00708         self._write_multi_entries("DBLINK", record.dbxrefs)
00709 
00710         try:
00711             #List of strings
00712             #Keywords should be given separated with semi colons,
00713             keywords = "; ".join(record.annotations["keywords"])
00714             #with a trailing period:
00715             if not keywords.endswith(".") :
00716                 keywords += "."
00717         except KeyError:
00718             #If no keywords, there should be just a period:
00719             keywords = "."
00720         self._write_multi_line("KEYWORDS", keywords)
00721 
00722         if "segment" in record.annotations:
00723             #Deal with SEGMENT line found only in segmented records,
00724             #e.g. AH000819
00725             segment = record.annotations["segment"]
00726             if isinstance(segment, list):
00727                 assert len(segment)==1, segment
00728                 segment = segment[0]
00729             self._write_single_line("SEGMENT", segment)
00730 
00731         self._write_multi_line("SOURCE", \
00732                                 self._get_annotation_str(record, "source"))
00733         #The ORGANISM line MUST be a single line, as any continuation is the taxonomy
00734         org = self._get_annotation_str(record, "organism")
00735         if len(org) > self.MAX_WIDTH - self.HEADER_WIDTH:
00736             org = org[:self.MAX_WIDTH - self.HEADER_WIDTH-4]+"..."
00737         self._write_single_line("  ORGANISM", org)
00738         try:
00739             #List of strings
00740             #Taxonomy should be given separated with semi colons,
00741             taxonomy = "; ".join(record.annotations["taxonomy"])
00742             #with a trailing period:
00743             if not taxonomy.endswith(".") :
00744                 taxonomy += "."
00745         except KeyError:
00746             taxonomy = "."
00747         self._write_multi_line("", taxonomy)
00748 
00749         if "references" in record.annotations:
00750             self._write_references(record)
00751 
00752         if "comment" in record.annotations:
00753             self._write_comment(record)
00754 
00755         handle.write("FEATURES             Location/Qualifiers\n")
00756         rec_length = len(record)
00757         for feature in record.features:
00758             self._write_feature(feature, rec_length) 
00759         self._write_sequence(record)
00760         handle.write("//\n")

Here is the call graph for this function:

Here is the caller graph for this function:


Member Data Documentation

Definition at line 378 of file InsdcIO.py.

int Bio.SeqIO.InsdcIO._InsdcWriter.MAX_WIDTH = 80 [static, inherited]

Definition at line 236 of file InsdcIO.py.

Reimplemented from Bio.SeqIO.InsdcIO._InsdcWriter.

Definition at line 379 of file InsdcIO.py.

string Bio.SeqIO.InsdcIO._InsdcWriter.QUALIFIER_INDENT_STR = " " [static, inherited]

Reimplemented in Bio.SeqIO.InsdcIO.ImgtWriter, and Bio.SeqIO.InsdcIO.EmblWriter.

Definition at line 238 of file InsdcIO.py.

string Bio.SeqIO.InsdcIO._InsdcWriter.QUALIFIER_INDENT_TMP = " %s " [static, inherited]

Reimplemented in Bio.SeqIO.InsdcIO.ImgtWriter, and Bio.SeqIO.InsdcIO.EmblWriter.

Definition at line 239 of file InsdcIO.py.


The documentation for this class was generated from the following file: