Back to index

python-biopython  1.60
Classes | Functions | Variables
Bio.SeqIO.InsdcIO Namespace Reference

Classes

class  _InsdcWriter
class  GenBankWriter
class  EmblWriter
class  ImgtWriter

Functions

def GenBankIterator
def EmblIterator
def ImgtIterator
def GenBankCdsFeatureIterator
def EmblCdsFeatureIterator
def _insdc_feature_position_string
def _insdc_location_string_ignoring_strand_and_subfeatures
def _insdc_feature_location_string
def compare_record
def compare_records
def compare_feature
def compare_features
def check_genbank_writer
def check_embl_writer

Variables

tuple handle = open("../../Tests/GenBank/%s" % filename)
tuple records = list(GenBankIterator(handle))

Function Documentation

def Bio.SeqIO.InsdcIO._insdc_feature_location_string (   feature,
  rec_length 
) [private]
Build a GenBank/EMBL location string from a SeqFeature (PRIVATE).

There is a choice of how to show joins on the reverse complement strand,
GenBank used "complement(join(1,10),(20,100))" while EMBL used to use
"join(complement(20,100),complement(1,10))" instead (but appears to have
now adopted the GenBank convention). Notice that the order of the entries
is reversed! This function therefore uses the first form. In this situation
we expect the parent feature and the two children to all be marked as
strand == -1, and in the order 0:10 then 19:100.

Also need to consider dual-strand examples like these from the Arabidopsis
thaliana chloroplast NC_000932: join(complement(69611..69724),139856..140650)
gene ArthCp047, GeneID:844801 or its CDS (protein NP_051038.1 GI:7525057)
which is further complicated by a splice:
join(complement(69611..69724),139856..140087,140625..140650)

For mixed this mixed strand feature, the parent SeqFeature should have
no strand (either 0 or None) while the child features should have either
strand +1 or -1 as appropriate, and be listed in the order given here.

Definition at line 183 of file InsdcIO.py.

00183 
00184 def _insdc_feature_location_string(feature, rec_length):
00185     """Build a GenBank/EMBL location string from a SeqFeature (PRIVATE).
00186 
00187     There is a choice of how to show joins on the reverse complement strand,
00188     GenBank used "complement(join(1,10),(20,100))" while EMBL used to use
00189     "join(complement(20,100),complement(1,10))" instead (but appears to have
00190     now adopted the GenBank convention). Notice that the order of the entries
00191     is reversed! This function therefore uses the first form. In this situation
00192     we expect the parent feature and the two children to all be marked as
00193     strand == -1, and in the order 0:10 then 19:100.
00194 
00195     Also need to consider dual-strand examples like these from the Arabidopsis
00196     thaliana chloroplast NC_000932: join(complement(69611..69724),139856..140650)
00197     gene ArthCp047, GeneID:844801 or its CDS (protein NP_051038.1 GI:7525057)
00198     which is further complicated by a splice:
00199     join(complement(69611..69724),139856..140087,140625..140650)
00200 
00201     For mixed this mixed strand feature, the parent SeqFeature should have
00202     no strand (either 0 or None) while the child features should have either
00203     strand +1 or -1 as appropriate, and be listed in the order given here.
00204     """
00205 
00206     if not feature.sub_features:
00207         #Non-recursive.
00208         #assert feature.location_operator == "", \
00209         #       "%s has no subfeatures but location_operator %s" \
00210         #       % (repr(feature), feature.location_operator)
00211         location = _insdc_location_string_ignoring_strand_and_subfeatures(feature.location, rec_length)
00212         if feature.strand == -1:
00213             location = "complement(%s)" % location
00214         return location
00215     # As noted above, treat reverse complement strand features carefully:
00216     if feature.strand == -1:
00217         for f in feature.sub_features:
00218             if f.strand != -1:
00219                 raise ValueError("Inconsistent strands: %r for parent, %r for child" \
00220                                  % (feature.strand, f.strand))
00221         return "complement(%s(%s))" \
00222                % (feature.location_operator,
00223                   ",".join(_insdc_location_string_ignoring_strand_and_subfeatures(f.location, rec_length) \
00224                            for f in feature.sub_features))
00225     #if feature.strand == +1:
00226     #    for f in feature.sub_features:
00227     #        assert f.strand == +1
00228     #This covers typical forward strand features, and also an evil mixed strand:
00229     assert feature.location_operator != ""
00230     return  "%s(%s)" % (feature.location_operator,
00231                         ",".join([_insdc_feature_location_string(f, rec_length) \
00232                                   for f in feature.sub_features]))
00233 

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.SeqIO.InsdcIO._insdc_feature_position_string (   pos,
  offset = 0 
) [private]
Build a GenBank/EMBL position string (PRIVATE).

Use offset=1 to add one to convert a start position from python counting.

Definition at line 103 of file InsdcIO.py.

00103 
00104 def _insdc_feature_position_string(pos, offset=0):
00105     """Build a GenBank/EMBL position string (PRIVATE).
00106 
00107     Use offset=1 to add one to convert a start position from python counting.
00108     """
00109     if isinstance(pos, SeqFeature.ExactPosition):
00110         return "%i" % (pos.position+offset)
00111     elif isinstance(pos, SeqFeature.WithinPosition):
00112         return "(%i.%i)" % (pos.position + offset,
00113                             pos.position + pos.extension + offset)
00114     elif isinstance(pos, SeqFeature.BetweenPosition):
00115         return "(%i^%i)" % (pos.position + offset,
00116                             pos.position + pos.extension + offset)
00117     elif isinstance(pos, SeqFeature.BeforePosition):
00118         return "<%i" % (pos.position + offset)
00119     elif isinstance(pos, SeqFeature.AfterPosition):
00120         return ">%i" % (pos.position + offset)
00121     elif isinstance(pos, SeqFeature.OneOfPosition):
00122         return "one-of(%s)" \
00123                % ",".join([_insdc_feature_position_string(p,offset) \
00124                            for p in pos.position_choices])
00125     elif isinstance(pos, SeqFeature.AbstractPosition):
00126         raise NotImplementedError("Please report this as a bug in Biopython.")
00127     else:
00128         raise ValueError("Expected a SeqFeature position object.")
00129 

Here is the caller graph for this function:

def Bio.SeqIO.InsdcIO._insdc_location_string_ignoring_strand_and_subfeatures (   location,
  rec_length 
) [private]

Definition at line 130 of file InsdcIO.py.

00130 
00131 def _insdc_location_string_ignoring_strand_and_subfeatures(location, rec_length):
00132     if location.ref:
00133         ref = "%s:" % location.ref
00134     else:
00135         ref = ""
00136     assert not location.ref_db
00137     if isinstance(location.start, SeqFeature.ExactPosition) \
00138     and isinstance(location.end, SeqFeature.ExactPosition) \
00139     and location.start.position == location.end.position:
00140         #Special case, for 12:12 return 12^13
00141         #(a zero length slice, meaning the point between two letters)
00142         if location.end.position == rec_length:
00143             #Very special case, for a between position at the end of a
00144             #sequence (used on some circular genomes, Bug 3098) we have
00145             #N:N so return N^1
00146             return "%s%i^1" % (ref, rec_length)
00147         else:
00148             return "%s%i^%i" % (ref, location.end.position,
00149                                 location.end.position+1)
00150     if isinstance(location.start, SeqFeature.ExactPosition) \
00151     and isinstance(location.end, SeqFeature.ExactPosition) \
00152     and location.start.position+1 == location.end.position:
00153         #Special case, for 11:12 return 12 rather than 12..12
00154         #(a length one slice, meaning a single letter)
00155         return "%s%i" % (ref, location.end.position)
00156     elif isinstance(location.start, SeqFeature.UnknownPosition) \
00157     or isinstance(location.end, SeqFeature.UnknownPosition):
00158         #Special case for features from SwissProt/UniProt files
00159         if isinstance(location.start, SeqFeature.UnknownPosition) \
00160         and isinstance(location.end, SeqFeature.UnknownPosition):
00161             #import warnings
00162             #warnings.warn("Feature with unknown location")
00163             #return "?"
00164             raise ValueError("Feature with unknown location")
00165         elif isinstance(location.start, SeqFeature.UnknownPosition):
00166             #Treat the unknown start position as a BeforePosition
00167             return "%s<%i..%s" \
00168                 % (ref,
00169                    location.nofuzzy_end,
00170                    _insdc_feature_position_string(location.end))
00171         else:
00172             #Treat the unknown end position as an AfterPosition
00173             return "%s%s..>%i" \
00174                 % (ref,
00175                    _insdc_feature_position_string(location.start),
00176                    location.nofuzzy_start)
00177     else:
00178         #Typical case, e.g. 12..15 gets mapped to 11:15
00179         return ref \
00180                + _insdc_feature_position_string(location.start, +1) \
00181                + ".." + \
00182                _insdc_feature_position_string(location.end)

Here is the call graph for this function:

Here is the caller graph for this function:

Definition at line 1137 of file InsdcIO.py.

01137 
01138     def check_embl_writer(records):
01139         handle = StringIO()
01140         try:
01141             EmblWriter(handle).write_file(records)
01142         except ValueError, err:
01143             print err
01144             return
01145         handle.seek(0)
01146 
01147         records2 = list(EmblIterator(handle))
01148         assert compare_records(records, records2)

Here is the call graph for this function:

Definition at line 1129 of file InsdcIO.py.

01129 
01130     def check_genbank_writer(records):
01131         handle = StringIO()
01132         GenBankWriter(handle).write_file(records)
01133         handle.seek(0)
01134 
01135         records2 = list(GenBankIterator(handle))
01136         assert compare_records(records, records2)

Here is the call graph for this function:

def Bio.SeqIO.InsdcIO.compare_feature (   old,
  new,
  ignore_sub_features = False 
)
Check two SeqFeatures agree.

Definition at line 1085 of file InsdcIO.py.

01085 
01086     def compare_feature(old, new, ignore_sub_features=False):
01087         """Check two SeqFeatures agree."""
01088         if old.type != new.type:
01089             raise ValueError("Type %s versus %s" % (old.type, new.type))
01090         if old.location.nofuzzy_start != new.location.nofuzzy_start \
01091         or old.location.nofuzzy_end != new.location.nofuzzy_end:
01092             raise ValueError("%s versus %s:\n%s\nvs:\n%s" \
01093                              % (old.location, new.location, str(old), str(new)))
01094         if old.strand != new.strand:
01095             raise ValueError("Different strand:\n%s\nvs:\n%s" % (str(old), str(new)))
01096         if old.location.start != new.location.start:
01097             raise ValueError("Start %s versus %s:\n%s\nvs:\n%s" \
01098                              % (old.location.start, new.location.start, str(old), str(new)))
01099         if old.location.end != new.location.end:
01100             raise ValueError("End %s versus %s:\n%s\nvs:\n%s" \
01101                              % (old.location.end, new.location.end, str(old), str(new)))
01102         if not ignore_sub_features:
01103             if len(old.sub_features) != len(new.sub_features):
01104                 raise ValueError("Different sub features")
01105             for a, b in zip(old.sub_features, new.sub_features):
01106                 if not compare_feature(a, b):
01107                     return False
01108         #This only checks key shared qualifiers
01109         #Would a white list be easier?
01110         #for key in ["name", "gene", "translation", "codon_table", "codon_start", "locus_tag"]:
01111         for key in set(old.qualifiers).intersection(new.qualifiers):
01112             if key in ["db_xref", "protein_id", "product", "note"]:
01113                 #EMBL and GenBank files are use different references/notes/etc
01114                 continue
01115             if old.qualifiers[key] != new.qualifiers[key]:
01116                 raise ValueError("Qualifier mis-match for %s:\n%s\n%s" \
01117                                  % (key, old.qualifiers[key], new.qualifiers[key]))
01118         return True

def Bio.SeqIO.InsdcIO.compare_features (   old_list,
  new_list,
  ignore_sub_features = False 
)
Check two lists of SeqFeatures agree, raises a ValueError if mismatch.

Definition at line 1119 of file InsdcIO.py.

01119 
01120     def compare_features(old_list, new_list, ignore_sub_features=False):
01121         """Check two lists of SeqFeatures agree, raises a ValueError if mismatch."""
01122         if len(old_list) != len(new_list):
01123             raise ValueError("%i vs %i features" % (len(old_list), len(new_list)))
01124         for old, new in zip(old_list, new_list):
01125             #This assumes they are in the same order
01126             if not compare_feature(old, new, ignore_sub_features):
01127                 return False
01128         return True

Here is the call graph for this function:

def Bio.SeqIO.InsdcIO.compare_record (   old,
  new 
)

Definition at line 1052 of file InsdcIO.py.

01052 
01053     def compare_record(old, new):
01054         if old.id != new.id and old.name != new.name:
01055             raise ValueError("'%s' or '%s' vs '%s' or '%s' records" \
01056                              % (old.id, old.name, new.id, new.name))
01057         if len(old.seq) != len(new.seq):
01058             raise ValueError("%i vs %i" % (len(old.seq), len(new.seq)))
01059         if str(old.seq).upper() != str(new.seq).upper():
01060             if len(old.seq) < 200:
01061                 raise ValueError("'%s' vs '%s'" % (old.seq, new.seq))
01062             else:
01063                 raise ValueError("'%s...' vs '%s...'" % (old.seq[:100], new.seq[:100]))
01064         if old.features and new.features:
01065             return compare_features(old.features, new.features)
01066         #Just insist on at least one word in common:
01067         if (old.description or new.description) \
01068         and not set(old.description.split()).intersection(new.description.split()):
01069             raise ValueError("%s versus %s" \
01070                              % (repr(old.description), repr(new.description)))
01071         #TODO - check annotation
01072         if "contig" in old.annotations:
01073             assert old.annotations["contig"] == \
01074                    new.annotations["contig"]
01075         return True

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.SeqIO.InsdcIO.compare_records (   old_list,
  new_list 
)
Check two lists of SeqRecords agree, raises a ValueError if mismatch.

Definition at line 1076 of file InsdcIO.py.

01076 
01077     def compare_records(old_list, new_list):
01078         """Check two lists of SeqRecords agree, raises a ValueError if mismatch."""
01079         if len(old_list) != len(new_list):
01080             raise ValueError("%i vs %i records" % (len(old_list), len(new_list)))
01081         for old, new in zip(old_list, new_list):
01082             if not compare_record(old, new):
01083                 return False
01084         return True

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.SeqIO.InsdcIO.EmblCdsFeatureIterator (   handle,
  alphabet = Alphabet.generic_protein 
)
Breaks up a EMBL file into SeqRecord objects for each CDS feature.

Every section from the LOCUS line to the terminating // can contain
many CDS features.  These are returned as with the stated amino acid
translation sequence (if given).

Definition at line 93 of file InsdcIO.py.

00093 
00094 def EmblCdsFeatureIterator(handle, alphabet=Alphabet.generic_protein):
00095     """Breaks up a EMBL file into SeqRecord objects for each CDS feature.
00096 
00097     Every section from the LOCUS line to the terminating // can contain
00098     many CDS features.  These are returned as with the stated amino acid
00099     translation sequence (if given).
00100     """
00101     #This calls a generator function:
00102     return EmblScanner(debug=0).parse_cds_features(handle, alphabet)

Breaks up an EMBL file into SeqRecord objects.

Every section from the LOCUS line to the terminating // becomes
a single SeqRecord with associated annotation and features.

Note that for genomes or chromosomes, there is typically only
one record.

Definition at line 61 of file InsdcIO.py.

00061 
00062 def EmblIterator(handle):
00063     """Breaks up an EMBL file into SeqRecord objects.
00064 
00065     Every section from the LOCUS line to the terminating // becomes
00066     a single SeqRecord with associated annotation and features.
00067     
00068     Note that for genomes or chromosomes, there is typically only
00069     one record."""
00070     #This calls a generator function:
00071     return EmblScanner(debug=0).parse_records(handle)

Here is the caller graph for this function:

def Bio.SeqIO.InsdcIO.GenBankCdsFeatureIterator (   handle,
  alphabet = Alphabet.generic_protein 
)
Breaks up a Genbank file into SeqRecord objects for each CDS feature.

Every section from the LOCUS line to the terminating // can contain
many CDS features.  These are returned as with the stated amino acid
translation sequence (if given).

Definition at line 83 of file InsdcIO.py.

00083 
00084 def GenBankCdsFeatureIterator(handle, alphabet=Alphabet.generic_protein):
00085     """Breaks up a Genbank file into SeqRecord objects for each CDS feature.
00086 
00087     Every section from the LOCUS line to the terminating // can contain
00088     many CDS features.  These are returned as with the stated amino acid
00089     translation sequence (if given).
00090     """
00091     #This calls a generator function:
00092     return GenBankScanner(debug=0).parse_cds_features(handle, alphabet)
    
Breaks up a Genbank file into SeqRecord objects.

Every section from the LOCUS line to the terminating // becomes
a single SeqRecord with associated annotation and features.

Note that for genomes or chromosomes, there is typically only
one record.

Definition at line 50 of file InsdcIO.py.

00050 
00051 def GenBankIterator(handle):
00052     """Breaks up a Genbank file into SeqRecord objects.
00053 
00054     Every section from the LOCUS line to the terminating // becomes
00055     a single SeqRecord with associated annotation and features.
00056     
00057     Note that for genomes or chromosomes, there is typically only
00058     one record."""
00059     #This calls a generator function:
00060     return GenBankScanner(debug=0).parse_records(handle)

Here is the caller graph for this function:

Breaks up an IMGT file into SeqRecord objects.

Every section from the LOCUS line to the terminating // becomes
a single SeqRecord with associated annotation and features.

Note that for genomes or chromosomes, there is typically only
one record.

Definition at line 72 of file InsdcIO.py.

00072 
00073 def ImgtIterator(handle):
00074     """Breaks up an IMGT file into SeqRecord objects.
00075 
00076     Every section from the LOCUS line to the terminating // becomes
00077     a single SeqRecord with associated annotation and features.
00078     
00079     Note that for genomes or chromosomes, there is typically only
00080     one record."""
00081     #This calls a generator function:
00082     return _ImgtScanner(debug=0).parse_records(handle)


Variable Documentation

tuple Bio.SeqIO.InsdcIO.handle = open("../../Tests/GenBank/%s" % filename)

Definition at line 1154 of file InsdcIO.py.

Definition at line 1155 of file InsdcIO.py.