Back to index

python-biopython  1.60
Functions | Variables
Bio.AlignIO.FastaIO Namespace Reference

Functions

def _extract_alignment_region
def FastaM10Iterator

Variables

 simple_example = \
tuple alignments = list(FastaM10Iterator(StringIO(simple_example)))
string path = "../../Tests/Fasta/"
list files = [f for f in os.listdir(path) if os.path.splitext(f)[-1] == ".m10"]

Function Documentation

def Bio.AlignIO.FastaIO._extract_alignment_region (   alignment_seq_with_flanking,
  annotation 
) [private]
Helper function for the main parsing code (PRIVATE).

To get the actual pairwise alignment sequences, we must first
translate the un-gapped sequence based coordinates into positions
in the gapped sequence (which may have a flanking region shown
using leading - characters).  To date, I have never seen any
trailing flanking region shown in the m10 file, but the
following code should also cope with that.

Note that this code seems to work fine even when the "sq_offset"
entries are prsent as a result of using the -X command line option.

Definition at line 31 of file FastaIO.py.

00031 
00032 def _extract_alignment_region(alignment_seq_with_flanking, annotation):
00033     """Helper function for the main parsing code (PRIVATE).
00034 
00035     To get the actual pairwise alignment sequences, we must first
00036     translate the un-gapped sequence based coordinates into positions
00037     in the gapped sequence (which may have a flanking region shown
00038     using leading - characters).  To date, I have never seen any
00039     trailing flanking region shown in the m10 file, but the
00040     following code should also cope with that.
00041 
00042     Note that this code seems to work fine even when the "sq_offset"
00043     entries are prsent as a result of using the -X command line option.
00044     """
00045     align_stripped = alignment_seq_with_flanking.strip("-")
00046     display_start = int(annotation['al_display_start'])
00047     if int(annotation['al_start']) <= int(annotation['al_stop']):
00048         start = int(annotation['al_start']) \
00049               - display_start
00050         end   = int(annotation['al_stop']) \
00051               - display_start + 1
00052     else:
00053         #FASTA has flipped this sequence...
00054         start = display_start \
00055               - int(annotation['al_start'])
00056         end   = display_start \
00057               - int(annotation['al_stop']) + 1
00058     end += align_stripped.count("-")
00059     assert 0 <= start and start < end and end <= len(align_stripped), \
00060            "Problem with sequence start/stop,\n%s[%i:%i]\n%s" \
00061            % (alignment_seq_with_flanking, start, end, annotation)
00062     return align_stripped[start:end]

Here is the caller graph for this function:

def Bio.AlignIO.FastaIO.FastaM10Iterator (   handle,
  alphabet = single_letter_alphabet 
)
Alignment iterator for the FASTA tool's pairwise alignment output.

This is for reading the pairwise alignments output by Bill Pearson's
FASTA program when called with the -m 10 command line option for machine
readable output.  For more details about the FASTA tools, see the website
http://fasta.bioch.virginia.edu/ and the paper:

     W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

This class is intended to be used via the Bio.AlignIO.parse() function
by specifying the format as "fasta-m10" as shown in the following code:

    from Bio import AlignIO
    handle = ...
    for a in AlignIO.parse(handle, "fasta-m10"):
        assert len(a) == 2, "Should be pairwise!"
        print "Alignment length %i" % a.get_alignment_length()
        for record in a:
            print record.seq, record.name, record.id

Note that this is not a full blown parser for all the information
in the FASTA output - for example, most of the header and all of the
footer is ignored.  Also, the alignments are not batched according to
the input queries.

Also note that there can be up to about 30 letters of flanking region
included in the raw FASTA output as contextual information.  This is NOT
part of the alignment itself, and is not included in the resulting
MultipleSeqAlignment objects returned.

Definition at line 63 of file FastaIO.py.

00063 
00064 def FastaM10Iterator(handle, alphabet = single_letter_alphabet):
00065     """Alignment iterator for the FASTA tool's pairwise alignment output.
00066 
00067     This is for reading the pairwise alignments output by Bill Pearson's
00068     FASTA program when called with the -m 10 command line option for machine
00069     readable output.  For more details about the FASTA tools, see the website
00070     http://fasta.bioch.virginia.edu/ and the paper:
00071 
00072          W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
00073 
00074     This class is intended to be used via the Bio.AlignIO.parse() function
00075     by specifying the format as "fasta-m10" as shown in the following code:
00076 
00077         from Bio import AlignIO
00078         handle = ...
00079         for a in AlignIO.parse(handle, "fasta-m10"):
00080             assert len(a) == 2, "Should be pairwise!"
00081             print "Alignment length %i" % a.get_alignment_length()
00082             for record in a:
00083                 print record.seq, record.name, record.id
00084 
00085     Note that this is not a full blown parser for all the information
00086     in the FASTA output - for example, most of the header and all of the
00087     footer is ignored.  Also, the alignments are not batched according to
00088     the input queries.
00089 
00090     Also note that there can be up to about 30 letters of flanking region
00091     included in the raw FASTA output as contextual information.  This is NOT
00092     part of the alignment itself, and is not included in the resulting
00093     MultipleSeqAlignment objects returned.
00094     """
00095     if alphabet is None:
00096         alphabet = single_letter_alphabet
00097     
00098     state_PREAMBLE = -1
00099     state_NONE = 0
00100     state_QUERY_HEADER = 1
00101     state_ALIGN_HEADER = 2
00102     state_ALIGN_QUERY = 3
00103     state_ALIGN_MATCH = 4
00104     state_ALIGN_CONS = 5
00105 
00106     def build_hsp():
00107         if not query_tags and not match_tags:
00108             raise ValueError("No data for query %r, match %r" \
00109                              % (query_id, match_id))
00110         assert query_tags, query_tags
00111         assert match_tags, match_tags
00112         evalue = align_tags.get("fa_expect", None)
00113         q = "?" #Just for printing len(q) in debug below
00114         m = "?" #Just for printing len(m) in debug below
00115         tool = global_tags.get("tool", "").upper()
00116         try:
00117             q = _extract_alignment_region(query_seq, query_tags)
00118             if tool in ["TFASTX"] and len(match_seq) == len(q):
00119                 m = match_seq
00120                 #Quick hack until I can work out how -, * and / characters
00121                 #and the apparent mix of aa and bp coordindates works.
00122             else:
00123                 m = _extract_alignment_region(match_seq, match_tags)
00124             assert len(q) == len(m)
00125         except AssertionError, err:
00126             print "Darn... amino acids vs nucleotide coordinates?"
00127             print tool
00128             print query_seq
00129             print query_tags
00130             print q, len(q)
00131             print match_seq
00132             print match_tags
00133             print m, len(m)
00134             print handle.name
00135             raise err
00136 
00137         assert alphabet is not None
00138         alignment = MultipleSeqAlignment([], alphabet)
00139 
00140         #TODO - Introduce an annotated alignment class?
00141         #For now, store the annotation a new private property:
00142         alignment._annotations = {}
00143         
00144         #Want to record both the query header tags, and the alignment tags.
00145         for key, value in header_tags.iteritems():
00146             alignment._annotations[key] = value
00147         for key, value in align_tags.iteritems():
00148             alignment._annotations[key] = value
00149         
00150         #Query
00151         #=====
00152         record = SeqRecord(Seq(q, alphabet),
00153                            id = query_id,
00154                            name = "query",
00155                            description = query_descr,
00156                            annotations = {"original_length" : int(query_tags["sq_len"])})
00157         #TODO - handle start/end coordinates properly. Short term hack for now:
00158         record._al_start = int(query_tags["al_start"])
00159         record._al_stop = int(query_tags["al_stop"])
00160         alignment.append(record)
00161 
00162         #TODO - What if a specific alphabet has been requested?
00163         #TODO - Use an IUPAC alphabet?
00164         #TODO - Can FASTA output RNA?
00165         if alphabet == single_letter_alphabet and "sq_type" in query_tags:
00166             if query_tags["sq_type"] == "D":
00167                 record.seq.alphabet = generic_dna
00168             elif query_tags["sq_type"] == "p":
00169                 record.seq.alphabet = generic_protein
00170         if "-" in q:
00171             if not hasattr(record.seq.alphabet,"gap_char"):
00172                 record.seq.alphabet = Gapped(record.seq.alphabet, "-")
00173 
00174         #Match
00175         #=====
00176         record = SeqRecord(Seq(m, alphabet),
00177                            id = match_id,
00178                            name = "match",
00179                            description = match_descr,
00180                            annotations = {"original_length" : int(match_tags["sq_len"])})
00181         #TODO - handle start/end coordinates properly. Short term hack for now:
00182         record._al_start = int(match_tags["al_start"])
00183         record._al_stop = int(match_tags["al_stop"])
00184         alignment.append(record)
00185 
00186         #This is still a very crude way of dealing with the alphabet:
00187         if alphabet == single_letter_alphabet and "sq_type" in match_tags:
00188             if match_tags["sq_type"] == "D":
00189                 record.seq.alphabet = generic_dna
00190             elif match_tags["sq_type"] == "p":
00191                 record.seq.alphabet = generic_protein
00192         if "-" in m:
00193             if not hasattr(record.seq.alphabet,"gap_char"):
00194                 record.seq.alphabet = Gapped(record.seq.alphabet, "-")
00195 
00196         return alignment
00197 
00198     state = state_PREAMBLE
00199     query_id = None
00200     match_id = None
00201     query_descr = ""
00202     match_descr = ""
00203     global_tags = {}
00204     header_tags = {}
00205     align_tags = {}
00206     query_tags = {}
00207     match_tags = {}
00208     query_seq = ""
00209     match_seq = ""
00210     cons_seq = ""
00211     for line in handle:
00212         if ">>>" in line and not line.startswith(">>>"):
00213             if query_id and match_id:
00214                 #This happens on old FASTA output which lacked an end of
00215                 #query >>><<< marker line.
00216                 yield build_hsp()
00217             state = state_NONE
00218             query_descr = line[line.find(">>>")+3:].strip()
00219             query_id = query_descr.split(None,1)[0]
00220             match_id = None
00221             header_tags = {}
00222             align_tags = {}
00223             query_tags = {}
00224             match_tags = {}
00225             query_seq = ""
00226             match_seq = ""
00227             cons_seq = ""
00228         elif line.startswith("!! No "):
00229             #e.g.
00230             #!! No library sequences with E() < 0.5
00231             #or on more recent versions,
00232             #No sequences with E() < 0.05
00233             assert state == state_NONE
00234             assert not header_tags
00235             assert not align_tags
00236             assert not match_tags
00237             assert not query_tags
00238             assert match_id is None
00239             assert not query_seq
00240             assert not match_seq
00241             assert not cons_seq
00242             query_id = None
00243         elif line.strip() in [">>><<<", ">>>///"]:
00244             #End of query, possible end of all queries
00245             if query_id and match_id:
00246                 yield build_hsp()
00247             state = state_NONE
00248             query_id = None
00249             match_id = None
00250             header_tags = {}
00251             align_tags = {}
00252             query_tags = {}
00253             match_tags = {}
00254             query_seq = ""
00255             match_seq = ""
00256             cons_seq = ""
00257         elif line.startswith(">>>"):
00258             #Should be start of a match!
00259             assert query_id is not None
00260             assert line[3:].split(", ",1)[0] == query_id, line
00261             assert match_id is None
00262             assert not header_tags
00263             assert not align_tags
00264             assert not query_tags
00265             assert not match_tags
00266             assert not match_seq
00267             assert not query_seq
00268             assert not cons_seq
00269             state = state_QUERY_HEADER
00270         elif line.startswith(">>"):
00271             #Should now be at start of a match alignment!
00272             if query_id and match_id:
00273                 yield build_hsp()
00274             align_tags = {}
00275             query_tags = {}
00276             match_tags = {}
00277             query_seq = ""
00278             match_seq = ""
00279             cons_seq = ""
00280             match_descr = line[2:].strip()
00281             match_id = match_descr.split(None,1)[0]
00282             state = state_ALIGN_HEADER
00283         elif line.startswith(">--"):
00284             #End of one HSP
00285             assert query_id and match_id, line
00286             yield build_hsp()
00287             #Clean up read for next HSP
00288             #but reuse header_tags
00289             align_tags = {}
00290             query_tags = {}
00291             match_tags = {}
00292             query_seq = ""
00293             match_seq = ""
00294             cons_seq = ""
00295             state = state_ALIGN_HEADER
00296         elif line.startswith(">"):
00297             if state == state_ALIGN_HEADER:
00298                 #Should be start of query alignment seq...
00299                 assert query_id is not None, line
00300                 assert match_id is not None, line
00301                 assert query_id.startswith(line[1:].split(None,1)[0]), line
00302                 state = state_ALIGN_QUERY
00303             elif state == state_ALIGN_QUERY:
00304                 #Should be start of match alignment seq
00305                 assert query_id is not None, line
00306                 assert match_id is not None, line
00307                 assert match_id.startswith(line[1:].split(None,1)[0]), line
00308                 state = state_ALIGN_MATCH
00309             elif state == state_NONE:
00310                 #Can get > as the last line of a histogram
00311                 pass
00312             else:
00313                 assert False, "state %i got %r" % (state, line)
00314         elif line.startswith("; al_cons"):
00315             assert state == state_ALIGN_MATCH, line
00316             state = state_ALIGN_CONS
00317             #Next line(s) should be consensus seq...
00318         elif line.startswith("; "):
00319             if ": " in line:
00320                 key, value = [s.strip() for s in line[2:].split(": ",1)]
00321             else:
00322                 import warnings
00323                 #Seen in lalign36, specifically version 36.3.4 Apr, 2011
00324                 #Fixed in version 36.3.5b Oct, 2011(preload8)
00325                 warnings.warn("Missing colon in line: %r" % line)
00326                 try:
00327                     key, value = [s.strip() for s in line[2:].split(" ",1)]
00328                 except ValueError:
00329                     raise ValueError("Bad line: %r" % line)
00330             if state == state_QUERY_HEADER:
00331                 header_tags[key] = value
00332             elif state == state_ALIGN_HEADER:
00333                 align_tags[key] = value
00334             elif state == state_ALIGN_QUERY:
00335                 query_tags[key] = value
00336             elif state == state_ALIGN_MATCH:
00337                 match_tags[key] = value
00338             else:
00339                 assert False, "Unexpected state %r, %r" % (state, line)
00340         elif state == state_ALIGN_QUERY:
00341             query_seq += line.strip()
00342         elif state == state_ALIGN_MATCH:
00343             match_seq += line.strip()
00344         elif state == state_ALIGN_CONS:
00345             cons_seq += line.strip("\n")
00346         elif state == state_PREAMBLE:
00347             if line.startswith("#"):
00348                 global_tags["command"] = line[1:].strip()
00349             elif line.startswith(" version "):
00350                 global_tags["version"] = line[9:].strip()
00351             elif " compares a " in line:
00352                 global_tags["tool"] = line[:line.find(" compares a ")].strip()
00353             elif " searches a " in line:
00354                 global_tags["tool"] = line[:line.find(" searches a ")].strip()
00355         else:
00356             pass
00357 

Here is the call graph for this function:


Variable Documentation

Definition at line 598 of file FastaIO.py.

list Bio.AlignIO.FastaIO.files = [f for f in os.listdir(path) if os.path.splitext(f)[-1] == ".m10"]

Definition at line 611 of file FastaIO.py.

string Bio.AlignIO.FastaIO.path = "../../Tests/Fasta/"

Definition at line 610 of file FastaIO.py.

Definition at line 362 of file FastaIO.py.