Back to index

python-biopython  1.60
__init__.py
Go to the documentation of this file.
00001 # Copyright 2008-2010 by Peter Cock.  All rights reserved.
00002 # This code is part of the Biopython distribution and governed by its
00003 # license.  Please see the LICENSE file that should have been included
00004 # as part of this package.
00005 
00006 """Multiple sequence alignment input/output as alignment objects.
00007 
00008 The Bio.AlignIO interface is deliberately very similar to Bio.SeqIO, and in
00009 fact the two are connected internally.  Both modules use the same set of file
00010 format names (lower case strings).  From the user's perspective, you can read
00011 in a PHYLIP file containing one or more alignments using Bio.AlignIO, or you
00012 can read in the sequences within these alignmenta using Bio.SeqIO.
00013 
00014 Bio.AlignIO is also documented at U{http://biopython.org/wiki/AlignIO} and by
00015 a whole chapter in our tutorial:
00016  - U{http://biopython.org/DIST/docs/tutorial/Tutorial.html}
00017  - U{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf}
00018 
00019 Input
00020 =====
00021 For the typical special case when your file or handle contains one and only
00022 one alignment, use the function Bio.AlignIO.read().  This takes an input file
00023 handle (or in recent versions of Biopython a filename as a string), format
00024 string and optional number of sequences per alignment.  It will return a single
00025 MultipleSeqAlignment object (or raise an exception if there isn't just one
00026 alignment):
00027 
00028     >>> from Bio import AlignIO
00029     >>> align = AlignIO.read("Phylip/interlaced.phy", "phylip")
00030     >>> print align
00031     SingleLetterAlphabet() alignment with 3 rows and 384 columns
00032     -----MKVILLFVLAVFTVFVSS---------------RGIPPE...I-- CYS1_DICDI
00033     MAHARVLLLALAVLATAAVAVASSSSFADSNPIRPVTDRAASTL...VAA ALEU_HORVU
00034     ------MWATLPLLCAGAWLLGV--------PVCGAAELSVNSL...PLV CATH_HUMAN
00035 
00036 For the general case, when the handle could contain any number of alignments,
00037 use the function Bio.AlignIO.parse(...) which takes the same arguments, but
00038 returns an iterator giving MultipleSeqAlignment objects (typically used in a
00039 for loop). If you want random access to the alignments by number, turn this
00040 into a list:
00041 
00042     >>> from Bio import AlignIO
00043     >>> alignments = list(AlignIO.parse("Emboss/needle.txt", "emboss"))
00044     >>> print alignments[2]
00045     SingleLetterAlphabet() alignment with 2 rows and 120 columns
00046     -KILIVDDQYGIRILLNEVFNKEGYQTFQAANGLQALDIVTKER...--- ref_rec
00047     LHIVVVDDDPGTCVYIESVFAELGHTCKSFVRPEAAEEYILTHP...HKE gi|94967506|receiver
00048 
00049 Most alignment file formats can be concatenated so as to hold as many
00050 different multiple sequence alignments as possible.  One common example
00051 is the output of the tool seqboot in the PHLYIP suite.  Sometimes there
00052 can be a file header and footer, as seen in the EMBOSS alignment output.
00053 
00054 Output
00055 ======
00056 Use the function Bio.AlignIO.write(...), which takes a complete set of
00057 Alignment objects (either as a list, or an iterator), an output file handle
00058 (or filename in recent versions of Biopython) and of course the file format::
00059 
00060     from Bio import AlignIO
00061     alignments = ...
00062     count = SeqIO.write(alignments, "example.faa", "fasta")
00063 
00064 If using a handle make sure to close it to flush the data to the disk::
00065 
00066     from Bio import AlignIO
00067     alignments = ...
00068     handle = open("example.faa", "w")
00069     count = SeqIO.write(alignments, handle, "fasta")
00070     handle.close()
00071 
00072 In general, you are expected to call this function once (with all your
00073 alignments) and then close the file handle.  However, for file formats
00074 like PHYLIP where multiple alignments are stored sequentially (with no file
00075 header and footer), then multiple calls to the write function should work as
00076 expected when using handles.
00077 
00078 If you are using a filename, the repeated calls to the write functions will
00079 overwrite the existing file each time.
00080 
00081 Conversion
00082 ==========
00083 The Bio.AlignIO.convert(...) function allows an easy interface for simple
00084 alignnment file format conversions. Additionally, it may use file format
00085 specific optimisations so this should be the fastest way too.
00086 
00087 In general however, you can combine the Bio.AlignIO.parse(...) function with
00088 the Bio.AlignIO.write(...) function for sequence file conversion. Using
00089 generator expressions provides a memory efficient way to perform filtering or
00090 other extra operations as part of the process.
00091 
00092 File Formats
00093 ============
00094 When specifying the file format, use lowercase strings.  The same format
00095 names are also used in Bio.SeqIO and include the following:
00096 
00097  - clustal   - Ouput from Clustal W or X, see also the module Bio.Clustalw
00098                which can be used to run the command line tool from Biopython.
00099  - emboss    - EMBOSS tools' "pairs" and "simple" alignment formats.
00100  - fasta     - The generic sequence file format where each record starts with
00101                an identifer line starting with a ">" character, followed by
00102                lines of sequence.
00103  - fasta-m10 - For the pairswise alignments output by Bill Pearson's FASTA
00104                tools when used with the -m 10 command line option for machine
00105                readable output.
00106  - ig        - The IntelliGenetics file format, apparently the same as the
00107                MASE alignment format.
00108  - nexus     - Output from NEXUS, see also the module Bio.Nexus which can also
00109                read any phylogenetic trees in these files.
00110  - phylip    - Interlaced PHYLIP, as used by the PHLIP tools.
00111  - phylip-sequential - Sequential PHYLIP.
00112  - phylip-relaxed - PHYLIP like format allowing longer names.
00113  - stockholm - A richly annotated alignment file format used by PFAM.
00114 
00115 Note that while Bio.AlignIO can read all the above file formats, it cannot
00116 write to all of them.
00117 
00118 You can also use any file format supported by Bio.SeqIO, such as "fasta" or
00119 "ig" (which are listed above), PROVIDED the sequences in your file are all the
00120 same length.
00121 """
00122 
00123 # For using with statement in Python 2.5 or Jython
00124 from __future__ import with_statement
00125 
00126 __docformat__ = "epytext en" #not just plaintext
00127 
00128 #TODO
00129 # - define policy on reading aligned sequences with gaps in
00130 #   (e.g. - and . characters) including how the alphabet interacts
00131 #
00132 # - Can we build the to_alignment(...) functionality
00133 #   into the generic Alignment class instead?
00134 #
00135 # - How best to handle unique/non unique record.id when writing.
00136 #   For most file formats reading such files is fine; The stockholm
00137 #   parser would fail.
00138 #
00139 # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf)
00140 #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format
00141 
00142 from Bio.Align import MultipleSeqAlignment
00143 from Bio.Align.Generic import Alignment
00144 from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet
00145 from Bio.File import as_handle
00146 
00147 import StockholmIO
00148 import ClustalIO
00149 import NexusIO
00150 import PhylipIO
00151 import EmbossIO
00152 import FastaIO
00153 
00154 #Convention for format names is "mainname-subtype" in lower case.
00155 #Please use the same names as BioPerl and EMBOSS where possible.
00156 
00157 _FormatToIterator = {#"fasta" is done via Bio.SeqIO
00158                      "clustal" : ClustalIO.ClustalIterator,
00159                      "emboss" : EmbossIO.EmbossIterator,
00160                      "fasta-m10" : FastaIO.FastaM10Iterator,
00161                      "nexus" : NexusIO.NexusIterator,
00162                      "phylip" : PhylipIO.PhylipIterator,
00163                      "phylip-sequential" : PhylipIO.SequentialPhylipIterator,
00164                      "phylip-relaxed" : PhylipIO.RelaxedPhylipIterator,
00165                      "stockholm" : StockholmIO.StockholmIterator,
00166                      }
00167 
00168 _FormatToWriter = {#"fasta" is done via Bio.SeqIO
00169                    #"emboss" : EmbossIO.EmbossWriter, (unfinished)
00170                    "nexus" : NexusIO.NexusWriter,
00171                    "phylip" : PhylipIO.PhylipWriter,
00172                    "phylip-sequential" : PhylipIO.SequentialPhylipWriter,
00173                    "phylip-relaxed" : PhylipIO.RelaxedPhylipWriter,
00174                    "stockholm" : StockholmIO.StockholmWriter,
00175                    "clustal" : ClustalIO.ClustalWriter,
00176                    }
00177 
00178 def write(alignments, handle, format):
00179     """Write complete set of alignments to a file.
00180 
00181     Arguments:
00182      - alignments - A list (or iterator) of Alignment objects (ideally the
00183                    new MultipleSeqAlignment objects), or (if using Biopython
00184                    1.54 or later) a single alignment object.
00185      - handle    - File handle object to write to, or filename as string
00186                    (note older versions of Biopython only took a handle).
00187      - format    - lower case string describing the file format to write.
00188 
00189     You should close the handle after calling this function.
00190 
00191     Returns the number of alignments written (as an integer).
00192     """
00193     from Bio import SeqIO
00194 
00195     #Try and give helpful error messages:
00196     if not isinstance(format, basestring):
00197         raise TypeError("Need a string for the file format (lower case)")
00198     if not format:
00199         raise ValueError("Format required (lower case string)")
00200     if format != format.lower():
00201         raise ValueError("Format string '%s' should be lower case" % format)
00202 
00203     if isinstance(alignments, Alignment):
00204         #This raised an exception in older version of Biopython
00205         alignments = [alignments]
00206 
00207     with as_handle(handle, 'w') as fp:
00208         #Map the file format to a writer class
00209         if format in _FormatToWriter:
00210             writer_class = _FormatToWriter[format]
00211             count = writer_class(fp).write_file(alignments)
00212         elif format in SeqIO._FormatToWriter:
00213             #Exploit the existing SeqIO parser to the dirty work!
00214             #TODO - Can we make one call to SeqIO.write() and count the alignments?
00215             count = 0
00216             for alignment in alignments:
00217                 if not isinstance(alignment, Alignment):
00218                     raise TypeError(\
00219                         "Expect a list or iterator of Alignment objects.")
00220                 SeqIO.write(alignment, fp, format)
00221                 count += 1
00222         elif format in _FormatToIterator or format in SeqIO._FormatToIterator:
00223             raise ValueError("Reading format '%s' is supported, but not writing" \
00224                              % format)
00225         else:
00226             raise ValueError("Unknown format '%s'" % format)
00227 
00228     assert isinstance(count, int), "Internal error - the underlying %s " \
00229            "writer should have returned the alignment count, not %s" \
00230            % (format, repr(count))
00231 
00232     return count
00233 
00234 #This is a generator function!
00235 def _SeqIO_to_alignment_iterator(handle, format, alphabet=None, seq_count=None):
00236     """Uses Bio.SeqIO to create an MultipleSeqAlignment iterator (PRIVATE).
00237 
00238     Arguments:
00239      - handle    - handle to the file.
00240      - format    - string describing the file format.
00241      - alphabet  - optional Alphabet object, useful when the sequence type
00242                    cannot be automatically inferred from the file itself
00243                    (e.g. fasta, phylip, clustal)
00244      - seq_count - Optional integer, number of sequences expected in each
00245                    alignment.  Recommended for fasta format files.
00246 
00247     If count is omitted (default) then all the sequences in the file are
00248     combined into a single MultipleSeqAlignment.
00249     """
00250     from Bio import SeqIO
00251     assert format in SeqIO._FormatToIterator
00252 
00253     if seq_count:
00254         #Use the count to split the records into batches.
00255         seq_record_iterator = SeqIO.parse(handle, format, alphabet)
00256 
00257         records = []
00258         for record in seq_record_iterator:
00259             records.append(record)
00260             if len(records) == seq_count:
00261                 yield MultipleSeqAlignment(records, alphabet)
00262                 records = []
00263         if len(records) > 0:
00264             raise ValueError("Check seq_count argument, not enough sequences?")
00265     else:
00266         #Must assume that there is a single alignment using all
00267         #the SeqRecord objects:
00268         records = list(SeqIO.parse(handle, format, alphabet))
00269         if records:
00270             yield MultipleSeqAlignment(records, alphabet)
00271     raise StopIteration
00272 
00273 def _force_alphabet(alignment_iterator, alphabet):
00274     """Iterate over alignments, over-riding the alphabet (PRIVATE)."""
00275     #Assume the alphabet argument has been pre-validated
00276     given_base_class = _get_base_alphabet(alphabet).__class__
00277     for align in alignment_iterator:
00278         if not isinstance(_get_base_alphabet(align._alphabet),
00279                           given_base_class):
00280             raise ValueError("Specified alphabet %s clashes with "\
00281                              "that determined from the file, %s" \
00282                              % (repr(alphabet), repr(align._alphabet)))
00283         for record in align:
00284             if not isinstance(_get_base_alphabet(record.seq.alphabet),
00285                               given_base_class):
00286                 raise ValueError("Specified alphabet %s clashes with "\
00287                                  "that determined from the file, %s" \
00288                            % (repr(alphabet), repr(record.seq.alphabet)))
00289             record.seq.alphabet = alphabet
00290         align._alphabet = alphabet
00291         yield align
00292 
00293 def parse(handle, format, seq_count=None, alphabet=None):
00294     """Iterate over an alignment file as MultipleSeqAlignment objects.
00295 
00296     Arguments:
00297      - handle    - handle to the file, or the filename as a string
00298                    (note older verions of Biopython only took a handle).
00299      - format    - string describing the file format.
00300      - alphabet  - optional Alphabet object, useful when the sequence type
00301                    cannot be automatically inferred from the file itself
00302                    (e.g. fasta, phylip, clustal)
00303      - seq_count - Optional integer, number of sequences expected in each
00304                    alignment.  Recommended for fasta format files.
00305 
00306     If you have the file name in a string 'filename', use:
00307 
00308     >>> from Bio import AlignIO
00309     >>> filename = "Emboss/needle.txt"
00310     >>> format = "emboss"
00311     >>> for alignment in AlignIO.parse(filename, format):
00312     ...     print "Alignment of length", alignment.get_alignment_length()
00313     Alignment of length 124
00314     Alignment of length 119
00315     Alignment of length 120
00316     Alignment of length 118
00317     Alignment of length 125
00318 
00319     If you have a string 'data' containing the file contents, use:
00320 
00321     from Bio import AlignIO
00322     from StringIO import StringIO
00323     my_iterator = AlignIO.parse(StringIO(data), format)
00324 
00325     Use the Bio.AlignIO.read() function when you expect a single record only.
00326     """
00327     from Bio import SeqIO
00328 
00329     #Try and give helpful error messages:
00330     if not isinstance(format, basestring):
00331         raise TypeError("Need a string for the file format (lower case)")
00332     if not format:
00333         raise ValueError("Format required (lower case string)")
00334     if format != format.lower():
00335         raise ValueError("Format string '%s' should be lower case" % format)
00336     if alphabet is not None and not (isinstance(alphabet, Alphabet) or \
00337                                      isinstance(alphabet, AlphabetEncoder)):
00338         raise ValueError("Invalid alphabet, %s" % repr(alphabet))
00339     if seq_count is not None and not isinstance(seq_count, int):
00340         raise TypeError("Need integer for seq_count (sequences per alignment)")
00341 
00342     with as_handle(handle, 'rU') as fp:
00343         #Map the file format to a sequence iterator:
00344         if format in _FormatToIterator:
00345             iterator_generator = _FormatToIterator[format]
00346             if alphabet is None :
00347                 i = iterator_generator(fp, seq_count)
00348             else:
00349                 try:
00350                     #Initially assume the optional alphabet argument is supported
00351                     i = iterator_generator(fp, seq_count, alphabet=alphabet)
00352                 except TypeError:
00353                     #It isn't supported.
00354                     i = _force_alphabet(iterator_generator(fp, seq_count),
00355                                         alphabet)
00356 
00357         elif format in SeqIO._FormatToIterator:
00358             #Exploit the existing SeqIO parser to the dirty work!
00359             i = _SeqIO_to_alignment_iterator(fp, format,
00360                                                 alphabet=alphabet,
00361                                                 seq_count=seq_count)
00362         else:
00363             raise ValueError("Unknown format '%s'" % format)
00364 
00365         #This imposes some overhead... wait until we drop Python 2.4 to fix it
00366         for a in i:
00367             yield a
00368 
00369 def read(handle, format, seq_count=None, alphabet=None):
00370     """Turns an alignment file into a single MultipleSeqAlignment object.
00371 
00372     Arguments:
00373      - handle    - handle to the file, or the filename as a string
00374                    (note older verions of Biopython only took a handle).
00375      - format    - string describing the file format.
00376      - alphabet  - optional Alphabet object, useful when the sequence type
00377                    cannot be automatically inferred from the file itself
00378                    (e.g. fasta, phylip, clustal)
00379      - seq_count - Optional integer, number of sequences expected in each
00380                    alignment.  Recommended for fasta format files.
00381 
00382     If the handle contains no alignments, or more than one alignment,
00383     an exception is raised.  For example, using a PFAM/Stockholm file
00384     containing one alignment:
00385 
00386     >>> from Bio import AlignIO
00387     >>> filename = "Clustalw/protein.aln"
00388     >>> format = "clustal"
00389     >>> alignment = AlignIO.read(filename, format)
00390     >>> print "Alignment of length", alignment.get_alignment_length()
00391     Alignment of length 411
00392 
00393     If however you want the first alignment from a file containing
00394     multiple alignments this function would raise an exception.
00395 
00396     >>> from Bio import AlignIO
00397     >>> filename = "Emboss/needle.txt"
00398     >>> format = "emboss"
00399     >>> alignment = AlignIO.read(filename, format)
00400     Traceback (most recent call last):
00401         ...
00402     ValueError: More than one record found in handle
00403 
00404     Instead use:
00405 
00406     >>> from Bio import AlignIO
00407     >>> filename = "Emboss/needle.txt"
00408     >>> format = "emboss"
00409     >>> alignment = AlignIO.parse(filename, format).next()
00410     >>> print "First alignment has length", alignment.get_alignment_length()
00411     First alignment has length 124
00412 
00413     You must use the Bio.AlignIO.parse() function if you want to read multiple
00414     records from the handle.
00415     """
00416     iterator = parse(handle, format, seq_count, alphabet)
00417     try:
00418         first = iterator.next()
00419     except StopIteration:
00420         first = None
00421     if first is None:
00422         raise ValueError("No records found in handle")
00423     try:
00424         second = iterator.next()
00425     except StopIteration:
00426         second = None
00427     if second is not None:
00428         raise ValueError("More than one record found in handle")
00429     if seq_count:
00430         assert len(first)==seq_count
00431     return first
00432 
00433 def convert(in_file, in_format, out_file, out_format, alphabet=None):
00434     """Convert between two alignment files, returns number of alignments.
00435 
00436      - in_file - an input handle or filename
00437      - in_format - input file format, lower case string
00438      - output - an output handle or filename
00439      - out_file - output file format, lower case string
00440      - alphabet - optional alphabet to assume
00441 
00442     NOTE - If you provide an output filename, it will be opened which will
00443     overwrite any existing file without warning. This may happen if even the
00444     conversion is aborted (e.g. an invalid out_format name is given).
00445     """
00446     #TODO - Add optimised versions of important conversions
00447     #For now just off load the work to SeqIO parse/write
00448     with as_handle(in_file, 'rU') as in_handle:
00449         #Don't open the output file until we've checked the input is OK:
00450         alignments = parse(in_handle, in_format, None, alphabet)
00451 
00452         #This will check the arguments and issue error messages,
00453         #after we have opened the file which is a shame.
00454         with as_handle(out_file, 'w') as out_handle:
00455             count = write(alignments, out_handle, out_format)
00456 
00457     return count
00458 
00459 def _test():
00460     """Run the Bio.AlignIO module's doctests.
00461 
00462     This will try and locate the unit tests directory, and run the doctests
00463     from there in order that the relative paths used in the examples work.
00464     """
00465     import doctest
00466     import os
00467     if os.path.isdir(os.path.join("..", "..", "Tests")):
00468         print "Runing doctests..."
00469         cur_dir = os.path.abspath(os.curdir)
00470         os.chdir(os.path.join("..", "..", "Tests"))
00471         doctest.testmod()
00472         os.chdir(cur_dir)
00473         del cur_dir
00474         print "Done"
00475     elif os.path.isdir(os.path.join("Tests", "Fasta")):
00476         print "Runing doctests..."
00477         cur_dir = os.path.abspath(os.curdir)
00478         os.chdir(os.path.join("Tests"))
00479         doctest.testmod()
00480         os.chdir(cur_dir)
00481         del cur_dir
00482         print "Done"
00483 
00484 if __name__ == "__main__":
00485     _test()