Back to index

python-biopython  1.60
QualityIO.py
Go to the documentation of this file.
00001 # Copyright 2009-2010 by Peter Cock.  All rights reserved.
00002 # This code is part of the Biopython distribution and governed by its
00003 # license.  Please see the LICENSE file that should have been included
00004 # as part of this package.
00005 #
00006 # This module is for reading and writing FASTQ and QUAL format files as
00007 # SeqRecord objects, and is expected to be used via the Bio.SeqIO API.
00008 
00009 """Bio.SeqIO support for the FASTQ and QUAL file formats.
00010 
00011 Note that you are expected to use this code via the Bio.SeqIO interface, as
00012 shown below.
00013 
00014 The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute
00015 to bundle a FASTA sequence and its PHRED quality data (integers between 0 and
00016 90).  Rather than using a single FASTQ file, often paired FASTA and QUAL files
00017 are used containing the sequence and the quality information separately.
00018 
00019 The PHRED software reads DNA sequencing trace files, calls bases, and
00020 assigns a non-negative quality value to each called base using a logged
00021 transformation of the error probability, Q = -10 log10( Pe ), for example::
00022 
00023     Pe = 1.0,         Q =  0
00024     Pe = 0.1,         Q = 10
00025     Pe = 0.01,        Q = 20
00026     ...
00027     Pe = 0.00000001,  Q = 80
00028     Pe = 0.000000001, Q = 90
00029 
00030 In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40.
00031 In the QUAL format these quality values are held as space separated text in
00032 a FASTA like file format.  In the FASTQ format, each quality values is encoded
00033 with a single ASCI character using chr(Q+33), meaning zero maps to the
00034 character "!" and for example 80 maps to "q".  For the Sanger FASTQ standard
00035 the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and
00036 quality are then stored in pairs in a FASTA like format.
00037 
00038 Unfortunately there is no official document describing the FASTQ file format,
00039 and worse, several related but different variants exist. For more details,
00040 please read this open access publication::
00041 
00042     The Sanger FASTQ file format for sequences with quality scores, and the
00043     Solexa/Illumina FASTQ variants.
00044     P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby),
00045     M.L.Heuer (BioJava) and P.M. Rice (EMBOSS).
00046     Nucleic Acids Research 2010 38(6):1767-1771
00047     http://dx.doi.org/10.1093/nar/gkp1137
00048 
00049 The good news is that Roche 454 sequencers can output files in the QUAL format,
00050 and sensibly they use PHREP style scores like Sanger.  Converting a pair of
00051 FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL
00052 files from a Roche 454 SFF binary file, use the Roche off instrument command
00053 line tool "sffinfo" with the -q or -qual argument.  You can extract a matching
00054 FASTA file using the -s or -seq argument instead.
00055 
00056 The bad news is that Solexa/Illumina did things differently - they have their
00057 own scoring system AND their own incompatible versions of the FASTQ format.
00058 Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can
00059 be negative.  PHRED scores and Solexa scores are NOT interchangeable (but a
00060 reasonable mapping can be achieved between them, and they are approximately
00061 equal for higher quality reads).
00062 
00063 Confusingly early Solexa pipelines produced a FASTQ like file but using their
00064 own score mapping and an ASCII offset of 64. To make things worse, for the
00065 Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the
00066 FASTQ file format, this time using PHRED scores (which is more consistent) but
00067 with an ASCII offset of 64.
00068 
00069 i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ
00070 file format: The original Sanger PHRED standard, and two from Solexa/Illumina.
00071 
00072 The good news is that as of CASAVA version 1.8, Illumina sequencers will
00073 produce FASTQ files using the standard Sanger encoding.
00074 
00075 You are expected to use this module via the Bio.SeqIO functions, with the
00076 following format names:
00077 
00078  - "qual" means simple quality files using PHRED scores (e.g. from Roche 454)
00079  - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII
00080     offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+).
00081     These can potentially hold PHRED scores from 0 to 93.
00082  - "fastq-sanger" is an alias for "fastq".
00083  - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ
00084     files, using Solexa scores with an ASCII offset 64. These can hold Solexa
00085     scores from -5 to 62.
00086  - "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using
00087     PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0
00088     to 62.
00089 
00090 We could potentially add support for "qual-solexa" meaning QUAL files which
00091 contain Solexa scores, but thus far there isn't any reason to use such files.
00092 
00093 For example, consider the following short FASTQ file::
00094 
00095     @EAS54_6_R1_2_1_413_324
00096     CCCTTCTTGTCTTCAGCGTTTCTCC
00097     +
00098     ;;3;;;;;;;;;;;;7;;;;;;;88
00099     @EAS54_6_R1_2_1_540_792
00100     TTGGCAGGCCAAGGCCGATGGATCA
00101     +
00102     ;;;;;;;;;;;7;;;;;-;;;3;83
00103     @EAS54_6_R1_2_1_443_348
00104     GTTGCTTCTGGCGTGGGTGGGGGGG
00105     +
00106     ;;;;;;;;;;;9;7;;.7;393333
00107 
00108 This contains three reads of length 25.  From the read length these were
00109 probably originally from an early Solexa/Illumina sequencer but this file
00110 follows the Sanger FASTQ convention (PHRED style qualities with an ASCII
00111 offet of 33).  This means we can parse this file using Bio.SeqIO using
00112 "fastq" as the format name:
00113 
00114     >>> from Bio import SeqIO
00115     >>> for record in SeqIO.parse("Quality/example.fastq", "fastq"):
00116     ...     print record.id, record.seq
00117     EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
00118     EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
00119     EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
00120 
00121 The qualities are held as a list of integers in each record's annotation:
00122 
00123     >>> print record
00124     ID: EAS54_6_R1_2_1_443_348
00125     Name: EAS54_6_R1_2_1_443_348
00126     Description: EAS54_6_R1_2_1_443_348
00127     Number of features: 0
00128     Per letter annotation for: phred_quality
00129     Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet())
00130     >>> print record.letter_annotations["phred_quality"]
00131     [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
00132 
00133 You can use the SeqRecord format method to show this in the QUAL format:
00134 
00135     >>> print record.format("qual")
00136     >EAS54_6_R1_2_1_443_348
00137     26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
00138     24 18 18 18 18
00139     <BLANKLINE>
00140 
00141 Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"):
00142 
00143     >>> print record.format("fastq")
00144     @EAS54_6_R1_2_1_443_348
00145     GTTGCTTCTGGCGTGGGTGGGGGGG
00146     +
00147     ;;;;;;;;;;;9;7;;.7;393333
00148     <BLANKLINE>
00149 
00150 Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset
00151 of 64):
00152 
00153     >>> print record.format("fastq-illumina")
00154     @EAS54_6_R1_2_1_443_348
00155     GTTGCTTCTGGCGTGGGTGGGGGGG
00156     +
00157     ZZZZZZZZZZZXZVZZMVZRXRRRR
00158     <BLANKLINE>
00159 
00160 You can also get Biopython to convert the scores and show a Solexa style
00161 FASTQ file:
00162 
00163     >>> print record.format("fastq-solexa")
00164     @EAS54_6_R1_2_1_443_348
00165     GTTGCTTCTGGCGTGGGTGGGGGGG
00166     +
00167     ZZZZZZZZZZZXZVZZMVZRXRRRR
00168     <BLANKLINE>
00169 
00170 Notice that this is actually the same output as above using "fastq-illumina"
00171 as the format! The reason for this is all these scores are high enough that
00172 the PHRED and Solexa scores are almost equal. The differences become apparent
00173 for poor quality reads. See the functions solexa_quality_from_phred and
00174 phred_quality_from_solexa for more details.
00175 
00176 If you wanted to trim your sequences (perhaps to remove low quality regions,
00177 or to remove a primer sequence), try slicing the SeqRecord objects.  e.g.
00178 
00179     >>> sub_rec = record[5:15]
00180     >>> print sub_rec
00181     ID: EAS54_6_R1_2_1_443_348
00182     Name: EAS54_6_R1_2_1_443_348
00183     Description: EAS54_6_R1_2_1_443_348
00184     Number of features: 0
00185     Per letter annotation for: phred_quality
00186     Seq('TTCTGGCGTG', SingleLetterAlphabet())
00187     >>> print sub_rec.letter_annotations["phred_quality"]
00188     [26, 26, 26, 26, 26, 26, 24, 26, 22, 26]
00189     >>> print sub_rec.format("fastq")
00190     @EAS54_6_R1_2_1_443_348
00191     TTCTGGCGTG
00192     +
00193     ;;;;;;9;7;
00194     <BLANKLINE>
00195     
00196 If you wanted to, you could read in this FASTQ file, and save it as a QUAL file:
00197 
00198     >>> from Bio import SeqIO
00199     >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq")
00200     >>> out_handle = open("Quality/temp.qual", "w")
00201     >>> SeqIO.write(record_iterator, out_handle, "qual")
00202     3
00203     >>> out_handle.close()
00204 
00205 You can of course read in a QUAL file, such as the one we just created:
00206 
00207     >>> from Bio import SeqIO
00208     >>> for record in SeqIO.parse("Quality/temp.qual", "qual"):
00209     ...     print record.id, record.seq
00210     EAS54_6_R1_2_1_413_324 ?????????????????????????
00211     EAS54_6_R1_2_1_540_792 ?????????????????????????
00212     EAS54_6_R1_2_1_443_348 ?????????????????????????
00213 
00214 Notice that QUAL files don't have a proper sequence present!  But the quality
00215 information is there:
00216 
00217     >>> print record
00218     ID: EAS54_6_R1_2_1_443_348
00219     Name: EAS54_6_R1_2_1_443_348
00220     Description: EAS54_6_R1_2_1_443_348
00221     Number of features: 0
00222     Per letter annotation for: phred_quality
00223     UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?')
00224     >>> print record.letter_annotations["phred_quality"]
00225     [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
00226 
00227 Just to keep things tidy, if you are following this example yourself, you can
00228 delete this temporary file now:
00229 
00230     >>> import os
00231     >>> os.remove("Quality/temp.qual")
00232 
00233 Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL
00234 files.  Because the Bio.SeqIO system is designed for reading single files, you
00235 would have to read the two in separately and then combine the data.  However,
00236 since this is such a common thing to want to do, there is a helper iterator
00237 defined in this module that does this for you - PairedFastaQualIterator.
00238 
00239 Alternatively, if you have enough RAM to hold all the records in memory at once,
00240 then a simple dictionary approach would work:
00241 
00242     >>> from Bio import SeqIO
00243     >>> reads = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fasta"), "fasta"))
00244     >>> for rec in SeqIO.parse(open("Quality/example.qual"), "qual"):
00245     ...     reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"]
00246 
00247 You can then access any record by its key, and get both the sequence and the
00248 quality scores.
00249 
00250     >>> print reads["EAS54_6_R1_2_1_540_792"].format("fastq")
00251     @EAS54_6_R1_2_1_540_792
00252     TTGGCAGGCCAAGGCCGATGGATCA
00253     +
00254     ;;;;;;;;;;;7;;;;;-;;;3;83
00255     <BLANKLINE>
00256 
00257 It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are
00258 using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values,
00259 "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina"
00260 for the more recent variant), as this cannot be detected reliably
00261 automatically.
00262 
00263 To illustrate this problem, let's consider an artifical example:
00264 
00265     >>> from Bio.Seq import Seq
00266     >>> from Bio.Alphabet import generic_dna
00267     >>> from Bio.SeqRecord import SeqRecord
00268     >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test",
00269     ... description="Made up!")
00270     >>> print test.format("fasta")
00271     >Test Made up!
00272     NACGTACGTA
00273     <BLANKLINE>
00274     >>> print test.format("fastq")
00275     Traceback (most recent call last):
00276      ...
00277     ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test).
00278 
00279 We created a sample SeqRecord, and can show it in FASTA format - but for QUAL
00280 or FASTQ format we need to provide some quality scores. These are held as a
00281 list of integers (one for each base) in the letter_annotations dictionary:
00282 
00283     >>> test.letter_annotations["phred_quality"] = [0,1,2,3,4,5,10,20,30,40]
00284     >>> print test.format("qual")
00285     >Test Made up!
00286     0 1 2 3 4 5 10 20 30 40
00287     <BLANKLINE>
00288     >>> print test.format("fastq")
00289     @Test Made up!
00290     NACGTACGTA
00291     +
00292     !"#$%&+5?I
00293     <BLANKLINE>
00294 
00295 We can check this FASTQ encoding - the first PHRED quality was zero, and this
00296 mapped to a exclamation mark, while the final score was 40 and this mapped to
00297 the letter "I":
00298 
00299     >>> ord('!') - 33
00300     0
00301     >>> ord('I') - 33
00302     40
00303     >>> [ord(letter)-33 for letter in '!"#$%&+5?I']
00304     [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
00305 
00306 Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED
00307 scores with an offset of 64:
00308 
00309     >>> print test.format("fastq-illumina")
00310     @Test Made up!
00311     NACGTACGTA
00312     +
00313     @ABCDEJT^h
00314     <BLANKLINE>
00315 
00316 And we can check this too - the first PHRED score was zero, and this mapped to
00317 "@", while the final score was 40 and this mapped to "h":
00318 
00319     >>> ord("@") - 64
00320     0
00321     >>> ord("h") - 64
00322     40
00323     >>> [ord(letter)-64 for letter in "@ABCDEJT^h"]
00324     [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
00325 
00326 Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style
00327 FASTQ files look for the same data! Then we have the older Solexa/Illumina
00328 format to consider which encodes Solexa scores instead of PHRED scores.
00329 
00330 First let's see what Biopython says if we convert the PHRED scores into Solexa
00331 scores (rounding to one decimal place):
00332 
00333     >>> for q in [0,1,2,3,4,5,10,20,30,40]:
00334     ...     print "PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q))
00335     PHRED 0 maps to Solexa -5.0
00336     PHRED 1 maps to Solexa -5.0
00337     PHRED 2 maps to Solexa -2.3
00338     PHRED 3 maps to Solexa -0.0
00339     PHRED 4 maps to Solexa 1.8
00340     PHRED 5 maps to Solexa 3.3
00341     PHRED 10 maps to Solexa 9.5
00342     PHRED 20 maps to Solexa 20.0
00343     PHRED 30 maps to Solexa 30.0
00344     PHRED 40 maps to Solexa 40.0
00345 
00346 Now here is the record using the old Solexa style FASTQ file:
00347 
00348     >>> print test.format("fastq-solexa")
00349     @Test Made up!
00350     NACGTACGTA
00351     +
00352     ;;>@BCJT^h
00353     <BLANKLINE>
00354 
00355 Again, this is using an ASCII offset of 64, so we can check the Solexa scores:
00356 
00357     >>> [ord(letter)-64 for letter in ";;>@BCJT^h"]
00358     [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40]
00359 
00360 This explains why the last few letters of this FASTQ output matched that using
00361 the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores
00362 are approximately equal.
00363 
00364 """
00365 __docformat__ = "epytext en" #Don't just use plain text in epydoc API pages!
00366 
00367 from Bio.Alphabet import single_letter_alphabet
00368 from Bio.Seq import Seq, UnknownSeq
00369 from Bio.SeqRecord import SeqRecord
00370 from Bio.SeqIO.Interfaces import SequentialSequenceWriter
00371 from math import log
00372 import warnings
00373 from Bio import BiopythonWarning, BiopythonParserWarning
00374 
00375 
00376 # define score offsets. See discussion for differences between Sanger and
00377 # Solexa offsets.
00378 SANGER_SCORE_OFFSET = 33
00379 SOLEXA_SCORE_OFFSET = 64
00380 
00381 def solexa_quality_from_phred(phred_quality):
00382     """Covert a PHRED quality (range 0 to about 90) to a Solexa quality.
00383 
00384     PHRED and Solexa quality scores are both log transformations of a
00385     probality of error (high score = low probability of error). This function
00386     takes a PHRED score, transforms it back to a probability of error, and
00387     then re-expresses it as a Solexa score. This assumes the error estimates
00388     are equivalent.
00389 
00390     How does this work exactly? Well the PHRED quality is minus ten times the
00391     base ten logarithm of the probability of error::
00392     
00393      phred_quality = -10*log(error,10)
00394 
00395     Therefore, turning this round::
00396 
00397      error = 10 ** (- phred_quality / 10)
00398     
00399     Now, Solexa qualities use a different log transformation::
00400 
00401      solexa_quality = -10*log(error/(1-error),10)
00402 
00403     After substitution and a little manipulation we get::
00404 
00405       solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10)
00406 
00407     However, real Solexa files use a minimum quality of -5. This does have a
00408     good reason - a random a random base call would be correct 25% of the time,
00409     and thus have a probability of error of 0.75, which gives 1.25 as the PHRED
00410     quality, or -4.77 as the Solexa quality. Thus (after rounding), a random
00411     nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5.
00412 
00413     Taken literally, this logarithic formula would map a PHRED quality of zero
00414     to a Solexa quality of minus infinity. Of course, taken literally, a PHRED
00415     score of zero means a probability of error of one (i.e. the base call is
00416     definitely wrong), which is worse than random! In practice, a PHRED quality
00417     of zero usually means a default value, or perhaps random - and therefore
00418     mapping it to the minimum Solexa score of -5 is reasonable.
00419 
00420     In conclusion, we follow EMBOSS, and take this logarithmic formula but also
00421     apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED
00422     quality of zero to -5.0 as well.
00423 
00424     Note this function will return a floating point number, it is up to you to
00425     round this to the nearest integer if appropriate.  e.g.
00426 
00427     >>> print "%0.2f" % round(solexa_quality_from_phred(80),2)
00428     80.00
00429     >>> print "%0.2f" % round(solexa_quality_from_phred(50),2)
00430     50.00
00431     >>> print "%0.2f" % round(solexa_quality_from_phred(20),2)
00432     19.96
00433     >>> print "%0.2f" % round(solexa_quality_from_phred(10),2)
00434     9.54
00435     >>> print "%0.2f" % round(solexa_quality_from_phred(5),2)
00436     3.35
00437     >>> print "%0.2f" % round(solexa_quality_from_phred(4),2)
00438     1.80
00439     >>> print "%0.2f" % round(solexa_quality_from_phred(3),2)
00440     -0.02
00441     >>> print "%0.2f" % round(solexa_quality_from_phred(2),2)
00442     -2.33
00443     >>> print "%0.2f" % round(solexa_quality_from_phred(1),2)
00444     -5.00
00445     >>> print "%0.2f" % round(solexa_quality_from_phred(0),2)
00446     -5.00
00447 
00448     Notice that for high quality reads PHRED and Solexa scores are numerically
00449     equal. The differences are important for poor quality reads, where PHRED
00450     has a minimum of zero but Solexa scores can be negative.
00451 
00452     Finally, as a special case where None is used for a "missing value", None
00453     is returned:
00454 
00455     >>> print solexa_quality_from_phred(None)
00456     None
00457     """
00458     if phred_quality is None:
00459         #Assume None is used as some kind of NULL or NA value; return None
00460         #e.g. Bio.SeqIO gives Ace contig gaps a quality of None.
00461         return None
00462     elif phred_quality > 0:
00463         #Solexa uses a minimum value of -5, which after rounding matches a
00464         #random nucleotide base call.
00465         return max(-5.0, 10*log(10**(phred_quality/10.0) - 1, 10))
00466     elif phred_quality == 0:
00467         #Special case, map to -5 as discussed in the docstring
00468         return -5.0
00469     else:
00470         raise ValueError("PHRED qualities must be positive (or zero), not %s" \
00471                          % repr(phred_quality))
00472 
00473 def phred_quality_from_solexa(solexa_quality):
00474     """Convert a Solexa quality (which can be negative) to a PHRED quality.
00475 
00476     PHRED and Solexa quality scores are both log transformations of a
00477     probality of error (high score = low probability of error). This function
00478     takes a Solexa score, transforms it back to a probability of error, and
00479     then re-expresses it as a PHRED score. This assumes the error estimates
00480     are equivalent.
00481 
00482     The underlying formulas are given in the documentation for the sister
00483     function solexa_quality_from_phred, in this case the operation is::
00484     
00485      phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10)
00486 
00487     This will return a floating point number, it is up to you to round this to
00488     the nearest integer if appropriate.  e.g.
00489 
00490     >>> print "%0.2f" % round(phred_quality_from_solexa(80),2)
00491     80.00
00492     >>> print "%0.2f" % round(phred_quality_from_solexa(20),2)
00493     20.04
00494     >>> print "%0.2f" % round(phred_quality_from_solexa(10),2)
00495     10.41
00496     >>> print "%0.2f" % round(phred_quality_from_solexa(0),2)
00497     3.01
00498     >>> print "%0.2f" % round(phred_quality_from_solexa(-5),2)
00499     1.19
00500 
00501     Note that a solexa_quality less then -5 is not expected, will trigger a
00502     warning, but will still be converted as per the logarithmic mapping
00503     (giving a number between 0 and 1.19 back).
00504 
00505     As a special case where None is used for a "missing value", None is
00506     returned:
00507 
00508     >>> print phred_quality_from_solexa(None)
00509     None
00510     """
00511     if solexa_quality is None:
00512         #Assume None is used as some kind of NULL or NA value; return None
00513         return None
00514     if solexa_quality < -5:
00515         warnings.warn("Solexa quality less than -5 passed, %s" \
00516                       % repr(solexa_quality), BiopythonWarning)
00517     return 10*log(10**(solexa_quality/10.0) + 1, 10)
00518 
00519 def _get_phred_quality(record):
00520     """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE).
00521 
00522     If there are no PHRED qualities, but there are Solexa qualities, those are
00523     used instead after conversion.
00524     """
00525     try:
00526         return record.letter_annotations["phred_quality"]
00527     except KeyError:
00528         pass
00529     try:
00530         return [phred_quality_from_solexa(q) for \
00531                 q in record.letter_annotations["solexa_quality"]]
00532     except KeyError:
00533         raise ValueError("No suitable quality scores found in "
00534                          "letter_annotations of SeqRecord (id=%s)." \
00535                          % record.id)
00536 
00537 #Only map 0 to 93, we need to give a warning on truncating at 93
00538 _phred_to_sanger_quality_str = dict((qp, chr(min(126, qp+SANGER_SCORE_OFFSET))) \
00539                                     for qp in range(0, 93+1))
00540 #Only map -5 to 93, we need to give a warning on truncating at 93
00541 _solexa_to_sanger_quality_str = dict( \
00542     (qs, chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET))) \
00543     for qs in range(-5, 93+1))
00544 def _get_sanger_quality_str(record):
00545     """Returns a Sanger FASTQ encoded quality string (PRIVATE).
00546 
00547     >>> from Bio.Seq import Seq
00548     >>> from Bio.SeqRecord import SeqRecord
00549     >>> r = SeqRecord(Seq("ACGTAN"), id="Test",
00550     ...               letter_annotations = {"phred_quality":[50,40,30,20,10,0]})
00551     >>> _get_sanger_quality_str(r)
00552     'SI?5+!'
00553 
00554     If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO),
00555     the PHRED qualities are integers, this function is able to use a very fast
00556     pre-cached mapping. However, if they are floats which differ slightly, then
00557     it has to do the appropriate rounding - which is slower:
00558 
00559     >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2",
00560     ...      letter_annotations = {"phred_quality":[50.0,40.05,29.99,20,9.55,0.01]})
00561     >>> _get_sanger_quality_str(r2)
00562     'SI?5+!'
00563 
00564     If your scores include a None value, this raises an exception:
00565 
00566     >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3",
00567     ...               letter_annotations = {"phred_quality":[50,40,30,20,10,None]})
00568     >>> _get_sanger_quality_str(r3)
00569     Traceback (most recent call last):
00570        ...
00571     TypeError: A quality value of None was found
00572 
00573     If (strangely) your record has both PHRED and Solexa scores, then the PHRED
00574     scores are used in preference:
00575 
00576     >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4",
00577     ...               letter_annotations = {"phred_quality":[50,40,30,20,10,0],
00578     ...                                     "solexa_quality":[-5,-4,0,None,0,40]})
00579     >>> _get_sanger_quality_str(r4)
00580     'SI?5+!'
00581 
00582     If there are no PHRED scores, but there are Solexa scores, these are used
00583     instead (after the approriate conversion):
00584 
00585     >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5",
00586     ...      letter_annotations = {"solexa_quality":[40,30,20,10,0,-5]})
00587     >>> _get_sanger_quality_str(r5)
00588     'I?5+$"'
00589 
00590     Again, integer Solexa scores can be looked up in a pre-cached mapping making
00591     this very fast. You can still use approximate floating point scores:
00592 
00593     >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6",
00594     ...      letter_annotations = {"solexa_quality":[40.1,29.7,20.01,10,0.0,-4.9]})
00595     >>> _get_sanger_quality_str(r6)
00596     'I?5+$"'
00597 
00598     Notice that due to the limited range of printable ASCII characters, a
00599     PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ
00600     file (using ASCII 126, the tilde). This function will issue a warning
00601     in this situation.
00602     """
00603     #TODO - This functions works and is fast, but it is also ugly
00604     #and there is considerable repetition of code for the other
00605     #two FASTQ variants.
00606     try:
00607         #These take priority (in case both Solexa and PHRED scores found)
00608         qualities = record.letter_annotations["phred_quality"]
00609     except KeyError:
00610         #Fall back on solexa scores...
00611         pass
00612     else:
00613         #Try and use the precomputed mapping:
00614         try:
00615             return "".join([_phred_to_sanger_quality_str[qp] \
00616                             for qp in qualities])
00617         except KeyError:
00618             #Could be a float, or a None in the list, or a high value.
00619             pass
00620         if None in qualities:
00621             raise TypeError("A quality value of None was found")
00622         if max(qualities) >= 93.5:
00623             warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ",
00624                           BiopythonWarning)
00625         #This will apply the truncation at 93, giving max ASCII 126
00626         return "".join([chr(min(126, int(round(qp))+SANGER_SCORE_OFFSET)) \
00627                         for qp in qualities])
00628     #Fall back on the Solexa scores...
00629     try:
00630         qualities = record.letter_annotations["solexa_quality"]
00631     except KeyError:
00632         raise ValueError("No suitable quality scores found in "
00633                          "letter_annotations of SeqRecord (id=%s)." \
00634                          % record.id)
00635     #Try and use the precomputed mapping:
00636     try:
00637         return "".join([_solexa_to_sanger_quality_str[qs] \
00638                         for qs in qualities])
00639     except KeyError:
00640         #Either no PHRED scores, or something odd like a float or None
00641         pass
00642     if None in qualities:
00643         raise TypeError("A quality value of None was found")
00644     #Must do this the slow way, first converting the PHRED scores into
00645     #Solexa scores:
00646     if max(qualities) >= 93.5:
00647         warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ",
00648                       BiopythonWarning)
00649     #This will apply the truncation at 93, giving max ASCII 126
00650     return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET)) \
00651                     for qs in qualities])
00652 
00653 #Only map 0 to 62, we need to give a warning on truncating at 62
00654 assert 62+SOLEXA_SCORE_OFFSET == 126
00655 _phred_to_illumina_quality_str = dict((qp, chr(qp+SOLEXA_SCORE_OFFSET)) \
00656                                       for qp in range(0, 62+1))
00657 #Only map -5 to 62, we need to give a warning on truncating at 62
00658 _solexa_to_illumina_quality_str = dict( \
00659     (qs, chr(int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \
00660     for qs in range(-5, 62+1))
00661 def _get_illumina_quality_str(record):
00662     """Returns an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE).
00663 
00664     Notice that due to the limited range of printable ASCII characters, a
00665     PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ
00666     file (using ASCII 126, the tilde). This function will issue a warning
00667     in this situation.
00668     """
00669     #TODO - This functions works and is fast, but it is also ugly
00670     #and there is considerable repetition of code for the other
00671     #two FASTQ variants.
00672     try:
00673         #These take priority (in case both Solexa and PHRED scores found)
00674         qualities = record.letter_annotations["phred_quality"]
00675     except KeyError:
00676         #Fall back on solexa scores...
00677         pass
00678     else:
00679         #Try and use the precomputed mapping:
00680         try:
00681             return "".join([_phred_to_illumina_quality_str[qp] \
00682                             for qp in qualities])
00683         except KeyError:
00684             #Could be a float, or a None in the list, or a high value.
00685             pass
00686         if None in qualities:
00687             raise TypeError("A quality value of None was found")
00688         if max(qualities) >= 62.5:
00689             warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ",
00690                           BiopythonWarning)
00691         #This will apply the truncation at 62, giving max ASCII 126
00692         return "".join([chr(min(126, int(round(qp))+SOLEXA_SCORE_OFFSET)) \
00693                         for qp in qualities])
00694     #Fall back on the Solexa scores...
00695     try:
00696         qualities = record.letter_annotations["solexa_quality"]
00697     except KeyError:
00698         raise ValueError("No suitable quality scores found in "
00699                          "letter_annotations of SeqRecord (id=%s)." \
00700                          % record.id)
00701     #Try and use the precomputed mapping:
00702     try:
00703         return "".join([_solexa_to_illumina_quality_str[qs] \
00704                         for qs in qualities])
00705     except KeyError:
00706         #Either no PHRED scores, or something odd like a float or None
00707         pass
00708     if None in qualities:
00709         raise TypeError("A quality value of None was found")
00710     #Must do this the slow way, first converting the PHRED scores into
00711     #Solexa scores:
00712     if max(qualities) >= 62.5:
00713         warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ",
00714                       BiopythonWarning)
00715     #This will apply the truncation at 62, giving max ASCII 126
00716     return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \
00717                     for qs in qualities])
00718 
00719 #Only map 0 to 62, we need to give a warning on truncating at 62
00720 assert 62+SOLEXA_SCORE_OFFSET == 126
00721 _solexa_to_solexa_quality_str = dict((qs, chr(min(126, qs+SOLEXA_SCORE_OFFSET))) \
00722                                      for qs in range(-5, 62+1))
00723 #Only map -5 to 62, we need to give a warning on truncating at 62
00724 _phred_to_solexa_quality_str = dict(\
00725     (qp, chr(min(126, int(round(solexa_quality_from_phred(qp)))+SOLEXA_SCORE_OFFSET))) \
00726     for qp in range(0, 62+1))
00727 def _get_solexa_quality_str(record):
00728     """Returns a Solexa FASTQ encoded quality string (PRIVATE).
00729 
00730     Notice that due to the limited range of printable ASCII characters, a
00731     Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ
00732     file (using ASCII 126, the tilde). This function will issue a warning
00733     in this situation.
00734     """
00735     #TODO - This functions works and is fast, but it is also ugly
00736     #and there is considerable repetition of code for the other
00737     #two FASTQ variants.
00738     try:
00739         #These take priority (in case both Solexa and PHRED scores found)
00740         qualities = record.letter_annotations["solexa_quality"]
00741     except KeyError:
00742         #Fall back on PHRED scores...
00743         pass
00744     else:
00745         #Try and use the precomputed mapping:
00746         try:
00747             return "".join([_solexa_to_solexa_quality_str[qs] \
00748                             for qs in qualities])
00749         except KeyError:
00750             #Could be a float, or a None in the list, or a high value.
00751             pass
00752         if None in qualities:
00753             raise TypeError("A quality value of None was found")
00754         if max(qualities) >= 62.5:
00755             warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ",
00756                           BiopythonWarning)
00757         #This will apply the truncation at 62, giving max ASCII 126
00758         return "".join([chr(min(126, int(round(qs))+SOLEXA_SCORE_OFFSET)) \
00759                         for qs in qualities])
00760     #Fall back on the PHRED scores...
00761     try:
00762         qualities = record.letter_annotations["phred_quality"]
00763     except KeyError:
00764         raise ValueError("No suitable quality scores found in "
00765                          "letter_annotations of SeqRecord (id=%s)." \
00766                          % record.id)
00767     #Try and use the precomputed mapping:
00768     try:
00769         return "".join([_phred_to_solexa_quality_str[qp] \
00770                         for qp in qualities])
00771     except KeyError:
00772         #Either no PHRED scores, or something odd like a float or None
00773         #or too big to be in the cache
00774         pass
00775     if None in qualities:
00776         raise TypeError("A quality value of None was found")
00777     #Must do this the slow way, first converting the PHRED scores into
00778     #Solexa scores:
00779     if max(qualities) >= 62.5:
00780         warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ",
00781                       BiopythonWarning)
00782     return "".join([chr(min(126,
00783                             int(round(solexa_quality_from_phred(qp))) + \
00784                             SOLEXA_SCORE_OFFSET)) \
00785                     for qp in qualities])
00786 
00787 #TODO - Default to nucleotide or even DNA?
00788 def FastqGeneralIterator(handle):
00789     """Iterate over Fastq records as string tuples (not as SeqRecord objects).
00790 
00791     This code does not try to interpret the quality string numerically.  It
00792     just returns tuples of the title, sequence and quality as strings.  For
00793     the sequence and quality, any whitespace (such as new lines) is removed.
00794 
00795     Our SeqRecord based FASTQ iterators call this function internally, and then
00796     turn the strings into a SeqRecord objects, mapping the quality string into
00797     a list of numerical scores.  If you want to do a custom quality mapping,
00798     then you might consider calling this function directly.
00799 
00800     For parsing FASTQ files, the title string from the "@" line at the start
00801     of each record can optionally be omitted on the "+" lines.  If it is
00802     repeated, it must be identical.
00803 
00804     The sequence string and the quality string can optionally be split over
00805     multiple lines, although several sources discourage this.  In comparison,
00806     for the FASTA file format line breaks between 60 and 80 characters are
00807     the norm.
00808 
00809     WARNING - Because the "@" character can appear in the quality string,
00810     this can cause problems as this is also the marker for the start of
00811     a new sequence.  In fact, the "+" sign can also appear as well.  Some
00812     sources recommended having no line breaks in the  quality to avoid this,
00813     but even that is not enough, consider this example::
00814 
00815         @071113_EAS56_0053:1:1:998:236
00816         TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA
00817         +071113_EAS56_0053:1:1:998:236
00818         IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
00819         @071113_EAS56_0053:1:1:182:712
00820         ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
00821         +
00822         @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
00823         @071113_EAS56_0053:1:1:153:10
00824         TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT
00825         +
00826         IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
00827         @071113_EAS56_0053:1:3:990:501
00828         TGGGAGGTTTTATGTGGA
00829         AAGCAGCAATGTACAAGA
00830         +
00831         IIIIIII.IIIIII1@44
00832         @-7.%<&+/$/%4(++(%
00833 
00834     This is four PHRED encoded FASTQ entries originally from an NCBI source
00835     (given the read length of 36, these are probably Solexa Illumna reads where
00836     the quality has been mapped onto the PHRED values).
00837 
00838     This example has been edited to illustrate some of the nasty things allowed
00839     in the FASTQ format.  Firstly, on the "+" lines most but not all of the
00840     (redundant) identifiers are ommited.  In real files it is likely that all or
00841     none of these extra identifiers will be present.
00842 
00843     Secondly, while the first three sequences have been shown without line
00844     breaks, the last has been split over multiple lines.  In real files any line
00845     breaks are likely to be consistent.
00846 
00847     Thirdly, some of the quality string lines start with an "@" character.  For
00848     the second record this is unavoidable.  However for the fourth sequence this
00849     only happens because its quality string is split over two lines.  A naive
00850     parser could wrongly treat any line starting with an "@" as the beginning of
00851     a new sequence!  This code copes with this possible ambiguity by keeping
00852     track of the length of the sequence which gives the expected length of the
00853     quality string.
00854 
00855     Using this tricky example file as input, this short bit of code demonstrates
00856     what this parsing function would return:
00857 
00858     >>> handle = open("Quality/tricky.fastq", "rU")
00859     >>> for (title, sequence, quality) in FastqGeneralIterator(handle):
00860     ...     print title
00861     ...     print sequence, quality
00862     071113_EAS56_0053:1:1:998:236
00863     TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
00864     071113_EAS56_0053:1:1:182:712
00865     ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
00866     071113_EAS56_0053:1:1:153:10
00867     TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
00868     071113_EAS56_0053:1:3:990:501
00869     TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(%
00870     >>> handle.close()
00871 
00872     Finally we note that some sources state that the quality string should
00873     start with "!" (which using the PHRED mapping means the first letter always
00874     has a quality score of zero).  This rather restrictive rule is not widely
00875     observed, so is therefore ignored here.  One plus point about this "!" rule
00876     is that (provided there are no line breaks in the quality sequence) it
00877     would prevent the above problem with the "@" character.
00878     """
00879     #We need to call handle.readline() at least four times per record,
00880     #so we'll save a property look up each time:
00881     handle_readline = handle.readline
00882     
00883     #Skip any text before the first record (e.g. blank lines, comments?)
00884     while True:
00885         line = handle_readline()
00886         if not line:
00887             return #Premature end of file, or just empty?
00888         if line[0] == "@":
00889             break
00890         if isinstance(line[0], int):
00891             raise ValueError("Is this handle in binary mode not text mode?")
00892 
00893     while line:
00894         if line[0] != "@":
00895             raise ValueError("Records in Fastq files should start with '@' character")
00896         title_line = line[1:].rstrip()
00897         #Will now be at least one line of quality data - in most FASTQ files
00898         #just one line! We therefore use string concatenation (if needed)
00899         #rather using than the "".join(...) trick just in case it is multiline:
00900         seq_string = handle_readline().rstrip()
00901         #There may now be more sequence lines, or the "+" quality marker line:
00902         while True:
00903             line = handle_readline()
00904             if not line:
00905                 raise ValueError("End of file without quality information.")
00906             if line[0] == "+":
00907                 #The title here is optional, but if present must match!
00908                 second_title = line[1:].rstrip()
00909                 if second_title and second_title != title_line:
00910                     raise ValueError("Sequence and quality captions differ.")
00911                 break
00912             seq_string += line.rstrip() #removes trailing newlines
00913         #This is going to slow things down a little, but assuming
00914         #this isn't allowed we should try and catch it here:
00915         if " " in seq_string or "\t" in seq_string:
00916             raise ValueError("Whitespace is not allowed in the sequence.")
00917         seq_len = len(seq_string)
00918 
00919         #Will now be at least one line of quality data...
00920         quality_string = handle_readline().rstrip()
00921         #There may now be more quality data, or another sequence, or EOF
00922         while True:
00923             line = handle_readline()
00924             if not line : break #end of file
00925             if line[0] == "@":
00926                 #This COULD be the start of a new sequence. However, it MAY just
00927                 #be a line of quality data which starts with a "@" character.  We
00928                 #should be able to check this by looking at the sequence length
00929                 #and the amount of quality data found so far.
00930                 if len(quality_string) >= seq_len:
00931                     #We expect it to be equal if this is the start of a new record.
00932                     #If the quality data is longer, we'll raise an error below.
00933                     break
00934                 #Continue - its just some (more) quality data.
00935             quality_string += line.rstrip()
00936         
00937         if seq_len != len(quality_string):
00938             raise ValueError("Lengths of sequence and quality values differs "
00939                              " for %s (%i and %i)." \
00940                              % (title_line, seq_len, len(quality_string)))
00941 
00942         #Return the record and then continue...
00943         yield (title_line, seq_string, quality_string)
00944     raise StopIteration
00945 
00946         
00947 #This is a generator function!
00948 def FastqPhredIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
00949     """Generator function to iterate over FASTQ records (as SeqRecord objects).
00950 
00951      - handle - input file
00952      - alphabet - optional alphabet
00953      - title2ids - A function that, when given the title line from the FASTQ
00954                    file (without the beginning >), will return the id, name and
00955                    description (in that order) for the record as a tuple of
00956                    strings.  If this is not given, then the entire title line
00957                    will be used as the description, and the first word as the
00958                    id and name.
00959 
00960     Note that use of title2ids matches that of Bio.SeqIO.FastaIO.
00961 
00962     For each sequence in a (Sanger style) FASTQ file there is a matching string
00963     encoding the PHRED qualities (integers between 0 and about 90) using ASCII
00964     values with an offset of 33.
00965 
00966     For example, consider a file containing three short reads::
00967 
00968         @EAS54_6_R1_2_1_413_324
00969         CCCTTCTTGTCTTCAGCGTTTCTCC
00970         +
00971         ;;3;;;;;;;;;;;;7;;;;;;;88
00972         @EAS54_6_R1_2_1_540_792
00973         TTGGCAGGCCAAGGCCGATGGATCA
00974         +
00975         ;;;;;;;;;;;7;;;;;-;;;3;83
00976         @EAS54_6_R1_2_1_443_348
00977         GTTGCTTCTGGCGTGGGTGGGGGGG
00978         +
00979         ;;;;;;;;;;;9;7;;.7;393333
00980 
00981     For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching
00982     string encoding the PHRED qualities using a ASCI values with an offset of
00983     33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88").
00984 
00985     Using this module directly you might run:
00986 
00987     >>> handle = open("Quality/example.fastq", "rU")
00988     >>> for record in FastqPhredIterator(handle):
00989     ...     print record.id, record.seq
00990     EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
00991     EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
00992     EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
00993     >>> handle.close()
00994 
00995     Typically however, you would call this via Bio.SeqIO instead with "fastq"
00996     (or "fastq-sanger") as the format:
00997 
00998     >>> from Bio import SeqIO
00999     >>> handle = open("Quality/example.fastq", "rU")
01000     >>> for record in SeqIO.parse(handle, "fastq"):
01001     ...     print record.id, record.seq
01002     EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
01003     EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
01004     EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
01005     >>> handle.close()
01006 
01007     If you want to look at the qualities, they are record in each record's
01008     per-letter-annotation dictionary as a simple list of integers:
01009 
01010     >>> print record.letter_annotations["phred_quality"]
01011     [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
01012     """
01013     assert SANGER_SCORE_OFFSET == ord("!")
01014     #Originally, I used a list expression for each record:
01015     #
01016     # qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string]
01017     #
01018     #Precomputing is faster, perhaps partly by avoiding the subtractions.
01019     q_mapping = dict()
01020     for letter in range(0, 255):
01021         q_mapping[chr(letter)] = letter-SANGER_SCORE_OFFSET
01022     for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
01023         if title2ids:
01024             id, name, descr = title2ids(title_line)
01025         else:
01026             descr = title_line
01027             id   = descr.split()[0]
01028             name = id
01029         record = SeqRecord(Seq(seq_string, alphabet),
01030                            id=id, name=name, description=descr)
01031         qualities = [q_mapping[letter] for letter in quality_string]
01032         if qualities and (min(qualities) < 0 or max(qualities) > 93):
01033             raise ValueError("Invalid character in quality string")
01034         #For speed, will now use a dirty trick to speed up assigning the
01035         #qualities. We do this to bypass the length check imposed by the
01036         #per-letter-annotations restricted dict (as this has already been
01037         #checked by FastqGeneralIterator). This is equivalent to:
01038         #record.letter_annotations["phred_quality"] = qualities
01039         dict.__setitem__(record._per_letter_annotations,
01040                          "phred_quality", qualities)
01041         yield record
01042 
01043 #This is a generator function!
01044 def FastqSolexaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
01045     r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping).
01046 
01047     The optional arguments are the same as those for the FastqPhredIterator.
01048 
01049     For each sequence in Solexa/Illumina FASTQ files there is a matching string
01050     encoding the Solexa integer qualities using ASCII values with an offset
01051     of 64.  Solexa scores are scaled differently to PHRED scores, and Biopython
01052     will NOT perform any automatic conversion when loading.
01053 
01054     NOTE - This file format is used by the OLD versions of the Solexa/Illumina
01055     pipeline. See also the FastqIlluminaIterator function for the NEW version.
01056 
01057     For example, consider a file containing these five records::
01058         
01059         @SLXA-B3_649_FC8437_R1_1_1_610_79
01060         GATGTGCAATACCTTTGTAGAGGAA
01061         +SLXA-B3_649_FC8437_R1_1_1_610_79
01062         YYYYYYYYYYYYYYYYYYWYWYYSU
01063         @SLXA-B3_649_FC8437_R1_1_1_397_389
01064         GGTTTGAGAAAGAGAAATGAGATAA
01065         +SLXA-B3_649_FC8437_R1_1_1_397_389
01066         YYYYYYYYYWYYYYWWYYYWYWYWW
01067         @SLXA-B3_649_FC8437_R1_1_1_850_123
01068         GAGGGTGTTGATCATGATGATGGCG
01069         +SLXA-B3_649_FC8437_R1_1_1_850_123
01070         YYYYYYYYYYYYYWYYWYYSYYYSY
01071         @SLXA-B3_649_FC8437_R1_1_1_362_549
01072         GGAAACAAAGTTTTTCTCAACATAG
01073         +SLXA-B3_649_FC8437_R1_1_1_362_549
01074         YYYYYYYYYYYYYYYYYYWWWWYWY
01075         @SLXA-B3_649_FC8437_R1_1_1_183_714
01076         GTATTATTTAATGGCATACACTCAA
01077         +SLXA-B3_649_FC8437_R1_1_1_183_714
01078         YYYYYYYYYYWYYYYWYWWUWWWQQ
01079         
01080     Using this module directly you might run:
01081 
01082     >>> handle = open("Quality/solexa_example.fastq", "rU")
01083     >>> for record in FastqSolexaIterator(handle):
01084     ...     print record.id, record.seq
01085     SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
01086     SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
01087     SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
01088     SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
01089     SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
01090     >>> handle.close()
01091 
01092     Typically however, you would call this via Bio.SeqIO instead with
01093     "fastq-solexa" as the format:
01094 
01095     >>> from Bio import SeqIO
01096     >>> handle = open("Quality/solexa_example.fastq", "rU")
01097     >>> for record in SeqIO.parse(handle, "fastq-solexa"):
01098     ...     print record.id, record.seq
01099     SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
01100     SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
01101     SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
01102     SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
01103     SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
01104     >>> handle.close()
01105 
01106     If you want to look at the qualities, they are recorded in each record's
01107     per-letter-annotation dictionary as a simple list of integers:
01108 
01109     >>> print record.letter_annotations["solexa_quality"]
01110     [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17]
01111 
01112     These scores aren't very good, but they are high enough that they map
01113     almost exactly onto PHRED scores:
01114 
01115     >>> print "%0.2f" % phred_quality_from_solexa(25)
01116     25.01
01117 
01118     Let's look at faked example read which is even worse, where there are
01119     more noticeable differences between the Solexa and PHRED scores::
01120 
01121          @slxa_0001_1_0001_01
01122          ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
01123          +slxa_0001_1_0001_01
01124          hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
01125 
01126     Again, you would typically use Bio.SeqIO to read this file in (rather than
01127     calling the Bio.SeqIO.QualtityIO module directly).  Most FASTQ files will
01128     contain thousands of reads, so you would normally use Bio.SeqIO.parse()
01129     as shown above.  This example has only as one entry, so instead we can
01130     use the Bio.SeqIO.read() function:
01131 
01132     >>> from Bio import SeqIO
01133     >>> handle = open("Quality/solexa_faked.fastq", "rU")
01134     >>> record = SeqIO.read(handle, "fastq-solexa")
01135     >>> handle.close()
01136     >>> print record.id, record.seq
01137     slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
01138     >>> print record.letter_annotations["solexa_quality"]
01139     [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
01140 
01141     These quality scores are so low that when converted from the Solexa scheme
01142     into PHRED scores they look quite different:
01143 
01144     >>> print "%0.2f" % phred_quality_from_solexa(-1)
01145     2.54
01146     >>> print "%0.2f" % phred_quality_from_solexa(-5)
01147     1.19
01148 
01149     Note you can use the Bio.SeqIO.write() function or the SeqRecord's format
01150     method to output the record(s):
01151 
01152     >>> print record.format("fastq-solexa")
01153     @slxa_0001_1_0001_01
01154     ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
01155     +
01156     hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
01157     <BLANKLINE>
01158     
01159     Note this output is slightly different from the input file as Biopython
01160     has left out the optional repetition of the sequence identifier on the "+"
01161     line.  If you want the to use PHRED scores, use "fastq" or "qual" as the
01162     output format instead, and Biopython will do the conversion for you:
01163 
01164     >>> print record.format("fastq")
01165     @slxa_0001_1_0001_01
01166     ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
01167     +
01168     IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##""
01169     <BLANKLINE>
01170     
01171     >>> print record.format("qual")
01172     >slxa_0001_1_0001_01
01173     40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
01174     20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2
01175     1 1
01176     <BLANKLINE>
01177 
01178     As shown above, the poor quality Solexa reads have been mapped to the
01179     equivalent PHRED score (e.g. -5 to 1 as shown earlier).
01180     """
01181     q_mapping = dict()
01182     for letter in range(0, 255):
01183         q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET
01184     for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
01185         if title2ids:
01186             id, name, descr = title_line
01187         else:
01188             descr = title_line
01189             id   = descr.split()[0]
01190             name = id
01191         record = SeqRecord(Seq(seq_string, alphabet),
01192                            id=id, name=name, description=descr)
01193         qualities = [q_mapping[letter] for letter in quality_string]
01194         #DO NOT convert these into PHRED qualities automatically!
01195         if qualities and (min(qualities) < -5 or max(qualities)>62):
01196             raise ValueError("Invalid character in quality string")
01197         #Dirty trick to speed up this line:
01198         #record.letter_annotations["solexa_quality"] = qualities
01199         dict.__setitem__(record._per_letter_annotations,
01200                          "solexa_quality", qualities)
01201         yield record
01202 
01203 #This is a generator function!
01204 def FastqIlluminaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
01205     """Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping).
01206 
01207     The optional arguments are the same as those for the FastqPhredIterator.
01208 
01209     For each sequence in Illumina 1.3+ FASTQ files there is a matching string
01210     encoding PHRED integer qualities using ASCII values with an offset of 64.
01211 
01212     >>> from Bio import SeqIO
01213     >>> record = SeqIO.read(open("Quality/illumina_faked.fastq"), "fastq-illumina")
01214     >>> print record.id, record.seq
01215     Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
01216     >>> max(record.letter_annotations["phred_quality"])
01217     40
01218     >>> min(record.letter_annotations["phred_quality"])
01219     0
01220 
01221     NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores
01222     with an ASCII offset of 64. They are approximately equal but only for high
01223     quality reads. If you have an old Solexa/Illumina file with negative
01224     Solexa scores, and try and read this as an Illumina 1.3+ file it will fail:
01225 
01226     >>> record2 = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-illumina")
01227     Traceback (most recent call last):
01228        ...
01229     ValueError: Invalid character in quality string
01230 
01231     NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33.
01232     """
01233     q_mapping = dict()
01234     for letter in range(0, 255):
01235         q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET
01236     for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
01237         if title2ids:
01238             id, name, descr = title2ids(title_line)
01239         else:
01240             descr = title_line
01241             id   = descr.split()[0]
01242             name = id
01243         record = SeqRecord(Seq(seq_string, alphabet),
01244                            id=id, name=name, description=descr)
01245         qualities = [q_mapping[letter] for letter in quality_string]
01246         if qualities and (min(qualities) < 0 or max(qualities) > 62):
01247             raise ValueError("Invalid character in quality string")
01248         #Dirty trick to speed up this line:
01249         #record.letter_annotations["phred_quality"] = qualities
01250         dict.__setitem__(record._per_letter_annotations,
01251                          "phred_quality", qualities)
01252         yield record
01253     
01254 def QualPhredIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
01255     """For QUAL files which include PHRED quality scores, but no sequence.
01256 
01257     For example, consider this short QUAL file::
01258 
01259         >EAS54_6_R1_2_1_413_324
01260         26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
01261         26 26 26 23 23
01262         >EAS54_6_R1_2_1_540_792
01263         26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
01264         26 18 26 23 18
01265         >EAS54_6_R1_2_1_443_348
01266         26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
01267         24 18 18 18 18
01268     
01269     Using this module directly you might run:
01270 
01271     >>> handle = open("Quality/example.qual", "rU")
01272     >>> for record in QualPhredIterator(handle):
01273     ...     print record.id, record.seq
01274     EAS54_6_R1_2_1_413_324 ?????????????????????????
01275     EAS54_6_R1_2_1_540_792 ?????????????????????????
01276     EAS54_6_R1_2_1_443_348 ?????????????????????????
01277     >>> handle.close()
01278 
01279     Typically however, you would call this via Bio.SeqIO instead with "qual"
01280     as the format:
01281 
01282     >>> from Bio import SeqIO
01283     >>> handle = open("Quality/example.qual", "rU")
01284     >>> for record in SeqIO.parse(handle, "qual"):
01285     ...     print record.id, record.seq
01286     EAS54_6_R1_2_1_413_324 ?????????????????????????
01287     EAS54_6_R1_2_1_540_792 ?????????????????????????
01288     EAS54_6_R1_2_1_443_348 ?????????????????????????
01289     >>> handle.close()
01290 
01291     Becase QUAL files don't contain the sequence string itself, the seq
01292     property is set to an UnknownSeq object.  As no alphabet was given, this
01293     has defaulted to a generic single letter alphabet and the character "?"
01294     used.
01295 
01296     By specifying a nucleotide alphabet, "N" is used instead:
01297 
01298     >>> from Bio import SeqIO
01299     >>> from Bio.Alphabet import generic_dna
01300     >>> handle = open("Quality/example.qual", "rU")
01301     >>> for record in SeqIO.parse(handle, "qual", alphabet=generic_dna):
01302     ...     print record.id, record.seq
01303     EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN
01304     EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN
01305     EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN
01306     >>> handle.close()
01307 
01308     However, the quality scores themselves are available as a list of integers
01309     in each record's per-letter-annotation:
01310 
01311     >>> print record.letter_annotations["phred_quality"]
01312     [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
01313 
01314     You can still slice one of these SeqRecord objects with an UnknownSeq:
01315 
01316     >>> sub_record = record[5:10]
01317     >>> print sub_record.id, sub_record.letter_annotations["phred_quality"]
01318     EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26]
01319 
01320     As of Biopython 1.59, this parser will accept files with negatives quality
01321     scores but will replace them with the lowest possible PHRED score of zero.
01322     This will trigger a warning, previously it raised a ValueError exception.
01323     """
01324     #Skip any text before the first record (e.g. blank lines, comments)
01325     while True:
01326         line = handle.readline()
01327         if line == "" : return #Premature end of file, or just empty?
01328         if line[0] == ">":
01329             break
01330 
01331     while True:
01332         if line[0] != ">":
01333             raise ValueError("Records in Fasta files should start with '>' character")
01334         if title2ids:
01335             id, name, descr = title2ids(line[1:].rstrip())
01336         else:
01337             descr = line[1:].rstrip()
01338             id   = descr.split()[0]
01339             name = id
01340 
01341         qualities = []
01342         line = handle.readline()
01343         while True:
01344             if not line : break
01345             if line[0] == ">": break
01346             qualities.extend([int(word) for word in line.split()])
01347             line = handle.readline()
01348 
01349         if qualities and min(qualities) < 0:
01350             warnings.warn(("Negative quality score %i found, " + \
01351                            "substituting PHRED zero instead.") \
01352                            % min(qualities), BiopythonParserWarning)
01353             qualities = [max(0,q) for q in qualities]
01354         
01355         #Return the record and then continue...
01356         record = SeqRecord(UnknownSeq(len(qualities), alphabet),
01357                            id = id, name = name, description = descr)
01358         #Dirty trick to speed up this line:
01359         #record.letter_annotations["phred_quality"] = qualities
01360         dict.__setitem__(record._per_letter_annotations,
01361                          "phred_quality", qualities)
01362         yield record
01363 
01364         if not line : return #StopIteration
01365     assert False, "Should not reach this line"
01366 
01367 class FastqPhredWriter(SequentialSequenceWriter):
01368     """Class to write standard FASTQ format files (using PHRED quality scores).
01369 
01370     Although you can use this class directly, you are strongly encouraged
01371     to use the Bio.SeqIO.write() function instead via the format name "fastq"
01372     or the alias "fastq-sanger".  For example, this code reads in a standard
01373     Sanger style FASTQ file (using PHRED scores) and re-saves it as another
01374     Sanger style FASTQ file:
01375 
01376     >>> from Bio import SeqIO
01377     >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
01378     >>> out_handle = open("Quality/temp.fastq", "w")
01379     >>> SeqIO.write(record_iterator, out_handle, "fastq")
01380     3
01381     >>> out_handle.close()
01382 
01383     You might want to do this if the original file included extra line breaks,
01384     which while valid may not be supported by all tools.  The output file from
01385     Biopython will have each sequence on a single line, and each quality
01386     string on a single line (which is considered desirable for maximum
01387     compatibility).
01388 
01389     In this next example, an old style Solexa/Illumina FASTQ file (using Solexa
01390     quality scores) is converted into a standard Sanger style FASTQ file using
01391     PHRED qualities:
01392 
01393     >>> from Bio import SeqIO
01394     >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa")
01395     >>> out_handle = open("Quality/temp.fastq", "w")
01396     >>> SeqIO.write(record_iterator, out_handle, "fastq")
01397     5
01398     >>> out_handle.close()
01399 
01400     This code is also called if you use the .format("fastq") method of a
01401     SeqRecord, or .format("fastq-sanger") if you prefer that alias.
01402 
01403     Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is
01404     encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
01405     warning is issued.
01406 
01407     P.S. To avoid cluttering up your working directory, you can delete this
01408     temporary file now:
01409 
01410     >>> import os
01411     >>> os.remove("Quality/temp.fastq")
01412     """
01413     assert SANGER_SCORE_OFFSET == ord("!")
01414 
01415     def write_record(self, record):
01416         """Write a single FASTQ record to the file."""
01417         assert self._header_written
01418         assert not self._footer_written
01419         self._record_written = True
01420         #TODO - Is an empty sequence allowed in FASTQ format?
01421         if record.seq is None:
01422             raise ValueError("No sequence for record %s" % record.id)
01423         seq_str = str(record.seq)
01424         qualities_str = _get_sanger_quality_str(record)
01425         if len(qualities_str) != len(seq_str):
01426             raise ValueError("Record %s has sequence length %i but %i quality scores" \
01427                              % (record.id, len(seq_str), len(qualities_str)))
01428 
01429         #FASTQ files can include a description, just like FASTA files
01430         #(at least, this is what the NCBI Short Read Archive does)
01431         id = self.clean(record.id)
01432         description = self.clean(record.description)
01433         if description and description.split(None, 1)[0]==id:
01434             #The description includes the id at the start
01435             title = description
01436         elif description:
01437             title = "%s %s" % (id, description)
01438         else:
01439             title = id
01440         
01441         self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
01442 
01443 class QualPhredWriter(SequentialSequenceWriter):
01444     """Class to write QUAL format files (using PHRED quality scores).
01445 
01446     Although you can use this class directly, you are strongly encouraged
01447     to use the Bio.SeqIO.write() function instead.  For example, this code
01448     reads in a FASTQ file and saves the quality scores into a QUAL file:
01449 
01450     >>> from Bio import SeqIO
01451     >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
01452     >>> out_handle = open("Quality/temp.qual", "w")
01453     >>> SeqIO.write(record_iterator, out_handle, "qual")
01454     3
01455     >>> out_handle.close()
01456 
01457     This code is also called if you use the .format("qual") method of a
01458     SeqRecord.
01459 
01460     P.S. Don't forget to clean up the temp file if you don't need it anymore:
01461 
01462     >>> import os
01463     >>> os.remove("Quality/temp.qual")
01464     """
01465     def __init__(self, handle, wrap=60, record2title=None):
01466         """Create a QUAL writer.
01467 
01468         Arguments:
01469          - handle - Handle to an output file, e.g. as returned
01470                     by open(filename, "w")
01471          - wrap   - Optional line length used to wrap sequence lines.
01472                     Defaults to wrapping the sequence at 60 characters
01473                     Use zero (or None) for no wrapping, giving a single
01474                     long line for the sequence.
01475          - record2title - Optional function to return the text to be
01476                     used for the title line of each record.  By default
01477                     a combination of the record.id and record.description
01478                     is used.  If the record.description starts with the
01479                     record.id, then just the record.description is used.
01480 
01481         The record2title argument is present for consistency with the
01482         Bio.SeqIO.FastaIO writer class.
01483         """
01484         SequentialSequenceWriter.__init__(self, handle)
01485         #self.handle = handle
01486         self.wrap = None
01487         if wrap:
01488             if wrap < 1:
01489                 raise ValueError
01490         self.wrap = wrap
01491         self.record2title = record2title
01492 
01493     def write_record(self, record):
01494         """Write a single QUAL record to the file."""
01495         assert self._header_written
01496         assert not self._footer_written
01497         self._record_written = True
01498 
01499         handle = self.handle
01500         wrap = self.wrap
01501 
01502         if self.record2title:
01503             title = self.clean(self.record2title(record))
01504         else:
01505             id = self.clean(record.id)
01506             description = self.clean(record.description)
01507             if description and description.split(None, 1)[0]==id:
01508                 #The description includes the id at the start
01509                 title = description
01510             elif description:
01511                 title = "%s %s" % (id, description)
01512             else:
01513                 title = id
01514         handle.write(">%s\n" % title)
01515 
01516         qualities = _get_phred_quality(record)
01517         try:
01518             #This rounds to the nearest integer.
01519             #TODO - can we record a float in a qual file?
01520             qualities_strs = [("%i" % round(q, 0)) for q in qualities]
01521         except TypeError, e:
01522             if None in qualities:
01523                 raise TypeError("A quality value of None was found")
01524             else:
01525                 raise e
01526 
01527         if wrap > 5:
01528             #Fast wrapping
01529             data = " ".join(qualities_strs)
01530             while True:
01531                 if len(data) <= wrap:
01532                     self.handle.write(data + "\n")
01533                     break
01534                 else:
01535                     #By construction there must be spaces in the first X chars
01536                     #(unless we have X digit or higher quality scores!)
01537                     i = data.rfind(" ", 0, wrap)
01538                     handle.write(data[:i] + "\n")
01539                     data = data[i+1:]
01540         elif wrap:
01541             #Safe wrapping
01542             while qualities_strs:
01543                 line = qualities_strs.pop(0)
01544                 while qualities_strs \
01545                 and len(line) + 1 + len(qualities_strs[0]) < wrap:
01546                     line += " " + qualities_strs.pop(0)
01547                 handle.write(line + "\n")
01548         else:
01549             #No wrapping
01550             data = " ".join(qualities_strs)
01551             handle.write(data + "\n")
01552 
01553 class FastqSolexaWriter(SequentialSequenceWriter):
01554     r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities).
01555 
01556     This outputs FASTQ files like those from the early Solexa/Illumina
01557     pipeline, using Solexa scores and an ASCII offset of 64. These are
01558     NOT compatible with the standard Sanger style PHRED FASTQ files.
01559 
01560     If your records contain a "solexa_quality" entry under letter_annotations,
01561     this is used, otherwise any "phred_quality" entry will be used after
01562     conversion using the solexa_quality_from_phred function. If neither style
01563     of quality scores are present, an exception is raised.
01564     
01565     Although you can use this class directly, you are strongly encouraged
01566     to use the Bio.SeqIO.write() function instead.  For example, this code
01567     reads in a FASTQ file and re-saves it as another FASTQ file:
01568 
01569     >>> from Bio import SeqIO
01570     >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa")
01571     >>> out_handle = open("Quality/temp.fastq", "w")
01572     >>> SeqIO.write(record_iterator, out_handle, "fastq-solexa")
01573     5
01574     >>> out_handle.close()
01575 
01576     You might want to do this if the original file included extra line breaks,
01577     which (while valid) may not be supported by all tools.  The output file
01578     from Biopython will have each sequence on a single line, and each quality
01579     string on a single line (which is considered desirable for maximum
01580     compatibility).
01581 
01582     This code is also called if you use the .format("fastq-solexa") method of
01583     a SeqRecord. For example,
01584 
01585     >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger")
01586     >>> print record.format("fastq-solexa")
01587     @Test PHRED qualities from 40 to 0 inclusive
01588     ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
01589     +
01590     hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;;
01591     <BLANKLINE>
01592     
01593     Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is
01594     encoded as ASCII 126, the tilde.  If your quality scores must be truncated to fit,
01595     a warning is issued.
01596     
01597     P.S. Don't forget to delete the temp file if you don't need it anymore:
01598 
01599     >>> import os
01600     >>> os.remove("Quality/temp.fastq")
01601     """
01602     def write_record(self, record):
01603         """Write a single FASTQ record to the file."""
01604         assert self._header_written
01605         assert not self._footer_written
01606         self._record_written = True
01607 
01608         #TODO - Is an empty sequence allowed in FASTQ format?
01609         if record.seq is None:
01610             raise ValueError("No sequence for record %s" % record.id)
01611         seq_str = str(record.seq)
01612         qualities_str = _get_solexa_quality_str(record)
01613         if len(qualities_str) != len(seq_str):
01614             raise ValueError("Record %s has sequence length %i but %i quality scores" \
01615                              % (record.id, len(seq_str), len(qualities_str)))
01616 
01617         #FASTQ files can include a description, just like FASTA files
01618         #(at least, this is what the NCBI Short Read Archive does)
01619         id = self.clean(record.id)
01620         description = self.clean(record.description)
01621         if description and description.split(None, 1)[0]==id:
01622             #The description includes the id at the start
01623             title = description
01624         elif description:
01625             title = "%s %s" % (id, description)
01626         else:
01627             title = id
01628         
01629         self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
01630 
01631 class FastqIlluminaWriter(SequentialSequenceWriter):
01632     r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores).
01633 
01634     This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline,
01635     using PHRED scores and an ASCII offset of 64. Note these files are NOT
01636     compatible with the standard Sanger style PHRED FASTQ files which use an
01637     ASCII offset of 32.
01638 
01639     Although you can use this class directly, you are strongly encouraged to
01640     use the Bio.SeqIO.write() function with format name "fastq-illumina"
01641     instead. This code is also called if you use the .format("fastq-illumina")
01642     method of a SeqRecord. For example,
01643 
01644     >>> from Bio import SeqIO
01645     >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger")
01646     >>> print record.format("fastq-illumina")
01647     @Test PHRED qualities from 40 to 0 inclusive
01648     ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
01649     +
01650     hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
01651     <BLANKLINE>
01652 
01653     Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is
01654     encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
01655     warning is issued.
01656     """
01657     def write_record(self, record):
01658         """Write a single FASTQ record to the file."""
01659         assert self._header_written
01660         assert not self._footer_written
01661         self._record_written = True
01662 
01663         #TODO - Is an empty sequence allowed in FASTQ format?
01664         if record.seq is None:
01665             raise ValueError("No sequence for record %s" % record.id)
01666         seq_str = str(record.seq)
01667         qualities_str = _get_illumina_quality_str(record)
01668         if len(qualities_str) != len(seq_str):
01669             raise ValueError("Record %s has sequence length %i but %i quality scores" \
01670                              % (record.id, len(seq_str), len(qualities_str)))
01671 
01672         #FASTQ files can include a description, just like FASTA files
01673         #(at least, this is what the NCBI Short Read Archive does)
01674         id = self.clean(record.id)
01675         description = self.clean(record.description)
01676         if description and description.split(None, 1)[0]==id:
01677             #The description includes the id at the start
01678             title = description
01679         elif description:
01680             title = "%s %s" % (id, description)
01681         else:
01682             title = id
01683         
01684         self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
01685         
01686 def PairedFastaQualIterator(fasta_handle, qual_handle, alphabet = single_letter_alphabet, title2ids = None):
01687     """Iterate over matched FASTA and QUAL files as SeqRecord objects.
01688 
01689     For example, consider this short QUAL file with PHRED quality scores::
01690 
01691         >EAS54_6_R1_2_1_413_324
01692         26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
01693         26 26 26 23 23
01694         >EAS54_6_R1_2_1_540_792
01695         26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
01696         26 18 26 23 18
01697         >EAS54_6_R1_2_1_443_348
01698         26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
01699         24 18 18 18 18
01700     
01701     And a matching FASTA file::
01702 
01703         >EAS54_6_R1_2_1_413_324
01704         CCCTTCTTGTCTTCAGCGTTTCTCC
01705         >EAS54_6_R1_2_1_540_792
01706         TTGGCAGGCCAAGGCCGATGGATCA
01707         >EAS54_6_R1_2_1_443_348
01708         GTTGCTTCTGGCGTGGGTGGGGGGG
01709 
01710     You can parse these separately using Bio.SeqIO with the "qual" and
01711     "fasta" formats, but then you'll get a group of SeqRecord objects with
01712     no sequence, and a matching group with the sequence but not the
01713     qualities.  Because it only deals with one input file handle, Bio.SeqIO
01714     can't be used to read the two files together - but this function can!
01715     For example,
01716     
01717     >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
01718     ...                                    open("Quality/example.qual", "rU"))
01719     >>> for record in rec_iter:
01720     ...     print record.id, record.seq
01721     EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
01722     EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
01723     EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
01724 
01725     As with the FASTQ or QUAL parsers, if you want to look at the qualities,
01726     they are in each record's per-letter-annotation dictionary as a simple
01727     list of integers:
01728 
01729     >>> print record.letter_annotations["phred_quality"]
01730     [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
01731 
01732     If you have access to data as a FASTQ format file, using that directly
01733     would be simpler and more straight forward.  Note that you can easily use
01734     this function to convert paired FASTA and QUAL files into FASTQ files:
01735 
01736     >>> from Bio import SeqIO
01737     >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
01738     ...                                    open("Quality/example.qual", "rU"))
01739     >>> out_handle = open("Quality/temp.fastq", "w")
01740     >>> SeqIO.write(rec_iter, out_handle, "fastq")
01741     3
01742     >>> out_handle.close()
01743 
01744     And don't forget to clean up the temp file if you don't need it anymore:
01745 
01746     >>> import os
01747     >>> os.remove("Quality/temp.fastq")    
01748     """
01749     from Bio.SeqIO.FastaIO import FastaIterator    
01750     fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, \
01751                                title2ids=title2ids)
01752     qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, \
01753                                   title2ids=title2ids)
01754 
01755     #Using zip(...) would create a list loading everything into memory!
01756     #It would also not catch any extra records found in only one file.
01757     while True:
01758         try:
01759             f_rec = fasta_iter.next()
01760         except StopIteration:
01761             f_rec = None
01762         try:
01763             q_rec = qual_iter.next()
01764         except StopIteration:
01765             q_rec = None
01766         if f_rec is None and q_rec is None:
01767             #End of both files
01768             break
01769         if f_rec is None:
01770             raise ValueError("FASTA file has more entries than the QUAL file.")
01771         if q_rec is None:
01772             raise ValueError("QUAL file has more entries than the FASTA file.")
01773         if f_rec.id != q_rec.id:
01774             raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." \
01775                              % (f_rec.id, q_rec.id))
01776         if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]):
01777             raise ValueError("Sequence length and number of quality scores disagree for %s" \
01778                              % f_rec.id)
01779         #Merge the data....
01780         f_rec.letter_annotations["phred_quality"] = q_rec.letter_annotations["phred_quality"]
01781         yield f_rec
01782     #Done
01783     
01784 
01785 def _test():
01786     """Run the Bio.SeqIO module's doctests.
01787 
01788     This will try and locate the unit tests directory, and run the doctests
01789     from there in order that the relative paths used in the examples work.
01790     """
01791     import doctest
01792     import os
01793     if os.path.isdir(os.path.join("..", "..", "Tests")):
01794         print "Runing doctests..."
01795         cur_dir = os.path.abspath(os.curdir)
01796         os.chdir(os.path.join("..", "..", "Tests"))
01797         assert os.path.isfile("Quality/example.fastq")
01798         assert os.path.isfile("Quality/example.fasta")
01799         assert os.path.isfile("Quality/example.qual")
01800         assert os.path.isfile("Quality/tricky.fastq")
01801         assert os.path.isfile("Quality/solexa_faked.fastq")
01802         doctest.testmod(verbose=0)
01803         os.chdir(cur_dir)
01804         del cur_dir
01805         print "Done"
01806         
01807 if __name__ == "__main__":
01808     _test()
01809