Back to index

python-biopython  1.60
Classes | Functions | Variables
Bio.SeqIO.QualityIO Namespace Reference

Classes

class  FastqPhredWriter
class  QualPhredWriter
class  FastqSolexaWriter
class  FastqIlluminaWriter

Functions

def solexa_quality_from_phred
def phred_quality_from_solexa
def _get_phred_quality
def _get_sanger_quality_str
def _get_illumina_quality_str
def _get_solexa_quality_str
def FastqGeneralIterator
def FastqPhredIterator
def FastqSolexaIterator
def FastqIlluminaIterator
def QualPhredIterator
def PairedFastaQualIterator
def _test

Variables

string __docformat__ = "epytext en"
int SANGER_SCORE_OFFSET = 33
int SOLEXA_SCORE_OFFSET = 64
tuple _phred_to_sanger_quality_str
tuple _solexa_to_sanger_quality_str
tuple _phred_to_illumina_quality_str
tuple _solexa_to_illumina_quality_str
tuple _solexa_to_solexa_quality_str
tuple _phred_to_solexa_quality_str

Function Documentation

Returns an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE).

Notice that due to the limited range of printable ASCII characters, a
PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ
file (using ASCII 126, the tilde). This function will issue a warning
in this situation.

Definition at line 661 of file QualityIO.py.

00661 
00662 def _get_illumina_quality_str(record):
00663     """Returns an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE).
00664 
00665     Notice that due to the limited range of printable ASCII characters, a
00666     PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ
00667     file (using ASCII 126, the tilde). This function will issue a warning
00668     in this situation.
00669     """
00670     #TODO - This functions works and is fast, but it is also ugly
00671     #and there is considerable repetition of code for the other
00672     #two FASTQ variants.
00673     try:
00674         #These take priority (in case both Solexa and PHRED scores found)
00675         qualities = record.letter_annotations["phred_quality"]
00676     except KeyError:
00677         #Fall back on solexa scores...
00678         pass
00679     else:
00680         #Try and use the precomputed mapping:
00681         try:
00682             return "".join([_phred_to_illumina_quality_str[qp] \
00683                             for qp in qualities])
00684         except KeyError:
00685             #Could be a float, or a None in the list, or a high value.
00686             pass
00687         if None in qualities:
00688             raise TypeError("A quality value of None was found")
00689         if max(qualities) >= 62.5:
00690             warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ",
00691                           BiopythonWarning)
00692         #This will apply the truncation at 62, giving max ASCII 126
00693         return "".join([chr(min(126, int(round(qp))+SOLEXA_SCORE_OFFSET)) \
00694                         for qp in qualities])
00695     #Fall back on the Solexa scores...
00696     try:
00697         qualities = record.letter_annotations["solexa_quality"]
00698     except KeyError:
00699         raise ValueError("No suitable quality scores found in "
00700                          "letter_annotations of SeqRecord (id=%s)." \
00701                          % record.id)
00702     #Try and use the precomputed mapping:
00703     try:
00704         return "".join([_solexa_to_illumina_quality_str[qs] \
00705                         for qs in qualities])
00706     except KeyError:
00707         #Either no PHRED scores, or something odd like a float or None
00708         pass
00709     if None in qualities:
00710         raise TypeError("A quality value of None was found")
00711     #Must do this the slow way, first converting the PHRED scores into
00712     #Solexa scores:
00713     if max(qualities) >= 62.5:
00714         warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ",
00715                       BiopythonWarning)
00716     #This will apply the truncation at 62, giving max ASCII 126
00717     return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \
00718                     for qs in qualities])
00719 
#Only map 0 to 62, we need to give a warning on truncating at 62

Here is the call graph for this function:

def Bio.SeqIO.QualityIO._get_phred_quality (   record) [private]
Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE).

If there are no PHRED qualities, but there are Solexa qualities, those are
used instead after conversion.

Definition at line 519 of file QualityIO.py.

00519 
00520 def _get_phred_quality(record):
00521     """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE).
00522 
00523     If there are no PHRED qualities, but there are Solexa qualities, those are
00524     used instead after conversion.
00525     """
00526     try:
00527         return record.letter_annotations["phred_quality"]
00528     except KeyError:
00529         pass
00530     try:
00531         return [phred_quality_from_solexa(q) for \
00532                 q in record.letter_annotations["solexa_quality"]]
00533     except KeyError:
00534         raise ValueError("No suitable quality scores found in "
00535                          "letter_annotations of SeqRecord (id=%s)." \
00536                          % record.id)
00537 
#Only map 0 to 93, we need to give a warning on truncating at 93

Here is the call graph for this function:

def Bio.SeqIO.QualityIO._get_sanger_quality_str (   record) [private]
Returns a Sanger FASTQ encoded quality string (PRIVATE).

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> r = SeqRecord(Seq("ACGTAN"), id="Test",
...               letter_annotations = {"phred_quality":[50,40,30,20,10,0]})
>>> _get_sanger_quality_str(r)
'SI?5+!'

If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO),
the PHRED qualities are integers, this function is able to use a very fast
pre-cached mapping. However, if they are floats which differ slightly, then
it has to do the appropriate rounding - which is slower:

>>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2",
...      letter_annotations = {"phred_quality":[50.0,40.05,29.99,20,9.55,0.01]})
>>> _get_sanger_quality_str(r2)
'SI?5+!'

If your scores include a None value, this raises an exception:

>>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3",
...               letter_annotations = {"phred_quality":[50,40,30,20,10,None]})
>>> _get_sanger_quality_str(r3)
Traceback (most recent call last):
   ...
TypeError: A quality value of None was found

If (strangely) your record has both PHRED and Solexa scores, then the PHRED
scores are used in preference:

>>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4",
...               letter_annotations = {"phred_quality":[50,40,30,20,10,0],
...                                     "solexa_quality":[-5,-4,0,None,0,40]})
>>> _get_sanger_quality_str(r4)
'SI?5+!'

If there are no PHRED scores, but there are Solexa scores, these are used
instead (after the approriate conversion):

>>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5",
...      letter_annotations = {"solexa_quality":[40,30,20,10,0,-5]})
>>> _get_sanger_quality_str(r5)
'I?5+$"'

Again, integer Solexa scores can be looked up in a pre-cached mapping making
this very fast. You can still use approximate floating point scores:

>>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6",
...      letter_annotations = {"solexa_quality":[40.1,29.7,20.01,10,0.0,-4.9]})
>>> _get_sanger_quality_str(r6)
'I?5+$"'

Notice that due to the limited range of printable ASCII characters, a
PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ
file (using ASCII 126, the tilde). This function will issue a warning
in this situation.

Definition at line 544 of file QualityIO.py.

00544 
00545 def _get_sanger_quality_str(record):
00546     """Returns a Sanger FASTQ encoded quality string (PRIVATE).
00547 
00548     >>> from Bio.Seq import Seq
00549     >>> from Bio.SeqRecord import SeqRecord
00550     >>> r = SeqRecord(Seq("ACGTAN"), id="Test",
00551     ...               letter_annotations = {"phred_quality":[50,40,30,20,10,0]})
00552     >>> _get_sanger_quality_str(r)
00553     'SI?5+!'
00554 
00555     If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO),
00556     the PHRED qualities are integers, this function is able to use a very fast
00557     pre-cached mapping. However, if they are floats which differ slightly, then
00558     it has to do the appropriate rounding - which is slower:
00559 
00560     >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2",
00561     ...      letter_annotations = {"phred_quality":[50.0,40.05,29.99,20,9.55,0.01]})
00562     >>> _get_sanger_quality_str(r2)
00563     'SI?5+!'
00564 
00565     If your scores include a None value, this raises an exception:
00566 
00567     >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3",
00568     ...               letter_annotations = {"phred_quality":[50,40,30,20,10,None]})
00569     >>> _get_sanger_quality_str(r3)
00570     Traceback (most recent call last):
00571        ...
00572     TypeError: A quality value of None was found
00573 
00574     If (strangely) your record has both PHRED and Solexa scores, then the PHRED
00575     scores are used in preference:
00576 
00577     >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4",
00578     ...               letter_annotations = {"phred_quality":[50,40,30,20,10,0],
00579     ...                                     "solexa_quality":[-5,-4,0,None,0,40]})
00580     >>> _get_sanger_quality_str(r4)
00581     'SI?5+!'
00582 
00583     If there are no PHRED scores, but there are Solexa scores, these are used
00584     instead (after the approriate conversion):
00585 
00586     >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5",
00587     ...      letter_annotations = {"solexa_quality":[40,30,20,10,0,-5]})
00588     >>> _get_sanger_quality_str(r5)
00589     'I?5+$"'
00590 
00591     Again, integer Solexa scores can be looked up in a pre-cached mapping making
00592     this very fast. You can still use approximate floating point scores:
00593 
00594     >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6",
00595     ...      letter_annotations = {"solexa_quality":[40.1,29.7,20.01,10,0.0,-4.9]})
00596     >>> _get_sanger_quality_str(r6)
00597     'I?5+$"'
00598 
00599     Notice that due to the limited range of printable ASCII characters, a
00600     PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ
00601     file (using ASCII 126, the tilde). This function will issue a warning
00602     in this situation.
00603     """
00604     #TODO - This functions works and is fast, but it is also ugly
00605     #and there is considerable repetition of code for the other
00606     #two FASTQ variants.
00607     try:
00608         #These take priority (in case both Solexa and PHRED scores found)
00609         qualities = record.letter_annotations["phred_quality"]
00610     except KeyError:
00611         #Fall back on solexa scores...
00612         pass
00613     else:
00614         #Try and use the precomputed mapping:
00615         try:
00616             return "".join([_phred_to_sanger_quality_str[qp] \
00617                             for qp in qualities])
00618         except KeyError:
00619             #Could be a float, or a None in the list, or a high value.
00620             pass
00621         if None in qualities:
00622             raise TypeError("A quality value of None was found")
00623         if max(qualities) >= 93.5:
00624             warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ",
00625                           BiopythonWarning)
00626         #This will apply the truncation at 93, giving max ASCII 126
00627         return "".join([chr(min(126, int(round(qp))+SANGER_SCORE_OFFSET)) \
00628                         for qp in qualities])
00629     #Fall back on the Solexa scores...
00630     try:
00631         qualities = record.letter_annotations["solexa_quality"]
00632     except KeyError:
00633         raise ValueError("No suitable quality scores found in "
00634                          "letter_annotations of SeqRecord (id=%s)." \
00635                          % record.id)
00636     #Try and use the precomputed mapping:
00637     try:
00638         return "".join([_solexa_to_sanger_quality_str[qs] \
00639                         for qs in qualities])
00640     except KeyError:
00641         #Either no PHRED scores, or something odd like a float or None
00642         pass
00643     if None in qualities:
00644         raise TypeError("A quality value of None was found")
00645     #Must do this the slow way, first converting the PHRED scores into
00646     #Solexa scores:
00647     if max(qualities) >= 93.5:
00648         warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ",
00649                       BiopythonWarning)
00650     #This will apply the truncation at 93, giving max ASCII 126
00651     return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET)) \
00652                     for qs in qualities])
00653 
#Only map 0 to 62, we need to give a warning on truncating at 62

Here is the call graph for this function:

def Bio.SeqIO.QualityIO._get_solexa_quality_str (   record) [private]
Returns a Solexa FASTQ encoded quality string (PRIVATE).

Notice that due to the limited range of printable ASCII characters, a
Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ
file (using ASCII 126, the tilde). This function will issue a warning
in this situation.

Definition at line 727 of file QualityIO.py.

00727 
00728 def _get_solexa_quality_str(record):
00729     """Returns a Solexa FASTQ encoded quality string (PRIVATE).
00730 
00731     Notice that due to the limited range of printable ASCII characters, a
00732     Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ
00733     file (using ASCII 126, the tilde). This function will issue a warning
00734     in this situation.
00735     """
00736     #TODO - This functions works and is fast, but it is also ugly
00737     #and there is considerable repetition of code for the other
00738     #two FASTQ variants.
00739     try:
00740         #These take priority (in case both Solexa and PHRED scores found)
00741         qualities = record.letter_annotations["solexa_quality"]
00742     except KeyError:
00743         #Fall back on PHRED scores...
00744         pass
00745     else:
00746         #Try and use the precomputed mapping:
00747         try:
00748             return "".join([_solexa_to_solexa_quality_str[qs] \
00749                             for qs in qualities])
00750         except KeyError:
00751             #Could be a float, or a None in the list, or a high value.
00752             pass
00753         if None in qualities:
00754             raise TypeError("A quality value of None was found")
00755         if max(qualities) >= 62.5:
00756             warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ",
00757                           BiopythonWarning)
00758         #This will apply the truncation at 62, giving max ASCII 126
00759         return "".join([chr(min(126, int(round(qs))+SOLEXA_SCORE_OFFSET)) \
00760                         for qs in qualities])
00761     #Fall back on the PHRED scores...
00762     try:
00763         qualities = record.letter_annotations["phred_quality"]
00764     except KeyError:
00765         raise ValueError("No suitable quality scores found in "
00766                          "letter_annotations of SeqRecord (id=%s)." \
00767                          % record.id)
00768     #Try and use the precomputed mapping:
00769     try:
00770         return "".join([_phred_to_solexa_quality_str[qp] \
00771                         for qp in qualities])
00772     except KeyError:
00773         #Either no PHRED scores, or something odd like a float or None
00774         #or too big to be in the cache
00775         pass
00776     if None in qualities:
00777         raise TypeError("A quality value of None was found")
00778     #Must do this the slow way, first converting the PHRED scores into
00779     #Solexa scores:
00780     if max(qualities) >= 62.5:
00781         warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ",
00782                       BiopythonWarning)
00783     return "".join([chr(min(126,
00784                             int(round(solexa_quality_from_phred(qp))) + \
00785                             SOLEXA_SCORE_OFFSET)) \
00786                     for qp in qualities])
00787 
#TODO - Default to nucleotide or even DNA?

Here is the call graph for this function:

def Bio.SeqIO.QualityIO._test ( ) [private]
Run the Bio.SeqIO module's doctests.

This will try and locate the unit tests directory, and run the doctests
from there in order that the relative paths used in the examples work.

Definition at line 1785 of file QualityIO.py.

01785 
01786 def _test():
01787     """Run the Bio.SeqIO module's doctests.
01788 
01789     This will try and locate the unit tests directory, and run the doctests
01790     from there in order that the relative paths used in the examples work.
01791     """
01792     import doctest
01793     import os
01794     if os.path.isdir(os.path.join("..", "..", "Tests")):
01795         print "Runing doctests..."
01796         cur_dir = os.path.abspath(os.curdir)
01797         os.chdir(os.path.join("..", "..", "Tests"))
01798         assert os.path.isfile("Quality/example.fastq")
01799         assert os.path.isfile("Quality/example.fasta")
01800         assert os.path.isfile("Quality/example.qual")
01801         assert os.path.isfile("Quality/tricky.fastq")
01802         assert os.path.isfile("Quality/solexa_faked.fastq")
01803         doctest.testmod(verbose=0)
01804         os.chdir(cur_dir)
01805         del cur_dir
01806         print "Done"
        
Iterate over Fastq records as string tuples (not as SeqRecord objects).

This code does not try to interpret the quality string numerically.  It
just returns tuples of the title, sequence and quality as strings.  For
the sequence and quality, any whitespace (such as new lines) is removed.

Our SeqRecord based FASTQ iterators call this function internally, and then
turn the strings into a SeqRecord objects, mapping the quality string into
a list of numerical scores.  If you want to do a custom quality mapping,
then you might consider calling this function directly.

For parsing FASTQ files, the title string from the "@" line at the start
of each record can optionally be omitted on the "+" lines.  If it is
repeated, it must be identical.

The sequence string and the quality string can optionally be split over
multiple lines, although several sources discourage this.  In comparison,
for the FASTA file format line breaks between 60 and 80 characters are
the norm.

WARNING - Because the "@" character can appear in the quality string,
this can cause problems as this is also the marker for the start of
a new sequence.  In fact, the "+" sign can also appear as well.  Some
sources recommended having no line breaks in the  quality to avoid this,
but even that is not enough, consider this example::

    @071113_EAS56_0053:1:1:998:236
    TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA
    +071113_EAS56_0053:1:1:998:236
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
    @071113_EAS56_0053:1:1:182:712
    ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
    +
    @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
    @071113_EAS56_0053:1:1:153:10
    TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT
    +
    IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
    @071113_EAS56_0053:1:3:990:501
    TGGGAGGTTTTATGTGGA
    AAGCAGCAATGTACAAGA
    +
    IIIIIII.IIIIII1@44
    @-7.%<&+/$/%4(++(%

This is four PHRED encoded FASTQ entries originally from an NCBI source
(given the read length of 36, these are probably Solexa Illumna reads where
the quality has been mapped onto the PHRED values).

This example has been edited to illustrate some of the nasty things allowed
in the FASTQ format.  Firstly, on the "+" lines most but not all of the
(redundant) identifiers are ommited.  In real files it is likely that all or
none of these extra identifiers will be present.

Secondly, while the first three sequences have been shown without line
breaks, the last has been split over multiple lines.  In real files any line
breaks are likely to be consistent.

Thirdly, some of the quality string lines start with an "@" character.  For
the second record this is unavoidable.  However for the fourth sequence this
only happens because its quality string is split over two lines.  A naive
parser could wrongly treat any line starting with an "@" as the beginning of
a new sequence!  This code copes with this possible ambiguity by keeping
track of the length of the sequence which gives the expected length of the
quality string.

Using this tricky example file as input, this short bit of code demonstrates
what this parsing function would return:

>>> handle = open("Quality/tricky.fastq", "rU")
>>> for (title, sequence, quality) in FastqGeneralIterator(handle):
...     print title
...     print sequence, quality
071113_EAS56_0053:1:1:998:236
TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
071113_EAS56_0053:1:1:182:712
ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
071113_EAS56_0053:1:1:153:10
TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
071113_EAS56_0053:1:3:990:501
TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(%
>>> handle.close()

Finally we note that some sources state that the quality string should
start with "!" (which using the PHRED mapping means the first letter always
has a quality score of zero).  This rather restrictive rule is not widely
observed, so is therefore ignored here.  One plus point about this "!" rule
is that (provided there are no line breaks in the quality sequence) it
would prevent the above problem with the "@" character.

Definition at line 788 of file QualityIO.py.

00788 
00789 def FastqGeneralIterator(handle):
00790     """Iterate over Fastq records as string tuples (not as SeqRecord objects).
00791 
00792     This code does not try to interpret the quality string numerically.  It
00793     just returns tuples of the title, sequence and quality as strings.  For
00794     the sequence and quality, any whitespace (such as new lines) is removed.
00795 
00796     Our SeqRecord based FASTQ iterators call this function internally, and then
00797     turn the strings into a SeqRecord objects, mapping the quality string into
00798     a list of numerical scores.  If you want to do a custom quality mapping,
00799     then you might consider calling this function directly.
00800 
00801     For parsing FASTQ files, the title string from the "@" line at the start
00802     of each record can optionally be omitted on the "+" lines.  If it is
00803     repeated, it must be identical.
00804 
00805     The sequence string and the quality string can optionally be split over
00806     multiple lines, although several sources discourage this.  In comparison,
00807     for the FASTA file format line breaks between 60 and 80 characters are
00808     the norm.
00809 
00810     WARNING - Because the "@" character can appear in the quality string,
00811     this can cause problems as this is also the marker for the start of
00812     a new sequence.  In fact, the "+" sign can also appear as well.  Some
00813     sources recommended having no line breaks in the  quality to avoid this,
00814     but even that is not enough, consider this example::
00815 
00816         @071113_EAS56_0053:1:1:998:236
00817         TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA
00818         +071113_EAS56_0053:1:1:998:236
00819         IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
00820         @071113_EAS56_0053:1:1:182:712
00821         ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
00822         +
00823         @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
00824         @071113_EAS56_0053:1:1:153:10
00825         TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT
00826         +
00827         IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
00828         @071113_EAS56_0053:1:3:990:501
00829         TGGGAGGTTTTATGTGGA
00830         AAGCAGCAATGTACAAGA
00831         +
00832         IIIIIII.IIIIII1@44
00833         @-7.%<&+/$/%4(++(%
00834 
00835     This is four PHRED encoded FASTQ entries originally from an NCBI source
00836     (given the read length of 36, these are probably Solexa Illumna reads where
00837     the quality has been mapped onto the PHRED values).
00838 
00839     This example has been edited to illustrate some of the nasty things allowed
00840     in the FASTQ format.  Firstly, on the "+" lines most but not all of the
00841     (redundant) identifiers are ommited.  In real files it is likely that all or
00842     none of these extra identifiers will be present.
00843 
00844     Secondly, while the first three sequences have been shown without line
00845     breaks, the last has been split over multiple lines.  In real files any line
00846     breaks are likely to be consistent.
00847 
00848     Thirdly, some of the quality string lines start with an "@" character.  For
00849     the second record this is unavoidable.  However for the fourth sequence this
00850     only happens because its quality string is split over two lines.  A naive
00851     parser could wrongly treat any line starting with an "@" as the beginning of
00852     a new sequence!  This code copes with this possible ambiguity by keeping
00853     track of the length of the sequence which gives the expected length of the
00854     quality string.
00855 
00856     Using this tricky example file as input, this short bit of code demonstrates
00857     what this parsing function would return:
00858 
00859     >>> handle = open("Quality/tricky.fastq", "rU")
00860     >>> for (title, sequence, quality) in FastqGeneralIterator(handle):
00861     ...     print title
00862     ...     print sequence, quality
00863     071113_EAS56_0053:1:1:998:236
00864     TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
00865     071113_EAS56_0053:1:1:182:712
00866     ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
00867     071113_EAS56_0053:1:1:153:10
00868     TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
00869     071113_EAS56_0053:1:3:990:501
00870     TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(%
00871     >>> handle.close()
00872 
00873     Finally we note that some sources state that the quality string should
00874     start with "!" (which using the PHRED mapping means the first letter always
00875     has a quality score of zero).  This rather restrictive rule is not widely
00876     observed, so is therefore ignored here.  One plus point about this "!" rule
00877     is that (provided there are no line breaks in the quality sequence) it
00878     would prevent the above problem with the "@" character.
00879     """
00880     #We need to call handle.readline() at least four times per record,
00881     #so we'll save a property look up each time:
00882     handle_readline = handle.readline
00883     
00884     #Skip any text before the first record (e.g. blank lines, comments?)
00885     while True:
00886         line = handle_readline()
00887         if not line:
00888             return #Premature end of file, or just empty?
00889         if line[0] == "@":
00890             break
00891         if isinstance(line[0], int):
00892             raise ValueError("Is this handle in binary mode not text mode?")
00893 
00894     while line:
00895         if line[0] != "@":
00896             raise ValueError("Records in Fastq files should start with '@' character")
00897         title_line = line[1:].rstrip()
00898         #Will now be at least one line of quality data - in most FASTQ files
00899         #just one line! We therefore use string concatenation (if needed)
00900         #rather using than the "".join(...) trick just in case it is multiline:
00901         seq_string = handle_readline().rstrip()
00902         #There may now be more sequence lines, or the "+" quality marker line:
00903         while True:
00904             line = handle_readline()
00905             if not line:
00906                 raise ValueError("End of file without quality information.")
00907             if line[0] == "+":
00908                 #The title here is optional, but if present must match!
00909                 second_title = line[1:].rstrip()
00910                 if second_title and second_title != title_line:
00911                     raise ValueError("Sequence and quality captions differ.")
00912                 break
00913             seq_string += line.rstrip() #removes trailing newlines
00914         #This is going to slow things down a little, but assuming
00915         #this isn't allowed we should try and catch it here:
00916         if " " in seq_string or "\t" in seq_string:
00917             raise ValueError("Whitespace is not allowed in the sequence.")
00918         seq_len = len(seq_string)
00919 
00920         #Will now be at least one line of quality data...
00921         quality_string = handle_readline().rstrip()
00922         #There may now be more quality data, or another sequence, or EOF
00923         while True:
00924             line = handle_readline()
00925             if not line : break #end of file
00926             if line[0] == "@":
00927                 #This COULD be the start of a new sequence. However, it MAY just
00928                 #be a line of quality data which starts with a "@" character.  We
00929                 #should be able to check this by looking at the sequence length
00930                 #and the amount of quality data found so far.
00931                 if len(quality_string) >= seq_len:
00932                     #We expect it to be equal if this is the start of a new record.
00933                     #If the quality data is longer, we'll raise an error below.
00934                     break
00935                 #Continue - its just some (more) quality data.
00936             quality_string += line.rstrip()
00937         
00938         if seq_len != len(quality_string):
00939             raise ValueError("Lengths of sequence and quality values differs "
00940                              " for %s (%i and %i)." \
00941                              % (title_line, seq_len, len(quality_string)))
00942 
00943         #Return the record and then continue...
00944         yield (title_line, seq_string, quality_string)
00945     raise StopIteration
00946 
00947         
#This is a generator function!

Here is the caller graph for this function:

def Bio.SeqIO.QualityIO.FastqIlluminaIterator (   handle,
  alphabet = single_letter_alphabet,
  title2ids = None 
)
Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping).

The optional arguments are the same as those for the FastqPhredIterator.

For each sequence in Illumina 1.3+ FASTQ files there is a matching string
encoding PHRED integer qualities using ASCII values with an offset of 64.

>>> from Bio import SeqIO
>>> record = SeqIO.read(open("Quality/illumina_faked.fastq"), "fastq-illumina")
>>> print record.id, record.seq
Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
>>> max(record.letter_annotations["phred_quality"])
40
>>> min(record.letter_annotations["phred_quality"])
0

NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores
with an ASCII offset of 64. They are approximately equal but only for high
quality reads. If you have an old Solexa/Illumina file with negative
Solexa scores, and try and read this as an Illumina 1.3+ file it will fail:

>>> record2 = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-illumina")
Traceback (most recent call last):
   ...
ValueError: Invalid character in quality string

NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33.

Definition at line 1204 of file QualityIO.py.

01204 
01205 def FastqIlluminaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
01206     """Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping).
01207 
01208     The optional arguments are the same as those for the FastqPhredIterator.
01209 
01210     For each sequence in Illumina 1.3+ FASTQ files there is a matching string
01211     encoding PHRED integer qualities using ASCII values with an offset of 64.
01212 
01213     >>> from Bio import SeqIO
01214     >>> record = SeqIO.read(open("Quality/illumina_faked.fastq"), "fastq-illumina")
01215     >>> print record.id, record.seq
01216     Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
01217     >>> max(record.letter_annotations["phred_quality"])
01218     40
01219     >>> min(record.letter_annotations["phred_quality"])
01220     0
01221 
01222     NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores
01223     with an ASCII offset of 64. They are approximately equal but only for high
01224     quality reads. If you have an old Solexa/Illumina file with negative
01225     Solexa scores, and try and read this as an Illumina 1.3+ file it will fail:
01226 
01227     >>> record2 = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-illumina")
01228     Traceback (most recent call last):
01229        ...
01230     ValueError: Invalid character in quality string
01231 
01232     NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33.
01233     """
01234     q_mapping = dict()
01235     for letter in range(0, 255):
01236         q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET
01237     for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
01238         if title2ids:
01239             id, name, descr = title2ids(title_line)
01240         else:
01241             descr = title_line
01242             id   = descr.split()[0]
01243             name = id
01244         record = SeqRecord(Seq(seq_string, alphabet),
01245                            id=id, name=name, description=descr)
01246         qualities = [q_mapping[letter] for letter in quality_string]
01247         if qualities and (min(qualities) < 0 or max(qualities) > 62):
01248             raise ValueError("Invalid character in quality string")
01249         #Dirty trick to speed up this line:
01250         #record.letter_annotations["phred_quality"] = qualities
01251         dict.__setitem__(record._per_letter_annotations,
01252                          "phred_quality", qualities)
01253         yield record
    

Here is the call graph for this function:

def Bio.SeqIO.QualityIO.FastqPhredIterator (   handle,
  alphabet = single_letter_alphabet,
  title2ids = None 
)
Generator function to iterate over FASTQ records (as SeqRecord objects).

 - handle - input file
 - alphabet - optional alphabet
 - title2ids - A function that, when given the title line from the FASTQ
               file (without the beginning >), will return the id, name and
               description (in that order) for the record as a tuple of
               strings.  If this is not given, then the entire title line
               will be used as the description, and the first word as the
               id and name.

Note that use of title2ids matches that of Bio.SeqIO.FastaIO.

For each sequence in a (Sanger style) FASTQ file there is a matching string
encoding the PHRED qualities (integers between 0 and about 90) using ASCII
values with an offset of 33.

For example, consider a file containing three short reads::

    @EAS54_6_R1_2_1_413_324
    CCCTTCTTGTCTTCAGCGTTTCTCC
    +
    ;;3;;;;;;;;;;;;7;;;;;;;88
    @EAS54_6_R1_2_1_540_792
    TTGGCAGGCCAAGGCCGATGGATCA
    +
    ;;;;;;;;;;;7;;;;;-;;;3;83
    @EAS54_6_R1_2_1_443_348
    GTTGCTTCTGGCGTGGGTGGGGGGG
    +
    ;;;;;;;;;;;9;7;;.7;393333

For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching
string encoding the PHRED qualities using a ASCI values with an offset of
33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88").

Using this module directly you might run:

>>> handle = open("Quality/example.fastq", "rU")
>>> for record in FastqPhredIterator(handle):
...     print record.id, record.seq
EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
>>> handle.close()

Typically however, you would call this via Bio.SeqIO instead with "fastq"
(or "fastq-sanger") as the format:

>>> from Bio import SeqIO
>>> handle = open("Quality/example.fastq", "rU")
>>> for record in SeqIO.parse(handle, "fastq"):
...     print record.id, record.seq
EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
>>> handle.close()

If you want to look at the qualities, they are record in each record's
per-letter-annotation dictionary as a simple list of integers:

>>> print record.letter_annotations["phred_quality"]
[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]

Definition at line 948 of file QualityIO.py.

00948 
00949 def FastqPhredIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
00950     """Generator function to iterate over FASTQ records (as SeqRecord objects).
00951 
00952      - handle - input file
00953      - alphabet - optional alphabet
00954      - title2ids - A function that, when given the title line from the FASTQ
00955                    file (without the beginning >), will return the id, name and
00956                    description (in that order) for the record as a tuple of
00957                    strings.  If this is not given, then the entire title line
00958                    will be used as the description, and the first word as the
00959                    id and name.
00960 
00961     Note that use of title2ids matches that of Bio.SeqIO.FastaIO.
00962 
00963     For each sequence in a (Sanger style) FASTQ file there is a matching string
00964     encoding the PHRED qualities (integers between 0 and about 90) using ASCII
00965     values with an offset of 33.
00966 
00967     For example, consider a file containing three short reads::
00968 
00969         @EAS54_6_R1_2_1_413_324
00970         CCCTTCTTGTCTTCAGCGTTTCTCC
00971         +
00972         ;;3;;;;;;;;;;;;7;;;;;;;88
00973         @EAS54_6_R1_2_1_540_792
00974         TTGGCAGGCCAAGGCCGATGGATCA
00975         +
00976         ;;;;;;;;;;;7;;;;;-;;;3;83
00977         @EAS54_6_R1_2_1_443_348
00978         GTTGCTTCTGGCGTGGGTGGGGGGG
00979         +
00980         ;;;;;;;;;;;9;7;;.7;393333
00981 
00982     For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching
00983     string encoding the PHRED qualities using a ASCI values with an offset of
00984     33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88").
00985 
00986     Using this module directly you might run:
00987 
00988     >>> handle = open("Quality/example.fastq", "rU")
00989     >>> for record in FastqPhredIterator(handle):
00990     ...     print record.id, record.seq
00991     EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
00992     EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
00993     EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
00994     >>> handle.close()
00995 
00996     Typically however, you would call this via Bio.SeqIO instead with "fastq"
00997     (or "fastq-sanger") as the format:
00998 
00999     >>> from Bio import SeqIO
01000     >>> handle = open("Quality/example.fastq", "rU")
01001     >>> for record in SeqIO.parse(handle, "fastq"):
01002     ...     print record.id, record.seq
01003     EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
01004     EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
01005     EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
01006     >>> handle.close()
01007 
01008     If you want to look at the qualities, they are record in each record's
01009     per-letter-annotation dictionary as a simple list of integers:
01010 
01011     >>> print record.letter_annotations["phred_quality"]
01012     [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
01013     """
01014     assert SANGER_SCORE_OFFSET == ord("!")
01015     #Originally, I used a list expression for each record:
01016     #
01017     # qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string]
01018     #
01019     #Precomputing is faster, perhaps partly by avoiding the subtractions.
01020     q_mapping = dict()
01021     for letter in range(0, 255):
01022         q_mapping[chr(letter)] = letter-SANGER_SCORE_OFFSET
01023     for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
01024         if title2ids:
01025             id, name, descr = title2ids(title_line)
01026         else:
01027             descr = title_line
01028             id   = descr.split()[0]
01029             name = id
01030         record = SeqRecord(Seq(seq_string, alphabet),
01031                            id=id, name=name, description=descr)
01032         qualities = [q_mapping[letter] for letter in quality_string]
01033         if qualities and (min(qualities) < 0 or max(qualities) > 93):
01034             raise ValueError("Invalid character in quality string")
01035         #For speed, will now use a dirty trick to speed up assigning the
01036         #qualities. We do this to bypass the length check imposed by the
01037         #per-letter-annotations restricted dict (as this has already been
01038         #checked by FastqGeneralIterator). This is equivalent to:
01039         #record.letter_annotations["phred_quality"] = qualities
01040         dict.__setitem__(record._per_letter_annotations,
01041                          "phred_quality", qualities)
01042         yield record
01043 
#This is a generator function!

Here is the call graph for this function:

def Bio.SeqIO.QualityIO.FastqSolexaIterator (   handle,
  alphabet = single_letter_alphabet,
  title2ids = None 
)

Definition at line 1044 of file QualityIO.py.

01044 
01045 def FastqSolexaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
01046     r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping).
01047 
01048     The optional arguments are the same as those for the FastqPhredIterator.
01049 
01050     For each sequence in Solexa/Illumina FASTQ files there is a matching string
01051     encoding the Solexa integer qualities using ASCII values with an offset
01052     of 64.  Solexa scores are scaled differently to PHRED scores, and Biopython
01053     will NOT perform any automatic conversion when loading.
01054 
01055     NOTE - This file format is used by the OLD versions of the Solexa/Illumina
01056     pipeline. See also the FastqIlluminaIterator function for the NEW version.
01057 
01058     For example, consider a file containing these five records::
01059         
01060         @SLXA-B3_649_FC8437_R1_1_1_610_79
01061         GATGTGCAATACCTTTGTAGAGGAA
01062         +SLXA-B3_649_FC8437_R1_1_1_610_79
01063         YYYYYYYYYYYYYYYYYYWYWYYSU
01064         @SLXA-B3_649_FC8437_R1_1_1_397_389
01065         GGTTTGAGAAAGAGAAATGAGATAA
01066         +SLXA-B3_649_FC8437_R1_1_1_397_389
01067         YYYYYYYYYWYYYYWWYYYWYWYWW
01068         @SLXA-B3_649_FC8437_R1_1_1_850_123
01069         GAGGGTGTTGATCATGATGATGGCG
01070         +SLXA-B3_649_FC8437_R1_1_1_850_123
01071         YYYYYYYYYYYYYWYYWYYSYYYSY
01072         @SLXA-B3_649_FC8437_R1_1_1_362_549
01073         GGAAACAAAGTTTTTCTCAACATAG
01074         +SLXA-B3_649_FC8437_R1_1_1_362_549
01075         YYYYYYYYYYYYYYYYYYWWWWYWY
01076         @SLXA-B3_649_FC8437_R1_1_1_183_714
01077         GTATTATTTAATGGCATACACTCAA
01078         +SLXA-B3_649_FC8437_R1_1_1_183_714
01079         YYYYYYYYYYWYYYYWYWWUWWWQQ
01080         
01081     Using this module directly you might run:
01082 
01083     >>> handle = open("Quality/solexa_example.fastq", "rU")
01084     >>> for record in FastqSolexaIterator(handle):
01085     ...     print record.id, record.seq
01086     SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
01087     SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
01088     SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
01089     SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
01090     SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
01091     >>> handle.close()
01092 
01093     Typically however, you would call this via Bio.SeqIO instead with
01094     "fastq-solexa" as the format:
01095 
01096     >>> from Bio import SeqIO
01097     >>> handle = open("Quality/solexa_example.fastq", "rU")
01098     >>> for record in SeqIO.parse(handle, "fastq-solexa"):
01099     ...     print record.id, record.seq
01100     SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
01101     SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
01102     SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
01103     SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
01104     SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
01105     >>> handle.close()
01106 
01107     If you want to look at the qualities, they are recorded in each record's
01108     per-letter-annotation dictionary as a simple list of integers:
01109 
01110     >>> print record.letter_annotations["solexa_quality"]
01111     [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17]
01112 
01113     These scores aren't very good, but they are high enough that they map
01114     almost exactly onto PHRED scores:
01115 
01116     >>> print "%0.2f" % phred_quality_from_solexa(25)
01117     25.01
01118 
01119     Let's look at faked example read which is even worse, where there are
01120     more noticeable differences between the Solexa and PHRED scores::
01121 
01122          @slxa_0001_1_0001_01
01123          ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
01124          +slxa_0001_1_0001_01
01125          hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
01126 
01127     Again, you would typically use Bio.SeqIO to read this file in (rather than
01128     calling the Bio.SeqIO.QualtityIO module directly).  Most FASTQ files will
01129     contain thousands of reads, so you would normally use Bio.SeqIO.parse()
01130     as shown above.  This example has only as one entry, so instead we can
01131     use the Bio.SeqIO.read() function:
01132 
01133     >>> from Bio import SeqIO
01134     >>> handle = open("Quality/solexa_faked.fastq", "rU")
01135     >>> record = SeqIO.read(handle, "fastq-solexa")
01136     >>> handle.close()
01137     >>> print record.id, record.seq
01138     slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
01139     >>> print record.letter_annotations["solexa_quality"]
01140     [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
01141 
01142     These quality scores are so low that when converted from the Solexa scheme
01143     into PHRED scores they look quite different:
01144 
01145     >>> print "%0.2f" % phred_quality_from_solexa(-1)
01146     2.54
01147     >>> print "%0.2f" % phred_quality_from_solexa(-5)
01148     1.19
01149 
01150     Note you can use the Bio.SeqIO.write() function or the SeqRecord's format
01151     method to output the record(s):
01152 
01153     >>> print record.format("fastq-solexa")
01154     @slxa_0001_1_0001_01
01155     ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
01156     +
01157     hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
01158     <BLANKLINE>
01159     
01160     Note this output is slightly different from the input file as Biopython
01161     has left out the optional repetition of the sequence identifier on the "+"
01162     line.  If you want the to use PHRED scores, use "fastq" or "qual" as the
01163     output format instead, and Biopython will do the conversion for you:
01164 
01165     >>> print record.format("fastq")
01166     @slxa_0001_1_0001_01
01167     ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
01168     +
01169     IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##""
01170     <BLANKLINE>
01171     
01172     >>> print record.format("qual")
01173     >slxa_0001_1_0001_01
01174     40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
01175     20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2
01176     1 1
01177     <BLANKLINE>
01178 
01179     As shown above, the poor quality Solexa reads have been mapped to the
01180     equivalent PHRED score (e.g. -5 to 1 as shown earlier).
01181     """
01182     q_mapping = dict()
01183     for letter in range(0, 255):
01184         q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET
01185     for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
01186         if title2ids:
01187             id, name, descr = title_line
01188         else:
01189             descr = title_line
01190             id   = descr.split()[0]
01191             name = id
01192         record = SeqRecord(Seq(seq_string, alphabet),
01193                            id=id, name=name, description=descr)
01194         qualities = [q_mapping[letter] for letter in quality_string]
01195         #DO NOT convert these into PHRED qualities automatically!
01196         if qualities and (min(qualities) < -5 or max(qualities)>62):
01197             raise ValueError("Invalid character in quality string")
01198         #Dirty trick to speed up this line:
01199         #record.letter_annotations["solexa_quality"] = qualities
01200         dict.__setitem__(record._per_letter_annotations,
01201                          "solexa_quality", qualities)
01202         yield record
01203 
#This is a generator function!

Here is the call graph for this function:

def Bio.SeqIO.QualityIO.PairedFastaQualIterator (   fasta_handle,
  qual_handle,
  alphabet = single_letter_alphabet,
  title2ids = None 
)
Iterate over matched FASTA and QUAL files as SeqRecord objects.

For example, consider this short QUAL file with PHRED quality scores::

    >EAS54_6_R1_2_1_413_324
    26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
    26 26 26 23 23
    >EAS54_6_R1_2_1_540_792
    26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
    26 18 26 23 18
    >EAS54_6_R1_2_1_443_348
    26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
    24 18 18 18 18

And a matching FASTA file::

    >EAS54_6_R1_2_1_413_324
    CCCTTCTTGTCTTCAGCGTTTCTCC
    >EAS54_6_R1_2_1_540_792
    TTGGCAGGCCAAGGCCGATGGATCA
    >EAS54_6_R1_2_1_443_348
    GTTGCTTCTGGCGTGGGTGGGGGGG

You can parse these separately using Bio.SeqIO with the "qual" and
"fasta" formats, but then you'll get a group of SeqRecord objects with
no sequence, and a matching group with the sequence but not the
qualities.  Because it only deals with one input file handle, Bio.SeqIO
can't be used to read the two files together - but this function can!
For example,

>>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
...                                    open("Quality/example.qual", "rU"))
>>> for record in rec_iter:
...     print record.id, record.seq
EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG

As with the FASTQ or QUAL parsers, if you want to look at the qualities,
they are in each record's per-letter-annotation dictionary as a simple
list of integers:

>>> print record.letter_annotations["phred_quality"]
[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]

If you have access to data as a FASTQ format file, using that directly
would be simpler and more straight forward.  Note that you can easily use
this function to convert paired FASTA and QUAL files into FASTQ files:

>>> from Bio import SeqIO
>>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
...                                    open("Quality/example.qual", "rU"))
>>> out_handle = open("Quality/temp.fastq", "w")
>>> SeqIO.write(rec_iter, out_handle, "fastq")
3
>>> out_handle.close()

And don't forget to clean up the temp file if you don't need it anymore:

>>> import os
>>> os.remove("Quality/temp.fastq")    

Definition at line 1686 of file QualityIO.py.

01686 
01687 def PairedFastaQualIterator(fasta_handle, qual_handle, alphabet = single_letter_alphabet, title2ids = None):
01688     """Iterate over matched FASTA and QUAL files as SeqRecord objects.
01689 
01690     For example, consider this short QUAL file with PHRED quality scores::
01691 
01692         >EAS54_6_R1_2_1_413_324
01693         26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
01694         26 26 26 23 23
01695         >EAS54_6_R1_2_1_540_792
01696         26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
01697         26 18 26 23 18
01698         >EAS54_6_R1_2_1_443_348
01699         26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
01700         24 18 18 18 18
01701     
01702     And a matching FASTA file::
01703 
01704         >EAS54_6_R1_2_1_413_324
01705         CCCTTCTTGTCTTCAGCGTTTCTCC
01706         >EAS54_6_R1_2_1_540_792
01707         TTGGCAGGCCAAGGCCGATGGATCA
01708         >EAS54_6_R1_2_1_443_348
01709         GTTGCTTCTGGCGTGGGTGGGGGGG
01710 
01711     You can parse these separately using Bio.SeqIO with the "qual" and
01712     "fasta" formats, but then you'll get a group of SeqRecord objects with
01713     no sequence, and a matching group with the sequence but not the
01714     qualities.  Because it only deals with one input file handle, Bio.SeqIO
01715     can't be used to read the two files together - but this function can!
01716     For example,
01717     
01718     >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
01719     ...                                    open("Quality/example.qual", "rU"))
01720     >>> for record in rec_iter:
01721     ...     print record.id, record.seq
01722     EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
01723     EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
01724     EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
01725 
01726     As with the FASTQ or QUAL parsers, if you want to look at the qualities,
01727     they are in each record's per-letter-annotation dictionary as a simple
01728     list of integers:
01729 
01730     >>> print record.letter_annotations["phred_quality"]
01731     [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
01732 
01733     If you have access to data as a FASTQ format file, using that directly
01734     would be simpler and more straight forward.  Note that you can easily use
01735     this function to convert paired FASTA and QUAL files into FASTQ files:
01736 
01737     >>> from Bio import SeqIO
01738     >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
01739     ...                                    open("Quality/example.qual", "rU"))
01740     >>> out_handle = open("Quality/temp.fastq", "w")
01741     >>> SeqIO.write(rec_iter, out_handle, "fastq")
01742     3
01743     >>> out_handle.close()
01744 
01745     And don't forget to clean up the temp file if you don't need it anymore:
01746 
01747     >>> import os
01748     >>> os.remove("Quality/temp.fastq")    
01749     """
01750     from Bio.SeqIO.FastaIO import FastaIterator    
01751     fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, \
01752                                title2ids=title2ids)
01753     qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, \
01754                                   title2ids=title2ids)
01755 
01756     #Using zip(...) would create a list loading everything into memory!
01757     #It would also not catch any extra records found in only one file.
01758     while True:
01759         try:
01760             f_rec = fasta_iter.next()
01761         except StopIteration:
01762             f_rec = None
01763         try:
01764             q_rec = qual_iter.next()
01765         except StopIteration:
01766             q_rec = None
01767         if f_rec is None and q_rec is None:
01768             #End of both files
01769             break
01770         if f_rec is None:
01771             raise ValueError("FASTA file has more entries than the QUAL file.")
01772         if q_rec is None:
01773             raise ValueError("QUAL file has more entries than the FASTA file.")
01774         if f_rec.id != q_rec.id:
01775             raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." \
01776                              % (f_rec.id, q_rec.id))
01777         if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]):
01778             raise ValueError("Sequence length and number of quality scores disagree for %s" \
01779                              % f_rec.id)
01780         #Merge the data....
01781         f_rec.letter_annotations["phred_quality"] = q_rec.letter_annotations["phred_quality"]
01782         yield f_rec
01783     #Done
01784     

Here is the call graph for this function:

Convert a Solexa quality (which can be negative) to a PHRED quality.

PHRED and Solexa quality scores are both log transformations of a
probality of error (high score = low probability of error). This function
takes a Solexa score, transforms it back to a probability of error, and
then re-expresses it as a PHRED score. This assumes the error estimates
are equivalent.

The underlying formulas are given in the documentation for the sister
function solexa_quality_from_phred, in this case the operation is::

 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10)

This will return a floating point number, it is up to you to round this to
the nearest integer if appropriate.  e.g.

>>> print "%0.2f" % round(phred_quality_from_solexa(80),2)
80.00
>>> print "%0.2f" % round(phred_quality_from_solexa(20),2)
20.04
>>> print "%0.2f" % round(phred_quality_from_solexa(10),2)
10.41
>>> print "%0.2f" % round(phred_quality_from_solexa(0),2)
3.01
>>> print "%0.2f" % round(phred_quality_from_solexa(-5),2)
1.19

Note that a solexa_quality less then -5 is not expected, will trigger a
warning, but will still be converted as per the logarithmic mapping
(giving a number between 0 and 1.19 back).

As a special case where None is used for a "missing value", None is
returned:

>>> print phred_quality_from_solexa(None)
None

Definition at line 473 of file QualityIO.py.

00473 
00474 def phred_quality_from_solexa(solexa_quality):
00475     """Convert a Solexa quality (which can be negative) to a PHRED quality.
00476 
00477     PHRED and Solexa quality scores are both log transformations of a
00478     probality of error (high score = low probability of error). This function
00479     takes a Solexa score, transforms it back to a probability of error, and
00480     then re-expresses it as a PHRED score. This assumes the error estimates
00481     are equivalent.
00482 
00483     The underlying formulas are given in the documentation for the sister
00484     function solexa_quality_from_phred, in this case the operation is::
00485     
00486      phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10)
00487 
00488     This will return a floating point number, it is up to you to round this to
00489     the nearest integer if appropriate.  e.g.
00490 
00491     >>> print "%0.2f" % round(phred_quality_from_solexa(80),2)
00492     80.00
00493     >>> print "%0.2f" % round(phred_quality_from_solexa(20),2)
00494     20.04
00495     >>> print "%0.2f" % round(phred_quality_from_solexa(10),2)
00496     10.41
00497     >>> print "%0.2f" % round(phred_quality_from_solexa(0),2)
00498     3.01
00499     >>> print "%0.2f" % round(phred_quality_from_solexa(-5),2)
00500     1.19
00501 
00502     Note that a solexa_quality less then -5 is not expected, will trigger a
00503     warning, but will still be converted as per the logarithmic mapping
00504     (giving a number between 0 and 1.19 back).
00505 
00506     As a special case where None is used for a "missing value", None is
00507     returned:
00508 
00509     >>> print phred_quality_from_solexa(None)
00510     None
00511     """
00512     if solexa_quality is None:
00513         #Assume None is used as some kind of NULL or NA value; return None
00514         return None
00515     if solexa_quality < -5:
00516         warnings.warn("Solexa quality less than -5 passed, %s" \
00517                       % repr(solexa_quality), BiopythonWarning)
00518     return 10*log(10**(solexa_quality/10.0) + 1, 10)

Here is the caller graph for this function:

def Bio.SeqIO.QualityIO.QualPhredIterator (   handle,
  alphabet = single_letter_alphabet,
  title2ids = None 
)
For QUAL files which include PHRED quality scores, but no sequence.

For example, consider this short QUAL file::

    >EAS54_6_R1_2_1_413_324
    26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
    26 26 26 23 23
    >EAS54_6_R1_2_1_540_792
    26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
    26 18 26 23 18
    >EAS54_6_R1_2_1_443_348
    26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
    24 18 18 18 18

Using this module directly you might run:

>>> handle = open("Quality/example.qual", "rU")
>>> for record in QualPhredIterator(handle):
...     print record.id, record.seq
EAS54_6_R1_2_1_413_324 ?????????????????????????
EAS54_6_R1_2_1_540_792 ?????????????????????????
EAS54_6_R1_2_1_443_348 ?????????????????????????
>>> handle.close()

Typically however, you would call this via Bio.SeqIO instead with "qual"
as the format:

>>> from Bio import SeqIO
>>> handle = open("Quality/example.qual", "rU")
>>> for record in SeqIO.parse(handle, "qual"):
...     print record.id, record.seq
EAS54_6_R1_2_1_413_324 ?????????????????????????
EAS54_6_R1_2_1_540_792 ?????????????????????????
EAS54_6_R1_2_1_443_348 ?????????????????????????
>>> handle.close()

Becase QUAL files don't contain the sequence string itself, the seq
property is set to an UnknownSeq object.  As no alphabet was given, this
has defaulted to a generic single letter alphabet and the character "?"
used.

By specifying a nucleotide alphabet, "N" is used instead:

>>> from Bio import SeqIO
>>> from Bio.Alphabet import generic_dna
>>> handle = open("Quality/example.qual", "rU")
>>> for record in SeqIO.parse(handle, "qual", alphabet=generic_dna):
...     print record.id, record.seq
EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN
EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN
EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN
>>> handle.close()

However, the quality scores themselves are available as a list of integers
in each record's per-letter-annotation:

>>> print record.letter_annotations["phred_quality"]
[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]

You can still slice one of these SeqRecord objects with an UnknownSeq:

>>> sub_record = record[5:10]
>>> print sub_record.id, sub_record.letter_annotations["phred_quality"]
EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26]

As of Biopython 1.59, this parser will accept files with negatives quality
scores but will replace them with the lowest possible PHRED score of zero.
This will trigger a warning, previously it raised a ValueError exception.

Definition at line 1254 of file QualityIO.py.

01254 
01255 def QualPhredIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
01256     """For QUAL files which include PHRED quality scores, but no sequence.
01257 
01258     For example, consider this short QUAL file::
01259 
01260         >EAS54_6_R1_2_1_413_324
01261         26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
01262         26 26 26 23 23
01263         >EAS54_6_R1_2_1_540_792
01264         26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
01265         26 18 26 23 18
01266         >EAS54_6_R1_2_1_443_348
01267         26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
01268         24 18 18 18 18
01269     
01270     Using this module directly you might run:
01271 
01272     >>> handle = open("Quality/example.qual", "rU")
01273     >>> for record in QualPhredIterator(handle):
01274     ...     print record.id, record.seq
01275     EAS54_6_R1_2_1_413_324 ?????????????????????????
01276     EAS54_6_R1_2_1_540_792 ?????????????????????????
01277     EAS54_6_R1_2_1_443_348 ?????????????????????????
01278     >>> handle.close()
01279 
01280     Typically however, you would call this via Bio.SeqIO instead with "qual"
01281     as the format:
01282 
01283     >>> from Bio import SeqIO
01284     >>> handle = open("Quality/example.qual", "rU")
01285     >>> for record in SeqIO.parse(handle, "qual"):
01286     ...     print record.id, record.seq
01287     EAS54_6_R1_2_1_413_324 ?????????????????????????
01288     EAS54_6_R1_2_1_540_792 ?????????????????????????
01289     EAS54_6_R1_2_1_443_348 ?????????????????????????
01290     >>> handle.close()
01291 
01292     Becase QUAL files don't contain the sequence string itself, the seq
01293     property is set to an UnknownSeq object.  As no alphabet was given, this
01294     has defaulted to a generic single letter alphabet and the character "?"
01295     used.
01296 
01297     By specifying a nucleotide alphabet, "N" is used instead:
01298 
01299     >>> from Bio import SeqIO
01300     >>> from Bio.Alphabet import generic_dna
01301     >>> handle = open("Quality/example.qual", "rU")
01302     >>> for record in SeqIO.parse(handle, "qual", alphabet=generic_dna):
01303     ...     print record.id, record.seq
01304     EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN
01305     EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN
01306     EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN
01307     >>> handle.close()
01308 
01309     However, the quality scores themselves are available as a list of integers
01310     in each record's per-letter-annotation:
01311 
01312     >>> print record.letter_annotations["phred_quality"]
01313     [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
01314 
01315     You can still slice one of these SeqRecord objects with an UnknownSeq:
01316 
01317     >>> sub_record = record[5:10]
01318     >>> print sub_record.id, sub_record.letter_annotations["phred_quality"]
01319     EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26]
01320 
01321     As of Biopython 1.59, this parser will accept files with negatives quality
01322     scores but will replace them with the lowest possible PHRED score of zero.
01323     This will trigger a warning, previously it raised a ValueError exception.
01324     """
01325     #Skip any text before the first record (e.g. blank lines, comments)
01326     while True:
01327         line = handle.readline()
01328         if line == "" : return #Premature end of file, or just empty?
01329         if line[0] == ">":
01330             break
01331 
01332     while True:
01333         if line[0] != ">":
01334             raise ValueError("Records in Fasta files should start with '>' character")
01335         if title2ids:
01336             id, name, descr = title2ids(line[1:].rstrip())
01337         else:
01338             descr = line[1:].rstrip()
01339             id   = descr.split()[0]
01340             name = id
01341 
01342         qualities = []
01343         line = handle.readline()
01344         while True:
01345             if not line : break
01346             if line[0] == ">": break
01347             qualities.extend([int(word) for word in line.split()])
01348             line = handle.readline()
01349 
01350         if qualities and min(qualities) < 0:
01351             warnings.warn(("Negative quality score %i found, " + \
01352                            "substituting PHRED zero instead.") \
01353                            % min(qualities), BiopythonParserWarning)
01354             qualities = [max(0,q) for q in qualities]
01355         
01356         #Return the record and then continue...
01357         record = SeqRecord(UnknownSeq(len(qualities), alphabet),
01358                            id = id, name = name, description = descr)
01359         #Dirty trick to speed up this line:
01360         #record.letter_annotations["phred_quality"] = qualities
01361         dict.__setitem__(record._per_letter_annotations,
01362                          "phred_quality", qualities)
01363         yield record
01364 
01365         if not line : return #StopIteration
01366     assert False, "Should not reach this line"

Here is the caller graph for this function:

Covert a PHRED quality (range 0 to about 90) to a Solexa quality.

PHRED and Solexa quality scores are both log transformations of a
probality of error (high score = low probability of error). This function
takes a PHRED score, transforms it back to a probability of error, and
then re-expresses it as a Solexa score. This assumes the error estimates
are equivalent.

How does this work exactly? Well the PHRED quality is minus ten times the
base ten logarithm of the probability of error::

 phred_quality = -10*log(error,10)

Therefore, turning this round::

 error = 10 ** (- phred_quality / 10)

Now, Solexa qualities use a different log transformation::

 solexa_quality = -10*log(error/(1-error),10)

After substitution and a little manipulation we get::

  solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10)

However, real Solexa files use a minimum quality of -5. This does have a
good reason - a random a random base call would be correct 25% of the time,
and thus have a probability of error of 0.75, which gives 1.25 as the PHRED
quality, or -4.77 as the Solexa quality. Thus (after rounding), a random
nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5.

Taken literally, this logarithic formula would map a PHRED quality of zero
to a Solexa quality of minus infinity. Of course, taken literally, a PHRED
score of zero means a probability of error of one (i.e. the base call is
definitely wrong), which is worse than random! In practice, a PHRED quality
of zero usually means a default value, or perhaps random - and therefore
mapping it to the minimum Solexa score of -5 is reasonable.

In conclusion, we follow EMBOSS, and take this logarithmic formula but also
apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED
quality of zero to -5.0 as well.

Note this function will return a floating point number, it is up to you to
round this to the nearest integer if appropriate.  e.g.

>>> print "%0.2f" % round(solexa_quality_from_phred(80),2)
80.00
>>> print "%0.2f" % round(solexa_quality_from_phred(50),2)
50.00
>>> print "%0.2f" % round(solexa_quality_from_phred(20),2)
19.96
>>> print "%0.2f" % round(solexa_quality_from_phred(10),2)
9.54
>>> print "%0.2f" % round(solexa_quality_from_phred(5),2)
3.35
>>> print "%0.2f" % round(solexa_quality_from_phred(4),2)
1.80
>>> print "%0.2f" % round(solexa_quality_from_phred(3),2)
-0.02
>>> print "%0.2f" % round(solexa_quality_from_phred(2),2)
-2.33
>>> print "%0.2f" % round(solexa_quality_from_phred(1),2)
-5.00
>>> print "%0.2f" % round(solexa_quality_from_phred(0),2)
-5.00

Notice that for high quality reads PHRED and Solexa scores are numerically
equal. The differences are important for poor quality reads, where PHRED
has a minimum of zero but Solexa scores can be negative.

Finally, as a special case where None is used for a "missing value", None
is returned:

>>> print solexa_quality_from_phred(None)
None

Definition at line 381 of file QualityIO.py.

00381 
00382 def solexa_quality_from_phred(phred_quality):
00383     """Covert a PHRED quality (range 0 to about 90) to a Solexa quality.
00384 
00385     PHRED and Solexa quality scores are both log transformations of a
00386     probality of error (high score = low probability of error). This function
00387     takes a PHRED score, transforms it back to a probability of error, and
00388     then re-expresses it as a Solexa score. This assumes the error estimates
00389     are equivalent.
00390 
00391     How does this work exactly? Well the PHRED quality is minus ten times the
00392     base ten logarithm of the probability of error::
00393     
00394      phred_quality = -10*log(error,10)
00395 
00396     Therefore, turning this round::
00397 
00398      error = 10 ** (- phred_quality / 10)
00399     
00400     Now, Solexa qualities use a different log transformation::
00401 
00402      solexa_quality = -10*log(error/(1-error),10)
00403 
00404     After substitution and a little manipulation we get::
00405 
00406       solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10)
00407 
00408     However, real Solexa files use a minimum quality of -5. This does have a
00409     good reason - a random a random base call would be correct 25% of the time,
00410     and thus have a probability of error of 0.75, which gives 1.25 as the PHRED
00411     quality, or -4.77 as the Solexa quality. Thus (after rounding), a random
00412     nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5.
00413 
00414     Taken literally, this logarithic formula would map a PHRED quality of zero
00415     to a Solexa quality of minus infinity. Of course, taken literally, a PHRED
00416     score of zero means a probability of error of one (i.e. the base call is
00417     definitely wrong), which is worse than random! In practice, a PHRED quality
00418     of zero usually means a default value, or perhaps random - and therefore
00419     mapping it to the minimum Solexa score of -5 is reasonable.
00420 
00421     In conclusion, we follow EMBOSS, and take this logarithmic formula but also
00422     apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED
00423     quality of zero to -5.0 as well.
00424 
00425     Note this function will return a floating point number, it is up to you to
00426     round this to the nearest integer if appropriate.  e.g.
00427 
00428     >>> print "%0.2f" % round(solexa_quality_from_phred(80),2)
00429     80.00
00430     >>> print "%0.2f" % round(solexa_quality_from_phred(50),2)
00431     50.00
00432     >>> print "%0.2f" % round(solexa_quality_from_phred(20),2)
00433     19.96
00434     >>> print "%0.2f" % round(solexa_quality_from_phred(10),2)
00435     9.54
00436     >>> print "%0.2f" % round(solexa_quality_from_phred(5),2)
00437     3.35
00438     >>> print "%0.2f" % round(solexa_quality_from_phred(4),2)
00439     1.80
00440     >>> print "%0.2f" % round(solexa_quality_from_phred(3),2)
00441     -0.02
00442     >>> print "%0.2f" % round(solexa_quality_from_phred(2),2)
00443     -2.33
00444     >>> print "%0.2f" % round(solexa_quality_from_phred(1),2)
00445     -5.00
00446     >>> print "%0.2f" % round(solexa_quality_from_phred(0),2)
00447     -5.00
00448 
00449     Notice that for high quality reads PHRED and Solexa scores are numerically
00450     equal. The differences are important for poor quality reads, where PHRED
00451     has a minimum of zero but Solexa scores can be negative.
00452 
00453     Finally, as a special case where None is used for a "missing value", None
00454     is returned:
00455 
00456     >>> print solexa_quality_from_phred(None)
00457     None
00458     """
00459     if phred_quality is None:
00460         #Assume None is used as some kind of NULL or NA value; return None
00461         #e.g. Bio.SeqIO gives Ace contig gaps a quality of None.
00462         return None
00463     elif phred_quality > 0:
00464         #Solexa uses a minimum value of -5, which after rounding matches a
00465         #random nucleotide base call.
00466         return max(-5.0, 10*log(10**(phred_quality/10.0) - 1, 10))
00467     elif phred_quality == 0:
00468         #Special case, map to -5 as discussed in the docstring
00469         return -5.0
00470     else:
00471         raise ValueError("PHRED qualities must be positive (or zero), not %s" \
00472                          % repr(phred_quality))

Here is the caller graph for this function:


Variable Documentation

string Bio.SeqIO.QualityIO.__docformat__ = "epytext en"

Definition at line 365 of file QualityIO.py.

Initial value:
00001 dict((qp, chr(qp+SOLEXA_SCORE_OFFSET)) \
00002                                       for qp in range(0, 62+1))

Definition at line 655 of file QualityIO.py.

Initial value:
00001 dict((qp, chr(min(126, qp+SANGER_SCORE_OFFSET))) \
00002                                     for qp in range(0, 93+1))

Definition at line 538 of file QualityIO.py.

Initial value:
00001 dict(\
00002     (qp, chr(min(126, int(round(solexa_quality_from_phred(qp)))+SOLEXA_SCORE_OFFSET))) \
00003     for qp in range(0, 62+1))

Definition at line 724 of file QualityIO.py.

Initial value:
00001 dict( \
00002     (qs, chr(int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \
00003     for qs in range(-5, 62+1))

Definition at line 658 of file QualityIO.py.

Initial value:
00001 dict( \
00002     (qs, chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET))) \
00003     for qs in range(-5, 93+1))

Definition at line 541 of file QualityIO.py.

Initial value:
00001 dict((qs, chr(min(126, qs+SOLEXA_SCORE_OFFSET))) \
00002                                      for qs in range(-5, 62+1))

Definition at line 721 of file QualityIO.py.

Definition at line 378 of file QualityIO.py.

Definition at line 379 of file QualityIO.py.