Back to index

python-biopython  1.60
bgzf.py
Go to the documentation of this file.
00001 #!/usr/bin/env python
00002 # Copyright 2010-2011 by Peter Cock.
00003 # All rights reserved.
00004 # This code is part of the Biopython distribution and governed by its
00005 # license.  Please see the LICENSE file that should have been included
00006 # as part of this package.
00007 r"""Read and write BGZF compressed files (the GZIP variant used in BAM).
00008 
00009 The SAM/BAM file format (Sequence Alignment/Map) comes in a plain text
00010 format (SAM), and a compressed binary format (BAM). The latter uses a
00011 modified form of gzip compression called BGZF (Blocked GNU Zip Format),
00012 which can be applied to any file format to provide compression with
00013 efficient random access. BGZF is described together with the SAM/BAM
00014 file format at http://samtools.sourceforge.net/SAM1.pdf
00015 
00016 Please read the text below about 'virtual offsets' before using BGZF
00017 files for random access.
00018 
00019 
00020 Aim of this module
00021 ------------------
00022 
00023 The Python gzip library can be used to read BGZF files, since for
00024 decompression they are just (specialised) gzip files. What this
00025 module aims to facilitate is random access to BGZF files (using the
00026 'virtual offset' idea), and writing BGZF files (which means using
00027 suitably sized gzip blocks and writing the extra 'BC' field in the
00028 gzip headers). As in the gzip library, the zlib library is used
00029 internally.
00030 
00031 In addition to being required for random access to and writing of
00032 BAM files, the BGZF format can also be used on other sequential
00033 data (in the sense of one record after another), such as most of
00034 the sequence data formats supported in Bio.SeqIO (like FASTA,
00035 FASTQ, GenBank, etc) or large MAF alignments.
00036 
00037 The Bio.SeqIO indexing functions use this module to support BGZF files.
00038 
00039 
00040 Technical Introduction to BGZF
00041 ------------------------------
00042 
00043 The gzip file format allows multiple compressed blocks, each of which
00044 could be a stand alone gzip file. As an interesting bonus, this means
00045 you can use Unix "cat" to combined to gzip files into one by
00046 concatenating them. Also, each block can have one of several compression
00047 levels (including uncompressed, which actually takes up a little bit
00048 more space due to the gzip header).
00049 
00050 What the BAM designers realised was that while random access to data
00051 stored in traditional gzip files was slow, breaking the file into
00052 gzip blocks would allow fast random access to each block. To access
00053 a particular piece of the decompressed data, you just need to know
00054 which block it starts in (the offset of the gzip block start), and
00055 how far into the (decompressed) contents of the block you need to
00056 read.
00057 
00058 One problem with this is finding the gzip block sizes efficiently.
00059 You can do it with a standard gzip file, but it requires every block
00060 to be decompressed -- and that would be rather slow. Additionally
00061 typical gzip files may use very large blocks.
00062 
00063 All that differs in BGZF is that compressed size of each gzip block
00064 is limited to 2^16 bytes, and an extra 'BC' field in the gzip header
00065 records this size. Traditional decompression tools can ignore this,
00066 and unzip the file just like any other gzip file.
00067 
00068 The point of this is you can look at the first BGZF block, find out
00069 how big it is from this 'BC' header, and thus seek immediately to
00070 the second block, and so on.
00071 
00072 The BAM indexing scheme records read positions using a 64 bit
00073 'virtual offset', comprising coffset<<16|uoffset, where coffset is
00074 the file offset of the BGZF block containing the start of the read
00075 (unsigned integer using up to 64-16 = 48 bits), and uoffset is the
00076 offset within the (decompressed) block (unsigned 16 bit integer).
00077 
00078 This limits you to BAM files where the last block starts by 2^48
00079 bytes, or 256 petabytes, and the decompressed size of each block
00080 is at most 2^16 bytes, or 64kb. Note that this matches the BGZF
00081 'BC' field size which limits the compressed size of each block to
00082 2^16 bytes, allowing for BAM files to use BGZF with no gzip
00083 compression (useful for intermediate files in memory to reduced
00084 CPU load).
00085 
00086 
00087 Warning about namespaces
00088 ------------------------
00089 
00090 It is considered a bad idea to use "from XXX import *" in Python, because
00091 it pollutes the namespace. This is a real issue with Bio.bgzf (and the
00092 standard Python library gzip) because they contain a function called open
00093 i.e. Suppose you do this:
00094 
00095 >>> from Bio.bgzf import *
00096 >>> print open.__module__
00097 Bio.bgzf
00098 
00099 Or,
00100 
00101 >>> from gzip import *
00102 >>> print open.__module__
00103 gzip
00104 
00105 Notice that the open function has been replaced. You can "fix" this if you
00106 need to by importing the built-in open function:
00107 
00108 >>> from __builtin__ import open
00109 
00110 However, what we recommend instead is to use the explicit namespace, e.g.
00111 
00112 >>> from Bio import bgzf
00113 >>> print bgzf.open.__module__
00114 Bio.bgzf
00115 
00116 
00117 Example
00118 -------
00119 
00120 This is an ordinary GenBank file compressed using BGZF, so it can
00121 be decompressed using gzip,
00122 
00123 >>> import gzip
00124 >>> handle = gzip.open("GenBank/NC_000932.gb.bgz", "r")
00125 >>> assert 0 == handle.tell()
00126 >>> line = handle.readline()
00127 >>> assert 80 == handle.tell()
00128 >>> line = handle.readline()
00129 >>> assert 143 == handle.tell()
00130 >>> data = handle.read(70000)
00131 >>> assert 70143 == handle.tell()
00132 >>> handle.close()
00133 
00134 We can also access the file using the BGZF reader - but pay
00135 attention to the file offsets which will be explained below:
00136 
00137 >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz", "r")
00138 >>> assert 0 == handle.tell()
00139 >>> print handle.readline().rstrip()
00140 LOCUS       NC_000932             154478 bp    DNA     circular PLN 15-APR-2009
00141 >>> assert 80 == handle.tell()
00142 >>> print handle.readline().rstrip()
00143 DEFINITION  Arabidopsis thaliana chloroplast, complete genome.
00144 >>> assert 143 == handle.tell()
00145 >>> data = handle.read(70000)
00146 >>> assert 987828735 == handle.tell()
00147 >>> print handle.readline().rstrip()
00148 f="GeneID:844718"
00149 >>> print handle.readline().rstrip()
00150      CDS             complement(join(84337..84771,85454..85843))
00151 >>> offset = handle.seek(make_virtual_offset(55074, 126))
00152 >>> print handle.readline().rstrip()
00153     68521 tatgtcattc gaaattgtat aaagacaact cctatttaat agagctattt gtgcaagtat
00154 >>> handle.close()
00155 
00156 Notice the handle's offset looks different as a BGZF file. This
00157 brings us to the key point about BGZF, which is the block structure:
00158 
00159 >>> handle = open("GenBank/NC_000932.gb.bgz", "rb")
00160 >>> for values in BgzfBlocks(handle):
00161 ...     print "Raw start %i, raw length %i; data start %i, data length %i" % values
00162 Raw start 0, raw length 15073; data start 0, data length 65536
00163 Raw start 15073, raw length 17857; data start 65536, data length 65536
00164 Raw start 32930, raw length 22144; data start 131072, data length 65536
00165 Raw start 55074, raw length 22230; data start 196608, data length 65536
00166 Raw start 77304, raw length 14939; data start 262144, data length 43478
00167 Raw start 92243, raw length 28; data start 305622, data length 0
00168 >>> handle.close()
00169 
00170 By reading ahead 70,000 bytes we moved into the second BGZF block,
00171 and at that point the BGZF virtual offsets start to look different
00172 a simple offset into the decompressed data as exposed by the gzip
00173 library.
00174 
00175 Using the seek for the decompressed co-ordinates, 65536*3 + 126
00176 is equivalent to jumping the first three blocks (each size 65536
00177 after decompression) and starting at byte 126 of the third block
00178 (after decompression). For BGZF, we need to know the block's
00179 offset of 55074 and the offset within the block of 126 to get
00180 the BGZF virtual offset.
00181 
00182 The catch with BGZF virtual offsets is while they can be compared
00183 (which offset comes first in the file), you cannot safely subtract
00184 them to get the size of the data between them, nor add/subtract
00185 a relative offset.
00186 
00187 Of course you can parse this file with Bio.SeqIO using BgzfReader,
00188 although there isn't any benefit over using gzip.open(...), unless
00189 you want to index BGZF compressed sequence files:
00190 
00191 >>> from Bio import SeqIO
00192 >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz")
00193 >>> record = SeqIO.read(handle, "genbank")
00194 >>> handle.close()
00195 >>> print record.id
00196 NC_000932.1
00197 
00198 """
00199 #TODO - Move somewhere else in Bio.* namespace?
00200 
00201 import gzip
00202 import zlib
00203 import struct
00204 import __builtin__ #to access the usual open function
00205 
00206 from Bio._py3k import _as_bytes, _as_string
00207 
00208 #For Python 2 can just use: _bgzf_magic = '\x1f\x8b\x08\x04'
00209 #but need to use bytes on Python 3
00210 _bgzf_magic = _as_bytes("\x1f\x8b\x08\x04")
00211 _bgzf_header = _as_bytes("\x1f\x8b\x08\x04\x00\x00\x00\x00"
00212                          "\x00\xff\x06\x00\x42\x43\x02\x00")
00213 _bgzf_eof = _as_bytes("\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00BC" + \
00214                       "\x02\x00\x1b\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00")
00215 _bytes_BC = _as_bytes("BC")
00216 _empty_bytes_string = _as_bytes("")
00217 _bytes_newline = _as_bytes("\n")
00218 
00219 
00220 def open(filename, mode="rb"):
00221     """Open a BGZF file for reading, writing or appending."""
00222     if "r" in mode.lower():
00223         return BgzfReader(filename, mode)
00224     elif "w" in mode.lower() or "a" in mode.lower():
00225         return BgzfWriter(filename, mode)
00226     else:
00227         raise ValueError("Bad mode %r" % mode)
00228 
00229 def make_virtual_offset(block_start_offset, within_block_offset):
00230     """Compute a BGZF virtual offset from block start and within block offsets.
00231 
00232     The BAM indexing scheme records read positions using a 64 bit
00233     'virtual offset', comprising in C terms:
00234 
00235     block_start_offset<<16 | within_block_offset
00236 
00237     Here block_start_offset is the file offset of the BGZF block
00238     start (unsigned integer using up to 64-16 = 48 bits), and
00239     within_block_offset within the (decompressed) block (unsigned
00240     16 bit integer).
00241 
00242     >>> make_virtual_offset(0,0)
00243     0
00244     >>> make_virtual_offset(0,1)
00245     1
00246     >>> make_virtual_offset(0, 2**16 - 1)
00247     65535
00248     >>> make_virtual_offset(0, 2**16)
00249     Traceback (most recent call last):
00250     ...
00251     ValueError: Require 0 <= within_block_offset < 2**16, got 65536
00252 
00253     >>> 65536 == make_virtual_offset(1,0)
00254     True
00255     >>> 65537 == make_virtual_offset(1,1)
00256     True
00257     >>> 131071 == make_virtual_offset(1, 2**16 - 1)
00258     True
00259 
00260     >>> 6553600000 == make_virtual_offset(100000,0)
00261     True
00262     >>> 6553600001 == make_virtual_offset(100000,1)
00263     True
00264     >>> 6553600010 == make_virtual_offset(100000,10)
00265     True
00266 
00267     >>> make_virtual_offset(2**48,0)
00268     Traceback (most recent call last):
00269     ...
00270     ValueError: Require 0 <= block_start_offset < 2**48, got 281474976710656
00271 
00272     """
00273     if within_block_offset < 0 or within_block_offset >= 65536:
00274         raise ValueError("Require 0 <= within_block_offset < 2**16, got %i" % within_block_offset)
00275     if block_start_offset < 0 or block_start_offset >= 281474976710656:
00276         raise ValueError("Require 0 <= block_start_offset < 2**48, got %i" % block_start_offset)
00277     return (block_start_offset<<16) | within_block_offset
00278 
00279 def split_virtual_offset(virtual_offset):
00280     """Divides a 64-bit BGZF virtual offset into block start & within block offsets.
00281 
00282     >>> (100000, 0) == split_virtual_offset(6553600000)
00283     True
00284     >>> (100000, 10) == split_virtual_offset(6553600010)
00285     True
00286 
00287     """
00288     start = virtual_offset>>16
00289     return start, virtual_offset ^ (start<<16)
00290 
00291 def BgzfBlocks(handle):
00292     """Low level debugging function to inspect BGZF blocks.
00293 
00294     Returns the block start offset (see virtual offsets), the block
00295     length (add these for the start of the next block), and the
00296     decompressed length of the blocks contents (limited to 65536 in
00297     BGZF).
00298 
00299     >>> from __builtin__ import open
00300     >>> handle = open("SamBam/ex1.bam", "rb")
00301     >>> for values in BgzfBlocks(handle):
00302     ...     print "Raw start %i, raw length %i; data start %i, data length %i" % values
00303     Raw start 0, raw length 18239; data start 0, data length 65536
00304     Raw start 18239, raw length 18223; data start 65536, data length 65536
00305     Raw start 36462, raw length 18017; data start 131072, data length 65536
00306     Raw start 54479, raw length 17342; data start 196608, data length 65536
00307     Raw start 71821, raw length 17715; data start 262144, data length 65536
00308     Raw start 89536, raw length 17728; data start 327680, data length 65536
00309     Raw start 107264, raw length 17292; data start 393216, data length 63398
00310     Raw start 124556, raw length 28; data start 456614, data length 0
00311     >>> handle.close()
00312 
00313     Indirectly we can tell this file came from an old version of
00314     samtools because all the blocks (except the final one and the
00315     dummy empty EOF marker block) are 65536 bytes.  Later versions
00316     avoid splitting a read between two blocks, and give the header
00317     its own block (useful to speed up replacing the header). You
00318     can see this in ex1_refresh.bam created using samtools 0.1.18:
00319 
00320     samtools view -b ex1.bam > ex1_refresh.bam
00321 
00322     >>> handle = open("SamBam/ex1_refresh.bam", "rb")
00323     >>> for values in BgzfBlocks(handle):
00324     ...     print "Raw start %i, raw length %i; data start %i, data length %i" % values
00325     Raw start 0, raw length 53; data start 0, data length 38
00326     Raw start 53, raw length 18195; data start 38, data length 65434
00327     Raw start 18248, raw length 18190; data start 65472, data length 65409
00328     Raw start 36438, raw length 18004; data start 130881, data length 65483
00329     Raw start 54442, raw length 17353; data start 196364, data length 65519
00330     Raw start 71795, raw length 17708; data start 261883, data length 65411
00331     Raw start 89503, raw length 17709; data start 327294, data length 65466
00332     Raw start 107212, raw length 17390; data start 392760, data length 63854
00333     Raw start 124602, raw length 28; data start 456614, data length 0
00334     >>> handle.close()
00335 
00336     The above example has no embedded SAM header (thus the first block
00337     is very small at just 38 bytes of decompressed data), while the next
00338     example does (a larger block of 103 bytes). Notice that the rest of
00339     the blocks show the same sizes (they contain the same read data):
00340 
00341     >>> handle = open("SamBam/ex1_header.bam", "rb")
00342     >>> for values in BgzfBlocks(handle):
00343     ...     print "Raw start %i, raw length %i; data start %i, data length %i" % values
00344     Raw start 0, raw length 104; data start 0, data length 103
00345     Raw start 104, raw length 18195; data start 103, data length 65434
00346     Raw start 18299, raw length 18190; data start 65537, data length 65409
00347     Raw start 36489, raw length 18004; data start 130946, data length 65483
00348     Raw start 54493, raw length 17353; data start 196429, data length 65519
00349     Raw start 71846, raw length 17708; data start 261948, data length 65411
00350     Raw start 89554, raw length 17709; data start 327359, data length 65466
00351     Raw start 107263, raw length 17390; data start 392825, data length 63854
00352     Raw start 124653, raw length 28; data start 456679, data length 0
00353     >>> handle.close()
00354 
00355     """
00356     data_start = 0
00357     while True:
00358         start_offset = handle.tell()
00359         #This may raise StopIteration which is perfect here
00360         block_length, data = _load_bgzf_block(handle)
00361         data_len = len(data)
00362         yield start_offset, block_length, data_start, data_len
00363         data_start += data_len
00364 
00365 
00366 def _load_bgzf_block(handle, text_mode=False):
00367     """Internal function to load the next BGZF function (PRIVATE)."""
00368     magic = handle.read(4)
00369     if not magic:
00370         #End of file
00371         raise StopIteration
00372     if magic != _bgzf_magic:
00373         raise ValueError(r"A BGZF (e.g. a BAM file) block should start with "
00374                          r"%r, not %r; handle.tell() now says %r"
00375                          % (_bgzf_magic, magic, handle.tell()))
00376     gzip_mod_time, gzip_extra_flags, gzip_os, extra_len = \
00377         struct.unpack("<LBBH", handle.read(8))
00378         
00379     block_size = None
00380     x_len = 0
00381     while x_len < extra_len:
00382         subfield_id = handle.read(2)
00383         subfield_len = struct.unpack("<H", handle.read(2))[0] #uint16_t
00384         subfield_data = handle.read(subfield_len)
00385         x_len += subfield_len + 4
00386         if subfield_id == _bytes_BC:
00387             assert subfield_len == 2, "Wrong BC payload length"
00388             assert block_size is None, "Two BC subfields?"
00389             block_size = struct.unpack("<H", subfield_data)[0]+1 #uint16_t
00390     assert x_len == extra_len, (x_len, extra_len)
00391     assert block_size is not None, "Missing BC, this isn't a BGZF file!"
00392     #Now comes the compressed data, CRC, and length of uncompressed data.
00393     deflate_size = block_size - 1 - extra_len - 19
00394     d = zlib.decompressobj(-15) #Negative window size means no headers
00395     data = d.decompress(handle.read(deflate_size)) + d.flush()
00396     expected_crc = handle.read(4)
00397     expected_size = struct.unpack("<I", handle.read(4))[0]
00398     assert expected_size == len(data), \
00399            "Decompressed to %i, not %i" % (len(data), expected_size)
00400     #Should cope with a mix of Python platforms...
00401     crc = zlib.crc32(data)
00402     if crc < 0:
00403         crc = struct.pack("<i", crc)
00404     else:
00405         crc = struct.pack("<I", crc)
00406     assert expected_crc == crc, \
00407            "CRC is %s, not %s" % (crc, expected_crc)
00408     if text_mode:
00409         return block_size, _as_string(data)
00410     else:
00411         return block_size, data
00412 
00413 
00414 class BgzfReader(object):
00415     r"""BGZF reader, acts like a read only handle but seek/tell differ.
00416 
00417     Let's use the BgzfBlocks function to have a peak at the BGZF blocks
00418     in an example BAM file,
00419 
00420     >>> from __builtin__ import open
00421     >>> handle = open("SamBam/ex1.bam", "rb")
00422     >>> for values in BgzfBlocks(handle):
00423     ...     print "Raw start %i, raw length %i; data start %i, data length %i" % values
00424     Raw start 0, raw length 18239; data start 0, data length 65536
00425     Raw start 18239, raw length 18223; data start 65536, data length 65536
00426     Raw start 36462, raw length 18017; data start 131072, data length 65536
00427     Raw start 54479, raw length 17342; data start 196608, data length 65536
00428     Raw start 71821, raw length 17715; data start 262144, data length 65536
00429     Raw start 89536, raw length 17728; data start 327680, data length 65536
00430     Raw start 107264, raw length 17292; data start 393216, data length 63398
00431     Raw start 124556, raw length 28; data start 456614, data length 0
00432     >>> handle.close()
00433  
00434     Now let's see how to use this block information to jump to
00435     specific parts of the decompressed BAM file:
00436 
00437     >>> handle = BgzfReader("SamBam/ex1.bam", "rb")
00438     >>> assert 0 == handle.tell()
00439     >>> magic = handle.read(4)
00440     >>> assert 4 == handle.tell()
00441 
00442     So far nothing so strange, we got the magic marker used at the
00443     start of a decompressed BAM file, and the handle position makes
00444     sense. Now however, let's jump to the end of this block and 4
00445     bytes into the next block by reading 65536 bytes,
00446 
00447     >>> data = handle.read(65536)
00448     >>> len(data)
00449     65536
00450     >>> assert 1195311108 == handle.tell()
00451 
00452     Expecting 4 + 65536 = 65540 were you? Well this is a BGZF 64-bit
00453     virtual offset, which means:
00454 
00455     >>> split_virtual_offset(1195311108)
00456     (18239, 4)
00457 
00458     You should spot 18239 as the start of the second BGZF block, while
00459     the 4 is the offset into this block. See also make_virtual_offset,
00460 
00461     >>> make_virtual_offset(18239, 4)
00462     1195311108
00463 
00464     Let's jump back to almost the start of the file,
00465 
00466     >>> make_virtual_offset(0, 2)
00467     2
00468     >>> handle.seek(2)
00469     2
00470     >>> handle.close()
00471 
00472     Note that you can use the max_cache argument to limit the number of
00473     BGZF blocks cached in memory. The default is 100, and since each
00474     block can be up to 64kb, the default cache could take up to 6MB of
00475     RAM. The cache is not important for reading through the file in one
00476     pass, but is important for improving performance of random access.
00477     """
00478 
00479     def __init__(self, filename=None, mode="r", fileobj=None, max_cache=100):
00480         #TODO - Assuming we can seek, check for 28 bytes EOF empty block
00481         #and if missing warn about possible truncation (as in samtools)?
00482         if max_cache < 1:
00483             raise ValueError("Use max_cache with a minimum of 1")
00484         #Must open the BGZF file in binary mode, but we may want to
00485         #treat the contents as either text or binary (unicode or
00486         #bytes under Python 3)
00487         if fileobj:
00488             assert filename is None
00489             handle = fileobj
00490             assert "b" in handle.mode.lower()
00491         else:
00492             if "w" in mode.lower() \
00493             or "a" in mode.lower():
00494                 raise ValueError("Must use read mode (default), not write or append mode")
00495             handle = __builtin__.open(filename, "rb")
00496         self._text = "b" not in mode.lower()
00497         if self._text:
00498             self._newline = "\n"
00499         else:
00500             self._newline = _bytes_newline
00501         self._handle = handle
00502         self.max_cache = max_cache
00503         self._buffers = {}
00504         self._block_start_offset = None
00505         self._block_raw_length = None
00506         self._load_block(handle.tell())
00507 
00508     def _load_block(self, start_offset=None):
00509         if start_offset is None:
00510             #If the file is being read sequentially, then _handle.tell()
00511             #should be pointing at the start of the next block.
00512             #However, if seek has been used, we can't assume that.
00513             start_offset = self._block_start_offset + self._block_raw_length
00514         if start_offset == self._block_start_offset:
00515             self._within_block_offset = 0
00516             return
00517         elif start_offset in self._buffers:
00518             #Already in cache
00519             self._buffer, self._block_raw_length = self._buffers[start_offset]
00520             self._within_block_offset = 0
00521             self._block_start_offset = start_offset
00522             return
00523         #Must hit the disk... first check cache limits,
00524         while len(self._buffers) >= self.max_cache:
00525             #TODO - Implemente LRU cache removal?
00526             self._buffers.popitem()
00527         #Now load the block
00528         handle = self._handle
00529         if start_offset is not None:
00530             handle.seek(start_offset)
00531         self._block_start_offset = handle.tell()
00532         try:
00533             block_size, self._buffer = _load_bgzf_block(handle, self._text)
00534         except StopIteration:
00535             #EOF
00536             block_size = 0
00537             if self._text:
00538                 self._buffer = ""
00539             else:
00540                 self._buffer = _empty_bytes_string
00541         self._within_block_offset = 0
00542         self._block_raw_length = block_size
00543         #Finally save the block in our cache,
00544         self._buffers[self._block_start_offset] = self._buffer, block_size
00545 
00546     def tell(self):
00547         """Returns a 64-bit unsigned BGZF virtual offset."""
00548         if 0 < self._within_block_offset == len(self._buffer):
00549             #Special case where we're right at the end of a (non empty) block.
00550             #For non-maximal blocks could give two possible virtual offsets,
00551             #but for a maximal block can't use 65536 as the within block
00552             #offset. Therefore for consistency, use the next block and a
00553             #within block offset of zero.
00554             return (self._block_start_offset + self._block_raw_length) << 16
00555         else:
00556             #return make_virtual_offset(self._block_start_offset,
00557             #                           self._within_block_offset)
00558             #TODO - Include bounds checking as in make_virtual_offset?
00559             return (self._block_start_offset<<16) | self._within_block_offset
00560 
00561     def seek(self, virtual_offset):
00562         """Seek to a 64-bit unsigned BGZF virtual offset."""
00563         #Do this inline to avoid a function call,
00564         #start_offset, within_block = split_virtual_offset(virtual_offset)
00565         start_offset = virtual_offset>>16
00566         within_block = virtual_offset ^ (start_offset<<16)
00567         if start_offset != self._block_start_offset:
00568             #Don't need to load the block if already there
00569             #(this avoids a function call since _load_block would do nothing)
00570             self._load_block(start_offset)
00571             assert start_offset == self._block_start_offset
00572         if within_block >= len(self._buffer) \
00573         and not (within_block == 0 and len(self._buffer)==0):
00574             raise ValueError("Within offset %i but block size only %i" \
00575                              % (within_block, len(self._buffer)))
00576         self._within_block_offset = within_block
00577         #assert virtual_offset == self.tell(), \
00578         #    "Did seek to %i (%i, %i), but tell says %i (%i, %i)" \
00579         #    % (virtual_offset, start_offset, within_block,
00580         #       self.tell(), self._block_start_offset, self._within_block_offset)
00581         return virtual_offset
00582 
00583     def read(self, size=-1):
00584         if size < 0:
00585             raise NotImplementedError("Don't be greedy, that could be massive!")
00586         elif size == 0:
00587             if self._text:
00588                 return ""
00589             else:
00590                 return _empty_bytes_string
00591         elif self._within_block_offset + size <= len(self._buffer):
00592             #This may leave us right at the end of a block
00593             #(lazy loading, don't load the next block unless we have too)
00594             data = self._buffer[self._within_block_offset:self._within_block_offset + size]
00595             self._within_block_offset += size
00596             assert data #Must be at least 1 byte
00597             return data
00598         else:
00599             data = self._buffer[self._within_block_offset:]
00600             size -= len(data)
00601             self._load_block() #will reset offsets
00602             #TODO - Test with corner case of an empty block followed by
00603             #a non-empty block
00604             if not self._buffer:
00605                 return data #EOF
00606             elif size:
00607                 #TODO - Avoid recursion
00608                 return data + self.read(size)
00609             else:
00610                 #Only needed the end of the last block
00611                 return data
00612 
00613     def readline(self):
00614         i = self._buffer.find(self._newline, self._within_block_offset)
00615         #Three cases to consider,
00616         if i==-1:
00617             #No newline, need to read in more data
00618             data = self._buffer[self._within_block_offset:]
00619             self._load_block() #will reset offsets
00620             if not self._buffer:
00621                 return data #EOF
00622             else:
00623                 #TODO - Avoid recursion
00624                 return data + self.readline()
00625         elif i + 1 == len(self._buffer):
00626             #Found new line, but right at end of block (SPECIAL)
00627             data = self._buffer[self._within_block_offset:]
00628             #Must now load the next block to ensure tell() works
00629             self._load_block() #will reset offsets
00630             assert data
00631             return data
00632         else:
00633             #Found new line, not at end of block (easy case, no IO)
00634             data = self._buffer[self._within_block_offset:i+1]
00635             self._within_block_offset = i + 1
00636             #assert data.endswith(self._newline)
00637             return data
00638 
00639     def next(self):
00640         line = self.readline()
00641         if not line:
00642             raise StopIteration
00643         return line
00644 
00645     def __iter__(self):
00646         return self
00647 
00648     def close(self):
00649         self._handle.close()
00650         self._buffer = None
00651         self._block_start_offset = None
00652 
00653 
00654 class BgzfWriter(object):
00655 
00656     def __init__(self, filename=None, mode="w", fileobj=None, compresslevel=6):
00657         if fileobj:
00658             assert filename is None
00659             handle = fileobj
00660         else:
00661             if "w" not in mode.lower() \
00662             and "a" not in mode.lower():
00663                 raise ValueError("Must use write or append mode, not %r" % mode)
00664             if "a" in mode.lower():
00665                 handle = __builtin__.open(filename, "ab")
00666             else:
00667                 handle = __builtin__.open(filename, "wb")
00668         self._text = "b" not in mode.lower()
00669         self._handle = handle
00670         self._buffer = _empty_bytes_string
00671         self.compresslevel = compresslevel
00672 
00673     def _write_block(self, block):
00674         #print "Saving %i bytes" % len(block)
00675         start_offset = self._handle.tell()
00676         assert len(block) <= 65536
00677         #Giving a negative window bits means no gzip/zlib headers, -15 used in samtools
00678         c = zlib.compressobj(self.compresslevel,
00679                              zlib.DEFLATED,
00680                              -15,
00681                              zlib.DEF_MEM_LEVEL,
00682                              0)
00683         compressed = c.compress(block) + c.flush()
00684         del c
00685         assert len(compressed) < 65536, "TODO - Didn't compress enough, try less data in this block"
00686         crc = zlib.crc32(block)
00687         #Should cope with a mix of Python platforms...
00688         if crc < 0:
00689             crc = struct.pack("<i", crc)
00690         else:
00691             crc = struct.pack("<I", crc)
00692         bsize = struct.pack("<H", len(compressed)+25) #includes -1
00693         crc = struct.pack("<I", zlib.crc32(block) & 0xffffffffL)
00694         uncompressed_length = struct.pack("<I", len(block))
00695         #Fixed 16 bytes,
00696         # gzip magic bytes (4) mod time (4),
00697         # gzip flag (1), os (1), extra length which is six (2),
00698         # sub field which is BC (2), sub field length of two (2),
00699         #Variable data,
00700         #2 bytes: block length as BC sub field (2)
00701         #X bytes: the data
00702         #8 bytes: crc (4), uncompressed data length (4)
00703         data = _bgzf_header + bsize + compressed + crc + uncompressed_length
00704         self._handle.write(data)
00705 
00706     def write(self, data):
00707         #TODO - Check bytes vs unicode
00708         data = _as_bytes(data)
00709         #block_size = 2**16 = 65536
00710         data_len = len(data)
00711         if len(self._buffer) + data_len < 65536:
00712             #print "Cached %r" % data
00713             self._buffer += data
00714             return
00715         else:
00716             #print "Got %r, writing out some data..." % data
00717             self._buffer += data
00718             while len(self._buffer) >= 65536:
00719                 self._write_block(self._buffer[:65536])
00720                 self._buffer = self._buffer[65536:]
00721 
00722     def flush(self):
00723         while len(self._buffer) >= 65536:
00724             self._write_block(self._buffer[:65535])
00725             self._buffer = self._buffer[65535:]
00726         self._write_block(self._buffer)
00727         self._buffer = _empty_bytes_string
00728         self._handle.flush()
00729 
00730     def close(self):
00731         """Flush data, write 28 bytes empty BGZF EOF marker, and close the BGZF file."""
00732         if self._buffer:
00733             self.flush()
00734         #samtools will look for a magic EOF marker, just a 28 byte empty BGZF block,
00735         #and if it is missing warns the BAM file may be truncated. In addition to
00736         #samtools writing this block, so too does bgzip - so we should too.
00737         self._handle.write(_bgzf_eof)
00738         self._handle.flush()
00739         self._handle.close()
00740 
00741     def tell(self):
00742         """Returns a BGZF 64-bit virtual offset."""
00743         return make_virtual_offset(self.handle.tell(), len(self._buffer)) 
00744 
00745 if __name__ == "__main__":
00746     import sys
00747     if len(sys.argv) > 1:
00748         print "Call this with no arguments and pipe uncompressed data in on stdin"
00749         print "and it will produce BGZF compressed data on stdout. e.g."
00750         print
00751         print "./bgzf.py < example.fastq > example.fastq.bgz"
00752         print
00753         print "The extension convention of *.bgz is to distinugish these from *.gz"
00754         print "used for standard gzipped files without the block structure of BGZF."
00755         print "You can use the standard gunzip command to decompress BGZF files,"
00756         print "if it complains about the extension try something like this:"
00757         print
00758         print "cat example.fastq.bgz | gunzip > example.fastq"
00759         print
00760         print "See also the tool bgzip that comes with samtools"
00761         sys.exit(0)
00762 
00763     sys.stderr.write("Producing BGZF output from stdin...\n")
00764     w = BgzfWriter(fileobj=sys.stdout)
00765     while True:
00766         data = sys.stdin.read(65536)
00767         w.write(data)
00768         if not data:
00769             break
00770     #Doing close with write an empty BGZF block as EOF marker:
00771     w.close()
00772     sys.stderr.write("BGZF data produced\n")