Back to index

python-biopython  1.60
Public Member Functions | Private Attributes
Bio.SeqIO._index._SQLiteManySeqFilesDict Class Reference
Inheritance diagram for Bio.SeqIO._index._SQLiteManySeqFilesDict:
Inheritance graph
[legend]
Collaboration diagram for Bio.SeqIO._index._SQLiteManySeqFilesDict:
Collaboration graph
[legend]

List of all members.

Public Member Functions

def __init__
def __repr__
def __contains__
def __len__
def __iter__
def keys
def __getitem__
def get
def get_raw
def close
def __str__
def values
def values
def items
def items
def itervalues
def iteritems
def iterkeys
def __setitem__
def update
def pop
def popitem
def clear
def fromkeys
def copy

Private Attributes

 _con
 _length
 _filenames
 _format
 _proxies
 _max_open
 _index_filename
 _alphabet
 _key_function

Detailed Description

Read only dictionary interface to many sequential sequence files.

Keeps the keys, file-numbers and offsets in an SQLite database. To access
a record by key, reads from the offset in the approapriate file using
Bio.SeqIO for parsing.

There are OS limits on the number of files that can be open at once,
so a pool are kept. If a record is required from a closed file, then
one of the open handles is closed first.

Definition at line 253 of file _index.py.


Constructor & Destructor Documentation

def Bio.SeqIO._index._SQLiteManySeqFilesDict.__init__ (   self,
  index_filename,
  filenames,
  format,
  alphabet,
  key_function,
  max_open = 10 
)

Definition at line 265 of file _index.py.

00265 
00266                  key_function, max_open=10):
00267         random_access_proxies = {}
00268         #TODO? - Don't keep filename list in memory (just in DB)?
00269         #Should save a chunk of memory if dealing with 1000s of files.
00270         #Furthermore could compare a generator to the DB on reloading
00271         #(no need to turn it into a list)
00272         if not _sqlite:
00273             #Hack for Python 2.4 (of if Python is compiled without it)
00274             from Bio import MissingPythonDependencyError
00275             raise MissingPythonDependencyError("Requires sqlite3, which is "
00276                                                "included Python 2.5+")
00277         if filenames is not None:
00278             filenames = list(filenames) #In case it was a generator
00279         if os.path.isfile(index_filename):
00280             #Reuse the index.
00281             con = _sqlite.connect(index_filename)
00282             self._con = con
00283             #Check the count...
00284             try:
00285                 count, = con.execute("SELECT value FROM meta_data WHERE key=?;",
00286                                      ("count",)).fetchone()
00287                 self._length = int(count)
00288                 if self._length == -1:
00289                     con.close()
00290                     raise ValueError("Unfinished/partial database")
00291                 count, = con.execute("SELECT COUNT(key) FROM offset_data;").fetchone()
00292                 if self._length <> int(count):
00293                     con.close()
00294                     raise ValueError("Corrupt database? %i entries not %i" \
00295                                      % (int(count), self._length))
00296                 self._format, = con.execute("SELECT value FROM meta_data WHERE key=?;",
00297                                            ("format",)).fetchone()
00298                 if format and format != self._format:
00299                     con.close()
00300                     raise ValueError("Index file says format %s, not %s" \
00301                                      % (self._format, format))
00302                 self._filenames = [row[0] for row in \
00303                                   con.execute("SELECT name FROM file_data "
00304                                               "ORDER BY file_number;").fetchall()]
00305                 if filenames and len(filenames) != len(self._filenames):
00306                     con.close()
00307                     raise ValueError("Index file says %i files, not %i" \
00308                                      % (len(self._filenames), len(filenames)))
00309                 if filenames and filenames != self._filenames:
00310                     con.close()
00311                     raise ValueError("Index file has different filenames")
00312             except _OperationalError, err:
00313                 con.close()
00314                 raise ValueError("Not a Biopython index database? %s" % err)
00315             #Now we have the format (from the DB if not given to us),
00316             try:
00317                 proxy_class = _FormatToRandomAccess[self._format]
00318             except KeyError:
00319                 con.close()
00320                 raise ValueError("Unsupported format '%s'" % self._format)
00321         else:
00322             self._filenames = filenames
00323             self._format = format
00324             if not format or not filenames:
00325                 raise ValueError("Filenames to index and format required")
00326             try:
00327                 proxy_class = _FormatToRandomAccess[format]
00328             except KeyError:
00329                 raise ValueError("Unsupported format '%s'" % format)
00330             #Create the index
00331             con = _sqlite.connect(index_filename)
00332             self._con = con
00333             #print "Creating index"
00334             # Sqlite PRAGMA settings for speed
00335             con.execute("PRAGMA synchronous='OFF'")
00336             con.execute("PRAGMA locking_mode=EXCLUSIVE")
00337             #Don't index the key column until the end (faster)
00338             #con.execute("CREATE TABLE offset_data (key TEXT PRIMARY KEY, "
00339             # "offset INTEGER);")
00340             con.execute("CREATE TABLE meta_data (key TEXT, value TEXT);")
00341             con.execute("INSERT INTO meta_data (key, value) VALUES (?,?);",
00342                         ("count", -1))
00343             con.execute("INSERT INTO meta_data (key, value) VALUES (?,?);",
00344                         ("format", format))
00345             #TODO - Record the alphabet?
00346             #TODO - Record the file size and modified date?
00347             con.execute("CREATE TABLE file_data (file_number INTEGER, name TEXT);")
00348             con.execute("CREATE TABLE offset_data (key TEXT, file_number INTEGER, offset INTEGER, length INTEGER);")
00349             count = 0
00350             for i, filename in enumerate(filenames):
00351                 con.execute("INSERT INTO file_data (file_number, name) VALUES (?,?);",
00352                             (i, filename))
00353                 random_access_proxy = proxy_class(filename, format, alphabet)
00354                 if key_function:
00355                     offset_iter = ((key_function(k),i,o,l) for (k,o,l) in random_access_proxy)
00356                 else:
00357                     offset_iter = ((k,i,o,l) for (k,o,l) in random_access_proxy)
00358                 while True:
00359                     batch = list(itertools.islice(offset_iter, 100))
00360                     if not batch: break
00361                     #print "Inserting batch of %i offsets, %s ... %s" \
00362                     # % (len(batch), batch[0][0], batch[-1][0])
00363                     con.executemany("INSERT INTO offset_data (key,file_number,offset,length) VALUES (?,?,?,?);",
00364                                     batch)
00365                     con.commit()
00366                     count += len(batch)
00367                 if len(random_access_proxies) < max_open:
00368                     random_access_proxies[i] = random_access_proxy
00369                 else:
00370                     random_access_proxy._handle.close()
00371             self._length = count
00372             #print "About to index %i entries" % count
00373             try:
00374                 con.execute("CREATE UNIQUE INDEX IF NOT EXISTS "
00375                             "key_index ON offset_data(key);")
00376             except _IntegrityError, err:
00377                 self._proxies = random_access_proxies
00378                 self.close()
00379                 con.close()
00380                 raise ValueError("Duplicate key? %s" % err)
00381             con.execute("PRAGMA locking_mode=NORMAL")
00382             con.execute("UPDATE meta_data SET value = ? WHERE key = ?;",
00383                         (count, "count"))
00384             con.commit()
00385             #print "Index created"
00386         self._proxies = random_access_proxies
00387         self._max_open = max_open
00388         self._index_filename = index_filename
00389         self._alphabet = alphabet
00390         self._key_function = key_function
    

Member Function Documentation

Reimplemented from Bio.SeqIO._index._IndexedSeqFileDict.

Definition at line 396 of file _index.py.

00396 
00397     def __contains__(self, key):
00398         return bool(self._con.execute("SELECT key FROM offset_data WHERE key=?;",
00399                                       (key,)).fetchone())

x.__getitem__(y) <==> x[y]

Reimplemented from Bio.SeqIO._index._IndexedSeqFileDict.

Definition at line 418 of file _index.py.

00418 
00419     def __getitem__(self, key):
00420         """x.__getitem__(y) <==> x[y]"""
00421         #Pass the offset to the proxy
00422         row = self._con.execute("SELECT file_number, offset FROM offset_data WHERE key=?;",
00423                                 (key,)).fetchone()
00424         if not row: raise KeyError
00425         file_number, offset = row
00426         proxies = self._proxies
00427         if file_number in proxies:
00428             record = proxies[file_number].get(offset)
00429         else:
00430             if len(proxies) >= self._max_open:
00431                 #Close an old handle...
00432                 proxies.popitem()[1]._handle.close()
00433             #Open a new handle...
00434             proxy = _FormatToRandomAccess[self._format]( \
00435                         self._filenames[file_number],
00436                         self._format, self._alphabet)
00437             record = proxy.get(offset)
00438             proxies[file_number] = proxy
00439         if self._key_function:
00440             key2 = self._key_function(record.id)
00441         else:
00442             key2 = record.id
00443         if key != key2:
00444             raise ValueError("Key did not match (%s vs %s)" % (key, key2))
00445         return record

Here is the call graph for this function:

Iterate over the keys.

Reimplemented from Bio.SeqIO._index._IndexedSeqFileDict.

Definition at line 405 of file _index.py.

00405 
00406     def __iter__(self):
00407         """Iterate over the keys."""
00408         for row in self._con.execute("SELECT key FROM offset_data;"):
00409             yield str(row[0])

How many records are there?

Reimplemented from Bio.SeqIO._index._IndexedSeqFileDict.

Definition at line 400 of file _index.py.

00400 
00401     def __len__(self):
00402         """How many records are there?"""
00403         return self._length
00404         #return self._con.execute("SELECT COUNT(key) FROM offset_data;").fetchone()[0]

Reimplemented from Bio.SeqIO._index._IndexedSeqFileDict.

Definition at line 391 of file _index.py.

00391 
00392     def __repr__(self):
00393         return "SeqIO.index_db(%r, filenames=%r, format=%r, alphabet=%r, key_function=%r)" \
00394                % (self._index_filename, self._filenames, self._format,
00395                   self._alphabet, self._key_function)

def Bio.SeqIO._index._IndexedSeqFileDict.__setitem__ (   self,
  key,
  value 
) [inherited]
Would allow setting or replacing records, but not implemented.

Definition at line 220 of file _index.py.

00220 
00221     def __setitem__(self, key, value):
00222         """Would allow setting or replacing records, but not implemented."""
00223         raise NotImplementedError("An indexed a sequence file is read only.")
    

Definition at line 110 of file _index.py.

00110 
00111     def __str__(self):
00112         if self:
00113             return "{%s : SeqRecord(...), ...}" % repr(self.keys()[0])
00114         else:
00115             return "{}"

Here is the call graph for this function:

def Bio.SeqIO._index._IndexedSeqFileDict.clear (   self) [inherited]
Would clear dictionary, but not implemented.

Definition at line 238 of file _index.py.

00238 
00239     def clear(self):
00240         """Would clear dictionary, but not implemented."""
00241         raise NotImplementedError("An indexed a sequence file is read only.")

Close any open file handles.

Definition at line 495 of file _index.py.

00495 
00496     def close(self):
00497         """Close any open file handles."""
00498         proxies = self._proxies
00499         while proxies:
00500             proxies.popitem()[1]._handle.close()
00501         

Here is the caller graph for this function:

def Bio.SeqIO._index._IndexedSeqFileDict.copy (   self) [inherited]
A dictionary method which we don't implement.

Definition at line 247 of file _index.py.

00247 
00248     def copy(self):
00249         """A dictionary method which we don't implement."""
00250         raise NotImplementedError("An indexed a sequence file doesn't "
00251                                   "support this.")
00252 

def Bio.SeqIO._index._IndexedSeqFileDict.fromkeys (   self,
  keys,
  value = None 
) [inherited]
A dictionary method which we don't implement.

Definition at line 242 of file _index.py.

00242 
00243     def fromkeys(self, keys, value=None):
00244         """A dictionary method which we don't implement."""
00245         raise NotImplementedError("An indexed a sequence file doesn't "
00246                                   "support this.")

def Bio.SeqIO._index._SQLiteManySeqFilesDict.get (   self,
  k,
  d = None 
)
D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.

Reimplemented from Bio.SeqIO._index._IndexedSeqFileDict.

Definition at line 446 of file _index.py.

00446 
00447     def get(self, k, d=None):
00448         """D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None."""
00449         try:
00450             return self.__getitem__(k)
00451         except KeyError:
00452             return d

Here is the call graph for this function:

Here is the caller graph for this function:

Similar to the get method, but returns the record as a raw string.

If the key is not found, a KeyError exception is raised.

Note that on Python 3 a bytes string is returned, not a typical
unicode string.

NOTE - This functionality is not supported for every file format.

Reimplemented from Bio.SeqIO._index._IndexedSeqFileDict.

Definition at line 453 of file _index.py.

00453 
00454     def get_raw(self, key):
00455         """Similar to the get method, but returns the record as a raw string.
00456 
00457         If the key is not found, a KeyError exception is raised.
00458 
00459         Note that on Python 3 a bytes string is returned, not a typical
00460         unicode string.
00461 
00462         NOTE - This functionality is not supported for every file format.
00463         """
00464         #Pass the offset to the proxy
00465         row = self._con.execute("SELECT file_number, offset, length FROM offset_data WHERE key=?;",
00466                                 (key,)).fetchone()
00467         if not row: raise KeyError
00468         file_number, offset, length = row
00469         proxies = self._proxies
00470         if file_number in proxies:
00471             if length:
00472                 #Shortcut if we have the length
00473                 h = proxies[file_number]._handle
00474                 h.seek(offset)
00475                 return h.read(length)
00476             else:
00477                 return proxies[file_number].get_raw(offset)
00478         else:
00479             #This code is duplicated from __getitem__ to avoid a function call
00480             if len(proxies) >= self._max_open:
00481                 #Close an old handle...
00482                 proxies.popitem()[1]._handle.close()
00483             #Open a new handle...
00484             proxy = _FormatToRandomAccess[self._format]( \
00485                         self._filenames[file_number],
00486                         self._format, self._alphabet)
00487             proxies[file_number] = proxy
00488             if length:
00489                 #Shortcut if we have the length
00490                 h = proxy._handle
00491                 h.seek(offset)
00492                 return h.read(length)
00493             else:
00494                 return proxy.get_raw(offset)

Here is the caller graph for this function:

def Bio.SeqIO._index._IndexedSeqFileDict.items (   self) [inherited]
Would be a list of the (key, SeqRecord) tuples, but not implemented.

In general you can be indexing very very large files, with millions
of sequences. Loading all these into memory at once as SeqRecord
objects would (probably) use up all the RAM. Therefore we simply
don't support this dictionary method.

Definition at line 137 of file _index.py.

00137 
00138         def items(self):
00139             """Would be a list of the (key, SeqRecord) tuples, but not implemented.
00140 
00141             In general you can be indexing very very large files, with millions
00142             of sequences. Loading all these into memory at once as SeqRecord
00143             objects would (probably) use up all the RAM. Therefore we simply
00144             don't support this dictionary method.
00145             """
00146             raise NotImplementedError("Due to memory concerns, when indexing a "
00147                                       "sequence file you cannot access all the "
00148                                       "records at once.")

Here is the caller graph for this function:

def Bio.SeqIO._index._IndexedSeqFileDict.items (   self) [inherited]
Iterate over the (key, SeqRecord) items.

Definition at line 170 of file _index.py.

00170 
00171         def items(self):
00172             """Iterate over the (key, SeqRecord) items."""
00173             for key in self.__iter__():
00174                 yield key, self.__getitem__(key)

Here is the caller graph for this function:

Iterate over the (key, SeqRecord) items.

Definition at line 159 of file _index.py.

00159 
00160         def iteritems(self):
00161             """Iterate over the (key, SeqRecord) items."""
00162             for key in self.__iter__():
00163                 yield key, self.__getitem__(key)
        

Here is the caller graph for this function:

Iterate over the keys.

Definition at line 164 of file _index.py.

00164 
00165         def iterkeys(self):
00166             """Iterate over the keys."""
00167             return self.__iter__()

Here is the call graph for this function:

Iterate over the SeqRecord) items.

Definition at line 154 of file _index.py.

00154 
00155         def itervalues(self):
00156             """Iterate over the SeqRecord) items."""
00157             for key in self.__iter__():
00158                 yield self.__getitem__(key)

Here is the caller graph for this function:

Return a list of all the keys (SeqRecord identifiers).

Reimplemented from Bio.SeqIO._index._IndexedSeqFileDict.

Definition at line 413 of file _index.py.

00413 
00414         def keys(self) :
00415             """Return a list of all the keys (SeqRecord identifiers)."""
00416             return [str(row[0]) for row in \
00417                     self._con.execute("SELECT key FROM offset_data;").fetchall()]

Here is the caller graph for this function:

def Bio.SeqIO._index._IndexedSeqFileDict.pop (   self,
  key,
  default = None 
) [inherited]
Would remove specified record, but not implemented.

Definition at line 229 of file _index.py.

00229 
00230     def pop(self, key, default=None):
00231         """Would remove specified record, but not implemented."""
00232         raise NotImplementedError("An indexed a sequence file is read only.")
    
Would remove and return a SeqRecord, but not implemented.

Definition at line 233 of file _index.py.

00233 
00234     def popitem(self):
00235         """Would remove and return a SeqRecord, but not implemented."""
00236         raise NotImplementedError("An indexed a sequence file is read only.")
00237 
    
def Bio.SeqIO._index._IndexedSeqFileDict.update (   self,
  args,
  kwargs 
) [inherited]
Would allow adding more values, but not implemented.

Definition at line 224 of file _index.py.

00224 
00225     def update(self, *args, **kwargs):
00226         """Would allow adding more values, but not implemented."""
00227         raise NotImplementedError("An indexed a sequence file is read only.")
00228 
    

Here is the caller graph for this function:

def Bio.SeqIO._index._IndexedSeqFileDict.values (   self) [inherited]
Would be a list of the SeqRecord objects, but not implemented.

In general you can be indexing very very large files, with millions
of sequences. Loading all these into memory at once as SeqRecord
objects would (probably) use up all the RAM. Therefore we simply
don't support this dictionary method.

Definition at line 125 of file _index.py.

00125 
00126         def values(self):
00127             """Would be a list of the SeqRecord objects, but not implemented.
00128 
00129             In general you can be indexing very very large files, with millions
00130             of sequences. Loading all these into memory at once as SeqRecord
00131             objects would (probably) use up all the RAM. Therefore we simply
00132             don't support this dictionary method.
00133             """
00134             raise NotImplementedError("Due to memory concerns, when indexing a "
00135                                       "sequence file you cannot access all the "
00136                                       "records at once.")

Here is the caller graph for this function:

def Bio.SeqIO._index._IndexedSeqFileDict.values (   self) [inherited]
Iterate over the SeqRecord items.

Definition at line 175 of file _index.py.

00175 
00176         def values(self):
00177             """Iterate over the SeqRecord items."""
00178             for key in self.__iter__():
00179                 yield self.__getitem__(key)


Member Data Documentation

Definition at line 388 of file _index.py.

Definition at line 281 of file _index.py.

Definition at line 301 of file _index.py.

Definition at line 322 of file _index.py.

Definition at line 387 of file _index.py.

Reimplemented from Bio.SeqIO._index._IndexedSeqFileDict.

Definition at line 389 of file _index.py.

Definition at line 286 of file _index.py.

Definition at line 386 of file _index.py.

Definition at line 376 of file _index.py.


The documentation for this class was generated from the following file: