Back to index

python-biopython  1.60
Functions | Variables
Bio.SeqIO.AbiIO Namespace Reference

Functions

def AbiIterator
def _AbiTrimIterator
def _abi_parse_header
def _abi_trim
def _parse_tag_data

Variables

string __docformat__ = "epytext en"
dictionary _EXTRACT
list _SPCTAGS
dictionary _BYTEFMT
string _HEADFMT = '>H4sI2H3I'
string _DIRFMT = '>4sI2H4I'
 elem_code

Function Documentation

def Bio.SeqIO.AbiIO._abi_parse_header (   header,
  handle 
) [private]
Generator that returns directory contents.

Definition at line 171 of file AbiIO.py.

00171 
00172 def _abi_parse_header(header, handle):
00173     """Generator that returns directory contents.
00174     """
00175     # header structure (after ABIF marker):
00176     # file version, tag name, tag number,
00177     # element type code, element size, number of elements
00178     # data size, data offset, handle (not file handle)
00179     head_elem_size = header[4]
00180     head_elem_num = header[5]
00181     head_offset = header[7]
00182     index = 0
00183 
00184     while index < head_elem_num:
00185         start = head_offset + index * head_elem_size
00186         # add directory offset to tuple
00187         # to handle directories with data size <= 4 bytes
00188         handle.seek(start)
00189         dir_entry = struct.unpack(_DIRFMT, \
00190                     handle.read(struct.calcsize(_DIRFMT))) + (start,)
00191         index += 1
00192         # only parse desired dirs
00193         key = _bytes_to_string(dir_entry[0])
00194         key += str(dir_entry[1])
00195         if key in (list(_EXTRACT.keys()) + _SPCTAGS):
00196             tag_name = _bytes_to_string(dir_entry[0])
00197             tag_number = dir_entry[1]
00198             elem_code = dir_entry[2]
00199             elem_num = dir_entry[4]
00200             data_size = dir_entry[5]
00201             data_offset = dir_entry[6]
00202             tag_offset = dir_entry[8]
00203             # if data size <= 4 bytes, data is stored inside tag
00204             # so offset needs to be changed
00205             if data_size <= 4:
00206                 data_offset = tag_offset + 20
00207             handle.seek(data_offset)
00208             data = handle.read(data_size)
00209             yield tag_name, tag_number, \
00210                   _parse_tag_data(elem_code, elem_num, data)
00211 

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.SeqIO.AbiIO._abi_trim (   seq_record) [private]
Trims the sequence using Richard Mott's modified trimming algorithm.

seq_record - SeqRecord object to be trimmed.

Trimmed bases are determined from their segment score, which is a
cumulative sum of each base's score. Base scores are calculated from
their quality values.

More about the trimming algorithm:
http://www.phrap.org/phredphrap/phred.html
http://www.clcbio.com/manual/genomics/Quality_abif_trimming.html

Definition at line 212 of file AbiIO.py.

00212 
00213 def _abi_trim(seq_record):
00214     """Trims the sequence using Richard Mott's modified trimming algorithm.
00215 
00216     seq_record - SeqRecord object to be trimmed.
00217 
00218     Trimmed bases are determined from their segment score, which is a
00219     cumulative sum of each base's score. Base scores are calculated from
00220     their quality values.
00221 
00222     More about the trimming algorithm:
00223     http://www.phrap.org/phredphrap/phred.html
00224     http://www.clcbio.com/manual/genomics/Quality_abif_trimming.html
00225     """
00226 
00227     start = False   # flag for starting position of trimmed sequence
00228     segment = 20    # minimum sequence length
00229     trim_start = 0  # init start index
00230     cutoff = 0.05   # default cutoff value for calculating base score
00231 
00232     if len(seq_record) <= segment:
00233         return seq_record
00234     else:
00235         # calculate base score
00236         score_list = [cutoff - (10 ** (qual/-10.0)) for qual in
00237                       seq_record.letter_annotations['phred_quality']]
00238 
00239         # calculate cummulative score
00240         # if cummulative value < 0, set it to 0
00241         # first value is set to 0, because of the assumption that
00242         # the first base will always be trimmed out
00243         cummul_score = [0]
00244         for i in range(1, len(score_list)):
00245             score = cummul_score[-1] + score_list[i]
00246             if score < 0:
00247                 cummul_score.append(0)
00248             else:
00249                 cummul_score.append(score)
00250                 if not start:
00251                     # trim_start = value when cummulative score is first > 0
00252                     trim_start = i
00253                     start = True
00254         
00255         # trim_finish = index of highest cummulative score,
00256         # marking the end of sequence segment with highest cummulative score
00257         trim_finish = cummul_score.index(max(cummul_score))
00258                          
00259         return seq_record[trim_start:trim_finish]

Here is the caller graph for this function:

def Bio.SeqIO.AbiIO._AbiTrimIterator (   handle) [private]
Iterator for the Abi file format that yields trimmed SeqRecord objects.

Definition at line 166 of file AbiIO.py.

00166 
00167 def _AbiTrimIterator(handle):
00168     """Iterator for the Abi file format that yields trimmed SeqRecord objects.
00169     """
00170     return AbiIterator(handle, trim=True)

Here is the call graph for this function:

def Bio.SeqIO.AbiIO._parse_tag_data (   elem_code,
  elem_num,
  raw_data 
) [private]
Returns single data value.

elem_code - What kind of data
elem_num - How many data points
raw_data - abi file object from which the tags would be unpacked

Definition at line 260 of file AbiIO.py.

00260 
00261 def _parse_tag_data(elem_code, elem_num, raw_data):
00262     """Returns single data value.
00263     
00264     elem_code - What kind of data
00265     elem_num - How many data points
00266     raw_data - abi file object from which the tags would be unpacked
00267     """
00268     if elem_code in _BYTEFMT:
00269         # because '>1s' unpack differently from '>s'
00270         if elem_num == 1:
00271             num = ''
00272         else:
00273             num = str(elem_num)
00274         fmt = '>' + num + _BYTEFMT[elem_code]
00275 
00276         assert len(raw_data) == struct.calcsize(fmt)
00277         data = struct.unpack(fmt, raw_data)
00278 
00279         # no need to use tuple if len(data) == 1
00280         # also if data is date / time
00281         if elem_code not in [10, 11] and len(data) == 1:
00282             data = data[0]
00283 
00284         # account for different data types
00285         if elem_code == 2:
00286             return _bytes_to_string(data)
00287         elif elem_code == 10:
00288             return str(datetime.date(*data))
00289         elif elem_code == 11:
00290             return str(datetime.time(*data[:3]))
00291         elif elem_code == 13:
00292             return bool(data)
00293         elif elem_code == 18:
00294             return _bytes_to_string(data[1:])
00295         elif self.elem_code == 19:
00296             return _bytes_to_string(data[:-1])
00297         else:
00298             return data
00299     else:
00300         return None

Here is the caller graph for this function:

def Bio.SeqIO.AbiIO.AbiIterator (   handle,
  alphabet = None,
  trim = False 
)
Iterator for the Abi file format.

Definition at line 79 of file AbiIO.py.

00079 
00080 def AbiIterator(handle, alphabet=None, trim=False):
00081     """Iterator for the Abi file format.
00082     """
00083     # raise exception is alphabet is not dna
00084     if alphabet is not None:
00085         if isinstance(Alphabet._get_base_alphabet(alphabet),
00086                       Alphabet.ProteinAlphabet):
00087             raise ValueError("Invalid alphabet, ABI files do not hold proteins.")
00088         if isinstance(Alphabet._get_base_alphabet(alphabet),
00089                       Alphabet.RNAAlphabet):
00090             raise ValueError("Invalid alphabet, ABI files do not hold RNA.")
00091 
00092     # raise exception if handle mode is not 'rb'
00093     if hasattr(handle, 'mode'):
00094         if set('rb') != set(handle.mode.lower()):
00095             raise ValueError("ABI files has to be opened in 'rb' mode.") 
00096 
00097     # check if input file is a valid Abi file
00098     handle.seek(0)
00099     marker = handle.read(4)
00100     if not marker:
00101         # handle empty file gracefully
00102         raise StopIteration
00103     if marker != _as_bytes('ABIF'):
00104         raise IOError('File should start ABIF, not %r' % marker)
00105 
00106     # dirty hack for handling time information
00107     times = {'RUND1': '', 'RUND2': '', 'RUNT1': '', 'RUNT2': '', }
00108 
00109     # initialize annotations
00110     annot = dict(zip(_EXTRACT.values(), [None] * len(_EXTRACT)))
00111 
00112     # parse header and extract data from directories
00113     header = struct.unpack(_HEADFMT, \
00114              handle.read(struct.calcsize(_HEADFMT)))
00115 
00116     for tag_name, tag_number, tag_data in _abi_parse_header(header, handle):
00117         # stop iteration if all desired tags have been extracted
00118         # 4 tags from _EXTRACT + 2 time tags from _SPCTAGS - 3,
00119         # and seq, qual, id
00120         # todo
00121 
00122         key = tag_name + str(tag_number)
00123 
00124         # PBAS2 is base-called sequence
00125         if key == 'PBAS2': 
00126             seq = tag_data
00127             ambigs = 'KYWMRS'
00128             if alphabet is None:
00129                 if set(seq).intersection(ambigs):
00130                     alphabet = ambiguous_dna
00131                 else:
00132                     alphabet = unambiguous_dna
00133         # PCON2 is quality values of base-called sequence
00134         elif key == 'PCON2':
00135             qual = [ord(val) for val in tag_data]
00136         # SMPL1 is sample id entered before sequencing run
00137         elif key == 'SMPL1':
00138             sample_id = tag_data
00139         elif key in times:
00140             times[key] = tag_data
00141         else:
00142             # extract sequence annotation as defined in _EXTRACT          
00143             if key in _EXTRACT:
00144                 annot[_EXTRACT[key]] = tag_data
00145 
00146     # set time annotations
00147     annot['run_start'] = '%s %s' % (times['RUND1'], times['RUNT1'])
00148     annot['run_finish'] = '%s %s' % (times['RUND2'], times['RUNT2'])
00149     
00150     # use the file name as SeqRecord.name if available
00151     try:
00152         file_name = basename(handle.name).replace('.ab1','')
00153     except:
00154         file_name = ""
00155 
00156     record = SeqRecord(Seq(seq, alphabet),
00157                        id=sample_id, name=file_name,
00158                        description='',
00159                        annotations=annot,
00160                        letter_annotations={'phred_quality': qual})
00161                       
00162     if not trim:
00163         yield record
00164     else:
00165         yield _abi_trim(record)

Here is the call graph for this function:

Here is the caller graph for this function:


Variable Documentation

string Bio.SeqIO.AbiIO.__docformat__ = "epytext en"

Definition at line 17 of file AbiIO.py.

Initial value:
00001 {
00002             1: 'b',     # byte
00003             2: 's',     # char
00004             3: 'H',     # word
00005             4: 'h',     # short
00006             5: 'i',     # long
00007             6: '2i',    # rational, legacy unsupported
00008             7: 'f',     # float
00009             8: 'd',     # double
00010             10: 'h2B',  # date
00011             11: '4B',   # time
00012             12: '2i2b', # thumb
00013             13: 'B',    # bool
00014             14: '2h',   # point, legacy unsupported
00015             15: '4h',   # rect, legacy unsupported
00016             16: '2i',   # vPoint, legacy unsupported
00017             17: '4i',   # vRect, legacy unsupported
00018             18: 's',    # pString
00019             19: 's',    # cString
00020             20: '2i',   # tag, legacy unsupported
00021            }

Definition at line 53 of file AbiIO.py.

string Bio.SeqIO.AbiIO._DIRFMT = '>4sI2H4I'

Definition at line 77 of file AbiIO.py.

Initial value:
00001 {
00002             'TUBE1': 'sample_well',
00003             'DySN1': 'dye',
00004             'GTyp1': 'polymer',
00005             'MODL1': 'machine_model',
00006            }

Definition at line 35 of file AbiIO.py.

string Bio.SeqIO.AbiIO._HEADFMT = '>H4sI2H3I'

Definition at line 75 of file AbiIO.py.

Initial value:
00001 [
00002             'PBAS2',    # base-called sequence
00003             'PCON2',    # quality values of base-called sequence
00004             'SMPL1',    # sample id inputted before sequencing run
00005             'RUND1',    # run start date
00006             'RUND2',    # run finish date
00007             'RUNT1',    # run start time
00008             'RUNT2',    # run finish time
00009            ]

Definition at line 43 of file AbiIO.py.

Definition at line 294 of file AbiIO.py.