Back to index

python-biopython  1.60
Public Member Functions | Public Attributes | Private Member Functions
Bio.Align.AlignInfo.SummaryInfo Class Reference

List of all members.

Public Member Functions

def __init__
def dumb_consensus
def gap_consensus
def replacement_dictionary
def pos_specific_score_matrix
def information_content
def get_column

Public Attributes

 alignment
 ic_vector

Private Member Functions

def _guess_consensus_alphabet
def _pair_replacement
def _get_all_letters
def _get_base_replacements
def _get_base_letters
def _get_letter_freqs
def _get_column_info_content

Detailed Description

Calculate summary info about the alignment.

This class should be used to caclculate information summarizing the
results of an alignment. This may either be straight consensus info
or more complicated things.

Definition at line 26 of file AlignInfo.py.


Constructor & Destructor Documentation

def Bio.Align.AlignInfo.SummaryInfo.__init__ (   self,
  alignment 
)
Initialize with the alignment to calculate information on.
   ic_vector attribute. A dictionary. Keys: column numbers. Values:

Definition at line 33 of file AlignInfo.py.

00033 
00034     def __init__(self, alignment):
00035         """Initialize with the alignment to calculate information on.
00036            ic_vector attribute. A dictionary. Keys: column numbers. Values:
00037         """
00038         self.alignment = alignment
00039         self.ic_vector = {}

Here is the caller graph for this function:


Member Function Documentation

Returns a string containing the expected letters in the alignment.

Definition at line 304 of file AlignInfo.py.

00304 
00305     def _get_all_letters(self):
00306         """Returns a string containing the expected letters in the alignment."""
00307         all_letters = self.alignment._alphabet.letters
00308         if all_letters is None \
00309         or (isinstance(self.alignment._alphabet, Alphabet.Gapped) \
00310         and all_letters == self.alignment._alphabet.gap_char):
00311             #We are dealing with a generic alphabet class where the
00312             #letters are not defined!  We must build a list of the
00313             #letters used...
00314             set_letters = set()
00315             for record in self.alignment:
00316                 #Note the built in set does not have a union_update
00317                 #which was provided by the sets module's Set
00318                 set_letters = set_letters.union(record.seq)
00319             list_letters = list(set_letters)
00320             list_letters.sort()
00321             all_letters = "".join(list_letters)
00322         return all_letters

Here is the caller graph for this function:

def Bio.Align.AlignInfo.SummaryInfo._get_base_letters (   self,
  letters 
) [private]
Create a zeroed dictionary with all of the specified letters.

Definition at line 423 of file AlignInfo.py.

00423 
00424     def _get_base_letters(self, letters):
00425         """Create a zeroed dictionary with all of the specified letters.
00426         """
00427         base_info = {}
00428         for letter in letters:
00429             base_info[letter] = 0
00430 
00431         return base_info

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Align.AlignInfo.SummaryInfo._get_base_replacements (   self,
  skip_items = [] 
) [private]
Get a zeroed dictonary of all possible letter combinations.

This looks at the type of alphabet and gets the letters for it.
It then creates a dictionary with all possible combinations of these
letters as keys (ie. ('A', 'G')) and sets the values as zero.

Returns:
o The base dictionary created
o A list of alphabet items to skip when filling the dictionary.Right
now the only thing I can imagine in this list is gap characters, but
maybe X's or something else might be useful later. This will also
include any characters that are specified to be skipped.

Definition at line 323 of file AlignInfo.py.

00323 
00324     def _get_base_replacements(self, skip_items = []):
00325         """Get a zeroed dictonary of all possible letter combinations.
00326 
00327         This looks at the type of alphabet and gets the letters for it.
00328         It then creates a dictionary with all possible combinations of these
00329         letters as keys (ie. ('A', 'G')) and sets the values as zero.
00330 
00331         Returns:
00332         o The base dictionary created
00333         o A list of alphabet items to skip when filling the dictionary.Right
00334         now the only thing I can imagine in this list is gap characters, but
00335         maybe X's or something else might be useful later. This will also
00336         include any characters that are specified to be skipped.
00337         """
00338         base_dictionary = {}
00339         all_letters = self._get_all_letters()
00340 
00341         # if we have a gapped alphabet we need to find the gap character
00342         # and drop it out
00343         if isinstance(self.alignment._alphabet, Alphabet.Gapped):
00344             skip_items.append(self.alignment._alphabet.gap_char)
00345             all_letters = all_letters.replace(self.alignment._alphabet.gap_char,'')
00346 
00347         # now create the dictionary
00348         for first_letter in all_letters:
00349             for second_letter in all_letters:
00350                 if (first_letter not in skip_items and
00351                     second_letter not in skip_items):
00352                     base_dictionary[(first_letter, second_letter)] = 0
00353 
00354         return base_dictionary, skip_items
00355 

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Align.AlignInfo.SummaryInfo._get_column_info_content (   self,
  obs_freq,
  e_freq_table,
  log_base,
  random_expected 
) [private]
Calculate the information content for a column.

Arguments:
o obs_freq - The frequencies observed for each letter in the column.
o e_freq_table - An optional argument specifying the expected
frequencies for each letter. This is a SubsMat.FreqTable instance.
o log_base - The base of the logathrim to use in calculating the
info content.

Definition at line 555 of file AlignInfo.py.

00555 
00556                                  random_expected):
00557         """Calculate the information content for a column.
00558 
00559         Arguments:
00560         o obs_freq - The frequencies observed for each letter in the column.
00561         o e_freq_table - An optional argument specifying the expected
00562         frequencies for each letter. This is a SubsMat.FreqTable instance.
00563         o log_base - The base of the logathrim to use in calculating the
00564         info content.
00565         """
00566         try:
00567             gap_char = self.alignment._alphabet.gap_char
00568         except AttributeError:
00569             #The alphabet doesn't declare a gap - there could be none
00570             #in the sequence... or just a vague alphabet.
00571             gap_char = "-" #Safe?
00572             
00573         if e_freq_table:
00574             if not isinstance(e_freq_table, FreqTable.FreqTable):
00575                 raise ValueError("e_freq_table should be a FreqTable object")
00576             # check the expected freq information to make sure it is good
00577             for key in obs_freq:
00578                 if (key != gap_char and key not in e_freq_table):
00579                     raise ValueError("Expected frequency letters %s "
00580                                      "do not match observed %s" \
00581                                      % (e_freq_table.keys(),
00582                                         obs_freq.keys() - [gap_char]))
00583         
00584         total_info = 0.0
00585 
00586         for letter in obs_freq:
00587             inner_log = 0.0
00588             # if we have expected frequencies, modify the log value by them
00589             # gap characters do not have expected frequencies, so they
00590             # should just be the observed frequency.
00591             if letter != gap_char:
00592                 if e_freq_table:
00593                     inner_log = obs_freq[letter] / e_freq_table[letter]
00594                 else:
00595                     inner_log = obs_freq[letter] / random_expected
00596             # if the observed frequency is zero, we don't add any info to the
00597             # total information content
00598             if inner_log > 0:
00599                 letter_info = (obs_freq[letter] * 
00600                                math.log(inner_log) / math.log(log_base))
00601                 total_info += letter_info
00602         return total_info 

Here is the caller graph for this function:

def Bio.Align.AlignInfo.SummaryInfo._get_letter_freqs (   self,
  residue_num,
  all_records,
  letters,
  to_ignore 
) [private]
Determine the frequency of specific letters in the alignment.

Arguments:
o residue_num - The number of the column we are getting frequencies
from.
o all_records - All of the SeqRecords in the alignment.
o letters - The letters we are interested in getting the frequency
for.
o to_ignore - Letters we are specifically supposed to ignore.

This will calculate the frequencies of each of the specified letters
in the alignment at the given frequency, and return this as a
dictionary where the keys are the letters and the values are the
frequencies.

Definition at line 510 of file AlignInfo.py.

00510 
00511     def _get_letter_freqs(self, residue_num, all_records, letters, to_ignore):
00512         """Determine the frequency of specific letters in the alignment.
00513 
00514         Arguments:
00515         o residue_num - The number of the column we are getting frequencies
00516         from.
00517         o all_records - All of the SeqRecords in the alignment.
00518         o letters - The letters we are interested in getting the frequency
00519         for.
00520         o to_ignore - Letters we are specifically supposed to ignore.
00521 
00522         This will calculate the frequencies of each of the specified letters
00523         in the alignment at the given frequency, and return this as a
00524         dictionary where the keys are the letters and the values are the
00525         frequencies.
00526         """
00527         freq_info = self._get_base_letters(letters)
00528 
00529         total_count = 0
00530         # collect the count info into the dictionary for all the records
00531         for record in all_records:
00532             try:
00533                 if record.seq[residue_num] not in to_ignore:
00534                     weight = record.annotations.get('weight',1.0)
00535                     freq_info[record.seq[residue_num]] += weight
00536                     total_count += weight
00537             # getting a key error means we've got a problem with the alphabet 
00538             except KeyError:
00539                 raise ValueError("Residue %s not found in alphabet %s"
00540                                  % (record.seq[residue_num],
00541                                     self.alignment._alphabet))
00542 
00543         if total_count == 0:
00544             # This column must be entirely ignored characters
00545             for letter in freq_info:
00546                 assert freq_info[letter] == 0
00547                 #TODO - Map this to NA or NaN?
00548         else:
00549             # now convert the counts into frequencies
00550             for letter in freq_info:
00551                 freq_info[letter] = freq_info[letter] / total_count
00552 
00553         return freq_info
            

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Align.AlignInfo.SummaryInfo._guess_consensus_alphabet (   self,
  ambiguous 
) [private]
Pick an (ungapped) alphabet for an alignment consesus sequence.

This just looks at the sequences we have, checks their type, and
returns as appropriate type which seems to make sense with the
sequences we've got.

Definition at line 169 of file AlignInfo.py.

00169 
00170     def _guess_consensus_alphabet(self, ambiguous):
00171         """Pick an (ungapped) alphabet for an alignment consesus sequence.
00172 
00173         This just looks at the sequences we have, checks their type, and
00174         returns as appropriate type which seems to make sense with the
00175         sequences we've got.
00176         """
00177         #Start with the (un-gapped version of) the alignment alphabet
00178         a = Alphabet._get_base_alphabet(self.alignment._alphabet)
00179 
00180         #Now check its compatible with all the rest of the sequences
00181         for record in self.alignment:
00182             #Get the (un-gapped version of) the sequence's alphabet
00183             alt =  Alphabet._get_base_alphabet(record.seq.alphabet)
00184             if not isinstance(alt, a.__class__):
00185                 raise ValueError \
00186                 ("Alignment contains a sequence with an incompatible alphabet.")
00187 
00188         #Check the ambiguous character we are going to use in the consensus
00189         #is in the alphabet's list of valid letters (if defined).
00190         if hasattr(a, "letters") and a.letters is not None \
00191         and ambiguous not in a.letters:
00192             #We'll need to pick a more generic alphabet...
00193             if isinstance(a, IUPAC.IUPACUnambiguousDNA):
00194                 if ambiguous in IUPAC.IUPACUnambiguousDNA().letters:
00195                     a = IUPAC.IUPACUnambiguousDNA()
00196                 else:
00197                     a = Alphabet.generic_dna
00198             elif isinstance(a, IUPAC.IUPACUnambiguousRNA):
00199                 if ambiguous in IUPAC.IUPACUnambiguousRNA().letters:
00200                     a = IUPAC.IUPACUnambiguousRNA()
00201                 else:
00202                     a = Alphabet.generic_rna
00203             elif isinstance(a, IUPAC.IUPACProtein):
00204                 if ambiguous in IUPAC.ExtendedIUPACProtein().letters:
00205                     a = IUPAC.ExtendedIUPACProtein()
00206                 else:
00207                     a = Alphabet.generic_protein
00208             else:
00209                 a = Alphabet.single_letter_alphabet
00210         return a

Here is the caller graph for this function:

def Bio.Align.AlignInfo.SummaryInfo._pair_replacement (   self,
  seq1,
  seq2,
  weight1,
  weight2,
  start_dict,
  ignore_chars 
) [private]
Compare two sequences and generate info on the replacements seen.

Arguments:
o seq1, seq2 - The two sequences to compare.
o weight1, weight2 - The relative weights of seq1 and seq2.
o start_dict - The dictionary containing the starting replacement
info that we will modify.
o ignore_chars - A list of characters to ignore when calculating
replacements (ie. '-').

Returns:
o A replacment dictionary which is modified from initial_dict with
the information from the sequence comparison.

Definition at line 263 of file AlignInfo.py.

00263 
00264                           start_dict, ignore_chars):
00265         """Compare two sequences and generate info on the replacements seen.
00266 
00267         Arguments:
00268         o seq1, seq2 - The two sequences to compare.
00269         o weight1, weight2 - The relative weights of seq1 and seq2.
00270         o start_dict - The dictionary containing the starting replacement
00271         info that we will modify.
00272         o ignore_chars - A list of characters to ignore when calculating
00273         replacements (ie. '-').
00274 
00275         Returns:
00276         o A replacment dictionary which is modified from initial_dict with
00277         the information from the sequence comparison.
00278         """
00279         # loop through each residue in the sequences
00280         for residue_num in range(len(seq1)):
00281             residue1 = seq1[residue_num]
00282             try:
00283                 residue2 = seq2[residue_num]
00284             # if seq2 is shorter, then we just stop looking at replacements
00285             # and return the information
00286             except IndexError:
00287                 return start_dict
00288 
00289             # if the two residues are characters we want to count
00290             if (residue1 not in ignore_chars) and (residue2 not in ignore_chars):
00291                 try:
00292                     # add info about the replacement to the dictionary,
00293                     # modified by the sequence weights
00294                     start_dict[(residue1, residue2)] += weight1 * weight2
00295                                          
00296                 # if we get a key error, then we've got a problem with alphabets
00297                 except KeyError:
00298                     raise ValueError("Residues %s, %s not found in alphabet %s"
00299                                      % (residue1, residue2,
00300                                         self.alignment._alphabet))
00301 
00302         return start_dict
00303 

Here is the caller graph for this function:

def Bio.Align.AlignInfo.SummaryInfo.dumb_consensus (   self,
  threshold = .7,
  ambiguous = "X",
  consensus_alpha = None,
  require_multiple = 0 
)
Output a fast consensus sequence of the alignment.

This doesn't do anything fancy at all. It will just go through the
sequence residue by residue and count up the number of each type
of residue (ie. A or G or T or C for DNA) in all sequences in the
alignment. If the percentage of the most common residue type is
greater then the passed threshold, then we will add that residue type,
otherwise an ambiguous character will be added.

This could be made a lot fancier (ie. to take a substitution matrix
into account), but it just meant for a quick and dirty consensus.

Arguments:
o threshold - The threshold value that is required to add a particular
atom.
o ambiguous - The ambiguous character to be added when the threshold is
not reached.
o consensus_alpha - The alphabet to return for the consensus sequence.
If this is None, then we will try to guess the alphabet.
o require_multiple - If set as 1, this will require that more than
1 sequence be part of an alignment to put it in the consensus (ie.
not just 1 sequence and gaps).

Definition at line 41 of file AlignInfo.py.

00041 
00042                        consensus_alpha = None, require_multiple = 0):
00043         """Output a fast consensus sequence of the alignment.
00044 
00045         This doesn't do anything fancy at all. It will just go through the
00046         sequence residue by residue and count up the number of each type
00047         of residue (ie. A or G or T or C for DNA) in all sequences in the
00048         alignment. If the percentage of the most common residue type is
00049         greater then the passed threshold, then we will add that residue type,
00050         otherwise an ambiguous character will be added.
00051 
00052         This could be made a lot fancier (ie. to take a substitution matrix
00053         into account), but it just meant for a quick and dirty consensus.
00054 
00055         Arguments:
00056         o threshold - The threshold value that is required to add a particular
00057         atom.
00058         o ambiguous - The ambiguous character to be added when the threshold is
00059         not reached.
00060         o consensus_alpha - The alphabet to return for the consensus sequence.
00061         If this is None, then we will try to guess the alphabet.
00062         o require_multiple - If set as 1, this will require that more than
00063         1 sequence be part of an alignment to put it in the consensus (ie.
00064         not just 1 sequence and gaps).
00065         """
00066         # Iddo Friedberg, 1-JUL-2004: changed ambiguous default to "X"
00067         consensus = ''
00068 
00069         # find the length of the consensus we are creating
00070         con_len = self.alignment.get_alignment_length()
00071 
00072         # go through each seq item
00073         for n in range(con_len):
00074             # keep track of the counts of the different atoms we get
00075             atom_dict = {}
00076             num_atoms = 0
00077 
00078             for record in self.alignment._records:
00079                 # make sure we haven't run past the end of any sequences
00080                 # if they are of different lengths
00081                 if n < len(record.seq):
00082                     if record.seq[n] != '-' and record.seq[n] != '.':
00083                         if record.seq[n] not in atom_dict:
00084                             atom_dict[record.seq[n]] = 1
00085                         else:
00086                             atom_dict[record.seq[n]] += 1
00087 
00088                         num_atoms = num_atoms + 1
00089 
00090             max_atoms = []
00091             max_size = 0
00092 
00093             for atom in atom_dict:
00094                 if atom_dict[atom] > max_size:
00095                     max_atoms = [atom]
00096                     max_size = atom_dict[atom]
00097                 elif atom_dict[atom] == max_size:
00098                     max_atoms.append(atom)
00099 
00100             if require_multiple and num_atoms == 1:
00101                 consensus += ambiguous
00102             elif (len(max_atoms) == 1) and ((float(max_size)/float(num_atoms))
00103                                          >= threshold):
00104                 consensus += max_atoms[0]
00105             else:
00106                 consensus += ambiguous
00107 
00108         # we need to guess a consensus alphabet if one isn't specified
00109         if consensus_alpha is None:
00110             consensus_alpha = self._guess_consensus_alphabet(ambiguous)
00111 
00112         return Seq(consensus, consensus_alpha)

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Align.AlignInfo.SummaryInfo.gap_consensus (   self,
  threshold = .7,
  ambiguous = "X",
  consensus_alpha = None,
  require_multiple = 0 
)
Same as dumb_consensus(), but allows gap on the output.

Things to do: Let the user define that with only one gap, the result
character in consensus is gap. Let the user select gap character, now
it takes the same is input.

Definition at line 114 of file AlignInfo.py.

00114 
00115                        consensus_alpha = None, require_multiple = 0):
00116         """Same as dumb_consensus(), but allows gap on the output.
00117 
00118         Things to do: Let the user define that with only one gap, the result
00119         character in consensus is gap. Let the user select gap character, now
00120         it takes the same is input.
00121         """
00122         # Iddo Friedberg, 1-JUL-2004: changed ambiguous default to "X"
00123         consensus = ''
00124 
00125         # find the length of the consensus we are creating
00126         con_len = self.alignment.get_alignment_length()
00127 
00128         # go through each seq item
00129         for n in range(con_len):
00130             # keep track of the counts of the different atoms we get
00131             atom_dict = {}
00132             num_atoms = 0
00133 
00134             for record in self.alignment._records:
00135                 # make sure we haven't run past the end of any sequences
00136                 # if they are of different lengths
00137                 if n < len(record.seq):
00138                     if record.seq[n] not in atom_dict:
00139                         atom_dict[record.seq[n]] = 1
00140                     else:
00141                         atom_dict[record.seq[n]] += 1
00142 
00143                     num_atoms += 1
00144 
00145             max_atoms = []
00146             max_size = 0
00147 
00148             for atom in atom_dict:
00149                 if atom_dict[atom] > max_size:
00150                     max_atoms = [atom]
00151                     max_size = atom_dict[atom]
00152                 elif atom_dict[atom] == max_size:
00153                     max_atoms.append(atom)
00154 
00155             if require_multiple and num_atoms == 1:
00156                 consensus += ambiguous
00157             elif (len(max_atoms) == 1) and ((float(max_size)/float(num_atoms))
00158                                          >= threshold):
00159                 consensus += max_atoms[0]
00160             else:
00161                 consensus += ambiguous
00162 
00163         # we need to guess a consensus alphabet if one isn't specified
00164         if consensus_alpha is None:
00165             #TODO - Should we make this into a Gapped alphabet?
00166             consensus_alpha = self._guess_consensus_alphabet(ambiguous)
00167 
00168         return Seq(consensus, consensus_alpha)
          

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Align.AlignInfo.SummaryInfo.get_column (   self,
  col 
)

Definition at line 603 of file AlignInfo.py.

00603 
00604     def get_column(self,col):
00605         return self.alignment.get_column(col)

def Bio.Align.AlignInfo.SummaryInfo.information_content (   self,
  start = 0,
  end = None,
  e_freq_table = None,
  log_base = 2,
  chars_to_ignore = [] 
)
Calculate the information content for each residue along an alignment.

Arguments:
o start, end - The starting an ending points to calculate the
information content. These points should be relative to the first
sequence in the alignment, starting at zero (ie. even if the 'real'
first position in the seq is 203 in the initial sequence, for
the info content, we need to use zero). This defaults to the entire
length of the first sequence.
o e_freq_table - A FreqTable object specifying the expected frequencies
for each letter in the alphabet we are using (e.g. {'G' : 0.4,
'C' : 0.4, 'T' : 0.1, 'A' : 0.1}). Gap characters should not be
included, since these should not have expected frequencies.
o log_base - The base of the logathrim to use in calculating the
information content. This defaults to 2 so the info is in bits.
o chars_to_ignore - A listing of characterw which should be ignored
in calculating the info content.

Returns:
o A number representing the info content for the specified region.

Please see the Biopython manual for more information on how information
content is calculated.

Definition at line 435 of file AlignInfo.py.

00435 
00436                             chars_to_ignore = []):
00437         """Calculate the information content for each residue along an alignment.
00438 
00439         Arguments:
00440         o start, end - The starting an ending points to calculate the
00441         information content. These points should be relative to the first
00442         sequence in the alignment, starting at zero (ie. even if the 'real'
00443         first position in the seq is 203 in the initial sequence, for
00444         the info content, we need to use zero). This defaults to the entire
00445         length of the first sequence.
00446         o e_freq_table - A FreqTable object specifying the expected frequencies
00447         for each letter in the alphabet we are using (e.g. {'G' : 0.4,
00448         'C' : 0.4, 'T' : 0.1, 'A' : 0.1}). Gap characters should not be
00449         included, since these should not have expected frequencies.
00450         o log_base - The base of the logathrim to use in calculating the
00451         information content. This defaults to 2 so the info is in bits.
00452         o chars_to_ignore - A listing of characterw which should be ignored
00453         in calculating the info content.
00454 
00455         Returns:
00456         o A number representing the info content for the specified region.
00457 
00458         Please see the Biopython manual for more information on how information
00459         content is calculated.
00460         """
00461         # if no end was specified, then we default to the end of the sequence
00462         if end is None:
00463             end = len(self.alignment._records[0].seq)
00464         
00465         if start < 0 or end > len(self.alignment._records[0].seq):
00466             raise ValueError \
00467                   ("Start (%s) and end (%s) are not in the range %s to %s"
00468                    % (start, end, 0, len(self.alignment._records[0].seq)))
00469         # determine random expected frequencies, if necessary
00470         random_expected = None
00471         if not e_freq_table:
00472             #TODO - What about ambiguous alphabets?
00473             base_alpha = Alphabet._get_base_alphabet(self.alignment._alphabet)
00474             if isinstance(base_alpha, Alphabet.ProteinAlphabet):
00475                 random_expected = Protein20Random
00476             elif isinstance(base_alpha, Alphabet.NucleotideAlphabet):
00477                 random_expected = Nucleotide4Random
00478             else:
00479                 errstr = "Error in alphabet: not Nucleotide or Protein, "
00480                 errstr += "supply expected frequencies"
00481                 raise ValueError(errstr)
00482             del base_alpha
00483         elif not isinstance(e_freq_table, FreqTable.FreqTable):
00484             raise ValueError("e_freq_table should be a FreqTable object")
00485             
00486 
00487         # determine all of the letters we have to deal with
00488         all_letters = self._get_all_letters()
00489         for char in chars_to_ignore:
00490             all_letters = all_letters.replace(char, '')
00491 
00492         info_content = {}
00493         for residue_num in range(start, end):
00494             freq_dict = self._get_letter_freqs(residue_num,
00495                                                self.alignment._records,
00496                                                all_letters, chars_to_ignore)
00497             # print freq_dict,
00498             column_score = self._get_column_info_content(freq_dict,
00499                                                          e_freq_table,
00500                                                          log_base,
00501                                                          random_expected)
00502 
00503             info_content[residue_num] = column_score
00504         # sum up the score
00505         total_info = sum(info_content.itervalues())
00506         # fill in the ic_vector member: holds IC for each column
00507         for i in info_content:
00508             self.ic_vector[i] = info_content[i]
00509         return total_info

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Align.AlignInfo.SummaryInfo.pos_specific_score_matrix (   self,
  axis_seq = None,
  chars_to_ignore = [] 
)
Create a position specific score matrix object for the alignment.

This creates a position specific score matrix (pssm) which is an
alternative method to look at a consensus sequence.

Arguments:
o chars_to_ignore - A listing of all characters not to include in
the pssm.  If the alignment alphabet declares a gap character,
then it will be excluded automatically.
o axis_seq - An optional argument specifying the sequence to
put on the axis of the PSSM. This should be a Seq object. If nothing
is specified, the consensus sequence, calculated with default
parameters, will be used.

Returns:
o A PSSM (position specific score matrix) object.

Definition at line 357 of file AlignInfo.py.

00357 
00358                                   chars_to_ignore = []):
00359         """Create a position specific score matrix object for the alignment.
00360 
00361         This creates a position specific score matrix (pssm) which is an
00362         alternative method to look at a consensus sequence.
00363 
00364         Arguments:
00365         o chars_to_ignore - A listing of all characters not to include in
00366         the pssm.  If the alignment alphabet declares a gap character,
00367         then it will be excluded automatically.
00368         o axis_seq - An optional argument specifying the sequence to
00369         put on the axis of the PSSM. This should be a Seq object. If nothing
00370         is specified, the consensus sequence, calculated with default
00371         parameters, will be used.
00372 
00373         Returns:
00374         o A PSSM (position specific score matrix) object.
00375         """
00376         # determine all of the letters we have to deal with
00377         all_letters = self._get_all_letters()
00378         assert all_letters
00379 
00380         if not isinstance(chars_to_ignore, list):
00381             raise TypeError("chars_to_ignore should be a list.")
00382 
00383         # if we have a gap char, add it to stuff to ignore
00384         if isinstance(self.alignment._alphabet, Alphabet.Gapped):
00385             chars_to_ignore.append(self.alignment._alphabet.gap_char)
00386         
00387         for char in chars_to_ignore:
00388             all_letters = all_letters.replace(char, '')
00389 
00390         if axis_seq:
00391             left_seq = axis_seq
00392             assert len(axis_seq) == self.alignment.get_alignment_length()
00393         else:
00394             left_seq = self.dumb_consensus()
00395 
00396         pssm_info = []
00397         # now start looping through all of the sequences and getting info
00398         for residue_num in range(len(left_seq)):
00399             score_dict = self._get_base_letters(all_letters)
00400             for record in self.alignment._records:
00401                 try:
00402                     this_residue = record.seq[residue_num]
00403                 # if we hit an index error we've run out of sequence and
00404                 # should not add new residues
00405                 except IndexError:
00406                     this_residue = None
00407                     
00408                 if this_residue and this_residue not in chars_to_ignore:
00409                     weight = record.annotations.get('weight', 1.0)
00410                     try:
00411                         score_dict[this_residue] += weight
00412                     # if we get a KeyError then we have an alphabet problem
00413                     except KeyError:
00414                         raise ValueError("Residue %s not found in alphabet %s"
00415                                      % (this_residue,
00416                                         self.alignment._alphabet))
00417 
00418             pssm_info.append((left_seq[residue_num],
00419                               score_dict))
00420 
00421 
00422         return PSSM(pssm_info)
                    

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.Align.AlignInfo.SummaryInfo.replacement_dictionary (   self,
  skip_chars = [] 
)
Generate a replacement dictionary to plug into a substitution matrix

This should look at an alignment, and be able to generate the number
of substitutions of different residues for each other in the
aligned object.

Will then return a dictionary with this information:
{('A', 'C') : 10, ('C', 'A') : 12, ('G', 'C') : 15 ....}

This also treats weighted sequences. The following example shows how
we calculate the replacement dictionary. Given the following
multiple sequence alignments:

GTATC  0.5
AT--C  0.8
CTGTC  1.0

For the first column we have:
('A', 'G') : 0.5 * 0.8 = 0.4
('C', 'G') : 0.5 * 1.0 = 0.5
('A', 'C') : 0.8 * 1.0 = 0.8

We then continue this for all of the columns in the alignment, summing
the information for each substitution in each column, until we end
up with the replacement dictionary.

Arguments:
o skip_chars - A list of characters to skip when creating the dictionary.
For instance, you might have Xs (screened stuff) or Ns, and not want
to include the ambiguity characters in the dictionary.

Definition at line 211 of file AlignInfo.py.

00211 
00212     def replacement_dictionary(self, skip_chars = []):
00213         """Generate a replacement dictionary to plug into a substitution matrix
00214         
00215         This should look at an alignment, and be able to generate the number
00216         of substitutions of different residues for each other in the
00217         aligned object.
00218 
00219         Will then return a dictionary with this information:
00220         {('A', 'C') : 10, ('C', 'A') : 12, ('G', 'C') : 15 ....}
00221 
00222         This also treats weighted sequences. The following example shows how
00223         we calculate the replacement dictionary. Given the following
00224         multiple sequence alignments:
00225 
00226         GTATC  0.5
00227         AT--C  0.8
00228         CTGTC  1.0
00229 
00230         For the first column we have:
00231         ('A', 'G') : 0.5 * 0.8 = 0.4
00232         ('C', 'G') : 0.5 * 1.0 = 0.5
00233         ('A', 'C') : 0.8 * 1.0 = 0.8
00234 
00235         We then continue this for all of the columns in the alignment, summing
00236         the information for each substitution in each column, until we end
00237         up with the replacement dictionary.
00238 
00239         Arguments:
00240         o skip_chars - A list of characters to skip when creating the dictionary.
00241         For instance, you might have Xs (screened stuff) or Ns, and not want
00242         to include the ambiguity characters in the dictionary.
00243         """
00244         # get a starting dictionary based on the alphabet of the alignment
00245         rep_dict, skip_items = self._get_base_replacements(skip_chars)
00246 
00247         # iterate through each record
00248         for rec_num1 in range(len(self.alignment._records)):
00249             # iterate through each record from one beyond the current record
00250             # to the end of the list of records
00251             for rec_num2 in range(rec_num1 + 1, len(self.alignment._records)):
00252                 # for each pair of records, compare the sequences and add
00253                 # the pertinent info to the dictionary
00254                 rep_dict = self._pair_replacement(
00255                     self.alignment._records[rec_num1].seq,
00256                     self.alignment._records[rec_num2].seq,
00257                     self.alignment._records[rec_num1].annotations.get('weight',1.0),
00258                     self.alignment._records[rec_num2].annotations.get('weight',1.0),
00259                     rep_dict, skip_items)
00260 
00261         return rep_dict

Here is the call graph for this function:


Member Data Documentation

Definition at line 37 of file AlignInfo.py.

Definition at line 38 of file AlignInfo.py.


The documentation for this class was generated from the following file: