Back to index

python-biopython  1.60
Public Member Functions
Bio.HMM.Trainer.BaumWelchTrainer Class Reference
Inheritance diagram for Bio.HMM.Trainer.BaumWelchTrainer:
Inheritance graph
[legend]
Collaboration diagram for Bio.HMM.Trainer.BaumWelchTrainer:
Collaboration graph
[legend]

List of all members.

Public Member Functions

def __init__
def train
def update_transitions
def update_emissions
def log_likelihood
def estimate_params
def ml_estimator

Detailed Description

Trainer that uses the Baum-Welch algorithm to estimate parameters.

These should be used when a training sequence for an HMM has unknown
paths for the actual states, and you need to make an estimation of the
model parameters from the observed emissions.

This uses the Baum-Welch algorithm, first described in
Baum, L.E. 1972. Inequalities. 3:1-8
This is based on the description in 'Biological Sequence Analysis' by
Durbin et al. in section 3.3

This algorithm is guaranteed to converge to a local maximum, but not
necessarily to the global maxima, so use with care!

Definition at line 142 of file Trainer.py.


Constructor & Destructor Documentation

def Bio.HMM.Trainer.BaumWelchTrainer.__init__ (   self,
  markov_model 
)
Initialize the trainer.

Arguments:

o markov_model - The model we are going to estimate parameters for.
This should have the parameters with some initial estimates, that
we can build from.

Reimplemented from Bio.HMM.Trainer.AbstractTrainer.

Definition at line 157 of file Trainer.py.

00157 
00158     def __init__(self, markov_model):
00159         """Initialize the trainer.
00160 
00161         Arguments:
00162         
00163         o markov_model - The model we are going to estimate parameters for.
00164         This should have the parameters with some initial estimates, that
00165         we can build from.
00166         """
00167         AbstractTrainer.__init__(self, markov_model)

Here is the call graph for this function:

Here is the caller graph for this function:


Member Function Documentation

def Bio.HMM.Trainer.AbstractTrainer.estimate_params (   self,
  transition_counts,
  emission_counts 
) [inherited]
Get a maximum likelihood estimation of transition and emmission.

Arguments:

o transition_counts -- A dictionary with the total number of counts
of transitions between two states.

o emissions_counts -- A dictionary with the total number of counts
of emmissions of a particular emission letter by a state letter.

This then returns the maximum likelihood estimators for the
transitions and emissions, estimated by formulas 3.18 in
Durbin et al:

a_{kl} = A_{kl} / sum(A_{kl'})
e_{k}(b) = E_{k}(b) / sum(E_{k}(b'))

Returns:
Transition and emission dictionaries containing the maximum
likelihood estimators.

Definition at line 63 of file Trainer.py.

00063 
00064     def estimate_params(self, transition_counts, emission_counts):
00065         """Get a maximum likelihood estimation of transition and emmission.
00066 
00067         Arguments:
00068         
00069         o transition_counts -- A dictionary with the total number of counts
00070         of transitions between two states.
00071 
00072         o emissions_counts -- A dictionary with the total number of counts
00073         of emmissions of a particular emission letter by a state letter.
00074 
00075         This then returns the maximum likelihood estimators for the
00076         transitions and emissions, estimated by formulas 3.18 in
00077         Durbin et al:
00078 
00079         a_{kl} = A_{kl} / sum(A_{kl'})
00080         e_{k}(b) = E_{k}(b) / sum(E_{k}(b'))
00081 
00082         Returns:
00083         Transition and emission dictionaries containing the maximum
00084         likelihood estimators.
00085         """
00086         # now calculate the information
00087         ml_transitions = self.ml_estimator(transition_counts)
00088         ml_emissions = self.ml_estimator(emission_counts)
00089 
00090         return ml_transitions, ml_emissions

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.HMM.Trainer.AbstractTrainer.log_likelihood (   self,
  probabilities 
) [inherited]
Calculate the log likelihood of the training seqs.

Arguments:

o probabilities -- A list of the probabilities of each training
sequence under the current paramters, calculated using the forward
algorithm.

Definition at line 48 of file Trainer.py.

00048 
00049     def log_likelihood(self, probabilities):
00050         """Calculate the log likelihood of the training seqs.
00051 
00052         Arguments:
00053 
00054         o probabilities -- A list of the probabilities of each training
00055         sequence under the current paramters, calculated using the forward
00056         algorithm.
00057         """
00058         total_likelihood = 0
00059         for probability in probabilities:
00060             total_likelihood += math.log(probability)
00061 
00062         return total_likelihood
                 

Here is the caller graph for this function:

def Bio.HMM.Trainer.AbstractTrainer.ml_estimator (   self,
  counts 
) [inherited]
Calculate the maximum likelihood estimator.

This can calculate maximum likelihoods for both transitions
and emissions.

Arguments:

o counts -- A dictionary of the counts for each item.

See estimate_params for a description of the formula used for
calculation.

Definition at line 91 of file Trainer.py.

00091 
00092     def ml_estimator(self, counts):
00093         """Calculate the maximum likelihood estimator.
00094 
00095         This can calculate maximum likelihoods for both transitions
00096         and emissions.
00097 
00098         Arguments:
00099 
00100         o counts -- A dictionary of the counts for each item.
00101 
00102         See estimate_params for a description of the formula used for
00103         calculation.
00104         """
00105         # get an ordered list of all items
00106         all_ordered = counts.keys()
00107         all_ordered.sort()
00108         
00109         ml_estimation = {}
00110 
00111         # the total counts for the current letter we are on
00112         cur_letter = None
00113         cur_letter_counts = 0
00114         
00115         for cur_item in all_ordered:
00116             # if we are on a new letter (ie. the first letter of the tuple)
00117             if cur_item[0] != cur_letter:
00118                 # set the new letter we are working with
00119                 cur_letter = cur_item[0]
00120 
00121                 # count up the total counts for this letter
00122                 cur_letter_counts = counts[cur_item]
00123                 
00124                 # add counts for all other items with the same first letter
00125                 cur_position = all_ordered.index(cur_item) + 1
00126 
00127                 # keep adding while we have the same first letter or until
00128                 # we get to the end of the ordered list
00129                 while (cur_position < len(all_ordered) and
00130                        all_ordered[cur_position][0] == cur_item[0]):
00131                     cur_letter_counts += counts[all_ordered[cur_position]]
00132                     cur_position += 1
00133             # otherwise we've already got the total counts for this letter
00134             else:
00135                 pass
00136 
00137             # now calculate the ml and add it to the estimation
00138             cur_ml = float(counts[cur_item]) / float(cur_letter_counts)
00139             ml_estimation[cur_item] = cur_ml
00140 
00141         return ml_estimation
            

Here is the caller graph for this function:

def Bio.HMM.Trainer.BaumWelchTrainer.train (   self,
  training_seqs,
  stopping_criteria,
  dp_method = ScaledDPAlgorithms 
)
Estimate the parameters using training sequences.

The algorithm for this is taken from Durbin et al. p64, so this
is a good place to go for a reference on what is going on.

Arguments:

o training_seqs -- A list of TrainingSequence objects to be used
for estimating the parameters.

o stopping_criteria -- A function, that when passed the change
in log likelihood and threshold, will indicate if we should stop
the estimation iterations.

o dp_method -- A class instance specifying the dynamic programming
implementation we should use to calculate the forward and
backward variables. By default, we use the scaling method.

Definition at line 169 of file Trainer.py.

00169 
00170               dp_method = ScaledDPAlgorithms):
00171         """Estimate the parameters using training sequences.
00172 
00173         The algorithm for this is taken from Durbin et al. p64, so this
00174         is a good place to go for a reference on what is going on.
00175         
00176         Arguments:
00177 
00178         o training_seqs -- A list of TrainingSequence objects to be used
00179         for estimating the parameters.
00180         
00181         o stopping_criteria -- A function, that when passed the change
00182         in log likelihood and threshold, will indicate if we should stop
00183         the estimation iterations.
00184 
00185         o dp_method -- A class instance specifying the dynamic programming
00186         implementation we should use to calculate the forward and
00187         backward variables. By default, we use the scaling method.
00188         """
00189         prev_log_likelihood = None
00190         num_iterations = 1
00191         
00192         while 1:            
00193             transition_count = self._markov_model.get_blank_transitions()
00194             emission_count = self._markov_model.get_blank_emissions()
00195 
00196             # remember all of the sequence probabilities
00197             all_probabilities = []
00198             
00199             for training_seq in training_seqs:
00200                 # calculate the forward and backward variables
00201                 DP = dp_method(self._markov_model, training_seq)
00202                 forward_var, seq_prob = DP.forward_algorithm()
00203                 backward_var =  DP.backward_algorithm()
00204                 
00205                 all_probabilities.append(seq_prob)
00206 
00207                 # update the counts for transitions and emissions
00208                 transition_count = self.update_transitions(transition_count,
00209                                                            training_seq,
00210                                                            forward_var,
00211                                                            backward_var,
00212                                                            seq_prob)
00213                 emission_count = self.update_emissions(emission_count,
00214                                                        training_seq,
00215                                                        forward_var,
00216                                                        backward_var,
00217                                                        seq_prob)
00218 
00219             # update the markov model with the new probabilities
00220             ml_transitions, ml_emissions = \
00221                 self.estimate_params(transition_count, emission_count)
00222             self._markov_model.transition_prob = ml_transitions
00223             self._markov_model.emission_prob = ml_emissions
00224 
00225             cur_log_likelihood =  self.log_likelihood(all_probabilities)
00226 
00227             # if we have previously calculated the log likelihood (ie.
00228             # not the first round), see if we can finish
00229             if prev_log_likelihood is not None:
00230                 # XXX log likelihoods are negatives -- am I calculating
00231                 # the change properly, or should I use the negatives...
00232                 # I'm not sure at all if this is right.
00233                 log_likelihood_change = abs(abs(cur_log_likelihood) -
00234                                             abs(prev_log_likelihood))
00235 
00236                 # check whether we have completed enough iterations to have
00237                 # a good estimation
00238                 if stopping_criteria(log_likelihood_change, num_iterations):
00239                     break
00240 
00241             # set up for another round of iterations
00242             prev_log_likelihood = cur_log_likelihood
00243             num_iterations += 1
00244 
00245         return self._markov_model

Here is the call graph for this function:

Here is the caller graph for this function:

def Bio.HMM.Trainer.BaumWelchTrainer.update_emissions (   self,
  emission_counts,
  training_seq,
  forward_vars,
  backward_vars,
  training_seq_prob 
)
Add the contribution of a new training sequence to the emissions

Arguments:

o emission_counts -- A dictionary of the current counts for the
emissions

o training_seq -- The training sequence we are working with

o forward_vars -- Probabilities calculated using the forward
algorithm.

o backward_vars -- Probabilities calculated using the backwards
algorithm.

o training_seq_prob - The probability of the current sequence.

This calculates E_{k}(b) (the estimated emission probability for
emission letter b from state k) using formula 3.21 in Durbin et al.

Definition at line 300 of file Trainer.py.

00300 
00301                            forward_vars, backward_vars, training_seq_prob):
00302         """Add the contribution of a new training sequence to the emissions
00303 
00304         Arguments:
00305 
00306         o emission_counts -- A dictionary of the current counts for the
00307         emissions
00308 
00309         o training_seq -- The training sequence we are working with
00310 
00311         o forward_vars -- Probabilities calculated using the forward
00312         algorithm.
00313 
00314         o backward_vars -- Probabilities calculated using the backwards
00315         algorithm.
00316 
00317         o training_seq_prob - The probability of the current sequence.
00318 
00319         This calculates E_{k}(b) (the estimated emission probability for
00320         emission letter b from state k) using formula 3.21 in Durbin et al.
00321         """
00322         # loop over the possible combinations of state path letters
00323         for k in training_seq.states.alphabet.letters:
00324             # now loop over all of the possible emissions
00325             for b in training_seq.emissions.alphabet.letters:
00326                 expected_times = 0
00327                 # finally loop over the entire training sequence
00328                 for i in range(len(training_seq.emissions)):
00329                     # only count the forward and backward probability if the
00330                     # emission at the position is the same as b
00331                     if training_seq.emissions[i] == b:
00332                         # f_{k}(i) b_{k}(i)
00333                         expected_times += (forward_vars[(k, i)] *
00334                                            backward_vars[(k, i)])
00335 
00336                 # add to E_{k}(b)
00337                 emission_counts[(k, b)] += (float(expected_times) /
00338                                             training_seq_prob)
00339 
00340         return emission_counts

Here is the caller graph for this function:

def Bio.HMM.Trainer.BaumWelchTrainer.update_transitions (   self,
  transition_counts,
  training_seq,
  forward_vars,
  backward_vars,
  training_seq_prob 
)
Add the contribution of a new training sequence to the transitions.

Arguments:

o transition_counts -- A dictionary of the current counts for the
transitions

o training_seq -- The training sequence we are working with

o forward_vars -- Probabilities calculated using the forward
algorithm.

o backward_vars -- Probabilities calculated using the backwards
algorithm.

o training_seq_prob - The probability of the current sequence.

This calculates A_{kl} (the estimated transition counts from state
k to state l) using formula 3.20 in Durbin et al.

Definition at line 247 of file Trainer.py.

00247 
00248                            forward_vars, backward_vars, training_seq_prob):
00249         """Add the contribution of a new training sequence to the transitions.
00250 
00251         Arguments:
00252 
00253         o transition_counts -- A dictionary of the current counts for the
00254         transitions
00255 
00256         o training_seq -- The training sequence we are working with
00257 
00258         o forward_vars -- Probabilities calculated using the forward
00259         algorithm.
00260 
00261         o backward_vars -- Probabilities calculated using the backwards
00262         algorithm.
00263 
00264         o training_seq_prob - The probability of the current sequence.
00265 
00266         This calculates A_{kl} (the estimated transition counts from state
00267         k to state l) using formula 3.20 in Durbin et al.
00268         """
00269         # set up the transition and emission probabilities we are using
00270         transitions = self._markov_model.transition_prob
00271         emissions = self._markov_model.emission_prob
00272         
00273         # loop over the possible combinations of state path letters
00274         for k in training_seq.states.alphabet.letters:
00275             for l in self._markov_model.transitions_from(k):
00276                 estimated_counts = 0
00277                 # now loop over the entire training sequence
00278                 for i in range(len(training_seq.emissions) - 1):
00279                     # the forward value of k at the current position
00280                     forward_value = forward_vars[(k, i)]
00281 
00282                     # the backward value of l in the next position
00283                     backward_value = backward_vars[(l, i + 1)]
00284 
00285                     # the probability of a transition from k to l
00286                     trans_value = transitions[(k, l)]
00287 
00288                     # the probability of getting the emission at the next pos
00289                     emm_value = emissions[(l, training_seq.emissions[i + 1])]
00290 
00291                     estimated_counts += (forward_value * trans_value *
00292                                          emm_value * backward_value)
00293 
00294                 # update the transition approximation
00295                 transition_counts[(k, l)] += (float(estimated_counts) /
00296                                               training_seq_prob)
00297                     
00298         return transition_counts

Here is the call graph for this function:

Here is the caller graph for this function:


The documentation for this class was generated from the following file: