In this post, I discuss the

**criterion for selecting parameters**of an Hidden Markov Model (HMM) for optimum training. If you still do not properly understand whats an HMM, please refer this article on HMMs. In order to classify data, one HMM is needed per class.

Assume that we want to have a Discrete HMM represented by

**λ( A, B, π)**.

**O**be the set of observations and

**N**to be number of states in a HMM and

**M**be the number of unique symbols.

**P(O|**

**λ) = P = likelihood that an observation matches the given HMM model**.

The following are some of the criteria in selecting parameters in Hidden Markov Models (taken from Bell System Journal):

**Initial Estimates of A and B**: We know that HMM classifier can be trained using Baum-Welch algorithm. For the given set of observations sequences (**O1**,**O2**, ......) in Baum-Welch algorithm, λ is re-estimated iteratively such that**P**is maximized. This depends on the initial parameters of HMM i.e. λ.**Different values of A, B and****π****gives different likelihoods**. Different values of**λ**give different results. There is no one right way to select initial estimates. Hence, they can be of any random values( however, the matrices**A, B and π**should sum to 1 along the rows).**HMM structure and number of states**: Structure of an HMM has an effect on performance of the HMM. For example in speech recognition, a HMM which is modeled from left to right is preferred.**There also is no need for a state to traverse to each and every state in the model**. i.e. Few elements of**A**can be**zero**. If**N**is too large reliable determination of**A**and**B**can become hard. If**N**is large, also**Viterbi Decoding**becomes hard. There is also no right way to choose N as it need not be related to a "observable phenomenon" in the problem.**Multiple Observation sequences:**Multiple observation sequences (from different training examples of same class) need to be used. Mixing training samples from different sources is a good idea.**Constraints on A, B matrices during training**: The matrices**A**,**B**and**π**can take any values (if they follow the constraints such as along each row of a matrix sum of elements needs to be 1). Its found that a HMM where, a state can traverse to all the states is not so flexible. At the same time, a state transition should also not occur "within" the state or "only" to next state. Such restrictions might constrain the level to which the model can be trained. For example in speech models, the transition is usually from left to right, so the matrix elements are zero below the diagonal. If the matrix B is unconstrained and the data set for training is limited, it is likely that b_{j}(k)=0. There are some solutions for this problem in the journal.**Multiple estimates of A, B and averaging**: It is known that different initial values of**λ**is likely to give different likelihoods for a testing set. So, different values of**λ**needs to implemented and averaged (average values of matrix elements of**A**,**B**and**π**). However, it was observed that results of the averaged model was not better than other best solutions computed. So, picking the model that has low memory constraint and gives good results should be picked.