Gary D. Stormo and David Haussler
We consider the problem of parsing a sequence into different classes of subsequences. Two common examples are finding the exons and introns in genomic sequences and identifying the secondary structure domains of protein sequences. In each case there are various types of evidence that are relevant to the classification, but none are completely reliable, so we expect some weighted average of all the evidence to provide improved classifications. For example, in the problem of identifying coding regions in genomic DNA, the combined use of evidence such as codon bias and splice junction patterns can give more reliable predictions than either type of evidence alone. We show three main results: 1. For a given weighting of the evidence a dynamic programming algorithm returns the optimal parse and any number of sub-optimal parses. 2. For a given weighting of the evidence a dynamic programming algorithm determines the probability of the optimal parse and any number of sub-optimal parses under a natural Boltzmann- Gibbs distribution over the set of possible parses. 3. Given a set of sequences with known correct parses, a dynamic programming algorithm allows one to apply gradient descent to obtain the weights that maximize the probability of the correct parses of these sequences.