Optimally Parsing a Sequence into Different Classes Based on Multiple Types of Evidence

Gary D. Stormo and David Haussler

We consider the problem of parsing a sequence into different classes of subsequences. Two common examples are finding the exons and introns in genomic sequences and identifying the secondary structure domains of protein sequences. In each case there are various types of evidence that are relevant to the classification, but none are completely reliable, so we expect some weighted average of all the evidence to provide improved classifications. For example, in the problem of identifying coding regions in genomic DNA, the combined use of evidence such as codon bias and splice junction patterns can give more reliable predictions than either type of evidence alone. We show three main results: 1. For a given weighting of the evidence a dynamic programming algorithm returns the optimal parse and any number of sub-optimal parses. 2. For a given weighting of the evidence a dynamic programming algorithm determines the probability of the optimal parse and any number of sub-optimal parses under a natural Boltzmann- Gibbs distribution over the set of possible parses. 3. Given a set of sequences with known correct parses, a dynamic programming algorithm allows one to apply gradient descent to obtain the weights that maximize the probability of the correct parses of these sequences.


This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.