Steven Salzberg, Xin Chen, John Henderson, and Kenneth Fasman
This study demonstrates the use of decision tree classifiers as the basis for a general gene-finding system. The system uses a dynamic programming algorithm that finds the optimal segmentation of a DNA sequence into coding and noncoding regions (exons and introns). The optimality property is dependent on a separate scoring function that takes a subsequence and assigns to it a score reflecting the probability that the sequence is an exon. In this study, the scoring functions were sets of decision trees and rules that were combined to give the probability estimate. Experimental results on a newly collected database of human DNA sequences are encouraging, and some new observations about the structure of classifiers for the gene-finding problem have emerged from this study. We also provide descriptions of a new probability chain model that produces very accurate filters to find donor and acceptor sites.