Mark W. Craven, Richard J. Mural, Loren J. Hauser, and Edward C. Uberbacher
An important open problem in molecular biology is how to use computational methods to understand the structure and function of proteins given only their primary sequences. We describe and evaluate an original machine-learning approach to classifying protein sequences according to their structural folding class. Our work is novel in several respects: we use a set of protein classes that previously have not been used for classifying primary sequences, and we use a unique set of attributes to represent protein sequences to the learners. We evaluate our approach by measuring its ability to correctly classify proteins that were not in its training set. We compare our input representation to a commonly used input representation -- amino acid composition -- and show that our approach more accurately classifies proteins that have very limited homology to the sequences on which the systems are trained.