D. J. States, N. Harris, and L. Hunter
Molecular sequence megaclassitication is a technique for automated protein sequence analysis and annotation. Implementation of the method has been limited by the need to store and randomly access a database of all the sequence pair similarities. More than 80,000 protein sequences are now present in the public _databases, and the pair similarity data table for the full protein sequence database requires over 1 gigabyte of storage. In this paper we present a com- Imtationally efficient representation of groups based on a graph theory approach where sequence clusters are described by a minimal spanning tree of highest scoring similarity pairs. This representation allows a classification of N proteins to be stored in order(N) memory. The use this minimal spanning tree representation simplifies analysis of groups, the description of group characteristics and the manual correction of artifacts resulting from false hits. The new tree representation also introduces new possibilities for artifact generation in sequence classiftcation. Metheds for detecting and removing these artifacts are discussed.