Nomi L. Harris, Lawrence Hunter, David J. States
We report on the development and application of an efficient unsupervised learning procedure for the classification of an unsegmented datastream, given a set of probabilistic binary similarity judgments between regions in the stream. Our method is effective on very large databases, and tolerates the presence of noise in the similarity judgements and in the extents of similar regions. We applied this method to the problem of finding the sequence-level building blocks of proteins. After verifying the effectiveuess of the clusterer by testing it on synthetic protein data with known evolutionary history, we applied the method to a large protein sequence database (a datastream of more than IO^7 elements) and found about 10,000 protein sequence classes. The motifs defined by these classes are of biological interest, and have the potential to supplement or replace the existing manual annotation of protein sequence databases.