Craig G. Nevill-Manning, Komal S. Sethi, Thomas D. Wu, and Douglas L. Brutlag
Discrete motifs that discriminate functional classes of proteins are useful for classifying new sequences, capturing structural constraints, and identifying protein subclasses. Despite the fact that the space of such motifs can grow exponentially with sequence length and number, we show that in practice it usually does not, and we describe a technique that infers motifs from aligned protein sequences by exhaustively searching this space. Our method generates sequence motifs over a wide range of recall and precision, and chooses a representative motif based on a score that we derive from both statistical and information-theoretic frameworks. Finally, we show that the selected motifs perform well in practice, classifying unseen sequences with extremely high precision, and infer protein subclasses that correspond to known biochemical classes.