Marie-France Sagot, Alain Viari, and Henri Soldano
We present in this paper an algorithm for the multiple comparison of a set of protein sequences. Our approach is that of peptide matching and consists in looking for all the words that occur approximatively in at least q of the sequences in the set, where q is a parameter. Words are compared by using a reference object called a model, that is itself a word over the alphabet of the amino acids, and the comparison between a model and a word is based on w-length words instead of single symbols. This idea is similar to the one used in the Blast program in the case of pairwise comparisons. Two w-length words are considered to be related if an alignment without gaps of the two using a similarity matrix has a score greater than a certain threshold value t. In our case, we say that a k-length word u is an occurrence of a model m of the same length if every w-length subword of u is related to the corresponding subword of m in the sense given above. If a model m has occurrences in at least q of the sequences of the set, m is said to occur in the set. In percentage terms, the value of q may correspond to something as small as 5% of the sequences (search for recurrent words in a set of non homologuous proteins) or as high as 70-100% (establishment of a list of all similar words as a first step in a multiple alignment program). The algorithm presented here is an efficient and exact way of looking for all the models, of a fixed length k or of the greatest possible length kmax, that occur in a set of sequences. It can work with any kind of scoring matrix and an extension of the algorithm allows for the introduction of gaps between a model and its occurrences.