Jun Zhu, Roland Lüthy, and Charles E. Lawrence
The size of protein sequence database is getting larger each day. One common challenge is to predict protein structures or functions of the sequences in databases. It is easy when a sequence shares direct similarity to a well-characterized protein. If there is no direct similarity, we have to rely on a third sequence or a model as intermediate to link two proteins together. We developed a new model based method, called Bayesian search, as a means to connect two distantly related proteins. We compared this Bayesian search model with pairwise and multiple sequence comparison methods on structural databases using structural similarity as the criteria for relationship. The results show that the Bayesian search can link more distantly related sequence pairs than other methods, collectively and consistently over large protein families. If each query made one error on average against SCOP database PDB40D-B, Bayesian search found 36.5% of related pairs, PSI- Blast found 32.6%, and Smith-Waterman method found 25%. Examples are presented to show that the alignments predicted by the Bayesian search agree well with structural alignments. Also false positives found by Bayesian search at low cutoff values are analyzed.