Liisa Holm and Chris Sander
The structures of nearly a thousand sequence-unique proteins represent only 300 different 3D shapes. Is structural resemblance between proteins with little sequence similarity the result of physical convergence to favourable folding patterns, or does it reflect a memory of common evolutionary history? Separating these two processes is important for organizing genome data in terms of protein families and for theoretical approaches to protein structure prediction by fold recognition techniques. Achieving separation requires a combination of structure, sequence and functional analysis of proteins. For this purpose, we are developing a decision support system that scans heterogeneous protein sequence and structure related databases, and collects or calculates characters indicative of common functional constraints. The criteria include sequence homology, analysis of 3D clusters of conserved residues, conservation of active sites, and keyword analysis of biological function. Even without extensive refinement, application of a combination of these criteria to a test set representing all currently known protein structures yields 87% coverage with 7 % false positives, compared to 53 % coverage by only 1D sequence criteria. Thus, the semiautomatic prototype system significantly enhances the efficiency of unifying families of functionally related proteins in spite of long evolutionary distances.