Golan Yona, Nathan Linial, Naftali Tishby, and Michal Linial
We investigate the space of all protein sequences. We combine the standard measures of similarity (SW, FASTA, BLAST), to associate with each sequence an exhaustive list of neighboring sequences. These lists induce a (weighted directed) graph whose vertices are the sequences. The weight of an edge connecting two sequences represents their degree of similarity. This graph encodes much of the fundamental properties of the sequence space. We look for clusters of related proteins in this graph. These clusters correspond to strongly connected sets of vertices. Two main ideas underlie our work: i) Interesting homologies among proteins can be deduced by transitivity. ii) Transitivity should be applied restrictively in order to prevent unrelated proteins from clustering together. Our analysis starts from a very conservative classification, based on very significant similarities, that has many classes. Subsequently, classes are merged to include less significant similarities. Merging is performed via a novel two phase algorithm. First, the algorithm identifies groups of possibly related clusters (based on transitivity and strong connectivity) using local considerations, and merges them. Then, a global test is applied to identify nuclei of strong relationships within these groups of clusters, and the classification is refined accordingly. This process takes place at varying thresholds of statistical significance, where at each step the algorithm is applied on the classes of the previous classification, to obtain the next one, at the more permissive threshold. Consequently, a hierarchical organization of all proteins is obtained. The resulting classification splits the space of all protein sequences into well defined groups of proteins. The results show that the automatically induced sets of proteins are closely correlated with natural biological families and super families. The hierarchical organization reveals inner sub-families that make up known families of proteins as well as many interesting relations between protein families. The hierarchical organization proposed may be considered as the first map of the space of all protein sequences. An interactive web site including the results of our analysis has been constructed.