Golan Yona and Michael Levitt, Stanford University
In search for global principles that may explain the organization of the space of all possible proteins, we study all known protein sequences and structures. In this paper we present a global map of the protein space based on our analysis. Our protein space contains all protein sequences in a non-redundant (NR) database, which includes all major sequence databases. Using the PSI-BLAST procedure we defined 4670 clusters of related sequences in this space. Of these clusters, 1421 are centered on a sequence of known structure. All 4670 clusters were then compared using either a structure metric (when 3D structures are known) or a novel sequence profile metric. These scores were used to define a unified and consistent metric between all clusters. Two schemes were employed to organize these clusters in a meta-organization. The first uses a graph theory method and cluster the clusters in an hierarchical organization. This organization extends our ability to predict the structure and function of many proteins beyond what is possible with existing tools for sequence analysis. The second uses a variation on a multidimensional scaling technique to embed the clusters in a low dimensional real space. This last approach resulted in a projection of the protein space onto a 2D plane that provides us with a bird’s eye view of the protein space. Based on this map we suggest a list of possible target sequences with unknown structure that are likely to adopt new, unknown folds.