Andrea Califano, Gustavo Stolovitzky, and Yuhai Tu, IBM T. J. Watson Research Center
Several microarray technologies that monitor the level of expression of a large number of genes have recently emerged. Given DNA-microarray data for a set of cells characterized by a given phenotype and for a set of control cells, an important problem is to identify "patterns" of gene expression that can be used to predict cell phenotype. The potential number of such patterns is exponential in the number of genes. In this paper, we propose a solution to this problem based on a supervised learning algorithm, which differs substantially from previous schemes. It couples a complex, non-linear similarity metric, which maximizes the probability of discovering discriminative gene expression patterns, and a pattern discovery algorithm called SPLASH. The latter discovers efficiently and deterministically all statistically significant gene expression patterns in the phenotype set. Statistical significance is evaluated based on the probability of a pattern to occur by chance in the control set. Finally, a greedy set covering algorithm is used to select an optimal subset of statistically significant patterns, which form the basis for a standard likelihood ratio classification scheme. We analyze data from 60 human cancer cell lines using this method, and compare our results with those of other supervised learning schemes. Different phenotypes are studied. These include cancer morphologies (such as melanoma), molecular targets (such as mutations in the p53 gene), and therapeutic targets related to the sensitivity to an anticancer compounds. We also analyze a synthetic data set that shows that this technique is especially well suited for the analysis of sub-phenotype mixtures. For complex phenotypes, such as p53, our method produces an encouragingly low rate of false positives and false negatives and seems to outperform the others. Similar low rates are reported when predicting the efficacy of experimental anticancer compounds. This counts among the first reported studies where drug efficacy has been successfully predicted from large-scale expression data analysis.