C. F. Aliferis, I. Tsamardinos, P. P. Massion, A. Statnikov, N. Fananapazir, and D. Hardin
This research explores machine learning methods for the development of computer models that use gene expression data to distinguish between tumor and non-tumor, between metastatic and non-metastatic, and between histological subtypes of lung cancer. A second goal is to identify small sets of gene predictors and study their properties in terms of stability, size, and relation to lung cancer. We apply four classifier and two gene selection algorithms to a 12,600 oligonucleotide array dataset from 203 patients and normal human subjects. The resulting models exhibit excellent classification performance. Gene selection methods reduce drastically the genes necessary for classification. Selected genes are very different among gene selection methods, however. A statistical method for characterizing the causal relevance of selected genes is introduced and applied.