Liviu Badea, Doina Tilivea
Although clustering is probably the most frequently used tool for data mining gene expression data, existing clustering approaches face at least one of the following problems in this domain: a huge number of variables (genes) as compared to the number of samples, high noise levels, the inability to naturally deal with overlapping clusters, the instability of the resulting clusters w.r.t. the initialization of the algorithm as well as the difficulty in clustering genes and samples simultaneously. In this paper we show that all of these problems can be elegantly dealt with by using nonnegative matrix factorizations to cluster genes and samples simultaneously while allowing for bicluster overlaps and by employing Positive Tensor Factorization to perform a two-way meta-clustering of the biclusters produced in several different clustering runs (thereby addressing the above-mentioned instability). The application of our approach to a large lung cancer dataset proved computationally tractable and was able to recover the histological classification of the various cancer subtypes represented in the dataset.
Subjects: 1. Applications; 12. Machine Learning and Discovery
Submitted: Oct 4, 2006