Clustering Using Monte Carlo Cross-Validation

Padhraic Smyth

Finding the "right" number of clusters, k, for a data set is a difficult, and often ill-posed, problem. In a probabilistic clustering context, likelihood-ratios, penalized likelihoods, and Bayesian techniques are among the more popular techniques. In this paper a new cross-validated likelihood criterion is investigated for determining cluster structure. A practical clustering algorithm based on Monte Carlo cross-validation (MCCV) is introduced. The algorithm permits the data analyst to judge if there is strong evidence for a particular k, or perhaps weaker evidence over a sub-range of k values. Experimental results with Gaussian mixtures on real and simulated data suggest that MCCV provides genuine insight into cluster structure. v-fold cross-validation appears inferior to the penalized likelihood method (BIC), a Bayesian algorithm (AutoClass v2.0), and the new MCCV algorithm. Overall, MCCV and AutoClass appear the most reliable of the methods. MCCV provides the data-miner with a useful data-driven clustering tool which complements the fully Bayesian approach.


This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.