Usama Fayyad, Cory Reina and P. S. Bradley, Microsoft Research
Iterative refinement clustering algorithms (e.g. K-Means, EM) converge to one of numerous local minima. It is known that they are especially sensitive to initial conditions. We present a procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution. The refined initial starting condition leads to convergence to "better" local minima. The procedure is applicable to a wide class of clustering algorithms for both discrete and continuous data. We demonstrate the application of this method to the Expectation Maximization (EM) clustering algorithm and show that refined initial points indeed lead to improved solutions. Refinement run time is considerably lower than the time required to cluster the full database. The method is scalable and can be coupled with a scalable clustering algorithm to address the large-scale clustering in data mining.