Ron Bekkerman, Hema Raghavan, James Allan, Koji Eguchi.
Document clustering is traditionally tackled from the perspective of grouping documents that are topically similar. However, many other criteria for clustering documents can be considered: for example, documents' genre or the author's mood. We propose an interactive scheme for clustering document collections, based on any criterion of the user's preference. The user holds an active position in the clustering process: first, she chooses the types of features suitable to the underlying task, leading to a task-specific document representation. She can then provide examples of features---if such examples are emerging, e.g., when clustering by the author's sentiment, words like `perfect', `mediocre', `awful' are intuitively good features. The algorithm proceeds iteratively, and the user can fix errors made by the clustering system at the end of each iteration. Such an interactive clustering method demonstrates excellent results on clustering by sentiment, substantially outperforming an SVM trained on a large amount of labeled data. Even if features are not provided because they are not intuitively obvious to the user---e.g., what would be good features for clustering by genre using part-of-speech trigrams?---our multi-modal clustering method performs significantly better than $k$-means and Latent Dirichlet Allocation (LDA).
Subjects: 12. Machine Learning and Discovery; 13. Natural Language Processing
Submitted: Oct 12, 2006