Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, Andrew McCallum
Author disambiguation is the problem of determining whether records in a publications database refer to the same person. A common supervised machine learning approach is to build a classifier to predict whether a pair of records is coreferent, followed by a clustering step to enforce transitivity. However, this approach ignores powerful evidence obtainable by examining sets (rather than pairs) of records, such as the number of publications or co-authors an author has. In this paper we propose a representation that enables these first-order features over sets of records. We then propose a training algorithm well-suited to this representation that is (1) error-driven in that training examples are generated from incorrect predictions on the training data, and (2) rank-based in that the classifier induces a ranking over candidate predictions. We evaluate our algorithms on three author disambiguation datasets and demonstrate error reductions of up to 60% over the standard binary classification approach.
Subjects: 1.10 Information Retrieval; 12. Machine Learning and Discovery
Submitted: May 15, 2007