Kedar Bellare, Andrew McCallum
Supervised machine learning algorithms for information extraction generally require large amounts of labeled training data. In many cases where labeling data is burdensome, there may, however, already exist an incomplete database relevant to the task at hand. Records from this database can be used to approximately label text strings that express the same information. For tasks in which text strings do not follow the same format or layout, and additionally may contain extra information, it may be problematic to obtain a complete labeling. This paper presents a method for training extractors that fill in missing labels of a text sequence that was partially labeled by simple high-precision heuristics. Furthermore, we improve the algorithm by using labeled fields from the database. In experiments with BibTeX records and research paper citation strings, we show a significant improvement in extraction accuracy over a baseline that relies only on the database for training data.
Subjects: 12. Machine Learning and Discovery; 1. Applications
Submitted: May 15, 2007