Terrance Goan, Nels Belson, and Oren Etzioni
The World Wide Web is a treasure trove of information. The Web’s sheer scale makes automatic location and extraction of information appealing. However, much of the information lies bmied in documents designed for human consumption, such as home pages or product catalogs. Before software agents can extract nuggets of information from Web documents, they have to be able to recognize it despite the multitude of formats in which it may appear. In this paper, we take a machine learning approach to the problem. We explain why existing grammar inference techniques face difficulties in this domain, present a new technique, and demonstrate its success on examples drawn from the Web ranging from CMU Tech Report codes to bus schedules. Our algorithm is shown to learn target languages found on the Web in significantly fewer examples than in previous methods.