William Cohen and Wei Fan
One bottleneck in implementing a system that intelligently queries the Web is developing "wrappers"--programs that extract data from Web pages. Here we describe a method for learning general, page-independent heuristics for extracting data from HTML documents. The input to our learning system is a set of working wrapper programs, paired with HTML pages they correctly wrap. The output is ageneral procedure for extracting data that works for many formats and many pages. In experiments with a collection of 84 constrained but realistic extraction problems, we demonstrate that 30% of the problems can be handled perfectly by learned extraction heuristics, and around 50% can be handled acceptably. We also demonstrate that learned page-independent extraction heuristics can substantially improve the performance of methods for learning page-specific wrappers.