Ion Muslea, Steve Minton, and Craig Knoblock
Information mediators are systems capable of providing a unified view of several information sources. Central to any mediator that accesses Web-based sources is a set of wrappers that can extract relevant information from Web pages. In this paper, we present a wrapper-induction algorithm that generates extraction rules for Web-based information sources. We introduce landmark automata, a formalism that describes classes of extraction rules. Our wrapper induction algorithm, STALKER, generates extraction rules that are expressed as simple landmark grammars, which are a class of landmark automata that is more expressive than the existing extraction languages. Based on just a few training examples STALKER learns extraction rules for documents with multiple levels of embedding. The experimental results show that our approach successfully wraps classes of documents that can not be wrapped by existing techniques.