STALKER: Learning Wrappers for Semistructured, Web-based Information Sources

Ion Muslea, Steve Minton, and Craig Knoblock

Information mediators are systems capable of providing a unified view of several information sources. Central to any mediator that accesses Web-based sources is a set of wrappers that can extract relevant information from Web pages. In this paper, we present a wrapper-induction algorithm that generates extraction rules for Web-based information sources. We introduce landmark automata, a formalism that describes classes of extraction rules. Our wrapper induction algorithm, STALKER, generates extraction rules that are expressed as simple landmark grammars, which are a class of landmark automata that is more expressive than the existing extraction languages. Based on just a few training examples STALKER learns extraction rules for documents with multiple levels of embedding. The experimental results show that our approach successfully wraps classes of documents that can not be wrapped by existing techniques.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.