AI & the World Wide Web Recognizing Structure in Web Pages Using Similarity Queries

William W. Cohen, AT&T Labs - Research

We present general-purpose methods for recognizing certain types of structure in HTML documents. The methods are implemented using WHIRL, a "soft" logic that incorporates a notion of textual similarity developed in the information retrieval community. In an experimental evaluation on 82 Web pages, the top-ranked structure is "meaningful" (a structure that was used in a hand-coded "wrapper" for the page) nearly 70% of the time, improving on a value of 50% obtained by an earlier method. With appropriate background information, the structure-recognition methods we describe can also be used to learn a wrapper from examples, or for maintaining a wrapper as a Web page changes format. In these settings, this measure of performance can improved to 85%.


This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.