Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment

Yanhong Zhai, Bing Liu

This paper is concerned with the problem of structured data extraction from Web pages. The objective of the research is to automatically segment data records in a page, extract data items/fields from these records and store the extracted data in a database. In this paper, we first introduce the extraction problem, and then discuss the main existing approaches and their limitations. After that, we introduce a novel technique (called DEPTA) to automatically perform Web data extraction. The method consists of three steps: (1) identifying data records with similar patterns in a page, (2) aligning and extracting data items from the identified data records and (3) generating tree-based regular expressions to facilitate later extraction from other similar pages. The key innovation is the proposal of a new multiple tree alignment algorithm called partial tree alignment, which was found to be particularly suitable for Web data extraction. It is both highly accurate and efficient. This paper is based on our work published in KDD-03 and WWW-05.

Subjects: 1.10 Information Retrieval

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.