Nicholas Kushmerick, University College Dublin
Recent work on Internet information integration assumes a library of wrappers, specialized information extraction procedures. Maintaining wrappers is difficult, because the formatting regularities on which they rely often change. The wrapper verification problem is to determine whether a wrapper is correct. Standard regression testing approaches are inappropriate, because both the formatting regularities and a site’s underlying content may change. We introduce RAPTURE, a fully-implemented, domain-independent verification algorithm. RAPTURE uses well-motivated heuristics to compute the similarity between a wrapper’s expected and observed output. Experiments with 27 actual Internet sites show a substantial performance improvement over standard regression testing.