Current Issues in Markup-Based Knowledge Extraction

Udo Kruschwitz

Extracting content from Web pages can be useful for a number of reasons. Our motivation is to help a user in the search for documents in subdomains of the Web such as company sites and intranets. Unlike online product catalogues, the data sources we are interested in are of heterogenous nature. A model that reflects the underlying semantic structure of the document collection can be very helpful. However, it is difficult to get hold of a domain model that can easily be plugged into such a system. We have been working on this problem for some time now and this paper will report our ongoing work in the field of markup-based knowledge extraction. Markup is used to identify conceptual information. This enables us to build a simple domain model automatically. Such a model can be used to enhance standard search facilities by engaging a user in a system initiated dialogue. Another aspect of ongoing research is the improvement of the domain model using ideas adopted from collaborative filtering.


This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.