Lobna Karoui, Marie-Aude Aufaure, Nacera Bennacer
In this paper, we focus on the ontological concept extraction and evaluation process from HTML documents. In order to improve this process, we propose an unsupervised hierarchical clustering algorithm namely Contextual Concept Discovery (CCD) which is an incremental use of the partitioning algorithm Kmeans and is guided by a structural context. Our context exploits the html structure and the location of words to select the semantically closer cooccurrents for each word and to improve word weighting. Guided by this context definition, we perform an incremental clustering that refines the context of each word clusters to obtain semantically extracted concepts. The CCD algorithm offers the choice between either an automatic execution or a users interaction. The last function of the CCD algorithm is to provide a complementary support for an easy evaluation task. This functionality is based on a large collection of web documents and several context definitions deduced from it by applying a linguistic and a documentary analysis. We experiment our algorithm on HTML documents related to the tourism domain. Our results show how the execution of our context-based improves the conceptual quality and the relevance of the extracted ontological concepts and how our credibility degree criterion assists the domain experts and facilitates the evaluation task.
Subjects: 10. Knowledge Acquisition; 11.2 Ontologies
Submitted: Feb 19, 2007