Roberto Navigli, Università di Roma "La Sapienza"
Linguistic resources are essential for the success of many AI tasks. Building a new lexical resource from scratch or combining heterogeneous resources is not only complex and time-consuming, but can also lead to knowledge inconsistency and redundancy. In this paper, we present a novel method for the large-scale semantic enrichment of a computational linguistic resource. To this end, with the aid of a controlled vocabulary, we identified a set of representative concepts, i.e. a restricted, but meaningful number of concepts from WordNet, such that each of them can replace any of its descendants in the taxonomical hierarchy without a substantial loss of information in natural language sentences (e.g. restaurant#1 is a representative for bistro#1 or cybercafe#1). Then, we manually enriched these representative concepts with collocations extracted from a variety of linguistic resources. After this manual step, representative concepts are still related with words, rather than with concepts (e.g. for taxi#1: fare, passenger, driver, etc.). The final step is to automatically disambiguate these terms, using a word sense disambiguation algorithm named Structural Semantic Interconnections (SSI). SSI is a knowledge-based WSD algorithm that is particularly performant when words in a context are highly semantically associated. As a result, the precision of this automatic disambiguation step is very high, to a point that residual disambiguation errors could be tolerated. In any case, since SSI provides semantic patterns to justify its sense choices, manual corrections by human annotators would be considerably facilitated, achieving a significant speed-up in semantic annotation. Furthermore, SSI helps in supporting a consistency of the lexical knowledge base.