Rada Mihalcea and Dan I. Moldovan, Southern Methodist University
The unavailability of very large corpora with semantically disambiguated words is a major limitation in text processing research. For example, statistical methods for word sense disambiguation of free text are known to achieve high accuracy results when large corpora are available to develop context rules, to train and test them. This paper presents a novel approach to automatically generate arbitrarily large corpora for word senses. The method is based on (1) the information provided in WordNet, used to formulate queries consisting of synonyms or definitions of word senses, and (2) the information gathered from Internet using existing search engines. The method was tested on 120 word senses and a precision of 91% was observed.