Elizabeth Liddy and Woojin Paik
Within the field of Natural Language Processing, lexical disambiguation remains one of the toughest hurdles to overcome in the development of fully operational systems. As part of a larger document detection system (DR-LINK), we have implemented a computational approximation of word sense disambiguation by combining information from a machine-readable dictionary, local context, and corpus statistics. We use the Subject-Field Codes (SFC) extracted from a machine-readable dictionary produce a preliminary, multi-tagged semantic coding of words in a text. Then we apply local heuristics that evaluate the SFCs of ambiguous words to choose among the multiple SFCs. Choices which cannot be made using local heuristics are resolved by statistical evidence, namely, an SFC correlation matrix that was generated by processing a corpus of 977 Wall Street Journal (WSJ) articles containing 442,059 words. The implementation was tested on a sample of 1638 words from the WSJ and selected the correct SFC 89% of the time. The resultant, disambiguated SFC frequencies are summed and normalized to produce a weighted semantic vector representation of each text. These SFC vectors provide the basis on which the system automatically classifies texts as the first stage in DR-LINK.