Ellen Riloff, Wendy Lehnert
Text processing for complex domains such as terrorism is complicated by the difficulty of being able to reliably distinguish relevant and irrelevant texts. We have discovered a simple and effective filter, the Relevancy Signatures Algorithm, and demonstrated its performance in the domain of terrorist event descriptions. The Relevancy Signatures Algorithm is based on the natural language processing technique of selective concept extraction, and relies on text representations that reflect predictable patterns of linguistic context. This paper describes text classification experiments conducted in the domain of terrorism using the MUC-3 text corpus. A customized dictionary of about 6,000 words provides the lexical knowledge base needed to discriminate relevant texts, and the CIRCUS sentence analyzer generates relevancy signatures as an effortless side-effect of its normal sentence analysis. Although we suspect that the training base available to us from the MUC-3 corpus may not be large enough to provide optimal training, we were nevertheless able to attain relevancy discriminations for significant levels of recall (ranging from 11% to 47%) with 100% precision in half of our test runs.