Semantic integration, link analysis and other forms of evidence detection often require recognition of multiple occurrences of a single name. However, names frequently occur in orthographic variations resulting from phonetic variations and transcription errors. The computational expense of similarity assessment algorithms usually precludes their application to all pairs of strings. Instead, it is typically necessary to use a high-recall, low-precision index to retrieve a smaller set of candidate matches to which the similarity assessment algorithm is then applied. This paper describes five algorithms for efficient candidate retrieval: Burkhart-Keller trees (BKT); filtered Burkhart-Keller trees (FBKT); partition filtering; ngrams; and Soundex. An empirical evaluation showed that no single algorithm performed best under all circumstances. When the source of name variations was purely orthographic, partition filtering generally performed best. When similarity assessment was based on phonetic similarity and the phonetic model was available, BKT and FBKT performed best. When the pronunciation model was unavailable, Soundex was best for k=0 (homonyms), and partition filtering and BKT were best for k>0. Unfortunately, the high-recall retrieval algorithms were multiple orders of magnitude more costly than the low-recall algorithms.
Subjects: 1.10 Information Retrieval; 13. Natural Language Processing