Alvaro E. Monge, Charles P. Elkan
To combine information from heterogeneous sources, equivalent data in the multiple sources must be identified. This task is the field matching problem. Specifically, the task is to determine whether or not two syntactic values are alternative designations of the same semantic entity. For example the addresses Dept. of Comput. Sci. and Eng., University of California, San Diego, 9500 Gilman Dr. Dept. O114; La Jolla; CA 92093 and UCSD, Computer Science and Engineering Department, CA 92093-0114 do designate the same department. This paper describes three field matching algorithms, and evaluates their performance on real-world datasets. One proposed method is the well-known Smith-Waterman algorithm for comparing DNA and protein sequences. Several applications of field matching in knowledge discovery are described briefly, including WEBFIND, which is a new software tool that discovers scientific papers published on the worldwide web. WEBFIND uses external information sources to guide its search for authors and papers. Like many other worldwide web toold, WEBFIND needs to solve the field matching problem in order to navigate between information sources.