Matthew Michelson, Craig A. Knoblock
Heterogeneous transformations are translations between strings that are not characterized by a single function. E.g., nicknames, abbreviations and synonyms are heterogeneous transformations while edit distances are not. Such transformations are useful for information retrieval, information extraction and text understanding. They are especially useful in record linkage, where we determine whether two records refer to the same entity by examining the similarities between their fields. However, heterogeneous transformations are usually created manually and without assurance they will be useful. This paper presents a data mining approach to discover heterogeneous transformations between two data sets, without labeled training data. In addition to simple transformations, our algorithm finds combinatorial transformations, such as synonyms and abbreviations together. Our experiments demonstrate that we discover many types of specialized transformations, and we show that by exploiting these transformations we can improve record linkage. Our approach makes discovering and exploiting heterogeneous transformations more scalable and robust by lessening the domain and human dependencies.
Subjects: 12. Machine Learning and Discovery; 10. Knowledge Acquisition
Submitted: May 11, 2007