M. Ganesh, Jaideep Srivastava, Travis Richardson
Entity identification (EI) is the identification and integration of all records which represent the same real-world entity, and is an important task in database integration process. When a common identification mechanism for similar records across heterogeneous databases is not readily available, EI is performed by examining the relationships between various attribute values among the records. We propose the use of distances between attribute values as a measure of similarity between the records they represent. Record-matching conditions for EI can then be expressed as constraints on the attribute distances. We show how knowledge discovery techniques can be used to automatically derive these conditions (expressed as decision trees) directly from the data, using a distance-based framework.