Karen C. Davis, Krishnamoorthy Janakiraman, Ali Minai, and Robert B. Davis
Database integration provides integrated access to multiple data sources. Database integration has two main activities: schema integration (forming a global view of the data contents available in the sources) and data integration (transforming source data into a uniform format). This paper focuses on automating the aspect of data integration known as entity identification using data mining techniques. Once a global database is formed of all the transformed source data, there may be multiple instances of the same entity, with different values for the global attributes, and no global identifier to simplify the process of entity identification. We implement decision trees and k-NN as classification techniques, and we introduce a preprocessing step to cluster the data using conceptual hierarchies. We conduct a performance study using a small testbed and varying parameters such as training set size and number of unique entities to study processing speed and accuracy tradeoffs. We find that clustering is a promising technique for improving processing speed, and that decision trees generally have faster processing time but lower accuracy than k-NN in some cases.