Data Integration Using Data Mining Techniques

Karen C. Davis, Krishnamoorthy Janakiraman, Ali Minai, and Robert B. Davis

Database integration provides integrated access to multiple data sources. Database integration has two main activities: schema integration (forming a global view of the data contents available in the sources) and data integration (transforming source data into a uniform format). This paper focuses on automating the aspect of data integration known as entity identification using data mining techniques. Once a global database is formed of all the transformed source data, there may be multiple instances of the same entity, with different values for the global attributes, and no global identifier to simplify the process of entity identification. We implement decision trees and k-NN as classification techniques, and we introduce a preprocessing step to cluster the data using conceptual hierarchies. We conduct a performance study using a small testbed and varying parameters such as training set size and number of unique entities to study processing speed and accuracy tradeoffs. We find that clustering is a promising technique for improving processing speed, and that decision trees generally have faster processing time but lower accuracy than k-NN in some cases.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.