AnHai Doan, Pedro Domingos, and Alon Y. Levy
To build a data-integration system, the application designer must specify a mediated schema and supply the descriptions of data sources. A source description con- tains a source schema that describes the content of the source, and a mapping between the corresponding ele- ments of the source schema and the mediated schema. Manually constructing these mappings is both labor- intensive and error-prone, and has proven to be a major bottleneck in deploying large-scale data integration sys- tems in practice. In this paper we report on our initial work toward automatically learning mappings between source schemas and the mediated schema. Specifically, we investigate finding one-to-one mappings for the leaf elements of source schemas. We describe LSD, a system that automatically finds such mappings. LSD consults a set of learner modules where each module looks at the problem from a different perspective, then combines the predictions of the modules using a meta-learner. Learner modules draw knowledge from the World-Wide Web, as well as on ideas from machine learning and information retrieval. We report on experimental results of applying LSD to five sources in the real-estate domain.