Comparable Datasets in Performance Benchmarking

David Steier

A number of tasks require gathering information about a collection of similar objects to perform a comparison. When the information needed to perform these tasks comes from a single database, the amount and the type of data retrieved about each object in the collection is likely to be very similar, and the task of comparison relatively straightforward. But when information comes from many sources, information gatherers face a problem of producing a common comparable dataset for each object being compared. This problem is difficult because what should be in a comparable dataset (as we show in this paper) depends on the task for which the information is being gathered, the target collection of objects to report on, and the data available about each object. The purpose of this workshop paper is to highlight the importance of this problem in gathering information from heterogeneous sources, and to present some detail about a case study encountered in practice while doing a performance benchmarking study. Aspects of producing compsets have been studied in the database literature within the area of schema integration for heterogeneous databases [Batini et al., 1986], because of the shared concern for semantic comparability at the schematic level. For example, the theory of semantic values developed by Sciore et al. [1994] seems like a promising approach to computing comparable datasets because of the explicit representation of contextual information for each value. We discuss a number of issues involved in using contextual information in this way.


This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.