George H. John, Pat Langley
As data warehouses grow to the point where one hundred gigabytes is considered small, the computational efficiency of data-mining algorithms on large databases becomes increasingly important. Using a sample from the database can speed up the datamining process, but this is only acceptable if it does not reduce the quality of the mined knowledge. To this end, we introduce the "Probably Close Enough" criterion to describe the desired properties of a sample. Sampling usually refers to the use of static statistical tests to decide whether a sample is sufficiently similar to the large database, in the absence of any knowledge of the tools the data miner intends to use. We discuss dynamic sampling methods, which take into account the mining tool being used and can thus give better samples. We describe dynamic schemes that observe a mining tool’s performance on training samples of increasing size and use these results to determine when a sample is sufficiently large. We evaluate these sampling methods on data from the UC1 repository and conclude that dynamic sampling is preferable.