Peter J. Huber
The three distinct data handling cultures (statistics, data base management and artificial intelligence) finally show signs of convergence. Whether you name their common area "data analysis" or "knowledge discovery," the necessary ingredients for success with ever larger data sets are identical: good data, subject area expertise, access to technical know-how in all three cultures, and a good portion of common sense. Curiously, all three cultures have been trying to avoid common sense and hide its lack behind a smoke-screen of technical formalism. Huge data sets usually are not just more of the same, they have to be huge because they are heterogeneous, with more internal structure, such that smaller sets would not do. As a consequence, subsamples and techniques based on them, like the bootstrap, may no longer make sense. The complexity of the data regularly forces the data analyst to fashion simple, but problem- and data-specific tools from basic building blocks, taken from data base management and numerical mathematics. Scaling-up of algorithms is problematic, computational complexity of many procedures explodes with increasing data size; for example, conventional clustering algorithms become unfeasible. The human ability to inspect a dataset, or even only a meaningful of part it, breaks down far below terabyte sizes. I believe that attempts to circumvent this by "automating" some aspects of exploratory analysis are futile. The available success stories suggest that the real function of data mining and KDD is not machine discovery of interesting structures by itself, but targeted extraction and reduction of data to a size and format suitable for human inspection. By necessity, such pre-processing is ad hoc, data specific and driven by working hypotheses based on subject matter expertise and on trial and error. Statistical common sense - which traps to avoid, handling of random and systematic errors, and where to stop - is more important than specific techniques. The machine assistance we need to step from large to huge sets thus is an integrated computing environment that allows easy improvisation and retooling even with massive data.