From Large to Huge: A Statistician’s Reactions to KDD and Data Mining

Authors

Peter J. Huber

Track:

All Contents

Downloads:

Abstract:

The three distinct data handling cultures (statistics, data base management and artificial intelligence) finally show signs of convergence. Whether you name their common area "data analysis" or "knowledge discovery," the necessary ingredients for success with ever larger data sets are identical: good data, subject area expertise, access to technical know-how in all three cultures, and a good portion of common sense. Curiously, all three cultures have been trying to avoid common sense and hide its lack behind a smoke-screen of technical formalism. Huge data sets usually are not just more of the same, they have to be huge because they are heterogeneous, with more internal structure, such that smaller sets would not do. As a consequence, subsamples and techniques based on them, like the bootstrap, may no longer make sense. The complexity of the data regularly forces the data analyst to fashion simple, but problem- and data-specific tools from basic building blocks, taken from data base management and numerical mathematics. Scaling-up of algorithms is problematic, computational complexity of many procedures explodes with increasing data size; for example, conventional clustering algorithms become unfeasible. The human ability to inspect a dataset, or even only a meaningful of part it, breaks down far below terabyte sizes. I believe that attempts to circumvent this by "automating" some aspects of exploratory analysis are futile. The available success stories suggest that the real function of data mining and KDD is not machine discovery of interesting structures by itself, but targeted extraction and reduction of data to a size and format suitable for human inspection. By necessity, such pre-processing is ad hoc, data specific and driven by working hypotheses based on subject matter expertise and on trial and error. Statistical common sense - which traps to avoid, handling of random and systematic errors, and where to stop - is more important than specific techniques. The machine assistance we need to step from large to huge sets thus is an integrated computing environment that allows easy improvisation and retooling even with massive data.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.