Ronald J. Brachman and Tej Anand
The general idea of discovering knowledge in large amounts of data is both appealing and intuitive. Typically we focus our attention on learning algorithms, which provide the core capability of generalizing from large numbers of small, very specific facts to useful high-level rules; these learning techniques seem to hold the most excitement and perhaps the most substantive scientific content in the knowledge discovery in databases (KDD) enterprise. However, when we engage in real-world discovery tasks, we find that they can be extremely complex, and that induction of rules is only one small part of the overall process. While others have written overviews of "the concept of KDD, and even provided block diagrams for "knowledge discovery systems," no one has begun to identify all of the building blocks in a realistic KDD process. This is what we attempt to do here. Besides bringing into the discussion several parts of the process that have received inadequate attention in the KDD community, a careful elucidation of the steps in a realistic knowledge discovery process can provide a framework for comparison of different technologies and tools that are almost impossible to compare without a clean model.