TOOLBOXBROWSE TOPICS
RESOURCESABOUT THIS SITEpmwiki.org |
Data Mining and Discovery(a subtopic of Machine Learning)
Data mining is an AI powered tool that can discover useful information within a database that can then be used to improve actions. To appreciate why businesses are so excited about data mining, you need only imagine that a major department store chain is looking for ways to boost sales. They have a large database containing information about customers and the nature of their purchases (with particulars such as identity of items, price, date, and time of sale). Suppose a data mining utility unearthed a pattern in the data which indicated that customers who shopped on Saturday afternoons and who made their initial purchase of the day in the shoe department tended to make, on average, 4 additional purchases from other departments and that the average member of this group spent more per visit than the typical shopper. Can you now envision the sort of advertising campaign that the department store chain might want to embark upon ? Good Places to StartA golden vein - Computing: Analysis of customer information, better known as “data mining”, is finally delivering on its promises—and expanding into some promising new areas. The Economist Technology Quarterly (June 10, 2004). "In the old days, knowing your customers was part and parcel of running a business, a natural consequence of living and working in a community. But for today's big firms, it is much more difficult: a big retailer such as Wal-Mart has no chance of knowing every single one of its customers. So the idea of gathering huge amounts of information and analysing it to pick out trends indicative of customers' wants and needs -- data mining -- has long been trumpeted as a way to return to the intimacy of a small-town general store. But for many years, data mining's claims were greatly exaggerated. ... In recent years, however, improvements in both hardware and software, and the rise of the world wide web, have enabled data mining to start delivering on its promises." Knowledge Discovery in Databases: Tools and Techniques. By Peggy Wright. Crossroads. 1998. "The purpose of this paper is to present the results of a literature survey outlining the state-of-the-art in KDD techniques and tools. The paper is not intended to provide an in-depth introduction to each approach; rather, we intend it to acquaint the reader with some KDD approaches and potential uses." Eureka! Knowledge Discovery. By Neena Buck. Software Magazine. December 2000/January 2001 cover story. "Knowledge discovery and data mining (KDD) is evolving from an esoteric art and a point solution, to a mainstream technology embedded in a variety of solutions, to help businesses turn information into insight." Research: From lab to market. By Michael Kanellos. CNET News (June 16, 2004). "Data mining, the ability to find unexpected patterns in accumulated data, was born during a lunch break. At a customer conference in the early 1990s, an executive at British department store chain Marks & Spencer was explaining his database woes to Rakesh Agrawal, an information retrieval specialist at IBM. The store was collecting all sorts of data but didn't know what to do with it. So Agrawal and his team began devising algorithms for asking open-ended queries, eventually authoring a 1993 paper that would become required reading in data-mining science. The report has been cited in more than 650 other studies, making it one of the most widely cited papers of its kind. ... Agrawal, the data-mining pioneer, is today working on a system that will scramble customer data in a way that will allow companies to study buying trends or other patterns while preserving strict privacy."
From Data Mining to Knowledge Discovery in Databases. By Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. AI Magazine 17(3): Fall 1996, 37-54. "Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular real-world applications, specific data-mining techniques, challenges involved in real-world applications of knowledge discovery, and current and future research directions in the field." Advanced Scout: Data Mining and Knowledge Discovery in NBA Data, a Brief Application Description. By Inderpal Bhandari, et al. Data Mining and Knowledge Discovery 1, 121-125 (1997). Available from CiteSeer. "Advanced Scout is a PC-based data mining application used by National Basketball Association (NBA) coaching staffs to discover interesting patterns in basketball game data. We describe Advanced Scout software from the perspective of data mining and knowledge discovery. This paper highlights the pre-processing of raw data that the program performs, describes the data mining aspects of the software and how the interpretation of patterns supports the process of knowledge discovery. The underlying technique of attribute focusing as the basis of the algorithm is also described. The process of pattern interpretation is facilitated by allowing the user to relate patterns to video tape." Mining for trends at the help desk. By John Boyd. IBM Think Research (1999). "Ordinary data mining simply looks for keywords, but the text-mining system -- dubbed TAKMI (an abbreviation for Text Analysis and Knowledge Mining but also a Japanese word meaning 'skilled craftsman') -- spots grammatical relationships, as well. Knowing which word is the subject, which the verb, and which the object, TAKMI can categorize calls according to whether they are, say, complaints or questions and according to the product that is causing difficulty." Also see:
Java Data Mining: Strategy, Standard, and Practice - A Practical Guide for architecture, design, and implementation. By Mark F. Hornick, Erik Marcade, Sunil Venkayala. Published by Morgan Kaufman, a division of Elsevier (2007). Chapter 1, Overview of Data Mining, is available online via a link in the sidebar. As you'll read in that chapter at 1.2.1: "Data mining goes by several aliases, for example, advanced analytics, predictive analytics, artificial intelligence, and machine learning."
Financial Services data mining example: Identifying risky borrowers. From Salford Systems. "To introduce you to data mining with the CART decision tree software we are going to walk through a real world example drawn from the Financial Services industry. The database is an extract from a group of customers who selected a financial loan product, some of whom went 'BAD'. The information we will make use of comes from standard credit reports provided by all the major credit bureaus...." Data Mining Glossary. From Two Crows Corporation. Tutorial Slides on Stistical Data Mining. Authored by Andrew Moore, CMU. Readings OnlineData Mining and Knowledge Discovery, a Springer Computing Methodologies Journal. The DBMS Guide to Data Mining Solutions (1998). A collection of articles by Estelle Brand and Rob Gerritsen including: Data Mining and Knowledge Discovery, Predicting Credit Risk, Neural Networks, Naýve-Bayes and Nearest Neighbors, and Decision Trees. Data-Mining. California Computer News (October 27, 2004). "The Andrew W. Mellon Foundation is funding the two-year, nearly $600,000 multi-institutional project, which John Unsworth, dean of Illinois' Graduate School of Library and Information Science (GSLIS), will lead. In his winning project, titled 'Web-based Text-Mining and Visualization for Humanities Digital Libraries,' Unsworth expects to produce software 'for discovering, visualizing and exploring significant patterns across large collections of full-text humanities resources in digital libraries and collections.' ... In traditional 'search-and-retrieval' projects, scholars bring specific queries to collections of text and get back more or less useful answers to those queries, Unsworth said. 'By contrast, the goal of data-mining, including text-mining, is to produce new knowledge by exposing unanticipated similarities or differences, clustering or dispersal, co-occurrence and trends.' ... With its roots in statistics, artificial intelligence and machine learning, data-mining has been around since the 1990s. ... With data-mining tools, Unsworth said, you first select a body of material that you think is important in some way, next select features of those materials that you similarly think are important, and then 'map the occurrence of those features in the selected materials to see whether patterns emerge. If patterns do emerge, you analyze them and from that analysis emerges -- if you are lucky -- new insights into the materials.' For example, in the planning grant for this project, members of his research team, using the full set of Shakespeare's plays, selected five 'circulation-of characters' features...." Duo-Mining -Combining Data and Text Mining. By Guy Creese. DMReview.com (September 16, 2004). "As standalone capabilities, the pattern-finding technologies of data mining and text mining have been around for years. However, it is only recently that enterprises have started to use the two in tandem - and have discovered that it is a combination that is worth more than the sum of its parts. First of all, what are data mining and text mining? They are similar in that they both 'mine' large amounts of data, looking for meaningful patterns. However, what they analyze is quite different. ... Collections and recovery departments in banks and credit card companies have used duo-mining to good effect. Using data mining to look at repayment trends, these enterprises have a good idea on who is going to default on a loan, for example. When logs from the collection agents are added to the mix, the understanding gets even better. For example, text mining can understand the difference in intent between, 'I will pay,' 'I won't pay,' 'I paid' and generate a propensity to pay score - which, in turn, can be data mined. To take another example, if a customer says, 'I can't pay because a tree fell on my house;' all of a sudden it is clear that it's not a 'bad' delinquency - but rather a sales opportunity for a home loan." Data Mining. Edmund X. DeJesus' introduction to this collection of three articles from the October 1995 issue of Byte begins with: "There's gold in your data, but you can't see it." The three articles which follow this introduction are: The Data Gold Rush, by Sara Reese Hedberg; A Data Miner's Tools, by Karen Watterson; and, Data-Mining Dynamite, by Cheryl D. Krivda. The Rebirth of Artificial Intelligence. Lisa DiCarlo. Forbes (May 16, 2000). "Oracle is promoting its Intelligent WebHouse tools. These tools give companies a detailed survey of their Web-surfing customers, determining what sites they have visited before and what their relationship is to that site. This, Howard says, 'enables companies to do a better job cross-selling and up-selling customers. You can [discover] sales programs on other sites and do competitive analysis.'" The race to computerise biology. The Economist Technology Quarterly (December 12, 2002). "It is in data mining, however, where bioinformatics hopes for its biggest pay-off. First applied in banking, data mining uses a variety of algorithms to sift through storehouses of data in search of 'noisy' patterns and relationships among the different silos of information. The promise for bioinformatics is that public genome data, mixed with proprietary sequence data, clinical data from previous drug efforts and other stores of information, could unearth clues about possible candidates for future drugs." Data Mining: Exploiting the Hidden Trends in Your Data. By Herb Edelstein. DB2 Online Magazine (Spring 1997). "Essentially, data mining discovers patterns and relationships hidden in your data. It's part of a larger process called knowledge discovery; specifically, the step in which advanced statistical analysis and modeling techniques are applied to the data to find useful patterns and relationships. The knowledge-discovery process as a whole is essential for successful data mining because it describes the steps you must take to ensure meaningful results." Knowledge Discovery in Databases: An Overview. By William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus. AI Magazine 13(3): Fall 1992, 57-70. "Definition of Knowledge Discovery: Knowledge discoveryis the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. Given a set of facts (data) F, a language L, and some measure of certainty C, we define a patternas a statement S in L that describes relationships among a subset F <sub>S</sub> of F with a certainty c, such that S is simpler (in some sense) than the enumeration of all facts in F<sub>S</sub>. A pattern that is interesting (according to a user-imposed interest measure) and certain enough (again according to the user’s criteria) is called knowledge. The output of a program that monitors the set of facts in a database and produces patterns in this sense is discovered knowledge." Knowledge-based Scientific Discovery from Geological Databases. By C. Li and G. Biswas. (1995). "It is common knowledge in the oil industry that the typical cost of drilling a new offshore well is in the range of $30 40 million, but the chance of that site being an economic success is 1 in 10. Recent advances in drilling technology and data collection methods have led to oil companies and their ancillaries collecting large amounts of geophysical/geographical data ... Can this vast amount of history from previously explored fields be systematically utilized to evaluate new plays and prospects?" Software: Text Mining. By Cade Metz. One of PC Magazine's Future Tech - 20 Hot Technologies to Watch (July 1, 2003). "Text-mining software is one of the front-line tools that the government is now using to tease out valuable connections. These specialized search engines can quickly sift through mountains of unstructured text -- anything that's not carefully arranged in a database or spreadsheet -- and pull out the meaningful stuff. They can infer relationships within data that are not stated explicitly." Machine Learning, Neural and Statistical Classification. D. Michie, D.J. Spiegelhalter, C.C. Taylor (eds). "[This] book (originally published in 1994 by Ellis Horwood) is now out of print. The copyright now resides with the editors who have decided to make the material freely available on the web." Topics addressed include: Classical Statistical Methods, Modern Statistical Techniques,Machine Learning of Rules and Trees, and Neural Networks. Machine Learning and Data Mining. By Tom M. Mitchell, Center for Automated Learning and Discovery at Carnegie Mellon University. (1999). Communications of the ACM, Vol. 42, No. 11; pages 30 - 36. Statistical Data Mining Tutorials - Tutorial Slides by Andrew Moore, professor of Robotics and Computer Science at the School of Computer Science, Carnegie Mellon University. "The following links point to a set of tutorials on many aspects of statistical data mining, including the foundations of probability, the foundations of statistical data analysis, and most of the classic machine learning and data mining algorithms." Virtual Prospecting. From oil exploration to neurosurgery, new tools are revealing the secrets hidden in mountains of data. By Otis Port. BusinessWeek Online. (March 23, 2001). Smart Tools - Companies in health care, finance, and retailing are using artificial-intelligence systems to filter huge amounts of data and identify suspicious transactions. By Otis Port, with Michael Arndt and John Carey. Business Week's 2003 edition of The BusinessWeek50. Coaxing Meaning Out Of Raw Data. By John W. Verity. Business Week (February 3, 1997). "First developed to help scientists make sense of experimental data, this software has enough smarts to 'see' meaningful patterns and relationships on its own--to see patterns that might otherwise take tens of man-years to find. That's a huge leap beyond conventional computer databases, which are powerful but unimaginative: They must be told precisely what to look for. Data-mining tools can sift through immense collections of customer, marketing, production, and financial data and, using statistical and artificial-intelligence techniques, identify what's worth noting and what's not." IT Versus Terror - Preventing a terror attack is invaluable. But even invaluable IT projects need realistic business case analysis to succeed. By Ben Worthen. CIO (August 2006). "Data mining is a relatively new field within computer science. In the broadest sense, it combines statistical models, powerful processors, and artificial intelligence to find and retrieve valuable information that might otherwise remain buried inside vast volumes of data. Retailers use it to predict consumer buying patterns, and credit card companies use it to detect fraud. In the aftermath of September 11, the government concluded that data mining could help it prevent future terrorist attacks. Experts say that the government, and in particular the intelligence community, has come to rely heavily on data mining. A 2004 Government Accountability Office report found that federal agencies were actively engaged in or planning 199 data mining projects. ... The government's data mining projects fall into two broad categories: subject-based systems that retrieve data that could help an analyst follow a lead, and pattern-based systems that look for suspicious behaviors across a spread of activities." Business Intelligence - The Value in Mining Data. By Jonathan Wu. DM Review (February 2002). "Data mining can best be described as a business intelligence (BI) technology that has various techniques to extract comprehensible, hidden and useful information from a population of data. This BI technology makes it possible to discover hidden trends and patterns in large amounts of data. The output of a data mining exercise can take the form of patterns, trends or rules that are implicit in the data. ... The following are examples of practical uses of data mining and the value it provides those who use this technology to mine their data. ... Fraud Detection ... Inventory Logistics ... Defect Analysis ... Focused Hiring." Related Web SitesACM Special Interest Group on Knowledge Discovery in Data and Data Mining. AI on the Web: Machine Learning. A resource companion to Stuart Russell and Peter Norvig's "Artificial Intelligence: A Modern Approach" with links to reference material, people, research groups, books, companies and much more. "The Auton Lab, part of Carnegie Mellon University's School of Computer Science, researches new approaches to Statistical Data Mining. ... We are very interested in the underlying computer science, mathematics, statistics and AI of detection and exploitation of patterns in data." The Data Mine. Maintained by Andy Pryke. Data Mining Product Features. Profiles of, and links to, many data mining commercial products. From Exclusive Ore Inc. "The Intelligent Data Understanding (IDU) subproject of NASA's Intelligent Systems Project develops techniques for transforming data into scientific understanding."
KD nuggets. Offers links to reference collections, newsletters, mailing lists, datasets, companies, job openings, competitions, and more . . . including:
Knowledge Discovery Laboratory at the University of Massachusetts Amherst,Department of Computer Science. "KDL investigates how to find useful patterns in large and complex databases. We study the underlying principles of data analysis algorithms, develop innovative techniques for knowledge discovery, and apply those techniques to practical tasks in areas such as fraud detection, scientific data analysis, and web mining." Machine Learning and Inference (MLI) Laboratory at George Mason University (GMU) "conducts fundamental and experimental research on the development of intelligent systems capable of advanced forms of learning, inference, and knowledge generation, and applies them to real-world problems."
Microsoft's Machine Learning and Applied Statistics (MLAS) group "is focused on learning from data and data mining. By building software that automatically learns from data, we enable applications that (1) do intelligent tasks such as handwriting recognition and natural-language processing, and (2) help human data analysts more easily explore and better understand their data." National Centre for Text Mining (NaCTeM): "We provide text mining services in response to the requirements of the UK academic community. Our initial focus is on applications in the biological and medical domains, where the major successes in the mining of scientific texts have so far occurred. We also make significant contributions to the text mining research community, both nationally and internationally."
Related AI Topics Pages
More ReadingsToward Automated Discovery in the Biological Sciences. By Bruce G. Buchanan and Gary R. Livingston. AI Magazine 25(1): Spring 2004, 69-84. "Knowledge discovery programs in the biological sciences require flexibility in the use of symbolic data and semantic information. Because of the volume of nonnumeric, as well as numeric, data, the programs must be able to explore a large space of possibly interesting relationships to discover those that are novel and interesting. Thus, the framework for the discovery program must facilitate proposing and selecting the next task to perform and performing the selected tasks. ... Our results demonstrate that both reasons given for performing tasks and estimates of the interestingness of the concepts and hypotheses examined by HAMB contribute to its performance and that the program can discover novel, interesting relationships in biological data." Artificial Intelligence and Link Analysis - Papers from the 1998 Fall Symposium, ed. David Jensen and Henry Goldberg. Technical Report FS-98-01. American Association for Artificial Intelligence, Menlo Park, California."Computer-based link analysis is increasingly used in law enforcement investigations, insurance fraud detection, telecommunications network analysis, pharmaceuticals research, epidemiology, and a host of other specialized applications. Link analysis explores associations among large numbers of objects of different types. For example, a law enforcement application might examine familial relationships among suspects and victims, the addresses at which those persons reside, and the telephone numbers that they called during a specified period. The ability of link analysis to represent relationships and associations among objects of different types has proven crucial in assisting human investigators to comprehend complex webs of evidence and draw conclusions that are not apparent from any single piece of information. However, there is both a need and opportunity to apply new technologies. Much of the current software for link analysis is little more than a graphical display tool. While visualizing networks has proven useful, some advanced applications of link analysis involve tens of thousands of objects and links as well as a rich array of possible data models. Manual construction and analysis of such networks has proven difficult. In addition, a large number of related techniques in artificial intelligence and several other fields have the potential to assist human reasoning about complex networks of relationships. These techniques draw on work from search, semantic networks, ontological engineering, autonomous agents, inductive logic programming, graph theory, social network analysis, knowledge discovery in databases, entity-relationship modeling, information extraction, information retrieval, and metaphor." Proceedings of the Knowledge Discovery and Data Mining Conference (KDD). Available from the AAAI Digital Library. Data Mining Research: Opportunities and Challenges. A Report of three NSF Workshops on Mining Large, Massive, and Distributed Data. By Robert Grossman, Simon Kasif, Reagan Moore, David Rocke, and Jeff Ullman. The January 21, 1998 Draft (8.4.5) is available from Robert Grossman. "Data mining is the semi-automatic discovery of patterns, associations, changes, anomalies, rules, and statistically significant structures and events in data. That is, data mining attempts to extract knowledge from data. Data mining differs from traditional statistics in several ways:...." And be sure to read the impressive "Success Stories" in Section 5! |

