Heikki Mannila, Hannu Toivonen
In interactive data mining it is advantageous to have condensed representations of data that can be used to efficiently answer different queries. In this paper we show how frequent sets can be used as a condensed representation for answering various types of queries. Given a table r with O/1 values and a threshold o, a frequent set of r is a set X of columns of r such that at least a fraction o of the rows of r have a 1 in all the columns of X. Finding frequent sets is a first step in finding association rules, and there exist several efficient algorithms for finding the frequent sets. We show that frequent sets have wider applications than just finding association rules. We show that using the inclusion-exclusion principle one can obtain approximate confidences of arbitrary boolean rules. We derive bounds for the errors in the confidences, and show that information collected during the computation of frequent sets can also be used to provide individual error bounds for each clause. Experiments show that this method enables one to obtain different forms of rules from data extremely fast. Furthermore, we define a general notion of condensed representations, and show that frequent sets, samples and the data cube can be viewed as instantations of this concept.