Nimrod Megiddo, Ramakrishnan Srikant
Association rule algorithms can produce a very large number of output patterns. This has raised questions of whether the set of discovered rules "overfit" the data because all the patterns that satisfy some constraints are generated (the Bonferroni effect). In other words, the question is whether some of the rules are "false discoveries" that are not statistically significant. We present a novel approach for estimating the number of "false discoveries" at any cutoff level. Empirical evaluation shows that on typical datasets the fraction of rules that may be false discoveries is very small. A bonus of this work is that the statistical significance measures we compute are a good basis for ordering the rules for presentation to users, since they correspond to the statistical "surprise" of the rule. We also show how to compute confidence intervals for the support and confidence of an association rule, enabling the rule to be used predictively on future data.