Petri Kontkanen, Petri Myllymaki, Tomi Silander, Henry Tirri
BAYDA is a software package for flexible data analysis in predictive data mining tasks. The mathematical model underlying the program is based on a simple Bayesian network, the Naive Bayes classifier. It is well-known that the Naive Bayes classifier performs well in predictive data mining tasks, when compared to approaches using more complex models. However, the model makes strong independence assumptions that are frequently violated in practice. For this reason, the BAYDA software also provides a feature selection scheme which can be used for analyzing the problem domain, and for improving the prediction accuracy of the models constructed by BAYDA. The scheme is based on a novel Bayesian feature selection criterion introduced in this paper. The suggested criterion is inspired by the Cheeseman-Stutz approximation for computing the marginal likelihood of Bayesian networks with hidden variables. The empirical results with several widely-used data sets demonstrate that the automated Bayesian feature selection scheme can dramatically decrease the number of relevant features, and lead to substantial improvements in prediction accuracy.