Kevin J. Cherkauer, Jude W. Shavlik
When using machine learning techniques for knowledge discovery, output that is comprehensible to a human is as important as predictive accuracy. We introduce a new algorithm, SET-GEN, that improves the comprehensibility of decision trees grown by standard C4.5 without reducing accuracy. It does this by using genetic search to select the set of input features C4.5 is allowed to use to build its tree. We test SET-GEN on a wide variety of real-world datasets and show that SET-GEN trees are significantly smaller and reference significantly fewer features than trees grown by C4.5 without using SET-GEN. Statistical significance tests show that the accuracies of SET-GEN’s trees are either not distinguishable from or are more accurate than those of the original C4.5 trees on all ten datasets tested.