Jason Beck, Maria Garcia, Mingyu Zhong, Michael Georgiopoulos, Georgios C. Anagnostopoulos
In machine learning, decision trees are employed extensively in solving classification problems. In order to design a decision tree classifier two main phases are employed. The first phase is to grow the tree using a set of data, called training data, quite often to its maximum size. The second phase is to prune the tree. The pruning phase produces a smaller tree with better generalization (smaller error on unseen data). One of the most popular decision tree classifiers introduced in the literature is the C4.5 decision tree classifier. In this paper, we introduce an additional phase, called adjustment phase, interjected between the growing and pruning phases of the C4.5 decision tree classifier. The intent of this adjustment phase is to reduce the C4.5 error rate by making adjustments to the non-optimal splits created in the growing phase of the C4.5 classifier, thus eventually improving generalization (accuracy of the tree on unseen data). In most of the simulations conducted with the C4.5 decision tree classifier, its parameters, confidence factor, CF, and minimum number of split-off cases, MS, are chosen to be equal 25% and 2, their default values, recommended by Quinlan, the inventor of C4.5. The overall value of this work is that it provides the C4.5 user with a quantitative and qualitative assessment of the benefits of the proposed adjust phase, as well as the benefits of optimizing the C4.5 parameters, CF and MS.
Subjects: 15.6 Decision Trees; 12. Machine Learning and Discovery
Submitted: Feb 24, 2008