Usama M. Fayyad
The problem of deciding which subset of values of a categorical-valued attribute to branch on during decision tree generation is addressed. Algorithms such as ID3 and C4 do not address the issue and simply branch on each value of the selected attribute. The GID3* algorithm is presented and evaluated. The GID3* algorithm is a generalized version of Quinlan’s ID3 and C4, and is a non-parametric version of the GID3 algorithm presented in an earlier paper. It branches on a subset of individual values of an attribute, while grouping the rest under a single DEFAULT branch. It is empirically demonstrated that GID3* outperforms ID3 (C4) and GID3 for any parameter setting of the latter. The empirical tests include both controlled synthetic (randomized) domains as well as real-world data sets. The improvement in tree quality as measured by number of leaves and estimated error rate is significant.