David Hsu, Oren Etzioni and Stephen Soderland
Covering algorithms for learning rule sets tend toward learning concise rule sets based on the training data. This bias may not be appropriate in the domain of text classification due to the large number of informative features these domains typically contain. We present a basic covering algorithm, DAIRY, that learns unordered rule sets, and present two extensions that encourage the rule learner to milk the training data to varying degrees, by recycling covered training data, and by searching for completely redundant but highly accurate rules. We evaluate these modifications on web page and newsgroup recommendation problems and show recycling can improve classification accuracy by over 10%. Redundant rule learning provides smaller increases in most datasets, but may decrease accuracy in some.