Chris Seiffert, Taghi M. Khoshgoftaar, Jason Van Hulse, Amri Napolitano
Building useful classification models can be a challenging endeavor, especially when training data is imbalanced. Class imbalance presents a problem when traditional classification algorithms are applied. These algorithms often attempt to build models with the goal of maximizing overall classification accuracy. While such a model may be very accurate, it is often not very useful. Consider the domain of software quality prediction where the goal is to identify program modules that are most likely to contain faults. Since these modules make up only a small fraction of the entire project, a highly accurate model may be generated by classifying all examples as not fault prone. Such a model would be useless. To alleviate the problems associated with class imbalance, several techniques have been proposed. We examine two such techniques: data sampling and boosting. Five data sampling techniques and one commonly used boosting algorithm are applied to five datasets from the software quality prediction domain. Our results suggest that while data sampling can be very effective at improving classification performance when training data is imbalance, boosting (which has received considerably less attention in research related to mining imbalanced data) usually results in even better performance.
Subjects: 12. Machine Learning and Discovery; 10. Knowledge Acquisition
Submitted: Feb 22, 2008