A Distance-based Over-sampling Method for Learning from Imbalanced Data Sets

Jorge de la Calleja, Olac Fuentes

Many real-world domains present the problem of imbalanced data sets, where examples of one classes significantly outnumber examples of other classes. This makes learning difficult, as learning algorithms based on optimizing accuracy over all training examples will tend to classify all examples as belonging to the majority class. We introduce a method to deal with this problem by means of creating a balanced data set, which allows to improve the performance of classifiers. Our method over-samples the minority class, using a randomized weighted distance scheme to generate synthetic examples in the neighborhood of each minority example.

Subjects: 12. Machine Learning and Discovery

Submitted: Feb 11, 2007

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.