AAAI Publications, The Twenty-Seventh International Flairs Conference

Font Size: 
Comparison of Data Sampling Approaches for Imbalanced Bioinformatics Data
David J. Dittman, Taghi M. Khoshgoftaar, Randall Wald, Amri Napolitano

Last modified: 2014-05-03


Class imbalance is a frequent problem found in bioinformatics datasets. Unfortunately, the minority class is usually also the class of interest. One of the methods to improve this situation is data sampling. There are a number of dierent data sampling methods, each with their own strengths and weaknesses, which makes choosing one a dicult prospect. In our work we compare three data sampling techniques (Random Undersampling, Random Oversampling, and SMOTE) on six bioinformatics datasets with varying levels of class imbalance. Additionally, we apply two dierent classiers to the problem (5-NN and SVM), and use feature selection to reduce our datasets to 25 features prior to applying sampling. Our results show that there is very little dierence between the data sampling techniques, although Random Undersampling is the most frequent top performing data sampling technique for both of our classiers. We also performed statistical analysis which conrms that there is no statistical dierence between the techniques. Therefore, our recommendation is to use Random Undersampling when choosing a data sampling technique, because it is less computationally expensive to implement than SMOTE and it also reduces the size of the dataset, which will improve subsequent computational costs without sacricing classication performance.

Full Text: PDF