AAAI Publications, Twenty-Fifth International FLAIRS Conference

Font Size: 
Robustness of Threshold-Based Feature Rankers with Data Sampling on Noisy and Imbalanced Data
Ahmad Abu Shanab, Taghi M. Khoshgoftaar, Randall Wald

Last modified: 2012-05-16


Gene selection has become a vital component in the
learning process when using high-dimensional gene
expression data. Although extensive research has been
done towards evaluating the performance of classifiers
trained with the selected features, the stability of
feature ranking techniques has received relatively
little study. This work evaluates the robustness of
eleven threshold-based feature selection techniques,
examining the impact of data sampling and class noise
on the stability of feature selection. To assess the
robustness of feature selection techniques, we use four
groups of gene expression datasets, employ eleven
threshold-based feature rankers, and generate artificial
class noise to better simulate real-world datasets. The
results demonstrate that although no ranker consistently
outperforms the others, MI and Dev show the best
stability on average, while GI and PR show the least
stability on average. Results also show that trying to
balance datasets through data sampling has on average
no positive impact on the stability of feature ranking
techniques applied to those datasets. In addition,
increased feature subset sizes improve stability, but
only does so reliably for noisy datasets.


data sampling; imbalanced data; class noise

Full Text: PDF