Oversampling for Imbalanced Data via Optimal Transport

  • Yuguang Yan South China University of Technology
  • Mingkui Tan South China University of Technology
  • Yanwu Xu Baidu, Inc.
  • Jiezhang Cao South China University of Technology
  • Michael Ng Hong Kong Baptist University
  • Huaqing Min South China University of Technology
  • Qingyao Wu South China University of Technology

Abstract

The issue of data imbalance occurs in many real-world applications especially in medical diagnosis, where normal cases are usually much more than the abnormal cases. To alleviate this issue, one of the most important approaches is the oversampling method, which seeks to synthesize minority class samples to balance the numbers of different classes. However, existing methods barely consider global geometric information involved in the distribution of minority class samples, and thus may incur distribution mismatching between real and synthetic samples. In this paper, relying on optimal transport (Villani 2008), we propose an oversampling method by exploiting global geometric information of data to make synthetic samples follow a similar distribution to that of minority class samples. Moreover, we introduce a novel regularization based on synthetic samples and shift the distribution of minority class samples according to loss information. Experiments on toy and real-world data sets demonstrate the efficacy of our proposed method in terms of multiple metrics.

Published
2019-07-17
Section
AAAI Technical Track: Machine Learning