Learning from Biased Data Using Mixture Models

A. J. Feelders

Data bases sometimes contain a non-random sample from the population of interest. This complicates the use of extracted knowledge for predictive purposes. We consider a specific type of biased data that is of considerable practical interest, namely non-random partially classified data. This type of data typically results when some screening mechanism determines whether the correct class of a particular case is known. In credit scoring the problem of learning from such a biased sample is called "reject inference," since the class label (e.g. good or bad loan) of rejected loan applications is unknown. We show that maximum likelihood estimation of so called mixture models is appropriate for this type of data, and discuss an experiment performed on simulated data using mixtures of normal components. The benefits of this approach are shown by making a comparison with the results of sample-based discriminant analysis. Some directions are given how to extend the analysis to allow for non-normal components and missing attribute values in order to make it suitable for "real-life" biased data.


This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.