A. J. Feelders, Soong Chang, G. J. McLachlan
Standard predictive data mining techniques operate on the implicit assumption of random sampling, but data bases seldomly contain random samples from the population of interest. This is not surprising, considering company data bases are primarily maintained to support vital business processes, rather than for the purpose of analysis. The bias present in many data bases poses a major threat to the validity of data mining results. We focus on a form of selectivity bias that occurs frequently in applications of data mining to scoring. Our approach is illustrated on a credit data base of a large Dutch bank, containing financial data of companies that applied for a loan, as well as a class label indicating the repayment behavior of accepted applicants. With respect to the missing class labels of the rejected applicants, we argue that the missing at random (MAR) case becomes increasingly important, because many banks nowadays use formal selection models. The classification problem is modeled with mixture distributions, using likelihood-based inference via the EM-algorithm. Since the distribution of financial ratios is notably non-normal, class-conditional densities are modeled by mixtures of normal components as well. The analysis shows that mixtures of two normal components usually give a satisfactory fit to the within-class empirical distribution of the ratios. The results of our comparative study show the selectivity bias caused by ignoring the rejects in the learning process.