DyS: A Framework for Mixture Models in Quantification
Quantification is an expanding research topic in Machine Learning literature. While in classification we are interested in obtaining the class of individual observations, in quantification we want to estimate the total number of instances that belong to each class. This subtle difference allows the development of several algorithms that incur smaller and more consistent errors than counting the classes issued by a classifier. Among such new quantification methods, one particular family stands out due to its accuracy, simplicity, and ability to operate with imbalanced training samples: Mixture Models (MM). Despite these desirable traits, MM, as a class of algorithms, lacks a more in-depth understanding concerning the influence of internal parameters on its performance. In this paper, we generalize MM with a base framework called DyS: Distribution y-Similarity. With this framework, we perform a thorough evaluation of the most critical design decisions of MM models. For instance, we assess 15 dissimilarity functions to compare histograms with varying numbers of bins from 2 to 110 and, for the first time, make a connection between quantification accuracy and test sample size, with experiments covering 24 public benchmark datasets. We conclude that, when tuned, Topsøe is the histogram distance function that consistently leads to smaller quantification errors and, therefore, is recommended to general use, notwithstanding Hellinger Distance’s popularity. To rid MM models of the dependency on a choice for the number of histogram bins, we introduce two dissimilarity functions that can operate directly on observations. We show that SORD, one of such measures, presents performance that is slightly inferior to the tuned Topsøe, while not requiring the sensible parameterization of the number of bins.