AAAI Publications, The Thirty-First International Flairs Conference

Font Size: 
Detecting Simpson’s Paradox
Chenguang Xu, Sarah M. Brown, Christan Grant

Last modified: 2018-05-10


Simpson’s paradox is the phenomenon that a trend of an association in the whole population reverses within the subpopulations defined by a categorical variable. Detecting Simpson’s paradox indicates surprising and interesting patterns of the data set for the user. It is generally discussed in terms of binary variables, but studies for the exploration of it for continuous variables are relatively rare. This paper describes a method to discover Simpson’s paradox for the trend of the pair of continuous variables. Correlation coefficient is used to indicate the association between a pair of continuous variables. We use categorical variables to partition the whole data set into groups. Our algorithm’s goal is to find the sign reversal between the coefficient correlations measured in the group relative to the original entire data. We show that our approach detects cases in real data sets as well as synthetic data sets, and demonstrate that our approach can uncover the hidden surprising pattern by detecting occurrences of Simpson’s paradox. This paper also proposes an approach that exploits sampled data for early Simpson’s paradox detection. We show the running time for the algorithm by examining through the combination of different conditions.


Simpson's paradox; correlation coefficient

Full Text: PDF