The simpson’s paradox

A collection of common dataviz caveats by Data-to-Viz.com




Definition


Simpson’s paradox, or the Yule–Simpson effect, is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. It is sometimes given the descriptive title reversal paradox or amalgamation paradox.

Wikipedia

Example


Let’s consider the following scatterplot built from dummy data.

# Libraries
library(tidyverse)
library(hrbrthemes)
library(babynames)
library(viridis)

# Create data
a <- data.frame( x = rnorm(100), y = rnorm(100)) %>% mutate(y = y-x/2)
b <- a %>% mutate(x=x+2) %>% mutate(y=y+2)
c <- a %>% mutate(x=x+4) %>% mutate(y=y+4)
data <- do.call(rbind, list(a,b,c))
data <- data %>% mutate(group=rep(c("A", "B", "C"), each=100))

ggplot(data, aes(x=x, y=y)) +
  geom_point( size=2) +
  theme_ipsum()

Here, it totally makes sense to state that there is a positive correlation between the X and the Y axis. Actually, the Pearson correlation coefficient is 0.63.


However, let’s check what happens if we consider the groups present in the dataset (3 groups):

# Libraries
library(tidyverse)
library(hrbrthemes)
library(babynames)
library(viridis)

# Create data
a <- data.frame( x = rnorm(100), y = rnorm(100)) %>% mutate(y = y-x/2)
b <- a %>% mutate(x=x+2) %>% mutate(y=y+2)
c <- a %>% mutate(x=x+4) %>% mutate(y=y+4)
data <- do.call(rbind, list(a,b,c))
data <- data %>% mutate(group=rep(c("A", "B", "C"), each=100))

ggplot(data, aes(x=x, y=y, color=group)) +
  geom_point( size=3) +
  scale_color_viridis(discrete=TRUE) +
  theme_ipsum()

Here, we understand that the positive correlation was due to a difference between groups. Actually, the correlation coefficient is even negative if each group is considered separately.

This is the Sympson’s paradox: the trend between two different variables reverses when a third variable is included.

Impact on dataviz


The impact is strong for data analytics and dataviz. The Simpson’s paradox can lead to a wrong conclusions with spurious correlation. Always double check the potential effect of confounding variables available in your dataset.

Going further


Comments


Any thoughts on this? Found any mistake? Disagree? Please drop me a word on twitter or in the comment section below:

 

A work by Yan Holtz for data-to-viz.com