Definition

Simpson’s paradox, or the Yule–Simpson effect, is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. It is sometimes given the descriptive title reversal paradox or amalgamation paradox.

Wikipedia

Example

Let’s consider the following scatterplot built from dummy data.

# Libraries
library(tidyverse)
library(hrbrthemes)
library(babynames)
library(viridis)

# Create data
a <- data.frame( x = rnorm(100), y = rnorm(100)) %>% mutate(y = y-x/2)
b <- a %>% mutate(x=x+2) %>% mutate(y=y+2)
c <- a %>% mutate(x=x+4) %>% mutate(y=y+4)
data <- do.call(rbind, list(a,b,c))
data <- data %>% mutate(group=rep(c("A", "B", "C"), each=100))

ggplot(data, aes(x=x, y=y)) +
  geom_point( size=2) +
  theme_ipsum()

Here, it totally makes sense to state that there is a positive correlation between the X and the Y axis. Actually, the Pearson correlation coefficient is 0.55.

However, let’s check what happens if we consider the groups present in the dataset (3 groups):

# Libraries
library(tidyverse)
library(hrbrthemes)
library(babynames)
library(viridis)

# Create data
a <- data.frame( x = rnorm(100), y = rnorm(100)) %>% mutate(y = y-x/2)
b <- a %>% mutate(x=x+2) %>% mutate(y=y+2)
c <- a %>% mutate(x=x+4) %>% mutate(y=y+4)
data <- do.call(rbind, list(a,b,c))
data <- data %>% mutate(group=rep(c("A", "B", "C"), each=100))

ggplot(data, aes(x=x, y=y, color=group)) +
  geom_point( size=3) +
  scale_color_viridis(discrete=TRUE) +
  theme_ipsum()

Here, we understand that the positive correlation was due to a difference between groups. Actually, the correlation coefficient is even negative if each group is considered separately.

This is the Sympson’s paradox: the trend between two different variables reverses when a third variable is included.

Impact on dataviz

The impact is strong for data analytics and dataviz. The Simpson’s paradox can lead to a wrong conclusions with spurious correlation. Always double check the potential effect of confounding variables available in your dataset.

Going further

Simpson’s Paradox - Correlations and Slopes. By Prof. Eric A. Suess. It gives a nice real life example based on book price and number of pages in books.
The Wikipedia page

Definition

Example

Impact on dataviz

Going further

Dataviz decision tree