Simpson’s paradox, or the Yule–Simpson effect, is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. It is sometimes given the descriptive title reversal paradox or amalgamation paradox.
Let’s consider the following scatterplot built from dummy data.
# Libraries
library(tidyverse)
library(hrbrthemes)
library(babynames)
library(viridis)
# Create data
a <- data.frame( x = rnorm(100), y = rnorm(100)) %>% mutate(y = y-x/2)
b <- a %>% mutate(x=x+2) %>% mutate(y=y+2)
c <- a %>% mutate(x=x+4) %>% mutate(y=y+4)
data <- do.call(rbind, list(a,b,c))
data <- data %>% mutate(group=rep(c("A", "B", "C"), each=100))
ggplot(data, aes(x=x, y=y)) +
geom_point( size=2) +
theme_ipsum()