The Boxplot and its pitfalls

A collection of common dataviz caveats by

A boxplot gives a nice summary of one or more numeric variables. A boxplot is composed of several elements:

Here is a diagram showing the boxplot anatomy:

Anatomy of a boxplot (image source)

A boxplot can summarize the distribution of a numeric variable for several groups. The problem is that summarizing also means losing information, and that can be a pitfall. If we consider the boxplot below, it is easy to conclude that group C has a higher value than the others. However, we cannot see the underlying distribution of dots in each group or their number of observations.

# Libraries

# create a dataset
data <- data.frame(
  name=c( rep("A",500), rep("B",500), rep("B",500), rep("C",20), rep('D', 100)  ),
  value=c( rnorm(500, 10, 5), rnorm(500, 13, 1), rnorm(500, 18, 1), rnorm(20, 25, 4), rnorm(100, 12, 1) )

# Plot
data %>%
  ggplot( aes(x=name, y=value, fill=name)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE) +
    theme_ipsum() +
      plot.title = element_text(size=11)
    ) +
    ggtitle("A somewhat misleading boxplot") +