A boxplot gives a nice summary of one or more numeric variables. A boxplot is composed of several elements:

The line that divides the box into 2 parts represents the median of the data. If the median is 10, it means that there are the same number of data points below and above 10.
The ends of the box shows the upper (Q3) and lower (Q1) quartiles. If the third quartile is 15, it means that 75% of the observation are lower than 15.
The difference between Quartiles 1 and 3 is called the interquartile range (IQR)
The extreme line shows Q3+1.5xIQR to Q1-1.5xIQR (the highest and lowest value excluding outliers).
Dots (or other markers) beyond the extreme line shows potntial outliers.

Here is a diagram showing the boxplot anatomy:

Anatomy of a boxplot (image source)

A boxplot can summarize the distribution of a numeric variable for several groups. The problem is that summarizing also means losing information, and that can be a pitfall. If we consider the boxplot below, it is easy to conclude that group C has a higher value than the others. However, we cannot see the underlying distribution of dots in each group or their number of observations.

# Libraries
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(plotly)

# create a dataset
data <- data.frame(
  name=c( rep("A",500), rep("B",500), rep("B",500), rep("C",20), rep('D', 100)  ),
  value=c( rnorm(500, 10, 5), rnorm(500, 13, 1), rnorm(500, 18, 1), rnorm(20, 25, 4), rnorm(100, 12, 1) )
)

# Plot
data %>%
  ggplot( aes(x=name, y=value, fill=name)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("A somewhat misleading boxplot") +
    xlab("")

Let’s see what happens when the boxplot is improved using additional elements.

Adding jitter

If the amount of data you are working with is not too large, adding jitter on top of your boxplot can make the graphic more insightful.

# Plot
data %>%
  ggplot( aes(x=name, y=value, fill=name)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE) +
    geom_jitter(color="grey", size=0.7, alpha=0.5) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("A boxplot with jitter") +
    xlab("")

Here some new patterns appear clearly. Group C has a small sample size compared to the other groups. This is definitely something you want to find out before saying that group C has a higher value than the others. Moreover, it looks like group B has a bimodal distribution: dots are distributed in 2 groups: around y=18 and y=13.

Switching to violin plot

If you have a large sample size, using jitter is not an option anymore since dots will overlap, making the figure uninterpretable. A alternative is the violin plot, which describes the distribution of the data for each group:

# sample size
sample_size = data %>% group_by(name) %>% summarize(num=n())

# Plot
data %>%
  left_join(sample_size) %>%
  mutate(myaxis = paste0(name, "\n", "n=", num)) %>%
  ggplot( aes(x=myaxis, y=value, fill=name)) +
    geom_violin(width=1.4) +
    geom_boxplot(width=0.1, color="grey", alpha=0.2) +
    scale_fill_viridis(discrete = TRUE) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("A boxplot with jitter") +
    xlab("")

Here it is very clear that the groups have different distributions. The bimodal distribution of group B becomes obvious. Violin plots are a powerful way to display information–they are probably under-utilized compared to boxplots.

Adding the sample size

On the previous chart, the sample size of each group is indicated on the x-axis, below the group name. This is a good practice and shows that group C is under-represented. However, it is sometimes better to show the data points themselves. Thus, a good alternative is a half violin plot showing the raw data. This uses code coming from jbburant and David Robinson.

# Code coming from @drob: https://gist.github.com/dgrtwo/eb7750e74997891d7c20#file-geom_flat_violin-r
"%||%" <- function(a, b) {
  if (!is.null(a)) a else b
}

geom_flat_violin <- function(mapping = NULL, data = NULL, stat = "ydensity",
                        position = "dodge", trim = TRUE, scale = "area",
                        show.legend = NA, inherit.aes = TRUE, ...) {
  layer(
    data = data,
    mapping = mapping,
    stat = stat,
    geom = GeomFlatViolin,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      trim = trim,
      scale = scale,
      ...
    )
  )
}

#' @rdname ggplot2-ggproto
#' @format NULL
#' @usage NULL
#' @export
GeomFlatViolin <-
  ggproto("GeomFlatViolin", Geom,
          setup_data = function(data, params) {
            data$width <- data$width %||%
              params$width %||% (resolution(data$x, FALSE) * 0.9)

            # ymin, ymax, xmin, and xmax define the bounding rectangle for each group
            data %>%
              group_by(group) %>%
              mutate(ymin = min(y),
                     ymax = max(y),
                     xmin = x,
                     xmax = x + width / 2)

          },

          draw_group = function(data, panel_scales, coord) {
            # Find the points for the line to go all the way around
            data <- transform(data, xminv = x,
                              xmaxv = x + violinwidth * (xmax - x))

            # Make sure it's sorted properly to draw the outline
            newdata <- rbind(plyr::arrange(transform(data, x = xminv), y),
                             plyr::arrange(transform(data, x = xmaxv), -y))

            # Close the polygon: set first and last point the same
            # Needed for coord_polar and such
            newdata <- rbind(newdata, newdata[1,])

            ggplot2:::ggname("geom_flat_violin", GeomPolygon$draw_panel(newdata, panel_scales, coord))
          },

          draw_key = draw_key_polygon,

          default_aes = aes(weight = 1, colour = "grey20", fill = "white", size = 0.5,
                            alpha = NA, linetype = "solid"),

          required_aes = c("x", "y")
)


# Final plot inspired from @jbburant: https://gist.github.com/jbburant/b3bd4961f3f5b03aeb542ed33a8fe062
data %>%
  sample_frac(0.4) %>%
  ggplot(aes(x = name, y = value, fill = name)) + 
    geom_flat_violin(scale = "count", trim = FALSE, width=2) + 
    scale_fill_viridis(discrete = TRUE) +
    stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "pointrange", position = position_nudge(4.9)) + 
    geom_dotplot(binaxis = "y", dotsize = 0.8, stackdir = "down", binwidth = 0.3, position = position_nudge(-0.025)) + 
    theme_ipsum() +
    theme(
      legend.position = "none"
    ) + 
    ylab("value")

Going further

Hintze et al. 1998, Violin Plots: A Box Plot-Density Trace Synergism. PDF
Making boxplots in R and Python
Making violin plots in R and Python

Comments

Any thoughts on this? Found any mistake? Disagree? Please drop me a word on twitter or in the comment section below: