A boxplot gives a nice summary of one or more numeric variables. A boxplot is composed of several elements:
Here is a diagram showing the boxplot anatomy:
A boxplot can summarize the distribution of a numeric variable for
several groups. The problem is that summarizing also means losing
information, and that can be a pitfall. If we consider the boxplot
below, it is easy to conclude that group C
has a higher
value than the others. However, we cannot see the underlying
distribution of dots in each group or their number of observations.
# Libraries
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(plotly)
# create a dataset
data <- data.frame(
name=c( rep("A",500), rep("B",500), rep("B",500), rep("C",20), rep('D', 100) ),
value=c( rnorm(500, 10, 5), rnorm(500, 13, 1), rnorm(500, 18, 1), rnorm(20, 25, 4), rnorm(100, 12, 1) )
)
# Plot
data %>%
ggplot( aes(x=name, y=value, fill=name)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("A somewhat misleading boxplot") +
xlab("")
Let’s see what happens when the boxplot is improved using additional elements.
If the amount of data you are working with is not too large, adding
jitter
on top of your boxplot can make the graphic more
insightful.
# Plot
data %>%
ggplot( aes(x=name, y=value, fill=name)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE) +
geom_jitter(color="grey", size=0.7, alpha=0.5) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("A boxplot with jitter") +
xlab("")
Here some new patterns appear clearly. Group C
has a
small sample size compared to the other groups. This is definitely
something you want to find out before saying that group C
has a higher value than the others. Moreover, it looks like group
B
has a bimodal distribution
: dots are
distributed in 2 groups: around y=18 and y=13.
If you have a large sample size, using jitter
is not an
option anymore since dots will overlap, making the figure
uninterpretable. A alternative is the violin plot,
which describes the distribution of the data for each group:
# sample size
sample_size = data %>% group_by(name) %>% summarize(num=n())
# Plot
data %>%
left_join(sample_size) %>%
mutate(myaxis = paste0(name, "\n", "n=", num)) %>%
ggplot( aes(x=myaxis, y=value, fill=name)) +
geom_violin(width=1.4) +
geom_boxplot(width=0.1, color="grey", alpha=0.2) +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("A violin plot") +
xlab("")
Here it is very clear that the groups have different distributions. The
bimodal distribution of group
B
becomes obvious. Violin
plots are a powerful way to display information–they are probably
under-utilized compared to boxplots.
On the previous chart, the sample size of each group is indicated on
the x-axis, below the group name. This is a good practice and shows that
group C
is under-represented. However, it is sometimes
better to show the data points themselves. Thus, a good alternative is a
half violin plot showing the raw data. This uses code coming from jbburant
and David
Robinson.
# Code coming from @drob: https://gist.github.com/dgrtwo/eb7750e74997891d7c20#file-geom_flat_violin-r
"%||%" <- function(a, b) {
if (!is.null(a)) a else b
}
geom_flat_violin <- function(mapping = NULL, data = NULL, stat = "ydensity",
position = "dodge", trim = TRUE, scale = "area",
show.legend = NA, inherit.aes = TRUE, ...) {
layer(
data = data,
mapping = mapping,
stat = stat,
geom = GeomFlatViolin,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(
trim = trim,
scale = scale,
...
)
)
}
#' @rdname ggplot2-ggproto
#' @format NULL
#' @usage NULL
#' @export
GeomFlatViolin <-
ggproto("GeomFlatViolin", Geom,
setup_data = function(data, params) {
data$width <- data$width %||%
params$width %||% (resolution(data$x, FALSE) * 0.9)
# ymin, ymax, xmin, and xmax define the bounding rectangle for each group
data %>%
group_by(group) %>%
mutate(ymin = min(y),
ymax = max(y),
xmin = x,
xmax = x + width / 2)
},
draw_group = function(data, panel_scales, coord) {
# Find the points for the line to go all the way around
data <- transform(data, xminv = x,
xmaxv = x + violinwidth * (xmax - x))
# Make sure it's sorted properly to draw the outline
newdata <- rbind(plyr::arrange(transform(data, x = xminv), y),
plyr::arrange(transform(data, x = xmaxv), -y))
# Close the polygon: set first and last point the same
# Needed for coord_polar and such
newdata <- rbind(newdata, newdata[1,])
ggplot2:::ggname("geom_flat_violin", GeomPolygon$draw_panel(newdata, panel_scales, coord))
},
draw_key = draw_key_polygon,
default_aes = aes(weight = 1, colour = "grey20", fill = "white", size = 0.5,
alpha = NA, linetype = "solid"),
required_aes = c("x", "y")
)
# Final plot inspired from @jbburant: https://gist.github.com/jbburant/b3bd4961f3f5b03aeb542ed33a8fe062
data %>%
sample_frac(0.4) %>%
ggplot(aes(x = name, y = value, fill = name)) +
geom_flat_violin(scale = "count", trim = FALSE, width=2) +
scale_fill_viridis(discrete = TRUE) +
stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "pointrange", position = position_nudge(4.9)) +
geom_dotplot(binaxis = "y", dotsize = 0.8, stackdir = "down", binwidth = 0.3, position = position_nudge(-0.025)) +
theme_ipsum() +
theme(
legend.position = "none"
) +
ylab("value")
Data To Viz is a comprehensive classification of chart types organized by data input format. Get a high-resolution version of our decision tree delivered to your inbox now!
A work by Yan Holtz for data-to-viz.com