The issue with error bars

A collection of common dataviz caveats by Data-to-Viz.com




Error bars give a general idea of how precise a measurement is, or conversely, how far from the reported value the true (error free) value might be. If the value displayed on your barplot is the result of an aggregation (like the mean value of several data points), you may want to display error bars.

# Libraries
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(patchwork)

# create dummy data
data <- data.frame(
  name=letters[1:5],
  value=sample(seq(4,15),5),
  sd=c(1,0.2,3,2,4)
)

# Plot
ggplot(data) +
    geom_bar( aes(x=name, y=value), stat="identity", fill="#69b3a2", alpha=0.7, width=0.5) +
    geom_errorbar( aes(x=name, ymin=value-sd, ymax=value+sd), width=0.4, colour="black", alpha=0.9, size=1) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("A barplot with error bar") +
    xlab("")

In the graphic above 5 groups are reported. The bar heights represent their mean value. The black error bar gives information on how the individual observations are dispersed around the average. For instance, it appears that measurements in group B are more precise than in group E.

Error bars hide information


The first issue with error bars is that they hide information. Here is a figure from a paper in PLOS Biology. It illustrates that the full data may suggest different conclusions than the summary statistics. The same barplot with error bars (left) can represent several situations. Both groups can have the same kind of distribution (B), one group can have outliers (C), one group can have a bimodal distribution (D), or groups can have unequal sample sizes:


img
Weissgerber et al. 2015


Thus, the same barplot with error bars can in fact tell very different stories, hidden to the reader.

Always show your individual data points if you can #showyourdata

What is an error bar?


The second issue with error bars is that they are used to show different metrics, and it is not always clear which one is being shown. Three different types of values are commonly used for error bars, sometimes giving very different results. Here is an overview of their definitions and how to calculate them on a simple vector in R.

  • Standard Deviation (SD) represents the amount of dispersion of the variable. Calculated as the root square of the variance
sd <- sd(vec)
sd <- sqrt(var(vec))
  • Standard Error (SE) is the standard deviation of the mean of the variable, calculated as the SD divided by the square root of the sample size. By construction, the SE is smaller than the SD. With a very large sample size, the SE tends toward 0.
se = sd(vec) / sqrt(length(vec))
  • A Confidence Interval (CI) is defined so that there is a specified probability that a value lies within it. It is calculated as t * SE where t is the value of the Student’s t-distribution for a specific alpha. Its value is often rounded to 1.96 (its value with a large sample size). If the sample size is huge or the distribution not normal, it is better to calculate the CI using the bootstrap method, however.
alpha=0.05
t=qt((1-alpha)/2 + .5, length(vec)-1)   # tend to 1.96 if sample size is big enough
CI=t*se


Here is an application of these 3 metrics to the famous Iris dataset. It shows the average sepal length of three species of Iris. The variation around the average length is represented using error bars.

# Data
data <- iris %>% select(Species, Sepal.Length)

# Calculates mean, sd, se and ci
my_sum <- data %>%
  group_by(Species) %>%
  summarise(
    n=n(),
    mean=mean(Sepal.Length),
    sd=sd(Sepal.Length)
  ) %>%
  mutate( se=sd/sqrt(n))  %>%
  mutate( ic=se * qt((1-0.05)/2 + .5, n-1))

# Standard deviation
p1 <- ggplot(my_sum) +
  geom_bar( aes(x=Species, y=mean), stat="identity", fill="#69b3a2", alpha=0.7, width=0.6) +
  geom_errorbar( aes(x=Species, ymin=mean-sd, ymax=mean+sd), width=0.4, colour="black", alpha=0.9, size=1) +
  ggtitle("standard deviation") +
  theme(
    plot.title = element_text(size=6)
  ) +
  theme_ipsum() +
  xlab("") +
  ylab("Sepal Length")

# Standard Error
p2 <- ggplot(my_sum) +
  geom_bar( aes(x=Species, y=mean), stat="identity", fill="#69b3a2", alpha=0.7, width=0.6) +
  geom_errorbar( aes(x=Species, ymin=mean-se, ymax=mean+se),width=0.4, colour="black", alpha=0.9, size=1) +
  ggtitle("standard error") +
  theme(
    plot.title = element_text(size=6)
  ) +
  theme_ipsum() +
  xlab("") +
  ylab("Sepal Length")

# Confidence Interval
p3 <- ggplot(my_sum) +
  geom_bar( aes(x=Species, y=mean), stat="identity", fill="#69b3a2", alpha=0.7, width=0.6) +
  geom_errorbar( aes(x=Species, ymin=mean-ic, ymax=mean+ic), width=0.4, colour="black", alpha=0.9, size=1) +
  ggtitle("confidence interval") +
  theme(
    plot.title = element_text(size=6)
  ) +
  theme_ipsum() +
  xlab("") +
  ylab("Sepal Length")

p1 + p2 + p3

It is quite obvious that the 3 metrics report very different visualizations and conclusions.

Always specify which metrics you used for the error bars

Workaround


It is better to avoid error bars as much as you can. Of course it is not possible if you only have summary statistics. But if you know the individual data points, show them. Several workarounds are possible. The boxplot with jitter is a good one for a relatively small amount of data. The violin plot is another possibility if you have a large sample size to display.

data %>%
  ggplot( aes(x=Species, y=Sepal.Length)) +
    geom_boxplot( fill="#69b3a2", notch=TRUE) +
    geom_jitter( size=0.9, color="orange", width=0.1) +
    ggtitle("confidence interval") +
    theme(
      plot.title = element_text(size=6)
    ) +
    theme_ipsum() +
    xlab("") +
    ylab("Sepal Length")

Going further


  • Weissgerber et al. 2015, Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. link
  • Making boxplots in R and Python
  • Making violin plots in R and Python


Dataviz decision tree

Data To Viz is a comprehensive classification of chart types organized by data input format. Get a high-resolution version of our decision tree delivered to your inbox now!


High Resolution Poster
 

A work by Yan Holtz for data-to-viz.com