Error bars give a general idea of how precise a measurement is, or conversely, how far from the reported value the true (error free) value might be. If the value displayed on your barplot is the result of an aggregation (like the mean value of several data points), you may want to display error bars.
# Libraries
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(patchwork)
# create dummy data
data <- data.frame(
name=letters[1:5],
value=sample(seq(4,15),5),
sd=c(1,0.2,3,2,4)
)
# Plot
ggplot(data) +
geom_bar( aes(x=name, y=value), stat="identity", fill="#69b3a2", alpha=0.7, width=0.5) +
geom_errorbar( aes(x=name, ymin=value-sd, ymax=value+sd), width=0.4, colour="black", alpha=0.9, size=1) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("A barplot with error bar") +
xlab("")
In the graphic above 5 groups are reported. The bar heights represent
their mean value. The black error bar gives information on how the
individual observations are dispersed around the average. For instance,
it appears that measurements in group B
are more precise
than in group E
.
The first issue with error bars is that they hide information. Here
is a figure from a paper
in PLOS Biology. It illustrates that the full data may suggest
different conclusions than the summary statistics. The same barplot with
error bars (left) can represent several situations. Both groups can have
the same kind of distribution (B
), one group can have
outliers (C
), one group can have a bimodal distribution
(D
), or groups can have unequal sample sizes:
Thus, the same barplot with error bars can in fact tell very different stories, hidden to the reader.
Always show your individual data points if you can #showyourdata
The second issue with error bars is that they are used to show
different metrics
, and it is not always clear which one is
being shown. Three different types of values are commonly used for error
bars, sometimes giving very different results. Here is an overview of
their definitions and how to calculate them on a simple vector in R.
alpha=0.05
t=qt((1-alpha)/2 + .5, length(vec)-1) # tend to 1.96 if sample size is big enough
CI=t*se
Here is an application of these 3 metrics to the famous Iris dataset. It shows the average sepal length of three species of Iris. The variation around the average length is represented using error bars.
# Data
data <- iris %>% select(Species, Sepal.Length)
# Calculates mean, sd, se and ci
my_sum <- data %>%
group_by(Species) %>%
summarise(
n=n(),
mean=mean(Sepal.Length),
sd=sd(Sepal.Length)
) %>%
mutate( se=sd/sqrt(n)) %>%
mutate( ic=se * qt((1-0.05)/2 + .5, n-1))
# Standard deviation
p1 <- ggplot(my_sum) +
geom_bar( aes(x=Species, y=mean), stat="identity", fill="#69b3a2", alpha=0.7, width=0.6) +
geom_errorbar( aes(x=Species, ymin=mean-sd, ymax=mean+sd), width=0.4, colour="black", alpha=0.9, size=1) +
ggtitle("standard deviation") +
theme(
plot.title = element_text(size=6)
) +
theme_ipsum() +
xlab("") +
ylab("Sepal Length")
# Standard Error
p2 <- ggplot(my_sum) +
geom_bar( aes(x=Species, y=mean), stat="identity", fill="#69b3a2", alpha=0.7, width=0.6) +
geom_errorbar( aes(x=Species, ymin=mean-se, ymax=mean+se),width=0.4, colour="black", alpha=0.9, size=1) +
ggtitle("standard error") +
theme(
plot.title = element_text(size=6)
) +
theme_ipsum() +
xlab("") +
ylab("Sepal Length")
# Confidence Interval
p3 <- ggplot(my_sum) +
geom_bar( aes(x=Species, y=mean), stat="identity", fill="#69b3a2", alpha=0.7, width=0.6) +
geom_errorbar( aes(x=Species, ymin=mean-ic, ymax=mean+ic), width=0.4, colour="black", alpha=0.9, size=1) +
ggtitle("confidence interval") +
theme(
plot.title = element_text(size=6)
) +
theme_ipsum() +
xlab("") +
ylab("Sepal Length")
p1 + p2 + p3
It is quite obvious that the 3 metrics report very different visualizations and conclusions.
Always specify which metrics you used for the error bars
It is better to avoid error bars as much as you can. Of course it is not possible if you only have summary statistics. But if you know the individual data points, show them. Several workarounds are possible. The boxplot with jitter is a good one for a relatively small amount of data. The violin plot is another possibility if you have a large sample size to display.
data %>%
ggplot( aes(x=Species, y=Sepal.Length)) +
geom_boxplot( fill="#69b3a2", notch=TRUE) +
geom_jitter( size=0.9, color="orange", width=0.1) +
ggtitle("confidence interval") +
theme(
plot.title = element_text(size=6)
) +
theme_ipsum() +
xlab("") +
ylab("Sepal Length")
Data To Viz is a comprehensive classification of chart types organized by data input format. Get a high-resolution version of our decision tree delivered to your inbox now!
A work by Yan Holtz for data-to-viz.com