Play with your histogram bin size

A collection of common dataviz caveats by Data-to-Viz.com




A histogram takes as input a numeric variable and cuts it into several bins. The number of observations in each bin is represented by the height of the bar. It is a very common type of graphic and most tools select a bin size value by default.

However, this bin size choice can have a strong impact on the chart insight. Let’s look at the distribution of Airbnb night prices on the French Riviera:

# Libraries
library(tidyverse)
library(hrbrthemes)
library(kableExtra)
options(knitr.table.format = "html")

# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_OneNum.csv", header=TRUE)


data %>%
  filter( price<300 ) %>%
  ggplot( aes(x=price)) +
    stat_bin(breaks=seq(0,300,10), fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    ggtitle("Night price distribution of Airbnb appartements") +
    theme_ipsum() +
    theme(
      plot.title = element_text(size=12)
    )

The price ranges between 10 and 300 euros, with most of the apartments ranging between 60 and 150 euros per night. In this chart, prices are cut in several 10 euro bins: between 0 and 10 euros a night, between 10 and 20, and so on. This is represented on the X-axis. Then, the number of apartments per bin is counted and represented by the Y-axis.


Let’s check what happens when prices are split into bins of 2 euros instead of 10:

data %>%
  filter( price<300 ) %>%
  ggplot( aes(x=price)) +
    stat_bin(breaks=seq(0,300,3), fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    ggtitle("Night price distribution of Airbnb appartements") +
    theme_ipsum() +
    theme(
      plot.title = element_text(size=12)
    )

There is a huge difference between these two histograms. A few values are over-represented in the dataset (like 58, 64, 69, 75, 80…). This is definitely a signal you want to understand when analyzing your data.

Going further



Dataviz decision tree

Data To Viz is a comprehensive classification of chart types organized by data input format. Get a high-resolution version of our decision tree delivered to your inbox now!


High Resolution Poster
 

A work by Yan Holtz for data-to-viz.com