Incorrect warning about clipping after adding wide xlim values to geom_histogram
Brief description of the problem
If I add xlim() or limits in scale_x_continuous() using geom_histogram and setting the limits outside the range of the data I see a warning message:
Removed 2 rows containing missing values (geom_bar()).
but in fact nothing has been removed.
True for me on: R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)
and
R version 4.2.2 Patched (2022-11-10 r83330) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.1 LTS
and
R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)
All three show ggplot2 at version 3.4.0
Reprex follows.
library(tidyverse)
set.seed(12345)
tibble(x = rnorm(5000) / 10) -> tmpTib
tmpTib %>%
summarise(min = min(x),
max = max(x),
nNA = sum(is.na(x)))
# # A tibble: 1 × 3
# min max nNA
# <dbl> <dbl> <int>
# 1 -0.388 0.333 0
### so no missing values and range well inside [-1, 1]
ggplot(data = tmpTib,
aes(x = x)) +
geom_histogram()
### plots all 5000 points
ggplot(data = tmpTib,
aes(x = x)) +
geom_histogram() +
xlim(-1, 1)
### reports:
# Warning message:
# Removed 2 rows containing missing values (`geom_bar()`).
### same happens using scale_x_continuous(limits = c(-1, 1)):
ggplot(data = tmpTib,
aes(x = x)) +
geom_histogram() +
scale_x_continuous(limits = c(-1, 1))
# Warning message:
# Removed 2 rows containing missing values (`geom_bar()`).
tmpTib %>%
filter(row_number() < 6) -> tmpTibSmall
tmpTibSmall
# # A tibble: 5 × 1
# x
# <dbl>
# 1 0.0586
# 2 0.0709
# 3 -0.0109
# 4 -0.0453
# 5 0.0606
### using small dataset shows that there is actually no removal of data
ggplot(data = tmpTibSmall,
aes(x = x)) +
geom_histogram() +
scale_x_continuous(limits = c(-.07, .085))
ggplot(data = tmpTibSmall,
aes(x = x)) +
geom_histogram() +
xlim(-.07, .085)
sessionInfo()
I hope I'm not being stupid!
Because the scale range is larger than the data range, this results in some empty, 0-count, bins at the flanks of the histograms. If these flanking bins are out-of-bounds, they get censored and dropped, which is the warning you get.
We can show in the layer data that there is an empty bin at the start and end that have NAs for either xmin or xmax (because they got censored).
library(ggplot2)
set.seed(12345)
df <- data.frame(x = rnorm(5000) / 10)
p <- ggplot(data = df, aes(x = x)) +
geom_histogram() +
xlim(-1, 1)
ld <- layer_data(p)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
head(ld, 2)[, 1:8]
#> y count x xmin xmax density ncount ndensity
#> 1 0 0 NA NA -1.0000000 0 0 0
#> 2 0 0 -0.9655172 -1 -0.9310345 0 0 0
tail(ld, 2)[, 1:8]
#> y count x xmin xmax density ncount ndensity
#> 29 0 0 0.8965517 0.8620690 0.9310345 0 0 0
#> 30 0 0 0.9655172 0.9310345 NA 0 0 0
Created on 2022-12-24 by the reprex package (v2.0.1)
This is all how it is supposed to work, but what I don't understand (yet) is why the breaks for the bins get calculated outside the bounds of the scale range.
If you want to remedy this issue, you could use scale_x_continuous(limits = c(-1, 1), oob = scales::oob_keep) to keep the out-of-bounds empty bins. If you'd use coord_cartesian(xlim = c(-1, 1)), it will change the break calculation to fit the data instead of the scale range.
Wow. Brilliant answer: many thanks. I suspect I should have been able to work this out myself, perhaps if I had tried coord_cartesian() that would have tipped me off. I hadn't found oob. I share your puzzlement, now I am starting to understand what's happening, about the setting of the bin limits. I can't see how that's a good choice. (I accept there often are good explanations for things in R that I haven't understood until I have thought about them a lot!) However, if there is a good reason that's escaping us I suspect it would be an improvement to have geom_histogram() throw a warning about what's happening and why. (I guess that the warning that is coming out is not coming from within geom_histogram() but somewhere "deeper" in ggplot(). Fascinating.