ggplot2 icon indicating copy to clipboard operation
ggplot2 copied to clipboard

Incorrect warning about clipping after adding wide xlim values to geom_histogram

Open cpsyctc2 opened this issue 3 years ago • 2 comments

Brief description of the problem

If I add xlim() or limits in scale_x_continuous() using geom_histogram and setting the limits outside the range of the data I see a warning message: Removed 2 rows containing missing values (geom_bar()). but in fact nothing has been removed.

True for me on: R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)

and

R version 4.2.2 Patched (2022-11-10 r83330) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.1 LTS

and

R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)

All three show ggplot2 at version 3.4.0

Reprex follows.

library(tidyverse)
set.seed(12345)
tibble(x = rnorm(5000) / 10) -> tmpTib
  
tmpTib %>% 
  summarise(min = min(x),
            max = max(x),
            nNA = sum(is.na(x)))
# # A tibble: 1 × 3
# min   max   nNA
# <dbl> <dbl> <int>
#   1 -0.388 0.333     0
### so no missing values and range well inside [-1, 1]
  
ggplot(data = tmpTib,
       aes(x = x)) + 
  geom_histogram()
### plots all 5000 points

ggplot(data = tmpTib,
       aes(x = x)) + 
  geom_histogram() +
  xlim(-1, 1)
### reports:
# Warning message:
# Removed 2 rows containing missing values (`geom_bar()`). 

### same happens using scale_x_continuous(limits = c(-1, 1)):
ggplot(data = tmpTib,
       aes(x = x)) + 
  geom_histogram() +
  scale_x_continuous(limits = c(-1, 1))
# Warning message:
# Removed 2 rows containing missing values (`geom_bar()`). 

tmpTib %>%
  filter(row_number() < 6) -> tmpTibSmall

tmpTibSmall
# # A tibble: 5 × 1
# x
# <dbl>
# 1  0.0586
# 2  0.0709
# 3 -0.0109
# 4 -0.0453
# 5  0.0606

### using small dataset shows that there is actually no removal of data
ggplot(data = tmpTibSmall,
       aes(x = x)) + 
  geom_histogram() +
  scale_x_continuous(limits = c(-.07, .085)) 

ggplot(data = tmpTibSmall,
       aes(x = x)) + 
  geom_histogram() +
  xlim(-.07, .085)

sessionInfo()

I hope I'm not being stupid!

cpsyctc2 avatar Dec 24 '22 12:12 cpsyctc2

Because the scale range is larger than the data range, this results in some empty, 0-count, bins at the flanks of the histograms. If these flanking bins are out-of-bounds, they get censored and dropped, which is the warning you get.

We can show in the layer data that there is an empty bin at the start and end that have NAs for either xmin or xmax (because they got censored).

library(ggplot2)
set.seed(12345)
df <- data.frame(x = rnorm(5000) / 10)

p <- ggplot(data = df, aes(x = x)) + 
  geom_histogram() +
  xlim(-1, 1)

ld <- layer_data(p)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
head(ld, 2)[, 1:8]
#>   y count          x xmin       xmax density ncount ndensity
#> 1 0     0         NA   NA -1.0000000       0      0        0
#> 2 0     0 -0.9655172   -1 -0.9310345       0      0        0
tail(ld, 2)[, 1:8]
#>    y count         x      xmin      xmax density ncount ndensity
#> 29 0     0 0.8965517 0.8620690 0.9310345       0      0        0
#> 30 0     0 0.9655172 0.9310345        NA       0      0        0

Created on 2022-12-24 by the reprex package (v2.0.1)

This is all how it is supposed to work, but what I don't understand (yet) is why the breaks for the bins get calculated outside the bounds of the scale range.

If you want to remedy this issue, you could use scale_x_continuous(limits = c(-1, 1), oob = scales::oob_keep) to keep the out-of-bounds empty bins. If you'd use coord_cartesian(xlim = c(-1, 1)), it will change the break calculation to fit the data instead of the scale range.

teunbrand avatar Dec 24 '22 13:12 teunbrand

Wow. Brilliant answer: many thanks. I suspect I should have been able to work this out myself, perhaps if I had tried coord_cartesian() that would have tipped me off. I hadn't found oob. I share your puzzlement, now I am starting to understand what's happening, about the setting of the bin limits. I can't see how that's a good choice. (I accept there often are good explanations for things in R that I haven't understood until I have thought about them a lot!) However, if there is a good reason that's escaping us I suspect it would be an improvement to have geom_histogram() throw a warning about what's happening and why. (I guess that the warning that is coming out is not coming from within geom_histogram() but somewhere "deeper" in ggplot(). Fascinating.

cpsyctc2 avatar Dec 24 '22 14:12 cpsyctc2