Consider allowing 1 dimensional `[` to drop the `geometry` column
In dplyr and in many other places in the tidyverse, we program with 1 dimensional calls to [, such as df["x"], where we expect that the result has exactly 1 column, and should be named x.
In ?dplyr::dplyr_extending, we discuss how this is one of the invariants that is required for compatibility with dplyr.
But sf doesn't do this, and instead retains the geometry column as a "sticky" column:
library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.0, PROJ 9.1.1; sf_use_s2() is TRUE
nrows <- 10
geometry = st_sfc(lapply(1:nrows, function(x) st_geometrycollection()))
df <- st_sf(id = 1:nrows, geometry = geometry)
df
#> Simple feature collection with 10 features and 1 field (with 10 geometries empty)
#> Geometry type: GEOMETRYCOLLECTION
#> Dimension: XY
#> Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
#> CRS: NA
#> id geometry
#> 1 1 GEOMETRYCOLLECTION EMPTY
#> 2 2 GEOMETRYCOLLECTION EMPTY
#> 3 3 GEOMETRYCOLLECTION EMPTY
#> 4 4 GEOMETRYCOLLECTION EMPTY
#> 5 5 GEOMETRYCOLLECTION EMPTY
#> 6 6 GEOMETRYCOLLECTION EMPTY
#> 7 7 GEOMETRYCOLLECTION EMPTY
#> 8 8 GEOMETRYCOLLECTION EMPTY
#> 9 9 GEOMETRYCOLLECTION EMPTY
#> 10 10 GEOMETRYCOLLECTION EMPTY
# `geometry` sticks around
df["id"]
#> Simple feature collection with 10 features and 1 field (with 10 geometries empty)
#> Geometry type: GEOMETRYCOLLECTION
#> Dimension: XY
#> Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
#> CRS: NA
#> id geometry
#> 1 1 GEOMETRYCOLLECTION EMPTY
#> 2 2 GEOMETRYCOLLECTION EMPTY
#> 3 3 GEOMETRYCOLLECTION EMPTY
#> 4 4 GEOMETRYCOLLECTION EMPTY
#> 5 5 GEOMETRYCOLLECTION EMPTY
#> 6 6 GEOMETRYCOLLECTION EMPTY
#> 7 7 GEOMETRYCOLLECTION EMPTY
#> 8 8 GEOMETRYCOLLECTION EMPTY
#> 9 9 GEOMETRYCOLLECTION EMPTY
#> 10 10 GEOMETRYCOLLECTION EMPTY
This has caused quite a bit of pain in dplyr over the years, and has recently also just bitten me again in hardhat, where I also use df[cols] as a way to select columns https://github.com/tidymodels/hardhat/issues/228.
In dplyr, algorithms underlying functions like arrange() and distinct() use df[i] to first select the columns to order by or compute the unique values of, so retaining extra columns here can be particularly problematic. I know sf has methods for these two verbs to work around this, but I think those could be avoided entirely if the geometry column wasn't sticky here.
I think:
- Having
dplyr::select()retain sticky columns makes for a great user experience - Having
df[i]retain sticky columns makes for a painful programming experience
This ^ is our general advice regarding sticky columns, and is how dplyr's grouped data frames work:
library(dplyr, warn.conflicts = FALSE)
df <- tibble(g = 1, x = 2)
df <- group_by(df, g)
select(df, x)
#> Adding missing grouping variables: `g`
#> # A tibble: 1 × 2
#> # Groups: g [1]
#> g x
#> <dbl> <dbl>
#> 1 1 2
# Returns a bare tibble, not a grouped data frame
df["x"]
#> # A tibble: 1 × 1
#> x
#> <dbl>
#> 1 2
It is also how tsibble works with its index column.
Would you ever consider allowing df[i] to only return exactly the columns selected by i? If the geometry column isn't selected, then the appropriate behavior would be to return a bare data frame or bare tibble depending on the underlying data structure.
Spatial data is not general data, and retaining the geometry column is essential if the data are still to be considered spatial. Re-ordering rows would be a key case here. So users of spatial data would have to choose positively to discard the geometry column first. This is much more sensible than say losing track of which census tracts are which. Obliging users to first use dplyr, then use dplyr::select() to do what is correct with spatial data, rather than using the general base-R selection by [, seems unduly invasive.
An alternative that might also work for us is for us to implement our own df_select(x, cols) in a low level package that can be used for column selection and would always return a bare data frame. In dplyr we could use that and then call dplyr::dplyr_reconstruct() after all the manipulation is done to attempt to restore back to an sf object at the end, if possible.
Perhaps. https://r-spatial.org/book/07-Introsf.html#subsetting is being published next month, the bookdown has been live for a long time, so it is too late to even try to adapt. @edzer you do the heavy lifting, what do you think?
what do you think?
I can see your point, but think that the sticky geometry in [ is the number one property that makes an object of class c("sf", "data.frame") different from a data.frame; if you don't want it, simply use data.frame's with geometry columns (and discover how hard life can be). It's been there from day one, now about 7 years ago. I also think that a significant part of the user base likes sf because it doesn't tie you up on using tidyverse. Also look carefully at the illustration on https://r-spatial.github.io/sf/ .
Obliging users to first use dplyr, then use dplyr::select() to do what is correct with spatial data, rather than using the general base-R selection by [, seems unduly invasive.
I think the suggestion is not so much that users should use dplyr::select(), but that they should use a different generic than [ to get sticky columns, so that [ may preserve important invariants of base R programming. This generic could have been provided by sf.
That said there are also good reasons for the current design choice as you mentioned and also the ship has sailed a long time ago. So I'm wondering if sf could use the same approach as data.table to enable and disable the custom [ interface in unsuspecting namespaces?
disable the custom [ interface in unsuspecting namespaces?
could you point to an example where & how this is done?
And can users be fully shielded from the unintended effects of the loading of packages by other packages? That is, can they use [ as before without anything unanticipated occurring?
data.table calls it CEDTA: https://github.com/Rdatatable/data.table/blob/master/R/cedta.R
The gist is something like this (untested, hopefully no typos):
# Known namespaces that need sticky columns. This is set to the
# packages that are already depending on sf-awareness at the time
# the feature was introduced. New packages should use `make_sf_aware()`.
known_sf_aware <- c(".globalenv", "foo", "bar")
# Register environment as sf-aware. We use a lexical flag instead of
# inspecting the the name of the calling namespace and add it to
# `the$sf_aware` because the flag is more flexible. It makes it
# possible for arbitrary environments to opt into (or disable) the sf
# API like this:
#
# ```
# local({ make_sf_aware(); sf[...] })
# ```
#
# Packages should call it like this in their `.onLoad()` hook (and
# thus before their namespace is sealed):
#
# ```
# make_sf_aware(topenv(environment()))
# ```
#' @export
make_sf_aware <- function(env = parent.frame(), value = TRUE) {
env$.__sf_aware__. <- value
}
is_sf_aware <- function(env = parent.frame(2)) {
top <- topenv(env)
# Check for overrides
top_name <- env_name(top)
if (!is.null(top_name) && top_name %in% known_sf_aware) {
return(TRUE)
}
# Now check for the lexical flag. This would need to be rewritten
# without rlang as a loop over parents that stops at `topenv()`.
flag <- rlang::env_get(
env,
".__sf_aware__.",
default = FALSE,
inherit = TRUE,
last = top
)
stopifnot(
is.logical(flag) && length(flag) == 1 && !is.na(flag)
)
flag
}
env_name <- function(env) {
ns <- topenv(env)
if (isNamespace(ns)) {
return(getNamespaceName(ns))
}
if (identical(ns, globalenv())) {
return(".globalenv")
}
NULL
}
Then you call is_sf_aware() in your methods. If not aware, call the next method and return the result.
And can users be fully shielded from the unintended effects of the loading of packages by other packages? That is, can they use [ as before without anything unanticipated occurring?
yup, the effect is entirely lexical because we're looking in the parent frame to determine sf-awareness. It's bounded by the top env, which is either a namespace or the global env.
I would absolutely advise keeping this on the opt-out of sf behaviour side, not the opt-in side. In this case, you appear to require all packages in any way depending on sf to add code (and add it all together) in order to choose the right/legacy behaviour for spatial data, placing a burden on the sf maintainer to keep all revdeps compliant, crucially all new ones. What will happen in existing workflows is unknown, as the sf namespace may be loaded later (Suggests:). This also seems to require the import of the rlang namespace, which is not acceptable in general (sf only Suggests: rlang).
, placing a burden on the sf maintainer to keep all revdeps compliant, crucially all new ones.
The burden on the sf maintainer is only at the time of implementing the feature. New revdeps that need sf-awareness would call make_sf_aware() as I mentioned in the comment. This setup works well in the data.table community so I'd recommend to at least consider it.
I'd argue that there is a burden either way. sf implements the data frame interface and so ends up being passed to all sorts of packages, including pure modelling and data-reshaping packages. From that point of view it makes sense to make the behaviour opt-in.
We could also make the mechanism a little more complex and read the DESCRIPTION file to detect if sf is in Imports or Depends. If that's the case, this could default to sf-awareness. This would need to be cached to avoid the overhead but it's certainly feasible.
This also seems to require the import of the rlang namespace, which is not acceptable in general (sf only Suggests: rlang).
It really doesn't, cf the comment above rlang::env_get(). I guess I should have taken the time to write the for loop instead of relying on a comment...
Why should sf's downstream consumers need to do anything? I much prefer any consumers that need to remove awareness, to do that themselves locally. In many cases, sf is Suggests:, not Depends: or Imports:. As we try to migrate a user base of > 1000 packages from rgdal and rgeos by mid 2023, we are using sf as Suggests: in sp (June 2023) to access PROJ/GEOS/GDAL. The behaviour of sf has to stay as stable as possible at least until end 2024. Requiring many hundreds of packages to adapt is hard enough without getting them to opt into something which for most ecologists etc. will be incomprehensible.
Is there a path through the behaviour of sf objects when [..., drop=TRUE] on the dplyr side when one is completely sure that the geometry column must be discarded?
It'd be fine to have packages that have sf in Suggests default to sf-awareness too if that's important for backward compatibility. I understand that you don't want to break existing packages, that's a very reasonable position. So in that scheme we'd have:
- Packages that don't depend on sf in any way always use the df interface.
- Packages that depend on sf in any way default to the sf interface. They opt out by calling
sf::make_sf_aware(ns, value = FALSE), which I guess would becomesf::make_sf_unaware(ns).
I think the main downside of this approach is discoverability. On the upside it should do the right thing in most cases if I'm not missing anything.
Is there a path through the behaviour of sf objects when [..., drop=TRUE] on the dplyr side when one is completely sure that the geometry column must be discarded?
Sorry I don't understand what you mean. I'll say that in general we avoid drop = TRUE when working with data frames in order to preserve the length invariants. Also this discussion is not about dplyr or tidyr where we're happy to use extra precautions, but about how subclasses of data.frame with special interfaces generally interact with other packages in the ecosystem.
It is not just important, it is essential that current behaviour is continued with no user and preferably no downstream maintainer intervention. Any change in behaviour is a show-stopper, especially as we try to migrate hundreds of packages (with often unresponsive maintainers) to use sf or terra from rgeos and rgdal in a very tight time-frame. This is not the right time to introduce insidious breaking changes into the sf ecosystem. Fortunately terra does not touch any tidy packages and is S4, but tidyterra @dieghernan does.
I agree with @lionel- that this is worth exploring, and, when done right, has the potential to help a lot of users using sf objects in non-sf aware packages. Good to see that data.table has been there too. Right now, as @rsbivand mentioned, our focus is on the retiring of rgeos, rgdal and maptools and helping package developers using those migrating to sf. Unless someone else picks this up the end of this year will be an optimistic start date for working on this.
started reverse dependency checks...
I hope the auguries were propitious! Is this in a branch that I could install to test ASDAR and SDSR?
Here's a POC check:
remotes::install_github("edzer/testingsf")
# Using github PAT from envvar GITHUB_PAT. Use `gitcreds::gitcreds_set()` and unset GITHUB_PAT in .Renviron (or elsewhere) if you want to use the more secure git credential store instead.
# Skipping install of 'testingsf' from a github remote, the SHA1 (5c9a2323) has not changed since last install.
# Use `force = TRUE` to force installation
Sys.setenv("ADD_SF_NAMESPACE"="true")
library(sf)
# Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
demo(nc, ask = FALSE, echo = FALSE)
print(attr(nc, ".sf_namespace")) # check for https://github.com/r-spatial/sf/pull/2212
# function ()
# NULL
# <bytecode: 0x5b6178ccb2c0>
# <environment: namespace:sf>
testingsf::report(nc) # number of columns in nc[1]: 1 under branch sf_aware, 2 under branch main
# ncol x[1]: 1
The branch can be installed directly:
remotes::install_github("r-spatial/sf", "sf_aware")
ASDAR OK. But ...
SDSR ch 14:
library(sf)
data(pol_pres15, package = "spDataLarge")
pol_pres15 |> subset(select = c(TERYT, name, types)) |> head()
# Error in .subset2(x, i, exact = exact) :
# attempt to select less than one element in get1index
> o <- subset(pol_pres15, select = c(TERYT, name, types))
> head(o)
Error in .subset2(x, i, exact = exact) :
attempt to select less than one element in get1index
> str(o)
Classes 'sf' and 'data.frame': 2495 obs. of 3 variables:
$ TERYT: chr "020101" "020102" "020103" "020104" ...
$ name : chr "BOLESŁAWIEC" "BOLESŁAWIEC" "GROMADKA" "NOWOGRODZIEC" ...
$ types: Factor w/ 4 levels "Rural","Urban",..: 2 1 1 3 1 1 2 2 2 2 ...
> sf:::is_sf_aware()
[1] TRUE
So this breaks base::subset, doesn't it?
> pol_pres15[1:2495, c(1, 4, 6), drop = FALSE]
Simple feature collection with 2495 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 171677.6 ymin: 133223.7 xmax: 861882.2 ymax: 774774
Projected CRS: ETRS89 / Poland CS92
First 10 features:
TERYT name types geometry
1 020101 BOLESŁAWIEC Urban MULTIPOLYGON (((261089.5 38...
2 020102 BOLESŁAWIEC Rural MULTIPOLYGON (((254150 3837...
3 020103 GROMADKA Rural MULTIPOLYGON (((275346 3846...
4 020104 NOWOGRODZIEC Urban/rural MULTIPOLYGON (((251769.8 37...
5 020105 OSIECZNICA Rural MULTIPOLYGON (((263423.9 40...
6 020106 WARTA BOLESŁAWIECKA Rural MULTIPOLYGON (((267030.7 38...
7 020201 BIELAWA Urban MULTIPOLYGON (((332596.9 31...
8 020202 DZIERŻONIÓW Urban MULTIPOLYGON (((334652.9 31...
9 020203 PIESZYCE Urban MULTIPOLYGON (((331038.9 32...
10 020204 PIŁAWA GÓRNA Urban MULTIPOLYGON (((343571.3 31...
is OK, the damage is inside base:::subset.data.frame.
Fixed now by adding base - there might be others, like utils & stats?
I remain concerned about model.matrix etc., but that should drop geometry, so probably I worry too much. Will check that adding base lets ch 14-17 run to completion.
SDSR Ch 14-17 now complete without error. Maybe the revdeps will show whether utils or stats are needed too?
Packages added (for now) to the known_sf_aware list because upstream package assume sf-aware behaviour:
-
stats, because ofas.distinspatialsample -
base, because ofsubset() -
utilsbecause ofhead() -
od, used byabstr -
rsample, used byspatialsample
Methods added to make this work:
-
relocate.sf(generic indplyr), used byARUtools
Still open issues
-
gtfs2gpsfails, but would pass ifdata.tableis added to the sf-aware package list
I think I've done about five revdep checks on 800+ packages now, just to discover that there are new packages depending on sf arriving on CRAN on a daily basis. There are now three packages that assume sf-aware behaviour of [.sf inside packages they depend on, but that do not depend on sf:
-
abstrassumesodis sf-aware -
spatialsampleassumesrsampleis sf-aware -
gtfs2gpsassumesdata.tableis sf-aware.
@lionel- how do you suggest we go about this? I'm reluctant to add these secondary dependencies to the sf-aware list; can we also leave that to the packages, say that abstr calls some function that makes [.sf in od behave sf-aware? I tried a function that modifies known_sf_aware inside sf, using <<-, but that seems not to be possible.
Asking the question is answering it; this now allows e.g. package spatialsample to define in .onLoad()
if (utils::packageVersion("sf") >= "1.0.17")
sf::sf_make_aware_pkg("rsample")
to make sure that [.sf has sf semantics when called inside rsample (which does not depend on sf).