spatialsample icon indicating copy to clipboard operation
spatialsample copied to clipboard

`spatial_clustering_cv` retains geometry in folds causing `fit_resamples` to fail

Open brianmsm opened this issue 1 year ago • 2 comments

The problem

When using spatial_clustering_cv to create spatial resamples, the geometry column is retained within the folds. This causes fit_resamples to fail with an error indicating that not all columns of y are known outcome types. It's unclear whether spatial_clustering_cv should drop the spatial information in the folds or if fit_resamples should exclude the geometry information. There might be something I'm missing.

Reproducible example

# Load package
library(dplyr, warn.conflicts = FALSE)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(spatialsample)
library(workflows)
library(parsnip)
library(tune)

# Example data
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)

# Making spatial clusters
nc_folds <- spatial_clustering_cv(nc, v = 5)

# Workflow for linear regression
lr_recipe <- workflow() %>%
  add_variables(outcomes = BIR74,
                predictors = AREA) %>%
  add_model(linear_reg(engine = "lm"))

# Tuning parameters: Fail
(spatial_lr <- fit_resamples(lr_recipe, nc_folds))
#> → A | error:   Not all columns of `y` are known outcome types. These columns have unknown types: 'geometry'.
#> There were issues with some computations   A: x1
#> There were issues with some computations   A: x5
#> 
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more
#> information.
#> # Resampling results
#> # 5-fold spatial cross-validation 
#> # A tibble: 5 × 4
#>   splits          id    .metrics .notes          
#>   <list>          <chr> <list>   <list>          
#> 1 <split [77/23]> Fold1 <NULL>   <tibble [1 × 3]>
#> 2 <split [75/25]> Fold2 <NULL>   <tibble [1 × 3]>
#> 3 <split [79/21]> Fold3 <NULL>   <tibble [1 × 3]>
#> 4 <split [84/16]> Fold4 <NULL>   <tibble [1 × 3]>
#> 5 <split [85/15]> Fold5 <NULL>   <tibble [1 × 3]>
#> 
#> There were issues with some computations:
#> 
#>   - Error(s) x5: Not all columns of `y` are known outcome types. These columns hav...
#> 
#> Run `show_notes(.Last.tune.result)` for more information.

# Best tuning parameters: : Fail
collect_metrics(spatial_lr)
#> Error in `estimate_tune_results()`:
#> ! All models failed. Run `show_notes(.Last.tune.result)` for more information.



# Try with st_drop_geometry:
orig_class <- class(nc_folds)

nc_folds <- nc_folds %>% 
  mutate(splits = purrr::map(splits, ~ {
    .x$data <- st_drop_geometry(.x$data)
    .x
  }))

class(nc_folds) <- orig_class

# Tuning parameters
(spatial_lr <- fit_resamples(lr_recipe, nc_folds))
#> # Resampling results
#> # -fold spatial cross-validation 
#> # A tibble: 5 × 4
#>   splits          id    .metrics         .notes          
#>   <list>          <chr> <list>           <list>          
#> 1 <split [77/23]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
#> 2 <split [75/25]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
#> 3 <split [79/21]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
#> 4 <split [84/16]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
#> 5 <split [85/15]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>

# Best tuning parameters 
collect_metrics(spatial_lr)
#> # A tibble: 2 × 6
#>   .metric .estimator     mean     n  std_err .config             
#>   <chr>   <chr>         <dbl> <int>    <dbl> <chr>               
#> 1 rmse    standard   3542.        5 634.     Preprocessor1_Model1
#> 2 rsq     standard      0.178     5   0.0616 Preprocessor1_Model1

Created on 2024-07-19 with reprex v2.1.1

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.1 (2024-06-14 ucrt)
#>  os       Windows 11 x64 (build 22635)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Spanish_Peru.utf8
#>  ctype    Spanish_Peru.utf8
#>  tz       America/Lima
#>  date     2024-07-19
#>  pandoc   3.1.12.3 @ c:\\Program Files\\Positron\\bin\\pandoc/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package       * version    date (UTC) lib source
#>  class           7.3-22     2023-05-03 [2] CRAN (R 4.4.1)
#>  classInt        0.4-10     2023-09-05 [1] CRAN (R 4.4.0)
#>  cli             3.6.3.9000 2024-06-28 [1] Github (r-lib/cli@d9febb5)
#>  codetools       0.2-20     2024-03-31 [2] CRAN (R 4.4.1)
#>  colorspace      2.1-0      2023-01-23 [1] CRAN (R 4.4.0)
#>  data.table      1.15.4     2024-03-30 [1] CRAN (R 4.4.0)
#>  DBI             1.2.3      2024-06-02 [1] CRAN (R 4.4.0)
#>  dials           1.2.1      2024-02-22 [1] CRAN (R 4.4.1)
#>  DiceDesign      1.10       2023-12-07 [1] CRAN (R 4.4.1)
#>  digest          0.6.36     2024-06-23 [1] CRAN (R 4.4.1)
#>  dplyr         * 1.1.4      2023-11-17 [1] CRAN (R 4.4.0)
#>  e1071           1.7-14     2023-12-06 [1] CRAN (R 4.4.0)
#>  evaluate        0.24.0     2024-06-10 [1] CRAN (R 4.4.0)
#>  fansi           1.0.6      2023-12-08 [1] CRAN (R 4.4.0)
#>  fastmap         1.2.0      2024-05-15 [1] CRAN (R 4.4.0)
#>  foreach         1.5.2      2022-02-02 [1] CRAN (R 4.4.0)
#>  fs              1.6.4      2024-04-25 [1] CRAN (R 4.4.0)
#>  furrr           0.3.1      2022-08-15 [1] CRAN (R 4.4.0)
#>  future          1.33.2     2024-03-26 [1] CRAN (R 4.4.0)
#>  future.apply    1.11.2     2024-03-28 [1] CRAN (R 4.4.0)
#>  generics        0.1.3      2022-07-05 [1] CRAN (R 4.4.0)
#>  ggplot2         3.5.1      2024-04-23 [1] CRAN (R 4.4.1)
#>  globals         0.16.3     2024-03-08 [1] CRAN (R 4.4.0)
#>  glue            1.7.0      2024-01-09 [1] CRAN (R 4.4.0)
#>  gower           1.0.1      2022-12-22 [1] CRAN (R 4.4.0)
#>  GPfit           1.0-8      2019-02-08 [1] CRAN (R 4.4.1)
#>  gtable          0.3.5      2024-04-22 [1] CRAN (R 4.4.0)
#>  hardhat         1.4.0      2024-06-02 [1] CRAN (R 4.4.1)
#>  htmltools       0.5.8.1    2024-04-04 [1] CRAN (R 4.4.0)
#>  ipred           0.9-14     2023-03-09 [1] CRAN (R 4.4.1)
#>  iterators       1.0.14     2022-02-05 [1] CRAN (R 4.4.0)
#>  KernSmooth      2.23-24    2024-05-17 [2] CRAN (R 4.4.1)
#>  knitr           1.48       2024-07-07 [1] CRAN (R 4.4.1)
#>  lattice         0.22-6     2024-03-20 [2] CRAN (R 4.4.1)
#>  lava            1.8.0      2024-03-05 [1] CRAN (R 4.4.1)
#>  lhs             1.2.0      2024-06-30 [1] CRAN (R 4.4.1)
#>  lifecycle       1.0.4      2023-11-07 [1] CRAN (R 4.4.0)
#>  listenv         0.9.1      2024-01-29 [1] CRAN (R 4.4.0)
#>  lubridate       1.9.3      2023-09-27 [1] CRAN (R 4.4.0)
#>  magrittr        2.0.3      2022-03-30 [1] CRAN (R 4.4.0)
#>  MASS            7.3-60.2   2024-04-26 [2] CRAN (R 4.4.1)
#>  Matrix          1.7-0      2024-04-26 [2] CRAN (R 4.4.1)
#>  munsell         0.5.1      2024-04-01 [1] CRAN (R 4.4.0)
#>  nnet            7.3-19     2023-05-03 [2] CRAN (R 4.4.1)
#>  parallelly      1.37.1     2024-02-29 [1] CRAN (R 4.4.0)
#>  parsnip       * 1.2.1      2024-03-22 [1] CRAN (R 4.4.1)
#>  pillar          1.9.0      2023-03-22 [1] CRAN (R 4.4.0)
#>  pkgconfig       2.0.3      2019-09-22 [1] CRAN (R 4.4.0)
#>  prodlim         2024.06.25 2024-06-24 [1] CRAN (R 4.4.1)
#>  proxy           0.4-27     2022-06-09 [1] CRAN (R 4.4.0)
#>  purrr           1.0.2      2023-08-10 [1] CRAN (R 4.4.0)
#>  R6              2.5.1      2021-08-19 [1] CRAN (R 4.4.0)
#>  Rcpp            1.0.12     2024-01-09 [1] CRAN (R 4.4.0)
#>  recipes         1.1.0      2024-07-04 [1] CRAN (R 4.4.1)
#>  reprex          2.1.1      2024-07-06 [1] CRAN (R 4.4.1)
#>  rlang           1.1.4.9000 2024-06-28 [1] Github (r-lib/rlang@cebbabf)
#>  rmarkdown       2.27       2024-05-17 [1] CRAN (R 4.4.0)
#>  rpart           4.1.23     2023-12-05 [2] CRAN (R 4.4.1)
#>  rsample         1.2.1      2024-03-25 [1] CRAN (R 4.4.1)
#>  s2              1.1.6      2023-12-19 [1] CRAN (R 4.4.0)
#>  scales          1.3.0      2023-11-28 [1] CRAN (R 4.4.0)
#>  sessioninfo     1.2.2      2021-12-06 [1] CRAN (R 4.4.0)
#>  sf            * 1.0-16     2024-03-24 [1] CRAN (R 4.4.0)
#>  spatialsample * 0.5.1      2023-11-08 [1] CRAN (R 4.4.1)
#>  survival        3.6-4      2024-04-24 [2] CRAN (R 4.4.1)
#>  tibble          3.2.1      2023-03-20 [1] CRAN (R 4.4.0)
#>  tidyr           1.3.1      2024-01-24 [1] CRAN (R 4.4.0)
#>  tidyselect      1.2.1      2024-03-11 [1] CRAN (R 4.4.0)
#>  timechange      0.3.0      2024-01-18 [1] CRAN (R 4.4.0)
#>  timeDate        4032.109   2023-12-14 [1] CRAN (R 4.4.0)
#>  tune          * 1.2.1      2024-04-18 [1] CRAN (R 4.4.1)
#>  units           0.8-5      2023-11-28 [1] CRAN (R 4.4.0)
#>  utf8            1.2.4      2023-10-22 [1] CRAN (R 4.4.0)
#>  vctrs           0.6.5      2023-12-01 [1] CRAN (R 4.4.0)
#>  withr           3.0.0      2024-01-16 [1] CRAN (R 4.4.0)
#>  wk              0.9.2      2024-07-09 [1] CRAN (R 4.4.1)
#>  workflows     * 1.1.4      2024-02-19 [1] CRAN (R 4.4.1)
#>  xfun            0.45       2024-06-16 [1] CRAN (R 4.4.1)
#>  yaml            2.3.9      2024-07-05 [1] CRAN (R 4.4.1)
#>  yardstick       1.3.1      2024-03-21 [1] CRAN (R 4.4.1)
#> 
#>  [1] C:/Users/brian/AppData/Local/R/win-library/4.4
#>  [2] C:/Program Files/R/R-4.4.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

brianmsm avatar Jul 19 '24 22:07 brianmsm

Try using add_formula instead of add_variables as a workaround

(Sorry for the brief reply -- I'm traveling at the moment so can't run stuff, but wanted to make sure I could try to help you get unstuck. This is definitely a bug somewhere)

mikemahoney218 avatar Jul 20 '24 14:07 mikemahoney218

Interesting. If I do it using add_formula() it does work.

# Load package
library(dplyr, warn.conflicts = FALSE)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(spatialsample)
library(workflows)
library(parsnip)
library(tune)

# Example data
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)

# Making spatial clusters
nc_folds <- spatial_clustering_cv(nc, v = 5)

# Workflow for linear regression
lr_recipe <- workflow() %>%
  add_formula(BIR74 ~ AREA) %>%
  add_model(linear_reg(engine = "lm"))

# Tuning parameters
(spatial_lr <- fit_resamples(lr_recipe, nc_folds))
#> # Resampling results
#> # 5-fold spatial cross-validation 
#> # A tibble: 5 × 4
#>   splits          id    .metrics         .notes          
#>   <list>          <chr> <list>           <list>          
#> 1 <split [79/21]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
#> 2 <split [75/25]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
#> 3 <split [77/23]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
#> 4 <split [85/15]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
#> 5 <split [84/16]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>

# Best tuning parameters:
collect_metrics(spatial_lr)
#> # A tibble: 2 × 6
#>   .metric .estimator     mean     n  std_err .config             
#>   <chr>   <chr>         <dbl> <int>    <dbl> <chr>               
#> 1 rmse    standard   3542.        5 634.     Preprocessor1_Model1
#> 2 rsq     standard      0.178     5   0.0616 Preprocessor1_Model1

Created on 2024-07-22 with reprex v2.1.1

brianmsm avatar Jul 22 '24 09:07 brianmsm