rsample icon indicating copy to clipboard operation
rsample copied to clipboard

Expand grouping variables for bootstrap intervals

Open topepo opened this issue 2 years ago • 1 comments

For some tune internals, it would be helpful to be able to make intervals for an extended set of column columns (s opposed to just terms). See tidymodels/tune#818.

These changes are a proposal to expand things to include columns starting with a period. We can discuss it, and I can create more unit tests if we're good with this.

Here's an example:

library(tidymodels)

tidymodels_prefer()
theme_set(theme_bw())
options(pillar.advice = FALSE, pillar.min_title_chars = Inf)

# Get regression estimates for each house type
lm_est <- function(split, ...) {
  analysis(split) %>%
    tidyr::nest(.by = c(type)) %>%
    mutate(
      betas = purrr::map(data, ~ lm(log10(price) ~ sqft, data = .x) %>% tidy())
      ) %>%
    rename(.type = type) %>%
    select(.type, betas) %>%
    unnest(cols = betas)
}

set.seed(52156)
house_rs <-
  bootstraps(Sacramento, 1000, apparent = TRUE) %>%
  mutate(results = map(splits, lm_est))

int_pctl(house_rs, results)
#> # A tibble: 6 × 7
#>   term        .type           .lower .estimate   .upper .alpha .method   
#>   <chr>       <fct>            <dbl>     <dbl>    <dbl>  <dbl> <chr>     
#> 1 (Intercept) Condo         4.45     4.59      4.72       0.05 percentile
#> 2 (Intercept) Multi_Family  4.74     5.25      5.71       0.05 percentile
#> 3 (Intercept) Residential   4.93     4.96      4.99       0.05 percentile
#> 4 sqft        Condo         0.000412 0.000520  0.000659   0.05 percentile
#> 5 sqft        Multi_Family -0.000197 0.0000344 0.000277   0.05 percentile
#> 6 sqft        Residential   0.000211 0.000225  0.000240   0.05 percentile
int_t(house_rs, results)
#> # A tibble: 6 × 7
#>   term        .type           .lower .estimate   .upper .alpha .method  
#>   <chr>       <fct>            <dbl>     <dbl>    <dbl>  <dbl> <chr>    
#> 1 (Intercept) Condo         4.47     4.59      4.73       0.05 student-t
#> 2 (Intercept) Multi_Family  4.81     5.25      5.78       0.05 student-t
#> 3 (Intercept) Residential   4.93     4.96      4.99       0.05 student-t
#> 4 sqft        Condo         0.000386 0.000520  0.000621   0.05 student-t
#> 5 sqft        Multi_Family -0.000193 0.0000344 0.000223   0.05 student-t
#> 6 sqft        Residential   0.000210 0.000225  0.000239   0.05 student-t
int_bca(house_rs, results, .fn = lm_est)
#> # A tibble: 6 × 7
#>   term        .type           .lower .estimate   .upper .alpha .method
#>   <chr>       <fct>            <dbl>     <dbl>    <dbl>  <dbl> <chr>  
#> 1 (Intercept) Residential   4.94     4.96      4.99       0.05 BCa    
#> 2 sqft        Residential   0.000210 0.000225  0.000239   0.05 BCa    
#> 3 (Intercept) Condo         4.47     4.59      4.74       0.05 BCa    
#> 4 sqft        Condo         0.000395 0.000520  0.000638   0.05 BCa    
#> 5 (Intercept) Multi_Family  4.64     5.25      5.62       0.05 BCa    
#> 6 sqft        Multi_Family -0.000156 0.0000344 0.000330   0.05 BCa

Created on 2024-01-19 with reprex v2.0.2

topepo avatar Jan 19 '24 12:01 topepo

This is ready for final review.

I've set up the int_pctl() S3 method for tune_results objects to work with the current interval methods in rsample and with this change.

topepo avatar Jan 24 '24 11:01 topepo

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

github-actions[bot] avatar Sep 27 '24 01:09 github-actions[bot]