RSocrata icon indicating copy to clipboard operation
RSocrata copied to clipboard

Fail gracefully when Socrata is down

Open geneorama opened this issue 2 years ago • 5 comments

We received notice from CRAN that RSocrata is failing with some regularity, which causes headaches for the CRAN administrators. We believe this happens because Socrata is occasionally briefly down for maintenance.

In order to "fail gracefully" @nicklucius and I discussed checking the site to see if it's up before running the tests.

For now I plan to use something like attr(curlGetHeaders("https://data.cityofchicago.org/"), "status") to check for 200 at the start of the tests, and abort testing if that fails. Error handling will be necessary because it returns an error if the site is down.

Example error for non existent website:

# > attr(curlGetHeaders("https://badurlx1x2x3x4.cityofchicago.org/"), "status")
# Error in curlGetHeaders("https://badurlx1x2x3x4.cityofchicago.org/") : 
#   libcurl error code 6:
#   Could not resolve host: badurlx1x2x3x4.cityofchicago.org

Example error for timeout example:

# > attr(curlGetHeaders("https://dev.socrata.com/docs/endpoints.html"), "status")
# Error in curlGetHeaders("https://dev.socrata.com/docs/endpoints.html") : 
#   libcurl error code 28:
# 	Failed to connect to dev.socrata.com port 443 after 21045 ms: Timed out

I would also like to take this opportunity to at least try splitting the tests (#103), and this should also take care of #131 which is fail gracefully on write.

I think it would be a mistake to resolve this issue by introducing new error handling into read.socrata and write.socrata. The existing errors are descriptive. For example if you plug in a bad URL you get a 404. Also modifying the tests is the least invasive way to make changes. Adding unexpected error handling would be a significant change and could have unintended consequences. For example, I would to have a process running that updates a dataset nightly and for that process to fail without the expected error.

geneorama avatar Aug 09 '23 21:08 geneorama

Correction, this wouldn't take care of #131.

This would resolve the CRAN tests related to #131, but it wouldn't resolve the failure when writing to Socrata.

geneorama avatar Aug 09 '23 21:08 geneorama

CRAN was removed because it has been failing tests randomly and creating overhead for the CRAN team. The CRAN maintainers are no longer allowing any tolerance for occasional transgressions, and RSocrata will be removed after a single false positive failed test.

The problem is that over the years we’ve written between 50 and 100 tests to address specific issues with specific datasets. The tests work when everything is working normally, but there are ways that the tests can fail under abnormal circumstances. For example, there could be a brief but unplanned outage, planned API maintenance, a dataset might get removed, or there could be changes in datasets that invalidate our tests.

Those external factors are unexpected, unusual, and generally impossible to predict. Also, it's hard to make exceptions for the tests. If we said "when this test fails, ignore it", then what's the purpose of the test? Also, I'm not sure if we can reliably catch outages and timeouts because this isn't a reproducible condition.

Some ideas:

  • Remove all external dependencies from the automated tests. One could argue that that point of CRAN tests is to test package dependencies, and we could still have tests that are required locally to protect against regressions.
  • We could only run tests against the Socrata demo site and work with the vendor to get more examples online.
  • We could test for general API availability and hope for the best (this was my initial approach on this branch).

That's not a comprehensive list of possible approaches, and each have their own considerations.

In my opinion we need a test API where we can reproduce the specific conditions that we are testing if we're ever going to have more reliable interactions with the API.

The tests use several API endpoints, examples include:

  • "read Socrata CSV is compatible with posixify" uses https://soda.demo.socrata.com/resource/4334-bgaj.csv
  • "read Socrata JSON is compatible with posixify (issue 85)" uses https://soda.demo.socrata.com/resource/9szf-fbd4.json
  • "read Socrata JSON that uses ISO 8601 but does not specify subseconds" uses https://data.cityofnewyork.us/resource/qcdj-rwhu.json
  • "Calendar Date Short" uses https://data.cityofchicago.org/resource/y93d-d9e3.csv?$order=debarment_date
  • "Date is not entirely NA if the first record is bad (issue 68)" uses https://data.cityofchicago.org/resource/4h87-zdcp.csv

All of the tests here: https://github.com/Chicago/RSocrata/blob/master/tests/testthat/test-all.R

It’s unsustainable to rely on this disparate collection of random datasets, because you never know when the conditions will change and invalidate the tests.

geneorama avatar Aug 22 '23 17:08 geneorama

After going back and forth with ChatGPT I had another idea to take a two step approach.

Within read.socrata add a handler for timeout errors that looks something like this:

read.socrata <- function(url, ...) {
  response <- tryCatch({
    httr::GET(url)
    ## plus rest of read.socrata code
    return(result)
  }, error = function(e) {
    warning(paste("Website non-responsive for URL:", url, "Skipping request."))
    return(NULL)
  })

Then write a second wrapper for the tests, perhaps called test.socrata, and skip the tests when they are NULL.

This is a bit of a work in progress and wouldn't solve for the cases when datasets change or other unexpected errors occur, but it would at least be foolproof (hopefully) for outages.

Also, I noticed that httr2 is out, and that could provide some mechanisms for more robust handling.

geneorama avatar Aug 22 '23 21:08 geneorama

For future reference there may be something useful in the mocking functions documented here: https://r-pkgs.org/testing-advanced.html#mocking I'm not sure at a glance.

geneorama avatar Aug 24 '23 19:08 geneorama