distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Cache conda environment between CI test runs

Open charlesbluca opened this issue 3 years ago β€’ 10 comments

Closes #6853

Follows the steps outlined in the setup-miniconda docs to cache the conda environment between test runs.

Note that (as pointed out by @gjoseph92), since we use unpinned CI dependencies in our environments, it's possible for the cached environment to become outdated - hoping to address this with a triggering keyword that can be added to PRs or commits to skip the caching and install the environment from scratch.

  • [ ] Tests added / passed
  • [ ] Passes pre-commit run --all-files

charlesbluca avatar Aug 08 '22 21:08 charlesbluca

Sweet, thanks @charlesbluca!

How can we test this out? Seems like tests don't actually run when you just modify the GitHub actions yamls. Maybe push a spurious change to something else to trigger tests, wait a bit, then push another change and make sure the second build used the cache?

gjoseph92 avatar Aug 08 '22 21:08 gjoseph92

I don't see anything in the modified workflow implying that it only runs on changes to certain files, so think we might just be blocked up by other running jobs in the Dask org; I'll ping this PR with an empty commit in a couple hours to see if things run then, but my plan was to do what you described, and then push a commit containing [skip-caching] to see if we're able to disable caching as expected πŸ™‚

charlesbluca avatar Aug 08 '22 21:08 charlesbluca

In other PRs, I'm used to seeing jobs at least queued, if not running, as soon as something gets pushed. Like on https://github.com/dask/distributed/pull/6856 right now, I see: image

Maybe try pushing a spurious commit right now? Not sure if anything will change if we wait.

gjoseph92 avatar Aug 08 '22 21:08 gjoseph92

Thanks @jrbourbeau πŸ€¦πŸΌβ€β™‚οΈ didn't realize the failed workflow syntax wouldn't pop up here, will fix now

charlesbluca avatar Aug 08 '22 21:08 charlesbluca

Looks like that got things working πŸŽ‰

charlesbluca avatar Aug 08 '22 22:08 charlesbluca

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

βŸβ€„βŸβ€„β€ˆβŸβ€„15 files  Β±0β€‚β€ƒβŸβ€„βŸβ€„β€ˆβŸβ€„15 suites  Β±0   6h 34m 30s :stopwatch: + 6m 40s βŸβ€„2β€ˆ992 tests Β±0β€‚β€ƒβŸβ€„2β€ˆ902 :heavy_check_mark: Β±0β€‚β€ƒβŸβ€„β€ˆβŸβ€„88 :zzz: Β±0  2 :x: Β±0  22β€ˆ189 runs  Β±0  21β€ˆ137 :heavy_check_mark:  -β€Š2  1β€ˆ050 :zzz: +2  2 :x: Β±0 

For more details on these failures, see this check.

Results for commit 5f85ef7e. ± Comparison against base commit bf537608.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Aug 09 '22 01:08 github-actions[bot]

Weird, the cache didn't work between those two commits, even though they seemed to have the same cache key. I had figured the cache would be saved as soon as the cache task completed, but maybe because the first CI job was interrupted, it wasn't?

gjoseph92 avatar Aug 09 '22 02:08 gjoseph92

It looks like maybe the cache isn't being saved for failed runs?

Unfortunately, due to GitHub Actions running on UTC, the cached environments for the non-macOS runs for https://github.com/dask/distributed/pull/6855/commits/a08262380d8064a66b87991c8f2cd37e1e2a0bc3 are no longer valid and need to be solved again. However, the macOS runs should have created a cached environment for today, but I notice the failed 3.8 runs didn't bother to cache the environment after failing (example).

However, the macOS 3.10 run succeeded and did end up saving the conda environment in the cache, which we can see getting picked up here and here. I notice that the time it takes to restore the cache is variable (in one run it takes >30 seconds, in another it takes nearly 4 minutes).

EDIT:

Some quick digging into this shows that the official caching action only updates the cache if the job it was polled in succeeds.

This is somewhat annoying considering the flaky nature of Distributed's testing, but is reasonable considering we might not want the conda environment to get cached if it would end up blocking CI. There is a fork of the standard caching action that always updates the cache, not sure if this is something we want to consider.

charlesbluca avatar Aug 09 '22 02:08 charlesbluca

Looks like all the OS / python version environments are now cached and we can see the speed up in:

https://github.com/dask/distributed/actions/runs/2822501538

charlesbluca avatar Aug 09 '22 03:08 charlesbluca

Looks like the option to skip caching is working fine πŸ™‚ commits with [skip-caching] in the message should now skip over the caching process and resolve the environment, which should be good in cases where we need to pull in the latest CI deps.

Is there anywhere it would make sense to document this option, or the caching behavior?

charlesbluca avatar Aug 09 '22 15:08 charlesbluca

cc @gjoseph92 @jrbourbeau this should be ready for review now

charlesbluca avatar Aug 22 '22 16:08 charlesbluca