Retry downloading buildpacks
Description
pack fails during create-builder if a buildpack is unable to be downloaded because of a temporary issue.
Proposed solution
pack could retry downloading a buildpack that has failed a handful of times to allow users to avoid network and service hiccups.
Describe alternatives you've considered
Pack could do nothing - Users would require re-running the entire pack command that failed or a user could run with pull-policy never and docker pull the buildpack images prior to running pack commands.
Additional context
- [x] This feature should be documented somewhere
We having buildpacks located in the public ECR and it seems flaky.
Status: Downloaded newer image for public.ecr.aws/r2f9u0w4/heroku-maven-buildpack@sha256:7bff54457286a9a36dbcaec77ab041d61c2e1a47741d459c3457e429f6d2f268
ERROR: failed to add buildpacks to builder: fetching image: image public.ecr.aws/r2f9u0w4/heroku-jvm-buildpack@sha256:fd4da69a6f34bce57e7280e290dd741d4059fcf8f5e60c6c6eb66258956dcb3c does not exist on the daemon: not found
Aside: The does not exist on the daemon in ^ is confusing at first. I think it shows that because it failed to download remotely and fell back to checking the local docker daemon.
Hi @jabrown85 , @jromero I started to take a look at this case, but after analyzing it I believe the issue will be solved by this PR - 96 in the Imgutil repository. That PR is adding a retry logic when the image is been downloaded. In case the image is not downloaded the message does not exist on the daemon is shown because of what @jabrown85 said, it is trying to find it locally after it failed to download remotely
Today I updated my code frio Imgutil repository in my local environment and I ran the Fetcher test suite, there are some tests that are failing after applying the code from PR - 96.
FAIL: TestFetcher (11.06s)
--- FAIL: TestFetcher/Fetcher (3.30s)
--- FAIL: TestFetcher/Fetcher/#Fetch/daemon_is_true/PullAlways/there_is_a_remote_image/pull_the_image_and_return_the_local_copy (0.09s)
--- FAIL: TestFetcher/Fetcher/#Fetch/daemon_is_true/PullAlways/there_is_a_remote_image/doesn't_log_anything_in_quiet_mode (0.09s)
--- FAIL: TestFetcher/Fetcher/#Fetch/daemon_is_true/PullIfNotPresent/there_is_a_remote_image/there_is_a_local_image/returns_the_local_image (0.09s)
--- FAIL: TestFetcher/Fetcher/#Fetch/daemon_is_true/PullIfNotPresent/there_is_a_remote_image/there_is_no_local_image/returns_the_remote_image (0.09s)
I will take a look to those tests and try to add new tests to reproduce this issue an very the fix with the code added in the imgutil repository
Today I updated my code frio Imgutil repository in my local environment and I ran the Fetcher test suite, there are some tests that are failing after applying the code from PR - 96.
FAIL: TestFetcher (11.06s) --- FAIL: TestFetcher/Fetcher (3.30s) --- FAIL: TestFetcher/Fetcher/#Fetch/daemon_is_true/PullAlways/there_is_a_remote_image/pull_the_image_and_return_the_local_copy (0.09s) --- FAIL: TestFetcher/Fetcher/#Fetch/daemon_is_true/PullAlways/there_is_a_remote_image/doesn't_log_anything_in_quiet_mode (0.09s) --- FAIL: TestFetcher/Fetcher/#Fetch/daemon_is_true/PullIfNotPresent/there_is_a_remote_image/there_is_a_local_image/returns_the_local_image (0.09s) --- FAIL: TestFetcher/Fetcher/#Fetch/daemon_is_true/PullIfNotPresent/there_is_a_remote_image/there_is_no_local_image/returns_the_remote_image (0.09s)I will take a look to those tests and try to add new tests to reproduce this issue an very the fix with the code added in the imgutil repository
Ignore this comment, the tests were failing in my local because of an issue with docker desktop v3.2*, I downgraded to version 3.1 and the tests were fine
Hi @jabrown85 .
I created the Pull Request #111 in the imgutil repository with some improvements to handle this cases, the error you are facing I think is happening because of the status 401 or 404 been thrown during the call to the ECR registry
Retries were implemented in imgutil here: https://github.com/buildpacks/imgutil/blob/5c57feb120b3362342d24bc45d59f0d66a44ebcc/remote/new.go#L234-L253
We looked into retrying on 40X errors, but decided that it might unnecessarily impact performance in cases where it could be perfectly valid that the image doesn't exist (e.g., previous image).