Make DockerAppliance image existence work better with private repositories
It seems there are some issues with starting toil if the toil docker is from a private registry, because toil in init.py does a check first if the image exists (in the checkDockerImageExists method), but it does this without accounting for auth. This would result in the following error,
Traceback (most recent call last):
File "<path>/bin/toil", line 8, in <module>
sys.exit(main())
File "<path>/lib/python3.8/site-packages/toil/utils/toilMain.py", line 33, in main
get_or_die(module, 'main')()
File "<path>/lib/python3.8/site-packages/toil/utils/toilLaunchCluster.py", line 119, in main
applianceSelf(forceDockerAppliance=options.forceDockerAppliance)
File "<path>/lib/python3.8/site-packages/toil/__init__.py", line 201, in applianceSelf
return checkDockerImageExists(appliance=appliance)
File "<path>/lib/python3.8/site-packages/toil/__init__.py", line 284, in checkDockerImageExists
return requestCheckDockerIo(origAppliance=appliance, imageName=imageName, tag=tag)
File "<path>/lib/python3.8/site-packages/toil/__init__.py", line 435, in requestCheckDockerIo
raise ApplianceImageNotFound(origAppliance, requests_url, response.status_code)
toil.ApplianceImageNotFound: The docker image that TOIL_APPLIANCE_SELF specifies (<docker_name>) produced a nonfunctional manifest URL (https://registry-1.docker.io/v2/<docker_path>). The HTTP status returned was 401. The specifier is most likely unsupported or malformed. Please supply a docker image with the format: '<websitehost>.io/<repo_path>:<tag>' or '<repo_path>:<tag>' (for official docker.io images). Examples: 'quay.io/ucsc_cgl/toil:latest', 'ubuntu:latest', or 'broadinstitute/genomes-in-the-cloud:2.0.0'.
2022-08-19 18:33:06,639 WARNING Something is wrong, cluster has not started...
This is the case for both private docker.io and other registries.For docker.io (in requestCheckDockerIo), I see that a token would be requested in
https://github.com/DataBiosphere/toil/blob/59ed0422c15ea7acb7eb715837f67fbe88c2a604/src/toil/init.py#L427
But this is without auth argument, and this token is then used for the request to https://registry-1.docker.io/v2/..., which would result in the HTTP return 401 error. I presume the issue is here (?).
--
I do know that TOIL_CUSTOM_INIT_COMMAND and TOIL_CUSTOM_DOCKER_INIT_COMMAND are available as a way to pull from private docker registries (as was the purpose and mentioned in #3182 #3183 #2561). I have one private docker image that needs to be pulled in for the workflow, and this is working with the TOIL_CUSTOM_DOCKER_INIT_COMMAND command. However, because of this, I also had to create a toil docker which defines this env var including some login details - which means I'd want to keep the toil docker also private. TOIL_CUSTOM_INIT_COMMAND does get called before checkDockerImageExists, and this allows me to run the docker login command, and the docker pull would work. But it seems the checkDockerImageExists doesn't respect the authentication already set.
┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-1206
Toil implements its own polling for image existence and doesn't know how to reach into the Docker client's credentials cache and use the Docker client's authenticated session to do it. I'm not sure why we decided not to just use docker to do the polling; it might be because, when launching a remote cluster, we need to be able to do the polling from the user's local machine, which might not have docker installed or logged in.
To bypass the polling check, it looks like we have a --forceDockerAppliance flag that toil launch-cluster accepts. Perhaps we should add a message about that to the error message that is reported when the check fails.
If you want to use TOIL_CUSTOM_INIT_COMMAND and TOIL_CUSTOM_DOCKER_INIT_COMMAND, I don't think you need a custom Toil docker to hold them. I think you just set them in the environment for toil launch-cluster and for the workflow that gets run on the cluster, from the shell.
Thanks! Since I set up for submitting each run with a single command, I needed a way to set the environments either as part of the docker or at the time of running toil-cwl-runner as single command. But I managed to inject in the env vars with the -e ENV=VAL arguments in docker exec, so all is working now.
Side note: I also saw toil documentation has the argument --setEnv NAME, which "set an environment variable early on in the worker". which wasn't clear what "early on" means if it was setting the env vars in the worker instance before starting the toil docker, or within the toil docker on the worker node. But in the end, I didn't use this.