toil icon indicating copy to clipboard operation
toil copied to clipboard

More easily provision toil clusters with custom configured images?

Open boyangzhao opened this issue 3 years ago • 9 comments

I'm following through with the instruction on how to run toil on AWS, including using toil launch-cluster to start clusters, and toil rsync-cluster to sync the files onto the cluster. This is working. However, there are quite some manual steps each time, as we needed to clone in our customized workflows (can be hundreds of cross-referenced cwl files) to the lead node, connect to s3 for the inputs (including ones where they are directories as input), and upload the outputs from lead node back to s3.

Is there an easier to set up clusters (where all the configurations are saved, customized workflows are already there, s3 bucket is mounted) and execute the workflows (and automatically the outputs are uploaded to the desired s3 location)?

Thanks!

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-1162

boyangzhao avatar May 08 '22 11:05 boyangzhao

For getting files to/from S3, if this was a Toil workflow, you could use import_file and export_file with S3 URLs.

I think we have this functionality attached to File and Directory objects that are passed into a CWL workflow, so we want you to be able to use an S3 URL as an input file URL. But it might not actually work unless we make ToilFsAccess support things like listing the contents of S3 prefixes, because if it can't do that and cwltool's base class for it can't do that, I don't know what piece would be doing it.

We also have a --destBucket option in the CWL runner for sending output to a URL that Toil can write to, such as an s3:// URL.

If you want to save some cluster setup steps, one option is to create and set up the cluster once, and then to use it for multiple workflows. For a Mesos cluster, you are really meant to run only one workflow at a time, especially if the workflow is autoscaling the cluster, but you can run several workflows one after the other. For a Kubernetes cluster (-t kubernetes to toil launch-cluster), you can use the Kubernetes cluster autoscaler by specifying ranges for worker node counts when setting the cluster up. Then it's easy to run several workflows at once on the same cluster.

You can also take the Toil docker image, build your own image based on it with your code pre-loaded, and put that in the TOIL_APPLIANCE_SELF environment variable when launching the cluster. Then that image will be the container you are in when you toil ssh-cluster, and your code will already be there after cluster startup.

adamnovak avatar May 09 '22 16:05 adamnovak

However, there are quite some manual steps each time, as we needed to clone in our customized workflows (can be hundreds of cross-referenced cwl files) to the lead node

For CWL, you can pack the workflow and all documents listed in $include, $import, and run by packing into a single file. For that, I recommend cwlpack from https://github.com/rabix/sbpack

mr-c avatar May 09 '22 18:05 mr-c

thank you! I see the path forward to set this up then,. And the destBucket is working nicely.

I think we have this functionality attached to File and Directory objects that are passed into a CWL workflow, so we want you to be able to use an S3 URL as an input file URL. But it might not actually work unless we make ToilFsAccess support things like listing the contents of S3 prefixes, because if it can't do that and cwltool's base class for it can't do that, I don't know what piece would be doing it.

As for the referencing to s3:// URL in the yaml files, I'm having issues for it to recognize it is from a s3 bucket. I've replaced the actual s3 URI pasted below with test/test.cwl. The issue and the logic is the same.

example yaml file:

txt_file:
  class: File
  path: s3://test/test.cwl

Error message when running toil-cwl-runner,

...
cwltool.errors.WorkflowException:  Reading s3://test/test.cwl
 [Errno 2] No such file or directory: '/tmp/s3://test/test.cwl'

I am running the toil-cwl-runner at the current directory location of /tmp on the cluster. Somehow it is being added in as prefix. I've tried s3 URL (with the https://, but there were permission issues), and also tried path vs location in the yaml (same error). I saw there was an earlier issue #2828, but was not quite as related, there it was asking whether Directory vs File were allowed as s3 (I presume Directory on s3 now works?).

boyangzhao avatar May 10 '22 18:05 boyangzhao

We don't have this under CI as far as I know, so it's quite possible S3:// URL import for CWL has broken. That's one of the things I didn't exercise in my tests for AGC integration.

A workaround might be to prepare pre-signed HTTPS URLs for the S3 objects you want to import, although this probably cannot work for a CWL Directory.

adamnovak avatar May 11 '22 19:05 adamnovak

OK, it looks like according to the spec, that's not actually the right way to reference a URL. path is only for actual on-disk paths. What ought to work is setting location instead:

txt_file:
  class: File
  location: s3://test/test.cwl

adamnovak avatar May 11 '22 19:05 adamnovak

Oh sorry, you said you already tried location. That really should have worked. I'll open a new issue for this.

adamnovak avatar May 11 '22 19:05 adamnovak

Does toil launch-cluster work with private toil dockers? Per the earlier suggestion of building my own toil image and define the env variable TOIL_APPLIANCE_SELF, this now seem to work with public docker registry. If I try to initiate the cluster by referring to a toil docker on my private AWS ECR registry, I get the following error message (id and docker name replaced as placeholders)

toil.ApplianceImageNotFound: The docker image that TOIL_APPLIANCE_SELF specifies (<id>.dkr.ecr.us-east-1.amazonaws.com/<docker_name>:latest) produced a nonfunctional manifest URL (https://<id>.dkr.ecr.us-east-1.amazonaws.com/v2/<docker_name>/manifests/latest). The HTTP status returned was 401. The specifier is most likely unsupported or malformed.  Please supply a docker image with the format: '<websitehost>.io/<repo_path>:<tag>' or '<repo_path>:<tag>' (for official docker.io images).  Examples: 'quay.io/ucsc_cgl/toil:latest', 'ubuntu:latest', or 'broadinstitute/genomes-in-the-cloud:2.0.0'.

I saw another issue https://github.com/DataBiosphere/toil/issues/2166 that was closed, but I think this was about how to run the workflows on workers if some of the cwl requires private dockers, not about toil appliance docker itself.

boyangzhao avatar May 15 '22 17:05 boyangzhao

@boyangzhao The problems with reading from S3 in CWL (#4094) were fixed in #4113. So with the current mainline Toil that should work now.

If you want to use your own TOIL_APPLIANCE_SELF, Docker on the cluster nodes needs to be able to pull it. I don't think we have machinery in Toil to do a docker login on startup.

We do have a --awsEc2ProfileArn option that toil launch-cluster takes, which can assign a custom IAM Instance Profile to all your cluster nodes. You could create an IAM Instance Profile that has access to your ECR registry, in addition to Toil's required AWS permissions, and use that for your cluster.

adamnovak avatar May 31 '22 17:05 adamnovak

For some reason, I can't seem to get this fix to work on my example files, and it still keep showing the [Errno 2] No such file or directory: '/data/s3://<dir>/test.cwl'. I think maybe I'm not pulling/building it right, a few things I've tried,

  • taking the latest online docker quay.io/ucsc_cgl/toil:5.7.0a1-979050e0d1ca68b0a1d10bf91143ec135523d850-py3.8
  • git clone the toil repo, and checkout either master, issues/4094-fix-s3-cwl, or issues/4109-4094-4107-megabranch, or explicitly 30b62dce. And then either 1) try to build a docker with TOIL_DOCKER_REGISTRY=<my username> make docker, or 2) pip install with make prepare, make develop extras=[all]

After which, I either run it from the docker, such as with docker run -it -v $(pwd):/data quay.io/ucsc_cgl/toil:5.7.0a1-979050e0d1ca68b0a1d10bf91143ec135523d850-py3.8 toil-cwl-runner /data/helloworld.cwl /data/helloworld.job.yaml

Or directly toil-cwl-runner from the venv. Perhaps it's something silly I'm overlooking?

Here some of the values I've replaced in <>, but in practice those were filled in as well. Everything runs fine when the file path is local, and not s3.


helloworld.cwl

class: CommandLineTool
cwlVersion: v1.0
id: helloworld
baseCommand:
  - echo
inputs:
  - id: txt_file
    type: File?
    inputBinding:
      position: 0
      loadContents: true
      valueFrom: $(self.contents)
outputs:
  - id: output
    type: File?
    outputBinding:
      glob: helloworld.txt
label: helloworld
requirements:
  - class: ShellCommandRequirement
  - class: DockerRequirement
    dockerPull: 'alpine:3.7'
  - class: InlineJavascriptRequirement
stdout: helloworld.txt

helloworld.job.yaml

txt_file:
  class: File
  location: s3://<dir>/test.txt

boyangzhao avatar Jun 02 '22 10:06 boyangzhao