toil icon indicating copy to clipboard operation
toil copied to clipboard

AWS jobstore buckets should inherit owner tags from cluster

Open glennhickey opened this issue 4 years ago • 3 comments

I've been running jobs with AWS autoscale for the first time in a bit and it seems when Cactus fails, the jobstore is left in an unusable state. For example, I just aborted a workflow with aws:us-west-2:glennhickey-jobstore-pa3 as the jobstore.

But if I try to run it with --restart I get

cactus-graphmap aws:us-west-2:glennhickey-jobstore-pa3 ./apes.pan.txt s3://vg-k8s/vgamb/users/hickey/apes-pangenome/apes.minigraph.gfa.gz s3://vg-k8s/vgamb/users/hickey/apes-pangenome/apes.pan.paf --realTimeLogging --reference hg38  --base --nodeTypes r4.8xlarge:1.25 --maxNodes 25 --nodeStorage 1000 --batchSystem mesos --provisioner aws --defaultPreemptable  --realTimeLogging --mapCores 32 --outputFasta  s3://vg-k8s/vgamb/users/hickey/apes-pangenome/apes.gfa.fa --delFilter 5000000 --logFile paf.log --restart
/usr/local/lib/python3.8/dist-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.8) or chardet (2.3.0)/charset_normalizer (2.0.10) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
[2022-02-01T14:27:54+0000] [MainThread] [I] [toil.lib.retry] Got An error occurred (404) when calling the HeadBucket operation: Not Found, trying again in 0s.
[2022-02-01T14:27:54+0000] [MainThread] [I] [toil.lib.retry] Got An error occurred (404) when calling the HeadBucket operation: Not Found, trying again in 1s.
[2022-02-01T14:27:55+0000] [MainThread] [I] [toil.lib.retry] Got An error occurred (404) when calling the HeadBucket operation: Not Found, trying again in 1s.
[2022-02-01T14:27:56+0000] [MainThread] [I] [toil.lib.retry] Got An error occurred (404) when calling the HeadBucket operation: Not Found, trying again in 4s.
^CTraceback (most recent call last):

and the message loops forever. Same deal if I try

toil clean aws:us-west-2:glennhickey-jobstore-pa3 
/usr/local/lib/python3.8/dist-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.8) or chardet (2.3.0)/charset_normalizer (2.0.10) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
[2022-02-01T14:28:46+0000] [MainThread] [I] [toil.lib.retry] Got An error occurred (404) when calling the HeadBucket operation: Not Found, trying again in 0s.
[2022-02-01T14:28:46+0000] [MainThread] [I] [toil.lib.retry] Got An error occurred (404) when calling the HeadBucket operation: Not Found, trying again in 1s.
[2022-02-01T14:28:47+0000] [MainThread] [I] [toil.lib.retry] Got An error occurred (404) when calling the HeadBucket operation: Not Found, trying again in 1s.
[2022-02-01T14:28:48+0000] [MainThread] [I] [toil.lib.retry] Got An error occurred (404) when calling the HeadBucket operation: Not Found, trying again in 4s.
[2022-02-01T14:28:52+0000] [MainThread] [I] [toil.lib.retry] Got An error occurred (404) when calling the HeadBucket operation: Not Found, trying again in 16s.

this has happened everytime my workflow has aborted for whatever reason. Not sure if it's related to changes in toil or our aws enironment...

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-1140

glennhickey avatar Feb 01 '22 14:02 glennhickey

Oh boy, I bet this is about me forgetting to set TOIL_OWNER_TAG. If it is, I would like to change this issue to a feature request:

When I create an cluster with toil launch-cluster --owner MYEMAIL, would it be possible to have TOIL_OWNER_TAG set to MYEMAIL by default whenever I open a shell on the cluster?

glennhickey avatar Feb 01 '22 14:02 glennhickey

Sounds like there's two problems here:

  1. Without the bucket, we can't clean the job store and destroy the SimpleDB domain.
  2. It would be nice if the cluster's owner tag became the default TOIL_OWNER_TAG value in the default environment on the cluster (maybe in something mounted as /etc/profile in the appliance container?).

adamnovak avatar Feb 01 '22 15:02 adamnovak

The first problem is tracked by https://github.com/DataBiosphere/toil/issues/3924 I think.

adamnovak avatar Feb 01 '22 15:02 adamnovak