Update documentation for running on AWS spot market nodes to mention the need for marking jobs preemptable
I've been running a workflow on non-preemptible instances and it's been working fine. AWS gave me more spot instances, so I tried to use the workflow using the cheaper instances. I observed that toil wouldn't deploy jobs on preemptible instances. This is using Toil version 3.11
I don't believe that the workflow I'm running is the cause of this, but if you want to look into it, the workflow can be found here: https://github.com/benedictpaten/marginPhase/tree/master/toil
Here's a breakdown of what I did and saw:
toil-marginphase run --defaultCores 8 --maxMemory 60G --maxCores 8 --batchSystem mesos --provisioner aws --nodeTypes r3.4xlarge:0.3 --nodeStorage 500 --maxNodes 30 --minNodes 3 aws:us-west-2:marginphase-param-testing-3-6
Three r3.4xlarge instances were created, but no work was being done. I logged onto the instances and saw that a toil-worker process was running, but there was no code from my workflow being run. I observed this over multiple attempts, both with new workflows and restarts of existing workflows.
toil-marginphase run --restart --defaultCores 8 --maxMemory 60G --maxCores 8 --batchSystem mesos --provisioner aws --nodeTypes m4.4xlarge:0.3 --nodeStorage 500 --maxNodes 30 --minNodes 4 aws:us-west-2:marginphase-param-testing-3-6
I observed the same functionality but with different node types.
toil-marginphase run --restart --defaultCores 8 --maxMemory 60G --maxCores 8 --batchSystem mesos --provisioner aws --nodeTypes m4.4xlarge --nodeStorage 500 --maxNodes 30 --minNodes 4 aws:us-west-2:marginphase-param-testing-3-6
When I used non-preemptible instances, the workflow ran as expected.
toil-marginphase run --restart --defaultCores 8 --maxMemory 60G --maxCores 8 --batchSystem mesos --provisioner aws --nodeTypes m4.4xlarge:0.3 --nodeStorage 500 --maxNodes 30 --minNodes 4 aws:us-west-2:marginphase-param-testing-3-6
After seeing the workflow function with non-preemptible instances, I killed in and restarted with the above command again. Again, I saw four m4.4xlarge nodes launched, but no work being performed.
toil-marginphase run --restart --defaultCores 8 --maxMemory 60G --maxCores 8 --batchSystem mesos --provisioner aws --nodeTypes m4.4xlarge,r3.4xlarge:0.3 --nodeStorage 500 --maxNodes 20,20 --minNodes 5,5 aws:us-west-2:marginphase-param-testing-3-6
Lastly, I tried running the workflow with a mixture of both preemptible and non-preemptible, both with a min and max node count. I observed the minimum number of both node types were created. Work was performed only on the non-preemptible instances, which were eventually scaled up. No work (or autoscaling) was ever done on the preemptible instances.
┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-214
Can you dump /var/lib/mesos/mesos-master.INFO from the master? it's probably long as fuck, but I think you can attach it.
Oh, yeah, I forgot to ask. Are the jobs marked as preemptable? I think by default jobs are considered non-preemptable. From looking at the logs it looks like the nodes are working correctly but their offers are getting declined (probably because no jobs are marked as preemptable).
There actually is a "just use the preemptable nodes" option after all, --defaultPreemptable.
Thanks Joel! So I see now that the documentation covers this flag (http://toil.readthedocs.io/en/3.11.0/running/amazon.html "Preemptability" section), but I think it should be more apparent that without this flag (or by specifying in the job) your jobs won't run on the preemptable nodes. I think a change to the documentation may be a good resolution to this.
Also, I know how much momentum drives things like "what's the right default value for a field", but I think that the default should be that a job runs on either node type.