dask-cloudprovider VMClusters can hang if something fails during scheduler provisioning

When launching clusters that extend VMCluster there is a failure mode where the cluster manager can just hang on creation, perhaps indefinitely. See #275.

The reason this happens is that we submit the Dask scheduler to each cloud provider as a VM with a preconfigured startup script using cloud init. You can view this script by setting the debug=True flag as a kwarg on the cluster manager, the init script will be printed when the VM is submitted.

One benefit of launching our work via cloud init is that we don't actually need to establish any inbound connection such as SSH to the VM in order to provision it. This helps keep the attack surface small and means we do not need to manage secrets in Dask Cloudprovider such as SSH keys.

The VM runs the cloud init script on startup and runs Dask. We wait for this to come up and then connect to the Dask comm on port 8786 (or whatever port we assign) and continue working from there. One drawback to this is that we submit our VM in a fire-and-forget way and then have to wait until the Dask comm port becomes available. If Dask fails to start we will end up waiting forever.

This isn't a pleasant experience for users as it is unclear whether the VM is just taking a long time to start or has failed. VMs can take some time to start because of capacity on the cloud provider or perhaps the user selected a large Docker image that takes a long time to download and decompress.

Things we should do to improve this today:

Add a timeout with automatic cleanup of the VM. We should avoid leaving hanging resources in case of failure and raise an exception to the user that the scheduler timed out starting up.

Design change options we could make to improve the debugging workflow:

Instead of provisioning via cloud init we configure SSH and configure Dask over SSH, allowing failure information to be fed back to the user more easily.
Keep provisioning with cloud init but make more of an effort to enable SSH to allow for interactive debugging.
Run some log exporting service on each VM to pass the contents of /var/log/cloud-init-output.log back to the cluster manager and on to the user. Typically to debug the hanging problem today we just need to see what is in this log file, so passing it back as a default operation could help.

I have concerns about trying to use SSH. Many orgs will have policies around SSH in terms of how the port is exposed, how keys are managed, whether you must use a bastion, etc. So while I think it is a good idea to enable SSH where possible I'm hesitant to depend on it for provisioning.

Exporting logs may have some drawbacks too. The simplest implementation of this would be a web server which serves the log files we want to read, but we would need to expose an additional port for this and ensure it is protected in some way.

My current leaning is to try and expose SSH as an optional feature but go with option 3 to resolve this issue. But feedback and discussion would be welcome.

Mar 30 '21 09:03 jacobtomlinson

cc @quasiben it would be good to get your input here

Mar 30 '21 09:03 jacobtomlinson

@jacobtomlinson I agree with the security concerns on exposing ports by default but leaving it optional, so people could use the ssh access when developing and /or testing while keeping it shut in production. Option 3 sounds very reasonable

Mar 31 '21 20:03 manuelreyesgomez

@jacobtomlinson Any update on this? Running into cluster creation issues due to the hang, and not sure why the creation fails as we do not have access to the logs.

Apr 20 '21 15:04 Nanthini10

No update here I'm afraid. Which cluster manager are you using?

Apr 20 '21 16:04 jacobtomlinson

AzureVMCluster with a RAPIDS 0.19 image. It worked before but I think something with the newer version is causing it to fail, unsure what that is.

Do you have a timeline on when we can expect a fix/update for this?

Apr 20 '21 16:04 Nanthini10

@jacobtomlinson This is breaking functionality we are pushing to azure enterprise users

Apr 20 '21 17:04 manuelreyesgomez

@jacobtomlinson Do the possibility to request an increase on the VM disk size added to the constructor?

Apr 20 '21 19:04 manuelreyesgomez

Let's take this discussion offline. This issue is discussing long term fixes to not getting feedback from failed clusters. I'll sync with you to figure out why your clusters are hanging and how to debug it. Rather than cluttering this issue with this conversation.

Apr 21 '21 16:04 jacobtomlinson

Is there a guide for debugging this issue? I'm running into the same problem with EC2Cluster on AWS, I haven't done anything but configure the awscli and specified the number of workers, but the cluster is still hanging at the "Waiting for scheduler to run" stage.

I'm new to AWS and I've tried following the documents for EC2 but they are rather sparse. https://cloudprovider.dask.org/en/latest/aws.html

May 04 '21 17:05 bEpeGVUgPF

@bEpeGVUgPF your best option today is to SSH to the VM (make sure you set up the SSH config when you create the cluster) and then check the /var/log/cloud-init-output.log file.

May 06 '21 13:05 jacobtomlinson