containers icon indicating copy to clipboard operation
containers copied to clipboard

Cluster timeouts for large images

Open mdagost opened this issue 2 years ago • 0 comments

@ygong1 tagging you because you seem to have done all the CUDA container stuff. I'm finding that I am running into issues with starting up Databricks clusters based off of these images because they are so big--the clusters seem to be running into timeout errors downloading the images.

The base cuda-pytorch image appears to be 7 GB in and of itself. I'm seeing that adding even a few things on top pushes it to like 7.8 GB. At that size, the same image on ECR sometimes successfully starts a cluster, and sometimes fails with an error like

Internal error message: Failed to launch spark container on instance i-xxxx. Exception: Container setup has timed out

@ygong1 are you all seeing anything internally like this? It doesn't appear that the timeout for container download is adjustable by the user.

mdagost avatar Oct 31 '23 15:10 mdagost