containers
containers copied to clipboard
Cluster timeouts for large images
@ygong1 tagging you because you seem to have done all the CUDA container stuff. I'm finding that I am running into issues with starting up Databricks clusters based off of these images because they are so big--the clusters seem to be running into timeout errors downloading the images.
The base cuda-pytorch image appears to be 7 GB in and of itself. I'm seeing that adding even a few things on top pushes it to like 7.8 GB. At that size, the same image on ECR sometimes successfully starts a cluster, and sometimes fails with an error like
Internal error message: Failed to launch spark container on instance i-xxxx. Exception: Container setup has timed out
@ygong1 are you all seeing anything internally like this? It doesn't appear that the timeout for container download is adjustable by the user.