Pravin
Pravin
Same error on GCP, Pytorch version : 1.13.1
@balintdecsi I get the same issue on GCP, have you found a work-around or a fix? I've tried with different Pytorch environements aswell.
@entrpn I only have a TPU quota of 8, so the training fails after 4-5mins, I requested to increase the quota to 30 which will take a while. So in...
@entrpn The accelerator count was by default set to 8, and I only had 8 limited TPU quota for my account. I tried to change the count to 6 through...
@entrpn I've successfully launched a training job with A100 changing the configuration as suggested above, but there was almost no activity in the console or logs, it took almost 25mins...
@entrpn I see, I somehow missed that detail too, thank you for pointing out. I also believe this [line](https://github.com/entrpn/serving-model-cards/blob/cd3cd107c435ef0fa47f352f104e788265842f0e/training-dreambooth/Dockerfile#L3) needs to change. Am not sure what to change it to...
@entrpn I've followed the instructions above but the training wouldn't start at all. please refer to screenshots below, I've also attached the Dockerfile I've used to build, and config to...
Thank you for the suggestion. I see, following the code I believe a REST endpoint is being deployed using FastAPI through Docker. But our use-case actually involves creating a pure...