Ask for the tutorial of multi-machine setup

Open J-BING opened this issue 3 years ago • 0 comments

Is your feature request related to a problem? Please describe. I am recently doing a research project based on TFF, which uses ResNet for image classification. Due to the model's size and other customization, it requires a large amount of GPU memory. Specifically, there is one type of task that cannot be trained in a single HPC node with 4 A100 in total 160G memory.

I have tried different strategies in the tutorial. Unfortunately, OOM issue still happens.

Describe the solution you'd like I hope you can publish the tutorial for multi-machine deployment ASAP.

Describe alternatives you've considered Or, can I follow the tf.distribute.Strategy to implement one for TFF framework? Many thanks.

Additional context Add any other context about the feature request here.

Aug 02 '22 21:08 J-BING