Ask for the tutorial of multi-machine setup
Is your feature request related to a problem? Please describe. I am recently doing a research project based on TFF, which uses ResNet for image classification. Due to the model's size and other customization, it requires a large amount of GPU memory. Specifically, there is one type of task that cannot be trained in a single HPC node with 4 A100 in total 160G memory.
I have tried different strategies in the tutorial. Unfortunately, OOM issue still happens.
Describe the solution you'd like I hope you can publish the tutorial for multi-machine deployment ASAP.
Describe alternatives you've considered Or, can I follow the tf.distribute.Strategy to implement one for TFF framework? Many thanks.
Additional context Add any other context about the feature request here.