How to add nodes during training?
Hi @insujang ,
Thanks for open-sourcing Oobleck, great work!
From the paper, it seems that the experiments show it supports both adding and removing nodes during training.
I successfully ran Oobleck with node failures (removing nodes), but I couldn't find a way to add nodes dynamically during training. Could you let me know how to make it work?
Thank you! Lam
Hi @laochonlam ,
All experiments in the paper were done with a Bamboo simulator, by measuring throughput and overheads of reconfiguration in every configuration and combining them. Current code does not include implementation for adding nodes. This is a future work; I think simply running reconfiguration would be enough, but need to try.
Got it—I'll give that a try. Thank you for your prompt response!
Lam
Let me leave it open so that later I can work on it :) You are also welcome to make a PR that adds a feature for node addition.