Oobleck
Oobleck copied to clipboard
A resilient distributed training framework
During handling failures, if some pipeline doesn't have enough number of nodes, Oobleck is supposed to borrow nodes from other pipelines or merge pipelines. Previous implementation had a prototype implementation,...
Hi @insujang , Thanks for open-sourcing Oobleck, great work! From the paper, it seems that the experiments show it supports both adding and removing nodes during training. I successfully ran...
Hi @insujang, Thank you for open-sourcing Oobleck—it’s an impressive piece of work! I noticed in the paper that there is a parameter f that controls the fault tolerance threshold. However,...