Oobleck Question about fault tolerance threshold (f) and zero3

Hi @insujang,

Thank you for open-sourcing Oobleck—it’s an impressive piece of work!

I noticed in the paper that there is a parameter f that controls the fault tolerance threshold. However, I couldn’t find it in the codebase. Is there a way to configure or control this parameter? Additionally, is there a default value set for f?

Another question I have is that I noticed Zero3 is being used. In this case, each GPU should hold a unique model slice for optimizer states (as is the case with traditional Zero3). If one node fails, the corresponding Zero3 slice would also be lost. How can this be recovered? If my understanding is incorrect, please feel free to point it out.

Looking forward to your response. Thanks again for your contributions!

Dec 24 '24 10:12 lhy101

Hi @lhy101 ! Thank you for your interest in Oobleck.

Re: fault tolerance threshold, please refer to: https://github.com/SymbioticLab/Oobleck/blob/9d4e3b1bb38a0c1ac2f4b56150727f4604a42dcb/oobleck/engine/plugin.py#L47

Re: Zero3, first of all, Zero3 is no longer used after refactoring and traditional 3D parallelism (DP+TP+PP) is used, where DP provides redundancy. Second, when Zero3 was used, Zero3 was not an alternative of DP, but of TP. So Zero3 + PP + DP was used and outermost DP provided redundancy.

Dec 24 '24 12:12 insujang

Thank you for your explanation, I understand now! I also have a practical question and hope you can take some time to help me with it. My experimental setup consists of 4 machines, each equipped with 8 A800 GPUs (80GB), and the model size is 32B. I am using a configuration with tp=4, so my hostfile looks like this:

30.207.99.20 slots=4 devices=0,1,2,3 port=22
30.207.99.20 slots=4 devices=4,5,6,7 port=22
30.207.99.21 slots=4 devices=0,1,2,3 port=22
30.207.99.21 slots=4 devices=4,5,6,7 port=22
30.207.99.22 slots=4 devices=0,1,2,3 port=22
30.207.99.22 slots=4 devices=4,5,6,7 port=22
30.207.99.23 slots=4 devices=0,1,2,3 port=22
30.207.99.23 slots=4 devices=4,5,6,7 port=22

With this setup, the training runs successfully, and the generated pipeline templates are as follows:

2024-12-24 11:27:33.385 | DEBUG | oobleck.engine.execution_engine:prepare:151 - Pipeline templates: {2: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages), 3: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages), 4: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages), 5: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 5 stages), 6: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 6 stages), 7: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 7 stages), 8: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 8 stages)}

Based on my understanding of the log, all supported configurations are as follows:

2024-12-24 11:27:33.385 | DEBUG | oobleck.engine.pipeline_instantiator:_enumerate_instantiation_options:94 - Enumerating all feasible sets of pipeline templates for 8 nodes. 2024-12-24 11:27:33.386 | DEBUG | oobleck.engine.pipeline_instantiator:_enumerate_instantiation_options:121 - Dynamic programming result: [defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 4}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 1, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages): 2}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 2, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages): 1}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages): 2}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages): 1, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 5 stages): 1}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 1, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 6 stages): 1}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 8 stages): 1})]

Next, I shut down 30.207.99.23 (i.e., the last two tp4 groups). In theory, if the default f is 3, reconfiguration should work. However, I encountered the following error:

File "/jizhicfs/hymiezhao/lhy/Oobleck/examples/run_gpt2.py", line 152, in main model, optimizer, dataloader = engine.reconfigure( File "/jizhicfs/hymiezhao/lhy/Oobleck/oobleck/engine/execution_engine.py", line 289, in reconfigure model, optimizer, dataloader, _ = self.plugin.reconfigure( File "/jizhicfs/hymiezhao/miniconda3/envs/Oobleck_new/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/jizhicfs/hymiezhao/lhy/Oobleck/oobleck/engine/plugin.py", line 228, in reconfigure new_pipelines, new_num_microbatches = self._instantiate_pipelines( File "/jizhicfs/hymiezhao/lhy/Oobleck/oobleck/engine/plugin.py", line 125, in _instantiate_pipelines pipelines = [ File "/jizhicfs/hymiezhao/lhy/Oobleck/oobleck/engine/plugin.py", line 126, in <listcomp> pipeline_templates[num_stages] KeyError: 1

It seems like the system is trying to find a pipeline with only 1 stage, but such a pipeline is not generated (perhaps due to high memory usage by the model?). Additionally, I tried starting the training with only the rest of nodes from the beginning:

30.207.99.20 slots=4 devices=0,1,2,3 port=22
30.207.99.20 slots=4 devices=4,5,6,7 port=22
30.207.99.21 slots=4 devices=0,1,2,3 port=22
30.207.99.21 slots=4 devices=4,5,6,7 port=22
30.207.99.22 slots=4 devices=0,1,2,3 port=22
30.207.99.22 slots=4 devices=4,5,6,7 port=22

2024-12-24 12:41:39.444 | DEBUG | oobleck.engine.execution_engine:prepare:151 - Pipeline templates: {2: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages), 3: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages), 4: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages), 5: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 5 stages), 6: PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 6 stages)} 2024-12-24 12:41:39.444 | DEBUG | oobleck.engine.pipeline_instantiator:_enumerate_instantiation_options:94 - Enumerating all feasible sets of pipeline templates for 6 nodes. 2024-12-24 12:41:39.444 | DEBUG | oobleck.engine.pipeline_instantiator:_enumerate_instantiation_options:121 - Dynamic programming result: [defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 3}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 3 stages): 2}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 2 stages): 1, PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 4 stages): 1}), defaultdict(<class 'int'>, {PipelineTemplate(transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel, 6 stages): 1})]

This worked fine. However, transitioning from the first scenario to the second scenario via reconfiguration seems to cause problems. I would greatly appreciate it if you could provide further clarification or guidance on this issue at your convenience. Thank you once again for your assistance!

Dec 24 '24 13:12 lhy101

In the paper and our early version of code included node borrow and pipeline merge; in this case, a pipeline with 1 node should be merged with another to form a 3-node pipeline. During refactoring the feature was removed due to incompatibility with the new framework structure, and the related issue #23 is still open. Sorry for the inconvinience, but for now the feature is not provided. It still should work if the new pipeline configuration is in the initial set of pipeline templates, as in yout second experiment.

Dec 26 '24 00:12 insujang

Thanks for clearing that up. I really appreciate your help!

Dec 26 '24 04:12 lhy101