Joel
Joel
## Why are these changes needed? This PR fix incorrect `CUDA_VISIBLE_DEVICES` when `placement_group_bundle_index` is specified. ## Related issue number Closes https://github.com/ray-project/ray/issues/29811 ## Checks - [x] I've signed off every commit(by...
@deepakn94 Hi, I'm diving deep into Megatron-LM's implementation. For DDP wrapper, the current implementation maps each parameter's `main_grad` to grad buffer. https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/distributed/grad_buffer.py#L272-L280 And then in backward hook, add `grad` to...
## Why are these changes needed? `ClientCallTag` is no longer need after `ClientCallManager::CreateCall`, so we can remove it and use `ClientCall` as grpc CompleteQueue tag. By removing ClientCallTag, we may...
What does this PR do? --- This PR resume dataloader by skipping batches that have been consumed by last training epoch. For large dataset, the training time for one epoch...
For OpenSoraT2V_v1_3-2B/122 model - VAE: no sharding - DiT: ZeRO-2 - EMA: ZeRO-3 with FP32 - T5: ZeRO-3 ``` 10/22/2024 16:33:05 - INFO - __main__ - Load VAE model finish,...
**Describe the bug** Our model has a small parameter with shape `torch.Size([32])`, when enable ZeRO++, it raise following error: - world_size: 2048 - zero_hpz_partition_size: 16 ``` File "/usr/local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1344,...
### What does this PR do? Change rollout to server mode by default, spmd mode will be removed in v0.6.2.
### What does this PR do? As title
### What does this PR do? Refactor RL workers with new model engine