Joel issues

Results 11 issues of


                                            Joel

add perf and benchmark scripts

documentation

enhancement

[Core] Fix incorrect gpu ids if placement group bundle index specified

## Why are these changes needed? This PR fix incorrect `CUDA_VISIBLE_DEVICES` when `placement_group_bundle_index` is specified. ## Related issue number Closes https://github.com/ray-project/ray/issues/29811 ## Checks - [x] I've signed off every commit(by...

bug

triage

@external-author-action-required

core

[QUESTION] For DDP, why map parameter's main_grad to grad buffer instead of grad?

@deepakn94 Hi, I'm diving deep into Megatron-LM's implementation. For DDP wrapper, the current implementation maps each parameter's `main_grad` to grad buffer. https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/distributed/grad_buffer.py#L272-L280 And then in backward hook, add `grad` to...

stale

[core] Remove grpc ClientCallTag

## Why are these changes needed? `ClientCallTag` is no longer need after `ClientCallManager::CreateCall`, so we can remove it and use `ClientCall` as grpc CompleteQueue tag. By removing ClientCallTag, we may...

performance

core

[feat]: support dataloader resume by skip_first_batches

What does this PR do? --- This PR resume dataloader by skipping batches that have been consumed by last training epoch. For large dataset, the training time for one epoch...

feat: enable ZeRO-3 sharding for TextEncoder and EMA model

For OpenSoraT2V_v1_3-2B/122 model - VAE: no sharding - DiT: ZeRO-2 - EMA: ZeRO-3 with FP32 - T5: ZeRO-3 ``` 10/22/2024 16:33:05 - INFO - __main__ - Load VAE model finish,...

[BUG] ZeRO++ sharding small parameter raise IndexError

**Describe the bug** Our model has a small parameter with shape `torch.Size([32])`, when enable ZeRO++, it raise following error: - world_size: 2048 - zero_hpz_partition_size: 16 ``` File "/usr/local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1344,...

bug

training

Joel