Joel

Results 11 issues of Joel

documentation
enhancement
P0

## Why are these changes needed? This PR fix incorrect `CUDA_VISIBLE_DEVICES` when `placement_group_bundle_index` is specified. ## Related issue number Closes https://github.com/ray-project/ray/issues/29811 ## Checks - [x] I've signed off every commit(by...

bug
triage
@external-author-action-required
core

@deepakn94 Hi, I'm diving deep into Megatron-LM's implementation. For DDP wrapper, the current implementation maps each parameter's `main_grad` to grad buffer. https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/distributed/grad_buffer.py#L272-L280 And then in backward hook, add `grad` to...

stale

## Why are these changes needed? `ClientCallTag` is no longer need after `ClientCallManager::CreateCall`, so we can remove it and use `ClientCall` as grpc CompleteQueue tag. By removing ClientCallTag, we may...

P2
performance
core
go

What does this PR do? --- This PR resume dataloader by skipping batches that have been consumed by last training epoch. For large dataset, the training time for one epoch...

For OpenSoraT2V_v1_3-2B/122 model - VAE: no sharding - DiT: ZeRO-2 - EMA: ZeRO-3 with FP32 - T5: ZeRO-3 ``` 10/22/2024 16:33:05 - INFO - __main__ - Load VAE model finish,...

**Describe the bug** Our model has a small parameter with shape `torch.Size([32])`, when enable ZeRO++, it raise following error: - world_size: 2048 - zero_hpz_partition_size: 16 ``` File "/usr/local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1344,...

bug
training

### What does this PR do? Change rollout to server mode by default, spmd mode will be removed in v0.6.2.

### What does this PR do? As title

### What does this PR do? Refactor RL workers with new model engine