upwindflys

Results 7 issues of upwindflys

**Environment:** 1. Framework:PyTorch 2. Framework version:1.10.2 3. Horovod version:0.24.2 4. MPI version: 5. CUDA version:10.2 6. NCCL version:2.8.4 7. Python version:3.6 8. Spark / PySpark version: 9. Ray version: 10....

wontfix

### 🐛 Describe the bug when running the model OPT,but i have no idea how to save the checkpoint. I tried the code ```python model = GeminiDDP(model, device=get_current_device(), placement_policy=PLACEMENT_POLICY, pin_memory=True)...

bug

**Describe the bug** Hello,I'm a novice using deepspeed. I used the ds_config.json but got outputs ``` 'DeepSpeedZeRoOffload' object has no attribute 'backward'``` The file as follows, can anyone give some...

bug
training

请教下,33B,65B全量训练32卡A100的batch_size多大,训练完1billion tokens大概需要多久?

**Describe the bug** Hi, I run DeepSpeedExamples/training/pipeline_parallelism.I run the code on 1 V100 with no pipeline, GPU memory requires approximately 2739M. But run the code on 2 V100 using pipelinemodule,...

bug
training

### 🐛 Describe the bug I tried to run the command in this link https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/opt, but errors occured. ```pyhton Traceback (most recent call last): File "run_clm.py", line 44, in from...