Sourab Mangrulkar issues

Results 14 issues of


                                            Sourab Mangrulkar

[BUG] DeBERTa has bad performance when using ZERO Stage-3 with continuous warnings "A module has unknown inputs or outputs type"

**Describe the bug** DeBERTa has bad performance when using ZERO Stage-3 . stdout has continuous warnings ```bash [stage3.py:104:_apply_to_tensors_only] A module has unknown inputs or outputs type () and the tensors...

bug

[WIP] DeepSpeed launcher related changes

### What does this PR do? 1. Removing 1 sub-process call for DeepSpeed for `Single Node Multi-GPU setup` and `Multi Node Multi-GPU setup using Standard launcher`. As discussed offline, the...

[BUG] Can't load checkpoint without having shared filesystem in multi-node training when multi-node setup config remains same

**Describe the bug** 1. Background: Use deepspeed (use ZeRO-1) for multi-node training, save optimizers to resume training. 2. `save_checkpoint` only saves the partitioned optimizer state on each machine. if we...

bug

issue with loading pretrained model using DeepSpeed Zero Stage 3

### System Info ```shell - `transformers` version: 4.19.0.dev0 - Platform: Linux-5.4.0-90-generic-x86_64-with-glibc2.29 - Python version: 3.8.10 - Huggingface_hub version: 0.5.1 - PyTorch version (GPU?): 1.12.0.dev20220505+cu113 (True) - Tensorflow version (GPU?): not...

DeepSpeed

bug

ipex intel extension for pytorch integration

### What does this PR do? 1. Implements feature request #700

example on converting PEFT+INT8 trained model to ONNX for faster inference

### What does this PR do? 1. Adds an example on converting PEFT+INT8 trained model to ONNX for faster inference. The example depicted is for Whsiper-large-V2 model.

fix issues to be compatible with latest peft

### What does this PR do? 1. Fixes the issues https://github.com/huggingface/peft/issues/286 and https://github.com/huggingface/peft/issues/317 2. Adds Callback to be used with HF Trainer to make sure intermediate checkpoints are saving only...

Smangrul/fix ckpt save ds fsdp

# What does this PR do? 1. `_save` function saves `tokenizer` and `training_args.bin` in addition to model. 2. This PR rearranges logic for saving model for DS and FSDP such...

[BUG] DeepSpeed ZeRO++ features aren't working

**Describe the bug** DeepSpeed ZeRO++ features aren't working: 1. On a single node, passing `zero_hpz_partition_size` , `zero_quantized_gradients` , `zero_quantized_weights` leads to foward pass error with `BF16`. Exact issue reported in...

bug

training

[BUG] `zero_quantized_nontrainable_weights=True` when using PEFT+DeepSpeed with Mixed-Precision training using BF16 leads to `float != c10::BFloat16` error

**Describe the bug** `zero_quantized_nontrainable_weights=True` when using PEFT+DeepSpeed with Mixed-Precision training using BF16 leads to `float != c10::BFloat16` error **To Reproduce** Steps to reproduce the behavior: 1. DeepSpeed Config: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/training/configs/ds_config_z3_lora.json 2....

bug

training