Karthik Suresh comments

Results 8 comments of


                                            Karthik Suresh

Model Parallelism and accelerate's usage of DDP aren't compatible

@sgugger Just to make sure my understanding is correct, can we use `deepspeed` support with the `Trainer` API to do model + data parallel (without setting `device_map`) or do we...

CUDA Out of Memory for the Falcon 7B model on A100 80GB GPU

@griff4692 Thanks for the pointer, I hardcoded strategy as `strategy = DeepSpeedStrategy(config=ds_config)` [here](https://github.com/Lightning-AI/lit-parrot/blob/af019fdb0b785ac6ec405b9de6c1768fc943f5e2/finetune/lora.py#L64) and it runs! Although there are two issues that I see: 1. Peak GPU memory is ~30...

CUDA Out of Memory for the Falcon 7B model on A100 80GB GPU

Aah I didn't realize it was hardcoded there, thanks!

CUDA Out of Memory for the Falcon 7B model on A100 80GB GPU

@rasbt @aniketmaurya 1. I tried running with 8 A100 (80GB) GPUs with the settings: ``` batch_size = 64 micro_batch_size = 4 lora_r = 8 devices=8 ``` It runs for ~15k...

CUDA Out of Memory for the Falcon 7B model on A100 80GB GPU

@carmocca Seems like this is a fix for the `adapter` method but not `lora` based on the PR. Can you outline the basic steps to make these changes for `lora`?

CUDA Out of Memory for the Falcon 7B model on A100 80GB GPU

Hey @carmocca I tried your fix and the memory requirement seems to be the same while the iteration time decreases from ~10s to ~7s. Here's my config: ``` max_seq_len =...

CUDA Out of Memory for the Falcon 7B model on A100 80GB GPU

@fozziethebeat What's your `micro_batch_size` and `max_seq_len`? Since the sequence length is local to the batch, may be it finds a batch later in your training that is big enough to...

Trouble resuming from checkpoint

I am facing the same issue on a 8xA100 machine with `bitsandbytes==0.42.0`. I am using the `paged_adamw_32bit` optimizer. Did any of you find a solution? Please help, thank you 🙏