Andcircle comments

Results 30 comments of


                                            Andcircle

NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@HamidShojanazeri thanks for your response cuda 12.2 nccl 2.19.3 torch 2.2.0 transformer 4.37.2 trl 0.7.10 accelerate 0.27.2 bitsandbytes 0.42.0

NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@HamidShojanazeri thanks for your response, I'll try with nvidia base image 12.1, and keep result updated here

NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@HamidShojanazeri cuda 12.1.1 torch 2.2.1 still have exactly the same error at the same step Any hints or guidance, how to debug this type of situation? tried to add TORCH_CUDA_SANITIZER=1,...

NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@HamidShojanazeri I did add this ENV, but didn't get extra info, I also used sanitizer =) This smallest code snippet I can reproduce ```import os import wandb import torch from...

NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@HamidShojanazeri thanks for your reply, this is just an demo snippets, we actually use MP + DDP, FSDP without qlora can't save memory that much, since we have relatively long...

RunTimeError Missing keys while resuming training and cannot load checkpoint

@ktlKTL using your package version, got new error: ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect. Any hints?

pip install flash-attn always happens ModuleNotFoundError: No module named 'packaging',but actually i have pip install packaging

Hey @tmm1, sorry to bother Still facing the same issue: ``` MAX_JOBS=4 pip install -U flash-attn --no-build-isolation Collecting flash-attn Using cached flash_attn-2.1.0.tar.gz (2.2 MB) Preparing metadata (pyproject.toml) ... error error:...

pip install flash-attn always happens ModuleNotFoundError: No module named 'packaging',but actually i have pip install packaging

> @tmm1 thanks, it works =)

Issue Loading 4-bit and 8-bit language models: ValueError: `.to` is not supported for `4-bit` or `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

@younesbelkada All the test case above is using device_map="auto", it also works for me. BUT: if I use device_map={'':torch.cuda.current_device()}, the error shows again like: ``` Traceback (most recent call last):...

Issue Loading 4-bit and 8-bit language models: ValueError: `.to` is not supported for `4-bit` or `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

@younesbelkada Even if set device_map="auto", if only have 1 GPU, still facing the error: ``` Traceback (most recent call last): File "train1.py", line 124, in trainer = SFTTrainer( File "/usr/local/lib/python3.8/dist-packages/trl/trainer/sft_trainer.py",...