bug-fixed comments

Results 7 comments of


                                            bug-fixed

[BUG] Version >0.14.0 leads to `RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!`

The problem will exist in `Zero3-offload`. It seems the problem lies in the partition parameter's part in Zero3 if the model has multiple parallel modules or frozen parameters, the offload...

[BUG] Version >0.14.0 leads to `RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!`

@tjruwase please try this example (https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune.sh) with zero3-offload. Thanks.

[BUG] Version >0.14.0 leads to `RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!`

@jomayeri , thanks for the response. The file needed in the script can be downloaded in here: https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5/tree/main. Unfortunately, I think it's difficult for me to prepare a more concise...

[BUG] Version >0.14.0 leads to `RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!`

@tjruwase I have updated my comment, please kindly check it. Thanks.

[BUG] Version >0.14.0 leads to `RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!`

> @bug-fixed Does the same thing happen when you offload to CPU? @jomayeri The machine I'm working on has very limited memory and is shared with others. it is difficult...

Error in Gemma 2 using model_worker (probably an error in conversation.py)

Same here. The `generate` speed in `gemma 2 9b` is very slow. Any ideas here? Thanks.

Why is total_iters unrelated to batch size and number of data?

Hello, I'm also confused on a related problem. In the `ssl_default_config.yaml`, the params are `batch_size_per_gpu: 64` and `OFFICIAL_EPOCH_LENGTH: 1250`. And in the README, it says `Run DINOv2 training on 4...