RuntimeError: CUDA error: device-side assert triggered when running Llama on multiple gpus
I'm getting the following error when using more than one gpu
python3 -m fastchat.serve.cli --model-name /tmp/cache/vicuna-13b/ --num-gpus 2
I am unsure if this is a problem on my end or if it's something that can be fixed. Can you please confirm if using multiple GPUs is supported by FastChat and if there are any specific requirements that must be met? Thank you.
I'm using 4xV100 - 32GB , and yes I've already tried with 2 and 4 combinations.

What is your Pytorch version? Could you try the latest pytorch version?
The command works for me on V100
(py310) zhangyu@ps:~/project/FastChat/FastChat/fastchat$ python3 -m fastchat.serve.cli --model-pat /home/zhangyu/project/FastChat/FastChat/fastchat/output/vicuna-13b --num-gpus 2
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:26<00:00, 8.83s/it]
USER: hello
ASSISTANT: ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [67,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [68,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [69,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [70,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [71,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [72,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [73,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [74,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [75,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [76,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [77,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [78,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [79,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [80,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [81,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [82,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [83,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [84,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [85,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [86,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [87,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [88,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [89,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [90,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [91,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [734,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [2,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [3,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [4,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [5,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [6,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [7,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [8,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [9,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [10,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [11,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [12,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [13,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [14,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [15,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [16,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [17,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [18,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [19,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [20,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [22,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [23,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [24,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [740,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [510,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [744,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [66,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [67,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [68,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [69,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [70,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [71,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [72,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [73,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [74,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [75,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [76,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [77,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [78,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [79,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [80,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [81,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [82,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [83,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [84,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [85,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [86,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [87,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [88,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [89,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [90,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [91,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [737,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/runpy.py:196 in _run_module_as_main │
│ │
│ 193 │ main_globals = sys.modules["main"].dict │
│ 194 │ if alter_argv: │
│ 195 │ │ sys.argv[0] = mod_spec.origin │
│ ❱ 196 │ return _run_code(code, main_globals, None, │
│ 197 │ │ │ │ │ "main", mod_spec) │
│ 198 │
│ 199 def run_module(mod_name, init_globals=None, │
│ │
│ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/runpy.py:86 in _run_code │
│ │
│ 83 │ │ │ │ │ loader = loader, │
│ 84 │ │ │ │ │ package = pkg_name, │
│ 85 │ │ │ │ │ spec = mod_spec) │
│ ❱ 86 │ exec(code, run_globals) │
│ 87 │ return run_globals │
│ 88 │
│ 89 def _run_module_code(code, init_globals=None, │
│ │
│ /home/zhangyu/project/FastChat/FastChat/fastchat/serve/cli.py:132 in None to a generator fires it up │
│ 34 │ │ │ with ctx_factory(): │
│ ❱ 35 │ │ │ │ response = gen.send(None) │
│ 36 │ │ │ │
│ 37 │ │ │ while True: │
│ 38 │ │ │ │ try: │
│ │
│ /home/zhangyu/project/FastChat/FastChat/fastchat/serve/inference.py:117 in generate_stream │
│ │
│ 114 │ │
│ 115 │ for i in range(max_new_tokens): │
│ 116 │ │ if i == 0: │
│ ❱ 117 │ │ │ out = model( │
│ 118 │ │ │ │ torch.as_tensor([input_ids], device=device), use_cache=True) │
│ 119 │ │ │ logits = out.logits │
│ 120 │ │ │ past_key_values = out.past_key_values │
│ │
│ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 │
│ in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/hooks.py:165 in │
│ new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/model │
│ ing_llama.py:687 in forward │
│ │
│ 684 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │
│ 685 │ │ │
│ 686 │ │ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) │
│ ❱ 687 │ │ outputs = self.model( │
│ 688 │ │ │ input_ids=input_ids, │
│ 689 │ │ │ attention_mask=attention_mask, │
│ 690 │ │ │ position_ids=position_ids, │
│ │
│ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 │
│ in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/model │
│ ing_llama.py:536 in forward │
│ │
│ 533 │ │ │ attention_mask = torch.ones( │
│ 534 │ │ │ │ (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embe │
│ 535 │ │ │ ) │
│ ❱ 536 │ │ attention_mask = self._prepare_decoder_attention_mask( │
│ 537 │ │ │ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_len │
│ 538 │ │ ) │
│ 539 │
│ │
│ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/model │
│ ing_llama.py:464 in _prepare_decoder_attention_mask │
│ │
│ 461 │ │ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] │
│ 462 │ │ combined_attention_mask = None │
│ 463 │ │ if input_shape[-1] > 1: │
│ ❱ 464 │ │ │ combined_attention_mask = _make_causal_mask( │
│ 465 │ │ │ │ input_shape, │
│ 466 │ │ │ │ inputs_embeds.dtype, │
│ 467 │ │ │ │ device=inputs_embeds.device, │
│ │
│ /home/zhangyu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/model │
│ ing_llama.py:49 in make_causal_mask │
│ │
│ 46 │ Make causal mask used for bi-directional self-attention. │
│ 47 │ """ │
│ 48 │ bsz, tgt_len = input_ids_shape │
│ ❱ 49 │ mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=de │
│ 50 │ mask_cond = torch.arange(mask.size(-1), device=device) │
│ 51 │ mask.masked_fill(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0) │
│ 52 │ mask = mask.to(dtype) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
What is your Pytorch version? Could you try the latest pytorch version?
The command works for me on V100
I also have the same problem. The model loading is not a problem, and once the conversation starts, an error will be reported. I am using 2 × 4090 (48g),python3.10
@starphantom666 did you find the solution ? I have the same problem Torch_USE_CUDA_DSA to enable device-side assertion. I am using 1x NVIDIA T4.
I also have the same problem. I can use the vicuna 13b model properly with --load-8bit option on in single 4090 GPU, but when I use multiple gpus like (--num-gpus 2), this problem occured. I'm still seeing the same traceback message and I can't figure out why.
I got the same problem on a dual 4090 machine. I tried the same command with two 3090s and it worked well. I guessed it was the problem of driver/cuda version, but then I did some searching and found the following post: https://discuss.pytorch.org/t/ddp-training-on-rtx-4090-ada-cu118/168366
It seems 4090 does not support communication between multiple cards at all. I am not 100% sure if this is the root cause since I am not an expert in this domain. Can someone double-check it? Thanks.
=== Updated below ===
I tried setting NCCL_P2P_DISABLE=1 and ran another code for training LoRA with two 4090s. Now it works (it used to be stuck).
But when I try running Vicuna with P2P disabled, it quit and reported another error:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /XX/envs/vicuna-matata/lib/python3.10/runpy.py:196 in _run_module_as_main │
│ │
│ 193 │ main_globals = sys.modules["__main__"].__dict__ │
│ 194 │ if alter_argv: │
│ 195 │ │ sys.argv[0] = mod_spec.origin │
│ ❱ 196 │ return _run_code(code, main_globals, None, │
│ 197 │ │ │ │ │ "__main__", mod_spec) │
│ 198 │
│ 199 def run_module(mod_name, init_globals=None, │
│ │
│ /XX/envs/vicuna-matata/lib/python3.10/runpy.py:86 in _run_code │
│ │
│ 83 │ │ │ │ │ __loader__ = loader, │
│ 84 │ │ │ │ │ __package__ = pkg_name, │
│ 85 │ │ │ │ │ __spec__ = mod_spec) │
│ ❱ 86 │ exec(code, run_globals) │
│ 87 │ return run_globals │
│ 88 │
│ 89 def _run_module_code(code, init_globals=None, │
│ │
│ /XX/FastChat/fastchat/serve/cli.py:132 in <module> │
│ │
│ 129 │ │ │ │ │ │ choices=["simple", "rich"], help="Display style.") │
│ 130 │ parser.add_argument("--debug", action="store_true") │
│ 131 │ args = parser.parse_args() │
│ ❱ 132 │ main(args) │
│ 133 │
│ │
│ /XX/FastChat/fastchat/serve/cli.py:108 in main │
│ │
│ 105 │ else: │
│ 106 │ │ raise ValueError(f"Invalid style for console: {args.style}") │
│ 107 │ try: │
│ ❱ 108 │ │ chat_loop(args.model_path, args.device, args.num_gpus, args.max_gpu_memory, │
│ 109 │ │ │ args.load_8bit, args.conv_template, args.temperature, args.max_new_tokens, │
│ 110 │ │ │ chatio, args.debug) │
│ 111 │ except KeyboardInterrupt: │
│ │
│ /XX/FastChat/fastchat/serve/inference.py:223 in chat_loop │
│ │
│ 220 │ │ │
│ 221 │ │ chatio.prompt_for_output(conv.roles[1]) │
│ 222 │ │ output_stream = generate_stream_func(model, tokenizer, params, device) │
│ ❱ 223 │ │ outputs = chatio.stream_output(output_stream, skip_echo_len) │
│ 224 │ │ # NOTE: strip is important to align with the training data. │
│ 225 │ │ conv.messages[-1][-1] = outputs.strip() │
│ 226 │
│ │
│ /XX/FastChat/fastchat/serve/cli.py:69 in stream_output │
│ │
│ 66 │ │ # Create a Live context for updating the console output │
│ 67 │ │ with Live(console=self._console, refresh_per_second=4) as live: │
│ 68 │ │ │ # Read lines from the stream │
│ ❱ 69 │ │ │ for outputs in output_stream: │
│ 70 │ │ │ │ accumulated_text = outputs[skip_echo_len:] │
│ 71 │ │ │ │ if not accumulated_text: │
│ 72 │ │ │ │ │ continue │
│ │
│ │ [104/1845]
│ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/utils/_contextli │
│ b.py:56 in generator_context │
│ │
│ 53 │ │ │ │ else: │
│ 54 │ │ │ │ │ # Pass the last request to the generator and get its response │
│ 55 │ │ │ │ │ with ctx_factory(): │
│ ❱ 56 │ │ │ │ │ │ response = gen.send(request) │
│ 57 │ │ │
│ 58 │ │ # We let the exceptions raised above by the generator's `.throw` or │
│ 59 │ │ # `.send` methods bubble up to our caller, except for StopIteration │
│ │
│ /XX/FastChat/fastchat/serve/inference.py:122 in generate_stream │
│ │
│ 119 │ │ │ logits = out.logits │
│ 120 │ │ │ past_key_values = out.past_key_values │
│ 121 │ │ else: │
│ ❱ 122 │ │ │ out = model(input_ids=torch.as_tensor([[token]], device=device), │
│ 123 │ │ │ │ │ │ use_cache=True, │
│ 124 │ │ │ │ │ │ past_key_values=past_key_values) │
│ 125 │ │ │ logits = out.logits │
│ │
│ /XX/miniconda3/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │
│ e.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /XX/envs/vicuna-matata/lib/python3.10/site-packages/accelerate/hooks.py:16 │
│ 5 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /XX/envs/vicuna-matata/lib/python3.10/site-packages/transformers/models/ll │
│ ama/modeling_llama.py:687 in forward │
│ │
│ 684 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │
│ 685 │ │ │
│ 686 │ │ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) │
│ ❱ 687 │ │ outputs = self.model( │
│ 688 │ │ │ input_ids=input_ids, │
│ 689 │ │ │ attention_mask=attention_mask, │
│ 690 │ │ │ position_ids=position_ids, │
│ │
│ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │
│ e.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /XX/envs/vicuna-matata/lib/python3.10/site-packages/transformers/models/ll │
│ ama/modeling_llama.py:577 in forward │
│ │
│ 574 │ │ │ │ │ None, │
│ 575 │ │ │ │ ) │
│ 576 │ │ │ else: │
│ ❱ 577 │ │ │ │ layer_outputs = decoder_layer( │
│ 578 │ │ │ │ │ hidden_states, │
│ 579 │ │ │ │ │ attention_mask=attention_mask, │
│ 580 │ │ │ │ │ position_ids=position_ids, │
│ │
│ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │
│ e.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /XX/envs/vicuna-matata/lib/python3.10/site-packages/accelerate/hooks.py:16 │
│ 5 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /XX/envs/vicuna-matata/lib/python3.10/site-packages/transformers/models/ll │
│ ama/modeling_llama.py:305 in forward │
│ │
│ 302 │ │ # Fully Connected │
│ 303 │ │ residual = hidden_states │
│ 304 │ │ hidden_states = self.post_attention_layernorm(hidden_states) │ [0/1845]
│ ❱ 305 │ │ hidden_states = self.mlp(hidden_states) │
│ 306 │ │ hidden_states = residual + hidden_states │
│ 307 │ │ │
│ 308 │ │ outputs = (hidden_states,) │
│ │
│ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │
│ e.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /XX/envs/vicuna-matata/lib/python3.10/site-packages/accelerate/hooks.py:16 │
│ 5 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
| /XX/miniconda3/envs/vicuna-matata/lib/python3.10/site-packages/transformers/models/ll │
│ ama/modeling_llama.py:157 in forward │
│ │
│ 154 │ │ self.act_fn = ACT2FN[hidden_act] │
│ 155 │ │
│ 156 │ def forward(self, x): │
│ ❱ 157 │ │ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) │
│ 158 │
│ 159 │
│ 160 class LlamaAttention(nn.Module): │
│ │
| /XX/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │
│ e.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
| /XX/envs/vicuna-matata/lib/python3.10/site-packages/accelerate/hooks.py:16 │
│ 5 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/linea │
│ r.py:114 in forward │
│ │
│ 111 │ │ │ init.uniform_(self.bias, -bound, bound) │
│ 112 │ │
│ 113 │ def forward(self, input: Tensor) -> Tensor: │
│ ❱ 114 │ │ return F.linear(input, self.weight, self.bias) │
│ 115 │ │
│ 116 │ def extra_repr(self) -> str: │
│ 117 │ │ return 'in_features={}, out_features={}, bias={}'.format( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb,
&fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
Able to fixed this using below
-
run nvidia-smi command and check GPU IDs, in my case its 0,1,2,3, please check MIG should not be enabled , otherwise that GPU wont work
-
Add CUDA_VISIBLE_DEVICES=0,1 before this command something like below
-
CUDA_VISIBLE_DEVICES=0,1 python3 -m fastchat.serve.cli --model-name /tmp/cache/vicuna-13b/ --num-gpus 2
-
CUDA_VISIBLE_DEVICES=0,1,2 python3 -m fastchat.serve.cli --model-name /tmp/cache/vicuna-13b/ --num-gpus 3
Hope it works
Thank you @Sparetavns
Does @Sparetavns 's solution solve others' problems?
In case it helps anybody else: The problem posted by @starphantom666 may be a different problem from the original post (OP) above. A problem consistent with the @starphantom666 report can occur because of the recently changed handling of BOS/EOS tokens in the Hugging Face ("HF") Llama implementation.
- This problem may occur if the conversion of weights to HF format was done with "old" HF code, whereas you are now using the latest HF code.
- The symptom of the problem is that, if you print out the prompt tokenization, the BOS token (
"<s>") is wrongly represented by ID#32000, whereas the embeddings now expect1. - In that case, if one does
export CUDA_LAUNCH_BLOCKING=1, then one will see an assertion message of the form... Indexing.cu: ... indexSelectLargeIndex: ... Assertion srcIndex < srcSelectDimSize failed.from a CUDA kernel launched by the torch embedding function -- presumably because the 32000 is past the end of the embedding table.
A solution is to update both HF transformers and FastChat repos to the latest and re-convert weights from the original weights to HF weights, and then from HF weights to vicuna weights.
@sethbruder Great, thanks for the detailed explainations!
Seems like @sethbruder 's solution would solve this problem. Closing. Please re-open if the issue persists.
Hi, I also got this error. The weird thing is, if I'm running script from this git, the model train without any problem, but if I trained it with my script, it's not working.
Here is my overall script:
class TrainDataset(torch.utils.data.Dataset):
def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer, max_length: int = 0):
abs_data_path = os.path.abspath(data_path)
self.data = self.get_data(abs_data_path)
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.data)
def __getitem__(self, index: int) -> torch.Tensor:
items = {'input_ids': self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(self.data[index]))}
items['attention_mask'] = [1]*len(items['input_ids'])
if self.max_length:
items['input_ids'] = items['input_ids'][:self.max_length]
items = self.tokenizer.pad(items, padding='max_length', max_length=self.max_length)
return {k: torch.tensor(v, dtype=(torch.long if k != 'attention_mask' else torch.bool)) for k,v in items.items()}
def get_data(self, data_path: str) -> List[str]:
with open(data_path) as f:
str_data = f.readlines()
return str_data
def load_pair(model_type, model_path) -> Tuple[transformers.PreTrainedModel, transformers.PreTrainedTokenizer]:
if model_type not in MODEL_TOKEN_PAIR:
raise KeyError(f'{model_type} model type is not supported')
model_cls, token_cls = MODEL_TOKEN_PAIR[model_type]
tokenizer: transformers.PreTrainedTokenizer = token_cls.from_pretrained(model_path, use_fast=False)
if tokenizer.pad_token is None:
try:
tokenizer.convert_tokens_to_ids('<pad>')
pad_token = '<pad>'
except NotImplementedError:
pad_token = tokenizer.convert_ids_to_tokens(0)
tokenizer.add_special_tokens({'pad_token': pad_token})
model: transformers.PreTrainedModel = model_cls.from_pretrained(model_path)
return model, tokenizer
def main():
parser = transformers.HfArgumentParser(
(ModelArguments, DataArguments, TrainingArguments)
)
model_args, data_args, training_args = parse_args(parser)
mlflow.set_tracking_uri(training_args.mlflow_url)
os.environ['MLFLOW_EXPERIMENT_NAME'] = training_args.experiment_name
mlflowCallback = transformers.integrations.MLflowCallback()
model, tokenizer = load_pair(model_args.model_type, model_args.model_name_or_path)
train_dataset = TrainDataset(data_args.data_path, tokenizer, training_args.model_max_length)
data_collator = transformers.DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False
)
trainer = transformers.Trainer(
model=model,
tokenizer=tokenizer,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
callbacks=[EarlyStopping(training_args.max_epoch_without_progress), mlflowCallback]
)
trainer.train()
After I check, the difference between my TrainDataset with SupervisedDataset is my dataset did not contain labels key. But, it should be fixed by data_collator and the return value is same. What could be the problem here?
Seems like @sethbruder 's solution would solve this problem. Closing. Please re-open if the issue persists.
@zhisbug Hi, I have updated the fschat and transformers package to the latest version , and reconverted the model format to huggingface format, but the error before mentioned still exists when running the client on two RTX4090 gpus. I think this issue hasn't been solved, may you reopen the issue?
same problem
I got the same problem on a dual 4090 machine. I tried the same command with two 3090s and it worked well. I guessed it was the problem of driver/cuda version, but then I did some searching and found the following post: https://discuss.pytorch.org/t/ddp-training-on-rtx-4090-ada-cu118/168366
It seems 4090 does not support communication between multiple cards at all. I am not 100% sure if this is the root cause since I am not an expert in this domain. Can someone double-check it? Thanks.
=== Updated below ===
I tried setting NCCL_P2P_DISABLE=1 and ran another code for training LoRA with two 4090s. Now it works (it used to be stuck).
But when I try running Vicuna with P2P disabled, it quit and reported another error:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /XX/envs/vicuna-matata/lib/python3.10/runpy.py:196 in _run_module_as_main │ │ │ │ 193 │ main_globals = sys.modules["__main__"].__dict__ │ │ 194 │ if alter_argv: │ │ 195 │ │ sys.argv[0] = mod_spec.origin │ │ ❱ 196 │ return _run_code(code, main_globals, None, │ │ 197 │ │ │ │ │ "__main__", mod_spec) │ │ 198 │ │ 199 def run_module(mod_name, init_globals=None, │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/runpy.py:86 in _run_code │ │ │ │ 83 │ │ │ │ │ __loader__ = loader, │ │ 84 │ │ │ │ │ __package__ = pkg_name, │ │ 85 │ │ │ │ │ __spec__ = mod_spec) │ │ ❱ 86 │ exec(code, run_globals) │ │ 87 │ return run_globals │ │ 88 │ │ 89 def _run_module_code(code, init_globals=None, │ │ │ │ /XX/FastChat/fastchat/serve/cli.py:132 in <module> │ │ │ │ 129 │ │ │ │ │ │ choices=["simple", "rich"], help="Display style.") │ │ 130 │ parser.add_argument("--debug", action="store_true") │ │ 131 │ args = parser.parse_args() │ │ ❱ 132 │ main(args) │ │ 133 │ │ │ │ /XX/FastChat/fastchat/serve/cli.py:108 in main │ │ │ │ 105 │ else: │ │ 106 │ │ raise ValueError(f"Invalid style for console: {args.style}") │ │ 107 │ try: │ │ ❱ 108 │ │ chat_loop(args.model_path, args.device, args.num_gpus, args.max_gpu_memory, │ │ 109 │ │ │ args.load_8bit, args.conv_template, args.temperature, args.max_new_tokens, │ │ 110 │ │ │ chatio, args.debug) │ │ 111 │ except KeyboardInterrupt: │ │ │ │ /XX/FastChat/fastchat/serve/inference.py:223 in chat_loop │ │ │ │ 220 │ │ │ │ 221 │ │ chatio.prompt_for_output(conv.roles[1]) │ │ 222 │ │ output_stream = generate_stream_func(model, tokenizer, params, device) │ │ ❱ 223 │ │ outputs = chatio.stream_output(output_stream, skip_echo_len) │ │ 224 │ │ # NOTE: strip is important to align with the training data. │ │ 225 │ │ conv.messages[-1][-1] = outputs.strip() │ │ 226 │ │ │ │ /XX/FastChat/fastchat/serve/cli.py:69 in stream_output │ │ │ │ 66 │ │ # Create a Live context for updating the console output │ │ 67 │ │ with Live(console=self._console, refresh_per_second=4) as live: │ │ 68 │ │ │ # Read lines from the stream │ │ ❱ 69 │ │ │ for outputs in output_stream: │ │ 70 │ │ │ │ accumulated_text = outputs[skip_echo_len:] │ │ 71 │ │ │ │ if not accumulated_text: │ │ 72 │ │ │ │ │ continue │ │ │ │ │ [104/1845] │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/utils/_contextli │ │ b.py:56 in generator_context │ │ │ │ 53 │ │ │ │ else: │ │ 54 │ │ │ │ │ # Pass the last request to the generator and get its response │ │ 55 │ │ │ │ │ with ctx_factory(): │ │ ❱ 56 │ │ │ │ │ │ response = gen.send(request) │ │ 57 │ │ │ │ 58 │ │ # We let the exceptions raised above by the generator's `.throw` or │ │ 59 │ │ # `.send` methods bubble up to our caller, except for StopIteration │ │ │ │ /XX/FastChat/fastchat/serve/inference.py:122 in generate_stream │ │ │ │ 119 │ │ │ logits = out.logits │ │ 120 │ │ │ past_key_values = out.past_key_values │ │ 121 │ │ else: │ │ ❱ 122 │ │ │ out = model(input_ids=torch.as_tensor([[token]], device=device), │ │ 123 │ │ │ │ │ │ use_cache=True, │ │ 124 │ │ │ │ │ │ past_key_values=past_key_values) │ │ 125 │ │ │ logits = out.logits │ │ │ │ /XX/miniconda3/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │ │ e.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/accelerate/hooks.py:16 │ │ 5 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/transformers/models/ll │ │ ama/modeling_llama.py:687 in forward │ │ │ │ 684 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │ │ 685 │ │ │ │ 686 │ │ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) │ │ ❱ 687 │ │ outputs = self.model( │ │ 688 │ │ │ input_ids=input_ids, │ │ 689 │ │ │ attention_mask=attention_mask, │ │ 690 │ │ │ position_ids=position_ids, │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │ │ e.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/transformers/models/ll │ │ ama/modeling_llama.py:577 in forward │ │ │ │ 574 │ │ │ │ │ None, │ │ 575 │ │ │ │ ) │ │ 576 │ │ │ else: │ │ ❱ 577 │ │ │ │ layer_outputs = decoder_layer( │ │ 578 │ │ │ │ │ hidden_states, │ │ 579 │ │ │ │ │ attention_mask=attention_mask, │ │ 580 │ │ │ │ │ position_ids=position_ids, │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │ │ e.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/accelerate/hooks.py:16 │ │ 5 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/transformers/models/ll │ │ ama/modeling_llama.py:305 in forward │ │ │ │ 302 │ │ # Fully Connected │ │ 303 │ │ residual = hidden_states │ │ 304 │ │ hidden_states = self.post_attention_layernorm(hidden_states) │ [0/1845] │ ❱ 305 │ │ hidden_states = self.mlp(hidden_states) │ │ 306 │ │ hidden_states = residual + hidden_states │ │ 307 │ │ │ │ 308 │ │ outputs = (hidden_states,) │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │ │ e.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/accelerate/hooks.py:16 │ │ 5 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ | /XX/miniconda3/envs/vicuna-matata/lib/python3.10/site-packages/transformers/models/ll │ │ ama/modeling_llama.py:157 in forward │ │ │ │ 154 │ │ self.act_fn = ACT2FN[hidden_act] │ │ 155 │ │ │ 156 │ def forward(self, x): │ │ ❱ 157 │ │ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) │ │ 158 │ │ 159 │ │ 160 class LlamaAttention(nn.Module): │ │ │ | /XX/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │ │ e.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ | /XX/envs/vicuna-matata/lib/python3.10/site-packages/accelerate/hooks.py:16 │ │ 5 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/linea │ │ r.py:114 in forward │ │ │ │ 111 │ │ │ init.uniform_(self.bias, -bound, bound) │ │ 112 │ │ │ 113 │ def forward(self, input: Tensor) -> Tensor: │ │ ❱ 114 │ │ return F.linear(input, self.weight, self.bias) │ │ 115 │ │ │ 116 │ def extra_repr(self) -> str: │ │ 117 │ │ return 'in_features={}, out_features={}, bias={}'.format( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
Hi, I'm also seeing this error. Any update on this? Thank you!
I got the same problem on a dual 4090 machine. I tried the same command with two 3090s and it worked well. I guessed it was the problem of driver/cuda version, but then I did some searching and found the following post: https://discuss.pytorch.org/t/ddp-training-on-rtx-4090-ada-cu118/168366 It seems 4090 does not support communication between multiple cards at all. I am not 100% sure if this is the root cause since I am not an expert in this domain. Can someone double-check it? Thanks. === Updated below === I tried setting NCCL_P2P_DISABLE=1 and ran another code for training LoRA with two 4090s. Now it works (it used to be stuck). But when I try running Vicuna with P2P disabled, it quit and reported another error:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /XX/envs/vicuna-matata/lib/python3.10/runpy.py:196 in _run_module_as_main │ │ │ │ 193 │ main_globals = sys.modules["__main__"].__dict__ │ │ 194 │ if alter_argv: │ │ 195 │ │ sys.argv[0] = mod_spec.origin │ │ ❱ 196 │ return _run_code(code, main_globals, None, │ │ 197 │ │ │ │ │ "__main__", mod_spec) │ │ 198 │ │ 199 def run_module(mod_name, init_globals=None, │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/runpy.py:86 in _run_code │ │ │ │ 83 │ │ │ │ │ __loader__ = loader, │ │ 84 │ │ │ │ │ __package__ = pkg_name, │ │ 85 │ │ │ │ │ __spec__ = mod_spec) │ │ ❱ 86 │ exec(code, run_globals) │ │ 87 │ return run_globals │ │ 88 │ │ 89 def _run_module_code(code, init_globals=None, │ │ │ │ /XX/FastChat/fastchat/serve/cli.py:132 in <module> │ │ │ │ 129 │ │ │ │ │ │ choices=["simple", "rich"], help="Display style.") │ │ 130 │ parser.add_argument("--debug", action="store_true") │ │ 131 │ args = parser.parse_args() │ │ ❱ 132 │ main(args) │ │ 133 │ │ │ │ /XX/FastChat/fastchat/serve/cli.py:108 in main │ │ │ │ 105 │ else: │ │ 106 │ │ raise ValueError(f"Invalid style for console: {args.style}") │ │ 107 │ try: │ │ ❱ 108 │ │ chat_loop(args.model_path, args.device, args.num_gpus, args.max_gpu_memory, │ │ 109 │ │ │ args.load_8bit, args.conv_template, args.temperature, args.max_new_tokens, │ │ 110 │ │ │ chatio, args.debug) │ │ 111 │ except KeyboardInterrupt: │ │ │ │ /XX/FastChat/fastchat/serve/inference.py:223 in chat_loop │ │ │ │ 220 │ │ │ │ 221 │ │ chatio.prompt_for_output(conv.roles[1]) │ │ 222 │ │ output_stream = generate_stream_func(model, tokenizer, params, device) │ │ ❱ 223 │ │ outputs = chatio.stream_output(output_stream, skip_echo_len) │ │ 224 │ │ # NOTE: strip is important to align with the training data. │ │ 225 │ │ conv.messages[-1][-1] = outputs.strip() │ │ 226 │ │ │ │ /XX/FastChat/fastchat/serve/cli.py:69 in stream_output │ │ │ │ 66 │ │ # Create a Live context for updating the console output │ │ 67 │ │ with Live(console=self._console, refresh_per_second=4) as live: │ │ 68 │ │ │ # Read lines from the stream │ │ ❱ 69 │ │ │ for outputs in output_stream: │ │ 70 │ │ │ │ accumulated_text = outputs[skip_echo_len:] │ │ 71 │ │ │ │ if not accumulated_text: │ │ 72 │ │ │ │ │ continue │ │ │ │ │ [104/1845] │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/utils/_contextli │ │ b.py:56 in generator_context │ │ │ │ 53 │ │ │ │ else: │ │ 54 │ │ │ │ │ # Pass the last request to the generator and get its response │ │ 55 │ │ │ │ │ with ctx_factory(): │ │ ❱ 56 │ │ │ │ │ │ response = gen.send(request) │ │ 57 │ │ │ │ 58 │ │ # We let the exceptions raised above by the generator's `.throw` or │ │ 59 │ │ # `.send` methods bubble up to our caller, except for StopIteration │ │ │ │ /XX/FastChat/fastchat/serve/inference.py:122 in generate_stream │ │ │ │ 119 │ │ │ logits = out.logits │ │ 120 │ │ │ past_key_values = out.past_key_values │ │ 121 │ │ else: │ │ ❱ 122 │ │ │ out = model(input_ids=torch.as_tensor([[token]], device=device), │ │ 123 │ │ │ │ │ │ use_cache=True, │ │ 124 │ │ │ │ │ │ past_key_values=past_key_values) │ │ 125 │ │ │ logits = out.logits │ │ │ │ /XX/miniconda3/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │ │ e.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/accelerate/hooks.py:16 │ │ 5 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/transformers/models/ll │ │ ama/modeling_llama.py:687 in forward │ │ │ │ 684 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │ │ 685 │ │ │ │ 686 │ │ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) │ │ ❱ 687 │ │ outputs = self.model( │ │ 688 │ │ │ input_ids=input_ids, │ │ 689 │ │ │ attention_mask=attention_mask, │ │ 690 │ │ │ position_ids=position_ids, │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │ │ e.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/transformers/models/ll │ │ ama/modeling_llama.py:577 in forward │ │ │ │ 574 │ │ │ │ │ None, │ │ 575 │ │ │ │ ) │ │ 576 │ │ │ else: │ │ ❱ 577 │ │ │ │ layer_outputs = decoder_layer( │ │ 578 │ │ │ │ │ hidden_states, │ │ 579 │ │ │ │ │ attention_mask=attention_mask, │ │ 580 │ │ │ │ │ position_ids=position_ids, │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │ │ e.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/accelerate/hooks.py:16 │ │ 5 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/transformers/models/ll │ │ ama/modeling_llama.py:305 in forward │ │ │ │ 302 │ │ # Fully Connected │ │ 303 │ │ residual = hidden_states │ │ 304 │ │ hidden_states = self.post_attention_layernorm(hidden_states) │ [0/1845] │ ❱ 305 │ │ hidden_states = self.mlp(hidden_states) │ │ 306 │ │ hidden_states = residual + hidden_states │ │ 307 │ │ │ │ 308 │ │ outputs = (hidden_states,) │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │ │ e.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/accelerate/hooks.py:16 │ │ 5 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ | /XX/miniconda3/envs/vicuna-matata/lib/python3.10/site-packages/transformers/models/ll │ │ ama/modeling_llama.py:157 in forward │ │ │ │ 154 │ │ self.act_fn = ACT2FN[hidden_act] │ │ 155 │ │ │ 156 │ def forward(self, x): │ │ ❱ 157 │ │ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) │ │ 158 │ │ 159 │ │ 160 class LlamaAttention(nn.Module): │ │ │ | /XX/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/modul │ │ e.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ | /XX/envs/vicuna-matata/lib/python3.10/site-packages/accelerate/hooks.py:16 │ │ 5 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /XX/envs/vicuna-matata/lib/python3.10/site-packages/torch/nn/modules/linea │ │ r.py:114 in forward │ │ │ │ 111 │ │ │ init.uniform_(self.bias, -bound, bound) │ │ 112 │ │ │ 113 │ def forward(self, input: Tensor) -> Tensor: │ │ ❱ 114 │ │ return F.linear(input, self.weight, self.bias) │ │ 115 │ │ │ 116 │ def extra_repr(self) -> str: │ │ 117 │ │ return 'in_features={}, out_features={}, bias={}'.format( │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`Hi, I'm also seeing this error. Any update on this? Thank you!
No, loading the model with dual 4090s still doesn't work. I currently use a single card with the --load-8bit quantization option as a workaround. It should have very little performance degradation. Hope this helps.
my situation is that when starting training 0.97 epoch,this problem occurs .I guess maybe the data problem result in.But it still occours when I try to use half data.Interestingly,When I try to use 1000 data,this problem disappears.