aphrodite-engine
aphrodite-engine copied to clipboard
[Bug]: loading model with int8 kv cache chokes
Your current environment
PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (conda-forge gcc 11.3.0-19) 11.3.0
Clang version: Could not collect
CMake version: version 3.27.6
Libc version: glibc-2.35
Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-6.5.0-15-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 40 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: AuthenticAMD
Model name: AMD EPYC Processor
CPU family: 23
Model: 1
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
Stepping: 2
BogoMIPS: 4890.76
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat npt nrip_save
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 1 MiB (32 instances)
L1i cache: 2 MiB (32 instances)
L2 cache: 16 MiB (32 instances)
L3 cache: 64 MiB (8 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT vulnerable
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.0
[pip3] triton==2.2.0
[conda] Could not collect ROCM Version: Could not collect
Aphrodite Version: 0.5.2
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled
🐛 Describe the bug
(aphrodite-runtime) [email protected]:~/aphrodite-engine$ python -m aphrodite.endpoints.openai.api_server -tp 2 --model ParasiticRogue/Merged-Vicuna-RP-Stew-34B --kv-cache-dtype int8
2024-03-19 19:52:14,449 WARNING utils.py:575 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-03-19 19:52:14,450 WARNING utils.py:587 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 30.71999 to 30.
2024-03-19 19:52:14,649 INFO worker.py:1724 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO: Model = 'ParasiticRogue/Merged-Vicuna-RP-Stew-34B'
INFO: DataType = torch.bfloat16
INFO: Model Load Format = auto
INFO: Number of GPUs = 2
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = None
INFO: Context Length = 32768
INFO: Enforce Eager Mode = False
INFO: KV Cache Data Type = int8
INFO: KV Cache Params Path = None
INFO: Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/root/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 599, in <module>
engine = AsyncAphrodite.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 113, in __init__
self._init_workers_ray(placement_group)
File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 268, in _init_workers_ray
self.driver_worker = Worker(
^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 60, in __init__
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 92, in __init__
self.kv_quant_params = (self.load_kv_quant_params(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 116, in load_kv_quant_params
kv_quant_params.append(kv_quant_param)
^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'kv_quant_param' where it is not associated with a value
2024-03-19 19:52:19,750 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerAphrodite.init_worker() (pid=26429, ip=172.17.0.2, actor_id=537d7fe532ba3d411a06c1f001000000, repr=<aphrodite.engine.ray_tools.RayWorkerAphrodite object at 0x7f34058b5b50>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/ray_tools.py", line 22, in init_worker
self.worker = worker_init_fn()
^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 252, in <lambda>
lambda rank=rank, local_rank=local_rank: Worker(
^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 60, in __init__
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 92, in __init__
self.kv_quant_params = (self.load_kv_quant_params(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 116, in load_kv_quant_params
kv_quant_params.append(kv_quant_param)
^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'kv_quant_param' where it is not associated with a value
Well, it turns out that I didn't have enough VRAM to load the model in 16-bit, but I just tried it with --load-in-4bit, and failure's the same. Without the int8 kv_cache, model loads fine:
(aphrodite-runtime) [email protected]:~/aphrodite-engine$ python -m aphrodite.endpoints.openai.api_server -tp 2 --model ParasiticRogue/Merged-Vicuna-RP-Stew-34B --load-in-4bit
WARNING: bnb quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-03-19 20:03:18,803 WARNING utils.py:575 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-03-19 20:03:18,804 WARNING utils.py:587 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 30.71999 to 30.
2024-03-19 20:03:18,984 INFO worker.py:1724 -- Started a local Ray instance.
INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO: Model = 'ParasiticRogue/Merged-Vicuna-RP-Stew-34B'
INFO: DataType = torch.bfloat16
INFO: Model Load Format = auto
INFO: Number of GPUs = 2
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = bnb
INFO: Context Length = 32768
INFO: Enforce Eager Mode = False
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING: Custom allreduce is disabled because your platform lacks GPU P2P capability. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerAphrodite pid=36344) WARNING: Custom allreduce is disabled because your platform lacks GPU P2P capability. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO: Downloading model weights ['*.safetensors']
(RayWorkerAphrodite pid=36344) INFO: Downloading model weights ['*.safetensors']
INFO: Memory allocated for converted model: 9.17 GiB
INFO: Memory reserved for converted model: 9.26 GiB
INFO: Model weights loaded. Memory usage: 9.17 GiB x 2 = 18.34 GiB
with kv-cache-dtype=fp8_e5m2 and load-in-4bit, it works also.
Oops, nevermind. I didn't read the documentation. Sorry, lol. You might want to put that in boldface or something on the main page where you mention it.