[Bug] Llama3.1 AWQ at TP>1 giving different responses
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
As titled, llama3.1 converted to AWQ format when running with tp=2, giving different responses for temperature 0.0 and top_p = 0, but giving correct and deterministic responses at tp=1
llama3 it is working correct, maybe ROPE issue
Reproduction
lmdeploy lite auto_awq meta-llama/Meta-Llama-3.1-8B-Instruct --work-dir llama3_1_awq then used example shown here at tp=2: https://github.com/InternLM/lmdeploy/blob/main/docs/en/quantization/w4a16.md
Environment
sys.platform: linux
Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Debian 10.2.1-6) 10.2.1 20210110
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 12.1
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 8.9 (built against CUDA 11.8)
- Built with CuDNN 8.9.2
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
TorchVision: 0.17.2+cu121
LMDeploy: 0.5.2+
transformers: 4.43.3
gradio: Not Found
fastapi: 0.111.0
pydantic: 2.8.2
triton: 2.2.0
NVIDIA Topology:
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X NV12 0-23 N/A
GPU1 NV12 X 0-23 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Error traceback
No response
Could you try export NCCL_LAUNCH_MODE=GROUP before running the inference example?
It is same, I am running on tritonserver. It is happening when increasing TP, for example till TP=2 for llama3 8B it gives consistent result, but on TP=4 it started giving different responses
Can you provide the reproducible demo?
I didn't reproduce this issue. My test code is as follows:
import time
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
model_path = "/workspace/llama3.1/Meta-Llama-3.1-8B-Instruct-AWQ"
start = time.perf_counter()
backend_config = TurbomindEngineConfig(
max_batch_size=1,
cache_max_entry_count=0.5,
)
pipe = pipeline(model_path, backend_config=backend_config, log_level='ERROR')
end = time.perf_counter()
print(f'building pipeline cost: {end - start} s')
prompt = "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\nPlease reason step by step, and put your final answer within \\boxed{}.\n"
gen_config = GenerationConfig(temperature=0.0)
for i in range(10):
print('-'*50)
response = pipe(prompt, gen_config=gen_config)
print(response.text)
@lvhan028 It is happening for TP>1
I didn't reproduce it.
I tested it in openmmlab/lmdeploy docker, in which NCCL_LAUNCH_MODE is GROUP
@lvhan028 It is changing when prompt got changed. for your prompt it is working, when I change to "write an essay on open-source", it is giving below response
The Open-Source Movement: A Revolution in Software Development
The open-source movement has been a game-changer in the software development industry, allowing developers to collaborate and create high-quality software without the need for a single proprietary license. Open-source software is free to use, modify, and distribute, and has become a vital part of many industries, including operating systems, web browsers, and even social media platforms.
The concept of open-source dates back to the 1980s, when Richard Stallman, a graduate student at MIT, coined the term "copyleft" to describe the practice of sharing and modifying software code. However, it wasn't until the late 1990s and early 2000s that open-source software started to gain mainstream popularity.
One of the key factors that contributed to the growth of open-source was the rise of the internet and the World Wide Web. With the widespread adoption of the web, developers could now easily share and collaborate on software projects, and the open-source model became a natural fit. Another important factor was the emergence of Linux, a free and open-source operating system that gained popularity in the late 1990s.
Linux, developed by Linus Torvalds and others, was initially created as a hobby project, but it quickly gained traction and became a viable alternative to proprietary operating systems like Windows and MacOS. Linux's open-source model allowed developers to contribute code, fix bugs, and improve the operating system, making it a highly reliable and efficient platform.
The open-source model has several benefits, including cost savings, increased collaboration, and faster development cycles. With open-source software, developers can use and modify existing code without the need for a license, which reduces costs and allows for faster development. Additionally, the open-source model encourages collaboration, as developers can work together to create high-quality software, and the community can benefit from the collective efforts.
Another significant advantage of open-source software is the ability to fix bugs and improve the code. With proprietary software, bugs and issues are often fixed by the original developers, and the code is not made available to the public. In contrast, open-source software allows developers to fix bugs and improve the code, making it a more reliable and efficient platform.
The open-source model has also led to the creation of many successful projects, including Apache, Firefox, and WordPress. These projects have become essential tools for many industries, including web development, and have enabled developers to create high-quality software without the need for a license.
However, the open-source model is not without its challenges
The concept of open-source refers to the practice of making the source code of a program or software available to the public, usually free or at a low cost, and allowing users to modify it, distribute it, and use it as they see fit. This approach has been gaining popularity in recent years, particularly in the field of software development, and has led to the creation of many successful projects, such as Linux, Apache, and Firefox.
The idea of open-source is rooted in the concept of free and open-source software, which emerged in the 1980s. At that time, many software developers, including Richard Stallman, Linus Torvalds, and Eric Raymond, were working on projects that were not only free but also open-source, meaning that the source code was available to the public and could be modified and distributed by anyone.
The term "open-source" was first used in 1998 by Eric Raymond, who wrote an essay titled "The Open Source Definition" in which he defined open-source as "a philosophy and a way of making a difference in the world by creating free and open-source software." This essay was widely read and shared, and it helped to popularize the concept of open-source and its benefits.
One of the main advantages of open-source is that it allows developers to work together and collaborate on projects, which can lead to faster development and better quality of the software. This is because open-source software is often developed by a community of developers who contribute to the project, and the source code is available to the public, which means that anyone can use it, modify it, and distribute it.
Another advantage of open-source is that it allows companies to save money and time by using existing software, which is often developed by a community of developers, rather than having to develop it themselves. This is because open-source software is often available for free or at a low cost, and companies can use it without having to spend a lot of money and time developing it themselves.
In addition, open-source software is often more secure than proprietary software, which is developed by a single company or a small group of developers. This is because open-source software is often developed by a community of developers, which means that many eyes are watching the code, and it is more likely to be secure and reliable.
However, there are also some challenges and limitations of open-source software. One of the main challenges is that it can be difficult to find and fix bugs, which can be a problem because open-source software is often developed by a community of
It is on Tp=4
Any update on this @lvhan028 @lzhangzz
Can you share the reproducible code?
import time
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
model_path = "/workspace/llama3.1/Meta-Llama-3.1-8B-Instruct-AWQ"
start = time.perf_counter()
backend_config = TurbomindEngineConfig(
max_batch_size=1,
cache_max_entry_count=0.5, tp = 4
)
pipe = pipeline(model_path, backend_config=backend_config, log_level='ERROR')
end = time.perf_counter()
print(f'building pipeline cost: {end - start} s')
prompt = "write an article on open-source"
gen_config = GenerationConfig(temperature=0.0)
for i in range(10):
print('-'*50)
response = pipe(prompt, gen_config=gen_config)
print(response.text)
@lvhan028 I am not getting issue now after your PR #2090.
Thanks a lot