[question] When the concurrency is 1, test the RT of 1024 inputs and 64 outputs. tp=2 or tp=4 does not yield high time benefits.

Open coolhok opened this issue 1 year ago • 0 comments

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Test the RT of 1024 inputs and 64 outputs concurrency is 1： TP2: 793.57 ms TP4: 853.66 ms

concurrency is 2： TP2: 1153.99 ms TP4: 820.67 ms

Reproduction

lmdeploy serve api_server /mnt/data/model_hub/Qwen-14B-Chat-Pro/ --backend turbomind --tp 2 lmdeploy serve api_server /mnt/data/model_hub/Qwen-14B-Chat-Pro/ --backend turbomind --tp 4

python3 -u profile_restful_len.py http://127.0.0.1:23333 /mnt/data/model_hub/Qwen-14B-Chat-Pro/ --input_len 1024 --output_len 64 --num_prompts 10 --concurrency 1 python3 -u profile_restful_len.py http://127.0.0.1:23333 /mnt/data/model_hub/Qwen-14B-Chat-Pro/ --input_len 1024 --output_len 64 --num_prompts 10 --concurrency 2

Environment

sys.platform: linux
Python: 3.10.13 (main, Jun  6 2024, 19:28:50) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: PPU-ZW810
CUDA_HOME: /usr/local/PPU_SDK/CUDA_SDK
NVCC: Cuda compilation tools, release 12.2, V12.2.
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.2.0
PyTorch compiling details: PyTorch built with:
  - GCC 11.4
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2021.1-Product Build 20201104 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.2
  - Built with CUDA Runtime 12.3
  - NVCC architecture flags: -gencode;arch=compute_80,code=sm_80
  - CuDNN 8.9.5  (built against CUDA 0.6.4)
    - Built with CuDNN 8.6
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.3, CUDNN_VERSION=8.6.0, CXX_COMPILER=/usr/lib/ccache/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=True, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.16.0+704f831
LMDeploy: 0.5.0+
transformers: 4.43.3
gradio: Not Found
fastapi: 0.111.0
pydantic: 2.7.4
triton: 2.2.0

Error traceback

No response

Aug 02 '24 08:08 coolhok