FastDeploy
FastDeploy copied to clipboard
Python统计耗时明显高于模型内置统计耗时
环境
- 【FastDeploy版本】:fastdeploy-linux-gpu-0.0.0
- 【系统平台】: Linux x64 (Ubuntu 20.04)
- 【硬件】: Nvidia GPU 4060Ti, conda-forge: CUDA 11.7, CUDNN 8.4
- 【编译语言】: Python (3.10)
性能疑问
- FastDeploy模型内置的压测统计耗时和Python层面的统计耗时不一致的问题。
- 如何缩小内置耗时
57.6966ms和Python耗时118ms的差距。
import time
import fastdeploy as fd
import numpy as np
import statistics
if __name__ == '__main__':
option = fd.RuntimeOption()
option.use_gpu(0)
option.use_trt_backend()
option.trt_option.enable_fp16 = True
option.trt_option.set_shape('images', [1, 3, 640, 640], [1, 3, 640, 640], [40, 3, 640, 640])
option.trt_option.serialize_file = 'weights/yolov8m.engine'
model = fd.vision.detection.YOLOv8('weights/yolov8m.onnx', runtime_option=option)
ims = [np.random.randint(0, 256, (360, 640, 3), dtype=np.uint8) for _ in range(20)]
model.enable_record_time_of_runtime()
costs = []
for i in range(500):
if 100 <= i:
begin = time.perf_counter()
results = model.batch_predict(ims)
if 100 <= i:
costs.append(time.perf_counter() - begin)
model.print_statis_info_of_runtime()
print(f'{int(1000 * statistics.mean(costs))}ms')
$ python benchmark.py
[INFO] fastdeploy/runtime/backends/tensorrt/trt_backend.cc(719)::CreateTrtEngineFromOnnx Detect serialized TensorRT Engine file in weights/yolov8m.engine, will load it directly.
[INFO] fastdeploy/runtime/backends/tensorrt/trt_backend.cc(108)::LoadTrtCache Build TensorRT Engine from cache file: weights/yolov8m.engine with shape range information as below,
[INFO] fastdeploy/runtime/backends/tensorrt/trt_backend.cc(111)::LoadTrtCache Input name: images, shape=[-1, 3, -1, -1], min=[1, 3, 640, 640], max=[40, 3, 640, 640]
[INFO] fastdeploy/runtime/runtime.cc(339)::CreateTrtBackend Runtime initialized with Backend::TRT in Device::GPU.
============= Runtime Statis Info(yolov8) =============
Total iterations: 500
Total time of runtime: 29.7184s.
Warmup iterations: 100
Total time of runtime in warmup step: 6.63981s.
Average time of runtime exclude warmup step: 57.6966ms.
118ms
模型内置的,统计的单纯是推理引擎的耗时。 而Python端,统计的是包含数据前后处理+推理引擎耗时
模型内置的,统计的单纯是推理引擎的耗时。 而Python端,统计的是包含数据前后处理+推理引擎耗时
目前YOLOv8的预处理没有继承ProcessorManager,不支持CVCUDA加速。
请问如果适配这部分代码之后,如何正确的在Python将默认的预处理替换为CVCUDA?
是否仅需要初始化模型并调用接口model.preprocessor.use_cuda(True, 0):
model = fd.vision.detection.YOLOv8(...)
# model.preprocessor.use_cuda(True, 0) # CPU
model.preprocessor.use_cuda(False, 0) # CUDA
model.preprocessor.use_cuda(True, 0) # CVCUDA
model = fd.vision.detection.YOLOv8(...)
model.preprocessor.use_cuda(True, 0) # CPU
model.preprocessor.use_cuda(False, 0) # CUDA model.preprocessor.use_cuda(True, 0) # CVCUDA 请问你这样使用可以吗?