ssd模型在jetson NX上推理时间不合理
利用1.6版本提供的ssd模型,在jetson上安装paddle后进行推理测试
place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
#place = fluid.CPUPlace()
exe = fluid.Executor(place)
# yapf: disable
if model_dir:
def if_exist(var):
return os.path.exists(os.path.join(model_dir, var.name))
fluid.io.load_vars(exe, model_dir, predicate=if_exist)
# yapf: enable
infer_reader = reader.infer(data_args, image_path)
feeder = fluid.DataFeeder(place=place, feed_list=[image])
data = infer_reader()
# switch network to test mode (i.e. batch norm test mode)
test_program = fluid.default_main_program().clone(for_test=True)
detect_time = []
t0 = time()
nmsed_out_v, = exe.run(test_program,
feed=feeder.feed([[data]]),
fetch_list=[nmsed_out],
return_numpy=False)
detect_time.append((time() - t0) * 1000)
print('detect_time is {} ms'.format(np.average(np.asarray(detect_time))))
分了利用gpu和cpu进行推理,获取推理时间 在gpu上推理:
W0128 15:40:08.554250 5453 device_context.cc:320] Please NOTE: device: 0, CUDA Capability: 72, Driver API Version: 10.2, Runtime API Version: 10.2
W0128 15:40:08.558723 5453 device_context.cc:328] device: 0, cuDNN Version: 8.0.
detect_time is 1125.3809928894043 ms
在cpu上推理:
detect_time is 419.86751556396484 ms
可以看到cpu推理时间比gpu还短,感觉有点不太合理
@Huihuihh 您好,您现在使用的是Paddle训练框架前向来做的推理,训练框架对Jetson GPU目前没有做特定的优化,性能的确可能存在问题。 可以尝试参考 NV Jetson上预测部署示例 来做推理,调用Paddle-Inference预测库,GPU性能会有较大提升
Paddle-Inference我看好像没有支持ssd_mobilenet_v1的推理?
@Huihuihh 您好,Paddle-Inference 支持 ssd_mobilenet_v1 推理,该模型在 TX2 上进行过测试,batch_size <=2 时可以使用。代码参考 预测示例 (Python) 改写下主要的预测部分代码即可。 但我注意到您使用的是 cudnn8,也就是jetpack 4.4?该版本的推理存在 内存泄露问题,我们正计划在下一个小版本想办法做一下规避处理。 如果着急使用,建议先使用下cudnn 7版本,或者使用如下方法绕过
# 关闭会导致内存泄露的fusion
config.delete_pass("conv_elementwise_add2_act_fuse_pass")
config.delete_pass("conv_elementwise_add_act_fuse_pass")
# 开启tensorrt engine
config.enable_tensorrt_engine()
这个是tensorrt不支持吗,我在jetson nx上编译paddle时,指定了 -DTENSORRT_ROOT=/usr/lib/aarch64-linux-gnu/,但是好像没啥用
Error Message Summary:
----------------------
InvalidArgumentError: Pass tensorrt_subgraph_pass has not been registered.
[Hint: Expected Has(pass_type) == true, but received Has(pass_type):0 != true:1.] (at /root/Paddle/paddle/fluid/framework/ir/pass.h:211)
@Huihuihh
这是因为目前develop版本,2.0版本等新增了编译选项WITH_TENSORRT,默认是关闭的,需要同时加上
-DWITH_TENSORRT=ON
-DTENSORRT_ROOT=/usr/lib/aarch64-linux-gnu/
参考:https://github.com/PaddlePaddle/Paddle/blob/develop/CMakeLists.txt#L31
另外目前我们提供编译好的 Jetpack 4.3 版本的C++推理库,如果您的Jetson是 Jetpack4.3版本,可以用我们编译好的库 2.0-rc1 下载链接:https://paddle-inference-lib.bj.bcebos.com/2.0.0-rc0-nv-jetson-cuda10-cudnn7.6-trt6/paddle_inference.tgz
官网链接:https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html 如果需要使用2.0正式版,可以关注下Paddle官网,会在近期更新C++库的下载链接,目前已经完成了编译。
cmake的时候,这个是不是就显示开启了tensorrt
Current TensorRT header is /usr/include/aarch64-linux-gnu/NvInfer.h. Current TensorRT version is v7
cmake的时候,这个是不是就显示开启了tensorrt
Current TensorRT header is /usr/include/aarch64-linux-gnu/NvInfer.h. Current TensorRT version is v7
@Huihuihh
是的。而且编译完成后,可以查看 paddle_inference_install_dir/version.txt 文件,会有如下信息,有TensorRT信息则表示TensorRT联编成功了
GIT COMMIT ID: xxxxx
WITH_MKL: ON
WITH_MKLDNN: ON
WITH_GPU: ON
CUDA version: 10.0
CUDNN version: v7.6
CXX compiler version: 4.8.5
WITH_TENSORRT: ON
TensorRT version: v6
这个tensorrt编译完成了,但是执行推理的时候,服务器自动断电了 我使用了
config.enable_use_gpu(500, 0)
config.delete_pass("conv_elementwise_add2_act_fuse_pass")
config.delete_pass("conv_elementwise_add_act_fuse_pass")
config.enable_tensorrt_engine()
但是执行的时候,到这里就服务器断电了
I0223 09:25:06.619712 8257 graph_pattern_detector.cc:101] --- detected 34 subgraphs
--- Running IR pass [tensorrt_subgraph_pass]
I0223 09:25:06.659562 8257 tensorrt_subgraph_pass.cc:126] --- detect a sub-graph with 95 nodes
I0223 09:25:06.735900 8257 tensorrt_subgraph_pass.cc:347] Prepare TRT engine (Optimize model structure, Select OP kernel etc). This process may cost a lot of time.
请检查电源是否稳定,日志中的优化过程的确会有比较大的GPU计算量。
请检查电源是否稳定,日志中的优化过程的确会有比较大的GPU计算量。
电源是稳定的,因为不使用这个tensorrt,能够成功执行推理
Paddle-Inference-Demo里面自带的cuda_linux_demo项目可以成功使用tensorrt,但是换成ssd模型之后好像有问题
其他场景可能没有将主频和功率推至最大值,有条件的话可以换个设备试一下。 或者将测试代码发一下,我们试一下是否可以复现。
我是修改了yolov3里面的代码
infer_yolov3.py
import numpy as np
import argparse
import cv2
from PIL import Image
from time import time
from paddle.fluid.core import AnalysisConfig
from paddle.inference import Config
from paddle.inference import create_predictor
from utils import preprocess, draw_bbox
def init_predictor(args):
if args.model_dir is not "":
config = Config(args.model_dir)
else:
config = Config(args.model_file, args.params_file)
if args.use_gpu:
config.enable_use_gpu(1000, 0)
config.switch_ir_optim()
config.enable_memory_optim()
config.enable_tensorrt_engine(workspace_size=1 << 30, precision_mode=AnalysisConfig.Precision.Float32,max_batch_size=1, min_subgraph_size=5, use_static=False, use_calib_mode=False)
config.delete_pass("conv_elementwise_add2_act_fuse_pass")
config.delete_pass("conv_elementwise_add_act_fuse_pass")
else:
# If not specific mkldnn, you can set the blas thread.
# The thread num should not be greater than the number of cores in the CPU.
config.set_cpu_math_library_num_threads(4)
config.delete_pass("conv_elementwise_add2_act_fuse_pass")
config.delete_pass("conv_elementwise_add_act_fuse_pass")
config.enable_mkldnn()
predictor = create_predictor(config)
return predictor
def run(predictor, img):
# copy img data to input tensor
input_names = predictor.get_input_names()
for i, name in enumerate(input_names):
input_tensor = predictor.get_input_handle(name)
input_tensor.reshape(img[i].shape)
input_tensor.copy_from_cpu(img[i].copy())
# do the inference
detect_time = []
t0 = time()
predictor.run()
detect_time.append((time() - t0) * 1000)
print('detect_time is {} ms'.format(np.average(np.asarray(detect_time))))
results = []
# get out data from output tensor
output_names = predictor.get_output_names()
for i, name in enumerate(output_names):
output_tensor = predictor.get_output_handle(name)
output_data = output_tensor.copy_to_cpu()
results.append(output_data)
return results
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_file",
type=str,
default="",
help="Model filename, Specify this when your model is a combined model."
)
parser.add_argument(
"--params_file",
type=str,
default="",
help=
"Parameter filename, Specify this when your model is a combined model."
)
parser.add_argument(
"--model_dir",
type=str,
default="",
help=
"Model dir, If you load a non-combined model, specify the directory of the model."
)
parser.add_argument("--use_gpu",
type=int,
default=0,
help="Whether use gpu.")
return parser.parse_args()
if __name__ == '__main__':
args = parse_args()
img_name = 'dog.jpg'
save_img_name = 'res.jpg'
im_size = 300
pred = init_predictor(args)
img = cv2.imread(img_name)
data = preprocess(img, im_size)
im_shape = np.array([im_size, im_size]).reshape((1, 2)).astype(np.int32)
result = run(pred, [data, im_shape])
img = Image.open(img_name).convert('RGB').resize((im_size, im_size))
draw_bbox(img, result[0], save_name=save_img_name)
utils.py
import cv2
import numpy as np
from PIL import Image, ImageDraw
def resize(img, target_size):
"""resize to target size"""
if not isinstance(img, np.ndarray):
raise TypeError('image type is not numpy.')
im_shape = img.shape
im_size_min = np.min(im_shape[0:2])
im_size_max = np.max(im_shape[0:2])
im_scale_x = float(target_size) / float(im_shape[1])
im_scale_y = float(target_size) / float(im_shape[0])
img = cv2.resize(img, None, None, fx=im_scale_x, fy=im_scale_y)
print("cols is", img.shape[1])
print("rows is", img.shape[0])
return img
def normalize(img, mean, std):
img = img / 255.0
mean = np.array(mean)[np.newaxis, np.newaxis, :]
std = np.array(std)[np.newaxis, np.newaxis, :]
img -= mean
img /= std
return img
def preprocess(img, img_size):
mean = [0.5, 0.5, 0.5]
std = [0.5, 0.5, 0.5]
img = resize(img, img_size)
img = img[:, :, ::-1].astype('float32') # bgr -> rgb
img = normalize(img, mean, std)
img = img.transpose((2, 0, 1)) # hwc -> chw
return img[np.newaxis, :]
def draw_bbox(img, result, threshold=0.5, save_name='res.jpg'):
"""draw bbox"""
draw = ImageDraw.Draw(img)
for res in result:
cat_id, score, bbox = res[0], res[1], res[2:]
if score < threshold:
continue
xmin, ymin, xmax, ymax = bbox
xmin = xmin*300
ymin = ymin*300
xmax = xmax*300
ymax = ymax*300
draw.line([(xmin, ymin), (xmin, ymax), (xmax, ymax), (xmax, ymin),
(xmin, ymin)],
width=2,
fill=(255, 0, 0))
print('category id is {}, score is {}, bbox is {}'.format(cat_id, score, bbox))
img.save(save_name, quality=95)
请提供一下模型
您好,请参考jetson开发手册 https://docs.nvidia.com/jetson/l4t/index.html%23page/Tegra%2520Linux%2520Driver%2520Package%2520Development%2520Guide/power_management_jetson_xavier.html%23wwpID0E0WD0HA 将GPU主频设置的低一些测试一下。
模型: 链接:https://pan.baidu.com/s/1WQlrywqDSnZ_M8aN2SsARw 提取码:ldcx
请先降低频率和电源功耗模式测试下
我看了下这个gpu的频率好像是最低的
root@jetson-desktop:~# jetson_clocks --show
SOC family:tegra194 Machine:NVIDIA Jetson Xavier NX Developer Kit
Online CPUs: 0-1
CPU Cluster Switching: Disabled
cpu0: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1907200 CurrentFreq=1907200 IdleStates: C1=1 c6=1
cpu1: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1907200 CurrentFreq=1907200 IdleStates: C1=1 c6=1
cpu2: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1907200 CurrentFreq=1907200 IdleStates: C1=1 c6=1
cpu3: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1907200 CurrentFreq=1907200 IdleStates: C1=1 c6=1
cpu4: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1907200 CurrentFreq=1907200 IdleStates: C1=1 c6=1
cpu5: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1907200 CurrentFreq=1907200 IdleStates: C1=1 c6=1
GPU MinFreq=114750000 MaxFreq=1109250000 CurrentFreq=114750000
EMC MinFreq=204000000 MaxFreq=1600000000 CurrentFreq=1600000000 FreqOverride=0
Fan: speed=0
NV Power Mode: MODE_15W_2CORE
降低电源功耗好像可以通过原先的地方了 NV Power Mode: MODE_10W_2CORE
但是好像卡在后面的地方了
I0223 14:23:17.323590 9258 analysis_predictor.cc:139] Profiler is deactivated, and no profiling report will be generated.
I0223 14:23:17.374783 9258 analysis_predictor.cc:474] TensorRT subgraph engine is enabled
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [adaptive_pool2d_convert_global_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [shuffle_channel_detect_pass]
--- Running IR pass [quant_conv2d_dequant_fuse_pass]
--- Running IR pass [delete_quant_dequant_op_pass]
--- Running IR pass [delete_quant_dequant_filter_op_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [embedding_eltwise_layernorm_fuse_pass]
--- Running IR pass [multihead_matmul_fuse_pass_v2]
--- Running IR pass [skip_layernorm_fuse_pass]
--- Running IR pass [unsqueeze2_eltwise_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
I0223 14:23:17.707765 9258 graph_pattern_detector.cc:101] --- detected 22 subgraphs
--- Running IR pass [squeeze2_matmul_fuse_pass]
--- Running IR pass [reshape2_matmul_fuse_pass]
--- Running IR pass [flatten2_matmul_fuse_pass]
--- Running IR pass [map_matmul_to_mul_pass]
--- Running IR pass [fc_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
I0223 14:23:17.782774 9258 graph_pattern_detector.cc:101] --- detected 34 subgraphs
--- Running IR pass [tensorrt_subgraph_pass]
I0223 14:23:17.822729 9258 tensorrt_subgraph_pass.cc:126] --- detect a sub-graph with 95 nodes
I0223 14:23:17.858089 9258 tensorrt_subgraph_pass.cc:347] Prepare TRT engine (Optimize model structure, Select OP kernel etc). This process may cost a lot of time.
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
I0223 14:23:48.898483 9258 graph_pattern_detector.cc:101] --- detected 2 subgraphs
--- Running analysis [ir_params_sync_among_devices_pass]
I0223 14:23:48.919353 9258 ir_params_sync_among_devices_pass.cc:45] Sync params from CPU to GPU
config.enable_use_gpu的值需要设置小一点
谢谢反馈~
@Huihuihh 你好,问题现在解决了吗?
解决了
@Huihuihh 你好 请问你测试时,开启tensorrt,能够正确预测结果么?
@Huihuihh 你好 请问你测试时,开启tensorrt,能够正确预测结果么?
可以正确预测结果
@Huihuihh 您好,Paddle-Inference 支持 ssd_mobilenet_v1 推理,该模型在 TX2 上进行过测试,batch_size <=2 时可以使用。代码参考 预测示例 (Python) 改写下主要的预测部分代码即可。 但我注意到您使用的是 cudnn8,也就是jetpack 4.4?该版本的推理存在 内存泄露问题,我们正计划在下一个小版本想办法做一下规避处理。 如果着急使用,建议先使用下cudnn 7版本,或者使用如下方法绕过
# 关闭会导致内存泄露的fusion config.delete_pass("conv_elementwise_add2_act_fuse_pass") config.delete_pass("conv_elementwise_add_act_fuse_pass")# 开启tensorrt engine config.enable_tensorrt_engine()
@OliverLPH 你好 请问我在jetsonTX2上,使用Je tPack4.3和Je tPack4.4的paddle-inference预测库(2.0rc1/2.0.1 c++库),开启gpu或tensorrt都无法正常预测,不使用gpu推理结果就正常,不知道您那里是否有进行过相关的测试?
@Huihuihh 您好,Paddle-Inference 支持 ssd_mobilenet_v1 推理,该模型在 TX2 上进行过测试,batch_size <=2 时可以使用。代码参考 预测示例 (Python) 改写下主要的预测部分代码即可。 但我注意到您使用的是 cudnn8,也就是jetpack 4.4?该版本的推理存在 内存泄露问题,我们正计划在下一个小版本想办法做一下规避处理。 如果着急使用,建议先使用下cudnn 7版本,或者使用如下方法绕过
# 关闭会导致内存泄露的fusion config.delete_pass("conv_elementwise_add2_act_fuse_pass") config.delete_pass("conv_elementwise_add_act_fuse_pass")# 开启tensorrt engine config.enable_tensorrt_engine()@OliverLPH 你好 请问我在jetsonTX2上,使用Je tPack4.3和Je tPack4.4的paddle-inference预测库(2.0rc1/2.0.1 c++库),开启gpu或tensorrt都无法正常预测,不使用gpu推理结果就正常,不知道您那里是否有进行过相关的测试?
你说的这个问题,我好像遇到过,当时有两个jetson,一个是jetson nano,一个是NX,在NX上能正常使用gpu预测,在nano上却不能正常预测。因为我当时是在NX上编译的paddle,然后直接将编译好的whl拿到nano上安装,我怀疑可能是需要在nano上编译,有什么地方不匹配啥的,但是实际结果是在nano上编译的paddle安装还是gpu预测有问题
@tianhechao 您好,gpu和tensorrt无法正常推理可能有以下原因,可以尝试排查下
- 检查您的机器安装的
JetPack版本是否和下载的paddle-inference预测库的JetPack版本一致。 - 检查所下载的
paddle-inference预测库架构是否是 volta架构(TX2对应架构),或者是全架构的
以下脚本可以查询您下载的 paddle-inference预测库 so的 sm_ 符号,相关符号对应架构请参考 Matching CUDA arch and CUDA gencode for various NVIDIA architectures
cuobjdump --dump-ptx libpaddle_inference.so | grep sm_
@OliverLPH 你好,下载的预测库没有问题,我又尝试使用paddle-inference-demo中paddle-trt例程提供的resnet50的分类预训练模型测试,可以正确推理出类别,但是检测模型如yolov3就无法正常使用,检测模型使用paddle-inference-demo中yolov3例程所提供的预训练模型 jetpack4.4 tx2推理库从官方下载https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/05_inference_deployment/inference/build_and_install_lib_cn.html nv_jetson_cuda10.2_cudnn8_trt7_all(jetpack4.4/4.5)及nv_jetson_cuda10.2_cudnn8_trt7_tx2(jetpack4.4/4.5)均已尝试,我自行编译的库结果相同
@.***
发件人: Peihan 发送时间: 2021-03-17 17:02 收件人: PaddlePaddle/models 抄送: tianhechao; Mention 主题: Re: [PaddlePaddle/models] ssd模型在jetson NX上推理时间不合理 (#5246) @tianhechao 您好,gpu和tensorrt无法正常推理可能有以下原因,可以尝试排查下 检查您的机器安装的JetPack版本是否和下载的paddle-inference预测库的JetPack版本一致。 检查所下载的 paddle-inference预测库 架构是否是 volta架构(TX2对应架构),或者是全架构的 以下脚本可以查询您下载的 paddle-inference预测库 so的 sm_ 符号,相关符号对应架构请参考 Matching CUDA arch and CUDA gencode for various NVIDIA architectures cuobjdump --dump-ptx libpaddle_inference.so | grep sm_ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.