models icon indicating copy to clipboard operation
models copied to clipboard

ssd模型在jetson NX上推理时间不合理

Open Huihuihh opened this issue 5 years ago • 31 comments

利用1.6版本提供的ssd模型,在jetson上安装paddle后进行推理测试

    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
    #place = fluid.CPUPlace()
    exe = fluid.Executor(place)
    # yapf: disable
    if model_dir:
        def if_exist(var):
            return os.path.exists(os.path.join(model_dir, var.name))
        fluid.io.load_vars(exe, model_dir, predicate=if_exist)
    # yapf: enable
    infer_reader = reader.infer(data_args, image_path)
    feeder = fluid.DataFeeder(place=place, feed_list=[image])

    data = infer_reader()

    # switch network to test mode (i.e. batch norm test mode)
    test_program = fluid.default_main_program().clone(for_test=True)
    detect_time = []
    t0 = time()
    nmsed_out_v, = exe.run(test_program,
                           feed=feeder.feed([[data]]),
                           fetch_list=[nmsed_out],
                           return_numpy=False)
    detect_time.append((time() - t0) * 1000)
    print('detect_time is {} ms'.format(np.average(np.asarray(detect_time))))

分了利用gpu和cpu进行推理,获取推理时间 在gpu上推理:

W0128 15:40:08.554250  5453 device_context.cc:320] Please NOTE: device: 0, CUDA Capability: 72, Driver API Version: 10.2, Runtime API Version: 10.2
W0128 15:40:08.558723  5453 device_context.cc:328] device: 0, cuDNN Version: 8.0.
detect_time is 1125.3809928894043 ms

在cpu上推理:

detect_time is 419.86751556396484 ms

可以看到cpu推理时间比gpu还短,感觉有点不太合理

Huihuihh avatar Jan 28 '21 08:01 Huihuihh

@Huihuihh 您好,您现在使用的是Paddle训练框架前向来做的推理,训练框架对Jetson GPU目前没有做特定的优化,性能的确可能存在问题。 可以尝试参考 NV Jetson上预测部署示例 来做推理,调用Paddle-Inference预测库,GPU性能会有较大提升

OliverLPH avatar Feb 02 '21 05:02 OliverLPH

Paddle-Inference我看好像没有支持ssd_mobilenet_v1的推理?

Huihuihh avatar Feb 02 '21 05:02 Huihuihh

@Huihuihh 您好,Paddle-Inference 支持 ssd_mobilenet_v1 推理,该模型在 TX2 上进行过测试,batch_size <=2 时可以使用。代码参考 预测示例 (Python) 改写下主要的预测部分代码即可。 但我注意到您使用的是 cudnn8,也就是jetpack 4.4?该版本的推理存在 内存泄露问题,我们正计划在下一个小版本想办法做一下规避处理。 如果着急使用,建议先使用下cudnn 7版本,或者使用如下方法绕过

# 关闭会导致内存泄露的fusion
config.delete_pass("conv_elementwise_add2_act_fuse_pass")
config.delete_pass("conv_elementwise_add_act_fuse_pass")
# 开启tensorrt engine
config.enable_tensorrt_engine()

OliverLPH avatar Feb 02 '21 06:02 OliverLPH

这个是tensorrt不支持吗,我在jetson nx上编译paddle时,指定了 -DTENSORRT_ROOT=/usr/lib/aarch64-linux-gnu/,但是好像没啥用

Error Message Summary:
----------------------
InvalidArgumentError: Pass tensorrt_subgraph_pass has not been registered.
  [Hint: Expected Has(pass_type) == true, but received Has(pass_type):0 != true:1.] (at /root/Paddle/paddle/fluid/framework/ir/pass.h:211)

Huihuihh avatar Feb 07 '21 05:02 Huihuihh

@Huihuihh 这是因为目前develop版本,2.0版本等新增了编译选项WITH_TENSORRT,默认是关闭的,需要同时加上

-DWITH_TENSORRT=ON
-DTENSORRT_ROOT=/usr/lib/aarch64-linux-gnu/

参考:https://github.com/PaddlePaddle/Paddle/blob/develop/CMakeLists.txt#L31

另外目前我们提供编译好的 Jetpack 4.3 版本的C++推理库,如果您的Jetson是 Jetpack4.3版本,可以用我们编译好的库 2.0-rc1 下载链接:https://paddle-inference-lib.bj.bcebos.com/2.0.0-rc0-nv-jetson-cuda10-cudnn7.6-trt6/paddle_inference.tgz

官网链接:https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html 如果需要使用2.0正式版,可以关注下Paddle官网,会在近期更新C++库的下载链接,目前已经完成了编译。

OliverLPH avatar Feb 07 '21 06:02 OliverLPH

cmake的时候,这个是不是就显示开启了tensorrt

Current TensorRT header is /usr/include/aarch64-linux-gnu/NvInfer.h. Current TensorRT version is v7

Huihuihh avatar Feb 07 '21 06:02 Huihuihh

cmake的时候,这个是不是就显示开启了tensorrt

Current TensorRT header is /usr/include/aarch64-linux-gnu/NvInfer.h. Current TensorRT version is v7

@Huihuihh 是的。而且编译完成后,可以查看 paddle_inference_install_dir/version.txt 文件,会有如下信息,有TensorRT信息则表示TensorRT联编成功了

GIT COMMIT ID: xxxxx
WITH_MKL: ON
WITH_MKLDNN: ON
WITH_GPU: ON
CUDA version: 10.0
CUDNN version: v7.6
CXX compiler version: 4.8.5
WITH_TENSORRT: ON
TensorRT version: v6

OliverLPH avatar Feb 07 '21 07:02 OliverLPH

这个tensorrt编译完成了,但是执行推理的时候,服务器自动断电了 我使用了

config.enable_use_gpu(500, 0)
config.delete_pass("conv_elementwise_add2_act_fuse_pass")
config.delete_pass("conv_elementwise_add_act_fuse_pass")
config.enable_tensorrt_engine()

但是执行的时候,到这里就服务器断电了

I0223 09:25:06.619712  8257 graph_pattern_detector.cc:101] ---  detected 34 subgraphs
--- Running IR pass [tensorrt_subgraph_pass]
I0223 09:25:06.659562  8257 tensorrt_subgraph_pass.cc:126] ---  detect a sub-graph with 95 nodes
I0223 09:25:06.735900  8257 tensorrt_subgraph_pass.cc:347] Prepare TRT engine (Optimize model structure, Select OP kernel etc). This process may cost a lot of time.

Huihuihh avatar Feb 23 '21 01:02 Huihuihh

请检查电源是否稳定,日志中的优化过程的确会有比较大的GPU计算量。

shangzhizhou avatar Feb 23 '21 02:02 shangzhizhou

请检查电源是否稳定,日志中的优化过程的确会有比较大的GPU计算量。

电源是稳定的,因为不使用这个tensorrt,能够成功执行推理

Huihuihh avatar Feb 23 '21 02:02 Huihuihh

Paddle-Inference-Demo里面自带的cuda_linux_demo项目可以成功使用tensorrt,但是换成ssd模型之后好像有问题

Huihuihh avatar Feb 23 '21 02:02 Huihuihh

其他场景可能没有将主频和功率推至最大值,有条件的话可以换个设备试一下。 或者将测试代码发一下,我们试一下是否可以复现。

shangzhizhou avatar Feb 23 '21 02:02 shangzhizhou

我是修改了yolov3里面的代码 infer_yolov3.py

import numpy as np
import argparse
import cv2
from PIL import Image
from time import time

from paddle.fluid.core import AnalysisConfig
from paddle.inference import Config
from paddle.inference import create_predictor

from utils import preprocess, draw_bbox


def init_predictor(args):
    if args.model_dir is not "":
        config = Config(args.model_dir)
    else:
        config = Config(args.model_file, args.params_file)

    if args.use_gpu:
        config.enable_use_gpu(1000, 0)
        config.switch_ir_optim()
        config.enable_memory_optim()
        config.enable_tensorrt_engine(workspace_size=1 << 30, precision_mode=AnalysisConfig.Precision.Float32,max_batch_size=1, min_subgraph_size=5, use_static=False, use_calib_mode=False)
        config.delete_pass("conv_elementwise_add2_act_fuse_pass")
        config.delete_pass("conv_elementwise_add_act_fuse_pass")
    else:
        # If not specific mkldnn, you can set the blas thread.
        # The thread num should not be greater than the number of cores in the CPU.
        config.set_cpu_math_library_num_threads(4)
        config.delete_pass("conv_elementwise_add2_act_fuse_pass")
        config.delete_pass("conv_elementwise_add_act_fuse_pass")
        config.enable_mkldnn()

    predictor = create_predictor(config)
    return predictor


def run(predictor, img):
    # copy img data to input tensor
    input_names = predictor.get_input_names()
    for i, name in enumerate(input_names):
        input_tensor = predictor.get_input_handle(name)
        input_tensor.reshape(img[i].shape)
        input_tensor.copy_from_cpu(img[i].copy())

    # do the inference
    detect_time = []
    t0 = time()
    predictor.run()
    detect_time.append((time() - t0) * 1000)
    print('detect_time is {} ms'.format(np.average(np.asarray(detect_time))))

    results = []
    # get out data from output tensor
    output_names = predictor.get_output_names()
    for i, name in enumerate(output_names):
        output_tensor = predictor.get_output_handle(name)
        output_data = output_tensor.copy_to_cpu()
        results.append(output_data)
    return results


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_file",
        type=str,
        default="",
        help="Model filename, Specify this when your model is a combined model."
    )
    parser.add_argument(
        "--params_file",
        type=str,
        default="",
        help=
        "Parameter filename, Specify this when your model is a combined model."
    )
    parser.add_argument(
        "--model_dir",
        type=str,
        default="",
        help=
        "Model dir, If you load a non-combined model, specify the directory of the model."
    )
    parser.add_argument("--use_gpu",
                        type=int,
                        default=0,
                        help="Whether use gpu.")
    return parser.parse_args()


if __name__ == '__main__':
    args = parse_args()
    img_name = 'dog.jpg'
    save_img_name = 'res.jpg'
    im_size = 300
    pred = init_predictor(args)
    img = cv2.imread(img_name)
    data = preprocess(img, im_size)
    im_shape = np.array([im_size, im_size]).reshape((1, 2)).astype(np.int32)
    result = run(pred, [data, im_shape])
    img = Image.open(img_name).convert('RGB').resize((im_size, im_size))
    draw_bbox(img, result[0], save_name=save_img_name)

utils.py

import cv2
import numpy as np
from PIL import Image, ImageDraw


def resize(img, target_size):
    """resize to target size"""
    if not isinstance(img, np.ndarray):
        raise TypeError('image type is not numpy.')
    im_shape = img.shape
    im_size_min = np.min(im_shape[0:2])
    im_size_max = np.max(im_shape[0:2])
    im_scale_x = float(target_size) / float(im_shape[1])
    im_scale_y = float(target_size) / float(im_shape[0])
    img = cv2.resize(img, None, None, fx=im_scale_x, fy=im_scale_y)
    print("cols is", img.shape[1])
    print("rows is", img.shape[0])
    return img


def normalize(img, mean, std):
    img = img / 255.0
    mean = np.array(mean)[np.newaxis, np.newaxis, :]
    std = np.array(std)[np.newaxis, np.newaxis, :]
    img -= mean
    img /= std
    return img


def preprocess(img, img_size):
    mean = [0.5, 0.5, 0.5]
    std = [0.5, 0.5, 0.5]
    img = resize(img, img_size)
    img = img[:, :, ::-1].astype('float32')  # bgr -> rgb
    img = normalize(img, mean, std)
    img = img.transpose((2, 0, 1))  # hwc -> chw
    return img[np.newaxis, :]


def draw_bbox(img, result, threshold=0.5, save_name='res.jpg'):
    """draw bbox"""
    draw = ImageDraw.Draw(img)
    for res in result:
        cat_id, score, bbox = res[0], res[1], res[2:]
        if score < threshold:
            continue
        xmin, ymin, xmax, ymax = bbox
        xmin = xmin*300
        ymin = ymin*300
        xmax = xmax*300
        ymax = ymax*300
        draw.line([(xmin, ymin), (xmin, ymax), (xmax, ymax), (xmax, ymin),
                   (xmin, ymin)],
                  width=2,
                  fill=(255, 0, 0))
        print('category id is {}, score is {}, bbox is {}'.format(cat_id, score, bbox))
    img.save(save_name, quality=95)

Huihuihh avatar Feb 23 '21 02:02 Huihuihh

请提供一下模型

shangzhizhou avatar Feb 23 '21 03:02 shangzhizhou

您好,请参考jetson开发手册 https://docs.nvidia.com/jetson/l4t/index.html%23page/Tegra%2520Linux%2520Driver%2520Package%2520Development%2520Guide/power_management_jetson_xavier.html%23wwpID0E0WD0HA 将GPU主频设置的低一些测试一下。

shangzhizhou avatar Feb 23 '21 03:02 shangzhizhou

模型: 链接:https://pan.baidu.com/s/1WQlrywqDSnZ_M8aN2SsARw 提取码:ldcx

Huihuihh avatar Feb 23 '21 03:02 Huihuihh

请先降低频率和电源功耗模式测试下

shangzhizhou avatar Feb 23 '21 04:02 shangzhizhou

我看了下这个gpu的频率好像是最低的

root@jetson-desktop:~# jetson_clocks --show
SOC family:tegra194  Machine:NVIDIA Jetson Xavier NX Developer Kit
Online CPUs: 0-1
CPU Cluster Switching: Disabled
cpu0: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1907200 CurrentFreq=1907200 IdleStates: C1=1 c6=1
cpu1: Online=1 Governor=schedutil MinFreq=1190400 MaxFreq=1907200 CurrentFreq=1907200 IdleStates: C1=1 c6=1
cpu2: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1907200 CurrentFreq=1907200 IdleStates: C1=1 c6=1
cpu3: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1907200 CurrentFreq=1907200 IdleStates: C1=1 c6=1
cpu4: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1907200 CurrentFreq=1907200 IdleStates: C1=1 c6=1
cpu5: Online=0 Governor=schedutil MinFreq=1190400 MaxFreq=1907200 CurrentFreq=1907200 IdleStates: C1=1 c6=1
GPU MinFreq=114750000 MaxFreq=1109250000 CurrentFreq=114750000
EMC MinFreq=204000000 MaxFreq=1600000000 CurrentFreq=1600000000 FreqOverride=0
Fan: speed=0
NV Power Mode: MODE_15W_2CORE

Huihuihh avatar Feb 23 '21 06:02 Huihuihh

降低电源功耗好像可以通过原先的地方了 NV Power Mode: MODE_10W_2CORE

Huihuihh avatar Feb 23 '21 06:02 Huihuihh

但是好像卡在后面的地方了

I0223 14:23:17.323590  9258 analysis_predictor.cc:139] Profiler is deactivated, and no profiling report will be generated.
I0223 14:23:17.374783  9258 analysis_predictor.cc:474] TensorRT subgraph engine is enabled
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [adaptive_pool2d_convert_global_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [shuffle_channel_detect_pass]
--- Running IR pass [quant_conv2d_dequant_fuse_pass]
--- Running IR pass [delete_quant_dequant_op_pass]
--- Running IR pass [delete_quant_dequant_filter_op_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [embedding_eltwise_layernorm_fuse_pass]
--- Running IR pass [multihead_matmul_fuse_pass_v2]
--- Running IR pass [skip_layernorm_fuse_pass]
--- Running IR pass [unsqueeze2_eltwise_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
I0223 14:23:17.707765  9258 graph_pattern_detector.cc:101] ---  detected 22 subgraphs
--- Running IR pass [squeeze2_matmul_fuse_pass]
--- Running IR pass [reshape2_matmul_fuse_pass]
--- Running IR pass [flatten2_matmul_fuse_pass]
--- Running IR pass [map_matmul_to_mul_pass]
--- Running IR pass [fc_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
I0223 14:23:17.782774  9258 graph_pattern_detector.cc:101] ---  detected 34 subgraphs
--- Running IR pass [tensorrt_subgraph_pass]
I0223 14:23:17.822729  9258 tensorrt_subgraph_pass.cc:126] ---  detect a sub-graph with 95 nodes
I0223 14:23:17.858089  9258 tensorrt_subgraph_pass.cc:347] Prepare TRT engine (Optimize model structure, Select OP kernel etc). This process may cost a lot of time.
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
I0223 14:23:48.898483  9258 graph_pattern_detector.cc:101] ---  detected 2 subgraphs
--- Running analysis [ir_params_sync_among_devices_pass]
I0223 14:23:48.919353  9258 ir_params_sync_among_devices_pass.cc:45] Sync params from CPU to GPU

Huihuihh avatar Feb 23 '21 06:02 Huihuihh

config.enable_use_gpu的值需要设置小一点

Huihuihh avatar Feb 23 '21 07:02 Huihuihh

谢谢反馈~

shangzhizhou avatar Feb 23 '21 08:02 shangzhizhou

@Huihuihh 你好,问题现在解决了吗?

wangye707 avatar Feb 23 '21 08:02 wangye707

解决了

Huihuihh avatar Feb 23 '21 08:02 Huihuihh

@Huihuihh 你好 请问你测试时,开启tensorrt,能够正确预测结果么?

tianhechao avatar Mar 17 '21 08:03 tianhechao

@Huihuihh 你好 请问你测试时,开启tensorrt,能够正确预测结果么?

可以正确预测结果

Huihuihh avatar Mar 17 '21 08:03 Huihuihh

@Huihuihh 您好,Paddle-Inference 支持 ssd_mobilenet_v1 推理,该模型在 TX2 上进行过测试,batch_size <=2 时可以使用。代码参考 预测示例 (Python) 改写下主要的预测部分代码即可。 但我注意到您使用的是 cudnn8,也就是jetpack 4.4?该版本的推理存在 内存泄露问题,我们正计划在下一个小版本想办法做一下规避处理。 如果着急使用,建议先使用下cudnn 7版本,或者使用如下方法绕过

# 关闭会导致内存泄露的fusion
config.delete_pass("conv_elementwise_add2_act_fuse_pass")
config.delete_pass("conv_elementwise_add_act_fuse_pass")
# 开启tensorrt engine
config.enable_tensorrt_engine()

@OliverLPH 你好 请问我在jetsonTX2上,使用Je tPack4.3和Je tPack4.4的paddle-inference预测库(2.0rc1/2.0.1 c++库),开启gpu或tensorrt都无法正常预测,不使用gpu推理结果就正常,不知道您那里是否有进行过相关的测试?

tianhechao avatar Mar 17 '21 08:03 tianhechao

@Huihuihh 您好,Paddle-Inference 支持 ssd_mobilenet_v1 推理,该模型在 TX2 上进行过测试,batch_size <=2 时可以使用。代码参考 预测示例 (Python) 改写下主要的预测部分代码即可。 但我注意到您使用的是 cudnn8,也就是jetpack 4.4?该版本的推理存在 内存泄露问题,我们正计划在下一个小版本想办法做一下规避处理。 如果着急使用,建议先使用下cudnn 7版本,或者使用如下方法绕过

# 关闭会导致内存泄露的fusion
config.delete_pass("conv_elementwise_add2_act_fuse_pass")
config.delete_pass("conv_elementwise_add_act_fuse_pass")
# 开启tensorrt engine
config.enable_tensorrt_engine()

@OliverLPH 你好 请问我在jetsonTX2上,使用Je tPack4.3和Je tPack4.4的paddle-inference预测库(2.0rc1/2.0.1 c++库),开启gpu或tensorrt都无法正常预测,不使用gpu推理结果就正常,不知道您那里是否有进行过相关的测试?

你说的这个问题,我好像遇到过,当时有两个jetson,一个是jetson nano,一个是NX,在NX上能正常使用gpu预测,在nano上却不能正常预测。因为我当时是在NX上编译的paddle,然后直接将编译好的whl拿到nano上安装,我怀疑可能是需要在nano上编译,有什么地方不匹配啥的,但是实际结果是在nano上编译的paddle安装还是gpu预测有问题

Huihuihh avatar Mar 17 '21 08:03 Huihuihh

@tianhechao 您好,gpu和tensorrt无法正常推理可能有以下原因,可以尝试排查下

  1. 检查您的机器安装的JetPack版本是否和下载的paddle-inference预测库的JetPack版本一致
  2. 检查所下载的 paddle-inference预测库 架构是否是 volta架构(TX2对应架构),或者是全架构的

以下脚本可以查询您下载的 paddle-inference预测库 so的 sm_ 符号,相关符号对应架构请参考 Matching CUDA arch and CUDA gencode for various NVIDIA architectures

cuobjdump --dump-ptx libpaddle_inference.so | grep sm_

OliverLPH avatar Mar 17 '21 09:03 OliverLPH

@OliverLPH 你好,下载的预测库没有问题,我又尝试使用paddle-inference-demo中paddle-trt例程提供的resnet50的分类预训练模型测试,可以正确推理出类别,但是检测模型如yolov3就无法正常使用,检测模型使用paddle-inference-demo中yolov3例程所提供的预训练模型 jetpack4.4 tx2推理库从官方下载https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/05_inference_deployment/inference/build_and_install_lib_cn.html nv_jetson_cuda10.2_cudnn8_trt7_all(jetpack4.4/4.5)及nv_jetson_cuda10.2_cudnn8_trt7_tx2(jetpack4.4/4.5)均已尝试,我自行编译的库结果相同

@.***

发件人: Peihan 发送时间: 2021-03-17 17:02 收件人: PaddlePaddle/models 抄送: tianhechao; Mention 主题: Re: [PaddlePaddle/models] ssd模型在jetson NX上推理时间不合理 (#5246) @tianhechao 您好,gpu和tensorrt无法正常推理可能有以下原因,可以尝试排查下 检查您的机器安装的JetPack版本是否和下载的paddle-inference预测库的JetPack版本一致。 检查所下载的 paddle-inference预测库 架构是否是 volta架构(TX2对应架构),或者是全架构的 以下脚本可以查询您下载的 paddle-inference预测库 so的 sm_ 符号,相关符号对应架构请参考 Matching CUDA arch and CUDA gencode for various NVIDIA architectures cuobjdump --dump-ptx libpaddle_inference.so | grep sm_ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

tianhechao avatar Mar 17 '21 14:03 tianhechao