tensorrt icon indicating copy to clipboard operation
tensorrt copied to clipboard

Cuda synchronize alternative for profiling

Open aimilefth opened this issue 3 years ago • 8 comments

Greetings,

I am currently using tf-trt and I want to measure the perfomance of my models (Latency, Throughput).

The tensorrt c++ API has the functionality of cuda synchronize via the cuda events API https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#cuda-events

On top of that, Pytorch contains the torch.cuda.synchronize() alternative https://pytorch.org/docs/stable/generated/torch.cuda.synchronize.html

However in the TF TRT docs, I cant find something similar, which in my opinion is essential in order to correctly measure perfomance metrics

Have I missed anything or are there plans to integrate such functionality?

Thank you

aimilefth avatar Jul 13 '22 14:07 aimilefth

Hi @aimilefth, you are correct on all counts. This is critical to measure performance in tensorflow, however, the APIs do not currently exist in TF (not just TF-TRT). We are in the process of adding such APIs. @DEKHTIARJonathan can add more.

You can also check out the benchmarking scripts for how TF-TRT overcomes this currently.

ncomly-nvidia avatar Jul 18 '22 15:07 ncomly-nvidia

@ncomly-nvidia I was looking at the TensorRT ResNet50 benchmarking example here. The throughput seems exceptionally high, almost 250,000 IPS on the T4, whereas ML Perf reports 39,000 IPS for the A100, which is a better GPU.

Is the use of time.perf_counter() correct here? - just putting it around the inference function?

slai-natanijel avatar Jan 12 '23 18:01 slai-natanijel

@slai-natanijel what is the input size for MLPerf, because TF-TRT uses MNIST (a very small input size) for demo purpose. We chose on using MNIST because it's easy to download and use, clearly not comparable to the performance you would get with an input size 10x larger (so 100x more pixels)

DEKHTIARJonathan avatar Jan 12 '23 20:01 DEKHTIARJonathan

@DEKHTIARJonathan Ah yes you are right - ML Perf uses 224x224x3 images. However, when I tested on A100 on this image size, I get like 700,000 IPS (expected 30,000 IPS) when I wrap time.perf_counter() around an inference call.

So how do your benchmarking scripts overcome the synchronisation issue currently?

slai-natanijel avatar Jan 13 '23 11:01 slai-natanijel

@slai-natanijel let me guess... Did you call '.numpy()' or resynchronize the GPU after the computation before the final perf_counter() call?

Don't forget that TF is eager executed which means there is no guarantee the computation is actually over when you return from 'result = model(data)'.

DEKHTIARJonathan avatar Jan 13 '23 17:01 DEKHTIARJonathan

@DEKHTIARJonathan
I tried the following:

start = time.perf_counter()
pred = func(x)['predictions'].numpy()
end = time.perf_counter()

where func(x) is an inference call to TensorRT. I get more reasonable IPS numbers with the above code, although I can't estimate how much overhead is added with the dictionary access and numpy() .

slai-natanijel avatar Jan 16 '23 14:01 slai-natanijel

@slai-natanijel actually it's a very good point ;) And it's a lot... And even worse, it's actually very changing due to the nature of memcpyDtoH ...

But you're in luck my friend :)

We actually are adding a feature in TensorFlow right now to address this issue: https://github.com/tensorflow/community/pull/434

Now in the meantime, you can use a little bit of TensorFlow dark magic to minimize that overhead:

def force_gpu_resync(func):
    p = tf.constant(0.)  # Create small tensor to force GPU resync

    def wrapper(*args, **kwargs):
        rslt = func(*args, **kwargs)
        (p + 1.).numpy()  # Sync the GPU
        return rslt

    return wrapper

model = ...  # a TF function, Eager Function, TF-TRT converted model, etc.
model = force_gpu_resync(model)

It will add very minor overhead, until the RFC above is merged, it's the best you can do.

@slai-natanijel may I ask which company do you work for ? That way we can follow up with you

DEKHTIARJonathan avatar Jan 16 '23 17:01 DEKHTIARJonathan

Great - I'll be watching the sync API! I tried your code snippet - I think it works fine, although there is no noticeable difference in performance compared to the numpy method. I guess if the output tensor is large, then we'd see a bigger difference. My email is [email protected]

slai-natanijel avatar Jan 18 '23 19:01 slai-natanijel