java TF Java 0.3.1 shows a performance degradation on GPU compared to v 0.2.0 when loading Hugging Face models

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Mint 20.1 (Ubuntu 20.04 LTS)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): TF Java 0.3.1 (TF 2.4.1)
Python version:
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 11.0 / 8.0.4
GPU model and memory: GeForce GTX 1060 computeCapability: 6.1 coreClock: 1.6705GHz coreCount: 10 deviceMemorySize: 5,93GiB deviceMemoryBandwidth: 178,99GiB/s

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior The usage of version TF Java bindings 0.3.1 degradates performances of a 3x factor on GPU compared to version 0.2.0 .

Describe the expected behavior Equal, hopefully better performances while migrating to newer versions.

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. Performance tests are currently on going to validate the issue. We'll update with more info asap. https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/test/scala/com/johnsnowlabs

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

May 30 '21 07:05 wolliq

I've heard stories about cuDNN 8.x slowing things down. Could you try with cuDNN 7.x?

May 30 '21 08:05 saudet

Do you know if it's during tensor creation/access, model inference (and is that model trained in Python or Java), training or some other place? Also is this across multiple different models or just a single one? Finally is this slowdown observed for a single run, or does it persist after the JVM has warmed up the codepath?

Jun 01 '21 01:06 Craigacp

Hi @Craigacp @saudet

I can answer some of the questions based on my personal tests:

Model inference (I didn't time the rest between 0.2.x and 0.3.x releases on GPU)
I have tested inferencing among BERT, DistilBERT, and RoBERTa models
Since there is a warm-up sometimes I always run every experiment 5-10 times depending on how heavy they are. So not only the total average was 2x-3x times slower, but each run was also 2x-3x times slower. (the heavier the model the slower it becomes)

For the cuDNN versions, I personally tested on Databricks series 8.x runtimes which come with CUDA 11.x and Google Colab. (not sure the exact cuDNN version on these two platforms, but I'll see if I can find a local GPU server to test different cuDNN versions - TensorFlow 2.5 has an upgraded cuDNN so I will be testing that once it's a snapshot here)

Jun 02 '21 09:06 maziyarpanahi

@maziyarpanahi , @wolliq , I suppose you work with String tensors, right? Something that has changed drastically between TF Java 0.2.0 and 0.3.1 is how string tensors are allocated by the TensorFlow runtime (since version 2.4.0). In your experiments, does the model inference also include the allocation of your input tensors?

If so, it could be great if you can isolate just the creation of tensors (without actually running the model) and see if only this takes 3x times slower than 0.2.0.

Jun 02 '21 12:06 karllessard

Thanks @karllessard for the reply.

We only use TInt32 tensors for those models (token ids, segment ids, and mask ids). I think the only place we use String Tenors is inside Universal Sentence Encoder. I haven't timed that between 0.2.x and 0.3.x yet.

Any suggestion for TInt32 tensors? If I can find a machine with a compatible GPU I will do some profiling to see where it spends more time compared to 0.2.x

Jun 02 '21 13:06 maziyarpanahi

Well something else major that has change is that now the tensor memory (of all types) is automatically mapped in the JVM, while in 0.2.0 it was only done when calling tensor.data().

This mapping should be pretty fast though, but I’d still be curious to know if the latency you observed is happening at the tensor allocation or at the session run.

Jun 02 '21 15:06 karllessard

Hi @karllessard

I have some updates:

The first session run in 0.3.1 is about 10%-20% slower than the 0.2.x
I would like to also correct the fact that not everything is 2x-3x times slower on GPU in 0.3.1. Actually we just found out only models coming from HuggingFace perform poorly on GPU compare to the same models coming from TF Hub. (@wolliq let's update the title to something less general)
We will keep debugging to see what can cause this slowness in HF exported models via saved_model

Many thanks again

Jun 03 '21 06:06 maziyarpanahi

Ok, it looks like the latency is not happening in TF Java but in TensorFlow itself. I've just done a very quick search and found a few latency issues observed also by non-Java users between TF 2.4 (0.3.1) and TF 2.3 (0.2.0), like this one:

https://github.com/tensorflow/tensorflow/issues/46515

I'd suggest you to take a look if their investigations lead to the same issue that you are facing.

Also we'll migrate soon to TF2.5 in the current snapshot, I don't know if that will fix it but would be worth trying when it is done.

Jun 03 '21 12:06 karllessard