serving tf serving performace is so slow

Now I train a recommend nn model offline ,and then predict it by tf serving on cpu online machine. There are 8 cores i just applied, and found it cost slow when predict it. More than 0.4% it costs 100ms when predict. The batch size request is 100. There are 167 one hot features and 3 full-connected layers.And the usage of cpu is also slow, it is only 20% usage. How can i analyze the bottleneck of serving ,and can it possible to reduce the time cost ratio by adjust some parameters? i have tried many way follow this link, https://www.tensorflow.org/tfx/serving/performance ,but it can't improve the performace.I doubt if i have so many one hot featues. ,that it cost much time to find hash featues embeddings

Mar 21 '22 10:03 liumilan

Hi @liumilan

Can you take a look at the workaround proposed in this thread and see if it helps in resolving your issue? Also you can refer to link1, link2 which discusses about similar problem. Thanks!

Mar 22 '22 11:03 pindinagesh

Hi @liumilan

One gotcha I ran into is that the cpu of TF Serving container is rather spikey and does not show up in the 1min aggregates (so it uses 100% + cpu but on average shows < 50% in some cases). I'm not sure of your serving environment, but if it is in kubernetes I'd recommend plotting CPU throttling to make sure you are not running into that helpful throttling video. Increasing limits will help allow your application to burst into spikes. In addition, you can look into serving your application with more cpu (though that's costly since you are already only at 20% cpu usage).

Apart from that, you can look into attaching tensorboard to look at costly operations -- it is fairly easy to setup. I've not found any other parameters that have helped too much with this problem, only changing resources and changing batch size.

Mar 22 '22 21:03 salliewalecka

Hi @liumilan

One gotcha I ran into is that the cpu of TF Serving container is rather spikey and does not show up in the 1min aggregates (so it uses 100% + cpu but on average shows < 50% in some cases). I'm not sure of your serving environment, but if it is in kubernetes I'd recommend plotting CPU throttling to make sure you are not running into that helpful throttling video. Increasing limits will help allow your application to burst into spikes. In addition, you can look into serving your application with more cpu (though that's costly since you are already only at 20% cpu usage).

Apart from that, you can look into attaching tensorboard to look at costly operations -- it is fairly easy to setup. I've not found any other parameters that have helped too much with this problem, only changing resources and changing batch size.

i have looked into attaching tensorboard to look at costly operations offline, i have found it cost much time on looking up embedding features,so can it possible to save this time ?@salliewalecka

Mar 28 '22 10:03 liumilan

Hey, I don't have any more tips for you if it is the embedding features lookup is the bottleneck. Sorry!

Mar 28 '22 16:03 salliewalecka

Hi @liumilan

Can you take a look at the workaround proposed in this thread and see if it helps in resolving your issue? Also you can refer to link1, link2 which discusses about similar problem. Thanks!

i don't think it is the same as issue.My bottleneck is it cost much time to embedding lookup,from tensorflow timeline. @pindinagesh

Mar 29 '22 02:03 liumilan

timeline-1.txt @pindinagesh here is my timline,could u help to check it? just change name to timeline-1.json,and ,open it by chrome

Mar 29 '22 02:03 liumilan

who can help to check this timeline?

Apr 05 '22 13:04 liumilan

In fact, other applications have similar performance issue #1991

Apr 06 '22 00:04 vscv

I also have the same low-performance issue. I guess it mainly comes from two parts:

It takes time to convert the image into JSON payload and POST.

TF serving itself is delayed (posts have been made several times in advance as a warm-up).

Therefore, the result of my POST test on the remote side and the local side is that the remote side (MBP + WIFI) takes 16 ~ 20 seconds to print res.josn. The local side takes 5 ~ 7 seconds. Also, I observed GPU usage, and it only ran (~70%) for less than a second during the entire POST.
# 1024x1024x3 image to json ans POST
image = PIL.Image.open(sys.argv[1])
payload = {"inputs": [image_np.tolist()]}
res = requests.request("POST", "http://2444.333.222.111:8501/v1/models/maskrcnn:predict", data=json.dumps(payload))
print(res.json())

my scenario is recommend ,not cv

Apr 06 '22 02:04 liumilan

@pindinagesh @christisg could u help to check timline.json

Apr 11 '22 12:04 liumilan

@liumilan,

Can you please compare the time taken to generate predictions using Tensorflow runtime and then Tensorflow Serving. Underneath the hood, TensorFlow Serving uses the TensorFlow runtime to do the actual inference on your requests. This means the average latency of serving a request with TensorFlow Serving is usually at least that of doing inference directly with TensorFlow. That would help us understand if the real issue is with Tensorflow Serving or with model. If embedding lookup is your bottleneck, I would suggest you to re-design your model with inference latency as a design constraint in mind.

In case, tail latency(time taken by tensoflow serving to do inference) results high, you can try gRPC API surface which is slightly more performant. Also, you can experiment with command-line flags (most notably tensorflow_intra_op_parallelism and tensorflow_inter_op_parallelism) to find right configuration for your specific workload and environment.

Thank you!

Apr 06 '23 05:04 singhniraj08

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

Apr 14 '23 01:04 github-actions[bot]

This issue was closed due to lack of activity after being marked stale for past 7 days.

Apr 21 '23 01:04 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

Apr 21 '23 01:04 google-ml-butler[bot]