[Perf] The time of Conv2DBackpropInput is very long in BlazePose/hand_detector models in WebGL
hand_detector
| Kernel | Time(ms) | Inputs | Output | GPUPrograms |
|---|---|---|---|---|
| Conv2DBackpropInput | 9.73 | input0: 4D[1,16,16,256]input1: 4D[2,2,128,256] | 1,32,32,128 | UnpackProgram: 0.026083, Conv2DDerInputProgram: 9.706294 |
| Conv2DBackpropInput | 4.93 | input0: 4D[1,8,8,256]input1: 4D[2,2,256,256] | 1,16,16,256 | UnpackProgram: 0.011333, Conv2DDerInputProgram: 4.91648 |
BlazePose with inputSize 256, inputType Tensor
| Kernel | Time(ms) | Inputs | Output | GPUPrograms |
|---|---|---|---|---|
| Conv2DBackpropInput | 20.20 | input0: 4D[1,7,7,1152]input1: 4D[2,2,192,1152] | 1,14,14,192 | UnpackProgram: 0.042083, Conv2DDerInputProgram: 20.159669 |
fyi @pyu10055 @lina128 @jinjingforever
This is very informative, thank you @qjia7 ! In the GPUPrograms column, is Conv2DDerInputProgram the packed version and UnpackProgram the unpacked version, both from WebGL backend?
@lina128 the UnpackProgram is the shader to unpack outputs from the previous kernel, the Conv2DDerInputProgram is the shader for Conv2DBackpropInput, which one supports unpacked texture.
FYI, https://github.com/tensorflow/tfjs/pull/6603 use the simulated matmul_vec4 on webgpu can greatly improve the performance of Conv2DBackpropInput.
FYI @Linchenn
Thank you Jiajia @qjia7! This is a great inspiration. Let me see if we can reuse this algorithm and, at least, we could have a packed version of Conv2DDerInputProgram to improve the performance.
hand_detector is currently deprecated, while BlazePoseDetector is accelerated with packed Conv2DBackpropInput. However, Conv2DBackpropInput is still the performance bottleneck for BlazePoseDetector.
Hi, @qjia7
Apologize for the delayed response and we're re-visiting our older issues and checking whether those issues got resolved or not as of now so May I know are you still looking for the solution or Have we taken care of this issue please?
I see this PR https://github.com/tensorflow/tfjs/pull/7339 got merged but that PR fixes this issue partially if I'm not wrong so do you want to keep this issue open till it fixes this issue completely ?
If have I missed something here please let me know ? Thank you!
Originally, it's fixed by https://github.com/tensorflow/tfjs/pull/7339. But later, another improvement https://github.com/tensorflow/tfjs/pull/7386#issuecomment-1436535284 brings some regressions on Intel devices. I am ok to close this one and file a new one to track that issue.
Hi, @qjia7
Thank you for the confirmation and if you're okay to close this issue from your end then please feel free to close this issue now and submit new issue to track other issue. Thank you!
This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.
This issue was closed due to lack of activity after being marked stale for past 7 days.