PAF Parallel Optimizations & Max pooling.
Parallel Optimizations
Hi, I just pushed some optimizations via parallelization of some CPU intensive loops in the paf processor (plus a couple other places) into my paf_opt branch. I have also made this cmake configurable such that you can either use C++17 parallel algorithms (just for_each), TBB, PPL, or serial/disabled (default), which looks like this in cmake-gui.
If TBB is choosen a new file path entry is added that should point to TBB's cmake directory, and if you do choose to use TBB I would recommend using a build of opencv with TBB enabled otherwise you may end up with double the number of threads used on windows because I believe the prebuilt binaries for opencv for visual C++ will use PPL so you could end up with a bunch of worker threads for PPL and bunch for TBB.
In terms of performance boost, these changes gives the PAF processor roughly a 3x-4x perf speed-up on my system (with PPL). For example when I use a batch size of 4 (to process 4 independent images) on my setup, the serial version takes about ~18ms, parallized version takes ~6ms.
Unfortunately I can not parallelize over paf::process (and take advantage of nested parallelism) at the highest level because it is not thread safe, what I mean by this is take for example code such as this:
for (auto& packet : feature_maps)
{
parser.process(packet[0], packet[1]));
}
can not be converted to:
parallel_for_each(feature_maps.begin(), feature_maps.end(), [&parser](auto& packet)
{
parser.process(packet[0], packet[1]);
}
The main problem being is the m_upsample_conf & m_upsample_paf buffers, it would require something like using TLS so there is one instance per woker thread but of-course this increases the memory requirements.
Max Pooling
The reason why I have done this and in this way is because the most expensive operation is the CPU based max pooling, mainly because there are something like 19 channels of images to process.
Initially I tried to optimize the code without parallization first by trying to optimize the loops in same_max_pool_3x3 to reduce the amount of boundry checks but no matter what I did it was never faster than the original code.
I'm not familliar with this problem domain but I looked into max pooling and from what I understand of the algorithm the way the code behaves is not the same as the standard algorithm of doing it, the 3x3 filters overlap and process every pixel and the pooled images are not smaller versions of the original.
It would seem the standard version would be a lot cheaper computationally but I never tried changing it this way because from what I understand some implementation do overlap filters and there is probably a reason why it was done this way that I don't understand.
Another option would be to try vectorize the loops but the buffers are not aligned for SIMD registers, the code doesnt seem to be cache friendly, and I don't feel like writing cross-platform SIMD code with fallbacks.
So in the end I went for a more brute force method of parallelizing over the number of channels.
I do understand that you guys are going to provide an alternative to PAF with proposal networks so I look forward trying that out when the models are ready.
Lastly, the GPU based max pooling, the option isn't exposed anymore in v2.0, I manually enabled it to try it out, it is definitely faster than the CPU serial version but the CPU parallel version is (slightly) faster, probably because of the host <-> device buffer synchronization overhead.
@korejan A big thank for your comprehensive issue.
-
About the parallel optimization. I made the optimization you mentioned in the stream API. Just allocate N paf objects to do paf in the same time.(See: https://github.com/tensorlayer/hyperpose/blob/master/include/hyperpose/stream/stream.hpp#L341) And I also made use of thread pool to reduce the overhead of thread allocation. All these parallel utilities were implemented via standard thread library for maximum compatibility. (Many C++17 compilers still do not support parallel algorithms, and some of them require extra libraries, e.g., TBB in GCC 9) But of course, we can adapt these changes for systems/compilers supporting parallel algorithms.
-
You are 100% right that the biggest bottleneck of PAF is the max pooling. The greater max-pool factor, the slower, however, the more precise it is. We adapted the factor of 4 according to the LightWeight-OpenPose paper. And interestingly, pose proposal is much faster than PAF for the reason that pose proposal has no extra matrix operations on the feature map. And the PAF post-processing is impl. by @lgarithm. I think we still have space for optimization. It is me that forced the CPU version of max-pool as I found that it's faster than GPU version on my computer. @lgarithm once mentioned that the GPU version can be significantly faster for some embedded platforms like NVIDIA Jetson. I am not sure why GPU-max-pool is slower than CPU version but we can profile it via
nvprof. I think it is of no necessity to manually playing SIMD optimization. We can allow such compiler optimization by giving corresponding optimization flags.
I understand that PAF processing is the bottleneck for small models and the PAF impl. of hyperpose should be further optimized. And I think we can start by optimizing the NCHW max-pool. And the host<->device speed can be optimized via pinned memory.