Performance improvements
Nice work on this! Thanks for making it available.
The performance of the current demo seems to differ substantially from what was quoted in the slam++ paper:
This process typically takes <5ms for 160K PPFs and could also be used in the future to describe new object classes on the fly as they are automatically segmented.
The demo example of this code seems to build the model in ~2 seconds and then object detection runs in ~10+ seconds for the chair.
I took a look at the code and I believe the problem is that it is implemented as if running on a regular CPU using a very small amount of code from thrust. Algorithms with low parallelization on the GPU will be much slower than on the CPU, the advantage of the GPU comes when code is parallelized to the extreme. With the current code I think switching to CPU only would actually improve performance. That change doesn't seem like it will take too much effort, and it may make sense to support CPU only and GPU at the users option anyway.
In case you'd like to improve the GPU performance these may be of use: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#axzz41liQ5klM https://www.quora.com/What-is-the-best-way-to-learn-CUDA
The performance may also be better than I'm seeing because I don't have PCL cuda. Trying to build and install it now.
Thanks for your review! Yes, this is a very rough implementation and is not really optimized. I do plan to make the GPU parallelization optional as you commented.