geometry2 refactor(doTransform): 10-100x faster transform of pointcloud

It seems like generating an Eigen::Vector3f and transforming it for each iteration is very inefficient. In our case, this mod took transform-time of a point-cloud from 90-100ms to 1-2ms.

Verified to give same result as old doTransform using:

sensor_msgs::PointCloud2ConstIterator<float> x_old(cloud1, "x");
sensor_msgs::PointCloud2ConstIterator<float> y_old(cloud1, "y");
sensor_msgs::PointCloud2ConstIterator<float> z_old(cloud1, "z");

sensor_msgs::PointCloud2ConstIterator<float> x_new(cloud2, "x");
sensor_msgs::PointCloud2ConstIterator<float> y_new(cloud2, "y");
sensor_msgs::PointCloud2ConstIterator<float> z_new(cloud2, "z");

std::vector<double> compare_vector;
for (; x_new != x_new.end(); ++x_new, ++y_new, ++z_new, ++x_old, ++y_old, ++z_old) {
compare_vector.push_back(*x_new - *x_old);
compare_vector.push_back(*y_new - *y_old);
compare_vector.push_back(*z_new - *z_old);
}
double max_diff = *max_element(compare_vector.begin(), compare_vector.end());
double min_diff = *min_element(compare_vector.begin(), compare_vector.end());

ROS_INFO("Biggest differences: %f, %f", max_diff, min_diff);

Not sure how/where to put this test-code in a proper test, so I'll leave it here until I get some feedback.

Dec 05 '19 10:12 steinmn

Hi :) Thank you for making this PR. I'm not a maintainer of this repository but I have a suggestion about this PR.

The main problem of the original code is that Eigen::Vector3f is generated and initialized inside the iteration.

It may be possible to speed up the original code by initializing the Eigen::Vector3f outside the iterations:

Eigen::Vector3f point, point_in;
for(; x_in != x_in.end(); ++x_in, ++y_in, ++z_in, ++x_out, ++y_out, ++z_out) {
    point_in.x() = *x_in;
    point_in.y() = *y_in;
    point_in.z() = *z_in;
    point = t * point_in;

Using Eigen::Vector3f could be faster than using std::vector or writing directory because the Eigen supports automatic SIMD (e.g., SSE, AVX) operation.

Dec 10 '19 03:12 yoshipon

Hi Thanks for the feedback :) I did a quick test with your suggestion, and this yielded very similar results to the old code (90-100ms). (used code:)

Eigen::Vector3f point, point_in;
for (; x_in != x_in.end(); ++x_in, ++y_in, ++z_in, ++x_out, ++y_out, ++z_out) {
    point_in.x() = *x_in;
    point_in.y() = *y_in;
    point_in.z() = *z_in;
    point = t * point_in;
    *x_out = point.x();
    *y_out = point.y();
    *z_out = point.z();
}

It seems the "point = t * point_in"-line is the bottleneck, clocking in at ~5000ns, with the other lines inside the loop around 100-200ns each. In comparison, each of the three codelines in the loop in my committed code take around 50-100ns. (Of course, individual time measurements should probably be taken with a grain of salt when the durations are this small, but the orders of magnitude should be reliable)

I'm not familiar enough with parallel computing to say anything about how that might affect the performance of the different approaches, so I'll leave that discussion for someone who unlike me knows what they're talking about.

Dec 10 '19 09:12 steinmn

I'm very sorry for my late response.

Thank you for your benchmark. I didn't know the significant overhead of Eigen.

May 20 '20 09:05 yoshipon