rclcpp Timer that can be triggered a finite number of times.

This PR addresses the #1990. Basically it gives the timers API the possibility to select the number of times the timer will be triggered and be canceled after that using a variable. This variable (amount_of_callbacks) have the default value of 0 which indicates that the timer will be triggered infinite times until it is 'manually' cancelled, i.e., the same behavior the times have at the moment. For example, initializing the timer with amount_of_callbacks = 1, would create an one shot timer, which would be triggered once and then would be canceled.

Aug 26 '22 14:08 Voldivh

This implementation adds overhead to existing timers. We should make sure that the added functionality does not affect timers that don't want to use it.

Aug 27 '22 10:08 alsora

This implementation adds overhead to existing timers. We should make sure that the added functionality does not affect timers that don't want to use it.

I agree. I thought about doing the implementation in another class, however, that option would imply copying all the code from an existing class and just modify a couple of lines and would also reduce the discoverability of the feature. What would you recommend to do in this case so that we can make sure the feature doesn't affect the existing timers?

Aug 30 '22 15:08 Voldivh

With adding generality there is always going to be overhead. There are different dimensions of overhead, there's pure runtime, there's memory footprint, there's development effort, and there's api complexity all of which are side effects of design decisions.

For core C++ libraries often they can avoid runtime overhead by doing template meta-programming and other advanced features. Another alternative would be if we were to change this API to be a factory pattern with generators that leverage different potential backend implementations. I don't think that's something that we want to get into.

But if I look at this for what the effect to the user will be. It's increasing the memory footprint by 2 unsigned integers. And in each timer cycle there's one if statement checking if an unsigned integer is non-zero. This is a very low overhead for having a generic API available. It keeps the API simple and compact. And does not increase the code complexity significantly. If we are to add another class with a parallel implmentation it will likely have a higher memory impact as both classes will be in parallel in the shared libraries and then potentially loaded into memory too, with a total extra space of much greater than 2 unsigned ints of space.

And looking a little bit closer the user's callback is invoked before the if statement so to the user the only potential slowdown is that the CPU has to do approximately one extra cycle prior to recomputing the timeout and enqueuing itself back into the waitset. As such the user will not have any effect on their timing either delaying their callback, nor delaying their next callback. With the exception that they overran their last callback time buffer and then it will be approximately 1 cpu clock cycle later if the compiler has not optimized out the check that this code path will clearly not take.

I don't think that there's a way to do this more efficiently that's both simple and easy for the user to leverage following our established API and design patterns as well as not adding significant extra lines of code for ongoing maintenance and testing. As such I would suggest going with this current quite simple solution.

Aug 30 '22 20:08 tfoote

@tfoote IMO the ROS 2 design should be rigorous about how new features are added.

This "callback counter" is a convenience utility and, as we discussed in the issue, it's something that users could very easily do from the application code. It looks like several people think that this is a very important feature and, as such, it should be available out of the box. I'm ok with that, but this should not affect users that don't want to use it (this should apply to all features, but in particular to utilities).

This particular PR is adding an overhead that is probably negligible, but still doing an exception for this PR today, then for another PR tomorrow and so is not good. Besides eventually having a noticeable impact on performance, it's also not a good design practice as there isn't a clear separation between the c++ core (i.e. a wrapper around rcl) and the all the stuff built on top of it.

Performance is one of the main problems of ROS 2, and there are a lot of on-going efforts to make ROS 2 more efficient, but this is possible only if performance is taken in serious consideration at all the layers.

My recommendation is to handle this via a new OneShotTimer class that simply has to store the additional counter and override the execute_callback method.

Aug 31 '22 11:08 alsora

@tfoote @Voldivh sorry for the short notice, but would you be available to have a discussion on this in the ROS 2 Client Library Working Group? It's today (August 31th) at 8AM Pacific Time: https://meet.google.com/pxe-ohqa-cjm

Aug 31 '22 11:08 alsora

This particular PR is adding an overhead that is probably negligible, but still doing an exception for this PR today, then for another PR tomorrow and so is not good.

Saying that there might be another PR tomorrow that might slow down performance is a reason to not take this one, is not really a valid critique of this PR.

there are a lot of on-going efforts to make ROS 2 more efficient, but this is possible only if performance is taken in serious consideration at all the layers.

Indeed, have there ever been any metrics at issue or proposed which this change will effect adversely? What is a metric on the performance of this PR that is acceptable vs unacceptable? As this is in the non-zero case an unchanging variable I suspect that the compiler will likely even detect that as such and this if statement never get even evaluated.

Are there any known or perceived cases when this will ever become a bottleneck to computation that we will need to optimize? If not this seems like a case of premature optimization for runtime overhead at the cost of many other aspects of the system.

Forcing things into separate classes and implementations has notable other deleterious effects such as slowing down developer/user velocity and increasing our support burden with the number of lines of code, tests and documentation to maintain. These are all real costs which will be born out by both us and users in the community and trades off one form of overhead for another.

I'm ok with that, but this should not affect users that don't want to use it (this should apply to all features, but in particular to utilities).

This is a great design goal but it cannot be applied in the abstract on a single metric. All of the above costs I've outlined above also effect the entire community.

it's also not a good design practice as there isn't a clear separation between the c++ core (i.e. a wrapper around rcl) and the all the stuff built on top of it.

I think that this is one of the fundamental area where we're looking at this differently. I see the rclcpp API as a user facing API that should be designed to optimize the development of the end user. To the end user there is absolutely no difference whether the code is implemented under the hood using rcl or locally implemented in rclcpp. If under the hood in the implementation you want to make a distinction as to how it's implemented that's great. But it should not be something the the user sees or cares about. If we were to extend the functionality that is implemented in rcl and change the backend of that feature in rclcpp, an rclcpp API user should not notice or care. rclcpp's public API is an abstraction which should hide all of those underlying implementation details.

Aug 31 '22 20:08 tfoote

So I was thinking about this a bit last night. And I think that a better solution that can both keep the default API generic and fully featured and separately maintain a highly optimized path. More concretely we will have the default Timer with all the features and it will have slightly higher overhead, and separately we have the LowOverheadTimer(naming TBD, using this for now for clarity) which is highly optimized and has a less user centric API that developers who are worried about the runtime overhead can opt into the more optimized one. If we're worried about a single if statement I suspect that there's more optimizations that can be achieved by stripping out more of the layers that already exist.

This will then allow the majority of users who are not going to be effected by the overhead to have the familiar API with all the bells and whistles. And the small fraction of users who want/need to optimize things can choose to use the lower level interface which does not need to be as cleanly presented or fully featured which will allow even more runtime optimizations than we already offer.

Sep 01 '22 18:09 tfoote

I like this proposal @tfoote! This definitely goes into the direction that I'm pushing for the ROS 2 libraries: expose both user-friendly APIs as well as optimized bare-bone primitives. The latter should also be used internally in the core library.

I really want to apologize if my comments before were sounding too strong, in hindsightI shouldn't have started this discussion on this very PR, but rather just started it separately. However, I'm glad that we had some good discussions on these very important topics!

One of the features developed by iRobot is the Events Executor. The main reason why this executor is more efficient than the default one is exactly because it skips all the layers, which currently are not zero-overhead wrappers. There's a lot of possible optimizations in the ROS stack, to implement them it's really fundamental to have a clear design in mind that makes ROS usable both by people who don't care about performance and resource management (and are happy to delegate it to the ROS stack) and by people who on the other hand must have full control over the operations to run them on resource constrained hardware.

Sep 02 '22 11:09 alsora

@tfoote @alsora thanks for sharing your thoughts.

IMO, this feature is basic control for timer, this is acceptable and reasonable. I think priority of rclcpp is what to provide interface for user application, and then we can review how efficiently we can implement?

If we're worried about a single if statement I suspect that there's more optimizations that can be achieved by stripping out more of the layers that already exist.

This is so true. Is this something we really want to do as design driver? I am just confused that user client library driver... If we are worried about single if statement, i think we have clear complexity threshold on specific platform as requirement, probably we would not even use rclcpp. (i think this is one of the driver of https://github.com/ros2/rclc.)

It will be really difficult to support new feature in rclcpp, if the performance comes 1st priority. and what about the other client library like rclpy? the same driver should be applied as user client library?

Just a bit confused about where this is going to land.

thanks!

Sep 02 '22 17:09 fujitatomoya

CC: @ros2/team

Sep 02 '22 17:09 fujitatomoya

It will be really difficult to support new feature in rclcpp, if the performance comes 1st priority.

So given where we are in the lifecycle in ROS 2, my opinion here is that performance doesn't need to be 1st priority, but it definitely needs to be a priority. We've heard a lot of grumbling from the community about how the performance of ROS 2 is worse than ROS 1 in a lot of ways. While we have a lot of knobs that people can use to make the system more performant, the fact of the matter is that out-of-the-box, ROS 2 currently isn't great at this. So any new features that go in need to be weighed against this fact.

That said, for this particular feature, I agree that the cost of an extra branch is likely going to be lost in the noise here. There's an easy way to test; do some performance testing before and after this PR with timers. If we can't measure the overhead (as I suspect), then I would be for this PR going in more-or-less as-is.

Sep 02 '22 17:09 clalancette

Our analysis showed that one of the major CPU bottlenecks of ROS 2 is caused by how its abstraction layers are implemented. In particular, getting data from the user application to the DDS and viceversa is very expensive (these should be almost negligible, we should be just passing a pointer around). The source of the overhead is not one big, expensive operation (i.e. something that would be easily detected if we had performance measurements in the PR process), but rather it comes from summing together a lot of if/else, checks and other stuff that on its own has a negligible impact on performance.

While developing the events executor, we did some tests where we invoked the DDS APIs as close as possible to the application layer. This is obviously bad from an architecture point of view, but it improved the performance a lot.

The "fast-path" i.e. that sequence of operations that is invoked at very high frequency while an application is running has to be cleaned from all operations that aren't required. We all love utilities and wrappers, but they should be implemented on top of efficient primitives, which are also accessible to users if they need them.

This PR is not a problem per se, merging it will have no impact on performance for 99.9% of the users (me included) and its overhead would be hard to measure, however it's a clear example of how the ROS 2 design does not follow the "you don't pay for what you don't use" principle in its fast-path Users that want a timer that runs forever will have to check that if-condition every time, similarly, users with an executor with only a rclcpp::Waitable have to go through ~20 if-conditions every time the executor wakes up, https://github.com/ros2/rclcpp/blob/rolling/rclcpp/src/rclcpp/executor.cpp#L826. These are just examples and there are a lot of them unfortunately.

User-friendly, feature-rich APIs and efficient primitives are in my opinion both equally important requirements for ROS 2, but they should be built one on top of the other rather than be mixed together.

Sep 02 '22 19:09 alsora

After reading all the discussion and the common ground reached regarding the software design, I ran some simple test on my local host to check the overhead this PR would include. I basically created a timer with a callback and checked if the timer was being fired on time (according to the period) using std::chrono::high_resolution_clock as the watchdog of the timer. Also I kept looking at the general CPU and memory load of the process with the proposed feature and without it (i.e. rolling). Here are the results I got on my local host:

As for the timer being triggered on time, I only ran it for a few callbacks and all were triggered within the time period:

eloy@567080f97543:~/ros2_rolling_ws$ ros2 run testing_pkg timer 
Callbacks called: 100
Missed Deadlines: 0

As for the workload on the CPU and the memory usage, take a look at the yellow row from the next pictures:

Feature Branch

Rolling

As we expected, the CPU/memory consumption is basically the same for both cases. I know this was a simple test to run that does not show an actual benchmark test for the feature. However, I believe it shows that the overhead is so small that it doesn't affect the behavior of the timers.

Sep 07 '22 16:09 Voldivh

@Voldivh sorry to be late to get back on this, can you take a look at latest comments?

Dec 05 '22 04:12 fujitatomoya

@Voldivh sorry to be late to get back on this, can you take a look at latest comments?

Of course! I'll take a look.

Dec 13 '22 18:12 Voldivh

@Voldivh could you address the conflicts?

Jun 05 '23 23:06 fujitatomoya

@Voldivh could you address the conflicts?

Sure thing, will do.

Jun 06 '23 22:06 Voldivh

same here, build failed https://build.ros2.org/job/Rpr__rclcpp__ubuntu_jammy_amd64/906/console

10:51:52 /tmp/ws/src/rclcpp/rclcpp/include/rclcpp/subscription.hpp:208:5: error: there are no arguments to ‘TRACETOOLS_TRACEPOINT’ that depend on a template parameter, so a declaration of ‘TRACETOOLS_TRACEPOINT’ must be available [-fpermissive]
10:51:52   208 |     TRACETOOLS_TRACEPOINT(
10:51:52       |     ^~~~~~~~~~~~~~~~~~~~~
10:51:52 gmake[2]: *** [CMakeFiles/rclcpp.dir/build.make:181: CMakeFiles/rclcpp.dir/src/rclcpp/any_executable.cpp.o] Error 1
10:51:52 gmake[1]: *** [CMakeFiles/Makefile2:138: CMakeFiles/rclcpp.dir/all] Error 2

Jun 08 '23 18:06 fujitatomoya

same here, build failed https://build.ros2.org/job/Rpr__rclcpp__ubuntu_jammy_amd64/906/console

10:51:52 /tmp/ws/src/rclcpp/rclcpp/include/rclcpp/subscription.hpp:208:5: error: there are no arguments to ‘TRACETOOLS_TRACEPOINT’ that depend on a template parameter, so a declaration of ‘TRACETOOLS_TRACEPOINT’ must be available [-fpermissive]
10:51:52   208 |     TRACETOOLS_TRACEPOINT(
10:51:52       |     ^~~~~~~~~~~~~~~~~~~~~
10:51:52 gmake[2]: *** [CMakeFiles/rclcpp.dir/build.make:181: CMakeFiles/rclcpp.dir/src/rclcpp/any_executable.cpp.o] Error 1
10:51:52 gmake[1]: *** [CMakeFiles/Makefile2:138: CMakeFiles/rclcpp.dir/all] Error 2

Same question here with this commit.

Jun 08 '23 18:06 Voldivh

CI:

Linux
Linux-aarch64
Windows

(Rpr job is supposed to fail.)

Jun 08 '23 22:06 fujitatomoya

@alsora could you review your comment?

Jun 08 '23 22:06 fujitatomoya

as we talked before, this makes more sense if the canceled timer is removed from the wait_set for performance. AFAIK, the canceled timer is NOT removed from the wait_set, those timers are just skipped to execute the callback.

IMO, i think this PR is still useful for syntax sugar for application.

Jul 06 '23 01:07 fujitatomoya

as we talked before, this makes more sense if the canceled timer is removed from the wait_set for performance. AFAIK, the canceled timer is NOT removed from the wait_set, those timers are just skipped to execute the callback.

IMO, i think this PR is still useful for syntax sugar for application.

I took a closer look at the implementation of the cancel method and you are right, a canceled timer is not removed from the wait_set it is skipped after checking it's state (canceled or not canceled).

Taking that into account, I do feel that the PR would make more sense if it removes the timer from the wait_set. So far, I've only thought of destroying the timer object after the callbacks are completed, but, would be open to other ideas.

Jul 06 '23 14:07 Voldivh

Taking that into account, I do feel that the PR would make more sense if it removes the timer from the wait_set. So far, I've only thought of destroying the timer object after the callbacks are completed, but, would be open to other ideas.

I don't think we should destroy the timer object. In particular, I can imagine that the user may want to retrigger the same timer multiple times, which they should be able to do by calling the reset method. Destroying the object would prevent that.

In order to remove it from the waitset, I think we'd need to do something more clever where we track inside of the timer object whether we are currently in the waitset. This would be true by default. When we automatically expired via this mechanism, or cancel was called, we would remove ourselves from the waitset and set this as false. The next time that reset was called, we would add ourselves back to the waitset and set this to true.

However, it's even more complicated than that, since at the TimerBase level we have no access to the executor. So we'd somehow have to signal the executor to remove use from the waitset. I'm not quite sure how we would do that.

Jul 17 '23 14:07 clalancette

Taking that into account, I do feel that the PR would make more sense if it removes the timer from the wait_set. So far, I've only thought of destroying the timer object after the callbacks are completed, but, would be open to other ideas.

I don't think we should destroy the timer object. In particular, I can imagine that the user may want to retrigger the same timer multiple times, which they should be able to do by calling the reset method. Destroying the object would prevent that.

In order to remove it from the waitset, I think we'd need to do something more clever where we track inside of the timer object whether we are currently in the waitset. This would be true by default. When we automatically expired via this mechanism, or cancel was called, we would remove ourselves from the waitset and set this as false. The next time that reset was called, we would add ourselves back to the waitset and set this to true.

However, it's even more complicated than that, since at the TimerBase level we have no access to the executor. So we'd somehow have to signal the executor to remove use from the waitset. I'm not quite sure how we would do that.

Ok I see, as I feared, there is no easy way around it then. I'll take some time to look into it and get back to you.

Jul 17 '23 14:07 Voldivh