Pending timeouts are not cancelled after receiving a response message
This is the first issue I'm filing with actor-framework. I'm new to most of its concepts and this might be an error on my end.
The following test program is supposed to chain two actors together. The message flow is supposed to be caf_main -> parent -> child -> parent handler -> caf_main handler. The timeout in the call from the parent to the child seems to hold up the main process from exiting. Setting it to seconds(2) makes the whole process take 2 seconds to exit. Setting it to infinite makes it exit immediately. Is this expected?
I convinced myself through a bunch of aout's that the control flow is as I expect it to be.
#include <chrono>
#include <string>
#include <iostream>
#include <unordered_map>
#include "caf/all.hpp"
using namespace caf;
using namespace std;
using namespace std::chrono;
using ping_atom = atom_constant<atom("ping")>;
using pong_atom = atom_constant<atom("pong")>;
behavior fwd_actor(event_based_actor* self, const actor* child) {
return {
[=](ping_atom) -> result<int> {
auto rp = self->make_response_promise<int>();
if (child != nullptr) {
aout(self) << "ping parent -> child" << endl;
self->request(*child, seconds(2), ping_atom::value).then([=](int i) mutable {
aout(self) << "parent -> child done, result: " << i << endl;
rp.deliver(i);
}, [=](error& e){
aout(self) << self->system().render(e) << std::endl;
});
aout(self) << "parent done" << endl;
} else {
aout(self) << "ping in the child" << endl;
rp.deliver(2);
}
return rp;
},
};
}
void caf_main(actor_system& system) {
scoped_actor self(system);
auto child = system.spawn(fwd_actor, nullptr);
auto parent = system.spawn(fwd_actor, &child);
self->request(parent, seconds(4), ping_atom::value).receive(
[&](int i) {
aout(self) << "main call done, result: " << i << endl;
},
[&](error& e) {
cout << self->system().render(e) << std::endl;
});
aout(self) << "Done" << endl;
}
CAF_MAIN();
Compile with g++ -g --std=c++17 -l caf_core -o chain chain.cpp.
time ./chain
ping parent -> child
parent done
ping in the child
parent -> child done, result: 2
main call done, result: 2
Done
./chain 0.01s user 0.03s system 2% cpu 2.034 total
The timeout in the call from the parent to the child seems to hold up the main process from exiting. (...) Is this expected?
To give a bit of background: this has to do with CAF's garbage collection. The actor system waits for all actors before leaving main. You don't explicitly shutdown your actors, but once you chain of messages is over all actors become unreachable. Except for the one actor that has a pending timeout, because now a message pointing to that actor still exists. After the timeout gets dispatched, no remaining reference to the actor exists and it gets shutdown since it is now unreachable as well.
Using infinite as timeout causes the actor to not request a timeout at all. That's why no additional reference to actor gets created and main exits immediately.
In any case, an actor should be able to cancel a timeout after receiving a response message. Actors already cancel all timeouts when calling quit, so this should be straightforward to fix.
From @riemass (https://github.com/actor-framework/actor-framework/pull/1548#discussion_r1321150483):
To explain it in more detail, the problem here isn't that we don't cancel the timeout. (...) After experimenting with the provided test code, in the case of a request-then we cancel the timeout by calling the disposable
disposemember function when handling the response message. The actual disposable is anactionthat is scheduled to enqueue the error to the mailbox. No other reference is kept at the request site other than a disposable intrusive pointer. The real error here is that when calling the dispose, we only mark the action enum as disposed, but since it's an intrusive ptr, the actual object is kept alive until the ptr goes out of scope in the scheduler (i.e. in 10 s). But the state also holds the function type which can have it's own state, in this case the function holds a strong reference and the actor system blocks the shutdown until that reference gets deleted and the ref count drops to 0. (...)
Copying it here for context and to document what the fix was.