LibAFL icon indicating copy to clipboard operation
LibAFL copied to clipboard

Why do Observers need to be serializable?

Open mlgiraud opened this issue 2 years ago • 10 comments

Hi, I'm currently implementing some custom components and am currently stuck on implementing an observer. Why does an Observer need to be serializable, and what are the consequences of serializing or deserializing an observer? I need to keep an Rc to a component that is shared between the executor and this observer.

mlgiraud avatar Nov 06 '23 09:11 mlgiraud

For scaling to multiple cores on slow targets, the observer will be sent to the other nodes, serialized, and can be evaluated by the other nodes respecitvely. That way there is no need to re-run a target on each node. If you don't need to scale like that in theory you can ignore serialization and then set the nodes to AlwaysUnique

domenukk avatar Nov 06 '23 10:11 domenukk

So if the nodes are AlwaysUnique, we can still have multiple cores for the fuzzer, but the inputs will need to be reevaluated for each instance, correct? So if an instance discovers a new input, it will have to be reevaluated in a different node, correct?

mlgiraud avatar Nov 06 '23 10:11 mlgiraud

Wouldn't it maybe make sense to have the Observer not be serializable, but instead make it have a serializable state that can be sent across instances? I presume this would make the API a bit cleaner, and then it also becomes easier to check whether observers may be shared, since we can encode this in the typesystem via a HasObserverState trait or something? Just some ideas, since i'm still getting familiar with the libAFL design.

mlgiraud avatar Nov 06 '23 10:11 mlgiraud

So if the nodes are AlwaysUnique, we can still have multiple cores for the fuzzer, but the inputs will need to be reevaluated for each instance, correct? So if an instance discovers a new input, it will have to be reevaluated in a different node, correct?

Yes. But that's not really recommended of course. I don't see a reason why an observer wouldn't be serailizable, maybe you're doing things in the Observer that belong in a Feedback?

Wouldn't it maybe make sense to have the Observer not be serializable, but instead make it have a serializable state that can be sent across instances?

The thing that should be serializable is the observer's state after an execution, you are roughly describing the way it works now. Where are you stuck exactly?

domenukk avatar Nov 06 '23 21:11 domenukk

So my understanding of an Observer is, that it gathers some kind of information from the target and then provides this to the feedback in its raw form. The Feedback can then do some processing on this data to reduce it to a true or false decision. If this is the case, then the observer needs some kind of access to the target in order to perform the observation.

In case of the map observer this (usually?) happens in the form of a reference to the shared memory map of the target coverage. What i'm now not quite sure about is what happens to this data on serialize. As far as i understand it right now, this data will be converted into an owned vector and then upon deserializing the observer then has no capability to actually observe the target anymore. Is my understanding correct so far?

So the observer is basically frozen upon being sent over the wire, and the other node can use it to do some calculations with the feedback moduls, correct?

If this is the case, then i don't have any problems with implementing this, since any internal objects that are not relevant to the observers state can be ignored during serde. However, i am still convinced (assuming my understanding is correct), that this is not modeled correctly. I would have expected that for example the observer emits a kind of "observation" which can be either sent over the wire, or be used directly by the feedback modules.

PS: I appreciate the quick replies so far. Thank you!

mlgiraud avatar Nov 07 '23 08:11 mlgiraud

So my understanding of an Observer is, that it gathers some kind of information from the target and then provides this to the feedback in its raw form. The Feedback can then do some processing on this data to reduce it to a true or false decision. If this is the case, then the observer needs some kind of access to the target in order to perform the observation.

Correct

In case of the map observer this (usually?) happens in the form of a reference to the shared memory map of the target coverage. What i'm now not quite sure about is what happens to this data on serialize. As far as i understand it right now, this data will be converted into an owned vector and then upon deserializing the observer then has no capability to actually observe the target anymore. Is my understanding correct so far?

Correct as well :)

So the observer is basically frozen upon being sent over the wire, and the other node can use it to do some calculations with the feedback moduls, correct?

Also correct, yes

If this is the case, then i don't have any problems with implementing this, since any internal objects that are not relevant to the observers state can be ignored during serde. However, i am still convinced (assuming my understanding is correct), that this is not modeled correctly. I would have expected that for example the observer emits a kind of "observation" which can be either sent over the wire, or be used directly by the feedback modules.

This is true, for example the map observer content can be used directly or sent over the wire. That wen don't always expose the map as a vec is simply an optimization - we don't want to memcopy the coverage map for each execution. Keep in mind that the serialization (commonly) only takes place when a new interesting testcase is found and stored - and shared with other nodes.

PS: I appreciate the quick replies so far. Thank you! Hope I can help :)

domenukk avatar Nov 08 '23 00:11 domenukk

Alright, i think i understand all the concepts how you intended them now i think. But don't you think it would make sense to separate the serialization trait into an Observation object that is returned from the Observer? There don't need to be any copies here i think, since any data that is kept in the Observer can simply be borrowed for as long as the Observation is needed for. I.e. the Observer could have a function like fn observation(&self) -> &Observation. The Observation itself can simply contain the data, or a reference to the data which is kept in the Observer. This would be an implementation detail. On serialization we need to copy the data anyways (especially if you only have a borrow in the observer to begin with, like with the MapObserver). I don't think this would incur a significant overhead (maybe one or two pointer derefs?), but would more clearly reflect what is actually done.

However, i don't have that good of an overview of what changes this would cascade into. What is your opinion on this?

For my purposes it will suffice to just write a custom Serialize implementation that serializes the data behind the Rc and deserializes it as an object that is not shared anymore (since it will not need to interact with the executor, as far as i understood everything). I can live with that :)

mlgiraud avatar Nov 08 '23 09:11 mlgiraud

It's a hot code path that executs a few hunded times a second depending on the target, but every single pointer deref counts in general And I'm not sure it would make it easier for new users since it's more moving parts to understand? What do you think

domenukk avatar Nov 08 '23 11:11 domenukk

Well from my point of view as a "new" user with a lot of fuzzing background, it wasn't exactly clear to me what consequences the serialization has and if it matters that any references to other components still behave in a meaningful way after deserialization. If we split this into Observer which does the observing and Observation which is used for further processing and can be sent to other fuzzers via serialization, then it would be more clear what the responsibilites are. For the case of the MapObserver you could even directly implement the Observation trait on the map and return a reference to the map. This shouldn't differ too much from the current amount of derefs you need to do, since you have to deref to the map either way. I think the only option here is to actually implement it and then check the performance. Everything else is just speculation imho.

Side note: I don't even think we would need an extra trait for that, as it can be simply represented as an associated type in the Observer trait. Maybe?

mlgiraud avatar Nov 08 '23 12:11 mlgiraud

No clue, @andreafioraldi what do you think?

domenukk avatar Nov 08 '23 14:11 domenukk