runtime type identification of messages with protobuf

Open ArcEye opened this issue 7 years ago • 1 comments

Issue by mhaberler Sun Mar 22 16:10:13 2015 Originally opened as https://github.com/machinekit/machinekit/issues/538

(TL;DR warning - a key problem ahead though).

HAL scalars are strongly typed: float, bool, s32, u32, and there is a corresponding type id: HAL_FLOAT, HAL_BIT, HAL_S32, HAL_U32.

Protobuf messages are strongly typed as well. However, Google protobuf as distributed does not create, or manage type id's for message descriptions, only for the fields contained therein (there are some 50+ message types under machinetalk/proto/proto). As messages go over the wire, or are handed from component to component, there needs to be a way to identify their type.

There are two solutions to this problem:

use a single message supertype, which may contain all other messages as submessages, and add a type field in this super-message which indicates which parts of the supertype are relevant.
use different messages as needed. In this case, recipients need a way to tell apart "which message is coming".

The first approach was the one used so far (and will continue to be used, but not exclusively - see below); the supertype message is the 'Container' (see message.proto), with the required 'type' field to enable recipients to discriminate.

While feasible in principle, there are a number of downsides to this approach:

it works only if the encoding of a message is by the protobuf library, forcing the use of protobuf-serialized messages even if this is not warranted, or desirable; for instance between RT components communicating over ringbuffers, where the serialisation/deserialisation overhead directly accounts against the limited thread timing budget.
there is a size overhead which is incurred by using the supertype approach, and while there are various ways to reduce the size of the deserialized C struct a bit, this comes at a cost: the resulting message descriptions are more cumbersome to use. With all sorts of tricks, the Container message still corresponds to a 6+Kb-sized struct. Many message types used in RT will use only a few percent of this space.

For the following discussion it is important to understand that the .proto definitions are translated into several language bindings (Python, C++ and C). With the latter C binding, provided by nanopb, every message is translated into a corresponding C struct.

An example from emcclass.proto - the following protobuf message description

message PmCartesian {
    optional double x = 10;
    optional double y = 20;
    optional double z = 30;
};

is translated by the nanopb generator into the following struct (emcclass.npb.h):

typedef struct _pb_PmCartesian {
    bool has_x;
    double x;
    bool has_y;
    double y;
    bool has_z;
    double z;
} pb_PmCartesian;

The correspondence is fairly obvious - if x, y or z are present, their double value fields are set, and the corresponding has_x, has_y, and has_z fields are set to true (false if not present). This is the representation used when processing messages in RT, and there is no good reason to serialize this message before handing it off to another component, just to have it deserialized there again; one could just as well pass this struct through a ringbuffer, which would be very fast.

If this accelerated method is used, now we need to pass a 2-member tuple between participants:

the encoding (nanopb C struct, a protobuf serialized message, or any other encoding deemed worth using in the future)
if one does not want to incur the runtime overhead of the supertype approach, the message type needs to be passed on.

As outlined above, stock Google protobuf does not support message type ID's independent of fields in a message.

However, recent work on the nanopb generator (https://github.com/mhaberler/nanopb/commits/msgid-option-cleaned, now merged upstream in https://code.google.com/p/nanopb/) permits an option to be set in a message description which can be used as a type identifier. This method is quite elegant and incurs no message overhead per se beyond the transport of the type tag, and it has the huge advantage of having all definitions in one place - the proto files themselves.

I have added such msgid options to all machinetalk message definitions; the above PmCartesian declaration now looks like so: https://github.com/machinekit/machinetalk-protobuf/blob/f9a759a2da38a7bb8e200f84b73566843805eb20/proto/emcclass.proto#L12-L19 .

These options are understood by the nanopb generator, and emitted in the corresponding header files as C defines for using code. Also, it is possible to assemble all message defintions, their names, type ids, and other parameters, into a descriptor struct which can be used at the C library level to do arbitrary mappings between type name, type ID, and descriptor. This code is currently found in machinetalk/msgcomponents/pbmsgs,{c,h} (for RT).

I envisage the usage of this new option to be primarily confined to RT where the speed and memory consumption advantage is significant.

The way this feature can be used between RT components, and RT and userland components will look like so:

as a ringbuffer vehicle, the multiframe ring is used (see rtapi/multiframe.h) - it is similar to the normal record oriented ring, but transports an uninterpreted uint32 with each message frame. This uint32 will be used to carry the type id as generated above, as well as some bits reserved to describe encoding (e.g. protobuf message, a nanopb C struct, or some other future encoding)
between RT components, nanopb-generated structs will be used as those are faster in processing than protobuf messages.
at the RT/userland boundary, the userland part will be responsible for converting whatever external serialisation is used (protobuf, JSON etc) to and from nanopb C structs. The type/encoding information passed in the multiframe uint32 flag makes this method type safe.
the type tag also makes it easy for RT components to decide if a message is to be processed, ignored or passed on - without any serialisation overhead.
the descriptor array generated by nanopb will alllow writing a Cython binding for this conversion at the RT/nonRT boundary (both directions), again type-safe.

The foundation for these features is collected in https://github.com/machinekit/machinekit/pull/537, and the corresponding changes in the machinekit/nanopb and machinekit/machinetalk-protobuf repositories.

Aug 04 '18 14:08 ArcEye

Hello, would like to inquire if anybody spent some thought or done any research on this since the original post in 2015? (Need to know if I can run to the possibility that somebody will be pissed off.)

(Following text is taken from the perspective of someone trying to enable Machinekit Instance-to-Machinekit Instance communication [or broker mediated] creating one distributed installation.)

I have been looking into the so-called zero-copy or offset accessed serialization libraries as an alternative to using Protocol Buffers. Mainly the Google Flatbuffers and Cap’n Proto. The main advantage of this type of serialization is that the wire format representation and in-memory representation are the same. So I have been thinking that with these properties there would be no need to have two or more different formats of passing messages, like the aforementioned protobuf serialized messages or messages consisting of nanopb structs. Given that the data is represented as one array of bytes with a simple specification of containers and what these containers represent, there should not be generally any significant performanceš hit.

However,

This is the representation used when processing messages in RT, and there is no good reason to serialize this message before handing it off to another component, just to have it deserialized there again; one could just as well pass this struct through a ringbuffer, which would be very fast.

does anybody have any idea how to measure it? How to measure if there would be any difference between creating nanopb struct and byte array buffer? My naive idea of how to go around to this is so: Given that every RT component takes inputs and produce outputs which are pretty much constant in time, components would at initialization create byte array as a model and then in cyclical run would copy this array to ringbuffer window buffer, mutate this array with new values and send it on its way.

if one does not want to incur the runtime overhead of the supertype approach, the message type needs to be passed on.

I for one consider the current Container approach extremely ugly. And pretty much breaking the type safety of the message. So I have been thinking that both Flatbuffers and Cap'n Proto have message identification as base functionality. It works pretty much like this nanopb functionality but it is possible to use it across the language bindings. In FB it is the file_identifier and in Cap'n Proto it is the Unique IDs. The advantage to this is you can determine by a simple look at the message if it is known and if it is pertinent to your situation. It in itself would be enough to broke the Container message into several logical blocks, all would be needed is lookup table on the receiver end.

The second problem is that ZeroMQ is going away with multipart messages. There were problems with ZMQ_CONFLATE option with multipart messages and the RADIO/DISH sockets (I think) no longer support multipart. (And these are only ones supporting UDP, which would be nice for this. [Pending MTU size research into jumboframes and such.]) So I have been thinking that even the multipart part, or at least the header (sender, receiver) and payload parts could be solved by the same idea. The whole message would be buffer (with unique identification) which would as payload have another buffer. And what to do with it would be based on this unique ID.

The way this feature can be used between RT components, and RT and userland components will look like so:

The current problem with Container message is that it needs to know all possible message types. So the developer cannot just publish a .icomp or .comp along with message definitions which it would use. Because then it would not be able to be sent over the network.

It would be also nice if the IDs have some kind of a tree structure. So there was reserved space (like 192.168.0.0) for components not currently in main Machinekit-HAL repository but which want to use (for example MessageBus) for communication.

Apr 26 '19 20:04 cerna