ndarray icon indicating copy to clipboard operation
ndarray copied to clipboard

Serialization format questions

Open kylecarow opened this issue 7 months ago • 1 comments

This is not really an issue per-se, but I'm trying to gain some insight into why the ArrayBase serialization format is the way it is.

With this example 2-D array: array![[0., 1., 2.], [3., 4., 5.], [6., 7., 8.]]

serialization output to JSON looks like this: {"v":1,"dim":[3,3],"data":[0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0]}

What I wonder is why the more human-readable (and smaller-character-count, depending on data) format generated by alternate ser/de functions defined in serde_ndim is not used instead: [[0.0,1.0,2.0],[3.0,4.0,5.0],[6.0,7.0,8.0]] In fact, its exactly how a user would supply an array syntactically in code.

I can speculate a few reasons, but would like insight from contributors to ndarray. Is it just performance reasons? I understand arrays are all 1-D in memory, and decoding the shape must take some amount of extra processing time. Are there other considerations that I'm missing?

I'm also looking for some insight into why the version field v exists.

Thanks! :)

kylecarow avatar Jun 06 '25 21:06 kylecarow

I think I can answer most of this, although I didn't write the serialization code.

On the JSON format, I think this is because our serialization implementation is generic over the Serializer, as most serde implementations are. So we don't provide specific code for JSON in particular, we just tell serde how any Serializer should interpret our data: as a linear Sequence, with some "metadata" about the shape (and a version number). serde_json is then responsible for turning this into a JSON representation, specifically.

Yes, I think that processing time of "discovering" the shape on deserialization could be expensive. In particular, think of it this way: if we know the shape of the data independently, we can grab a block of uninitialized memory of the exact size we need and then iterate once through the data itself to fill it in.

I think the version field is a sort of humble admission that we're not going to write "the one serializer to rule them all" the first time. You can consider a whole slew of optimizations and enhancements: including whether the data is (approximately) C- or F-order, packing arrays of bool into smaller representations with a bit for each element, etc. You may need a different "version" of the serialization format for these changes.

Lmk if that answers your questions!

akern40 avatar Jun 07 '25 01:06 akern40