bitreader icon indicating copy to clipboard operation
bitreader copied to clipboard

Little endian support

Open leoschwarz opened this issue 8 years ago • 5 comments

One of the main pieces still missing from this crate is support for little endian values. I made a first sketch of what this could look like, but I'm open to any critique (and obviously it would require testing which is completely lacking as of now and probably some performance improvements):

https://github.com/leoschwarz/opensim-networking/blob/20858c6e0c8c0f5de3b0f05778b3e1513d377538/src/layer_data/reader.rs

The main problem I had with the API wasn't the addition of a (generic) parameter specifying endianness to the methods, but also the fact that values can be padded from both the left and right with zeros, before applying the byte reordering. (i.e. say you read 9 bits 1 0010 and want that as a u16 it could either be 0001 0010 or 1001 0000, and when you swap the bytes this makes a difference) That's why I ended up taking two generic arguments for the read_part_u32 method. Obviously the read_full_* and read_part_* methods could be combined but then every call would involve supplying two generic arguments which is really verbose. One alternative is taking only one generic argument, which implements both traits, and then to define types implementing both traits.

One alternative to the calling read_u8 all the time would be to read into the target type directly, skip the byteorder crate, and have our own logic to call swap_bytes on the values if needed. (This would have to check the requested ordering and probably the host ordering too.)

leoschwarz avatar Nov 10 '17 21:11 leoschwarz

Does the OpenSimulator protocol really use that complicated encoding, that the bits have to be shifted, or padded as you phrase it, before doing endianness conversion?

I honestly don't have particularly wide experience in bit-level stuff in general - I ended up making this crate because I was interested in the MPEG-2 format, which seems pretty friendly in the sense that big endian support is enough for pretty much all parts of it I came across.

Anyway, I suppose one reason for strange requirements to extract a value could be that some original implementation has just cast a piece of memory into a struct of u32, u16 and other wider types while assuming a little endian platform, and only then smaller bitfields are extracted with shifts and masking.

irauta avatar Nov 12 '17 19:11 irauta

I'm not sure yet what all the specifics of the "protocol" (honestly it's more of a combination of an actual protocol usually called "LLUDP" and many ressource encodings) entail, but for one decoder I have already encountered 10 bit long (and at times dynamic size) unsigned integers, which tend to be encoded in little endian and not big endian. I cannot really say yet what the exact need is since I haven't been able to perform a successful test of the decoder I'm working on the last couple days. :')

But anyway I'll probably write my own Reader/Decoder helper anyway, if I end up needing even more specific things, maybe it would be better to see what kind of little endian support other people need, if there's any such demand at all.

leoschwarz avatar Nov 12 '17 21:11 leoschwarz

I was thinking about my earlier comment, and the assumption that the endianness comes from byte level of the source data. The following may have some fatal error in it. In a big endian encoding a 14 bit value could span three bytes like this (just a contrived example of a hard case; the numbers are there just to show from which byte each bit comes and x marks an irrelevant bit):

xxxxxxx1 22222222 33333xxx

But if the bytes were in little endian order, the same bits would be laid down on the consecutive bytes like this:

33333xxx 22222222 xxxxxxx1

That is not a problem at all, if the sequence of bytes is cast into a u32 (just imagine an xxxxxxxx as a most significant byte after the three bytes to fill out the entire u32). Then extracting the value would only require a >> 3 and a &. However, it's hard to imagine how the big-endian-by-default BitReader should act when read_u32_little_endian(14) were called - where should the cursor end up after the call?

I don't mean to discourage you by any means, but little-endian reading just started to seem a lot harder to handle.

irauta avatar Nov 13 '17 20:11 irauta

I think what I requested is very application specific, but at the same time the functionality of this crate also kind of is. There really are many points where bit encoding might be done differently.

One thing I'm still thinking about (but it's rather somewhere on my backlog → right now I'm going to use the code that I linked, which was wrong btw but I fixed it in the must ugly way right now) is whether someone could use the type system to generate a reader type with many configuration options, which would hopefully benefit from compile time optimizations. Or the easier option would be to provide the reader with a Config struct instance carrying information about the encoding which would be applied at run time, this would also allow the construction through a builder struct, however it's not zero cost for users of the current implementation.

Configuration like that could allow for parameters like cursor_follow: enum { Bit | Byte } and then it would clear "where the cursor ends up after the call". Or maybe I misunderstood the exact problem you mentioned.

leoschwarz avatar Nov 14 '17 17:11 leoschwarz

I guess the main problem with little-endian data is that there needs to be some higher-level type around than just the slice of bytes and the bits within them - endianness makes only sense for multi-byte types.

After pondering about this, I came up with a potential way to tell to the reader that the upcoming bits come from a type stored in little-endian order. Here's an imaginary use case with lots of comments:

// The "reader" variable is a BitReader, in a usable state
// Just some regular method call for context, this is just BitReader doing what
// it normally does:
let some_value = reader.read_u8(8)?;
   
// To read bits in little-endian order you need to create a "sub-reader",
// that's valid only for a small amount of bits:
let le = reader.read_le_32()?;
// "le" can provide up to 32 bits (the alternatives would be 16 and 64),
// read from a u32 variable stored in little-endian order in the byte slice.
// Also read_le_* takes the reader as a mutable reference, so the main
// reader is not available during its existence

// Reading bits with it looks as expected, it's just that source data is
// interpreted differently
let le_value = le.read_u16(14)?;
let another_le_value = le.read_u8(8)?;
// ...and so on, the little-endian reader provides similar methods as BitReader

// To stop using the sub-reader and gain back control of the main BitReader, you need
// to "close" it:
le.close();
// "close" can't fail, and will always increment the cursor position by
// 32 bits (or 16 or 64, depending on the originally requested amount)

The close method could return a &mut BitReader, to allow chaining:

let le = reader.read_le_32()?;
let x = le.read_u8(num_bits)?;
let y = le.read_u8(more_bits)?;
// ...and more calls to le.read_* here, up to 32 bits, until finally another sub-reader
// is set up to interpret the next 32 bits in little-endian order:
let le = le.close().read_le_32()?;
let z = le.read_u16(bits_never_end)?;

For another perspective, some fn signatures in pseudo-Rust:

impl BitReader {
    // The return types actually should have some lifetime parameter too,
    // or did lifetime elision work here too, can't remember right now.
    fn read_le_16(&mut self) -> LittleEndianReader16;
    fn read_le_32(&mut self) -> LittleEndianReader32;
    fn read_le_64(&mut self) -> LittleEndianReader64;
}

impl LittleEndianReader32 {
    fn read_u8(&mut self, bits: u8) -> BitReaderResult;
    fn read_u16(&mut self, bits: u8) -> BitReaderResult;
    // etc. for all the types

    // Consumes self, advances the BitReader cursor to the end of 32-bit segment
    fn close(self) -> &mut BitReader;
    // Maybe there should be a version that *doesn't* advance the cursor, to allow "peeking"
    // into the upcoming data
    fn cancel(self) -> &mut BitReader;
}

irauta avatar Nov 17 '17 19:11 irauta