Port lib0 encoding/decoding
Background
We want to port encoding/decoding features of lib0 into OctoBase. The existing lib0 implementation is buggy and unsafe to use it in production. And we don't see any optimization for that in their roadmap.
So let's just start porting our own high-performance and production-ready lib0 encoding/decoding features in OctoBase.
### Steps
- [x] basic type/ybinary decoding - [nom](https://docs.rs/nom/latest/nom/) & [byteorder](https://docs.rs/byteorder)
- [x] basic type encoding - [nom](https://docs.rs/nom/latest/nom/) & [byteorder](https://docs.rs/byteorder)
- [ ] https://github.com/toeverything/OctoBase/issues/401
- [ ] https://github.com/toeverything/OctoBase/issues/402
Overview
lib0/encode provides two types of encoding methods:
- write an integer of explicit byte length、write integers to variable-length bytes、write float to bytes as big endian
- composite type
- write a variable-length int for the tag type
- write length if need(string、stringified json etc.)
- write the corresponding type of bytes
So in fact, what we encode and decode is a sequence containing multiple types
Design
I plan to implement a port of the lib0 part using nom for decoding and byteorder for encoding.
Since lib0 only provides codecs for multi-type sequences and does not provide specific structural abstractions, I do not plan to introduce libraries such as serde at this level, but introduce them in the port of yjs.
Testing
At present, y-crdt has implemented a lib0 version. I plan to use it as a result reference to fuzzing, and write a test case comparing the results of js version lib0 for each type of encoding.
cc @Brooooooklyn @zuoxiaodong0815
The following pseudo-code describes how to read a ydoc (read method without ytype) VarXxx means a read_var_xxx function, accept &[u8] and consume part of it, return new &[u8] and parsing results
1. readUpdateV2
a. readClientStructRefs
i. numOfStateUpdates = VarUint
ii. Loop numOfStateUpdates
1) numOfStructs = VarUint
2) client = VarUint
3) clock = VarUint
4) Loop numOfStructs
a) info = Uint8
b) First5Bit = 0b00011111 & info
c) Switch First5Bit
i) Case0 // GC
1. len = VarUint
2. Break
ii) Case10 // Skip
1. len = VarUint
2. Break
iii) Default
1. cantCopyParentInfo = 0b1100000 & info
2. If info & 0b10000000 == 0b10000000 then
– leftId = Id {VarUint, VarUint}
3. End
4. If info & 0b01000000 == 0b01000000 then
– rightId = Id {VarUint, VarUint}
5. End
6. If VarUint = 1 then
– doc.get(VarString)
7. Else
– leftId = Id {VarUint, VarUint}
8. End
9. If info & 0b00100000 == 0b00100000 then
– VarString
10. End
11. readContentByType(First5Bit)
b. readAndApplyDeleteSet
i. numClients = VarUint
ii. Loop numClients
1) client = VarUint
2) numberOfDeletes = VarUint
3) Loop numberOfDeletes
a) clock = VarUint
b) clockLen = VarUint
Ybinary structure analysis
Currently, I have implemented the parser of ybinary(update) in https://github.com/toeverything/OctoBase/pull/389.
There are several known problems, which I will write here for future improvements to the format:
-
format without analytic expressions
In mathematics, a solution that can be expressed by a analytic expression is called a analytic solution, and I use it to describe a file format: if a file format can read any data in a file container with a fixed logic, I call it has analytic expression. MKV is a typical format that has a analytic expression, you can read any data stored in MKV with a fixed read logic.
ybinary, on the other hand, has not analytic expression. If you want to read ybinary, you first need to read multiple variable-length integers, which describe how many times you need to loop through the item, and then there is a lot of variable-length integer read/write logic in the process of reading the item, which means that if any one byte is corrupted in the ybinary, the whole ybinary will no longer be readable, which greatly increases the risk of store ybinary for a long time.
-
can only be read in full
ybinary is stateful: ybinary does not store the complete crdt item state, some state depends on the state in the previously loaded crdt item, which means if you want to read or write a crdt item in ybinary, you have to read the whole ybinary into memory.
-
No file or value level self-checking
As mentioned earlier, ybinary has no analytic expression, which means that any byte corruption in binary will result in the corruption of the whole binary. Since binary does not contain checksum on the value, we cannot know whether ybinary is corrupted and from where until we read it.
-
Complexity of decoding
ybinary uses a lot of variable-length integer encoding to save the values, which consumes a lot of cpu resources when coding and decoding.
in contrast, fixed-length binary numbers can be decoded at high speed on any platform (including js runtime, using dataview)
the variable-length integer encoding used by ybinary contains conditional judgments in the encoding and decoding logic, which makes it several times slower than fixed-length binary numbers even if it is optimized on the native platform.
