bitarray Any interest in adding a table based decode?

Hello. I've been enjoying your module and associated writings/docs! May I 👏 and say 'Well done!'?

I'm an R and Rcpp/c++ kinda guy who had never really looked at Python until just a few weeks ago and immediately had a use for your module and have found it easy to use and instructive but since I was exploring I obviously was profiling a ton while learning and found two things that I thought might be of interest.

First let me present some benchmark numbers...

*** Large Sample Size (Symbols: 12, Max Code Length: 6, In Bits: 132074, Out Symbols: 42858)

Clock type: CPU
Ordered by: ttot, asc

name                                                                    ncall            tsub      ttot      tavg      
DecodeTableTest.py:81 ba_C_iterdecode                                   10000            0.017626  0.098321  0.000010
DecodeTableTest.py:89 ba_C_decode                                       10000            0.022404  0.111367  0.000011
DecodeTableTest.py:139 baTo01_Py_lookupDict                             10000            0.693772  0.715359  0.000072
DecodeTableTest.py:122 baTo01_Py_lookupTbl_builtin_int                  10000            0.863217  0.885142  0.000089
DecodeTableTest.py:177 rawBytes_Py_lookupTbl_intfrombytes_memoryview    10000            2.719999  4.090594  0.000409
DecodeTableTest.py:195 ba_Py_lookupTbl_builtin_int_baMemoryviewTo01     10000            3.397645  4.807507  0.000481
DecodeTableTest.py:155 ba_Py_lookupTbl_builtin_int_shift                10000            3.415027  4.837891  0.000484
DecodeTableTest.py:97 ba_Py_iterdecode                                  10000            0.033506  13.49097  0.001349
DecodeTableTest.py:105 ba_Py_lookupTbl_ba2int                           10000            3.263269  33.08122  0.003308
DecodeTableTest.py:211 ba_Py_lookupTbl_ba2int_baMemoryview              10000            3.923426  33.38520  0.003339

(where ba is bitarray, and baTo01 is the string representation and the table lookups are list lookups of symbol/freq tuples and the memoryview is on the first max code length bits but shifting the whole underlying bitarray)

My first interest was in using a table lookup for the decode instead of a tree so I figured if I compared the Python iterdecode() to the C I could get a feel for the gains to be had going from Python to the underlying C and as you already know its a massive improvement (ba_Py_iterdecode) and I am thinking that even if a table lookup is a fraction of that improvement then it could possibly blaze past the C tree decodes.

I haven't looked into the Python C bindings modules yet and that is my next step (or, preferably, C++ bindings) but if you are interested in it then you know your code and where it'd plug in. I was thinking along the lines of a .decode(decodeTable) method where the decodeTable would be treated in the same fashion as the decode dict / decodeTree setup.

The other thing I wanted to mention, and you can see in the benchmarks, was ba2int() is slow. I see what it is doing with the checks and padding and such but thought perhaps I am missing something or perhaps it is something added for convenience that hasn't had usage enough to decide if it is worth working on to make it faster... 🤷

/home/thell/Projects/HuffmanTestPy/bit2intTesting.py:19 t_bitarray_ba2int_frombytes                   10000            0.020761  0.094871  0.000009
/home/thell/Projects/HuffmanTestPy/bit2intTesting.py:13 t_bitarray_ba2int_fromstrings                 10000            0.013938  0.077259  0.000008
/home/thell/Projects/HuffmanTestPy/bit2intTesting.py:41 t_builtin_int_frombytes                       10000            0.011092  0.016091  0.000002
/home/thell/Projects/HuffmanTestPy/bit2intTesting.py:36 t_builtin_int_fromstrings                     10000            0.006036  0.006036  0.000001

Nov 24 '21 07:11 Thell

Thank you for your interest in bitarray! When you say "table based decode" I assume that this implies a fixed length prefix. The reason I'm using a binary tree structure for decoding is because this allows variable length prefix trees. For large enough bitarrays, a C implementation of table based decoding will almost certainly be faster than the current tree based implementation. How did you implement ba_Py_iterdecode? If I had to do it, I would used a Python dict mapping frozenbitarrays to symbols (basically the reverse of the code dict).

Regarding ba2int(): Yes, there are a lot of checks which are necessary to handle all the different cases that can arise. However, these should be quite fast compared to the actual translation into integers (which is one on the C level using the int.from_bytes() function.

Nov 25 '21 06:11 ilanschnell

Left aligned variable length codes with padding to get the appropriate table indexes while building.

shiftLen = maxCodeLength - currCodeLength
startPos1 = vInt << shiftLen # Left-Align 0 pad
endPos1 = startPos1 | ((1 << shiftLen) - 1) # Left-Align 1 pad

Its seems we both think a table lookup could be a benefit over traversing the tree in some cases so I'll give it a go and we can see later if it is something that might be worth adding in some form or fashion. Why not, eh? 🙂

The ba_Py_iterdecode is yours from the Huffman example but only the decode time. All of the sample data (SD) prep was done outside of the timing so everything is comparable on the decode steps. So

  dt = SD.DecodeTree
  msg_str = ''.join(inBuf[:lenBuf].iterdecode(dt))

-vs-

  dt = SD.DecodeTree
  msg_str = ''.join(inBuf[:lenBuf].decode(dt))

-vs-

  dt = SD.PyDecodeTree
  msg_str = ''.join(baPyHuff.iterdecode(dt, inBuf[:lenBuf]))

where baPyHuff is import bitarrayHuffmanExample as baPyHuff (from the example file).

On the ba2int() I can appreciate the required checks but it is a pretty massive cumulative hit...

From: baTo01_Py_lookupTbl_builtin_int                  10000            0.863217  0.885142
  To: ba_Py_lookupTbl_ba2int                           10000            3.263269  33.08122

for only changing

From: code_index = int(inBuf[inBuf_index:inBuf_index + maxPrefixLen], 2) # inBuf is to01() string
  To: code_index = ba2int(inBuf[inBuf_index:inBuf_index + maxPrefixLen]) # inBuf is bitarray

but I get it... I bet those checks are pretty much required for lots of use cases.

Nov 25 '21 20:11 Thell

Hi again Ilan, So, I cloned your repo yesterday and started poking around while looking into the Python C bindings and such. While I really enjoy working with bitarray on the python side I realize this is not the road I wish to travel. I've been there and done that too many times. And I'm sure you'll agree it usually isn't all that fruitful getting submissions from someone just figuring out the api calls and supported version details and project goals. Working on that and the conversions isn't something I want to do for c. I think that instead of doing that for hopeful inclusion to the module I'll be happier and not take more of your time by doing what the contributing doc eludes to... which is an extension that depends on using bitarray.

There are a few c++ binding projects that hide the nitty gritty api details and I've been interested in diving into rust where it seems the pyo3 project is of the same mindset as R's rcpp project (where the api essentially becomes a non-issue and you can focus on solutions). So, perhaps I'll see where that road goes and contribute an 'example' pull request if the timings support it. :)

Thanks again for such a useful and easy to use module. 👍

Nov 29 '21 17:11 Thell

Ilan, I made a couple of Rust functions to match what the lookup table and lookup dict test functions do and while the rust ones are notably faster than the python versions the C versions are still way ahead of them.

name                                                                    ncall            tsub      ttot      tavg      
HuffmanDecodeTest.py:121 ba_C_decode                                    10000            0.023838  0.105012  0.000011
HuffmanDecodeTest.py:113 ba_C_iterdecode                                10000            0.021914  0.117057  0.000012
HuffmanDecodeTest.py:187 baTo01_Rust_lookupTbl                          10000            0.012419  0.246104  0.000025
HuffmanDecodeTest.py:195 baTo01_Rust_lookupDict                         10000            0.021548  0.481968  0.000048
HuffmanDecodeTest.py:154 baTo01_Py_lookupDict                           10000            0.724718  0.747388  0.000075
HuffmanDecodeTest.py:171 baTo01_Py_lookupTbl_builtin_int                10000            0.862049  0.884257  0.000088
HuffmanDecodeTest.py:224 rawBytes_Py_lookupTbl_intfrombytes_memoryview  10000            2.914654  4.352324  0.000435
HuffmanDecodeTest.py:242 ba_Py_lookupTbl_builtin_int_baMemoryviewTo01   10000            3.581334  5.039527  0.000504
HuffmanDecodeTest.py:203 ba_Py_lookupTbl_builtin_int_shift              10000            3.592218  5.057208  0.000506
HuffmanDecodeTest.py:129 ba_Py_iterdecode                               10000            0.038460  14.13969  0.001414
HuffmanDecodeTest.py:137 ba_Py_lookupTbl_ba2int                         10000            3.217130  34.10895  0.003411
HuffmanDecodeTest.py:258 ba_Py_lookupTbl_ba2int_baMemoryview            10000            4.405727  36.71794  0.003672

I guess the overhead is just too much because the tsub times on both rust functions are fast. I'm willing to be I could speed up the rust functions with a bit more knowledge but it looks like even if I did they'd likely still not outperform the C functions given my sample datasets.

Just an FYI... working with pyo3 was a joy compared to trying to figure out the C conventions. Just a few decorators and it is still very similar to the python code which is pretty cool.

use pyo3::prelude::*;

/// Simple Huffman Table Decode
#[pyfunction]
fn decode(in_buf: &str, lookup_tbl: Vec<(String, usize)>, max_code_len: usize, out_len: usize) -> PyResult<String> {
    let mut out_buf = Vec::with_capacity(out_len);
    let mut pos: usize = 0;
    for _i in 0..out_len {
        let code_index = usize::from_str_radix(&in_buf[pos..(pos + max_code_len)], 2).unwrap();
        pos += lookup_tbl[code_index].1;
        out_buf.push(lookup_tbl[code_index].0.clone());
    }
    let result = out_buf.join("");
    Ok(result)
}
/// A Python module implemented in Rust.
#[pymodule]
fn rs_decode(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(decode, m)?)?;
    Ok(())
}

compared to my test python function

def baTo01_Py_lookupTbl_builtin_int(SD, *args):
  inBuf = SD.ba01EncodedMessage
  lookupTbl = SD.tblSymLenLookup
  maxPrefixLen = SD.maxCodeLength
  outLen = SD.DecodedMessageBytesLen

  outBuf = [None] * outLen
  inBuf_index = 0
  for i in range(outLen):
    code_index = int(inBuf[inBuf_index:inBuf_index + maxPrefixLen], 2)
    entry = lookupTbl[code_index]
    inBuf_index += entry[1]
    outBuf[i] = entry[0]
  result = ''.join(outBuf)
  return result

Dec 01 '21 00:12 Thell