BitNetMCU short multiplication

        for (uint32_t j = 0; j < 8; j++) {
            int32_t in=*activations_idx++;
            int32_t tmpsum = (weightChunk & 0x80000000) ? -in : in; 
            sum += tmpsum;                                  // sign*in*1
            if (weightChunk & 0x40000000) sum += tmpsum<<3; // sign*in*8
            if (weightChunk & 0x20000000) sum += tmpsum<<2; // sign*in*4
            if (weightChunk & 0x10000000) sum += tmpsum<<1; // sign*in*2
            weightChunk <<= 4;
        }

I'm having some concerns about the line of code 'sum += tmpsum;'. Is it redundant, perhaps? If the weight element is in SDDD format, then the sum of three shifts should be sufficient. Here's the relevant code section:

            int32_t tmpsum = (weightChunk & 0x80000000) ? -in : in; 
            if (weightChunk & 0x10000000) sum += tmpsum; // sign*in*1
            if (weightChunk & 0x20000000) sum += tmpsum<<1; // sign*in*2
            if (weightChunk & 0x40000000) sum += tmpsum<<2; // sign*in*3
            weightChunk <<= 4;

upd: It appears that I overlooked the fact that the value 0 is not used...

Apr 29 '24 16:04 kimstik

Yes, this is to scale the numbers in a way, where the distance between the first positive and negative element is the same as between the first and second.

e.g. the encoding for two bit is -3 -1 1 3 or -1.5, -0.5, 0.5, 1.5 with constant delta.

With twos complements the encoding would be assymetric, -2,1,0,1 With ones complement without the shift, you would get a double 0: -1,0, 0,1 or a double step around zero: -2,-1,1,2

The encoding is more regular that way and there are no redundant codes. The weights are centered symetrically around zero for most layers, so you don't want to have a huge gap there.

The impact of this encoding network capacity and loss minimal, though. QaT is able to redistribute quantization errors fascinatingly well.

Apr 29 '24 18:04 cpldcpu

Did you notice that Clang generates linear code without jumps (at the cost of a slight size increase)? They seem to mention static branch prediction. It's interesting to find out how expensive conditional branches are in QingKe RISC-V2A finally ...

Apr 30 '24 16:04 kimstik

I didn't check Clang specificially, but there were quite some differences in code generation between optimization options. To be honest, I thought that the RV32EC code was a bit cumbersome. The only bit that could be checked without masking is the MSB.

In the end, the biggest impact on execution speed was moving the inference code to SRAM, though.

Apr 30 '24 17:04 cpldcpu

Interesting. So, this device has both flash memory and a processor on a single chip? That's unusual for modern Chinese designs. The common trend there is to use two-chip solutions, allowing for faster operation without wait states and insane overclocking capabilities.

May 01 '24 10:05 kimstik

I achieved branchless linear code by processing two activations simultaneously in GCC:

// W packed two 4bit into 8bits by interleaving: (S1,D12,D11,D10),(S1,D12,D11,D10) -> (S1,S2,D10,D20,D11,D22,D12,D22)
static inline int vmac8x4(uint32_t *act, int32_t W) {
    static const int M[4]   = { 0x00000000, 0x00000fff, 0x0fff0000, 0x0fff0fff, };
    static const int ONE[4] = { 0x00000000, 0x00000001, 0x00010000, 0x00010001, };	//FIXME: perhaps we may get rid of this part for sign application
    int acc = 0;
    for (uint32_t j = 0; j < 2; j++) {
        uint32_t in = *act++;
        for (uint32_t k = 0; k < 2; k++) {
            int tmp = (M[(W>>(32-2))&0x3] ^ (in & 0x00ff00ff)) + ONE[(W>>(32-2))&0x3];    //    process bytes 1/3
                        acc += tmp;
            tmp<<=1;    acc += tmp ^ M[(W>>(32-2))&0x3];
            tmp<<=1;    acc += tmp ^ M[(W>>(32-4))&0x3];
            tmp<<=1;    acc += tmp ^ M[(W>>(32-6))&0x3];
            W  <<=8;
            in >>=8;    // repeat for bytes 0/2
        }
    }
    return (acc + (acc>>16))&0xfff;
}

Just a prototype. It needs to be tested.

May 01 '24 11:05 kimstik

Interesting. So, this device has both flash memory and a processor on a single chip? That's unusual for modern Chinese designs. The common trend there is to use two-chip solutions, allowing for faster operation without wait states and insane overclocking capabilities.

Indeed, see here. https://cpldcpu.wordpress.com/2024/05/01/decapsulating-the-ch32v203-reveals-a-separate-flash-die/

Quite surprised to see this even in a $0.30 MCU. But the Ch32v003 is monolithic.

May 01 '24 12:05 cpldcpu

I achieved branchless linear code by processing two activations simultaneously in GCC:

Nice! I thought about this, but never went through actually doing this. I wonder how well RV32EC deals with the lookup tables?

Btw, I don't know wether you saw the FP1.3.0 encoding: https://github.com/cpldcpu/BitNetMCU/blob/e730e598b994379426927259ad8ccf22c7311ce7/BitNetMCU_inference.c#L129

Need to write up something about it. I almost managed to get it to the same accuracy as the 4 bit encoding. It's insane that it works so well.

This may work even better when parallelized.

May 01 '24 12:05 cpldcpu

nonlinear FP1.3.0 is nice, but running it in parallel isn't looks straightforward yet. It's strange that you didn't give it much attention, as it uses only 15 out of 16 states. There are shifts of +0/-0 present.

2-bit parallel multiplication should be very interesting.

May 01 '24 14:05 kimstik

It's strange that you didn't give it much attention, as it uses only 15 out of 16 states. There are shifts of +0/-0 present.

For an exponent of zero, the sign encodes to +1 and -1, so all states are used. The entropy of the weights is still not optimal, after the training, not all codes are used by the model in all layers.

May 01 '24 17:05 cpldcpu

8 instructions per iteration of FP1.3.0 is so beautiful:

.L28:
	slli	t1,a4,8
	sll	a2,s1,a2
	srai	a1,t1,28
	lb	s1,2(a3)
	add	a2,a2,s0
	andi	a1,a1,7
	bge	t1,zero,.L29
	neg	s1,s1
.L29:
	slli	s0,a4,12
	sll	a1,s1,a1
	srai	t1,s0,28
	add	a1,a1,a2
	lb	s1,3(a3)
	andi	a2,t1,7
	bge	s0,zero,.L30
	neg	s1,s1
.L30:
	slli	s0,a4,16
	sll	a2,s1,a2
	srai	t1,s0,28
	lb	s1,4(a3)
	add	a2,a2,a1
	andi	t1,t1,7
	bge	s0,zero,.L31
	neg	s1,s1
.L31:

May 02 '24 14:05 kimstik

I managed to get a 4-in-1 activation convolution working with 5-bit input quantization. However, it provides almost no performance benefit ;((( It still takes around 106 instructions for 8 inputs.

Jul 29 '24 09:07 kimstik

it should be possible to parrallelize 4 bit quants with mul? 2 or 4 at a time?

Aug 02 '24 06:08 cpldcpu

Yes, with mul it should be much more simple. would you like to shift to chip with mul?

Aug 04 '24 11:08 kimstik

The successors of the CH32V003 (V002 etc) will all have mul.

Aug 04 '24 11:08 cpldcpu

Ough! I missed it! Is V2C mul is 1-cycle? Hope next successor will add Zcmp

Aug 04 '24 20:08 kimstik

I think it is 1 cycle. They only added the mul instruction, nothing else.

Aug 06 '24 18:08 cpldcpu

Should be possible to implement SIMD processing. For example 6 bit activations and 2 bit weights with 4 in parallel. The weights could be unpacked with an 8 bit table lookup into 32bit.

Aug 06 '24 18:08 cpldcpu