acf icon indicating copy to clipboard operation
acf copied to clipboard

Suggestions for improving code performance

Open xjb714 opened this issue 3 months ago • 0 comments

uint64_t
encode_ten_thousands(uint64_t hi, uint64_t lo)
{
    uint64_t merged = hi | (lo << 32);
    /* Truncate division by 100: 10486 / 2**20 ~= 1/100. */
    uint64_t top = ((merged * 10486ULL) >> 20) & ((0x7FULL << 32) | 0x7FULL);
    /* Trailing 2 digits in the 1e4 chunks. */
    uint64_t bot = merged - 100ULL * top;
    uint64_t hundreds;
    uint64_t tens;

    /*
     * We now have 4 radix-100 digits in little-endian order, each
     * in its own 16 bit area.
     */
    hundreds = (bot << 16) + top;

    /* Divide and mod by 10 all 4 radix-100 digits in parallel. */
    tens = (hundreds * 103ULL) >> 10;
    tens &= (0xFULL << 48) | (0xFULL << 32) | (0xFULL << 16) | 0xFULL;
    //tens += (hundreds - 10ULL * tens) << 8;
    tens = (hundreds << 8) - 2559ull * tens;

    return tens;
}

Explanation:

decimal: a*10+b=x
BCD: a+256*b=a+256*(x-a*10)=a+256*x-2560*x=256*x-2559*a=256*x-2559*(x/10)

I haven't conducted performance tests, but the above implementation has fewer instructions and fewer instruction dependencies, so there should be some performance improvement.

xjb714 avatar Nov 04 '25 18:11 xjb714