pubby
pubby
Looks like operator+ and - was wrong. I've fixed it in bfa3b77 and it should work now. Thanks :+1:
Ha, whoops. That's correct.
I've been testing Q3_0 and found the performance was improved by representing data like this: ``` typedef struct { ggml_fp16_t d; uint16_t hi; // Highest bit, packed. uint32_t lo; //...
> Looking forward to anyone finding better SIMD optimizations, especially for Q3, which is a pain in the butt... I wrote about this at https://github.com/ggerganov/llama.cpp/issues/456#issuecomment-1507919345, but I think there's a...
Here's an attempt at porting `bytesFromCrumbs` to other architectures. I don't have these systems so they may be incorrect and/or slow. Improvements are more than welcome. ARM NEON: ``` static...
> Horizontal sums This is slightly faster on my alder lake CPU. Dunno if it's faster in general. ``` static inline float horizontalSum(__m256i a) { __m256i b = _mm256_castps_si256(_mm256_movehdup_ps(_mm256_castsi256_ps(a))); __m256i...
What do you think about having two separate arrays, one for qs and one for scales?