unbounded comments

Repositories
Issues
Comments

Results 23 comments of


                                            unbounded

Continuous layouts for quantization q4_0c

Did some performance testing on M1 and updated the q4_2c branch: Now I see a performance win here as well: q4_2: ``` llama_print_timings: prompt eval time = 592.93 ms /...

Continuous layouts for quantization q4_0c

Some miscellaneous performance observations: I saw no performance difference using 64-byte aligned loads with AVX-512. Prefetching instructions give no benefit at all on M1 processors. On other CPUs they helped...

Variable bit rate quantization

> My guess is that a locally adaptive variable bit rate would require a major change to ggml. The easiest way I can think of would be to vary allocation...