llama.cpp
llama.cpp copied to clipboard
opencl: fix rms_norm_mul
The rms_norm_mul kernel produces incorrect result when ne00 = 768. This PR changes how the kernel does reduction to get the sum. This seems to fix the issue.