Alan Gray
Alan Gray
Labelled as DRAFT since this will need some more testing across different models, CUDA versions, etc before it is merged. See https://github.com/ggerganov/llama.cpp/issues/6763.
Thanks for these tests. I haven't yet optimized/tested for batch size greater than one - it might be a good idea for me to only enable CUDA graphs for size...
> `nodes`, `paramsDriver`, and `paramsRuntime` are being used across multiple calls of the function but their data is only loaded in an earlier call. Should they be static? Good spot!...
@JohannesGaessler I think the llama-bench and perplexity issues should now be fixed with the latest commit - can you confirm from your end? Perplexity is slower with CUDA graphs ATM...
>It does not seems to work at P40 and I cannot get it compile on ROCM >Tried to add ROCm HIP compatibility but it error The P40 issue may be...
@sorasoras thanks for testing. Can you let me know the exact command for which you are seeing a failure, so I can try and reproduce? I don't have access to...
OK thanks. I've now disabled CUDA graphs for multi-GPU and batch size > 1 which should prevent these crashes and regressions (where I can investigate these cases later). I can...
I've reproduced the llama-bench regression on Pascal (CC 6) and Volta (CC 7), so I've now added code to disable CUDA graphs for CC
> I just noticed that instead of using Github's built-in draft feature you added "DRAFT:" to the title. Please let me know when you think the PR is ready for...
Thanks @slaren and @ggerganov . I'm not an expert on all the different usage possibilities so I appreciate the guidance, and more than happy to further improve robustness. I've now...