Alan Gray

Results 28 comments of Alan Gray

Labelled as DRAFT since this will need some more testing across different models, CUDA versions, etc before it is merged. See https://github.com/ggerganov/llama.cpp/issues/6763.

Thanks for these tests. I haven't yet optimized/tested for batch size greater than one - it might be a good idea for me to only enable CUDA graphs for size...

> `nodes`, `paramsDriver`, and `paramsRuntime` are being used across multiple calls of the function but their data is only loaded in an earlier call. Should they be static? Good spot!...

@JohannesGaessler I think the llama-bench and perplexity issues should now be fixed with the latest commit - can you confirm from your end? Perplexity is slower with CUDA graphs ATM...

>It does not seems to work at P40 and I cannot get it compile on ROCM >Tried to add ROCm HIP compatibility but it error The P40 issue may be...

@sorasoras thanks for testing. Can you let me know the exact command for which you are seeing a failure, so I can try and reproduce? I don't have access to...

OK thanks. I've now disabled CUDA graphs for multi-GPU and batch size > 1 which should prevent these crashes and regressions (where I can investigate these cases later). I can...

I've reproduced the llama-bench regression on Pascal (CC 6) and Volta (CC 7), so I've now added code to disable CUDA graphs for CC

> I just noticed that instead of using Github's built-in draft feature you added "DRAFT:" to the title. Please let me know when you think the PR is ready for...

Thanks @slaren and @ggerganov . I'm not an expert on all the different usage possibilities so I appreciate the guidance, and more than happy to further improve robustness. I've now...