xyLu
xyLu
The same two question when use deepspeed inference: (1) It seems that `replace_with_kernel_inject=True` conflict with `dtype=torch.int8` and causes "CUDA error: an illegal memory access was encountered". (2) With setting `replace_with_kernel_inject=False`,...
This is a very interesting issue and I already know that I should do merge_and_unload() before generation with LoRA tuned casual language model such as LLaMA. And my new question...
> for PPL, you just need forward, I don't think you need to call the `generate` function Thank you. May I explain it like this: In evaluation, the forward() function...