Results 4 comments of xyLu

The same two question when use deepspeed inference: (1) It seems that `replace_with_kernel_inject=True` conflict with `dtype=torch.int8` and causes "CUDA error: an illegal memory access was encountered". (2) With setting `replace_with_kernel_inject=False`,...

This is a very interesting issue and I already know that I should do merge_and_unload() before generation with LoRA tuned casual language model such as LLaMA. And my new question...

> for PPL, you just need forward, I don't think you need to call the `generate` function Thank you. May I explain it like this: In evaluation, the forward() function...