Yidan Wang
Yidan Wang
Is it necessary to fine-tune all parameters during the training process? Why does the loss explode when I use lora fine-tuning Llama 2 7B?
Thank you for your work. I have the following questions to discuss with you: 1. why does the loss mention in equation 2 of the paper need to sum the...
I would like to inquire whether hash collisions may occur in this paper, given that the entire message space is mapped to a smaller space. If so, how can I...
Why do the speeds of two programs interfere with each other when running the LLaDA model on two A100 GPUs? For example, running LLaDA on A100 No. 0 alone takes...
Is `argmax` the only sampling method for the LLaDA model? Why are the model outputs sometimes filled with '\n'? How can I mitigate this effect?