Question About top-p sampling
Hello , thanks for sharing your code, it is really helpful.
I notice there is a hyperparameter top-p, the code is here. When we run decode, this hyperparameter is set -1, so we don't actually use "top-p sampling".
But I still wonder what it is for , did you use it in your experiment?if we use it,what is the appropriate value? Could you please provide me with further details or refer me to any relevant literature that would allow me to better understand it
Thank you in advance for your assistance
Hi, We dind't use top-p sampling in our experiment. During sampling, we compute the logits of each token, and you can do top-p sampling or beam search based on this. These sampling strategies can be easily borrowed from the generation of AR models. You're free to try it. However, honestly speaking, top-p or beam search may not work as much as you think. But it is still worth to try and investigate meticulously.