Heavenn
Heavenn
@sangmichaelxie It seems that the loss used for optimizing the proxy model in the code is different from the one described in the paper. ```py loss = (pertoken_loss * curr_domain_weights.detach()).sum()...
Thanks for your solid work first! But I am wondering whether the optimized domain weights only significantly related with the tokenizer. If I use the same tokenizer and the domain...
# What does this PR do? First of all, thanks for your great work. Here is my personal understanding. If there are any mistakes, feel free to correct me! Regardless...