MiniCPM-V icon indicating copy to clipboard operation
MiniCPM-V copied to clipboard

Question about full-parameter finetuning

Open dydxdt opened this issue 1 year ago • 2 comments

Thx for your great work! I have a question about training arguments, i.e. is the parameter max_steps=10000 proper for full-parameter finetuning?

I use my own train datset for full-parameter finetuning and the dataset has around 240,000 data. After I use the default training setting to train the model, I see the training log shows : "epoch: 0.32", which means it uses 1/3 data of the training data. My training dataset contains 3 different tasks(caption,ocr,...). Then I use the num_train_epochs(=5, same as qwen) instead of max_steps to train, but I found the model with 5 epochs performs worse than that with 10000 steps when testing on my caption testset. The loss seems normal. So Can you give some advice for this situation? Thx!

10000 step: (corresponding to red line, ignore blue line) image

~5 epoch: 企业微信截图_e7e709a1-4fab-4034-a4b1-3591945db1dc

dydxdt avatar Jun 18 '24 08:06 dydxdt

请问你使用了多少资源进行全参数微调,我使用2张v100和4张v100均不行

1SingleFeng avatar Jun 20 '24 03:06 1SingleFeng

我在全参微调的时候,控制台不打印loss信息,这个有遇到过么

todaydeath avatar Jun 29 '24 09:06 todaydeath

请问你使用了多少资源进行全参数微调,我使用2张v100和4张v100均不行

全量微调的话可能要8张v100

LDLINGLINGLING avatar Jul 04 '24 09:07 LDLINGLINGLING

Thx for your great work!感谢您的出色工作! I have a question about training arguments, i.e. is the parameter max_steps=10000 proper for full-parameter finetuning?我有一个关于训练参数的问题,即参数 max_steps=10000 是否适合全参数微调?

I use my own train datset for full-parameter finetuning and the dataset has around 240,000 data. After I use the default training setting to train the model, I see the training log shows : "epoch: 0.32", which means it uses 1/3 data of the training data. My training dataset contains 3 different tasks(caption,ocr,...). Then I use the num_train_epochs(=5, same as qwen) instead of max_steps to train, but I found the model with 5 epochs performs worse than that with 10000 steps when testing on my caption testset. The loss seems normal. So Can you give some advice for this situation? Thx!我使用自己的训练数据集进行全参数微调,数据集大约有 240,000 个数据。使用默认训练设置训练模型后,我看到训练日志显示:“epoch:0.32”,这意味着它使用了训练数据的 1/3 数据。我的训练数据集包含 3 个不同的任务(caption、ocr,...)。然后我使用 num_train_epochs(=5,与 qwen 相同)而不是 max_steps 进行训练,但我发现在我的标题测试集上测试时,具有 5 个 epochs 的模型比具有 10000 步的模型性能更差。损失似乎很正常。那么,您能为这种情况提供一些建议吗?感谢!

10000 step: (corresponding to red line, ignore blue line)10000步:(对应红线,忽略蓝线) image

~5 epoch: ~5 纪元: 企业微信截图_e7e709a1-4fab-4034-a4b1-3591945db1dc

The picture you gave only shows the train loss, which cannot reflect the real effect, so it is meaningless to only look at the loss.

LDLINGLINGLING avatar Jul 04 '24 09:07 LDLINGLINGLING