Zhihao Lin comments

Results 94 comments of


                                            Zhihao Lin

TypeError: scaled_dot_product_attention() got an unexpected keyword argument 'training' in <xtuner.engine.hooks.evaluate_chat_hook.EvaluateChatHook object at 0x7f1ce9adfe20>

@QB-Chen This error have been fixed by https://github.com/InternLM/xtuner/pull/513 Please update your xtuner library.

requests版本冲突

@JiBingdong torchattacks 库并非 xtuner 所依赖的，可以考虑卸载，或者新建一个 xtuner 专用的虚拟环境

requests版本冲突

确实是这个问题，应该是conda的包管理或配置有问题，可以尝试重新安装一个conda？

Potential boundary error in utils.py

The possibility indeed exists (although image tokens typically do not appear in the last one input_id, due to the prompt template). Thank you for your feedback!

RuntimeError: Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.

@lesjie-wen , Hi! 从log来看，有两个方法可以尝试一下： 1. 判断一下数据处理时间是否超过了30分钟（从log来看只用了~15分钟，但建议还是检查一下）。xtuner默认会在数据处理超过30分钟后强制退出，以避免某些未知错误。用户可以通过设置环境变量`XTUNER_DATASET_TIMEOUT`来改变这一timeout 分钟数，例如`XTUNER_DATASET_TIMEOUT=120 xtuner train xxx` 2. 如果不符合上述情况1，那么可以考虑是在数据处理阶段发生了内存的OOM，可以监控一下数据处理阶段时内存的变化。

请问各位xtuner里面有关于RAG的使用吗

暂时没有，不久后会推出 agent 的微调课程，可以关注 https://space.bilibili.com/1293512903 及对应文档 https://github.com/InternLM/Tutorial/blob/camp2/xtuner/readme.md#part-3-agent-微调模型函数调用能力

Can i finetune Qwen1.5-72B-Chat-GPTQ-Int4 model, and also convert&merge as a GPTQ-Int4 model directly?

@iFe1er No, we cannot fine-tune the GPTQ models. If you don't have enough memory resources, you should consider using QLoRA, which also applies 4-bit quantization for LLM.

Can i finetune Qwen1.5-72B-Chat-GPTQ-Int4 model, and also convert&merge as a GPTQ-Int4 model directly?

@iFe1er Hi! Sorry for the delayed response. > that the GPU usages is not halfed even i am using two gpus. Yes, because the training is data parallel and each...

复现官方教程出现 grad_norm:nan

尝试多训几百个iter？在某些模型、数据上确实会出现一段时间的grad_norm NaN 的情况，但一般在几百个iter后会正常。

复现官方教程出现 grad_norm:nan

@kaisersama112 可以训练一下开源数据集，比如alpaca ，查看是否会出现类似的情况。如果开源数据集是正常的，可能是构造的数据集有一些问题。 https://github.com/InternLM/xtuner/blob/main/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_qlora_alpaca_e3.py