young_chao
young_chao
采用python ./examples/pipeline/demo/pipeline-mini-demo.py方式运行示例任务,fateflow的logs文件夹中有相应日志记录,而fateboard文件夹内无日志生成。在浏览器中无法打开127.0.0.1:8080无反应。
### Describe the bug When I run the example of text_to_image.py, I got the problem shown in logs. I'm pretty sure I have it configured and running as the reademe.md...
### 🐛 Describe the bug  ### Environment ### env OS:ubuntu 20.04 GPU:4 x A10 python==3.9.0 torch==1.13.1-cu116 colossalai==0.2.5 ### command `python train_reward_model.py --pretrain "bigscience/bloom-560m" --lora_rank 16`
### 🐛 Describe the bug Although [ht-zhou](https://github.com/ht-zhou) said that the LoRA problem has been fixed, according to the latest code and experimental tests of ColossalAI, LoRA still does not support...
According to batch_size_table.md, from 144=48 x 3 (144 from batch_size_table.md and 48 x 3 from bench_suite.py) I can think that batch-size is composed of num-gpu-batches and gpu-batch-size together in FlexGen....
root@b787722dc2e1:/workspace/workfile/Projects/chatllama# python artifacts/main.py artifacts/config/config.yaml --type ACTOR Current device used :cuda local_rank: -1 world_size: -1 Traceback (most recent call last): File "/workspace/workfile/Projects/chatllama/artifacts/main.py", line 50, in actor_trainer = ActorTrainer(config.actor) File "/usr/local/lib/python3.9/site-packages/chatllama/rlhf/actor.py", line...
在七月份官方公众号的博客上,能够看到对于中文生成代码评测问题的描述: 目前的生成代码任务的基准评估集合都是英文的,为了客观地评估模型利用中文指令生成代码的能力,我们从stackoverflow和其他中文问答社区收集了与代码相关的问题。这些问题涵盖了Python、C++、Java、PHP、SQL等多种编程语言,包括代码纠错、代码生成、代码问题解答、代码解释等共计75道题目。我们使用这个评测集进行了Case-by-Case的人工评估。对于生成的代码问题,我们以运行结果是否正确作为通过标准;对于代码问答和解释的回答,我们以正确性和全面性为通过标准。 但是以上描述的评测问题似乎并没有开源?如果有开源的话,希望能够提供一下链接。目前在huggingface和github并没有找到。
**Describe the bug** When I load the model from checkpoint and continue training, it always hangs during the validation process. It should be noted that it only hangs when the...
I have tried many benchmarks, including implementing my own, and I think BestAnswer Evaluation is the most reliable recently for evaluating long context capability of LLM, especially in terms of...
在使用其它框架(Megatron-LM、DeepSpeed)训练的时候,一般统一把考虑梯度累积后的batch-size作为真正的单步batch-size,再根据这个batch-size推导训练步数,但是从xtuner训练时的步数显示来看,显然xtuner的步数逻辑不是这样的,这会导致一个问题,当我同时跑多个不同size模型的runs时,由于梯度累积值不同,所有的runs步数无法对齐,在wandb上看的体验很不好,请问有相关的设置可以保证我的步数为考虑梯度累积以后的步数吗?