YunHao Yang comments

Results 8 comments of


                                            YunHao Yang

Sharing training log of 7B model on A6000 x 4

I use the same command on 4 x 48g a6000, I got OOM error

CUDA out of memory error. How to calculate or what am I doing wrong?

The memory of 2070 is too small, you need about 12g of memory

在使用多图像数据微调kimi-vl时训练卡死

@Kuangdd01 Thank for your reply，this is the yaml I used： ``` ### model model_name_or_path: /mnt/workspace/yangyunhao/Kimi-VL-A3B-Instruct trust_remote_code: true ### method stage: sft do_train: true finetuning_type: full freeze_vision_tower: true freeze_multi_modal_projector: true freeze_language_model:...

在使用多图像数据微调kimi-vl时训练卡死

Thanks, I'll try that

在使用多图像数据微调kimi-vl时训练卡死

> > Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, [deepspeedai/DeepSpeed#5066](https://github.com/deepspeedai/DeepSpeed/issues/5066). > > To...

在使用多图像数据微调kimi-vl时训练卡死

@Kuangdd01 This is the DeepseekV3MoE code I modified, I'm not sure if it is correct： ```python class DeepseekV3MoE(nn.Module): """ A mixed expert module containing shared experts. """ def __init__(self, config):...

在使用多图像数据微调kimi-vl时训练卡死

@Kuangdd01 I found that in the batch of data that caused the stuck, there were differences in image_grid_hws on different ranks. Could this be the problem? ===== DEBUG: Input Keys...

Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1580, OpType=ALLREDUCE, NumelIn=466119168, NumelOut=466119168, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.

是不是使用mllm-demo的数据了，使用这个数据集微调kimi-vl时会出现GPU利用率100%而且卡死的情况，不知道是不是对多图像数据的支持有问题，再删除掉数据里的多余图像后就正常了