YunHao Yang

Results 8 comments of YunHao Yang

I use the same command on 4 x 48g a6000, I got OOM error

The memory of 2070 is too small, you need about 12g of memory

@Kuangdd01 Thank for your reply,this is the yaml I used: ``` ### model model_name_or_path: /mnt/workspace/yangyunhao/Kimi-VL-A3B-Instruct trust_remote_code: true ### method stage: sft do_train: true finetuning_type: full freeze_vision_tower: true freeze_multi_modal_projector: true freeze_language_model:...

> > Sorry for the late reply, I have reproduced this issue. It is a common issue when using dsz3 for a moe model, for example, [deepspeedai/DeepSpeed#5066](https://github.com/deepspeedai/DeepSpeed/issues/5066). > > To...

@Kuangdd01 This is the DeepseekV3MoE code I modified, I'm not sure if it is correct: ```python class DeepseekV3MoE(nn.Module): """ A mixed expert module containing shared experts. """ def __init__(self, config):...

@Kuangdd01 I found that in the batch of data that caused the stuck, there were differences in image_grid_hws on different ranks. Could this be the problem? ===== DEBUG: Input Keys...

是不是使用mllm-demo的数据了,使用这个数据集微调kimi-vl时会出现GPU利用率100%而且卡死的情况,不知道是不是对多图像数据的支持有问题,再删除掉数据里的多余图像后就正常了