Muchen Li

Results 7 comments of Muchen Li

I run into those too, there seems to be some bug in DDAR

Happens to me as well, seems no trivial solution because the package dependency on cuda is pretty outdated

I ran into a similar issue with a deepseek-math-7b model on a100 80G GPU, cannot get things to work with er_device_train_batch_size = 1 and gradient_accumulation_steps = 1, I suspect there...

Hi, thanks so much for the reply! I'm pretty sure all memory was eaten up by this single process. I did a very detailed verification and see the forward process...

Actually I tried zephyr-7b-sft-full with the orginal setting on my data, I was able to get training going on with per-device batch size = 8, but not with deepseek some...

thanks for the pointer, I'll take a look