Ramesh Arvind

Results 3 comments of Ramesh Arvind

I'm not sure if you had the same issue, but when I tried to resume a deepspeed run, it would try to load the right checkpoint but fail to find...

You can try following the [suggestion](https://github.com/huggingface/transformers/issues/17258#issuecomment-1128905263) of adding more memory via swap file. We faced a similar issue before and that resolved it

Running into the same issue for user defined KTO datasets as well