Corey James Levinson
Corey James Levinson
YuriCat, can you add an arg in the train function that has evaluate to a different agent? Like i want to evaluate against my model from 20 epochs ago to...
@mcarilli I just watched a video that says you can used FusedAdam, FusedSGD, etc. for a faster optimizer when using amp. How do we use this in native Pytorch 1.6...
Judging by the newness of this issue, maybe it's because bitsandbytes does not support cuda 12.3? What if you downgrade cuda
I have the same issue, “Removed shared tensor”. Transformers 4.35.2, using deepspeed on 1 gpu. Following the comments here, I disabled deepspeed and now it is saving correctly. I imagine...
Agree. I got the same issue when I just ran it on my 8gpu instance with deepspeed. I even downgraded to 4.35.0 and still have the same issue. basically my...
Bu the way, in case it matters, I am using deepspeed zero stage 0, but for Trainer it only began to use dp16 and gradient checkpointing and stuff when I...
Conversation_chain_handler.py L140 change from a simple log to a raise error? There is so much stuff being printed in the log that the average person would miss the warning
No, it doesn’t have to do with train vs valid. Just use any csv file, and in your config.yaml for training, type system=“column_that_doesnt_exist”. The code will still run, it will...
Example: after epoch 1 it saves checkpoint_ep01.pth after epoch 2 it saves checkpoint_ep02.pth when loading mode back in according to config, it by default will load in sorted(glob(“checkpoint_ep*”))[-1] aka the...
> We didnt do that by default as model weights take a ton of disk space. > > We could theoretically make it a separate setting to additionally save all...