胡译文
胡译文
you should install via http protocol. just simply replace "git+git://" with "git+http://"
Hello @xylian86 , I was previously using the HF trainer. Why doesn't the universal checkpoint support the HF trainer? Is there any way to load the universal checkpoint? Do I...
Hi @krrishdholakia , I found this PR very useful and would love to see it merged. Are there any updates or changes needed? Thanks for your hard work!
请问和同等大小的Transformers架构模型,推理速度差别是多少呢
Thank you for replying! I moved the papers to another group library and everything worked all fine.
Here's my deepspeed config json: ```json { "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 1e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1e8, "contiguous_gradients": true }, "gradient_accumulation_steps":...
Another related issue: https://github.com/microsoft/DeepSpeed/issues/5405
Hello @ArthurZucker and @muellerz. I am able to create a pull request to address the issue. I have resolved the issue by deleting all the “rng_state” files as it had...
We can skip these rng_state and add a warning.
I’m currently working on an MoE model and looking to implement expert parallelism. Writing EP/EP+TP/EP+DP from scratch with torch distributed communication is pretty challenging, especially if I want good training...