Heyang Qin
Heyang Qin
Hello @lucadiliello. Thank you for reporting this issue to us. Could you share a script or commandline for us to reproduce this issue?
I ran into the same issue. After a whole day of trial and error, I finally solved it by disabling IPv6 as suggested here: https://stackoverflow.com/questions/57992691/pip-hangs-on-starting-new-https-connection. However, I have no idea...
One of our recent fixes https://github.com/microsoft/DeepSpeed/pull/3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply...
Hello @Bill-Orz. We have fixed the hanging issue in https://github.com/microsoft/DeepSpeedExamples/pull/636. Please update to the latest DeepSpeedExample.
Hello @liuaiting. Thank you for reporting this issue to us. One of our recent fixes https://github.com/microsoft/DeepSpeed/pull/3462 may have already fixed this error. Could you update your deepspeed and give it...
@liuaiting Glad to hear the error is fixed. Closing the issue
Hello @sindhuvahinis @lanking520, thank you for reporting this! With the merge of https://github.com/microsoft/DeepSpeed/pull/2725, the major part of this issue should have been resolved. I tested the models you listed with...
> @HeyangQin we did some tests on 2725 as well and still observing the major issues with INT8. Will share more details and setup @lanking520 Thank you for the update!...
Hi @lanking520 @sindhuvahinis, Thank you for the information. Previously I only tested checkpoint loading with int8. Now when I test checkpoint saving with int8, I see the same error as...
@tjruwase I reworked the previous PR. This PR would check GPU count against world size for all dist tests so it avoids issues like https://github.com/microsoft/DeepSpeed/issues/2733 and https://github.com/microsoft/DeepSpeed/issues/2482 for all the...