Shaowei Su

Results 18 comments of Shaowei Su

I can reproduce the same error with the following setups: ``` Framework: tensorflow.keras Framework version: 2.5.0 Horovod version: 0.23.0 MPI version: 3.0.0 CUDA version: 11.0 NCCL version: 2.8.4-1+cuda11.2 Python version:...

basically it can not load spark model info correctly: ``` scala> pred.getSparkModelParam res28: biz.k11i.xgboost.spark.SparkModelParam = null ``` then the model can not be reconstructed ``` xgboostPredictor.getSparkModelParam match { case param:...

Hi @tenzen-y yeah I just tried `v0.14.0` UI component and it failed with the same error. Are you able to reproduce this on your side?

``` - match: prefix: "/flyteidl.service.DataProxyService" route: cluster: flyteadmin_grpc ``` The data proxy service is missing in the example configmap.

Thanks @loadams for helping on this issue - I did some investigation in the last few weeks and I think the root cause is related to the orchestrator (Ray Air:...

Could you point me to the fixes you mentioned above? Also happy to close this issue since it's not directly related to DeepSpeed.

Hi @loadams , to reproduce the failure it requires a multi-node & multi-gpu Ray cluster setup (https://github.com/ray-project/ray). Usually the issue occurs with > 16 GPUs (a100 or a10g, this is...

Ran into exact same error when running DeepSpeed on Ray. Following this thread.

@tjruwase I was able to run DS + stage 3 + fp16 by disabling `optimizer` section in the DS config, which I found negative impacts on the model quality. If...

Following thread, same issue with zero stage 2 + deepspeed=0.9.0. Zero stage 3 does not have this issue. (fall back to deepspeed==0.8.2 fixed the issue).