xUanBing comments

Results 11 comments of


                                            xUanBing

Getting jep.JepException: java.util.concurrent.TimeoutException: Futures timed out after [100 seconds] exception in starting a training from ORCA pytorch estimation with BigDL backend

the same error

Getting jep.JepException: java.util.concurrent.TimeoutException: Futures timed out after [100 seconds] exception in starting a training from ORCA pytorch estimation with BigDL backend

How long will it take?

Getting jep.JepException: java.util.concurrent.TimeoutException: Futures timed out after [100 seconds] exception in starting a training from ORCA pytorch estimation with BigDL backend

Layer info: TorchModel[5d5e341e] jep.JepException: java.util.concurrent.TimeoutException: Futures timed out after [100 seconds] at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98) at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.exec(PythonInterpreter.scala:108) at com.intel.analytics.bigdl.orca.net.TorchModel.updateOutput(TorchModel.scala:131) at com.intel.analytics.bigdl.dllib.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:283) at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:272) at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply(DistriOptimizer.scala:263) at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply(DistriOptimizer.scala:263) at com.intel.analytics.bigdl.dllib.utils.ThreadPool$$anonfun$1$$anon$5.call(ThreadPool.scala:160) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at...

是不是没人维护了？tabbar都找不到，好烂

就是个DEMO

DeepSpeed giving Assertion Error

同样的问题，你终于解决了吗？ deepspeed 0.10.0 错误信息: AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 64 != 8 * 1 * 1

Graph4Rec相关问题请教。

2853222 如流群这个群不允许用户加入什么鬼？

Graph4Rec相关问题请教。

该群不允许任何人加入

Graph4Rec mpirun方式多机CPU 分布式启动失败

手动方式可以起成功

Graph4Rec mpirun方式多机CPU 分布式启动失败

1、ips配置如下： 192.168.12.217:8813 192.168.12.218:8814 192.168.12.219:8815 2、起图引擎： /opt/python38paddle/bin/python3 -m pgl.distributed.launch --ip_config ./toy_data/ip_list.txt --conf ./user_configs/metapath2vec.yaml --shard_num 1000 --server_id 0 /opt/python38paddle/bin/python3 -m pgl.distributed.launch --ip_config ./toy_data/ip_list.txt --conf ./user_configs/metapath2vec.yaml --shard_num 1000 --server_id 1 /opt/python38paddle/bin/python3 -m pgl.distributed.launch...

Graph4Rec mpirun方式多机CPU 分布式启动失败

分布式的话返回的loss 不是应该是数组结构么 sec/batch: 0.149264 | step: 100 | train_loss: 0.485856 这是不是还是单机？