xUanBing
xUanBing
How long will it take?
Layer info: TorchModel[5d5e341e] jep.JepException: java.util.concurrent.TimeoutException: Futures timed out after [100 seconds] at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98) at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.exec(PythonInterpreter.scala:108) at com.intel.analytics.bigdl.orca.net.TorchModel.updateOutput(TorchModel.scala:131) at com.intel.analytics.bigdl.dllib.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:283) at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply$mcI$sp(DistriOptimizer.scala:272) at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply(DistriOptimizer.scala:263) at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$$anonfun$4$$anonfun$5$$anonfun$apply$2.apply(DistriOptimizer.scala:263) at com.intel.analytics.bigdl.dllib.utils.ThreadPool$$anonfun$1$$anon$5.call(ThreadPool.scala:160) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at...
就是个DEMO
同样的问题,你终于解决了吗? deepspeed 0.10.0 错误信息: AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 64 != 8 * 1 * 1
2853222 如流群 这个群不允许用户加入什么鬼?
该群不允许任何人加入
手动方式可以起成功
1、ips配置如下: 192.168.12.217:8813 192.168.12.218:8814 192.168.12.219:8815 2、起图引擎: /opt/python38paddle/bin/python3 -m pgl.distributed.launch --ip_config ./toy_data/ip_list.txt --conf ./user_configs/metapath2vec.yaml --shard_num 1000 --server_id 0 /opt/python38paddle/bin/python3 -m pgl.distributed.launch --ip_config ./toy_data/ip_list.txt --conf ./user_configs/metapath2vec.yaml --shard_num 1000 --server_id 1 /opt/python38paddle/bin/python3 -m pgl.distributed.launch...
分布式的话 返回的loss 不是应该是 数组结构么 sec/batch: 0.149264 | step: 100 | train_loss: 0.485856 这是不是还是单机?