amzno

Results 8 comments of amzno

native environment

thank you for your answering, i have exported the environment NCCL_SOCKET_IFNAME=eth1 but it's not working for me, it still hang when i using 2 GPU on one node. if i...

2019-01-17 11:30:31.451159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-01-17 11:30:31.451169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-01-17 11:30:31.451695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21546 MB memory) -> physical GPU (device: 0, name: Tesla...

@singhniraj08 but our model train by tf.estimator not using keras, is that ok to export model using tf.keras.model.save_model ? And our model can be loaded correctly by tf2.5.2 when the...

@jackonan I have repeated the test for these problem, the procedure is as follows: eg: I use dist_train.py (examples/tf/graphsage/dist_train.py) to test distributed mode and using 2 ps and 2 workers....

@zhyncs I encountered the same issue and the GPU memory was freed when the server was ready. "python3 -m sglang.check_env" info as follow: python: 3.10.14 (main, Mar 21 2024, 16:24:04)...

> 原因? 在一些小模型上会有输出重复的情况,实现上也是和cuda对齐。

> > > 原因? > > > > > > 在一些小模型上会有输出重复的情况,实现上也是和cuda对齐。 > > 应该要先排查清楚? prefill阶段应该是不用读缓存的,只在decode的时候用,所以和cuda那边对齐,去掉了prefill读缓存的逻辑~