amzno comments

Results 8 comments of


                                            amzno

can't runing multi-GPU

thank you for your answering, i have exported the environment NCCL_SOCKET_IFNAME=eth1 but it's not working for me, it still hang when i using 2 GPU on one node. if i...

2019-01-17 11:30:31.451159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-01-17 11:30:31.451169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-01-17 11:30:31.451695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21546 MB memory) -> physical GPU (device: 0, name: Tesla...

the same savedModel can be loaded by tf-serving 2.2, but can not be loaded by tf-serving 2.5.2 when using s3 storage

@singhniraj08 but our model train by tf.estimator not using keras, is that ok to export model using tf.keras.model.save_model ? And our model can be loaded correctly by tf2.5.2 when the...

servers would be hang when change inter_thread_num

@jackonan I have repeated the test for these problem, the procedure is as follows: eg: I use dist_train.py (examples/tf/graphsage/dist_train.py) to test distributed mode and using 2 ps and 2 workers....

[Bug] backend stuck at Prefill batch

@zhyncs I encountered the same issue and the GPU memory was freed when the server was ready. "python3 -m sglang.check_env" info as follow: python: 3.10.14 (main, Mar 21 2024, 16:24:04)...

fix: [rocm] precision issues caused by the ROPE cache

> 原因？在一些小模型上会有输出重复的情况，实现上也是和cuda对齐。

fix: [rocm] precision issues caused by the ROPE cache

> > > 原因？ > > > > > > 在一些小模型上会有输出重复的情况，实现上也是和cuda对齐。 > > 应该要先排查清楚？ prefill阶段应该是不用读缓存的，只在decode的时候用，所以和cuda那边对齐，去掉了prefill读缓存的逻辑～