Konnase Lee

Results 6 issues of Konnase Lee

add cudaDeviceReset() into p2pBandwidthLatencyTest to free gpu memory after test

I created an issue #284 about 4 months ago, and I suggested that we should replace tf.train.Supervisor with tf.train.MonitoredTrainingSession, as the later will restart session when facing OS Error(or communication...

judge which task a pod belongs to according to task name instead of task type

bug

链接google-hosts.sh到/usr/bin/google-hosts后,在终端输入google-hosts,提示: `google-hosts:未找到命令` 加上sudo也不行,切换成root用户也找不着。 列出/usr/bin下面的文件 ![image](https://user-images.githubusercontent.com/18288851/32443310-82f4e404-c339-11e7-9ce8-f77d132ed9b0.png) google-hosts显示为红色,表示压缩文件吗? 还望指教!

Service label is `app: pytorch-operator`, while selector is `name: pytorch-operator`. Deployment spec label and selector are both `name: pytorch-operator`. ![image](https://user-images.githubusercontent.com/18288851/140506943-00a4c917-2d89-4145-8e4c-e34dd548077d.png) In such a case, both the service and deployment have...

kind/bug

1. add tcp store for rendezvous usage: ```c++ auto rank = getenv("RANK"); if (!rank) { rank = "0"; } auto world_size = getenv("WORLD_SIZE"); if (!world_size) { world_size = "1"; }...

CLA Signed