lanyangyang
lanyangyang
I follow the [step-by-step-tutorial](https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md) to run distributed training with mxnet and tensorflow, both hang. I have 3 nodes and on first node I run scheduler and server and second and...
/kind feature **Describe the solution you'd like** [A clear and concise description of what you want to happen.] Is it possible to allow users to set more fields in suggetion...
_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._...
_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._...
hi ~ I use k8s-rdma-sriov-dev-plugin with HCA mode and I haven't config ib0 as the parent netdevice. so what the meaning of configure ib0 as the parent netdevice? additionaly, my...
[About the Service Account for Driver Pods](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md#about-the-service-account-for-driver-pods) Use the service account in driver pod, you can create pod with any spec. If PodSecurityPolicy or PodSecurityAdmission is not restricted, attacker can...
If there is 2GPU per node, how to set the Worker spec In the PytorchJob 1 replicas with 2GPU per pod or 2 replicas with only 1GPU per pod? I've...