lanyangyang

Results 7 issues of lanyangyang

I follow the [step-by-step-tutorial](https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md) to run distributed training with mxnet and tensorflow, both hang. I have 3 nodes and on first node I run scheduler and server and second and...

distributed

/kind feature **Describe the solution you'd like** [A clear and concise description of what you want to happen.] Is it possible to allow users to set more fields in suggetion...

help wanted
priority/p2
kind/feature
lifecycle/frozen

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._...

lifecycle/stale

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._...

hi ~ I use k8s-rdma-sriov-dev-plugin with HCA mode and I haven't config ib0 as the parent netdevice. so what the meaning of configure ib0 as the parent netdevice? additionaly, my...

[About the Service Account for Driver Pods](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md#about-the-service-account-for-driver-pods) Use the service account in driver pod, you can create pod with any spec. If PodSecurityPolicy or PodSecurityAdmission is not restricted, attacker can...

lifecycle/stale

If there is 2GPU per node, how to set the Worker spec In the PytorchJob 1 replicas with 2GPU per pod or 2 replicas with only 1GPU per pod? I've...