Kai Zhang

Results 16 comments of Kai Zhang

@nowenL basically, we can think of '--gpus=0' as a flag to let arena not attach gpu device into job containers explicitly. however, that might be a different semantic from the...

> @wsxiaozhang > Thanks for the reply. To answer you questions: > > > what's the expected behavior, when you use --gpus=0? do you mean you just want to run...

@yuanbw in this release arena supports multiple users isolation via K8S PodSecurityContext on namespace level. 1. for dev scenario: It automatically detects the current host Linux account (uid, gid, supplemental...

@zchunhai have you tried RDMA HCA mode? It's supported by arena so far. For SRIOV, are you looking for bandwidth isolation, or can you describe your scenario more? thanks

@yuzisun that's great to have KFServing in arena. we just have tf-serving now. Any plan to submit a PR for the integration?

@asdfsx yes, we have plan to wrapper a arena context which is not only for clusters env but also security . pls stay tuned.

> This is caused by a change in driver 510 that lumps the reserved memory into the used category. We are updating DCGM to handle this case and split the...

@flx42 @jiayingz any plan to enhance current device health check?

it seems your device plugin pod hasn't been allocated to any node. Could you pls check whether your GPU node has label "gpushare=true" added?

@201508876PMH pls refer to this https://github.com/rancher/rke/issues/1841 to config your RKE scheduler. for the hanging gpushare-sched-extender pod, could you pls describe that pod and share more log of the pod?