AllenXu93

Results 35 comments of AllenXu93

I think NPD should support different config for different device or runtime; For example , I have both GPU and none GPU worker node in one cluster, or containerd and...

BTW, for GPU, we don't need to install more dependencies, it just add env `NVIDIA_VISIBLE_DEVICES` and `NVIDIA_DRIVER_CAPABILITIES` , and use `nvidia-smi` command to check GPU state

> yes, accelerators health is an important functionality and would be great to have it in NPD > > Need to design it carefully though. There is already some health...

> > We are able to perform health check with port from service healthCheckNodePort, but all cluster nodes in NLB members looks like huge overhead. > > This is expected...

> @isugimpy can you show me the output of those keys? > > ``` > hmget {harbor_job_service_namespace}:job_stats:{gc_job_id} status > ttl {harbor_job_service_namespace}:job_stats:{gc_job_id} > ``` In my harbor cluster met the same...

> > @isugimpy can you show me the output of those keys? > > ``` > > hmget {harbor_job_service_namespace}:job_stats:{gc_job_id} status > > ttl {harbor_job_service_namespace}:job_stats:{gc_job_id} > > ``` > > In...

I think I found the reason. In my case, JobServer will mark job failed after it runs longer than 1 day (by default): https://github.com/goharbor/harbor/blob/f86f1cebc3a1af8c5c14c0a94d687fff04ebc6eb/src/jobservice/worker/cworker/reaper.go#L173-L182 And then it will mark the...

I think the point is that restart kubelet should not affect any running workload. I have met similar scenario, when restarting Kubelet, some running pod failed due to a init-container...

> > I think the point is that restart kubelet should not affect any running workload > > what do we mean by restart, it is killing the kubelet process...