krishh85
krishh85
@nikkon-dev Any pointers will be greatly helpful? Thanks
@nikkon-dev @bmarchant , Gently ping on this question?
@nvvfedorov right, the question was specific to MIG instances, like the below metrics(dcgm_fi_prof_gr_engine_active) where there is a non-zero value (which i assume indicates that gpus are being used and pods...
@nvvfedorov Any update on this? Thanks
@nvvfedorov we also, ran a load test to simulate the traffic for a period of time(30 mins) and observed that none of the MIG metrics had container_name, pod_name, pod_namespace info....
@nvvfedorov 1. Ran a script which captures dcgm-exporter metrics from localhost /metrics endpoint. 2. Setup inferencing request on a model served from a host. The hosts is a A100 gpu...
@nvvfedorov Any update on this? SHould be a simple test to see if it works as expected in your tests and if it does we can check if this is...
@nvvfedorov Based on the [code](https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/dcgmexporter/kubernetes.go#L150) it seems like this is disabled for MIG resource names. Can you please confirm and if so any reason why this is not supported?
@nvvfedorov , Added the details. We use MIXED strategy with 2 mig slices on 4 gpus.(3g.40gb & 4g.40gb) env: - name: MIG_STRATEGY value: mixed - name: NVIDIA_MIG_MONITOR_DEVICES value: all -...
@nvvfedorov I doubt I will be able to do it as I won't be able to download & run external packages on hosts without several reviews. Since you know the...