luccabb comments

Results 8 comments of


                                            luccabb

dcgm nvlink metrics not available on dcgm 3.1.3

@dbeer > What GPU generation are you using? NVIDIA A100-SXM4-40GB

dcgm nvlink metrics not available on dcgm 3.1.3

> what is the output of nvidia-smi? ``` +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 ... ```

dcgm nvlink metrics not available on dcgm 3.1.3

per https://github.com/NVIDIA/DCGM/issues/149#issuecomment-1922398817 its only available on Hopper+ GPUs surfacing this on the [dcgm docs](https://docs.nvidia.com/datacenter/dcgm/3.1/dcgm-api/dcgm-api-field-ids.html?fbclid=IwAR2Zb6uoCJiZ2fiQ_F0KM1SzfIb49vUiU5e1bwwf-PywDCZhFESP59RQS0M) would be helpful cc: @dbeer @nikkon-dev

GitHub Actions is deprecating `macos-10.5` runner

> Yep, macos-12 runner works just fine. `macos-12.6.9` and `macos-13` seems to be failing for VirtualBox. it's being tracked on https://github.com/actions/runner-images/issues/8730

Logs API: not possible to use without SDK

> Probably related to https://github.com/open-telemetry/opentelemetry-python/issues/3552 When using the `APILogRecord` I get: - https://github.com/open-telemetry/opentelemetry-python/issues/4319 on `ConsoleLogExporter` - https://github.com/open-telemetry/opentelemetry-python/issues/3552 on `OTLPLogExporter`

Logs API: not possible to use without SDK

probably also part of the same issue, resource and scope attributes via the SDK are ignored: ``` from opentelemetry._logs import LogRecord, get_logger_provider, set_logger_provider, SeverityNumber from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter from opentelemetry.sdk._logs...

DCGM_FI_DEV_GPU_UTIL abnormal point

@nvvfedorov I'm observing the same issue This is quite rare, when tracking down to a single GPU UUID I've noticed 1 weird sample where DCGM_FI_DEV_GPU_UTIL is higher than 100 for...

dcgm-exporter counter value goes down

if this is expected behavior, should we change the type to gauge? > A gauge is a metric that represents a single numerical value that can arbitrarily go up and...