level-zero-tests icon indicating copy to clipboard operation
level-zero-tests copied to clipboard

Add new CTS test to validate if unique telemetry is reported by sysman

Open kalyanalle opened this issue 2 months ago • 0 comments

on multi card systems for each device(GPU)

Related-To: VLCLJ-2646

Note: left the debug prints for testing purpose, will remove in the final code.

Brief Function Logic Explanations Data Collection Functions collectMemoryData() Purpose: Collects memory telemetry from a device Enumerates memory modules using zesDeviceEnumMemoryModules() For each module: gets bandwidth counters (read/write) and memory state (free/used) Stores in deviceData.memoryBandwidth and deviceData.memoryStates Returns gracefully if no memory modules found collectPowerData() Purpose: Collects power consumption telemetry from a device Enumerates power domains using zesDeviceEnumPowerDomains() For each domain: gets energy counters using zesPowerGetEnergyCounter() Stores in deviceData.powerEnergy Returns gracefully if no power domains found collectTemperatureData() Purpose: Collects temperature readings from a device Enumerates temperature sensors using zesDeviceEnumTemperatureSensors() For each sensor: gets temperature value using zesTemperatureGetState() Stores in deviceData.temperatures Returns gracefully if no temperature sensors found collectPciData() Purpose: CRITICAL - Collects PCI info and creates unique device ID Gets PCI properties using zesDevicePciGetProperties() Creates BDF string: "bus:device:function" (e.g., "3:0:0") Gets PCI traffic stats using zesDevicePciGetStats() Returns false if PCI properties fail (test-critical failure)

Validation Functions validateUniquePciBdf() Purpose: CORE PMT VALIDATION - Ensures no duplicate PCI addresses Uses std::set to detect duplicate BDF identifiers Returns false if duplicate found → PMT mapping error detected Most critical validation - proves each device has unique address validateMemoryDataIsolation() Purpose: Ensures memory counters differ between all device pairs Double loop: Compares every device pair (i vs j where j > i) Checks memory bandwidth: read/write counters must differ Checks memory state: free memory should differ between devices EXPECT_FALSE on identical data → detects PMT cross-contamination validatePowerDataIsolation() Purpose: Ensures power readings differ between all device pairs Double loop: Compares every device pair Checks energy counters: power consumption values must differ EXPECT_FALSE on identical energy → detects shared power data validateTemperatureDataIsolation() Purpose: Validates temperature readings are realistic per device Double loop: Validates each device's temperature range Range check: 0°C < temperature < 150°C per device No uniqueness requirement (idle GPUs may have similar temps) Ensures PMT thermal interface is accessible validatePciDataIsolation() Purpose: CRITICAL - Validates PCI bus uniqueness and traffic isolation EXPECT_NE on PCI bus numbers → different devices must be on different buses Compares PCI traffic stats: RX/TX/packet counters must differ EXPECT_FALSE on identical stats → detects PMT interface sharing Core PMT mapping validation - validates the commit's fix

kalyanalle avatar Nov 26 '25 15:11 kalyanalle