linux_peripheral_interfaces
linux_peripheral_interfaces copied to clipboard
WIP: Add computer_hw package (copied, improved from pr2_computer_monitor)
The change in this PR is the same as https://github.com/ros-drivers/linux_peripheral_interfaces/pull/20 but re-opened from a different branch.
Issue aimed at
- Add some h/w status monitoring
- Likely addresses https://github.com/PR2/pr2_common/issues/286 (generalize some pr2 components).
Changes
- Add
computer_hwpackage (renamedpr2_computer_monitorthat was copied from pr2_robot repo) - Added a
.launchto allow downstream to start processes by batch.
Review items
- [x] Depends on https://github.com/plusone-robotics/computer_status_msgs/pull/4
- [x] (Option) .deb pkgs release to allow non-source installation https://github.com/ros/rosdistro/pull/30818 and https://github.com/ros/rosdistro/pull/30819. Afraid this might be needed for CI, hence WIPed for now.
Test
Dev test done on Ubuntu 16.04 host with nvidia GeForce GTX 1060
# roslaunch computer_hw monitor.launch
... logging to /root/.ros/log/1b44b418-1846-11ec-b2b0-c400ad2d8cb0/roslaunch-rabbitdeer-3380.log
Checking log directory for disk usage. This may take awhile.
Press Ctrl-C to interrupt
Done checking log file disk usage. Usage is <1GB.
started roslaunch server http://rabbitdeer:38343/
SUMMARY
========
PARAMETERS
* /rosdistro: kinetic
* /rosversion: 1.12.13
NODES
/
diag_agg (diagnostic_aggregator/aggregator_node)
libsensors_monitor (libsensors_monitor/libsensors_monitor)
nvidia_temperature_monitor (computer_hw/nvidia_temp.py)
auto-starting new master
process[master]: started with pid [3390]
ROS_MASTER_URI=http://localhost:11311
setting /run_id to 1b44b418-1846-11ec-b2b0-c400ad2d8cb0
process[rosout-1]: started with pid [3403]
started core service [/rosout]
process[libsensors_monitor-2]: started with pid [3410]
[ INFO] [1631944994.889260052]: Got system hostname: rabbitdeer
[ INFO] [1631944994.896585316]: Found sensor coretemp-isa-0000 with features: temp1, temp2, temp3, temp4, temp5
[ INFO] [1631944994.896702034]: Found sensor acpitz-virtual-0 with features: temp1, temp2, temp3
[ INFO] [1631944994.896749535]: Found sensor pch_skylake-virtual-0 with features: temp1
process[nvidia_temperature_monitor-3]: started with pid [3421]
[INFO] [1631944995.775560]: card_out:
==============NVSMI LOG==============
Timestamp : Sat Sep 18 06:03:15 2021
Driver Version : 440.64
CUDA Version : 10.2
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : GeForce GTX 1060 6GB
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-7f9b4a72-68fe-e2a9-8907-4590704d3431
Minor Number : 0
VBIOS Version : 86.06.45.00.60
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1C0310DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x61633842
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 5 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 6077 MiB
Used : 114 MiB
Free : 5963 MiB
BAR1 Memory Usage
Total : 256 MiB [62/1811]
Used : 5 MiB
Free : 251 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 2 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Temperature
GPU Current Temp : 51 C
GPU Shutdown Temp : 102 C
GPU Slowdown Temp : 99 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 5.91 W
Power Limit : 120.00 W
Default Power Limit : 120.00 W
Enforced Power Limit : 120.00 W
Min Power Limit : 60.00 W
Max Power Limit : 140.00 W
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2012 MHz
SM : 2012 MHz
Memory : 4004 MHz
Video : 1708 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
gpu_stat: header:
seq: 0
stamp:
secs: 0
nsecs: 0
frame_id: ''
product_name: "GeForce GTX 1060 6GB"
pci_device_id: ''
pci_location: ''
display: ''
driver_version: "440.64"
temperature: 51
fan_speed: 23.5619449019
gpu_usage: 0
memory_usage: 2
process[diag_agg-4]: started with pid [3435]
[ERROR] [1631944995.896812050]: No analyzers initialized in AnalyzerGroup /diag_agg/analyzers
[ERROR] [1631944995.896856468]: Analyzer group for diagnostic aggregator failed to initialize!
^C[diag_agg-4] killing on exit
[nvidia_temperature_monitor-3] killing on exit
[libsensors_monitor-2] killing on exit
[INFO] [1631944996.825916]: card_out:
gpu_stat: header:
seq: 0
stamp:
secs: 0
nsecs: 0
frame_id: ''
product_name: ''
pci_device_id: ''
pci_location: ''
display: ''
driver_version: ''
temperature: 0.0
fan_speed: 0.0
gpu_usage: 0.0
memory_usage: 0.0
:
Sample of Diagnostic GUI with GPU monitoring output.

Unit test is failing and I think this is a bug in the unit test case.
https://github.com/kinu-garage/linux_peripheral_interfaces/actions/runs/4266170296/jobs/7426371093#step:4:467
======================================================================
FAIL: test_parse (parse_test.TestNominalParser)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/root/target_ws/src/linux_peripheral_interfaces/computer_hw/test/parse_test.py", line 70, in test_parse
self.assert_(gpu_stat.pci_device_id, "No PCI Device ID found")
AssertionError: '' is not true : No PCI Device ID found