linux_peripheral_interfaces icon indicating copy to clipboard operation
linux_peripheral_interfaces copied to clipboard

WIP: Add computer_hw package (copied, improved from pr2_computer_monitor)

Open 130s opened this issue 3 years ago • 1 comments

The change in this PR is the same as https://github.com/ros-drivers/linux_peripheral_interfaces/pull/20 but re-opened from a different branch.

Issue aimed at

  • Add some h/w status monitoring
    • Likely addresses https://github.com/PR2/pr2_common/issues/286 (generalize some pr2 components).

Changes

  • Add computer_hw package (renamed pr2_computer_monitor that was copied from pr2_robot repo)
  • Added a .launch to allow downstream to start processes by batch.

Review items

  • [x] Depends on https://github.com/plusone-robotics/computer_status_msgs/pull/4
  • [x] (Option) .deb pkgs release to allow non-source installation https://github.com/ros/rosdistro/pull/30818 and https://github.com/ros/rosdistro/pull/30819. Afraid this might be needed for CI, hence WIPed for now.

Test

Dev test done on Ubuntu 16.04 host with nvidia GeForce GTX 1060
# roslaunch computer_hw monitor.launch                                                                                                                                                                                                                                                                      
... logging to /root/.ros/log/1b44b418-1846-11ec-b2b0-c400ad2d8cb0/roslaunch-rabbitdeer-3380.log
Checking log directory for disk usage. This may take awhile.
Press Ctrl-C to interrupt
Done checking log file disk usage. Usage is <1GB.

started roslaunch server http://rabbitdeer:38343/

SUMMARY
========

PARAMETERS
 * /rosdistro: kinetic
 * /rosversion: 1.12.13

NODES
  /
    diag_agg (diagnostic_aggregator/aggregator_node)
    libsensors_monitor (libsensors_monitor/libsensors_monitor)
    nvidia_temperature_monitor (computer_hw/nvidia_temp.py)

auto-starting new master
process[master]: started with pid [3390]
ROS_MASTER_URI=http://localhost:11311

setting /run_id to 1b44b418-1846-11ec-b2b0-c400ad2d8cb0
process[rosout-1]: started with pid [3403]
started core service [/rosout]
process[libsensors_monitor-2]: started with pid [3410]
[ INFO] [1631944994.889260052]: Got system hostname: rabbitdeer
[ INFO] [1631944994.896585316]: Found sensor coretemp-isa-0000 with features: temp1, temp2, temp3, temp4, temp5
[ INFO] [1631944994.896702034]: Found sensor acpitz-virtual-0 with features: temp1, temp2, temp3
[ INFO] [1631944994.896749535]: Found sensor pch_skylake-virtual-0 with features: temp1
process[nvidia_temperature_monitor-3]: started with pid [3421]
[INFO] [1631944995.775560]: card_out: 
==============NVSMI LOG==============

Timestamp                           : Sat Sep 18 06:03:15 2021
Driver Version                      : 440.64
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:01:00.0
    Product Name                    : GeForce GTX 1060 6GB
    Product Brand                   : GeForce
    Display Mode                    : Enabled
    Display Active                  : Enabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-7f9b4a72-68fe-e2a9-8907-4590704d3431
    Minor Number                    : 0
    VBIOS Version                   : 86.06.45.00.60
    MultiGPU Board                  : No
    Board ID                        : 0x100
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.01.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization Mode         : None
        Host VGPU Mode              : N/A
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x01
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1C0310DE
        Bus Id                      : 00000000:01:00.0
        Sub System Id               : 0x61633842
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays Since Reset         : 0
        Replay Number Rollovers     : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 5 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 6077 MiB
        Used                        : 114 MiB
        Free                        : 5963 MiB
    BAR1 Memory Usage
       Total                       : 256 MiB                                                                                                                                                                                                                                                                        [62/1811]
        Used                        : 5 MiB
        Free                        : 251 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 2 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending Page Blacklist      : N/A
    Temperature
        GPU Current Temp            : 51 C
        GPU Shutdown Temp           : 102 C
        GPU Slowdown Temp           : 99 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 5.91 W
        Power Limit                 : 120.00 W
        Default Power Limit         : 120.00 W
        Enforced Power Limit        : 120.00 W
        Min Power Limit             : 60.00 W
        Max Power Limit             : 140.00 W
   Clocks
        Graphics                    : 139 MHz
        SM                          : 139 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 2012 MHz
        SM                          : 2012 MHz
        Memory                      : 4004 MHz
        Video                       : 1708 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes


gpu_stat: header: 
  seq: 0
  stamp: 
    secs: 0
    nsecs:         0
  frame_id: ''
product_name: "GeForce GTX 1060 6GB"
pci_device_id: ''
pci_location: ''
display: ''
driver_version: "440.64"
temperature: 51
fan_speed: 23.5619449019
gpu_usage: 0
memory_usage: 2

process[diag_agg-4]: started with pid [3435]
[ERROR] [1631944995.896812050]: No analyzers initialized in AnalyzerGroup /diag_agg/analyzers
[ERROR] [1631944995.896856468]: Analyzer group for diagnostic aggregator failed to initialize!
^C[diag_agg-4] killing on exit
[nvidia_temperature_monitor-3] killing on exit
[libsensors_monitor-2] killing on exit
[INFO] [1631944996.825916]: card_out: 
gpu_stat: header: 
  seq: 0
  stamp: 
    secs: 0
    nsecs:         0
  frame_id: ''
product_name: ''
pci_device_id: ''
pci_location: ''
display: ''
driver_version: ''
temperature: 0.0
fan_speed: 0.0
gpu_usage: 0.0
memory_usage: 0.0

:

Sample of Diagnostic GUI with GPU monitoring output.

130s avatar Feb 24 '22 22:02 130s

Unit test is failing and I think this is a bug in the unit test case.

https://github.com/kinu-garage/linux_peripheral_interfaces/actions/runs/4266170296/jobs/7426371093#step:4:467

  ======================================================================
  FAIL: test_parse (parse_test.TestNominalParser)
  ----------------------------------------------------------------------
  Traceback (most recent call last):
    File "/root/target_ws/src/linux_peripheral_interfaces/computer_hw/test/parse_test.py", line 70, in test_parse
      self.assert_(gpu_stat.pci_device_id, "No PCI Device ID found")
  AssertionError: '' is not true : No PCI Device ID found

130s avatar Feb 24 '23 20:02 130s