Facts not being reported by the Microkernel to Hanlon on some hardware
A user recently reported that when he used the Microkernel to discover HP BL460cG8 blade servers that included with 10Gb Flex Fabric NICs, there were no system tags assigned to the nodes after they registered with the Hanlon. This seems to be associated with facts not being returned to Hanlon from the Microkernel in the node registration process. This issue needs to be explored further to determine the root cause (whether it's a Hanlon issue, a Hanlon-Microkernel issue, or both).
On the plus side, the node seems to be checking in successfully, so this hints at an issue with discovering (and parsing?) the underlying facts from the Hanlon-Microkernel side. This issue is associated with Issue #422 in the Hanlon project
I think I have a related issue. Received a number of Super Micro machines and only getting partial tags/facts being reported.
0:0 ᐅ hanlon node 1a -f attributes [3:38]
Node Attributes:
Name Value
architecture x86_64
bios_release_date 12/18/2015
bios_vendor American Megatrends Inc.
bios_version 2.0
blockdevice_sda_model SMC3108
blockdevice_sda_size 959656755200
blockdevice_sda_vendor AVAGO
blockdevices sda
boardmanufacturer Supermicro
boardproductname X10DRT-P
boardserialnumber ZM152S026018
filesystems ext2,ext3,ext4
fqdn mk0CC47A4BF8D0
gid root
hardwareisa unknown
hardwaremodel x86_64
hostname mk0CC47A4BF8D0
ipaddress 172.18.42.1
ipaddress_docker0 172.17.0.1
ipaddress_docker_sys 172.18.42.1
ipaddress_eth2 10.33.12.25
ipaddress_lo 127.0.0.1
is_virtual false
macaddress 00:00:00:00:00:00
macaddress_docker0 02:42:35:62:CC:DA
macaddress_docker_sys 00:00:00:00:00:00
macaddress_eth0 0C:C4:7A:4B:F8:D0
macaddress_eth1 0C:C4:7A:4B:F8:D1
macaddress_eth2 A0:36:9F:6C:F9:F8
macaddress_eth3 A0:36:9F:6C:F9:FA
macaddress_none DE:EA:AA:7A:C6:4F
manufacturer Supermicro
memorysize 125.88 GB
memorysize_mb 128896.68
mk_hw_bus_description Motherboard
mk_hw_bus_physical_id 0
mk_hw_bus_product X10DRT-P
mk_hw_bus_serial ZM152S026018
mk_hw_bus_vendor Supermicro
mk_hw_bus_version 1.10
mk_hw_fw_capacity 15MiB
mk_hw_fw_date 12/18/2015
mk_hw_fw_description BIOS
mk_hw_fw_physical_id 0
mk_hw_fw_size 64KiB
mk_hw_fw_vendor American Megatrends Inc.
mk_hw_fw_version 2.0
mk_hw_lscpu_Architecture x86_64
mk_hw_lscpu_BogoMIPS 4805.22
mk_hw_lscpu_Byte_Order Little Endian
mk_hw_lscpu_CPU_MHz 1200.656
mk_hw_lscpu_CPU_family 6
mk_hw_lscpu_CPU_op-modes 32-bit, 64-bit
mk_hw_lscpu_L1d_cache 32K
mk_hw_lscpu_L1i_cache 32K
mk_hw_lscpu_L2_cache 256K
mk_hw_lscpu_L3_cache 15360K
mk_hw_lscpu_Model 63
mk_hw_lscpu_Stepping 2
mk_hw_lscpu_Vendor_ID GenuineIntel
mk_hw_lscpu_Virtualization VT-x
mtu_docker0 1500
mtu_docker_sys 1500
mtu_eth0 1500
mtu_eth1 1500
mtu_eth2 1500
mtu_eth3 1500
mtu_lo 65536
mtu_none 1500
netmask 255.255.0.0
netmask_docker0 255.255.0.0
netmask_docker_sys 255.255.0.0
netmask_eth2 255.255.255.0
netmask_lo 255.0.0.0
network_docker0 172.17.0.0
network_docker_sys 172.18.0.0
network_eth2 10.33.12.0
network_lo 127.0.0.0
physicalprocessorcount 2
processorcount 24
productname SYS-2028TP-HC1R
serialnumber E168774X6101021
type Other
virtual physical
Note the lack of some of the microkernel facts not defined so the system tags do not appear. Specifically mk_hw_cpu_count, mk_hw_mem_size and mk_hw_nic_count.
Probably related is the exception that is occurring in hnl_mk_hardware_facter.rb.
E, [2016-03-24T23:16:04.885474 #71] ERROR -- HanlonMicrokernel::HnlMkHardwareFacter#rescue in add_facts_to_map!: /usr/local/lib/ruby/hanl
on_microkernel/hnl_mk_hardware_facter.rb:79:in `add_facts_to_map!'
/usr/local/lib/ruby/hanlon_microkernel/hnl_mk_registration_manager.rb:52:in `register_with_server'
/usr/local/lib/ruby/hanlon_microkernel/hnl_mk_registration_manager.rb:42:in `register_node_if_changed'
/usr/local/bin/hnl_mk_control_server.rb:256:in `block in <top (required)>'
/usr/local/bin/hnl_mk_control_server.rb:141:in `loop'
/usr/local/bin/hnl_mk_control_server.rb:141:in `<top (required)>'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/application.rb:218:in `load'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/application.rb:218:in `start_load'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/application.rb:297:in `start'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/controller.rb:56:in `run'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons.rb:144:in `block in run'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/cmdline.rb:88:in `call'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/cmdline.rb:88:in `catch_exceptions'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons.rb:143:in `run'
/usr/local/bin/hnl_mk_controller.rb:24:in `<main>'
E, [2016-03-24T23:17:04.883214 #71] ERROR -- HanlonMicrokernel::HnlMkHardwareFacter#rescue in add_facts_to_map!: /usr/local/lib/ruby/hanl
on_microkernel/hnl_mk_hardware_facter.rb:79:in `add_facts_to_map!'
/usr/local/lib/ruby/hanlon_microkernel/hnl_mk_registration_manager.rb:52:in `register_with_server'
/usr/local/lib/ruby/hanlon_microkernel/hnl_mk_registration_manager.rb:42:in `register_node_if_changed'
/usr/local/bin/hnl_mk_control_server.rb:256:in `block in <top (required)>'
/usr/local/bin/hnl_mk_control_server.rb:141:in `loop'
/usr/local/bin/hnl_mk_control_server.rb:141:in `<top (required)>'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/application.rb:218:in `load'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/application.rb:218:in `start_load'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/application.rb:297:in `start'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/controller.rb:56:in `run'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons.rb:144:in `block in run'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/cmdline.rb:88:in `call'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons/cmdline.rb:88:in `catch_exceptions'
/usr/lib/ruby/gems/2.2.0/gems/daemons-1.2.3/lib/daemons.rb:143:in `run'
/usr/local/bin/hnl_mk_controller.rb:24:in `<main>'
@hickey I created this doc a while back, maybe if we increase the logging we can figure out where the problem is. https://gist.github.com/jcpowermac/3ed70022ba218ad29ce6
Yes, I was starting to look at that this morning.... I am also starting to cut a new microkernel image that print out values throughout the routine to determine what things look like as the routine is executing.
Not sure how the debug level gets transferred to the microkernel (figure it has to be statically written into the docker image when image add executes), but the lack of controls when starting up the docker containers is getting frustrating.... I have already started to extend the hanlon_docker.sh script to be more 12 factorish. I guess I have another setting to add.
@hickey see https://github.com/csc/Hanlon-Microkernel/blob/master/hnl_mk_web_server.rb#L57
Most of the configurations of Hanlon don't need to be changed which is the reason why we did not add those options when starting container. Its easy enough to enter the container modify the config temporarily for testing/debugging and restart puma.
I have grabbed a copy of the log for analysis. Here is the gist: https://gist.github.com/hickey/6207183c78ea0903cea1
My guess (from looking at line 79 of the hanlon_microkernel/hnl_mk_hardware_facter.rb file in the Hanlon-Microkernel project) is that the command being exec’ed on line 65 of that file (the sudo lshw -c memory command) isn’t returning the sizes of the memory slots on that hardware (so the bank_array entry in the hash map constructed from the output of that command is empty, a nil is returned from the hash_map[“bank_array”] statement, and the rest of the code on that line is attempting to run a select statement on a nil object.
This is the first time I’ve seen this sort of error before…can you run a sudo lshw -c memory command on that node from the command line in the Microkernel just to be sure???
Cheers,
Tom
On Mar 25, 2016, at 4:00 PM, Gerard Hickey [email protected] wrote:
I have grabbed a copy of the log for analysis. Here is the gist: https://gist.github.com/hickey/6207183c78ea0903cea1 https://gist.github.com/hickey/6207183c78ea0903cea1 — You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/csc/Hanlon-Microkernel/issues/29#issuecomment-201568696
Here is the value of hash_map (prettied up to make it readable) just before the exception:
D, [2016-03-27T03:44:02.369209 #71] DEBUG -- HanlonMicrokernel::HnlMkHardwareFacter#add_facts_to_map!:
hash_map = {
"firmware"=>{
"description"=>"BIOS",
"vendor"=>"American Megatrends Inc.",
"physical_id"=>"0",
"version"=>"2.0",
"date"=>"12/18/2015",
"size"=>"64KiB",
"capacity"=>"15MiB",
"capabilities"=>"pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17
printer acpi usb biosbootspecification uefi"
},
"memory_array"=>[
{
"description"=>"System Memory",
"physical_id"=>"57",
"slot"=>"System board or motherboard",
"bank_array"=>[
{
"description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
"product"=>"M393A2G40DB0-CPB",
"vendor"=>"Samsung",
"physical_id"=>"0",
"serial"=>"3130B89B",
"slot"=>"P1-DIMMA1",
"size"=>"16GiB",
"width"=>"64 bits",
"clock"=>"2133MHz (0.5ns)"
},
{
"description"=>"DIMM Synchronous [empty]",
"product"=>"NO DIMM",
"vendor"=>"NO DIMM",
"physical_id"=>"1",
"serial"=>"NO DIMM",
"slot"=>"P1-DIMMA2"
},
{
"description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
"product"=>"M393A2G40DB0-CPB",
"vendor"=>"Samsung",
"physical_id"=>"2",
"serial"=>"3130B8A6",
"slot"=>"P1-DIMMB1",
"size"=>"16GiB",
"width"=>"64 bits",
"clock"=>"2133MHz (0.5ns)"
},
{
"description"=>"DIMM Synchronous [empty]",
"product"=>"NO DIMM",
"vendor"=>"NO DIMM",
"physical_id"=>"3",
"serial"=>"NO DIMM",
"slot"=>"P1-DIMMB2"
},
{
"description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
"product"=>"M393A2G40DB0-CPB",
"vendor"=>"Samsung",
"physical
_id"=>"4",
"serial"=>"3130B89E",
"slot"=>"P1-DIMMC1",
"size"=>"16GiB",
"width"=>"64 bits",
"clock"=>"2133MHz (0.5ns)"
},
{
"description"=>"DIMM Synchronous [empty]",
"product"=>"NO DIMM",
"vendor"=>"NO DIMM",
"physical_id"=>"5",
"serial"=>"NO DIMM",
"slot"=>"P1-DIMMC2"
},
{
"description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
"product"=>"M393A2G40DB0-CPB",
"vendor"=>"Samsung",
"physical_id"=>"6",
"serial"=>"3130B8AB",
"slot"=>"P1-DIMMD1",
"size"=>"16GiB",
"width"=>"64 bits",
"clock"=>"2133MHz (0.5ns)"
},
{
"description"=>"DIMM Synchronous [empty]",
"product"=>"NO DIMM",
"vendor"=>"NO DIMM",
"physical_id"=>"7",
"serial"=>"NO DIMM",
"slot"=>"P1-DIMMD2"
}
]
},
{
"description"=>"System Memory",
"physical_id"=>"60",
"slot"=>"System board or motherboard",
"bank_array"=>[
{
"description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
"product"=>"M393A2G40DB0-CPB",
"vendor"=>"Samsung",
"physical_id"=>"0",
"serial"=>"3130B899",
"slot"=>"P2-DIMME1",
"size"=>"16GiB",
"width"=>"64 bits",
"clock"=>"2133MHz (0.5ns)"
},
{
"description"=>"DIMM Synchronous [empty]",
"product"=>"NO DIMM",
"vendor"=>"NO DIMM",
"physical_id"=>"1",
"serial"=>"NO DIMM",
"slot"=>"P2-DIMME2"
},
{
"description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
"product"=>"M393A2G40DB0-CPB",
"vendor"=>"Samsung",
"physical_id"=>"2",
"serial"=>"3130B8A2",
"slot"=>"P2-DIMMF1",
"size"=>"16GiB",
"width"=>"64 bits",
"clock"=>"2133MHz (0.5ns)"
},
{
"description"=>"DIMM Synchronous [empty]",
"product"=>"NO DIMM",
"vendor"=>"NO DIMM",
"physical_id"=>"3",
"serial"=>"NO DIMM",
"slot"=>"P2-DIMMF2"
},
{
"description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
"product"=>"M393A2G40DB0-CPB",
"vendor"=>"Samsung",
"physical_id"=>"4",
"serial"=>"3130B8A1",
"slot"=>"P2-DIMMG1",
"size"=>"16GiB",
"width"=>"64 bits",
"clock"=>"2133MHz (0.5ns)"
},
{
"description"=>"DIMM Synchronous [empty]",
"product"=>"NO DIMM",
"vendor"=>"NO DIMM",
"physical_id"=>"5",
"serial"=>"NO DIMM",
"slot"=>"P2-DIMMG2"
},
{
"description"=>"DIMM Synchronous 2133 MHz (0.5 ns)",
"product"=>"M393A2G40DB0-CPB",
"vendor"=>"Samsung",
"physical_id"=>"6",
"serial"=>"3130B89D",
"slot"=>"P2-DIMMH1",
"size"=>"16GiB",
"width"=>"64 bits",
"clock"=>"2133MHz (0.5ns)"
},
{
"description"=>"DIMM Synchronous [empty]",
"product"=>"NO DIMM",
"vendor"=>"NO DIMM",
"physical_id"=>"7",
"serial"=>"NO DIMM",
"slot"=>"P2-DIMMH2"
}
]
},
{
"UNCLAIMED"=>true, "physical_id"=>"1"
}
],
"cache_array"=>[
{
"description"=>"L1 cache",
"physical_id"=>"74",
"slot"=>"CPU Internal L1",
"size"=>"384KiB",
"capacity"=>"384KiB",
"capabilities"=>"internal write-back"
},
{
"description"=>"L2 cache",
"physical_id"=>"75",
"slot"=>"CPU Internal L2",
"size"=>"1536KiB",
"capacity"=>"1536KiB",
"capabilities"=>"internal write-back unified"
},
{
"description"=>"L3 cache",
"physical_id"=>"76",
"slot"=>"CPU Internal L3",
"size"=>"15MiB",
"capacity"=>"15MiB",
"capabilities"=>"internal write-back unified"
},
{
"description"=>"L1 cache",
"physical_id"=>"78",
"slot"=>"CPU Internal L1",
"size"=>"384KiB",
"capacity"=>"384KiB",
"capabilities"=>"internal write-back"
},
{
"description"=>"L2 cache",
"physical_id"=>"79",
"slot"=>"CPU Internal L2",
"size"=>"1536KiB",
"capacity"=>"1536KiB",
"capabilities"=>"internal write-back unified"
},
{
"description"=>"L3 cache",
"physical_id"=>"7a",
"slot"=>"CPU Internal L3",
"size"=>"15MiB",
"capacity"=>"15MiB",
"capabilities"=>"internal write-back
unified"
}
]
}
So clearly in my output it is memory_array rather than bank_array. OK that is an easy fix.
I walked through the rest of the sections and executed the commands to look at the output. Everything else looks like it will parse correctly. Well everything up to the point of trying to gather BMC/IPMI information. I have already created an issue (#31) for this. It will be fairly critical to gather this information also, so hopefully this will be able to be overcome.
A though I was just having (more of a question) is why was the hardware gathering done this way instead of using the regular Facter interface? If all these gathering processes were written as either Facter modules or even using the facter-dot-d interface, then any one of them blowing up would not disturb the other bits of code gathering information. At lease then in my case only the memory information would be missing.
It would also be easier to add new code to test and generate facts. The other advantage is that running facter on the command line would yield all the regular facts along with the the ones being created by the hanlon code.
Would not be that much trouble to break it apart and make it Facter modules. Although the more I think about it, the more I like adding it as facts-dot-d scripts. One of the principle reasons being that it would make it pretty easy to interface to an external PAAS system through a hook (yes the microkernel needs to support an easy way for someone to drop scripts in a directory and have them executed prior to and after registration--maybe even at each checkin) to retrieve information and drop a JSON/YAML/text file in the facter-dot-d directory to add PAAS values as facts. This would allow Hanlon tags to be created from exposed PAAS information.
I an report back that the initial changes I have made to support memory_array and bank_array are working. I did not get any real memory information back, so I will look to solve this before I submit a PR. But I am seeing the facts to generate number of CPUs and NICs. So improvements :-)
While I still have the patch for this issue in my local repo, I would suggest that the PR I just created for issue #32 be used instead. That code base also solves some (if not all) of the issues on this thread.