Container Reset by HA during concurrent execs..

Open malikkal opened this issue 7 years ago • 1 comments

VIC Engine - v1.4.1 / vSphere 6.0 u3e

User statement: As a customer of VIC, we would like to have concurrent execs work reliably..

Issue Description: There have been various issues with exec that was improved considerably with 1.4.1; however, this one is a new variant...

Concurrent execs probably made HA think that cVM has probably hung and was promptly reset.
fdm logs suggested it was a genuine restart due to vmtools hearbeat miss.
issue was resolved in the interim by reducing the number of concurrent execs.

Aug 09 '18 12:08 malikkal

Item of interest while looking at the container tether log:

+ echo 'switching to the new mount'
switching to the new mount
+ systemctl switch-root /mnt/containerfs /.tether/tether
+ echo 'switched to the new mount'
switched to the new mount
+ systemctl poweroff

The code performs a systemctl switch-root but we still see the commands after it executed (althoug the power off does not take effect). This did not used to be the case suggesting that systemctl switch-root behaviour has changed. While it does not seem to be misbehaving currently we should ensure that there's no significant code executed after the switch-root.

Regarding the failure of the cVM:

2018/08/06 20:07:27 error calling Set_Option: request "info-set guestinfo.ip x.x.x.x": "0 Total guest info exceeds maximum allocated space"
2018/08/06 20:07:27 Message: Unable to send a message over the communication channel 0

This looks likely to be related to the extra large number of exec requests that were pushed to the cVM. We are not garbage collecting the exec records after they complete as it's both primarily present as a debug/diagnostic mechanism and so not expected to be heavily used in production, and because an audit of what exec operations were run against a container is useful. We may need to change the mechanism used for exec if it's used heavily given that is completely opposite the initial design assumptions, both for VIC and regular docker.

It's unclear at first glance why the CPU halt has occurred - there is no indication in the tether log of it becoming unresponsive and I do not see explicit mention in the vmware.log.

We should get someone from the hypervisor team to look over the vmware.log and confirm whether there's entries that could lead to CPU halt. We should also confirm impact of the exhaustion of guestinfo memory. Should add a unit test to tether for this circumstance, fuzzing this response.

Detail logs snipped to @hickeng onedrive (VIC/issues/8197)

Aug 26 '18 02:08 hickeng