Jesse Noller
Jesse Noller
Given what we know about the kernel IO path issues and the bottleneck with the kernel / os disk itself, I need to re-run / fix the tests here: https://stian.tech/disk-performance-on-aks-part-1/...
https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections Azures *LBaaS services re-use connections / etc - this behavior triggers something called "SNAT port exhaustion" - that exhaustion triggers linux kernel traffic black holing that in turns leads...
Revise the current TSG(s) with See X, do Y instructions for users to identify the failure / disk latency triggered issues. Omit all background information
validate IO impact on 500-1000 node clusters - should be dramatically worse due to the node count.
Current `tainted-love` deployment is enough to kill a cluster, need to back off to identify precise break point
Based on the PLEG / other log lines detected due to disk latency, I need to set up a custom fork of NPD to test new issue detectors.
Kernel versions LOVE changing disk / storage characteristics. perform baseline disk tests using: - ubuntu 16.04 / 18.04 - OS disk - default, ephemeral vs larger NAS - Attached disk...
Common failure where users using many watches - default settings may need to be tuned for specific workloads using daemonsets sysctl fs.inotify.max_user_watches fs.inotify.max_user_watches = 8192 Reset cap: sysctl -w fs.inotify.max_user_watches=588576...
Execute and capture failures/baselines of the upstream clusterloader 2 tests including changed needed to run successfully, metrics and failure points before and after isolating docker's IO Currently all tests fail...
Document best practices for host/vm/runtime log aggregation to detect system error patterns and faults.