Jesse Noller

Results 48 issues of Jesse Noller

Given what we know about the kernel IO path issues and the bottleneck with the kernel / os disk itself, I need to re-run / fix the tests here: https://stian.tech/disk-performance-on-aks-part-1/...

documentation
test

https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections Azures *LBaaS services re-use connections / etc - this behavior triggers something called "SNAT port exhaustion" - that exhaustion triggers linux kernel traffic black holing that in turns leads...

Revise the current TSG(s) with See X, do Y instructions for users to identify the failure / disk latency triggered issues. Omit all background information

documentation

validate IO impact on 500-1000 node clusters - should be dramatically worse due to the node count.

test

Current `tainted-love` deployment is enough to kill a cluster, need to back off to identify precise break point

code
test

Based on the PLEG / other log lines detected due to disk latency, I need to set up a custom fork of NPD to test new issue detectors.

code

Kernel versions LOVE changing disk / storage characteristics. perform baseline disk tests using: - ubuntu 16.04 / 18.04 - OS disk - default, ephemeral vs larger NAS - Attached disk...

test

Common failure where users using many watches - default settings may need to be tuned for specific workloads using daemonsets sysctl fs.inotify.max_user_watches fs.inotify.max_user_watches = 8192 Reset cap: sysctl -w fs.inotify.max_user_watches=588576...

bug
investigation
system failure

Execute and capture failures/baselines of the upstream clusterloader 2 tests including changed needed to run successfully, metrics and failure points before and after isolating docker's IO Currently all tests fail...

investigation
test
system failure

Document best practices for host/vm/runtime log aggregation to detect system error patterns and faults.

documentation