Jesse Noller issues

Results 48 issues of


                                            Jesse Noller

Correctly benchmark using FIO w/ examples

Given what we know about the kernel IO path issues and the bottleneck with the kernel / os disk itself, I need to re-run / fix the tests here: https://stian.tech/disk-performance-on-aks-part-1/...

documentation

test

Add host-level / NPD level SNAT exhaustion detection

https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections Azures *LBaaS services re-use connections / etc - this behavior triggers something called "SNAT port exhaustion" - that exhaustion triggers linux kernel traffic black holing that in turns leads...

TSG: Golden metrics for disk latency issues / identification

Revise the current TSG(s) with See X, do Y instructions for users to identify the failure / disk latency triggered issues. Omit all background information

documentation

test large/at-scale clusters (500-1000 nodes)

validate IO impact on 500-1000 node clusters - should be dramatically worse due to the node count.

test

tune fio burn deployment to identify latency threshold

Current `tainted-love` deployment is enough to kill a cluster, need to back off to identify precise break point

code

test

Add custom NPD detectors

Based on the PLEG / other log lines detected due to disk latency, I need to set up a custom fork of NPD to test new issue detectors.

code

baseline ubuntu 16.04 vs 18.04 vm disk performance

Kernel versions LOVE changing disk / storage characteristics. perform baseline disk tests using: - ubuntu 16.04 / 18.04 - OS disk - default, ephemeral vs larger NAS - Attached disk...

test

New investigation: Failed to watch file...: no space left on device%

Common failure where users using many watches - default settings may need to be tuned for specific workloads using daemonsets sysctl fs.inotify.max_user_watches fs.inotify.max_user_watches = 8192 Reset cap: sysctl -w fs.inotify.max_user_watches=588576...

bug

investigation

system failure

Clusterloader2 scale and saturate/pod scheduling tests

Execute and capture failures/baselines of the upstream clusterloader 2 tests including changed needed to run successfully, metrics and failure points before and after isolating docker's IO Currently all tests fail...

investigation

test

system failure

Log aggregation best practices (hosts)

Document best practices for host/vm/runtime log aggregation to detect system error patterns and faults.

documentation