foundationdb icon indicating copy to clipboard operation
foundationdb copied to clipboard

Automatically exclude slow machines

Open etschannen opened this issue 7 years ago • 1 comments

If a single process in a database is much slower than the other processes, the entire database will slow down. This means that failures which cause a process to slow down are much worse than failures that kill a process completely.

The goal of this project is to detect when we are limited by a process which is performing much worse than other processes in the cluster, and automatically exclude it.

When we are being held back by a slow process, it is hard to determine what speed the other processes could achieve. Therefore we should have a background process which periodically writes metadata to the database about achieved throughput.

This approach might produce false positives because of a changing workload profile, so we limit the process to only have the power to exclude at most one machine.

etschannen avatar Jan 09 '19 01:01 etschannen