cbt Allow OSD nodes to quiesce after a deleting pool

When deleting a pool, it may take a while for the OSD nodes to delete the objects in the pool. This change makes CBT wait until the OSD nodes quiesce in order to ensure they are idle before starting the next test run.

Quiescing is done by waiting until the maximum disk utilization for any disk falls below 3% across a 30 second window, and waiting until the maximum CPU utilization for any ceph-osd process falls below 3%.

Closes #117

Sep 27 '16 12:09 ASBishop

I'm also using RHEL7.2. I suspect the reason you saw no output is due to the 'z' flag included in the iostat command, so it will produce no output if you happen to run it when disk activity is totally idle. The awk code that processes the output knows how to deal with that situation.

Regarding the CPU usage, I hadn't considered why you might actually want to run CBT while the cluster is scrubbing or in recovery. But the possibility makes sense now that I think about it. But, as you noted, CPU usage will still spike when deleting objects, so I don't know how you distinguish acceptable CPU usage (scrubbing) from the activity we're trying to quiesce (pool deletion).

One idea is to make the thresholds configurable, and include the ability to bypass the quiesce operations entirely. I could add support for the following cluster settings:

cluster:
  quiesce_disk_util_max: 3
  quiesce_disk_window_size: 30
  quiesce_osd_cpu_max: 3

The settings would be optional, and values I listed would be the defaults. If either "max" setting is < 0 or > 100 then the corresponding quiesce operation would be skipped.

How does that sound?

Oct 20 '16 13:10 ASBishop

I just pushed a new version that makes the quiesce parameters tunable in the config file. The config items are optional, and default to the values I listed above.

Oct 25 '16 16:10 ASBishop

I tested this, and it's awesome. I was having wild variance between test iterations when the first iteration would create a ton of objects. If I run the same test with this patch, the variance dropped from over 20-100% to less than 5%.

Nov 04 '16 23:11 mmgaggle