Status Json Generation Can Be Very Long

Open jzhou77 opened this issue 3 years ago • 1 comments

clusterGetStatus() does many steps in serial order. Even though each step typically has a timeout, the total time can be very long, especially when there are faults in the cluster, e.g., when some storage servers are unavailable.

A second problem is that if a new step is added without a proper timeout, the process becomes unbounded. This is bad, because operational tools typically depends on the output from status json.

To bound the time for status generation, we need to parallelize the steps for generating status as much as possible, probably with a good code refactoring to address both of the above problems.

Jul 25 '22 00:07 jzhou77

@sfc-gh-satherton mentioned that currently there is an optimization in the status json generation that reduces memory copies, which reduces the time from a few seconds to less than one second. The refactor should consider this optimization as well to reduce the total time.

Jul 27 '22 17:07 jzhou77