ozone HDDS-10965. Add Top N Datanode Latency Metrics for Block Operations

What changes were proposed in this pull request?

Add the datanode dimension for the GetBlock, PutBlock, WriteChunk and ReadChunk, Since the Since a cluster has a very large number of Datanodes, if you export metrics for all Datanodes, this will generate a very large number of metrics, so this adds a TopNMetrics type, which will only display information about the Datanodes with the highest latency.

With this Metrics, we should be able to quickly find out if a Datanode is failing (for read operations, we can confirm directly with the Metrics Tag: name, for write operations, we need to combine it with the Pipeline information, in the Metrics for write operations, Tag: name is the Datanode UUID of the Pipeline's leader).

What is the link to the Apache JIRA

Example

Configuration

  <property>
    <name>ozone.xceiver.client.top.metrics.latency.record.threshold.ms.key</name>
    <value>1, 10</value>
  </property>
  <property>
    <name>ozone.xceiver.client.top.metrics.latency.record.count.key</name>
    <value>5</value>
  </property>

Metrics

[root@VM-7-9-centos ~]$ curl -s http://127.0.0.1:9878/prom | grep -v '#' | grep -i top | grep top
xceiver_client_metrics_put_block_exceed10_ms_count_top_n{top="1",name="2384b4cc-6808-489c-a14e-0781a6cbeb5d",hostname="VM-7-9-centos"} 1
xceiver_client_metrics_put_block_exceed1_ms_count_top_n{top="1",name="2384b4cc-6808-489c-a14e-0781a6cbeb5d",hostname="VM-7-9-centos"} 13
xceiver_client_metrics_put_block_exceed1_ms_count_top_n{top="2",name="99ce0d5a-0c29-49b5-94b5-cae68f8bdc39",hostname="VM-7-9-centos"} 9
xceiver_client_metrics_write_chunk_exceed10_ms_count_top_n{top="1",name="2384b4cc-6808-489c-a14e-0781a6cbeb5d",hostname="VM-7-9-centos"} 1
xceiver_client_metrics_write_chunk_exceed1_ms_count_top_n{top="1",name="2384b4cc-6808-489c-a14e-0781a6cbeb5d",hostname="VM-7-9-centos"} 13
xceiver_client_metrics_write_chunk_exceed1_ms_count_top_n{top="2",name="99ce0d5a-0c29-49b5-94b5-cae68f8bdc39",hostname="VM-7-9-centos"} 9

How was this patch tested?

existing test

Jun 05 '24 17:06 xichen01

@xichen01 thanks for working on the change.

Please resolve the merge conflict. Can you also introduce additional unit tests to cover TopNMetrics?

Jun 07 '24 05:06 nandakumar131

@xichen01 also, can you please fix the PR title? I don't think this is specific to S3 Gateway.

Jun 07 '24 06:06 adoroszlai

@xichen01 thanks for working on the change.

Please resolve the merge conflict. Can you also introduce additional unit tests to cover TopNMetrics?

OK, I will

Jun 07 '24 06:06 xichen01

@xichen01 also, can you please fix the PR title? I don't think this is specific to S3 Gateway.

Updated

Jun 07 '24 06:06 xichen01

can you please fix the PR title? I don't think this is specific to S3 Gateway.

Updated

Thanks a lot, the new one is much better for clarity.

Jun 07 '24 06:06 adoroszlai

Why do we need to filter the metrics for top 10 here and add a lot of processing in the data path? Grafana can sort and showcase top 10. Why don't we just add the datanode label for each latency measured from the client. This way, we can compare client side view of latency on a per datanode basis even if they are below a certain threshold? It might be over all simpler to capture all the data in the client and let grafana sort through and present what we want instead of adding the threshold logic here.

Jun 07 '24 08:06 kerneltime

@kerneltime Thanks for your response.

Why do we need to filter the metrics for top 10 here and add a lot of processing in the data path? Grafana can sort and showcase top 10. Why don't we just add the datanode label for each latency measured from the client. This way, we can compare client side view of latency on a per datanode basis even if they are below a certain threshold? It might be over all simpler to capture all the data in the client and let grafana sort through and present what we want instead of adding the threshold logic here.

A Cluster may have a lot of Datanodes, if we export metrics for all Datanodes, this will generate a very large number of metrics. Just exporting opsLatency Metrics for GetBlock, PutBlock, WriteChunk and ReadChunk may generate thousands of Metrics. And most of it is probably of no concern.
TopN metrics only record the number of metrics that take longer than a threshold to execute, these values are more useful and make it easier to detect long-tail problems. If only the average value is used, the long-tail latency will be averaged, and it will be difficult to detect long-tail latency problems by averaging the metrics.

Jun 07 '24 10:06 xichen01

@kerneltime can you please check if the latest comment answers your question?

Oct 18 '24 05:10 adoroszlai

/pending conflicts

Dec 04 '24 17:12 adoroszlai

Thank you very much for the patch. I am closing this PR temporarily as there was no activity recently and it is waiting for response from its author.

It doesn't mean that this PR is not important or ignored: feel free to reopen the PR at any time.

It only means that attention of committers is not required. We prefer to keep the review queue clean. This ensures PRs in need of review are more visible, which results in faster feedback for all PRs.

If you need ANY help to finish this PR, please contact the community on the mailing list or the slack channel."

Dec 27 '24 00:12 github-actions[bot]