zookeeper feature：Configurable summarySetBlackList to set monitoring blacklist.

Issue:

The current ZooKeeper (ZK) cluster frequently experiences excessively long Full Garbage Collection (FGC) events (averaging once a week) that last over 10 seconds, leading to occasional, unexpected leader elections in the ZK cluster.
The cluster currently fails to collect mntr monitoring data.

Root Cause:

The ZK cluster uses the direct child nodes under the root directory (/) as Namespaces, At the Namespace level, the following have been implemented:
Monitoring of MIN, CNT, AVG, MAX, and SUM request counts, which can be obtained via mntr monitoring.
monitoring for Namespaces only accounts for additions and not removals, tasks are created directly under the root directory (/), with tasks older than seven days being cleaned daily.
This leads to the following phenomenon: The number of Namespaces stored in ZK nodes (157,376) is significantly greater than the actual number of Namespaces in the ZK cluster (5,856).

Solution: Provide a new version of the ZK server that supports configurable Namespace monitoring.

May 16 '24 03:05 zichen-gan

I cannot understand the intention of this PR clearly. From the description, I guess

it wants to solve the metric issues which is inaccurate or too expensive? which metrics(write_per_namespace,read_per_namespace)?
Add a blacklist can completely solve this problem?

May 21 '24 13:05 maoling

it wants to solve metrics(write_per_namespace,read_per_namespace) which is too expensive.

/ stores task IDs, which will be frequently added and deleted.metrics(write_per_namespace,read_per_namespace) only increase and do not decrease.
we do not require monitoring for write_per_namespace and read_per_namespace Hope to support write_per_namespace, read_per_namespace blacklist。 @maoling

May 22 '24 02:05 zichen-gan

Perhaps it's a more universal and maintainable solution to remove the metric monitoring when the / path node is deleted. Regarding this PR, I will make some further modifications.

May 24 '24 01:05 zichen-gan

https://github.com/apache/zookeeper/pull/2171 @maoling Added a new test MetricSetRemoveTest to help understand this bug, and I found that fixing it using the previous blacklist method was much easier. bug fix:At the time of ZK service delete root path, the audit and control was remove the NameSpace Metric.

Jun 04 '24 07:06 zichen-gan

Yes, I understand now. When you frequently create and delete numerous znodes at the first level, the write_per_namespace and read_per_namespace metrics will indeed expand and consume excessive memory.

A possible workaround is to add an additional layer under the root directory of the tree (e.g., /taskName/taskId-[1,n]).

While deleting the SummarySet key might work, it seems somewhat counterintuitive.

Here are a few alternative suggestions (just my personal opinion):

Limit the size of a SummarySet. For example, when the number of keys exceeds 1000, prevent any further additions.
Implementing a blacklist is a good idea, as we have other metrics that are costly and can be resource-intensive to query at times. 2.1. Place the blacklist in the zoo.cfg file within the metrics section, instead of using a SystemProperty. 2.2. Ensure the blacklist supports all metric types, not just SummarySet.

## Metrics Providers
metricsProvider.exportJvmInfo=true
# Support multiple and prefix matching
metricsProvider.blackList=prep_processor*,sync_processor*

Jun 18 '24 14:06 maoling

I am the ZK cluster provider, so it is difficult for businesses to modify the way they use it.😂

The reason why i don't want to use blacklists is that each business that developer them incorrectly needs to configure a blacklist.So i deleting the SummarySet key to make ZK developer unconsciously.

I'm sorry, I don't understand why this is counterintuitive. @maoling

Jun 19 '24 07:06 zichen-gan