feature:Configurable summarySetBlackList to set monitoring blacklist.
Issue:
- The current ZooKeeper (ZK) cluster frequently experiences excessively long Full Garbage Collection (FGC) events (averaging once a week) that last over 10 seconds, leading to occasional, unexpected leader elections in the ZK cluster.
- The cluster currently fails to collect mntr monitoring data.
Root Cause:
- The ZK cluster uses the direct child nodes under the root directory (/) as Namespaces, At the Namespace level, the following have been implemented:
- Monitoring of MIN, CNT, AVG, MAX, and SUM request counts, which can be obtained via mntr monitoring.
- monitoring for Namespaces only accounts for additions and not removals, tasks are created directly under the root directory (/), with tasks older than seven days being cleaned daily.
- This leads to the following phenomenon: The number of Namespaces stored in ZK nodes (157,376) is significantly greater than the actual number of Namespaces in the ZK cluster (5,856).
Solution: Provide a new version of the ZK server that supports configurable Namespace monitoring.
I cannot understand the intention of this PR clearly. From the description, I guess
- it wants to solve the metric issues which is inaccurate or too expensive? which metrics(
write_per_namespace,read_per_namespace)? - Add a blacklist can completely solve this problem?
it wants to solve metrics(write_per_namespace,read_per_namespace) which is too expensive.
- / stores task IDs, which will be frequently added and deleted.metrics(
write_per_namespace,read_per_namespace) only increase and do not decrease. - we do not require monitoring for write_per_namespace and read_per_namespace Hope to support write_per_namespace, read_per_namespace blacklist。 @maoling
Perhaps it's a more universal and maintainable solution to remove the metric monitoring when the / path node is deleted. Regarding this PR, I will make some further modifications.
https://github.com/apache/zookeeper/pull/2171 @maoling Added a new test MetricSetRemoveTest to help understand this bug, and I found that fixing it using the previous blacklist method was much easier. bug fix:At the time of ZK service delete root path, the audit and control was remove the NameSpace Metric.
Yes, I understand now. When you frequently create and delete numerous znodes at the first level, the write_per_namespace and read_per_namespace metrics will indeed expand and consume excessive memory.
A possible workaround is to add an additional layer under the root directory of the tree (e.g., /taskName/taskId-[1,n]).
While deleting the SummarySet key might work, it seems somewhat counterintuitive.
Here are a few alternative suggestions (just my personal opinion):
- Limit the size of a SummarySet. For example, when the number of keys exceeds 1000, prevent any further additions.
- Implementing a blacklist is a good idea, as we have other metrics that are costly and can be resource-intensive to query at times. 2.1. Place the blacklist in the zoo.cfg file within the metrics section, instead of using a SystemProperty. 2.2. Ensure the blacklist supports all metric types, not just SummarySet.
## Metrics Providers
metricsProvider.exportJvmInfo=true
# Support multiple and prefix matching
metricsProvider.blackList=prep_processor*,sync_processor*
I am the ZK cluster provider, so it is difficult for businesses to modify the way they use it.😂
The reason why i don't want to use blacklists is that each business that developer them incorrectly needs to configure a blacklist.So i deleting the SummarySet key to make ZK developer unconsciously.
I'm sorry, I don't understand why this is counterintuitive. @maoling