zookeeper icon indicating copy to clipboard operation
zookeeper copied to clipboard

feature:Configurable summarySetBlackList to set monitoring blacklist.

Open zichen-gan opened this issue 1 year ago • 6 comments

Issue:

  • The current ZooKeeper (ZK) cluster frequently experiences excessively long Full Garbage Collection (FGC) events (averaging once a week) that last over 10 seconds, leading to occasional, unexpected leader elections in the ZK cluster.
  • The cluster currently fails to collect mntr monitoring data.

Root Cause:

  • The ZK cluster uses the direct child nodes under the root directory (/) as Namespaces, At the Namespace level, the following have been implemented:
  • Monitoring of MIN, CNT, AVG, MAX, and SUM request counts, which can be obtained via mntr monitoring.
  • monitoring for Namespaces only accounts for additions and not removals, tasks are created directly under the root directory (/), with tasks older than seven days being cleaned daily.
  • This leads to the following phenomenon: The number of Namespaces stored in ZK nodes (157,376) is significantly greater than the actual number of Namespaces in the ZK cluster (5,856). image image

Solution: Provide a new version of the ZK server that supports configurable Namespace monitoring.

zichen-gan avatar May 16 '24 03:05 zichen-gan

I cannot understand the intention of this PR clearly. From the description, I guess

  • it wants to solve the metric issues which is inaccurate or too expensive? which metrics(write_per_namespace,read_per_namespace)?
  • Add a blacklist can completely solve this problem?

maoling avatar May 21 '24 13:05 maoling

it wants to solve metrics(write_per_namespace,read_per_namespace) which is too expensive.

  • / stores task IDs, which will be frequently added and deleted.metrics(write_per_namespace,read_per_namespace) only increase and do not decrease.
  • we do not require monitoring for write_per_namespace and read_per_namespace Hope to support write_per_namespace, read_per_namespace blacklist。 @maoling

zichen-gan avatar May 22 '24 02:05 zichen-gan

Perhaps it's a more universal and maintainable solution to remove the metric monitoring when the / path node is deleted. Regarding this PR, I will make some further modifications.

zichen-gan avatar May 24 '24 01:05 zichen-gan

https://github.com/apache/zookeeper/pull/2171 @maoling Added a new test MetricSetRemoveTest to help understand this bug, and I found that fixing it using the previous blacklist method was much easier. bug fix:At the time of ZK service delete root path, the audit and control was remove the NameSpace Metric.

zichen-gan avatar Jun 04 '24 07:06 zichen-gan

Yes, I understand now. When you frequently create and delete numerous znodes at the first level, the write_per_namespace and read_per_namespace metrics will indeed expand and consume excessive memory.

A possible workaround is to add an additional layer under the root directory of the tree (e.g., /taskName/taskId-[1,n]).

While deleting the SummarySet key might work, it seems somewhat counterintuitive.

Here are a few alternative suggestions (just my personal opinion):

  1. Limit the size of a SummarySet. For example, when the number of keys exceeds 1000, prevent any further additions.
  2. Implementing a blacklist is a good idea, as we have other metrics that are costly and can be resource-intensive to query at times. 2.1. Place the blacklist in the zoo.cfg file within the metrics section, instead of using a SystemProperty. 2.2. Ensure the blacklist supports all metric types, not just SummarySet.
## Metrics Providers
metricsProvider.exportJvmInfo=true
# Support multiple and prefix matching
metricsProvider.blackList=prep_processor*,sync_processor*

maoling avatar Jun 18 '24 14:06 maoling

I am the ZK cluster provider, so it is difficult for businesses to modify the way they use it.😂

The reason why i don't want to use blacklists is that each business that developer them incorrectly needs to configure a blacklist.So i deleting the SummarySet key to make ZK developer unconsciously.

I'm sorry, I don't understand why this is counterintuitive. @maoling

zichen-gan avatar Jun 19 '24 07:06 zichen-gan