zookeeper ZOOKEEPER-2770 ZooKeeper slow operation log

ZooKeeper is a complex distributed application. There are many reasons why any given read or write operation may become slow: a software bug, a protocol problem, a hardware issue with the commit log(s), a network issue. If the problem is constant it is trivial to come to an understanding of the cause. However in order to diagnose intermittent problems we often don't know where, or when, to begin looking. We need some sort of timestamped indication of the problem. Although ZooKeeper is not a datastore, it does persist data, and can suffer intermittent performance degradation, and should consider implementing a 'slow query' log, a feature very common to services which persist information on behalf of clients which may be sensitive to latency while waiting for confirmation of successful persistence.

The patch sets a default threshold of 10 sec for request. If the value of warn.responseTime is configured as -1 then logging is ignored.

Jul 12 '17 04:07 karanmehta93

General comment: please add more description on the pull request on what's the motivation of change and how it impacts performance / improve debugging etc. Or add those to the JIRA and post a digest on the pull request. That will make others who does not have a lot of knowledge on your use cases understand the motivation / impact of the change.

Here is a good example w.r.t. pull request description: https://github.com/apache/zookeeper/pull/306, for reference.

Jul 12 '17 05:07 hanm

ZooKeeper is a complex distributed application. There are many reasons why any given read or write operation may become slow: a software bug, a protocol problem, a hardware issue with the commit log(s), a network issue. If the problem is constant it is trivial to come to an understanding of the cause. However in order to diagnose intermittent problems we often don't know where, or when, to begin looking. We need some sort of timestamped indication of the problem. Although ZooKeeper is not a datastore, it does persist data, and can suffer intermittent performance degradation, and should consider implementing a 'slow query' log, a feature very common to services which persist information on behalf of clients which may be sensitive to latency while waiting for confirmation of successful persistence.

@hanm FYI

Jul 12 '17 18:07 karanmehta93

So here is an alternative suggestion.

For each class of operation (GET, PUT, MULTIPUT, DELETE and so on) maintain a data structure (such as t-digest or Floathistogram see https://github.com/tdunning/t-digest) that keeps track of the distribution of latencies for that operation.

Log any transactions that are above a settable percentile latency for operations of that class. Default could be 99.99%.

Operations longer than 2x the 99.9%-ile latency should not be entered into the historical distribution.

Jul 13 '17 00:07 tdunning

From my understanding, QuorumPeer class will be the candidate class to maintain this data structure. Let me know alternatives. I will look into the data-structure ideas you provided till then.

Jul 13 '17 02:07 karanmehta93

@karanmehta93 Thanks for update the pull request with more context information. Do you mind put the context description in the pull request description?

Regarding test cases, I think you can just add a new test file under zookeeper/test.

Jul 13 '17 04:07 hanm

I think we should consolidate the latency check in zks.serverStats().updateLatency. It's odd to have two (or in future even more) types of latency checks scattered around which creates fragmentation w.r.t. the definition of what a request latency means. The existing latency measurement in ServerStats measures the time between a request creation and a request landing at final request processor; the patch instead measures end to end time of a request from its start to finish processing. I am fine with the end to end processing time, though I'd like to double check with a few folks around to make sure the regression and impact of this change is limited.

I think ServerStats is a good place to put the DS Ted recommended.

I think it's a good idea to scope the JIRA so it's easier to get it reviewed and committed. What this patch is doing is a positive improvement to the operational aspects of ZK so that can be the scope of this PR. On top of that future improvements could be what Edward and Ted suggested (JMX, distribution of latencies / histogram etc). These work can be tracked by making them sub tasks under current JIRA.

Jul 13 '17 23:07 hanm

@hanm @eribeiro Thank you for your comments. I have added a rebased patch with a test in ZooKeeperServerMainTest class. Please review whenever convenient.

It's odd to have two (or in future even more) types of latency checks scattered around which creates fragmentation w.r.t. the definition of what a request latency means.

We can decide based on opinion from other people and I can update the patch accordingly. I will create new JIRA's based on @tdunning suggestions, after this one gets approved. I am eager to take them up.

Documentation is still pending. Will add a patch for it soon. I am not sure if I have to modify the ZooKeeperAdmin.xml file or ZooKeeperAdmin.html file. Can you point me to any relevant stuff?

Jul 15 '17 23:07 karanmehta93

@karanmehta93 hi, mate.

You don't need to touch ZooKeeperAdmin.html You change ZooKeeperAdmin.xml only and Apache Forrest generate html anf PDF. Also, this xml change should be part of this patch, okay?

Cheers!

Jul 16 '17 22:07 eribeiro

@eribeiro Thank you for your comments. Will post an updated patch soon. A small concern regarding the ServerStats.updateLatency() method. Should the updated patch have both the methods for now and file another JIRA to look into it? And for the documentation, this property is being introduced from 3.5.4 version, Is that correct? @hanm

Jul 17 '17 04:07 karanmehta93

And for the documentation, this property is being introduced from 3.5.4 version, Is that correct?

Yes, for this specific pull request targeting master. I think at some point after merge this you need create another pull request targeting stable branch (branch-3.4) where the version number should be 3.4.11 (hopefully, if this can get in before next release).

Jul 19 '17 04:07 hanm

@hanm

I think at some point after merge this you need create another pull request targeting stable branch (branch-3.4) where the version number should be 3.4.11 (hopefully, if this can get in before next release)

I will create a new PR for that branch as well. I am hoping this to get this out in next release if possible.

Jul 19 '17 04:07 karanmehta93

@hanm @eribeiro @tdunning I have fixed the whitespace issues and final patch is ready, Please review whenever convenient.

Jul 25 '17 05:07 karanmehta93

@hanm @eribeiro @tdunning @skamille Please review. Now that I have added rate limiting to logging, can we also turn this on by default?

Aug 18 '17 04:08 karanmehta93

There are -1 on findbugs, release audits, and test failures reported by Jekins. Please investigate.

Aug 18 '17 15:08 hanm

What is this pull request? 45 commits? 95 changed files? Lots of commits without associated issues?

(danger, danger)

Oct 25 '17 15:10 tdunning

@tdunning Sorry about this. I rebased my branch with the current master and pushed my patch along with it. Now it is showing up all those commits as my commits. Let me try to figure out a way to resolve this.

Oct 25 '17 17:10 karanmehta93

Due to some reason, Jenkins didn't build it automatically. I triggered a manual build with my latest commit ID SHA1 and the results are at https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/1123/

All the Findbugs warnings have been resolved, and the single failing test seems to be unrelated to my change (or possibly a flapper since it is passing on my local machine).

This commit is a squashed commit based on the earlier inputs provided by everyone. Thank you!

As of now, the default value for requestWarnThresholdMs is 10 seconds, which is too high. Users can set it according to the average latency values they typically see from the serverstats. The patch also limits logging rate, so if the value is set too low, it will print out a message every 60 seconds showing the number of high latency requests in past minute. The motivation of this JIRA is specified in description at https://issues.apache.org/jira/browse/ZOOKEEPER-2770.

If you can review this PR, it would be great. @hanm @tdunning @eribeiro @skamille

If we can get in this JIRA, the future plan is to work on @tdunning t-digest and use it to determine high latency requests specifically based on its type. Thanks!

Oct 26 '17 20:10 karanmehta93

Ping @hanm @tdunning @eribeiro @skamille for a review if you got some time. Thanks!

Nov 09 '17 00:11 karanmehta93

I appreciate the work that you are trying to do here, and this still feels like an incomplete approach to a problem that would be worth investing further into. It seems like we all agree that it would be good to have clearer metrics about the performance behavior of the internals of the system. This implements some basic infrastructure for a metric recording framework, coupled to the recording of one metric. Can we at least make it so that the process of "push a metric to a buffer, have a thread that wakes up periodically and flushes information out of that buffer" is usable by multiple parts of the system, instead of coupling it to the one metric of request time? I will say that if others think this is valuable as-is, I am not going to block the change.

Nov 22 '17 02:11 skamille

Can we at least make it so that the process of "push a metric to a buffer, have a thread that wakes up periodically and flushes information out of that buffer" is usable by multiple parts of the system, instead of coupling it to the one metric of request time?

I am ready to put in the effort of making this framework more generic, if you feel its worth the time and effort to put into it. I would also like to hear others opinions on this. The inspiration for this JIRA came from HBase, where it has been used to log slow requests at the basic level. I feel that it can be useful to keep track of latency spikes that are seen in production. @apurtell can suggest use-cases.

Nov 22 '17 21:11 karanmehta93

Hello everyone, Appreciate your efforts in reviewing this patch. @hanm @tdunning @eribeiro @skamille Is there any possibility that the patch will get merged in (with minor changes if required) or shall we 'never' this JIRA and close this PR? Thanks!

Mar 07 '18 18:03 karanmehta93

reasons why any given read or write operation may become slow: a software bug, a protocol problem, a hardware issue with the commit log(s), a network issue. If the problem is constant it is trivial to come to an understanding of the cause. However in order to diagnose intermittent problems we often don't know where, or when, to begin looking. We need some sort of timestamped indication of the problem

IMO, slow query log is useful for those systems where the response times can vary due to type of query (long scans, improper index, poor query plan, etc.). For the type of issues mentioned, we can expect all operations to be impacted and fill up the log.

If the intent is to get a timestamp indication of the problem, a metric reporting mechanism will be more useful. If we have rolling interval counts & percentiles (per command?) published either to stat or jmx, that would give good visibility.

Mar 09 '18 07:03 zodvik

looks like the branch has merge conflicts. do you mind update your patch with latest master and I'll review @karanmehta93

Apr 16 '18 04:04 hanm

@hanm Thank you! I have rebased the patch. The jenkins build has been failing for a while and my patch doesn't introduce new failure. Thanks once again!

Apr 16 '18 05:04 karanmehta93

Agreed @zodvik . The scope of this Jira is to limit it to a generic slow query log and identify the outliers if any. For that reason, I am printing the actual request to the log as well. An improvement would be definitely to separate based on request types as others have also suggested. If this gets in, I would like to work on the improvement part of using t-digest(https://github.com/tdunning/t-digest) to give out detailed insight and not have the problem of setting a fixed threshold.

May 13 '18 22:05 karanmehta93

@hanm Got a few spare cycles to address my comments?

May 24 '18 21:05 karanmehta93

@karanmehta93 Are u planning to continue working on this feature?

Mar 31 '20 10:03 anmolnar

At this point, I wont be able to actively contribute to it @anmolnar. However, if there is a sufficient need from the community, I am happy to take it up. It's been almost 3 years now since it originally started and we are still living without it, so I believe that there isn't a genuine need for this. Plus, the feature might need a lot more enhancement before it reaches an usable stage too.

Apr 02 '20 06:04 karanmehta93