As a caliper user, I would like to access worker metrics, so that I can investigate SDK bottle necks

Open nklincoln opened this issue 5 years ago • 1 comments

With the introduction of tx-observers (name pending), we are in the perfect position to enable metric collection on the workers themselves.

There are a few implementation routes here, based on the available docs for:

Appmetrics (https://github.com/RuntimeTools/appmetrics#readme)
AppMetrics dash (https://github.com/RuntimeTools/appmetrics-dash)

The minimal implementation would be "simply" adding in the appmetrics require, and exposing a port so that the dash can collect and render realtime metrics. The use case here would be for "simple debug".

A more advanced implementation would include use of the appmetrics libraries in conjunction with Prometheus style data buckets to collect desired statistics and have them scraped. The significant advantage here would be that metrics collected by Prometheus are available for report inclusion.

We do come into a naming convention/discussion here - since what I have described is not a transaction monitor, but a worker monitor. So perhaps we need a new Abstract class to encapsulate the difference (with a separate lifecycle)

Oct 07 '20 17:10 nklincoln

I would throw in the following ideas:

Integrating Appmetrics as a TX monitor is feasible I think. The TX monitors are first activated before the first round, so before the heavy lifting begins. We can expose the metrics endpoint during the first activation, and we only miss the register/assign/init phases, which should be lightweight anyway.
The V8 engine for Node.js has built-in profiling capability that doesn't require further instrumentation. Since Caliper isn't meant to run forever, we can do profiled runs of Caliper workers and inspect the observed data with some other tool. This won't provide real-time metrics, but for development-time optimization, this looks like the standard approach. There's great IDE support for streamlined profiling (especially combined with the next idea).
Workers heavily depend on the manager for orchestration, which might complicate setting up profiling scenarios. We could fake this dependency with a special worker-side messenger, that acts as a dummy manager, going through the necessary phases, but in-process. It's kind of like a dev mode for workers (or a manager mock).

As for the naming convention. You are right, the distinction between resource and TX monitors is more like a distinction between manager and worker monitors. Manager monitors expose data to the reporter(s), while worker monitors do whatever they want with the data (expose it to the manager, to Prometheus, or any other stream). Both sides have an implicit/built-in monitor, that reports performance metrics.

Oct 08 '20 13:10 aklenik