feat: add ability to publish metrics as ceramic model instance documents (rebased)
Description
This PR enables publishing metrics to the ceramic network as model instance documents. It uses the same count and observe semantics as ServiceMetrics, with a configurable interval for publishing.
Key Design notes
-
avoids circular dependencies. the observability model-metrics package accepts a ceramic api instance at run time, so never imports any js-ceramic common packages.
-
singleton pattern. because of the instance import and that its instantiated as a singleton, you can import model-metrics from a code path where the ceramic instance is not accessible, and still use it; also no need to pass it around, just import
-
lightweight functions. the
countobserveandrecordErrorfunctions are very lightweight, and only update variables in memory. At the configured interval time, a summary will be published as a model metric.
These designs are in the model-metrics package in the observability repo which is a dependency of this PR. The code is separate from js-ceramic but the npm package is published under @ceramicnetwork/model-metrics
How Has This Been Tested?
Describe the tests that you ran to verify your changes. Provide instructions for reproduction.
- [x ] Tested manually with a local node on dev-unstable, verified that metrics are correctly produced.
To reproduce this test, add the following into daemon.config.json and run your node.
"metrics": {
"metrics-publisher-enabled": true
}
- [ x] Tested Manually with a harness at https://github.com/3box/runbooks-and-oncall/tree/main/MetricsDemo (see readme for instructions)
PR checklist
Before submitting this PR, please make sure:
- [x] I have tagged the relevant reviewers and interested parties
- [ ] I have updated the READMEs of affected packages
- [ ] I have made corresponding changes to the documentation
References:
Please list relevant documentation (e.g. tech specs, articles, related work etc.) relevant to this change, and note if the documentation has been updated.
For easy reference, here is a sample of the metrics currently produced:
{
"streamId": "kjzl6kcym7w8y6ghqfmaqe9m94obsmoekktwdnblu1u7fvhcljvcvnmqhvd4a4m",
"state": {
"type": 3,
"content": {
"ts": "2024-03-22T15:02:19.384Z",
"ceramicNode": {
"id": "CAESIHWY0nyXqZTZO2as2Dli/USgf+TRcYTuljaiVc1rifhV",
"name": "0.0.0.0",
"PeerID": "CAESIHWY0nyXqZTZO2as2Dli/USgf+TRcYTuljaiVc1rifhV",
"IPAddress": "0.0.0.0",
"ipfsVersion": "0.18.1",
"nodeAuthDID": "did:key:z6Mkqot5TB3v7kd7UjezicHrnHvUE7ikN8b763xpE3U9gR6d",
"ceramicVersion": "5.5.0-rc.0"
},
"recentErrors": 0,
"sampleRecentErrors": [],
"totalIndexedModels": 0,
"totalPinnedStreams": 0,
"maxAnchorRequestAgeMS": 706236,
"currentPendingRequests": 0,
"meanAnchorRequestAgeMS": 648052,
"recentCompletedRequests": 3
},
"metadata": {
"controllers": [
"did:key:z6Mkqot5TB3v7kd7UjezicHrnHvUE7ikN8b763xpE3U9gR6d"
],
"model": "kjzl6hvfrbw6c7kjl8p225jzy0mjxqo5xdcdt7eqsjv0519arqfda598zfgpfx9",
"unique": "bwOxAanX3RqNTWMc"
},
"signature": 2,
"anchorStatus": "PENDING",
"log": [
{
"cid": "bagcqcerah4z6t76dsy44xohfhnz2465dth3sgpvsbwxgsarhhc6mbfqouvda",
"type": 0
}
],
"doctype": "MID"
}
}
The code that produces them is in https://github.com/ceramicnetwork/observability/tree/master/packages/model-metrics/src
which is included as @ceramicnetwork/model-metrics
Note that it uses a singleton pattern to avoid passing around a pointer to the ceramic api and metrics instance.
Note that 'totalPinnedStreams' is part of the schema, but is not currently being recorded as we do not have a way to get this out of leveldb afaik.
'0' values are currently also 'no data' values, we could change that to '-1' or something clearer
IPAddress just has the value of 0.0.0.0 currently, which is likely what was given as the arg for the process to bind to, but isn't actually the node's real public IP address. Can we get the node's actual public IP? This may need to be changed to a list since one node can have multiple IPs if it's listening to multiple networks.
Alternatively, we may just want to remove this field entirely. This actually may fall under the category of information that devs wont want to be published publicly
published
I agree collecting IP is generally considered sensitive information. We can use PeerId to uniquely identify how many nodes are on the network etc without collecting which IP they come from.
'0' values are currently also 'no data' values, we could change that to '-1' or something clearer
How about null to mean "no data"? I do think we will probably have semantically meaningful 0 values, so would be good to use something else to mean no data.
published
I agree collecting IP is generally considered sensitive information. We can use PeerId to uniquely identify how many nodes are on the network etc without collecting which IP they come from.
agreed, tho having a public endpoint for the node if there is one could be super valuable to have - this could be optional public endpoint?
Overall I like the design. Its simple and straightforward. I have some questions below.
Also as an alternative approach if we wanted to make this more generic for any kinds of metrics instead of Ceramic specific metrics.
OpenTelemetry defines a specific structure for metric data. See their protobuf definitions. We don't need all the complexity represented in the protobuf but the general design would be to define some models like:
interface MetricEvent @createModel(description: "Required metric event content interface") { ts: DateTime! metrics: [Metric] } type Metric { attributes: [Attribute] name: String value: Float } type Attribute { key: String value: String }The idea here is you can have an arbitrary set of metrics without ever having to change the schema. The downside is its not guaranteed at the schema level which metrics are present. However generally this works well in practice as conventions arise around which metrics are useful/valuable.
To be clear, I think this PR can/should merge as is (after address all feedback) but we should consider this design when we want to generalize this workflow.
ok 100%, I would love to do this as version 2
I did have a generic blob one called GenericMetricEvent here https://github.com/ceramicnetwork/observability/blob/master/packages/model-metrics/src/composites/00-simpleNodeMetrics.graphql#L18 but went with an explicit schema for this round
Captured in https://linear.app/3boxlabs/issue/WS3-568/generalize-metrics-to-protobuf-like-schema
hey @stbrody i think this and the observability change address the two things we agreed to change for this iteration; one question came up i referred to it here - https://github.com/ceramicnetwork/observability/pull/25/files - about whether we should be using the same model instance id across networks, or not.