Support sinking events to Pyroscope
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Use Cases
I'm trying to get Pyroscope data flowing through Vector (well, a vector-to-vector pair of a Vector agent in a source cluster to an "aggregator" in the destination cluster). Pyroscope supports 2 methods of ingest from its language-specific SDKs, an HTTP POST API which supports multipart/form-data uploads, and a gRPC Push service.
I have a client of each type—
- a Grafana Alloy agent collecting ebpf-based profiles which are sent using the gRPC Push service, and
- a node.js app running pyroscope-nodejs which uses the multipart/form-data HTTP upload method.
In this case I don't need Vector to have an internal data model for profiles. I'd be happy if they were treated as Log events, with the contents being mostly an opaque binary payload (a gzipped pprof message, which is itself a protobuf) and a set of label names/values with a certain structure.
Attempted Solutions
With some creative config, I could get both a gRPC source and an HTTP multipart upload source working. I was unable to get either a gRPC sink or an HTTP upload sink working, though, which is what I'm blocked the hardest on.
gRPC source
Somewhat surprisingly, I was able to get Vector to receive the gRPC Push messages by using a type: http source, like below.
Details
Using a proto desc file from running this in the pyroscope repo: protoc -Iapi -o pyroscope_push_v1.desc api/push/v1/push.proto --include_imports
sources:
pyroscope_grpc_push_raw:
type: http_server
address: '0.0.0.0:4050'
framing:
method: bytes
decoding:
codec: protobuf
protobuf:
desc_file: /etc/vector-proto/pyroscope_push_v1.desc
message_type: "push.v1.PushRequest"
strict_path: false
HTTP multipart source
I struggled to get this working, but I was eventually able to with some hacks. I could use a separate type: http source (below) which captures the Content-Type header containing the multipart boundary token (i.e. Content-Type: multipart/form-data; boundary=---------abcd1234). I could then write some hacky VRL which does some crude multipart upload parsing and pulls out the binary profile payload (a gzipped protobuf). The main friction is that some of the string manipulation methods in VRL, namely split(), will force a lossy utf8 encoding under the hood, which corrupts the gzip payload. The workaround makes the multipart upload parser even cruder, but it's at least possible by using find() and slice() instead of split.
Details
sources:
pyroscope_ingest_raw:
type: http_server
address: '0.0.0.0:4051'
framing:
method: bytes
decoding:
codec: bytes
strict_path: false
# capture known query params and headers used by the pyroscope sdk
query_parameters:
- from
- until
- name
- spyName
- sampleRate
- format
- units
- aggregationType
headers:
- Content-Type
gRPC sink
I couldn't get a gRPC sink working at all. I can successfully re-encode a gRPC Push message using encode_proto(), but a type: http sink uses HTTP/1.1 and the Pyroscope gRPC server doesn't accept it.
HTTP upload sink
The Pyroscope HTTP Ingest API will accept either a multipart/form-data upload, like the nodejs SDK sends, or just a simple POST with the pprof profile as the request body. However, in both cases, it expects metadata including service name and labels in the form of URL query params, which means those have to be dynamically generated per Log event from Vector's perspective. Vector currently doesn't support dynamic values in the uri: field of the HTTP sink, and there's no way to specify query params separately (like headers:).
Proposal
On the source side—
- a
decode_multipart_form_data()VRL function would be hugely helpful. It's not a hard blocker, as I was able to roll my own crude parser in VRL, but I'd love to be able to delete that code and use something built into VRL. - a specific
type: pyroscope_grpcsource might have been nice, but not a huge deal as atype: httpsource with a protobuf encoding seems to work
On the sink side—
- Adding a
type: grpcsink would be ideal. If Vector had a generic gRPC sink, I could use that for both source types and just restructure the payloads to fit the schema. - If a gRPC sink can't be added or would take longer, supporting dynamic
uri:field and/orquery_parameters:field with dynamic values in the HTTP sink would suffice.
References
No response
Version
0.40.0
Thanks for this detailed feature request @ryanartecona !
Given you say that you'd be happy if Vector treated the incoming data as opaque, I'm wondering what you plan to use Vector to do with the data? Are you intending to just "proxy" the requests?
On the source side—
- a
decode_multipart_form_data()VRL function would be hugely helpful. It's not a hard blocker, as I was able to roll my own crude parser in VRL, but I'd love to be able to delete that code and use something built into VRL.
This seems like a reasonable addition. I could also see enhancing the http_server source to be able to handle multi-part data as a first-class concept (though I'm not sure exactly what this would look like).
- a specific
type: pyroscope_grpcsource might have been nice, but not a huge deal as atype: httpsource with a protobuf encoding seems to work
Agreed. I could see it being useful to add for discoverability, but it seemingly could be a simple wrapper around the http_server source.
On the sink side—
- Adding a
type: grpcsink would be ideal. If Vector had a generic gRPC sink, I could use that for both source types and just restructure the payloads to fit the schema.
Agreed. I'm not sure if it is possible to create a dynamic gRPC sink in Rust though. The existing sinks that use gRPC use code generation. It seems like something should be doable using prost_reflect though.
- If a gRPC sink can't be added or would take longer, supporting dynamic
uri:field and/orquery_parameters:field with dynamic values in the HTTP sink would suffice.
Agreed, these would be useful in their own right. Related issues:
- https://github.com/vectordotdev/vector/issues/201
- https://github.com/vectordotdev/vector/issues/6759
Given you say that you'd be happy if Vector treated the incoming data as opaque, I'm wondering what you plan to use Vector to do with the data? Are you intending to just "proxy" the requests?
Mostly yes. We also have Vector doing some extra things like tag insertion which are convenient to also do in VRL for these profiles.
I could also see enhancing the
http_serversource to be able to handle multi-part data as a first-class concept (though I'm not sure exactly what this would look like).
Even better! I like it.
Agreed. I'm not sure if it is possible to create a dynamic gRPC sink in Rust though. The existing sinks that use gRPC use code generation. It seems like something should be doable using
prost_reflectthough.
Ohh, that's unfortunate. I was hoping it would be an easier addition from existing pieces, since I knew the vector source/sink components existed, but I forgot about the gRPC codegen part.
Thanks for linking those other issues. I had seen #201 but not #6759. Upvoted.
Should I file other issues for any of those specific pieces?
Should I file other issues for any of those specific pieces?
I think it'd be reasonable to open separate issues for:
- multipart handling in the
http_serversource - a generic
grpcsink
This would be great. I'm referencing this issue I created back in January about how I was struggling to get vector decode the data Pyroscope (NET profiling agent) was sending via HTTP.
I put that work aside for a while and now I have taken it up again. I didn't get to deep into VRL manipulation or hacks as @ryanartecona, but I would like to join in the discussion and the work needed to make this happen.
https://github.com/vectordotdev/vector/issues/19737