vector icon indicating copy to clipboard operation
vector copied to clipboard

Support sinking events to Pyroscope

Open ryanartecona opened this issue 1 year ago • 3 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

I'm trying to get Pyroscope data flowing through Vector (well, a vector-to-vector pair of a Vector agent in a source cluster to an "aggregator" in the destination cluster). Pyroscope supports 2 methods of ingest from its language-specific SDKs, an HTTP POST API which supports multipart/form-data uploads, and a gRPC Push service.

I have a client of each type—

  1. a Grafana Alloy agent collecting ebpf-based profiles which are sent using the gRPC Push service, and
  2. a node.js app running pyroscope-nodejs which uses the multipart/form-data HTTP upload method.

In this case I don't need Vector to have an internal data model for profiles. I'd be happy if they were treated as Log events, with the contents being mostly an opaque binary payload (a gzipped pprof message, which is itself a protobuf) and a set of label names/values with a certain structure.

Attempted Solutions

With some creative config, I could get both a gRPC source and an HTTP multipart upload source working. I was unable to get either a gRPC sink or an HTTP upload sink working, though, which is what I'm blocked the hardest on.

gRPC source

Somewhat surprisingly, I was able to get Vector to receive the gRPC Push messages by using a type: http source, like below.

Details

Using a proto desc file from running this in the pyroscope repo: protoc -Iapi -o pyroscope_push_v1.desc api/push/v1/push.proto --include_imports

  sources:
    pyroscope_grpc_push_raw:
      type: http_server
      address: '0.0.0.0:4050'
      framing:
        method: bytes
      decoding:
        codec: protobuf
        protobuf:
          desc_file: /etc/vector-proto/pyroscope_push_v1.desc
          message_type: "push.v1.PushRequest"
      strict_path: false

HTTP multipart source

I struggled to get this working, but I was eventually able to with some hacks. I could use a separate type: http source (below) which captures the Content-Type header containing the multipart boundary token (i.e. Content-Type: multipart/form-data; boundary=---------abcd1234). I could then write some hacky VRL which does some crude multipart upload parsing and pulls out the binary profile payload (a gzipped protobuf). The main friction is that some of the string manipulation methods in VRL, namely split(), will force a lossy utf8 encoding under the hood, which corrupts the gzip payload. The workaround makes the multipart upload parser even cruder, but it's at least possible by using find() and slice() instead of split.

Details
  sources:
    pyroscope_ingest_raw:
      type: http_server
      address: '0.0.0.0:4051'
      framing:
        method: bytes
      decoding:
        codec: bytes
      strict_path: false
      # capture known query params and headers used by the pyroscope sdk
      query_parameters:
        - from
        - until
        - name
        - spyName
        - sampleRate
        - format
        - units
        - aggregationType
      headers:
        - Content-Type

gRPC sink

I couldn't get a gRPC sink working at all. I can successfully re-encode a gRPC Push message using encode_proto(), but a type: http sink uses HTTP/1.1 and the Pyroscope gRPC server doesn't accept it.

HTTP upload sink

The Pyroscope HTTP Ingest API will accept either a multipart/form-data upload, like the nodejs SDK sends, or just a simple POST with the pprof profile as the request body. However, in both cases, it expects metadata including service name and labels in the form of URL query params, which means those have to be dynamically generated per Log event from Vector's perspective. Vector currently doesn't support dynamic values in the uri: field of the HTTP sink, and there's no way to specify query params separately (like headers:).

Proposal

On the source side—

  • a decode_multipart_form_data() VRL function would be hugely helpful. It's not a hard blocker, as I was able to roll my own crude parser in VRL, but I'd love to be able to delete that code and use something built into VRL.
  • a specific type: pyroscope_grpc source might have been nice, but not a huge deal as a type: http source with a protobuf encoding seems to work

On the sink side—

  • Adding a type: grpc sink would be ideal. If Vector had a generic gRPC sink, I could use that for both source types and just restructure the payloads to fit the schema.
  • If a gRPC sink can't be added or would take longer, supporting dynamic uri: field and/or query_parameters: field with dynamic values in the HTTP sink would suffice.

References

No response

Version

0.40.0

ryanartecona avatar Aug 12 '24 19:08 ryanartecona

Thanks for this detailed feature request @ryanartecona !

Given you say that you'd be happy if Vector treated the incoming data as opaque, I'm wondering what you plan to use Vector to do with the data? Are you intending to just "proxy" the requests?

On the source side—

  • a decode_multipart_form_data() VRL function would be hugely helpful. It's not a hard blocker, as I was able to roll my own crude parser in VRL, but I'd love to be able to delete that code and use something built into VRL.

This seems like a reasonable addition. I could also see enhancing the http_server source to be able to handle multi-part data as a first-class concept (though I'm not sure exactly what this would look like).

  • a specific type: pyroscope_grpc source might have been nice, but not a huge deal as a type: http source with a protobuf encoding seems to work

Agreed. I could see it being useful to add for discoverability, but it seemingly could be a simple wrapper around the http_server source.

On the sink side—

  • Adding a type: grpc sink would be ideal. If Vector had a generic gRPC sink, I could use that for both source types and just restructure the payloads to fit the schema.

Agreed. I'm not sure if it is possible to create a dynamic gRPC sink in Rust though. The existing sinks that use gRPC use code generation. It seems like something should be doable using prost_reflect though.

  • If a gRPC sink can't be added or would take longer, supporting dynamic uri: field and/or query_parameters: field with dynamic values in the HTTP sink would suffice.

Agreed, these would be useful in their own right. Related issues:

  • https://github.com/vectordotdev/vector/issues/201
  • https://github.com/vectordotdev/vector/issues/6759

jszwedko avatar Aug 13 '24 16:08 jszwedko

Given you say that you'd be happy if Vector treated the incoming data as opaque, I'm wondering what you plan to use Vector to do with the data? Are you intending to just "proxy" the requests?

Mostly yes. We also have Vector doing some extra things like tag insertion which are convenient to also do in VRL for these profiles.

I could also see enhancing the http_server source to be able to handle multi-part data as a first-class concept (though I'm not sure exactly what this would look like).

Even better! I like it.

Agreed. I'm not sure if it is possible to create a dynamic gRPC sink in Rust though. The existing sinks that use gRPC use code generation. It seems like something should be doable using prost_reflect though.

Ohh, that's unfortunate. I was hoping it would be an easier addition from existing pieces, since I knew the vector source/sink components existed, but I forgot about the gRPC codegen part.

Thanks for linking those other issues. I had seen #201 but not #6759. Upvoted.

Should I file other issues for any of those specific pieces?

ryanartecona avatar Aug 14 '24 01:08 ryanartecona

Should I file other issues for any of those specific pieces?

I think it'd be reasonable to open separate issues for:

  • multipart handling in the http_server source
  • a generic grpc sink

jszwedko avatar Aug 14 '24 19:08 jszwedko

This would be great. I'm referencing this issue I created back in January about how I was struggling to get vector decode the data Pyroscope (NET profiling agent) was sending via HTTP.

I put that work aside for a while and now I have taken it up again. I didn't get to deep into VRL manipulation or hacks as @ryanartecona, but I would like to join in the discussion and the work needed to make this happen.

https://github.com/vectordotdev/vector/issues/19737

jaraya-mycarrier avatar Dec 03 '24 21:12 jaraya-mycarrier