opentelemetry-ruby icon indicating copy to clipboard operation
opentelemetry-ruby copied to clipboard

Periodic Metric exporter unable to collect data points on Passenger-based servers

Open divyam-rai opened this issue 1 year ago • 13 comments

Description of the bug

Due to Passenger's smart spawning mechanism which creates worker processes by forking an initial pre-loader process, we notice that the thread for the periodic metric exporter is only attached to the pre-loader process and as such does not share memory with the other child processes which handle the actual incoming requests.

This causes the periodic metric reader to never have any data points since the metrics are stored in the memory of the child processes and are not shared with the thread which is attached to the pre-loader process.

The spawning mechanism, as well as its caveats, are described here: https://www.phusionpassenger.com/library/indepth/ruby/spawn_methods/#the-smart-spawning-method

Image

To resolve this, the periodic metric reader needs to be initialised for each of the forked worker processes instantiated by Passenger, which was done as followed based on events that are made available within the internals of Passenger. This allowed us to send metrics as required.

if defined?(PhusionPassenger)
    PhusionPassenger.on_event(:starting_worker_process) do |forked|
      if forked
        metric_reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
          exporter: OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new,
        )

        OpenTelemetry.meter_provider.add_metric_reader(metric_reader)
      end
    end
end

divyam-rai avatar Jan 28 '25 12:01 divyam-rai

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

github-actions[bot] avatar Feb 28 '25 02:02 github-actions[bot]

I've run into the same issue while I've been testing this out on one of our applications that uses puma and forking puma workers. It should apply to all systems that use a forking webserver as forking kills threads.

@kaylareopelle, I believe I have a generic enough patch for this. Would you like me to open a PR or is somebody else looking at this?

chrisholmes avatar Mar 04 '25 14:03 chrisholmes

Hi @chrisholmes, thanks for reaching out! A PR with your patch would be great. I'm not aware of anyone looking into this issue yet, and we would appreciate the support.

kaylareopelle avatar Mar 04 '25 17:03 kaylareopelle

@kaylareopelle, here's the PR: https://github.com/open-telemetry/opentelemetry-ruby/pull/1823

chrisholmes avatar Mar 07 '25 12:03 chrisholmes

We've also encountered an issue of duplicate metrics with the way forked workers work. We needed to update the resource attributes such that it reported the worker pid instead so avoiding duplicate metrics. An example from the tracer side: https://github.com/open-telemetry/opentelemetry-ruby/issues/1425#issuecomment-1921630036.

karmingc avatar Apr 07 '25 15:04 karmingc

@chrisholmes Our team runs it with Puma (multithreaded) + Grafana and we've noticed that long running instances start to report duplicate timestamp metrics after ~ 1 hr. Rebooting the app suddenly resolves it...

Our implementation with Puma

on_worker_boot do
  metric_reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
    exporter: OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new,
  )

  OpenTelemetry.meter_provider.add_metric_reader(metric_reader)

  resource = OpenTelemetry
    .meter_provider
    .resource
    .merge(OpenTelemetry::SDK::Resources::Resource.process)
    .merge(OpenTelemetry::SDK::Resources::Resource.create({"service.instance.id" => Process.pid}))
  OpenTelemetry.meter_provider.instance_variable_set(:@resource, resource)
end

Is it something your team has encountered?

@kaylareopelle is multithreading a potential cause to reporting duplicate timestamp metrics or does the SDK covers it?

karmingc avatar Apr 15 '25 15:04 karmingc

Tested the approach that trigger the initialization of periodic metrics reader in child process by modifying metric_stream.rb suggested by @chrisholmes.

        class MetricStream
          def initialize(...)
            # ...
            @pid = nil
          end

          def collect(start_time, end_time); end

          def update(value, attributes)
            reset_on_fork
            # ...
          end
          
          def reset_on_fork
            pid = Process.pid
            return if @pid == pid

            @pid = pid

            @meter_provider.metric_readers.each do |reader|
               reader.send(:start) if reader.instance_of?(OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader)
            end
          end

          def aggregate_metric_data(start_time, end_time, aggregation: nil); end
          def find_registered_view; end
          def to_s; end
        end

Hi @karmingc , does it always happen in ~1 hr? Do you use delta temporality?

xuan-cao-swi avatar Apr 15 '25 18:04 xuan-cao-swi

hi @xuan-cao-swi, the cadence is probably an approximation, but it seems to hover around that time period.

Image

note that it was the same deployment from 16h - 9h and there was a new deploy at ~9:30. New deploy around 11:30 as well.

We use the cumulative temporality due to Mimir.

karmingc avatar Apr 15 '25 19:04 karmingc

Hi everyone, I am willing to capture request failures metrics but best way to do that is by adding middleware but just like above approach with puma middleware cannot be attached in on_worker_boot that attempts to modify forzen array of middlewares. And if adding middleware in rails application.rb middleware works but metrics are never received by otel collector. Any suggestion ?

imagarwalshri avatar Apr 23 '25 09:04 imagarwalshri

Hi folks, #1823 has been merged and released as part of opentelemetry-metrics-sdk version 0.7.1.

Give it a try when you can and let us know if you run into other issues.

kaylareopelle avatar May 29 '25 17:05 kaylareopelle

Hi folks, #1823 has been merged and released as part of opentelemetry-metrics-sdk version 0.7.1.

Give it a try when you can and let us know if you run into other issues.

@kaylareopelle We removed our manual patch via puma and I can confirm that the change introduced in 0.7.1 works for our team.

karmingc avatar Jun 02 '25 16:06 karmingc

That's great to hear! Thanks @karmingc!

kaylareopelle avatar Jun 03 '25 16:06 kaylareopelle

Is anyone still having trouble with metrics when using Passenger?

wsmoak-maxio avatar Aug 28 '25 14:08 wsmoak-maxio