Periodic Metric exporter unable to collect data points on Passenger-based servers
Description of the bug
Due to Passenger's smart spawning mechanism which creates worker processes by forking an initial pre-loader process, we notice that the thread for the periodic metric exporter is only attached to the pre-loader process and as such does not share memory with the other child processes which handle the actual incoming requests.
This causes the periodic metric reader to never have any data points since the metrics are stored in the memory of the child processes and are not shared with the thread which is attached to the pre-loader process.
The spawning mechanism, as well as its caveats, are described here: https://www.phusionpassenger.com/library/indepth/ruby/spawn_methods/#the-smart-spawning-method
To resolve this, the periodic metric reader needs to be initialised for each of the forked worker processes instantiated by Passenger, which was done as followed based on events that are made available within the internals of Passenger. This allowed us to send metrics as required.
if defined?(PhusionPassenger)
PhusionPassenger.on_event(:starting_worker_process) do |forked|
if forked
metric_reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
exporter: OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new,
)
OpenTelemetry.meter_provider.add_metric_reader(metric_reader)
end
end
end
👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.
I've run into the same issue while I've been testing this out on one of our applications that uses puma and forking puma workers. It should apply to all systems that use a forking webserver as forking kills threads.
@kaylareopelle, I believe I have a generic enough patch for this. Would you like me to open a PR or is somebody else looking at this?
Hi @chrisholmes, thanks for reaching out! A PR with your patch would be great. I'm not aware of anyone looking into this issue yet, and we would appreciate the support.
@kaylareopelle, here's the PR: https://github.com/open-telemetry/opentelemetry-ruby/pull/1823
We've also encountered an issue of duplicate metrics with the way forked workers work. We needed to update the resource attributes such that it reported the worker pid instead so avoiding duplicate metrics. An example from the tracer side: https://github.com/open-telemetry/opentelemetry-ruby/issues/1425#issuecomment-1921630036.
@chrisholmes Our team runs it with Puma (multithreaded) + Grafana and we've noticed that long running instances start to report duplicate timestamp metrics after ~ 1 hr. Rebooting the app suddenly resolves it...
Our implementation with Puma
on_worker_boot do
metric_reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
exporter: OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new,
)
OpenTelemetry.meter_provider.add_metric_reader(metric_reader)
resource = OpenTelemetry
.meter_provider
.resource
.merge(OpenTelemetry::SDK::Resources::Resource.process)
.merge(OpenTelemetry::SDK::Resources::Resource.create({"service.instance.id" => Process.pid}))
OpenTelemetry.meter_provider.instance_variable_set(:@resource, resource)
end
Is it something your team has encountered?
@kaylareopelle is multithreading a potential cause to reporting duplicate timestamp metrics or does the SDK covers it?
Tested the approach that trigger the initialization of periodic metrics reader in child process by modifying metric_stream.rb suggested by @chrisholmes.
class MetricStream
def initialize(...)
# ...
@pid = nil
end
def collect(start_time, end_time); end
def update(value, attributes)
reset_on_fork
# ...
end
def reset_on_fork
pid = Process.pid
return if @pid == pid
@pid = pid
@meter_provider.metric_readers.each do |reader|
reader.send(:start) if reader.instance_of?(OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader)
end
end
def aggregate_metric_data(start_time, end_time, aggregation: nil); end
def find_registered_view; end
def to_s; end
end
Hi @karmingc , does it always happen in ~1 hr? Do you use delta temporality?
hi @xuan-cao-swi, the cadence is probably an approximation, but it seems to hover around that time period.
note that it was the same deployment from 16h - 9h and there was a new deploy at ~9:30. New deploy around 11:30 as well.
We use the cumulative temporality due to Mimir.
Hi everyone,
I am willing to capture request failures metrics but best way to do that is by adding middleware but just like above approach with puma middleware cannot be attached in on_worker_boot that attempts to modify forzen array of middlewares. And if adding middleware in rails application.rb middleware works but metrics are never received by otel collector. Any suggestion ?
Hi folks, #1823 has been merged and released as part of opentelemetry-metrics-sdk version 0.7.1.
Give it a try when you can and let us know if you run into other issues.
Hi folks, #1823 has been merged and released as part of
opentelemetry-metrics-sdkversion 0.7.1.Give it a try when you can and let us know if you run into other issues.
@kaylareopelle We removed our manual patch via puma and I can confirm that the change introduced in 0.7.1 works for our team.
That's great to hear! Thanks @karmingc!
Is anyone still having trouble with metrics when using Passenger?