sentry-ruby fix: Safe-navigate to session flusher

Sentry.session_flusher is nil-able. And under some circumstances (which I still can't figure out) in specs it's nil causing our spec suite randomly (but as of recent, pretty all the time) fail.

Example failure

     Failure/Error: post(path, env:, headers:, xhr: true)
     
     NoMethodError:
       undefined method `add_session' for nil
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry/hub.rb:251:in `end_session'
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry/hub.rb:260:in `with_session_tracking'
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry-ruby.rb:411:in `with_session_tracking'
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry/rack/capture_exceptions.rb:21:in `block in call'
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry/hub.rb:59:in `with_scope'
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry-ruby.rb:391:in `with_scope'
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry/rack/capture_exceptions.rb:20:in `call'
     # /home/ixti/.gem/ruby/3.3.4/gems/rails_semantic_logger-4.17.0/lib/rails_semantic_logger/rack/logger.rb:45:in `call_app'
     # /home/ixti/.gem/ruby/3.3.4/gems/rails_semantic_logger-4.17.0/lib/rails_semantic_logger/rack/logger.rb:26:in `block in call'
     # /home/ixti/.gem/ruby/3.3.4/bundler/gems/semantic_logger-093c0258b053/lib/semantic_logger/base.rb:190:in `block in tagged'
     # /home/ixti/.gem/ruby/3.3.4/bundler/gems/semantic_logger-093c0258b053/lib/semantic_logger/semantic_logger.rb:395:in `named_tagged'
     # /home/ixti/.gem/ruby/3.3.4/bundler/gems/semantic_logger-093c0258b053/lib/semantic_logger/base.rb:197:in `tagged'
     # /home/ixti/.gem/ruby/3.3.4/gems/rails_semantic_logger-4.17.0/lib/rails_semantic_logger/rack/logger.rb:26:in `call'
     # /home/ixti/.gem/ruby/3.3.4/gems/rack-3.1.7/lib/rack/runtime.rb:24:in `call'
     # /home/ixti/.gem/ruby/3.3.4/gems/rack-3.1.7/lib/rack/sendfile.rb:114:in `call'
     # /home/ixti/.gem/ruby/3.3.4/gems/rack-cors-2.0.2/lib/rack/cors.rb:102:in `call'
     # /home/ixti/.gem/ruby/3.3.4/gems/railties-7.2.1/lib/rails/engine.rb:535:in `call'
     # /home/ixti/.gem/ruby/3.3.4/gems/rack-test-2.1.0/lib/rack/test.rb:360:in `process_request'
     # /home/ixti/.gem/ruby/3.3.4/gems/rack-test-2.1.0/lib/rack/test.rb:153:in `request'
     # ./spec/support/helpers/protobufable_test_helper.rb:76:in `rpc'
     # ./spec/requests/asgard/auth_controller_spec.rb:220:in `block (4 levels) in <top (required)>'
     # /home/ixti/.gem/ruby/3.3.4/gems/webmock-3.23.1/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'

RSpec Config (Sentry related)

require "sentry/test_helper"

RSpec.configure do |config|
  config.include Sentry::TestHelper
  config.before { setup_sentry_test }
  config.after { teardown_sentry_test }
end

Workaround

As a workaround, we just disabled session tracking:

Sentry.init do |config|
  # ...

  # See: https://github.com/getsentry/sentry-ruby/pull/2337
  config.excluded_exceptions << "Sidekiq::JobRetry::Handled"

  # See: https://github.com/getsentry/sentry-ruby/pull/2396
  config.auto_session_tracking = false

  # ...
end

Resolves: https://github.com/getsentry/sentry-ruby/issues/2378

Sep 02 '24 16:09 ixti

We tried and it still fails:

require "sentry/test_helper"

RSpec.configure do |config|
  config.include Sentry::TestHelper

  config.before do
    setup_sentry_test { _1.auto_session_tracking = true }
  end

  config.after { teardown_sentry_test }
end

Sep 02 '24 16:09 ixti

Thanks for reporting the issue. It does look very puzzling as the only method that could nullify the session flusher is Sentry.close. Can you do a quick search to see if it's called anywhere in the test suite or codebase?

Sep 15 '24 11:09 st0012

I am experiencing the same issue on production and hence reported #2378. Currently, the issue appears rather frequent for us, but not for each request:

Is there any chance I could help to get this merged and released? Without a fix, each occurrence ends up with a 500 Internal Server Error and hides actually errors. I would rather miss a session tracing than having those errors.

In our app, we never call Sentry.close ourselves.

Oct 10 '24 18:10 MrSerth

Thanks for reporting the issue. It does look very puzzling as the only method that could nullify the session flusher is Sentry.close. Can you do a quick search to see if it's called anywhere in the test suite or codebase?

We don't call Sentry.close anywhere.

Oct 18 '24 01:10 ixti

Could you grab sentry from solnic/make-closing-thread-safe branch and see if the crash is gone?

Oct 18 '24 08:10 solnic

Could you grab sentry from solnic/make-closing-thread-safe branch and see if the crash is gone?

I tried the branch, but since we modified the codebase (where this issue normally occurred) a lot recently, I wasn't able to confirm the fix is working. Sorry for not being able to help out here. Maybe it's still valid to merge this PR using the safe navigation operator?

Oct 19 '24 19:10 MrSerth

We also took a workaround way by disabling session tracking completely. And even if we go back, the bug is pretty sporadic, to being able to tell if the patch in solnic/make-closing-thread-safe fixes the bug or no.

Oct 27 '24 01:10 ixti

Thanks folks. Would it be possible for you to disable your workarounds after we release a new version and see if it helped. The proposed fix here comes with a risk that it will silence actual failures where the session_flusher should really be there. It's kinda hard to tell which option would be worse but given that the fix addresses an edge case by making a change that may result in breaking more common cases (an actual misconfiguration of Sentry), then it's probably better not to apply the fix.

Oct 28 '24 07:10 solnic

As written above, I already tested the branch and partially reverted some of the code changes I considered to have the most impact on this issue.

While I understand your concern, this issue (and especially the one discussed in #2428) may hide actually bugs from our code base as Sentry customers -- that happened at least for us. We would normally trust in the Sentry error reporting, but when this bug occurs, we lack any notification and insights on actual defects of our software. Hence, from our perspective, it is far less critical to miss some sessions than to miss the actual bug that occurred.

Oct 28 '24 08:10 MrSerth

As written above, I already tested the branch and partially reverted some of the code changes I considered to have the most impact on this issue.

Gotcha, thank you!

While I understand your concern, this issue (and especially the one discussed in #2428) may hide actually bugs from our code base as Sentry customers -- that happened at least for us. We would normally trust in the Sentry error reporting, but when this bug occurs, we lack any notification and insights on actual defects of our software. Hence, from our perspective, it is far less critical to miss some sessions than to miss the actual bug that occurred.

Alright, given that Hub is going away eventually (see https://github.com/getsentry/sentry-ruby/issues/1495#issuecomment-1881400829) we can just rely on the workarounds here and in your PR for the time being. Thanks for all your input and help with this btw!

Oct 28 '24 16:10 solnic

Would it be possible for you to disable your workarounds after we release a new version and see if it helped.

Sure, we will give it a try!

Oct 29 '24 03:10 ixti

hey guys. So we've been running into NoMethodError add_session for nil for quite some time now, but the errors were intermittent. Recently we were able to reproduce it with one seed and 3 specs. I haven't tried to reproduce it in a blank app/script yet, but the general problem I'm seeing is that if we're running setup_sentry_test before any request spec is ran then for every request spec that follows we'll get this error. If any request spec runs before the spec that does setup_sentry_test then we don't get the error at all. Which explains why it was intermittent for us. But it also tells me that the issue is not with threads, I think(cc @solnic), since it will always happen if the specs are ran in the order I mentioned above. I have no idea if there's any connection between when sentry is initialized and request specs, or if the connection is unique to our project. As a workaround, this:

before do
  setup_sentry_test { |it| it.auto_session_tracking = false }
end

works, but meh. I should also mention that the issue persists on master branch. I will try to create a simple reproduction and/or a fix, but just wanted to provide some info in case it can help anyone 👍

Let me know if I can be of any further assistance here

edit: another bit of info. It seems that if we remove config.enabled_environments = %w[production] from sentry.rb then all is well somehow 🤔

edit2: 2 specs are enough too

Nov 13 '24 13:11 Vagab

@vegab oh these are great insights, thank you! I'll look into this again

Nov 14 '24 06:11 solnic

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 98.17%. Comparing base (ceba0da) to head (7bf8c9e). Report is 12 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2396      +/-   ##
==========================================
+ Coverage   98.13%   98.17%   +0.04%     
==========================================
  Files         126      128       +2     
  Lines        4761     4829      +68     
==========================================
+ Hits         4672     4741      +69     
+ Misses         89       88       -1

Components	Coverage Δ
sentry-ruby	`98.57% <100.00%> (+0.05%)`	:arrow_up:
sentry-rails	`97.08% <ø> (+0.02%)`	:arrow_up:
sentry-sidekiq	`96.96% <ø> (ø)`
sentry-resque	`92.85% <ø> (ø)`
sentry-delayed_job	`95.65% <ø> (ø)`
sentry-opentelemetry	`99.31% <ø> (ø)`

Files with missing lines	Coverage Δ
sentry-ruby/lib/sentry/hub.rb	`98.87% <100.00%> (ø)`

... and 6 files with indirect coverage changes

Nov 28 '24 12:11 codecov[bot]