sentry-ruby icon indicating copy to clipboard operation
sentry-ruby copied to clipboard

fix: Safe-navigate to session flusher

Open ixti opened this issue 1 year ago • 6 comments

Sentry.session_flusher is nil-able. And under some circumstances (which I still can't figure out) in specs it's nil causing our spec suite randomly (but as of recent, pretty all the time) fail.

Example failure

     Failure/Error: post(path, env:, headers:, xhr: true)
     
     NoMethodError:
       undefined method `add_session' for nil
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry/hub.rb:251:in `end_session'
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry/hub.rb:260:in `with_session_tracking'
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry-ruby.rb:411:in `with_session_tracking'
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry/rack/capture_exceptions.rb:21:in `block in call'
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry/hub.rb:59:in `with_scope'
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry-ruby.rb:391:in `with_scope'
     # /home/ixti/.gem/ruby/3.3.4/gems/sentry-ruby-5.19.0/lib/sentry/rack/capture_exceptions.rb:20:in `call'
     # /home/ixti/.gem/ruby/3.3.4/gems/rails_semantic_logger-4.17.0/lib/rails_semantic_logger/rack/logger.rb:45:in `call_app'
     # /home/ixti/.gem/ruby/3.3.4/gems/rails_semantic_logger-4.17.0/lib/rails_semantic_logger/rack/logger.rb:26:in `block in call'
     # /home/ixti/.gem/ruby/3.3.4/bundler/gems/semantic_logger-093c0258b053/lib/semantic_logger/base.rb:190:in `block in tagged'
     # /home/ixti/.gem/ruby/3.3.4/bundler/gems/semantic_logger-093c0258b053/lib/semantic_logger/semantic_logger.rb:395:in `named_tagged'
     # /home/ixti/.gem/ruby/3.3.4/bundler/gems/semantic_logger-093c0258b053/lib/semantic_logger/base.rb:197:in `tagged'
     # /home/ixti/.gem/ruby/3.3.4/gems/rails_semantic_logger-4.17.0/lib/rails_semantic_logger/rack/logger.rb:26:in `call'
     # /home/ixti/.gem/ruby/3.3.4/gems/rack-3.1.7/lib/rack/runtime.rb:24:in `call'
     # /home/ixti/.gem/ruby/3.3.4/gems/rack-3.1.7/lib/rack/sendfile.rb:114:in `call'
     # /home/ixti/.gem/ruby/3.3.4/gems/rack-cors-2.0.2/lib/rack/cors.rb:102:in `call'
     # /home/ixti/.gem/ruby/3.3.4/gems/railties-7.2.1/lib/rails/engine.rb:535:in `call'
     # /home/ixti/.gem/ruby/3.3.4/gems/rack-test-2.1.0/lib/rack/test.rb:360:in `process_request'
     # /home/ixti/.gem/ruby/3.3.4/gems/rack-test-2.1.0/lib/rack/test.rb:153:in `request'
     # ./spec/support/helpers/protobufable_test_helper.rb:76:in `rpc'
     # ./spec/requests/asgard/auth_controller_spec.rb:220:in `block (4 levels) in <top (required)>'
     # /home/ixti/.gem/ruby/3.3.4/gems/webmock-3.23.1/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'

RSpec Config (Sentry related)

require "sentry/test_helper"

RSpec.configure do |config|
  config.include Sentry::TestHelper
  config.before { setup_sentry_test }
  config.after { teardown_sentry_test }
end

Workaround

As a workaround, we just disabled session tracking:

Sentry.init do |config|
  # ...

  # See: https://github.com/getsentry/sentry-ruby/pull/2337
  config.excluded_exceptions << "Sidekiq::JobRetry::Handled"

  # See: https://github.com/getsentry/sentry-ruby/pull/2396
  config.auto_session_tracking = false

  # ...
end

Resolves: https://github.com/getsentry/sentry-ruby/issues/2378

ixti avatar Sep 02 '24 16:09 ixti

We tried and it still fails:

require "sentry/test_helper"

RSpec.configure do |config|
  config.include Sentry::TestHelper

  config.before do
    setup_sentry_test { _1.auto_session_tracking = true }
  end

  config.after { teardown_sentry_test }
end

ixti avatar Sep 02 '24 16:09 ixti

Thanks for reporting the issue. It does look very puzzling as the only method that could nullify the session flusher is Sentry.close. Can you do a quick search to see if it's called anywhere in the test suite or codebase?

st0012 avatar Sep 15 '24 11:09 st0012

I am experiencing the same issue on production and hence reported #2378. Currently, the issue appears rather frequent for us, but not for each request:

Bildschirmfoto 2024-10-10 um 20 15 26

Is there any chance I could help to get this merged and released? Without a fix, each occurrence ends up with a 500 Internal Server Error and hides actually errors. I would rather miss a session tracing than having those errors.

In our app, we never call Sentry.close ourselves.

MrSerth avatar Oct 10 '24 18:10 MrSerth

Thanks for reporting the issue. It does look very puzzling as the only method that could nullify the session flusher is Sentry.close. Can you do a quick search to see if it's called anywhere in the test suite or codebase?

We don't call Sentry.close anywhere.

ixti avatar Oct 18 '24 01:10 ixti

Could you grab sentry from solnic/make-closing-thread-safe branch and see if the crash is gone?

solnic avatar Oct 18 '24 08:10 solnic

Could you grab sentry from solnic/make-closing-thread-safe branch and see if the crash is gone?

I tried the branch, but since we modified the codebase (where this issue normally occurred) a lot recently, I wasn't able to confirm the fix is working. Sorry for not being able to help out here. Maybe it's still valid to merge this PR using the safe navigation operator?

MrSerth avatar Oct 19 '24 19:10 MrSerth

We also took a workaround way by disabling session tracking completely. And even if we go back, the bug is pretty sporadic, to being able to tell if the patch in solnic/make-closing-thread-safe fixes the bug or no.

ixti avatar Oct 27 '24 01:10 ixti

Thanks folks. Would it be possible for you to disable your workarounds after we release a new version and see if it helped. The proposed fix here comes with a risk that it will silence actual failures where the session_flusher should really be there. It's kinda hard to tell which option would be worse but given that the fix addresses an edge case by making a change that may result in breaking more common cases (an actual misconfiguration of Sentry), then it's probably better not to apply the fix.

solnic avatar Oct 28 '24 07:10 solnic

As written above, I already tested the branch and partially reverted some of the code changes I considered to have the most impact on this issue.

While I understand your concern, this issue (and especially the one discussed in #2428) may hide actually bugs from our code base as Sentry customers -- that happened at least for us. We would normally trust in the Sentry error reporting, but when this bug occurs, we lack any notification and insights on actual defects of our software. Hence, from our perspective, it is far less critical to miss some sessions than to miss the actual bug that occurred.

MrSerth avatar Oct 28 '24 08:10 MrSerth

As written above, I already tested the branch and partially reverted some of the code changes I considered to have the most impact on this issue.

Gotcha, thank you!

While I understand your concern, this issue (and especially the one discussed in #2428) may hide actually bugs from our code base as Sentry customers -- that happened at least for us. We would normally trust in the Sentry error reporting, but when this bug occurs, we lack any notification and insights on actual defects of our software. Hence, from our perspective, it is far less critical to miss some sessions than to miss the actual bug that occurred.

Alright, given that Hub is going away eventually (see https://github.com/getsentry/sentry-ruby/issues/1495#issuecomment-1881400829) we can just rely on the workarounds here and in your PR for the time being. Thanks for all your input and help with this btw!

solnic avatar Oct 28 '24 16:10 solnic

Would it be possible for you to disable your workarounds after we release a new version and see if it helped.

Sure, we will give it a try!

ixti avatar Oct 29 '24 03:10 ixti

hey guys. So we've been running into NoMethodError add_session for nil for quite some time now, but the errors were intermittent. Recently we were able to reproduce it with one seed and 3 specs. I haven't tried to reproduce it in a blank app/script yet, but the general problem I'm seeing is that if we're running setup_sentry_test before any request spec is ran then for every request spec that follows we'll get this error. If any request spec runs before the spec that does setup_sentry_test then we don't get the error at all. Which explains why it was intermittent for us. But it also tells me that the issue is not with threads, I think(cc @solnic), since it will always happen if the specs are ran in the order I mentioned above. I have no idea if there's any connection between when sentry is initialized and request specs, or if the connection is unique to our project. As a workaround, this:

before do
  setup_sentry_test { |it| it.auto_session_tracking = false }
end

works, but meh. I should also mention that the issue persists on master branch. I will try to create a simple reproduction and/or a fix, but just wanted to provide some info in case it can help anyone 👍

Let me know if I can be of any further assistance here

edit: another bit of info. It seems that if we remove config.enabled_environments = %w[production] from sentry.rb then all is well somehow 🤔

edit2: 2 specs are enough too

Vagab avatar Nov 13 '24 13:11 Vagab

@vegab oh these are great insights, thank you! I'll look into this again

solnic avatar Nov 14 '24 06:11 solnic

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 98.17%. Comparing base (ceba0da) to head (7bf8c9e). Report is 12 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2396      +/-   ##
==========================================
+ Coverage   98.13%   98.17%   +0.04%     
==========================================
  Files         126      128       +2     
  Lines        4761     4829      +68     
==========================================
+ Hits         4672     4741      +69     
+ Misses         89       88       -1     
Components Coverage Δ
sentry-ruby 98.57% <100.00%> (+0.05%) :arrow_up:
sentry-rails 97.08% <ø> (+0.02%) :arrow_up:
sentry-sidekiq 96.96% <ø> (ø)
sentry-resque 92.85% <ø> (ø)
sentry-delayed_job 95.65% <ø> (ø)
sentry-opentelemetry 99.31% <ø> (ø)
Files with missing lines Coverage Δ
sentry-ruby/lib/sentry/hub.rb 98.87% <100.00%> (ø)

... and 6 files with indirect coverage changes

codecov[bot] avatar Nov 28 '24 12:11 codecov[bot]