graphql-ruby Unable to find subscription with identifier: {"channel":"GraphqlChannel","channelId":"18bccb1dbd6"}

Describe the bug

This is possibly not a bug in graphql-ruby, and I apologize in advance if that's the case, but I've just run out of ideas on what this could be and thought I'd open this bug report in case this is something anybody else has run into before, or if it's a common problem that I just can't find much information on.

Basically, we see a lot of these errors in our AppSignal monitoring, but I can't reproduce it locally. We don't know if it's something that happens when we deploy new versions and somebody is using the site with (possibly now disconnected websockets?) or what's been going on really

Versions

graphql version: 2.1.1 rails (or other framework): 7.0.8 other applicable versions (graphql-batch, etc):

graphql-pro (1.24.11)
    graphql (2.1.1)
    graphql-metrics (5.0.7)
    graphql-rails_logger (1.2.4)
    rubocop-graphql (1.4.0)
graphql (= 2.1.1)

Steps to reproduce

I'm sorry but honestly, no clue. We can't reproduce this locally, it only happens in our production and staging environments (on Amazon Fargate using ElastiCache Redis)

Expected behavior

Not to throw this error, or to find the subscription

Actual behavior

What specifically went wrong?

Place full backtrace here (if a Ruby exception is involved):

Click to view exception backtrace


lib/action_cable/connection/subscriptions.rb:75 find |  
-- | --
lib/action_cable/connection/subscriptions.rb:47 remove |  
lib/action_cable/connection/subscriptions.rb:18 execute_command |  
lib/action_cable/connection/base.rb:89 dispatch_websocket_message |  
lib/action_cable/server/worker.rb:59 block in invoke |  
lib/active_support/callbacks.rb:118 block in run_callbacks |  
lib/semantic_logger/base.rb:190 block in tagged |  
lib/semantic_logger/semantic_logger.rb:346 tagged |  
lib/semantic_logger/base.rb:202 tagged |  
lib/rails_semantic_logger/extensions/action_cable/tagged_logger_proxy.rb:8 tag |  
lib/action_cable/server/worker/active_record_connection_management.rb:16 with_database_connections |  
lib/active_support/callbacks.rb:127 block in run_callbacks |  
lib/action_cable/engine.rb:71 block (4 levels) in <class:Engine> |  
lib/active_support/execution_wrapper.rb:92 wrap |  
lib/action_cable/engine.rb:66 block (3 levels) in <class:Engine> |  
lib/active_support/callbacks.rb:127 instance_exec |  
lib/active_support/callbacks.rb:127 block in run_callbacks |  
lib/active_support/callbacks.rb:138 run_callbacks |  
lib/action_cable/server/worker.rb:42 work |  
lib/action_cable/server/worker.rb:58 invoke |  
lib/action_cable/server/worker.rb:53 block in async_invoke |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:352 run_task |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:343 block (3 levels) in create_worker |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:334 loop |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:334 block (2 levels) in create_worker |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:333 catch |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:333 block in create_worker

Nov 15 '23 02:11 Amnesthesia

Actually, this being raised in subscriptions.rb when the identifier can't be found in the @subscriptions array, could this be because we run multiple containers, and the user may be subscribed at another container?

Nov 15 '23 02:11 Amnesthesia

Hi! Sorry for the trouble and thanks for the detailed report. What I expect on re-deploy is for clients to reconnect, re-sending GraphQL subscriptions as needed. A couple of quick thoughts:

What caused this error to start occurring? Did you just add GraphQL subscriptions to your app, or did something else change that caused these to appear?
Reviewing the backtrace above, I don't see anything GraphQL-related. Is there any indication that this is GraphQL-specific? (Are there other uses of ActionCable that work properly in your application?)
You mention that you can't replicate the error locally. If you're using the app in production during a deploy, do you see anything happening in your browser?
In AppSignal, can you see any trends in the occurences in this bug? (Maybe browser? I can imagine browsers handling this event in different ways...)

Nov 15 '23 18:11 rmosolgo

@rmosolgo Right, so to address your questions (and what I think is going on here as well):

This error has been occurring for a very long time for us, but it doesn't seem to cause any issues. We've had a lot of ideas on what it could be, and none of those have turned out to be it. We haven't really prioritized it because like I said, it seems pretty harmless
You're right, and I realized after posting this here that this may not be graphql-related and could be actioncable-related 3 & 4. We're not seeing anything related to the browser type for this, seems to be a mixed bag

I think it's that we have this in our GraphQLChannel:

def subscribed
    @subscription_ids = []
end

This would start each instance on a blank slate of connections, so if we're scaling and running 3-4 instances, they're all pushing to the same redis. One instance may pick up a subscriptions.trigger(...) from puma on another instance, but doesn't have the gid in its @subscription_ids, because its not being shared.

Perhaps @subscription_ids should be pointing to redis to maintain a shared list of subscribed clients?

Nov 16 '23 01:11 Amnesthesia

My intention on @subscription_ids is for that to be a per-instance list of subscribers to that instance. So it works like:

subscriptions.trigger is called in one Ruby process, it calls .broadcast to notify other processes of the trigger: https://github.com/rmosolgo/graphql-ruby/blob/8270480f4264ab31ae84dc225937cbd03e2a4843/lib/graphql/subscriptions/action_cable_subscriptions.rb#L116-L122
meanwhile, any ActionCable process that received a subscription should have setup a stream_from for that subscriber: https://github.com/rmosolgo/graphql-ruby/blob/8270480f4264ab31ae84dc225937cbd03e2a4843/lib/graphql/subscriptions/action_cable_subscriptions.rb#L169
So, when a process receives a notification from a trigger, it runs that stream_from block and generates a payload. Then it broadcasts again, using the specific subscription ID, to update the client: https://github.com/rmosolgo/graphql-ruby/blob/8270480f4264ab31ae84dc225937cbd03e2a4843/lib/graphql/subscriptions/action_cable_subscriptions.rb#L129

I hope that helps!

Dec 18 '23 20:12 rmosolgo

@rmosolgo Thank you for taking the time out of the day to explain this a bit, appreciate it! :) We still haven't worked this issue out. We realized we weren't running ActionCable standalone in production, and thought perhaps this would be solved when we move ActionCable to be its own standalone service, but (I presume) because we also scale this out, this "issue" still persists.

We don't really know to what extent its an issue, or if it is an issue at all — e.g, are we getting a certain percentage of lost websockets pushes? or are these errors completely harmless?

So, when a process receives a notification from a trigger, it runs that stream_from block and generates a payload. Then it broadcasts again, using the specific subscription ID, to update the client

In this sense, should an ActionCable process that does not have the subscription_id in its list of subscribers simply not pick this up? Do you see a way that it would happen in a multi-ActionCable setup that ActionCable processes routinely try to handle messages meant for subscriber_ids that they dont have?

Jan 02 '24 21:01 Amnesthesia

@rmosolgo We've moved away from scaling actioncable horizontally and have instead scaled it up, increasing the spec of the machine significantly and only running one machine. We are still seeing these errors every day and we're quite lost at where they come from

We've reduced our identified_by attributes as well to a single one, identified_by :_strategy where _strategy is a custom class managing the JWT token and current user

Apr 04 '24 09:04 Amnesthesia

I'm sorry I don't have anything better to offer on debugging this ... to answer those very old questions:

In this sense, should an ActionCable process that does not have the subscription_id in its list of subscribers simply not pick this up?

Yes, that sounds right to me... I think a process would only call stream_from during write_subscription, which is to say, processes without subscription IDs wouldn't call stream_from.

Do you see a way that it would happen in a multi-ActionCable setup that ActionCable processes routinely try to handle messages meant for subscriber_ids that they dont have?

How I expect that to work is:

Channels call subscription.delete(...) as described here:

https://github.com/rmosolgo/graphql-ruby/blob/8270480f4264ab31ae84dc225937cbd03e2a4843/lib/graphql/subscriptions/action_cable_subscriptions.rb#L58-L62
delete_subscription removes the subscription from the process's data:

https://github.com/rmosolgo/graphql-ruby/blob/f06879943cd5916bafcdc44b58c7e28d24b2626f/lib/graphql/subscriptions/action_cable_subscriptions.rb#L224-L227

I notice that it sends more: false to the client there in delete ... maybe that's the problem here?

I pushed a branch which adds a rescue to that code, in case you want to try it: https://github.com/rmosolgo/graphql-ruby/pull/5053

Aug 06 '24 15:08 rmosolgo

Hey, I spent some more time with this and I was able to replicate the error in the test app in this repository. I could replicate this error by clicking "unsubscribe" twice. The first time, it unsubscribed (using ActionCable.Subscription.unsubscribe), but the second time, my client code called that function again and Rails logged the same error described here.

Here's a screencap:

unsubscribe-error

So, I don't think #5053 is related -- this error is between the client and ActionCable, without any (Ruby) GraphQL-Ruby code involved.

In the test app, I solved this problem by adding an _unsubscribed flag to my unsubscribe function:

unsubscribe() {
  // Don't re-call `unsubscribe` to avoid Rails logging an error 
  if (!this._unsubscribed) {
    this._unsubscribed = true 
    this.subscription.unsubscribe()
  }
}

Would an approach like that work for you?

Also, reviewing this issue, I don't see which ActionCable client you're using. Are you using one of the ones from graphql-ruby-client? If so, I'd be happy to investigate an improvement there to include a fix like the one above.

I'm going to close this issue since it's gone cold, but if you get back to me about the client code, I'll be happy to re-open it until the client is improved.

Aug 12 '24 14:08 rmosolgo