self-hosted icon indicating copy to clipboard operation
self-hosted copied to clipboard

Replays will not display the latest records, need to restart Docker for it to work properly

Open fzred opened this issue 2 years ago • 21 comments

Self-Hosted Version

23.8.0

CPU Architecture

x86_64

Docker Version

24.0.5

Docker Compose Version

2.20.2

Steps to Reproduce

Problems appeared less than 24 hours after installation. Replays won't display the latest data, requiring a restart, but after restarting, only the data before the restart can be seen, and the newly added records still won't display. I've tried reinstalling, but the problem persists.

Expected Result

Replays running normally

Actual Result

Sentry-self-hosted-clickhouse-1 is not work properly. docker ps sentry-self-hosted-clickhouse-1 STATUS=Restarting (139) Less than a second ago

volumes/sentry-self-hosted_sentry-clickhouse-log/_data/clickhouse-server.err.log

2023.09.04 02:32:49.273415 [ 1 ] {} <Error> Application: Listen [::]:8123 failed: Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = DNS error: EAI: -9 (version 20.3.9.70 (official build)). If it is an IPv6 or IPv4 address and your host has disabled IPv6 or IPv4, then consider to specify not disabled IPv4 or IPv6 address to listen in <listen_host> element of configuration file. Example for disabled IPv6: <listen_host>0.0.0.0</listen_host> . Example for disabled IPv4: <listen_host>::</listen_host>
2023.09.04 02:32:49.273576 [ 1 ] {} <Error> Application: Listen [::]:9000 failed: Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = DNS error: EAI: -9 (version 20.3.9.70 (official build)). If it is an IPv6 or IPv4 address and your host has disabled IPv6 or IPv4, then consider to specify not disabled IPv4 or IPv6 address to listen in <listen_host> element of configuration file. Example for disabled IPv6: <listen_host>0.0.0.0</listen_host> . Example for disabled IPv4: <listen_host>::</listen_host>
2023.09.04 02:32:49.273669 [ 1 ] {} <Error> Application: Listen [::]:9009 failed: Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = DNS error: EAI: -9 (version 20.3.9.70 (official build)). If it is an IPv6 or IPv4 address and your host has disabled IPv6 or IPv4, then consider to specify not disabled IPv4 or IPv6 address to listen in <listen_host> element of configuration file. Example for disabled IPv6: <listen_host>0.0.0.0</listen_host> . Example for disabled IPv4: <listen_host>::</listen_host>
2023.09.04 02:32:49.273746 [ 1 ] {} <Error> Application: Listen [::]:9004 failed: Poco::Exception. Code: 1000, e.code() = 0, e.displayText() = DNS error: EAI: -9 (version 20.3.9.70 (official build)). If it is an IPv6 or IPv4 address and your host has disabled IPv6 or IPv4, then consider to specify not disabled IPv4 or IPv6 address to listen in <listen_host> element of configuration file. Example for disabled IPv6: <listen_host>0.0.0.0</listen_host> . Example for disabled IPv4: <listen_host>::</listen_host>
2023.09.04 02:33:38.357586 [ 72 ] {} <Warning> Settings: Unknown setting database_atomic_wait_for_drop_and_detach_synchronously, skipping
2023.09.04 02:33:38.358584 [ 72 ] {} <Warning> Settings: Unknown setting database_atomic_wait_for_drop_and_detach_synchronously, skipping
2023.09.04 02:33:38.359498 [ 72 ] {} <Warning> Settings: Unknown setting database_atomic_wait_for_drop_and_detach_synchronously, skipping
2023.09.04 02:33:38.380039 [ 72 ] {a6e69f47-9585-4784-a6c2-bb9722fa23d0} <Error> executeQuery: Code: 60, e.displayText() = DB::Exception: Table default.migrations_local doesn't exist. (version 20.3.9.70 (official build)) (from 172.20.0.6:47290) (in query: SELECT group, migration_id, status FROM migrations_local FINAL WHERE group IN ('system', 'events', 'transactions', 'discover', 'outcomes', 'metrics', 'sessions', 'profiles', 'functions', 'replays', 'generic_metrics', 'search_issues')), Stack trace (when copying this message, always include the lines below):

0. Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x105351b0 in /usr/bin/clickhouse
1. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x8f4172d in /usr/bin/clickhouse
2. DB::Context::getTableImpl(DB::StorageID const&, std::__1::optional<DB::Exception>*) const @ 0xcfe2a24 in /usr/bin/clickhouse
3. DB::Context::getTable(DB::StorageID const&) const @ 0xcfe2bbb in /usr/bin/clickhouse
4. DB::Context::getTable(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) const @ 0xcfe2c7d in /usr/bin/clickhouse
5. DB::JoinedTables::getLeftTableStorage() @ 0xd454892 in /usr/bin/clickhouse
6. DB::InterpreterSelectQuery::InterpreterSelectQuery(std::__1::shared_ptr<DB::IAST> const&, DB::Context const&, std::__1::shared_ptr<DB::IBlockInputStream> const&, std::__1::optional<DB::Pipe>, std::__1::shared_ptr<DB::IStorage> const&, DB::SelectQueryOptions const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) @ 0xd13b6d1 in /usr/bin/clickhouse
7. DB::InterpreterSelectQuery::InterpreterSelectQuery(std::__1::shared_ptr<DB::IAST> const&, DB::Context const&, DB::SelectQueryOptions const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) @ 0xd13c619 in /usr/bin/clickhouse
8. DB::InterpreterSelectWithUnionQuery::InterpreterSelectWithUnionQuery(std::__1::shared_ptr<DB::IAST> const&, DB::Context const&, DB::SelectQueryOptions const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) @ 0xd341686 in /usr/bin/clickhouse
9. DB::InterpreterFactory::get(std::__1::shared_ptr<DB::IAST>&, DB::Context&, DB::QueryProcessingStage::Enum) @ 0xd0909b4 in /usr/bin/clickhouse
10. ? @ 0xd550655 in /usr/bin/clickhouse
11. DB::executeQuery(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, DB::Context&, bool, DB::QueryProcessingStage::Enum, bool, bool) @ 0xd553441 in /usr/bin/clickhouse
12. DB::TCPHandler::runImpl() @ 0x9024489 in /usr/bin/clickhouse
13. DB::TCPHandler::run() @ 0x9025470 in /usr/bin/clickhouse
14. Poco::Net::TCPServerConnection::start() @ 0xe3ac69b in /usr/bin/clickhouse
15. Poco::Net::TCPServerDispatcher::run() @ 0xe3acb1d in /usr/bin/clickhouse
16. Poco::PooledThread::run() @ 0x105c3317 in /usr/bin/clickhouse
17. Poco::ThreadImpl::runnableEntry(void*) @ 0x105bf11c in /usr/bin/clickhouse
18. ? @ 0x105c0abd in /usr/bin/clickhouse
19. start_thread @ 0x76db in /lib/x86_64-linux-gnu/libpthread-2.27.so
20. __clone @ 0x12188f in /lib/x86_64-linux-gnu/libc-2.27.so

Event ID

No response

fzred avatar Sep 04 '23 18:09 fzred

Were you able to get replays working on a previous version of Sentry, or is this the first version where you're seeing this problem?

hubertdeng123 avatar Sep 06 '23 17:09 hubertdeng123

Were you able to get replays working on a previous version of Sentry, or is this the first version where you're seeing this problem?

I have used version 23.7.0, and this problem occurred after about a week. So I upgraded to 23.8.0, but this problem occurred again after 24 hours.

Now the problem is a bit different, even after rebooting, the Replays won't show anything new, it keeps displaying the content from 2 days ago.

image

fzred avatar Sep 06 '23 17:09 fzred

I think I see this same issue where replays stop appearing after some time - restarting all containers fixes it - after the restart, replays which were previously missing now appear. For example I just noticed my most recent replays are all from 5hrs ago, so I restarted all containers and now I see replays from the past 5 hours have now all appeared. I'm yet to check to see if there's a correlation in the logs though, or to selectively restart any containers to isolate the issue. This has happened 2 or 3 times now on a new install which is about a week old. I did have to wait maybe 5 min for them all to appear, so perhaps things were queued and required some processing

agoddard avatar Sep 07 '23 02:09 agoddard

I think I see this same issue where replays stop appearing after some time - restarting all containers fixes it - after the restart, containers which were previously missing now appear. For example I just noticed my most recent replays are all from 5hrs ago, so I restarted all containers and now I see replays from the past 5 hours have now all appeared. I'm yet to check to see if there's a correlation in the logs though, or to selectively restart any containers to isolate the issue. This has happened 2 or 3 times now on a new install which is about a week old. I did have to wait maybe 5 min for them all to appear, so perhaps things were queued and required some processing

I encountered the situation you mentioned before, but now even after restart, I can't see any new data.

fzred avatar Sep 07 '23 02:09 fzred

I think I see this same issue where replays stop appearing after some time - restarting all containers fixes it - after the restart, containers which were previously missing now appear

What containers were previously missing, did they crash for some reason?

Table default.migrations_local doesn't exist might be useful here. Wondering if there was a snuba migration that wasn't performed?

hubertdeng123 avatar Sep 07 '23 17:09 hubertdeng123

@hubertdeng123 my apologies, I mistyped that - I have updated my comment now. In that sentence I said “containers” were missing when I meant “replays” were missing. Now that I have confirmed for sure that the problem reoccurs after some time and a restart fixes it in my case, I will do further diagnosis before restarting next time

agoddard avatar Sep 07 '23 17:09 agoddard

Will do! Let us know what might be going wrong if it happens again.

hubertdeng123 avatar Sep 07 '23 17:09 hubertdeng123

I don't have a lot to add, except that selectively restarting zookeeper, clickhouse didn't fix it, but a restart of all containers again fixed it. This time my most recent replay was 8hrs ago, and after restart I now see dozens in the past 8hrs. In terms of the processing pipeline from receiving the replay payload to showing it in the WebUI, which other containers can/should I inspect or selectively restart next time this happens, so I can start to isolate the issue?

agoddard avatar Sep 08 '23 02:09 agoddard

hi @agoddard, if zookeeper/clickhouse don't seem to be the cause I'd recommend looking at the kafka container to see if restarting that container specifically fixes the issue (and also checking the kafka logs for anything when the issue occurs).

bmckerry avatar Sep 11 '23 16:09 bmckerry

@bmckerry when the issue occurs, the most recent Kafka container log (issue occurred around the same time as this log too) is a log cleaner message with additional detail (shown below). Usually the log cleaner messages are just 1 line, but this one is more detailed, and also looks like it deleted quite a lot, though I don't know if this is important. Also printed below is an error message from sentry-self-hosted-snuba-replays-consumer-1 so next time it happens I will try selectively restarting that container.

Restarting the Kafka, clickhouse and zookeeper containers doesn't seem to solve the issue, but it's fixed again by a full docker compose restart. 4 days worth of replays processed in the ~5min following docker-compose restart

[2023-09-11 03:24:12,212] INFO [kafka-log-cleaner-thread-0]: 
	Log cleaner thread 0 cleaned log __consumer_offsets-0 (dirty section = [213401, 213401])
	28.4 MB of log processed in 0.4 seconds (65.9 MB/sec).
	Indexed 28.4 MB in 0.3 seconds (104.4 Mb/sec, 63.1% of total time)
	Buffer utilization: 0.0%
	Cleaned 28.4 MB in 0.2 seconds (178.6 Mb/sec, 36.9% of total time)
	Start size: 28.4 MB (218,698 messages)
	End size: 0.0 MB (58 messages)
	100.0% size reduction (100.0% fewer messages)
 (kafka.log.LogCleaner)

Potentially relevant logs might from sentry-self-hosted-snuba-replays-consumer-1:

2023-09-13 18:05:57,384 Caught exception, shutting down...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 288, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 365, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 101, in poll
    self.__inner_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task.py", line 55, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 37, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/reduce.py", line 149, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task_in_threads.py", line 98, in poll
    result = future.result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/snuba/snuba/consumers/strategy_factory.py", line 122, in flush_batch
    message.payload.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 330, in close
    self.__insert_batch_writer.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 160, in close
    self.__writer.write(
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 347, in write
    batch.join(timeout=batch_join_timeout)
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 239, in join
    response = self._result.result(timeout)
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
2023-09-13 18:05:57,394 Closing <arroyo.backends.kafka.consumer.KafkaConsumer object at 0x7fd4212aea00>...
2023-09-13 18:05:57,394 Partitions to revoke: [Partition(topic=Topic(name='ingest-replay-events'), index=0)]
2023-09-13 18:05:57,394 Partition revocation complete.
2023-09-13 18:05:57,395 Processor terminated
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/consumer.py", line 260, in consumer
    consumer.run()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 288, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 365, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 101, in poll
    self.__inner_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task.py", line 55, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 37, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/reduce.py", line 149, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task_in_threads.py", line 98, in poll
    result = future.result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/snuba/snuba/consumers/strategy_factory.py", line 122, in flush_batch
    message.payload.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 330, in close
    self.__insert_batch_writer.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 160, in close
    self.__writer.write(
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 347, in write
    batch.join(timeout=batch_join_timeout)
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 239, in join
    response = self._result.result(timeout)
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"

agoddard avatar Sep 17 '23 08:09 agoddard

Restarting the Kafka, clickhouse and zookeeper containers doesn't seem to solve the issue, but it's fixed again by a full docker compose restart. 4 days worth of replays processed in the ~5min following docker-compose restart

Interesting. What does your CPU/RAM usage look like out of curiousity?

hubertdeng123 avatar Sep 19 '23 21:09 hubertdeng123

@hubertdeng123 super chill during normal operation. load avg ~0.5, host using 8 of 32GB ram (I'll see if I can check how much docker/sentry is using), but it doesn't seem to be sweating at all. no spikes in ram, IO, load during the approx window when it last failed.

agoddard avatar Sep 19 '23 21:09 agoddard

Got it, seems strange since it seems like a connection error is being thrown. All the containers are up and running, so there aren't any crashes?

hubertdeng123 avatar Sep 20 '23 17:09 hubertdeng123

@hubertdeng123 it's back in the stale replay state, CPU, load, memory are all fine - all containers are up and no crashes. I can leave it in this state for additional troubleshooting

agoddard avatar Sep 23 '23 18:09 agoddard

I've noticed the same issues as @agoddard. Replays show up in the UI fine for a while, then eventually stop. This is on release 23.7.2, but observed the same issue going back to the first self-hosted release that included replays.

Running docker compose logs snuba-replays-consumer, I saw the same http.client.RemoteDisconnected: Remote end closed connection without response @agoddard is seeing.

I found running docker compose restart snuba-replays-consumer would make replays process and start showing up in the UI.

fpotter avatar Sep 25 '23 06:09 fpotter

thanks @fpotter I just tested restarting snuba-replays-consumer and I can confirm that fixes it for me too. Judging by the logs from that container, it's showing errors which seem to match the times when I had the issues, I'm not sure how I missed those errors in my prior searches - it looks it disconnects (from Kafka?) and then never recovers until a container restart. I believe the "shutdown signaled" messages are when I intentionally restarted the container(s) to recover from the issue.

2023-09-08 17:42:41,804 Partition revocation complete.
2023-09-08 17:42:42,814 New partitions assigned: {Partition(topic=Topic(name='ingest-replay-events'), index=0): 41141}
2023-09-08 17:42:47,806 Partitions to revoke: [Partition(topic=Topic(name='ingest-replay-events'), index=0)]
2023-09-08 17:42:47,806 Closing <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fd42108ffd0>...
2023-09-08 17:42:47,806 Waiting for <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fd42108ffd0> to exit...
2023-09-08 17:42:47,806 <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fd42108ffd0> exited successfully, releasing assignment.
2023-09-08 17:42:47,806 Partition revocation complete.
2023-09-08 17:42:48,819 New partitions assigned: {Partition(topic=Topic(name='ingest-replay-events'), index=0): 41141}
2023-09-13 18:05:57,384 Caught exception, shutting down...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 288, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 365, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 101, in poll
    self.__inner_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task.py", line 55, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 37, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/reduce.py", line 149, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task_in_threads.py", line 98, in poll
    result = future.result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/snuba/snuba/consumers/strategy_factory.py", line 122, in flush_batch
    message.payload.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 330, in close
    self.__insert_batch_writer.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 160, in close
    self.__writer.write(
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 347, in write
    batch.join(timeout=batch_join_timeout)
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 239, in join
    response = self._result.result(timeout)
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
2023-09-13 18:05:57,394 Closing <arroyo.backends.kafka.consumer.KafkaConsumer object at 0x7fd4212aea00>...
2023-09-13 18:05:57,394 Partitions to revoke: [Partition(topic=Topic(name='ingest-replay-events'), index=0)]
2023-09-13 18:05:57,394 Partition revocation complete.
2023-09-13 18:05:57,395 Processor terminated
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/consumer.py", line 260, in consumer
    consumer.run()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 288, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 365, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 101, in poll
    self.__inner_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task.py", line 55, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 37, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/reduce.py", line 149, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task_in_threads.py", line 98, in poll
    result = future.result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/snuba/snuba/consumers/strategy_factory.py", line 122, in flush_batch
    message.payload.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 330, in close
    self.__insert_batch_writer.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 160, in close
    self.__writer.write(
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 347, in write
    batch.join(timeout=batch_join_timeout)
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 239, in join
    response = self._result.result(timeout)
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
%3|1694936651.474|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Connect to ipv4#172.18.0.11:9092 failed: Connection refused (after 0ms in state CONNECT)
%3|1694936652.474|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Connect to ipv4#172.18.0.11:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
%3|1694937232.477|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Connect to ipv4#172.18.0.11:9092 failed: Connection refused (after 0ms in state CONNECT)
%3|1694937233.477|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Connect to ipv4#172.18.0.11:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
%3|1694937236.635|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Failed to resolve 'kafka:9092': Name or service not known (after 157ms in state CONNECT)
2023-09-17 07:53:57,466 Shutdown signalled
%3|1694937238.563|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Connect to ipv4#172.18.0.11:9092 failed: Connection refused (after 86ms in state CONNECT)
2023-09-17 07:54:19,587 Initializing Snuba...
2023-09-17 07:54:39,127 Snuba initialization took 19.55938357487321s
2023-09-17 07:54:40,272 Initializing Snuba...
2023-09-17 07:54:47,690 Snuba initialization took 7.420066382735968s
2023-09-17 07:54:47,702 Consumer Starting
2023-09-17 07:54:47,702 Checking Clickhouse connections
2023-09-17 07:54:47,711 librdkafka log level: 6
2023-09-17 07:55:19,489 New partitions assigned: {Partition(topic=Topic(name='ingest-replay-events'), index=0): 77317}
2023-09-17 07:55:27,848 Connection pool is full, discarding connection: clickhouse. Connection pool size: 1
2023-09-17 07:55:31,608 Connection pool is full, discarding connection: clickhouse. Connection pool size: 1
2023-09-17 07:55:33,258 Connection pool is full, discarding connection: clickhouse. Connection pool size: 1
2023-09-18 09:21:06,939 Partitions to revoke: [Partition(topic=Topic(name='ingest-replay-events'), index=0)]
2023-09-18 09:21:06,939 Closing <arroyo.processing.strategies.guard.StrategyGuard object at 0x7f31b291f250>...
2023-09-18 09:21:06,940 Waiting for <arroyo.processing.strategies.guard.StrategyGuard object at 0x7f31b291f250> to exit...
2023-09-18 09:21:06,955 <arroyo.processing.strategies.guard.StrategyGuard object at 0x7f31b291f250> exited successfully, releasing assignment.
2023-09-18 09:21:06,955 Partition revocation complete.
2023-09-18 09:21:08,291 New partitions assigned: {Partition(topic=Topic(name='ingest-replay-events'), index=0): 104727}
2023-09-18 09:21:12,936 Partitions to revoke: [Partition(topic=Topic(name='ingest-replay-events'), index=0)]
2023-09-18 09:21:12,937 Closing <arroyo.processing.strategies.guard.StrategyGuard object at 0x7f31b94adb80>...
2023-09-18 09:21:12,937 Waiting for <arroyo.processing.strategies.guard.StrategyGuard object at 0x7f31b94adb80> to exit...
2023-09-18 09:21:12,960 <arroyo.processing.strategies.guard.StrategyGuard object at 0x7f31b94adb80> exited successfully, releasing assignment.
2023-09-18 09:21:12,960 Partition revocation complete.
2023-09-18 09:21:14,293 New partitions assigned: {Partition(topic=Topic(name='ingest-replay-events'), index=0): 104728}
2023-09-19 08:39:25,005 Partitions to revoke: [Partition(topic=Topic(name='ingest-replay-events'), index=0)]
2023-09-19 08:39:25,005 Closing <arroyo.processing.strategies.guard.StrategyGuard object at 0x7f31b25db550>...
2023-09-19 08:39:25,005 Waiting for <arroyo.processing.strategies.guard.StrategyGuard object at 0x7f31b25db550> to exit...
2023-09-19 08:39:25,006 <arroyo.processing.strategies.guard.StrategyGuard object at 0x7f31b25db550> exited successfully, releasing assignment.
2023-09-19 08:39:25,006 Partition revocation complete.
2023-09-19 08:39:25,120 New partitions assigned: {Partition(topic=Topic(name='ingest-replay-events'), index=0): 112729}
2023-09-19 08:39:31,008 Partitions to revoke: [Partition(topic=Topic(name='ingest-replay-events'), index=0)]
2023-09-19 08:39:31,009 Closing <arroyo.processing.strategies.guard.StrategyGuard object at 0x7f31b54b5220>...
2023-09-19 08:39:31,009 Waiting for <arroyo.processing.strategies.guard.StrategyGuard object at 0x7f31b54b5220> to exit...
2023-09-19 08:39:31,010 <arroyo.processing.strategies.guard.StrategyGuard object at 0x7f31b54b5220> exited successfully, releasing assignment.
2023-09-19 08:39:31,011 Partition revocation complete.
2023-09-19 08:39:31,126 New partitions assigned: {Partition(topic=Topic(name='ingest-replay-events'), index=0): 112730}
2023-09-21 10:42:15,886 Caught exception, shutting down...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 288, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 365, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 101, in poll
    self.__inner_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task.py", line 55, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 37, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/reduce.py", line 149, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task_in_threads.py", line 98, in poll
    result = future.result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/snuba/snuba/consumers/strategy_factory.py", line 122, in flush_batch
    message.payload.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 330, in close
    self.__insert_batch_writer.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 160, in close
    self.__writer.write(
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 347, in write
    batch.join(timeout=batch_join_timeout)
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 239, in join
    response = self._result.result(timeout)
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
2023-09-21 10:42:15,904 Closing <arroyo.backends.kafka.consumer.KafkaConsumer object at 0x7f31b2911a30>...
2023-09-21 10:42:15,905 Partitions to revoke: [Partition(topic=Topic(name='ingest-replay-events'), index=0)]
2023-09-21 10:42:15,905 Partition revocation complete.
2023-09-21 10:42:15,907 Processor terminated
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/consumer.py", line 260, in consumer
    consumer.run()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 288, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 365, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 101, in poll
    self.__inner_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task.py", line 55, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 37, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/reduce.py", line 149, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task_in_threads.py", line 98, in poll
    result = future.result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/snuba/snuba/consumers/strategy_factory.py", line 122, in flush_batch
    message.payload.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 330, in close
    self.__insert_batch_writer.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 160, in close
    self.__writer.write(
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 347, in write
    batch.join(timeout=batch_join_timeout)
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 239, in join
    response = self._result.result(timeout)
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
2023-09-25 06:08:15,752 Shutdown signalled
2023-09-25 06:08:26,740 Initializing Snuba...

agoddard avatar Sep 25 '23 06:09 agoddard

Restarting sentry-self-hosted-snuba-replays-consumer-1 can indeed temporarily fix it.

Self-Hosted Version 23.9.1 sentry-self-hosted-snuba-replays-consumer-1 logs

2023-09-26 02:47:43,844 Closing <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fb2d7052b80>...
2023-09-26 02:47:43,845 Waiting for <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fb2d7052b80> to exit...
2023-09-26 02:47:43,845 <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fb2d7052b80> exited successfully, releasing assignment.
2023-09-26 02:47:43,845 Partition revocation complete.
2023-09-26 02:47:45,280 New partitions assigned: {Partition(topic=Topic(name='ingest-replay-events'), index=0): 36642}
2023-09-26 02:47:49,845 Partitions to revoke: [Partition(topic=Topic(name='ingest-replay-events'), index=0)]
2023-09-26 02:47:49,846 Closing <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fb2d7052790>...
2023-09-26 02:47:49,846 Waiting for <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fb2d7052790> to exit...
2023-09-26 02:47:49,846 <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fb2d7052790> exited successfully, releasing assignment.
2023-09-26 02:47:49,846 Partition revocation complete.
2023-09-26 02:47:50,061 New partitions assigned: {Partition(topic=Topic(name='ingest-replay-events'), index=0): 36643}
2023-09-26 04:58:29,909 Partitions to revoke: [Partition(topic=Topic(name='ingest-replay-events'), index=0)]
2023-09-26 04:58:29,909 Closing <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fb2d7052dc0>...
2023-09-26 04:58:29,909 Waiting for <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fb2d7052dc0> to exit...
2023-09-26 04:58:29,918 <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fb2d7052dc0> exited successfully, releasing assignment.
2023-09-26 04:58:29,918 Partition revocation complete.
2023-09-26 04:58:31,223 New partitions assigned: {Partition(topic=Topic(name='ingest-replay-events'), index=0): 41129}
2023-09-26 04:58:35,910 Partitions to revoke: [Partition(topic=Topic(name='ingest-replay-events'), index=0)]
2023-09-26 04:58:35,910 Closing <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fb2d7052a60>...
2023-09-26 04:58:35,910 Waiting for <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fb2d7052a60> to exit...
2023-09-26 04:58:35,917 <arroyo.processing.strategies.guard.StrategyGuard object at 0x7fb2d7052a60> exited successfully, releasing assignment.
2023-09-26 04:58:35,917 Partition revocation complete.
2023-09-26 04:58:36,179 New partitions assigned: {Partition(topic=Topic(name='ingest-replay-events'), index=0): 41134}
2023-09-26 11:11:39,503 Caught exception, shutting down...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 288, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 368, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 101, in poll
    self.__inner_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task.py", line 55, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 37, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/reduce.py", line 149, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task_in_threads.py", line 107, in poll
    result = future.result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/snuba/snuba/consumers/strategy_factory.py", line 122, in flush_batch
    message.payload.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 330, in close
    self.__insert_batch_writer.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 160, in close
    self.__writer.write(
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 347, in write
    batch.join(timeout=batch_join_timeout)
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 239, in join
    response = self._result.result(timeout)
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
2023-09-26 11:11:39,514 Closing <arroyo.backends.kafka.consumer.KafkaConsumer object at 0x7fb2d70522b0>...
2023-09-26 11:11:39,516 Partitions to revoke: [Partition(topic=Topic(name='ingest-replay-events'), index=0)]
2023-09-26 11:11:39,516 Partition revocation complete.
2023-09-26 11:11:39,520 Processor terminated
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/consumer.py", line 260, in consumer
    consumer.run()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 288, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 368, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 101, in poll
    self.__inner_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task.py", line 55, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/guard.py", line 37, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/reduce.py", line 149, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/run_task_in_threads.py", line 107, in poll
    result = future.result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/snuba/snuba/consumers/strategy_factory.py", line 122, in flush_batch
    message.payload.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 330, in close
    self.__insert_batch_writer.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 160, in close
    self.__writer.write(
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 347, in write
    batch.join(timeout=batch_join_timeout)
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 239, in join
    response = self._result.result(timeout)
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/stdlib.py", line 126, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

fzred avatar Sep 27 '23 01:09 fzred

In case this is useful for others, my band-aid fix for this has been to auto-restart the snuba-consumers-replay service with a cronjob:

*/1 * * * * (cd path/to/your/sentry/checkout && docker compose logs --tail=2 snuba-replays-consumer | grep -F 'Remote end closed connection without response' && docker compose restart snuba-replays-consumer || echo "No need to restart.") 2>&1 | systemd-cat -t replays-restart

You can view the logs from the job with:

journalctl -t replays-restart

fpotter avatar Oct 17 '23 00:10 fpotter

In case this is useful for others, my band-aid fix for this has been to auto-restart the snuba-consumers-replay service with a cronjob:

*/1 * * * * (cd path/to/your/sentry/checkout && docker compose logs --tail=2 snuba-replays-consumer | grep -F 'Remote end closed connection without response' && docker compose restart snuba-replays-consumer || echo "No need to restart.") 2>&1 | systemd-cat -t replays-restart

You can view the logs from the job with:

journalctl -t replays-restart

Same exact issue happening to me. Doing this helps but yeah there is def. something else going on

2snEM6 avatar Oct 18 '23 13:10 2snEM6

I'm currently running into this on a self-hosted instance running 24.1.0 using the EmberJS SDK. Was a resolution ever discovered?

csprocket777 avatar Jan 30 '24 23:01 csprocket777

We have not found a resolution yet, thanks for your patience.

hubertdeng123 avatar Feb 01 '24 20:02 hubertdeng123