Error during reelection
The single process that processes River jobs generate this error multiple times a day:
{
"_sysTime": "2024-12-16T19:04:31.633026767Z",
"_time": "2024-12-16T19:04:31Z",
"client_id": "magic-api-worker-1_2024_12_12T19_27_07_561681",
"err": {
"error": "error beginning transaction: context deadline exceeded",
"kind": "*fmt.wrapError"
},
"level": "error",
"message": "Elector: Error attempting reelection",
"process": "worker",
"sleep_duration": 1079.326989
}
Jobs are inserted by an insert-only client on another process. The few jobs (maximum 2 per day) are still correctly handled. Postgres is not overused and is in version 15.8.0. Driver is pgx v5.7.1. Connection is quite standard (no pgbouncer). River is v0.14.2.
How could I investigate the root cause of this issue?
The code is almost a duplication of your tutorial:
workers := river.NewWorkers()
river.AddWorker(workers, myWorker)
logger := slog.New(slogzerolog.Option{Level: slog.LevelInfo, Logger: &log.Logger}.NewZerologHandler())
riverClient, err := river.NewClient(riverpgxv5.New(dbPool), &river.Config{
Logger: logger,
Queues: map[string]river.QueueConfig{river.QueueDefault: {MaxWorkers: 10},},
Workers: workers,
})
Thanks
We've had similar reports of this before, but I've always had trouble trying to repro them. I've been running a demo work set for hours and have yet to see it once.
The reelection queries have a context timeout of 5 seconds on them, and it's possible that the problem is as simple as that for some set ups, that's just not enough time. Is it possible that you have a high latency link between your app and DB?
Aside from that, any chance you have a slow query log or some other similar insight enabled in Postgres?
https://www.metisdata.io/blog/how-enabling-slow-query-log-enhances-postgres-observability
If you could look for the reelection queries and see if they're taking a really a long time, that'd give us a path to start investigating (i.e. you could then extract one and EXPLAIN ANALYZE it to get more detail).
The election/reelection queries looking like this:
-- name: LeaderAttemptElect :execrows
INSERT INTO river_leader(leader_id, elected_at, expires_at)
VALUES (@leader_id, now(), now() + @ttl::interval)
ON CONFLICT (name)
DO NOTHING;
-- name: LeaderAttemptReelect :execrows
INSERT INTO river_leader(leader_id, elected_at, expires_at)
VALUES (@leader_id, now(), now() + @ttl::interval)
ON CONFLICT (name)
DO UPDATE SET
expires_at = now() + @ttl
WHERE
river_leader.leader_id = @leader_id;
Thanks for your help @brandur.
Only the request LeaderAttemptReelect is listed in the top 100 of query statistics (pg_stat_statements):
- 237149 calls
- 237149 rows
- total time 20.11 sec
- mean time 0.08 ms
- min time 0.05 ms
- max time 13.88 ms
- time stddev 0.05 ms
Its explain plan is:
Insert on river_leader (cost=0.00..0.01 rows=0 width=0) (actual time=0.190..0.190 rows=0 loops=1)
Conflict Resolution: UPDATE
Conflict Arbiter Indexes: river_leader_pkey
Conflict Filter: (river_leader.leader_id = 'test'::text)
Rows Removed by Conflict Filter: 1
Tuples Inserted: 0
Conflicting Tuples: 1
-> Result (cost=0.00..0.01 rows=1 width=80) (actual time=0.003..0.003 rows=1 loops=1)
Planning Time: 0.114 ms
Execution Time: 0.236 ms
The plan of LeaderAttemptElect is:
Insert on river_leader (cost=0.00..0.01 rows=0 width=0) (actual time=0.057..0.057 rows=0 loops=1)
Conflict Resolution: NOTHING
Conflict Arbiter Indexes: river_leader_pkey
Tuples Inserted: 0
Conflicting Tuples: 1
-> Result (cost=0.00..0.01 rows=1 width=80) (actual time=0.003..0.003 rows=1 loops=1)
Planning Time: 0.075 ms
Execution Time: 0.085 ms
Everything seems correct to me on that part.
Thanks for the extra detail. Weird ... yeah that looks fine to me. The total time of 20 sec seems concerning, but that's the aggregate sum of all execution times so that's actually fine.
One more: it's not possible that you have a fairly slow link between DB and app? Like, are you hosting DB/app across regions or across providers of something of that nature? That could explain how it works a lot of the time, but where an upper bound of 5 seconds might not be enough.
I don't think the link between db and server is the issue. It's fast and reliable (low activity, same datacenter). FYI it's a standard configuration with a "Standard 512M" Postgres on scalingo.com
The weird thing could be the log field "sleep_duration". I don't know the unit, but values are always between 900 and 2'000.
Note that two new errors occurred today:
{
"_sysTime": "2024-12-19T08:35:39.395206877Z",
"_time": "2024-12-19T08:35:38Z",
"err": {
"error": "unexpected EOF",
"kind": "*errors.errorString"
},
"level": "error",
"message": "Notifier: Error from notification wait",
"process": "worker"
}
{
"_sysTime": "2024-12-19T08:35:39.452771129Z",
"_time": "2024-12-19T08:35:38Z",
"attempt": 0,
"err": {
"error": "unexpected EOF",
"kind": "*errors.errorString"
},
"level": "error",
"message": "Notifier: Error running listener (will attempt reconnect after backoff)",
"process": "worker",
"sleep_duration": 506.329371
}
That seems to indicate that the listener lost it's connection, which should be quite a rare occurrence. It's probably the same root cause as whatever's causing the trouble on your link with the elector.
Regarding sleep duration. Here's the code for that:
sleepDuration := serviceutil.ExponentialBackoff(attempt, serviceutil.MaxAttemptsBeforeResetDefault)
n.Logger.ErrorContext(ctx, n.Name+": Error running listener (will attempt reconnect after backoff)",
"attempt", attempt, "err", err, "sleep_duration", sleepDuration)
Slog's behavior is such that if you're using a text handler, it'll use time.Duration's String(), which produces something like 500ms. If you're using a JSON handler, it gets printed as a number in nanoseconds:
package main
import (
"log/slog"
"os"
"time"
)
func main() {
dur := 500 * time.Millisecond
loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil))
loggerText.Info("with text handler", "sleep_duration", dur)
loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil))
loggerJSON.Info("with JSON handler", "sleep_duration", dur)
}
$ go run main.go
time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms
{"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000}
You're getting what looks like a milliseconds number as a float, so you must be using some kind of custom slog handler to get that.
We had a hard time with the same error messages on our side. Turned out it was a starvation of the pgx pool connections (because of crappy worker on our side), leaving river unable to send any query:
time=2025-02-05T14:12:05.277Z level=ERROR msg="Elector: Error attempting reelection" client_id=c2b93b2b3676_2025_02_05T14_07_32_004429 err="error beginning transaction: context deadline exceeded" sleep_duration=905.18934ms
Hope this helps
I get these errors as well. Often in a pre-prod environment where there's next to no activity. I think I have river configured to do some periodic jobs once per day and that's it. Very minimal workload.
Seeing a similar issue with
time=2025-02-18T14:31:58.960Z level=ERROR msg="Elector: Error attempting to elect" client_id=f58e825e1032_2025_02_16T07_47_20 err="error beginning transaction: context deadline exceeded" num_errors=230 sleep_duration=8m45.679068094s
Using supabase ipv4 direct connection (no session pooling)
We had a hard time with the same error messages on our side. Turned out it was a starvation of the pgx pool connections (because of crappy worker on our side), leaving river unable to send any query:
Ditto. I had a similar problem.
The crappy worker in my case was something that used the same pgxpool as river, called .Begin() and never .Commit() or .Rollback() on the pgx.Tx.
I'm surprised no linter caught that on my side! But sharing here incase anyone else has a problem -- great first place to check is if you're leaking connections ๐
Actually -- that didn't fully resolve my issue. It made it self-recoverable, but did not fix my issue.
We rely on transactional queueing (and transactional completion!) because we want workers to submit work for other workers (fan out).
We also have an insert middleware that checks the number of jobs that have come in recently to perform a circuit breaker if there's too many.
This means every worker, for every Work() function will use 3 pg connections (one from river I believe, which should have a transaction around the Work() function I think?, one from the Worker itself calling pool.Begin() so it can give a transaction to the InsertTx function, and lastly one in the Insert middleware that calls .Begin() to then call .ListTx() and potentially call .QueuePauseTx)
When the default for pgxpoolv5 is 4 connections, we're going to hit a deadlock pretty soon! And it might not happen that the Work() function for one particular job has all 3 connections it needs.
It would be interesting if the river client in river.ClientFromContextSafely had somehow embedded the transaction that wraps the Work function (reading the code.. I actually can't determine if this is a true statement. I thought it was based on reading the docs, but is fetching from river_jobs and then marking as complete actually in two separate transactions? if so... this idea won't work!)
When the default for pgxpoolv5 is 4 connections, we're going to hit a deadlock pretty soon! And it might not happen that the Work() function for one particular job has all 3 connections it needs.
Hmm, this is true, but I think it'd be fair to say that if you're going to use something like River, using the default max conns of 4 is almost certainly going to be far too low.
Your example is only suggesting the use of a single worker too. River's designed to execute potentially dozens/hundreds/thousands of jobs in parallel, and with a few more jobs you're going to hit your max connections ~immediately.
Generally speaking, any Postgres of non-trivial size will be able to handle a lot more than four connections so I'd suggest raising your maximum. The good news too is that River's minimum connection use is a small fixed number. As you scale out your number of works, it'll only use a couple connections to support the producer/completer.
It would be interesting if the river client in
river.ClientFromContextSafelyhad somehow embedded the transaction that wraps the Work function (reading the code.. I actually can't determine if this is a true statement. I thought it was based on reading the docs, but is fetching from river_jobs and then marking as complete actually in two separate transactions? if so... this idea won't work!)
Yeah, job fetching and job completion are done in two separate operations outside of a transaction. Especially in a Postgres installation where there's a lot of activity, long running transactions can be a huge problem and a default of keeping one open around all jobs being worked would be quite dangerous. Not tying to one particular transaction also means that jobs can be fetched and completed in bulk.
We recently had a similar issue. Our 2 River workers were no longer processing jobs. We had to manually restart the workers to fix the issue. We didn't notice any problems with our database as other services were running fine.
{"level":"ERROR","msg":"Elector: Error attempting to elect","err":"error beginning transaction: context deadline exceeded","num_errors":1,"sleep_duration":1026737572}
{"level":"ERROR","msg":"Elector: Error attempting to elect","err":"error beginning transaction: context deadline exceeded","num_errors":2,"sleep_duration":2093754434}
{"level":"ERROR","msg":"Elector: Error attempting to elect","err":"error beginning transaction: context deadline exceeded","num_errors":3,"sleep_duration":4088133206}
...
{"level":"ERROR","msg":"producer: Queue status update, error updating in database","err":"context deadline exceeded"}
...
{"level":"ERROR","msg":"Elector: Error attempting to elect","err":"error beginning transaction: context deadline exceeded","num_errors":39,"sleep_duration":233456556669}
We are currently using a connection pool with River version v0.11.4 and pgx v5.7.2. Any help would be much appreciated.
@bgentry Any ideas on this one?
I suspect what's happening is that the connections that River's trying to use have become unusable for some reason or the other. Normally, you'd rely on one of the Postgres timeouts to trigger an return an error (and I think trigger refreshing the connection in the pgx pool), but the context timeouts trigger first and cover up whatever error you might've otherwise gotten. This is similar to cases where context timeouts had been applied in the test suite and produce an error message that's undiagnosable rather than stack trace dumps for all active goroutines like you'd get using go test -timeout instead.
Maybe you have debugging techniques for these that I don't know about, but I'm tempted to say that we should start unwinding the use of context timeouts, or at least make them configurable so that users who want it would have the option of getting better pgx/Postgres timeout messages instead.
It's definitely tough to debug this based on the available info and without a reproduction. Trying to compile a list of what we do know:
- As far as we can tell, the leader election query itself is always fast, or at least we have no evidence to suggest otherwise. Makes sense, it's a tiny unlogged table that should always be in memory or quick to load, so unless there's a stuck transaction blocking access this query should be ~immediate.
- The error message
error beginning transaction: context deadline exceededindicates it was theBeginwhich failed on thisWithTxVcall, i.e. the context was either already expired when the operation was started, or the presently-hardcoded 5 seconddeadlineTimeoutwas exceeded while simply attempting to executeBEGINon the database connection. - Alternatively, the timeout may have occurred during
Acquire. In pgx, a conn must be acquired from the pool (already connected) before the actual operation (Begin) is run on it. The context timeout is inclusive of acquire time. - When pgx encounters a context timeout on
Begin, it disposes the connection which calls through topgconn.Close. This means the conn will not be returned to the pool, so any subsequent attempts must be on a different or new conn. - When the underlying puddle
Pool.Acquirecall is made, the context may be cancelled prior to the new conn being established. However in that case even though theAcquirewill return an error, the underlying resource creation will not be cancelled, so in theory if it were merely slow to establish new conns you would still see some of them get picked up eventually. Unless of course the pool is already full, in which case no new resources would be acquired, and this operation would merely timeout without any real attempt being made to acquire a connection. - When this situation occurs as outlined by @almottier, it appears to occur endlessly until a restart. This means that the sum of
Acquire+Beginis taking longer than 5 seconds indefinitely under these circumstances. - Fetching of jobs also stopped per @almottier, again indicating the client is unable to successfully execute queries periodically.
This all leads me toward a conclusion that your pgxpool is overloaded and/or at capacity and simply unable to consistently (or at all?) acquire a connection within 5 seconds. Potentially you have other operations that are acquiring and never releasing conns on the same pool, or are simply consuming all conns with very slow operations. This seems most likely to me given that a restart fixes it; essentially you are likely leaking pool conns or tying them up somehow.
The simplest way to validate this would be to set up a ConnConfig.Tracer that implements AcquireTracer so that you can see how long your acquire calls are taking. You can also periodically grab stats from your pool so that you can see how long acquires are taking, how many conns are free, etc. Could you please try one of those and report back what you learn?
It's also a good moment to highlight that you don't necessarily need River to use the same conn pool as the rest of your app, though this is probably the right thing to do most of the time. Fundamentally a running river.Client needs to be able to periodically run queries against the DB pool you give it, and if it isn't able to something will break.
@brandur I'm not sure that any of the above would be helped by removing context timeouts as we're probably not actually seeing a Postgres error hereโin fact you'd probably see the client waiting even longer for operations that aren't ever going to succeed.
+1 to leaving the timeouts in (although configurable is always nice!)
I would not have been able to root cause my issue as easily (small pool shared with river that actually caused a deadlock) if river just hung trying to acquire a connection (a SIGQUIT + stack traces might have hinted at the problem, but not as clear as the error did)
Hello, it appears we had this issue because we were using the same database pool for River internal operations and our business code. We reached the max connections in our business code, which led to missing connections for River internal operations. We fixed it by using a separate database pool for River and refactoring some transactions in our code. Thank you for helping us identify this problem. ๐
Hello, it appears we had this issue because we were using the same database pool for River internal operations and our business code. We reached the max connections in our business code, which led to missing connections for River internal operations. We fixed it by using a separate database pool for River and refactoring some transactions in our code. Thank you for helping us identify this problem. ๐
Did the same a few months ago. Been working fine since, even though it's not optimal in theory.
Closing this one out, I think it's pretty likely these issues are typically caused by conn pool contention. Thanks all for confirming! :pray: