river Error during reelection

The single process that processes River jobs generate this error multiple times a day:

{
    "_sysTime": "2024-12-16T19:04:31.633026767Z",
    "_time": "2024-12-16T19:04:31Z",
    "client_id": "magic-api-worker-1_2024_12_12T19_27_07_561681",
    "err": {
        "error": "error beginning transaction: context deadline exceeded",
        "kind": "*fmt.wrapError"
    },
    "level": "error",
    "message": "Elector: Error attempting reelection",
    "process": "worker",
    "sleep_duration": 1079.326989
}

Jobs are inserted by an insert-only client on another process. The few jobs (maximum 2 per day) are still correctly handled. Postgres is not overused and is in version 15.8.0. Driver is pgx v5.7.1. Connection is quite standard (no pgbouncer). River is v0.14.2.

How could I investigate the root cause of this issue?

The code is almost a duplication of your tutorial:

workers := river.NewWorkers()
river.AddWorker(workers, myWorker)

logger := slog.New(slogzerolog.Option{Level: slog.LevelInfo, Logger: &log.Logger}.NewZerologHandler())
riverClient, err := river.NewClient(riverpgxv5.New(dbPool), &river.Config{
    Logger: logger,
    Queues: map[string]river.QueueConfig{river.QueueDefault: {MaxWorkers: 10},},
    Workers: workers,
})

Thanks

Dec 18 '24 14:12 simonbrandhof

We've had similar reports of this before, but I've always had trouble trying to repro them. I've been running a demo work set for hours and have yet to see it once.

The reelection queries have a context timeout of 5 seconds on them, and it's possible that the problem is as simple as that for some set ups, that's just not enough time. Is it possible that you have a high latency link between your app and DB?

Aside from that, any chance you have a slow query log or some other similar insight enabled in Postgres?

https://www.metisdata.io/blog/how-enabling-slow-query-log-enhances-postgres-observability

If you could look for the reelection queries and see if they're taking a really a long time, that'd give us a path to start investigating (i.e. you could then extract one and EXPLAIN ANALYZE it to get more detail).

The election/reelection queries looking like this:

-- name: LeaderAttemptElect :execrows
INSERT INTO river_leader(leader_id, elected_at, expires_at)
    VALUES (@leader_id, now(), now() + @ttl::interval)
ON CONFLICT (name)
    DO NOTHING;

-- name: LeaderAttemptReelect :execrows
INSERT INTO river_leader(leader_id, elected_at, expires_at)
    VALUES (@leader_id, now(), now() + @ttl::interval)
ON CONFLICT (name)
    DO UPDATE SET
        expires_at = now() + @ttl
    WHERE
        river_leader.leader_id = @leader_id;

Dec 18 '24 18:12 brandur

Thanks for your help @brandur.

Only the request LeaderAttemptReelect is listed in the top 100 of query statistics (pg_stat_statements):

237149 calls
237149 rows
total time 20.11 sec
mean time 0.08 ms
min time 0.05 ms
max time 13.88 ms
time stddev 0.05 ms

Its explain plan is:

Insert on river_leader  (cost=0.00..0.01 rows=0 width=0) (actual time=0.190..0.190 rows=0 loops=1)
  Conflict Resolution: UPDATE
  Conflict Arbiter Indexes: river_leader_pkey
  Conflict Filter: (river_leader.leader_id = 'test'::text)
  Rows Removed by Conflict Filter: 1
  Tuples Inserted: 0
  Conflicting Tuples: 1
  ->  Result  (cost=0.00..0.01 rows=1 width=80) (actual time=0.003..0.003 rows=1 loops=1)
Planning Time: 0.114 ms
Execution Time: 0.236 ms

The plan of LeaderAttemptElect is:

Insert on river_leader  (cost=0.00..0.01 rows=0 width=0) (actual time=0.057..0.057 rows=0 loops=1)
  Conflict Resolution: NOTHING
  Conflict Arbiter Indexes: river_leader_pkey
  Tuples Inserted: 0
  Conflicting Tuples: 1
  ->  Result  (cost=0.00..0.01 rows=1 width=80) (actual time=0.003..0.003 rows=1 loops=1)
Planning Time: 0.075 ms
Execution Time: 0.085 ms

Everything seems correct to me on that part.

Dec 19 '24 14:12 simonbrandhof

Thanks for the extra detail. Weird ... yeah that looks fine to me. The total time of 20 sec seems concerning, but that's the aggregate sum of all execution times so that's actually fine.

One more: it's not possible that you have a fairly slow link between DB and app? Like, are you hosting DB/app across regions or across providers of something of that nature? That could explain how it works a lot of the time, but where an upper bound of 5 seconds might not be enough.

Dec 19 '24 16:12 brandur

I don't think the link between db and server is the issue. It's fast and reliable (low activity, same datacenter). FYI it's a standard configuration with a "Standard 512M" Postgres on scalingo.com

The weird thing could be the log field "sleep_duration". I don't know the unit, but values are always between 900 and 2'000.

Note that two new errors occurred today:

{
    "_sysTime": "2024-12-19T08:35:39.395206877Z",
    "_time": "2024-12-19T08:35:38Z",
    "err": {
        "error": "unexpected EOF",
        "kind": "*errors.errorString"
    },
    "level": "error",
    "message": "Notifier: Error from notification wait",
    "process": "worker"
}

{
    "_sysTime": "2024-12-19T08:35:39.452771129Z",
    "_time": "2024-12-19T08:35:38Z",
    "attempt": 0,
    "err": {
        "error": "unexpected EOF",
        "kind": "*errors.errorString"
    },
    "level": "error",
    "message": "Notifier: Error running listener (will attempt reconnect after backoff)",
    "process": "worker",
    "sleep_duration": 506.329371
}

Dec 19 '24 21:12 simonbrandhof

That seems to indicate that the listener lost it's connection, which should be quite a rare occurrence. It's probably the same root cause as whatever's causing the trouble on your link with the elector.

Regarding sleep duration. Here's the code for that:

sleepDuration := serviceutil.ExponentialBackoff(attempt, serviceutil.MaxAttemptsBeforeResetDefault)
n.Logger.ErrorContext(ctx, n.Name+": Error running listener (will attempt reconnect after backoff)",
	"attempt", attempt, "err", err, "sleep_duration", sleepDuration)

Slog's behavior is such that if you're using a text handler, it'll use time.Duration's String(), which produces something like 500ms. If you're using a JSON handler, it gets printed as a number in nanoseconds:

package main

import (
	"log/slog"
	"os"
	"time"
)

func main() {
	dur := 500 * time.Millisecond

	loggerText := slog.New(slog.NewTextHandler(os.Stdout, nil))
	loggerText.Info("with text handler", "sleep_duration", dur)

	loggerJSON := slog.New(slog.NewJSONHandler(os.Stdout, nil))
	loggerJSON.Info("with JSON handler", "sleep_duration", dur)
}

$ go run main.go
time=2024-12-19T21:36:41.948-07:00 level=INFO msg="with text handler" sleep_duration=500ms
{"time":"2024-12-19T21:36:41.949239-07:00","level":"INFO","msg":"with JSONtext handler","sleep_duration":500000000}

You're getting what looks like a milliseconds number as a float, so you must be using some kind of custom slog handler to get that.

Dec 20 '24 04:12 brandur

We had a hard time with the same error messages on our side. Turned out it was a starvation of the pgx pool connections (because of crappy worker on our side), leaving river unable to send any query:

time=2025-02-05T14:12:05.277Z level=ERROR msg="Elector: Error attempting reelection" client_id=c2b93b2b3676_2025_02_05T14_07_32_004429 err="error beginning transaction: context deadline exceeded" sleep_duration=905.18934ms

Hope this helps

Feb 05 '25 14:02 emilecaron

I get these errors as well. Often in a pre-prod environment where there's next to no activity. I think I have river configured to do some periodic jobs once per day and that's it. Very minimal workload.

Feb 14 '25 00:02 dbhoot

Seeing a similar issue with

time=2025-02-18T14:31:58.960Z level=ERROR msg="Elector: Error attempting to elect" client_id=f58e825e1032_2025_02_16T07_47_20 err="error beginning transaction: context deadline exceeded" num_errors=230 sleep_duration=8m45.679068094s

Using supabase ipv4 direct connection (no session pooling)

Feb 18 '25 15:02 rphly

We had a hard time with the same error messages on our side. Turned out it was a starvation of the pgx pool connections (because of crappy worker on our side), leaving river unable to send any query:

Ditto. I had a similar problem.

The crappy worker in my case was something that used the same pgxpool as river, called .Begin() and never .Commit() or .Rollback() on the pgx.Tx.

I'm surprised no linter caught that on my side! But sharing here incase anyone else has a problem -- great first place to check is if you're leaking connections 😎

Apr 07 '25 13:04 travisby

Actually -- that didn't fully resolve my issue. It made it self-recoverable, but did not fix my issue.

We rely on transactional queueing (and transactional completion!) because we want workers to submit work for other workers (fan out).

We also have an insert middleware that checks the number of jobs that have come in recently to perform a circuit breaker if there's too many.

This means every worker, for every Work() function will use 3 pg connections (one from river I believe, which should have a transaction around the Work() function I think?, one from the Worker itself calling pool.Begin() so it can give a transaction to the InsertTx function, and lastly one in the Insert middleware that calls .Begin() to then call .ListTx() and potentially call .QueuePauseTx)

When the default for pgxpoolv5 is 4 connections, we're going to hit a deadlock pretty soon! And it might not happen that the Work() function for one particular job has all 3 connections it needs.

It would be interesting if the river client in river.ClientFromContextSafely had somehow embedded the transaction that wraps the Work function (reading the code.. I actually can't determine if this is a true statement. I thought it was based on reading the docs, but is fetching from river_jobs and then marking as complete actually in two separate transactions? if so... this idea won't work!)

Apr 07 '25 15:04 travisby

When the default for pgxpoolv5 is 4 connections, we're going to hit a deadlock pretty soon! And it might not happen that the Work() function for one particular job has all 3 connections it needs.

Hmm, this is true, but I think it'd be fair to say that if you're going to use something like River, using the default max conns of 4 is almost certainly going to be far too low.

Your example is only suggesting the use of a single worker too. River's designed to execute potentially dozens/hundreds/thousands of jobs in parallel, and with a few more jobs you're going to hit your max connections ~immediately.

Generally speaking, any Postgres of non-trivial size will be able to handle a lot more than four connections so I'd suggest raising your maximum. The good news too is that River's minimum connection use is a small fixed number. As you scale out your number of works, it'll only use a couple connections to support the producer/completer.

It would be interesting if the river client in river.ClientFromContextSafely had somehow embedded the transaction that wraps the Work function (reading the code.. I actually can't determine if this is a true statement. I thought it was based on reading the docs, but is fetching from river_jobs and then marking as complete actually in two separate transactions? if so... this idea won't work!)

Yeah, job fetching and job completion are done in two separate operations outside of a transaction. Especially in a Postgres installation where there's a lot of activity, long running transactions can be a huge problem and a default of keeping one open around all jobs being worked would be quite dangerous. Not tying to one particular transaction also means that jobs can be fetched and completed in bulk.

Apr 09 '25 05:04 brandur

We recently had a similar issue. Our 2 River workers were no longer processing jobs. We had to manually restart the workers to fix the issue. We didn't notice any problems with our database as other services were running fine.

{"level":"ERROR","msg":"Elector: Error attempting to elect","err":"error beginning transaction: context deadline exceeded","num_errors":1,"sleep_duration":1026737572}
{"level":"ERROR","msg":"Elector: Error attempting to elect","err":"error beginning transaction: context deadline exceeded","num_errors":2,"sleep_duration":2093754434}
{"level":"ERROR","msg":"Elector: Error attempting to elect","err":"error beginning transaction: context deadline exceeded","num_errors":3,"sleep_duration":4088133206}
...
{"level":"ERROR","msg":"producer: Queue status update, error updating in database","err":"context deadline exceeded"}
...
{"level":"ERROR","msg":"Elector: Error attempting to elect","err":"error beginning transaction: context deadline exceeded","num_errors":39,"sleep_duration":233456556669}

We are currently using a connection pool with River version v0.11.4 and pgx v5.7.2. Any help would be much appreciated.

Apr 30 '25 12:04 almottier

@bgentry Any ideas on this one?

I suspect what's happening is that the connections that River's trying to use have become unusable for some reason or the other. Normally, you'd rely on one of the Postgres timeouts to trigger an return an error (and I think trigger refreshing the connection in the pgx pool), but the context timeouts trigger first and cover up whatever error you might've otherwise gotten. This is similar to cases where context timeouts had been applied in the test suite and produce an error message that's undiagnosable rather than stack trace dumps for all active goroutines like you'd get using go test -timeout instead.

Maybe you have debugging techniques for these that I don't know about, but I'm tempted to say that we should start unwinding the use of context timeouts, or at least make them configurable so that users who want it would have the option of getting better pgx/Postgres timeout messages instead.

May 05 '25 01:05 brandur

It's definitely tough to debug this based on the available info and without a reproduction. Trying to compile a list of what we do know:

As far as we can tell, the leader election query itself is always fast, or at least we have no evidence to suggest otherwise. Makes sense, it's a tiny unlogged table that should always be in memory or quick to load, so unless there's a stuck transaction blocking access this query should be ~immediate.
The error message error beginning transaction: context deadline exceeded indicates it was the Begin which failed on this WithTxV call, i.e. the context was either already expired when the operation was started, or the presently-hardcoded 5 second deadlineTimeout was exceeded while simply attempting to execute BEGIN on the database connection.
Alternatively, the timeout may have occurred during Acquire. In pgx, a conn must be acquired from the pool (already connected) before the actual operation (Begin) is run on it. The context timeout is inclusive of acquire time.
When pgx encounters a context timeout on Begin, it disposes the connection which calls through to pgconn.Close. This means the conn will not be returned to the pool, so any subsequent attempts must be on a different or new conn.
When the underlying puddle Pool.Acquire call is made, the context may be cancelled prior to the new conn being established. However in that case even though the Acquire will return an error, the underlying resource creation will not be cancelled, so in theory if it were merely slow to establish new conns you would still see some of them get picked up eventually. Unless of course the pool is already full, in which case no new resources would be acquired, and this operation would merely timeout without any real attempt being made to acquire a connection.
When this situation occurs as outlined by @almottier, it appears to occur endlessly until a restart. This means that the sum of Acquire + Begin is taking longer than 5 seconds indefinitely under these circumstances.
Fetching of jobs also stopped per @almottier, again indicating the client is unable to successfully execute queries periodically.

This all leads me toward a conclusion that your pgxpool is overloaded and/or at capacity and simply unable to consistently (or at all?) acquire a connection within 5 seconds. Potentially you have other operations that are acquiring and never releasing conns on the same pool, or are simply consuming all conns with very slow operations. This seems most likely to me given that a restart fixes it; essentially you are likely leaking pool conns or tying them up somehow.

The simplest way to validate this would be to set up a ConnConfig.Tracer that implements AcquireTracer so that you can see how long your acquire calls are taking. You can also periodically grab stats from your pool so that you can see how long acquires are taking, how many conns are free, etc. Could you please try one of those and report back what you learn?

It's also a good moment to highlight that you don't necessarily need River to use the same conn pool as the rest of your app, though this is probably the right thing to do most of the time. Fundamentally a running river.Client needs to be able to periodically run queries against the DB pool you give it, and if it isn't able to something will break.

@brandur I'm not sure that any of the above would be helped by removing context timeouts as we're probably not actually seeing a Postgres error here—in fact you'd probably see the client waiting even longer for operations that aren't ever going to succeed.

May 08 '25 16:05 bgentry

+1 to leaving the timeouts in (although configurable is always nice!)

I would not have been able to root cause my issue as easily (small pool shared with river that actually caused a deadlock) if river just hung trying to acquire a connection (a SIGQUIT + stack traces might have hinted at the problem, but not as clear as the error did)

May 08 '25 16:05 travisby

Hello, it appears we had this issue because we were using the same database pool for River internal operations and our business code. We reached the max connections in our business code, which led to missing connections for River internal operations. We fixed it by using a separate database pool for River and refactoring some transactions in our code. Thank you for helping us identify this problem. 🙏

May 12 '25 06:05 almottier

Hello, it appears we had this issue because we were using the same database pool for River internal operations and our business code. We reached the max connections in our business code, which led to missing connections for River internal operations. We fixed it by using a separate database pool for River and refactoring some transactions in our code. Thank you for helping us identify this problem. 🙏

Did the same a few months ago. Been working fine since, even though it's not optimal in theory.

May 12 '25 06:05 emilecaron

Closing this one out, I think it's pretty likely these issues are typically caused by conn pool contention. Thanks all for confirming! :pray:

May 20 '25 19:05 bgentry