node-postgres "Client was closed and is not queryable" when using client from pool in transactions

Ran into a very puzzling bug yesterday when trying to implement database transactions, related to but not the same issue as #1892.

Node version: 12.18.3 Pg version: 8.5.1 OS: Mac & Linux (replicated both locally and on CI)

I was using pool clients in order to create long-running transactions, and found that sometimes when I went to rollback a commit, the database client had sometimes (but inconsistently) been released back to the pool. We were using the async/await syntax for clients, example pseudocode below:



async function transaction() {
  const client = await pool.connect();
  try {
    await client.query('BEGIN;');

    await client.query('some insert query')
    const result = await client.query('some query');
    if (!result.isValid) {
      // jump to catch
      throw new Error('invalid result');
    }

    await client.query('COMMIT;');
    return result;
  } catch (err) {
    try {
      // error here
      await client.query('ROLLBACK;');
    } catch (e) {
      console.log('could not rollback: ', e);
    }

    throw err;
  } finally {
    client.release();
  }
}

When running our tests, at higher loads we would get the following error:

could not rollback: Error: Client was closed and is not queryable

The client wasn't being released by the client.release() call, since that always happens after any rollbacks, so I added some logging statements, and it looks like the Pool class (from pg-pool) is logging this.log('remove idle client') right before these error in the rollback happens. We found that setting the idleTimeoutMillis: 0 fixed the issue since it prevented the idle client timeout from ever being created.

The oddest part of this is that the clients didn't need to be idle for longer than the default of 10s for them to be closed - we logged the time at which the client was removed from the pool, and the time at which the failed rollback was attempted, and it was often less than a second.

Is there some reason why the clients are being closed earlier than the timeout? Perhaps the timeouts from earlier instances of the clients aren't being cleared properly?

Any insight would be greatly appreciated, we'd love to be able to use the transactions in prod, but due to this issue they don't seem to scale very well.

Apr 06 '21 15:04 quentin-fox

Is that the actual function? If so, then you'd need to change it to await the individual client.query(...) calls to ensure they get completed prior to the next one. If not, they'll get added to an internal per-client queue which does not mix well with transactions.

You also need to pass a truthy value to client.release(...) so that it evicts the connection from the pool. Otherwise even if the connection is broken, the pool will hand it out again.

See here for a more elaborate example of a transaction wrapper: https://github.com/sehrope/postgres-with-nodejs-code/blob/master/src/00022-transaction-helper-sequential.js#L9-L35

Apr 06 '21 15:04 sehrope

Sorry, that's my bad - that's just some pseudocode I threw together this morning, I'll update it so that it's more representative of what we were doing. We are definitely awaiting the result from the client for all queries, and the client is being generated with await pool.connect().

Unless I'm misunderstanding, I don't think that the issue is with passing the truthy value to client.release(), since the client is being disconnected before the rollback is done.

In the example you sent, it looks like the client.release() is awaited. I don't see that in the example on the docs (https://node-postgres.com/features/transactions) - would that potentially lead to the same problems as not awaiting client.query?

Apr 06 '21 16:04 quentin-fox

No, the client.release(...) function is sync so the await in my example is not needed. It shouldn't change anything either way though.

See if the actual task being done with the client is hanging on to the client after it's supposed to return back. Or if you think it's related to the idle client cleanup, enable logging on the pool to see if those events are firing in between.

Apr 06 '21 16:04 sehrope

Here are the pool options, for a sanity check (the log option is what I used to log everything).



const poolOptions = {
  max: 3,
  host: databaseParameters.host,
  port: databaseParameters.port,
  user: databaseParameters.user,
  password: databaseParameters.password,
  database: databaseParameters.database,
  log: console.log
};

After I enabled logging, I noticed that every time, right before one of our test cases failed due to the error above, it logged:

remove idle client

This was essentially only logged prior to a failing test case, it's not like it's getting logged all over the place.

This should only happen after the default of 10 seconds, but the clients are "checked out" from the pool for at most one second before it fires.

Note: this is after updating the code so that client.release is passed a truthy value if the rollback fails in all cases, and in general so that it matches what's in the example code you sent along.

Apr 06 '21 16:04 quentin-fox

Are you able to reproduce the issue with a self-contained script that you can post in its entirety?

Apr 06 '21 19:04 charmander

I'll try to find the time this weekend to put a MWE together!

Apr 07 '21 19:04 quentin-fox

Having a similar issue, replicating as self-contained script/repo is proving impossible - issue is only appearing for me intermittently when making a lot of queries from a hard hitting test inside a Docker with https://sailsjs.com/ - so can't rule out Docker/Sails funny stuff. Some times as much as 80% of the time, but then run the same tests on "bare metal" and there's no problems.

It appears that the setTimeout in https://github.com/brianc/node-postgres/blob/c99fb2c127ddf8d712500db2c7b9a5491a178655/packages/pg-pool/index.js#L352-L358 is killing a client that has finished it's original query and then the same client has been reused for another query which then fails when it is killed mid-execution. Looking through code though, that does not make sense, but logging at the Postgres logs I see no queries hanging or any taking anywhere near 10,000ms to run.

Ended up on this issue because this issue is the closest to what I was seeing - queries failing with remove idle client - and making the below note for any future explorers who land here:

Setting idleTimeoutMillis: 0 mitigated the issue allowing the test to run.

Aug 25 '22 23:08 wulfsolter

I'm also seeing this on postgres calls from nextjs to supabase. shows up both with and without pgbouncer / supavisor. i'm using kysely which uses node-postgres's Pool under the hood.

Nov 06 '23 02:11 ryanrhee