river icon indicating copy to clipboard operation
river copied to clipboard

Document Auto-vacuum Starvation

Open mikehale opened this issue 2 years ago • 6 comments

I think we implemented this after your time at Heroku. We finally realized that sometimes the 2 PG auto vacuum processes would both end up working on tables that were taking hours to complete. That kept them from vacuuming the que_jobs table, which would lead to poor performance. We ended up adding a clock process task to manually vacuum our high churn tables at regular intervals, and we haven't seen poor job selection performance since.

It might be worthwhile to add something similar to what que has to your docs to mitigate this potential issue for river users.

mikehale avatar Nov 21 '23 17:11 mikehale

Oh hey @mikehale 👋 Thanks for raising this concern. Out of curiosity, how frequently are you manually vacuuming, how long does it take to complete, and are there any related settings you would want to share?

We have an internal maintenance process architecture where we could fairly trivially add a feature to have the leader initiate a vacuum on a specific schedule but would need to be careful to provide the right configs for it and probably not enable by default.

bgentry avatar Nov 21 '23 18:11 bgentry

When I left we were vacuuming once every 5 minutes, and IIRC it took less than a second to complete. We had tuned the auto vacuum settings, but obviously those don't come into play with a manual vacuum. Our main concern was about affecting the performance of the rest of the database, through either a lock or increased I/O load. In practice neither of those manifested as issues, and I believe that is partly because of the consistently low overhead due to the regular and frequent vacumming. The trade-off of potentially higher I/O every 5 minutes vs randomly not having auto-vacuum run for many hours turned out to be a good one for us.

mikehale avatar Nov 21 '23 19:11 mikehale

We got bit by this. Was pretty frustrating to debug.

dyeje avatar May 10 '24 16:05 dyeje

@dyeje can you add any context around what you were seeing, what your workload is like, jobs/sec or day, etc?

bgentry avatar May 10 '24 17:05 bgentry

Postgres major version too please. I would've also expected that many of the B-tree optimizations would've helped slow down degenerate table bloat compared to those old days at Heroku.

brandur avatar May 11 '24 05:05 brandur

It was admittedly an extreme case, but I think it had been over 24 hours that the auto vacuum processes were working on very large tables, that coupled with a very thrashed jobs table caused a noticeable though not fatal degradation in job acquisition performance.

mikehale avatar May 11 '24 05:05 mikehale