Hangfire.PostgreSql icon indicating copy to clipboard operation
Hangfire.PostgreSql copied to clipboard

How to safely clean jobqueue table

Open AlexanderSysoev opened this issue 10 months ago • 15 comments

Hi!

We are using Hangfire with PostgreSQL (version 1.20.9). Recently, I’ve noticed that the largest table in the hangfire schema is jobqueue (currently containing 3.5 million records, with a size of nearly 1 GB). Additionally, I’ve observed that a lot of queries are directed towards this particular table.

When executing the SQL query SELECT * FROM hangfire.jobqueue ORDER BY fetchedat ASC, it returns quite old records, and I’m not certain that this data is necessary for my application.

Image

Is there a safe method to clean up this table and maintain its size at a reasonable level?

Image

AlexanderSysoev avatar Jun 18 '25 15:06 AlexanderSysoev

Oh wow, how? I'm curious why the job queue items were not removed at the time they at least finished. Do you by any chance use the switch for Npgsql? When removing from queue, the date has to match together with id, and Npgsql doing some fiddling with timestamp can drift the time.

azygis avatar Jun 18 '25 17:06 azygis

Yes, we are using this Switch, but the screenshot from my message was taken on June 18th and it shows records starting from June 6th. We have been actively using Hangfire for several months now, and I have the impression that records are still being deleted, but for some reason it happens with a significant delay. For example, as of the morning of June 19th, the query SELECT * FROM hangfire.jobqueue ORDER BY fetchedat ASC produces the following result:

Image

AlexanderSysoev avatar Jun 19 '25 06:06 AlexanderSysoev

@azygis are there any other ideas or explanations for why this queue is being processed with such a significant delay?

AlexanderSysoev avatar Jun 24 '25 09:06 AlexanderSysoev

SELECT *
FROM hangfire.jobqueue AS jq
JOIN hangfire.state AS s ON jq.jobid = s.jobid
ORDER BY jq.fetchedat ASC
LIMIT 20 -- just so we get a smaller data set, even if some job won't get the full "state" result

How about this?

I'm really confused. Based on my brief look Hangfire can remove rows from queue when the state changes too, but it's still happening in the same execute loop... Could the jobs be requeued after failure? Hence I want to verify the state of these jobs too.

Honestly it would be nice to have a dump, but definitely not 1GB in size. Are you able to reproduce this on some other environment, where not that many jobs are happening?

One unlikely (but possible) case is that actually all these old jobs are still actually not processed and waiting for their turn. Could it be that the job is somewhat heavy, and the jobs get added to the queue way faster/more frequently than they complete?

azygis avatar Jun 25 '25 15:06 azygis

@azygis

How about this?

Here is the result of your query. The data is relatively recent, considering that I executed it on June 26th. Image

Could the jobs be requeued after failure?

Yes, we are using AutomaticRetryAttribute for this purpose. This is how we use it:

    [Queue("webhooks")]
    [AutomaticRetry(DelaysInSeconds =
    [
        1 * 60,
        5 * 60,
        10 * 60,
        20 * 60,
        40 * 60,
        1 * 3600,
        2 * 3600,
        4 * 3600,
        8 * 3600,
        16 * 3600,
        32 * 3600
    ],
    OnAttemptsExceeded = AttemptsExceededAction.Delete)]
    public async Task<string> SendWebhookAsync(Guid transactionId, PerformContext? context)
    {
        //send webhook
    }

Are you able to reproduce this on some other environment, where not that many jobs are happening?

We have several test environments, but they do not run jobs as frequently as the Production environment, and the jobqueue queues in these environments remain relatively small. Therefore, it seems that the issue is specifically related to the high volume of jobs. I believe it will not be possible to simply replicate this problem in the test environment.

Could it be that the job is somewhat heavy, and the jobs get added to the queue way faster/more frequently than they complete?

We use this job to send webhooks to our clients. Inside the job, there is an HttpClient with a timeout set to 1 minute. If the job fails with an error or the client does not respond with a 200 HTTP status code, we terminate the job with an error, and then the AutomaticRetry attribute logic kicks in to retry the execution. Yes, some of our clients may not respond within 1 minute or may not return a 200 HTTP status code, but their number is relatively small compared to the total number of webhooks. For example, here are the HTTP status codes of webhook responses from Grafana:

Image

The current state of our dashboard is as follows

Image

AlexanderSysoev avatar Jun 26 '25 12:06 AlexanderSysoev

This is interesting. Feels a bit like it has some catching up to do, but even for retry, based on what I see in Hangfire itself, it should remove the job from the queue once it gets rescheduled after failure.

Sorry, won't have time for it in the coming weeks. Have to try to reproduce it, probably similar to your approach with increasing delay times and forcing the failures while new jobs get queued.

azygis avatar Jun 28 '25 07:06 azygis

We've encountered the same problem and we think we found the core of the problem. It lies with parameters passing and timezones https://github.com/hangfire-postgres/Hangfire.PostgreSql/blob/master/src/Hangfire.PostgreSql/PostgreSqlFetchedJob.cs#L81

Our application runs in container which has by default UTC timezone set. However PostgreSQL is installed with timezone UTC+3 (set in OS). When Hangfire reads records from table it sets for DateTime fields property Kind = Undefined, so it doesn't belong to local time or UTC. And then passes these values as is. PostgreSQL interpret them as his default timezone (UTC+3) in our case - so dates are not the same now.

And yes, our application uses Npgsql.EnableLegacyTimestampBehavior swithc set to true. If we remove it - queue is cleaned.

AOne-T avatar Jul 11 '25 14:07 AOne-T

@AOne-T Thank you for your response. What is the maximum date discrepancy did you get in your case? In my situation, it’s two weeks, which doesn’t seem like a typical time lag related to time zones, as far as I can tell.

AlexanderSysoev avatar Jul 14 '25 10:07 AlexanderSysoev

@azygis I am currently in the process of writing an example to diagnose the problem. To extinguish the production fire right now, is it safe to run this script?

DELETE FROM hangfire.jobqueue WHERE "jobid" NOT IN (SELECT "jobid" FROM hangfire.state);

AlexanderSysoev avatar Jul 17 '25 10:07 AlexanderSysoev

@azygis here is an example https://github.com/AlexanderSysoev/HangfireJobQueueExample/tree/main. Still cannot reproduce the problem, any ideas are welcome

AlexanderSysoev avatar Jul 17 '25 15:07 AlexanderSysoev

I also noticed that if you click on the link in the screenshot, an Exception occurs.

Image
System.ArgumentException: An item with the same key has already been added. Key: 4064151\n  
   at System.Collections.Generic.Dictionary`2.TryInsert(TKey key, TValue value, InsertionBehavior behavior)\n  
   at System.Linq.Enumerable.ToDictionary[TSource,TKey,TElement](List`1 source, Func`2 keySelector, Func`2 elementSelector, IEqualityComparer`1 comparer)\n  
   at Hangfire.PostgreSql.PostgreSqlMonitoringApi.FetchedJobs(IEnumerable`1 jobIds)\n  
   at Hangfire.PostgreSql.PostgreSqlMonitoringApi.FetchedJobs(String queue, Int32 from, Int32 perPage)\n  
   at Hangfire.Dashboard.Pages.FetchedJobsPage.Execute() in C:\\projects\\hangfire-525\\src\\Hangfire.Core\\Dashboard\\Pages\\FetchedJobsPage.cshtml:line 19\n  
   at Hangfire.Dashboard.RazorPage.TransformText(String body) in C:\\projects\\hangfire-525\\src\\Hangfire.Core\\Dashboard\\RazorPage.cs:line 179\n  
   at Hangfire.Dashboard.RazorPage.ToString() in C:\\projects\\hangfire-525\\src\\Hangfire.Core\\Dashboard\\RazorPage.cs:line 103\n  
   at Hangfire.Dashboard.RazorPageDispatcher.Dispatch(DashboardContext context) in C:\\projects\\hangfire-525\\src\\Hangfire.Core\\Dashboard\\RazorPageDispatcher.cs:line 38\n  
   at Hangfire.Dashboard.AspNetCoreDashboardMiddleware.Invoke(HttpContext httpContext)\n  
   at Microsoft.AspNetCore.Builder.Extensions.MapMiddleware.InvokeCore(HttpContext context, PathString matchedPath, PathString remainingPath)\n  
   at Swashbuckle.AspNetCore.SwaggerUI.SwaggerUIMiddleware.Invoke(HttpContext httpContext)\n  
   at Swashbuckle.AspNetCore.Swagger.SwaggerMiddleware.Invoke(HttpContext httpContext, ISwaggerProvider swaggerProvider)\n  
   at Wata.Payment.HttpApi.Host.Swagger.SwaggerAuthorizationMiddleware.InvokeAsync(HttpContext context, RequestDelegate next) in /src/Payment/Wata.Payment.HttpApi.Host/Swagger/SwaggerAuthorizationMiddleware.cs:line 25\n  
   at Microsoft.AspNetCore.Builder.UseMiddlewareExtensions.InterfaceMiddlewareBinder.<>c__DisplayClass2_0.<<CreateMiddleware>b__0>d.MoveNext()\n
   --- End of stack trace from previous location ---\n  
   at Wata.Payment.HttpApi.Host.Auditing.AuditingMiddleware.InvokeAsync(HttpContext context, RequestDelegate next) in /src/Payment/Wata.Payment.HttpApi.Host/Auditing/AuditingMiddleware.cs:line 28\n  
   at Microsoft.AspNetCore.Builder.UseMiddlewareExtensions.InterfaceMiddlewareBinder.<>c__DisplayClass2_0.<<CreateMiddleware>b__0>d.MoveNext()\n
   --- End of stack trace from previous location ---\n  
   at Microsoft.AspNetCore.Authorization.AuthorizationMiddleware.Invoke(HttpContext context)\n  
   at Microsoft.AspNetCore.Authentication.AuthenticationMiddleware.Invoke(HttpContext context)\n  
   at Microsoft.AspNetCore.HttpLogging.HttpLoggingMiddleware.InvokeInternal(HttpContext context, HttpLoggingOptions options, HttpLoggingAttribute loggingAttribute, HttpLoggingFields loggingFields)\n  
   at Microsoft.AspNetCore.HttpLogging.HttpLoggingMiddleware.InvokeInternal(HttpContext context, HttpLoggingOptions options, HttpLoggingAttribute loggingAttribute, HttpLoggingFields loggingFields)\n  
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)

AlexanderSysoev avatar Jul 17 '25 15:07 AlexanderSysoev

@AOne-T Thank you for your response. What is the maximum date discrepancy did you get in your case? In my situation, it’s two weeks, which doesn’t seem like a typical time lag related to time zones, as far as I can tell.

Sorry for delay. As far as I remember it was also about 2 weeks

AOne-T avatar Jul 30 '25 19:07 AOne-T