I have a single server running PSU v3 with an SQL backend on MSSQL Server 2019 where the database is an availability group. My transaction log with just one job in PSU is growing out of control at around 25GB / hour. Is this expected behavior? Is there any tuning I can do to MSSQL and/or PSU to trim down the size of the transaction log?
This is not expected behavior. I did have another user mention this same problem but they ran a command to shrink the db log to resolve it and we never got to the bottom of why that happened. I think they used dbcc shrinkfile.
Some more info would be good:
How many jobs are you running per hour?
Are you storing a lot of pipeline data with your jobs?
I’m running a 3 PSU cluster with a SQL server backend. It’s not super busy but grabs all the running processes very minute just to inflate the database.
My job queues are healthy and nothing is backed up.
This is our first PSUv3 server, and I’m still just getting it setup. We have one job that runs every 5 minutes, the job is set to discard the pipeline (but there is very little output). We also have between two and three dashboards running as I test.
I have no jobs in the queue generally, however I am seeing MANY GroomService.Groom jobs scheduled, and over 2k failed Groom jobs.
I just emailed a log over, but it looks like a pretty generic error.
2022-09-21 00:03:00.614 -05:00 [ERR] Failed to process the job '23464': an exception occurred.
Hangfire.Storage.DistributedLockTimeoutException: Timeout expired. The timeout elapsed prior to obtaining a distributed lock on the 'HangFire:GroomService.Groom' resource.
at Hangfire.SqlServer.SqlServerDistributedLock.Acquire(IDbConnection connection, String resource, TimeSpan timeout)
at Hangfire.SqlServer.SqlServerConnection.AcquireLock(String resource, TimeSpan timeout)
at Hangfire.SqlServer.SqlServerConnection.AcquireDistributedLock(String resource, TimeSpan timeout)
at Hangfire.DisableConcurrentExecutionAttribute.OnPerforming(PerformingContext filterContext)
at Hangfire.Profiling.ProfilerExtensions.InvokeAction[TInstance](InstanceAction`1 tuple)
at Hangfire.Profiling.SlowLogProfiler.InvokeMeasured[TInstance,TResult](TInstance instance, Func`2 action, String message)
at Hangfire.Profiling.ProfilerExtensions.InvokeMeasured[TInstance](IProfiler profiler, TInstance instance, Action`1 action, String message)
at Hangfire.Server.BackgroundJobPerformer.InvokePerformFilter(IServerFilter filter, PerformingContext preContext, Func`1 continuation)
I’m able to reproduce a similar problem when forcing the groom job to hang. Additional groom jobs will be queued up, wait for the distributed lock, fail to receive the lock and then reschedule.
This is not the correct behavior. The incoming groom jobs should be cancelled and not requeued if they cannot access the lock.
There is still an underlying issue that is causing the groom job to hang that may be the root cause here but I will have to try to reproduce that myself to see if we can get to the bottom of it. I’ll let you know if I need some more information.