Product: PowerShell Universal
I have a single server running PSU v3 with an SQL backend on MSSQL Server 2019 where the database is an availability group. My transaction log with just one job in PSU is growing out of control at around 25GB / hour. Is this expected behavior? Is there any tuning I can do to MSSQL and/or PSU to trim down the size of the transaction log?
This is not expected behavior. I did have another user mention this same problem but they ran a command to shrink the db log to resolve it and we never got to the bottom of why that happened. I think they used dbcc shrinkfile.
Some more info would be good:
- How many jobs are you running per hour?
- Are you storing a lot of pipeline data with your jobs?
- Do you have a lot of stuff backed up in your hangfire job queues?
I’m running a 3 PSU cluster with a SQL server backend. It’s not super busy but grabs all the running processes very minute just to inflate the database.
My job queues are healthy and nothing is backed up.
My SQL transaction log seems like a generally reasonable size.
This is our first PSUv3 server, and I’m still just getting it setup. We have one job that runs every 5 minutes, the job is set to discard the pipeline (but there is very little output). We also have between two and three dashboards running as I test.
I have no jobs in the queue generally, however I am seeing MANY GroomService.Groom jobs scheduled, and over 2k failed Groom jobs.
My SQL server is being backed up every 4 hours, and it looks like it’s maxing around 65GB between backups.
I’m definitely seeing the GroomService.Groom jobs pile up. I suspect that could be contributing to the log file growth.
Can you send me a log file? It seems like either the groom job is stuck or it’s failing and retrying.
You can open a case if you’d like by emailing firstname.lastname@example.org
I just emailed a log over, but it looks like a pretty generic error.
2022-09-21 00:03:00.614 -05:00 [ERR] Failed to process the job '23464': an exception occurred.
Hangfire.Storage.DistributedLockTimeoutException: Timeout expired. The timeout elapsed prior to obtaining a distributed lock on the 'HangFire:GroomService.Groom' resource.
at Hangfire.SqlServer.SqlServerDistributedLock.Acquire(IDbConnection connection, String resource, TimeSpan timeout)
at Hangfire.SqlServer.SqlServerConnection.AcquireLock(String resource, TimeSpan timeout)
at Hangfire.SqlServer.SqlServerConnection.AcquireDistributedLock(String resource, TimeSpan timeout)
at Hangfire.DisableConcurrentExecutionAttribute.OnPerforming(PerformingContext filterContext)
at Hangfire.Profiling.ProfilerExtensions.InvokeAction[TInstance](InstanceAction`1 tuple)
at Hangfire.Profiling.SlowLogProfiler.InvokeMeasured[TInstance,TResult](TInstance instance, Func`2 action, String message)
at Hangfire.Profiling.ProfilerExtensions.InvokeMeasured[TInstance](IProfiler profiler, TInstance instance, Action`1 action, String message)
at Hangfire.Server.BackgroundJobPerformer.InvokePerformFilter(IServerFilter filter, PerformingContext preContext, Func`1 continuation)
I’m able to reproduce a similar problem when forcing the groom job to hang. Additional groom jobs will be queued up, wait for the distributed lock, fail to receive the lock and then reschedule.
This is not the correct behavior. The incoming groom jobs should be cancelled and not requeued if they cannot access the lock.
There is still an underlying issue that is causing the groom job to hang that may be the root cause here but I will have to try to reproduce that myself to see if we can get to the bottom of it. I’ll let you know if I need some more information.
Do you have any groom jobs running?
There are no groom jobs completing as far as I can tell. I haven’t scheduled any groom jobs, just what is included by default.
Ok. Thanks. I’ll let you know if I need some more info.
+1 with the same error now, 2 jobs running, logs at 50GB and growing
@adam We are also having a similar growth issue, happy to provide anything you need
Eric (we talked yesterday AM)