Groom Service "hung" in 3.3.0?

Prior to 3.3.0, the groom service was crashing every time, so we have 51 pages of jobs built up.

I have upgraded to 3.3.0, and the service is no longer crashing, but it seems to just “hang”. Check out this screenshot from Hangfire:

Any thoughts on what I can try? Would love to get these old jobs cleaned up! Thanks!

Product: PowerShell Universal
Version: 3.3.0

Sadly there are no errors in the log that I can report. I just see lots of entries like this:

2022-09-14 08:21:10.552 +10:00 [INF] Starting groom job.
2022-09-14 08:21:10.552 +10:00 [INF] Script Test-Psu.ps1 has a max history of 20
2022-09-14 08:21:10.570 +10:00 [INF] Script has 27 jobs to delete.
2022-09-14 08:21:11.271 +10:00 [INF] Starting heartbeat job.
2022-09-14 08:21:13.090 +10:00 [INF] Finished heartbeat job.

It’s like it sees that it has jobs to delete, and then just “gives up” without reporting that the task has finished.

Uh oh. The Hangfire page is throwing an HTTP/500 now. I wonder if all the hung jobs has messed with it somehow.

EDIT: Yeah, I’m seeing this in the log:

2022-09-14 10:25:12.882 +10:00 [ERR] Connection ID "17293822570713326760", Request ID "800024ab-0000-f000-b63f-84710c7967bb": An unhandled exception was thrown by the application.
System.Data.SqlClient.SqlException (0x80131904): Resource ID : 1. The request limit for the database is 90 and has been reached. See 'https://docs.microsoft.com/azure/azure-sql/database/resource-limits-logical-server' for assistance.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior)
   at Dapper.SqlMapper.ExecuteReaderWithFlagsFallback(IDbCommand cmd, Boolean wasClosed, CommandBehavior behavior)
   at Dapper.SqlMapper.QueryImpl[T](IDbConnection cnn, CommandDefinition command, Type effectiveType)+MoveNext()
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at Hangfire.SqlServer.SqlServerJobQueueMonitoringApi.<>c__DisplayClass6_0.<GetQueues>b__0(DbConnection connection)

It’s like Hangfire has so many open requests to the Azure SQL database that it has hit some sort of limit and is bombing out.

EDIT 2: I’ve just stopped the app service and manually deleted all the rows from the “Job*” tables to see if there was something in the data causing the issue.

EDIT 3: That seemed to work. I had one deletion fail because of a self-referential thing in the Job table (i.e. there was another job with it as ParentJobId) so I had to make sure I deleted them in the right order. I wonder whether perhaps the Groom service was hitting up against a referential integrity issue and the delete query was just hanging.

We’ll add some more logging into the groom steps to see what is tripping this up. As for the issue with hangfire, this has been resolved. The problem is that since the hung groom job was just doing whatever, more and more groom jobs will be queued but never time out.

Moving forward, if a groom job can’t start because one is already running, it will just timeout so that they won’t stack up like this.

2 Likes

Great news!

Also, should I be concerned that Hangfire is showing two entries for each of my schedules? One with the Job Id and one with its description:

Both appeared to execute in Hangfire, but only one “job” appears in PSU. Should I delete one or the other?

Hmmm. I think that may be a problem. We change how we register jobs in hangfire so we can be more consistent across nodes. The problem is that the ID wasn’t a great identifier for a schedule since it could easily change by just reordering the file.

It seems like hangfire may be holding onto the previous schedules even after restarts. I’m going to open an issue for this one because we may need to consider cleaning up the hangfire scheduler somehow when using SQL. This wasn’t a problem with LiteDB since we actually stored the schedules in memory so restarting the server would just start fresh.

1 Like

Any recommendations in the meantime? I have a few jobs that send emails/make Teams posts and I don’t want to risk them running twice. Delete the ones with a numeric Id?

Yeah that probably would be the best course of action at the moment.

2 Likes