Uh oh. The Hangfire page is throwing an HTTP/500 now. I wonder if all the hung jobs has messed with it somehow.
EDIT: Yeah, I’m seeing this in the log:
2022-09-14 10:25:12.882 +10:00 [ERR] Connection ID "17293822570713326760", Request ID "800024ab-0000-f000-b63f-84710c7967bb": An unhandled exception was thrown by the application.
System.Data.SqlClient.SqlException (0x80131904): Resource ID : 1. The request limit for the database is 90 and has been reached. See 'https://docs.microsoft.com/azure/azure-sql/database/resource-limits-logical-server' for assistance.
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
at System.Data.SqlClient.SqlDataReader.get_MetaData()
at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds)
at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior)
at Dapper.SqlMapper.ExecuteReaderWithFlagsFallback(IDbCommand cmd, Boolean wasClosed, CommandBehavior behavior)
at Dapper.SqlMapper.QueryImpl[T](IDbConnection cnn, CommandDefinition command, Type effectiveType)+MoveNext()
at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
at Hangfire.SqlServer.SqlServerJobQueueMonitoringApi.<>c__DisplayClass6_0.<GetQueues>b__0(DbConnection connection)
It’s like Hangfire has so many open requests to the Azure SQL database that it has hit some sort of limit and is bombing out.
EDIT 2: I’ve just stopped the app service and manually deleted all the rows from the “Job*” tables to see if there was something in the data causing the issue.
EDIT 3: That seemed to work. I had one deletion fail because of a self-referential thing in the Job table (i.e. there was another job with it as ParentJobId) so I had to make sure I deleted them in the right order. I wonder whether perhaps the Groom service was hitting up against a referential integrity issue and the delete query was just hanging.
We’ll add some more logging into the groom steps to see what is tripping this up. As for the issue with hangfire, this has been resolved. The problem is that since the hung groom job was just doing whatever, more and more groom jobs will be queued but never time out.
Moving forward, if a groom job can’t start because one is already running, it will just timeout so that they won’t stack up like this.
Hmmm. I think that may be a problem. We change how we register jobs in hangfire so we can be more consistent across nodes. The problem is that the ID wasn’t a great identifier for a schedule since it could easily change by just reordering the file.
It seems like hangfire may be holding onto the previous schedules even after restarts. I’m going to open an issue for this one because we may need to consider cleaning up the hangfire scheduler somehow when using SQL. This wasn’t a problem with LiteDB since we actually stored the schedules in memory so restarting the server would just start fresh.
Any recommendations in the meantime? I have a few jobs that send emails/make Teams posts and I don’t want to risk them running twice. Delete the ones with a numeric Id?