Crash in v2.9 on Azure

Came in this morning to find that PSU had crashed. When I visit the Azure website, I’m greeted with:

The specified CGI application encountered an error and the server terminated the process.

The earliest error I can find in thte logs (from sometime last Friday night) was this:

2022-04-08 11:01:17 [WARN]  (Hangfire.AutomaticRetryAttribute) Failed to process the job 'ec6b33c7-1ade-458d-95bb-f1c29d35e868': an exception occurred. Retry attempt 1 of 10 will be performed in 00:00:21.
LiteDB.LiteException
Collection '_files' lock timeout when entering in write mode after 00:01:00
   at LiteDB.Engine.LockService.EnterLock(String collectionName)
   at LiteDB.Engine.Snapshot..ctor(LockMode mode, String collectionName, HeaderPage header, UInt32 transactionID, TransactionPages transPages, LockService locker, WalIndexService walIndex, DiskReader reader, Boolean addIfNotExists)
   at LiteDB.Engine.TransactionService.<CreateSnapshot>g__create|43_0(<>c__DisplayClass43_0& )
   at LiteDB.Engine.TransactionService.CreateSnapshot(LockMode mode, String collection, Boolean addIfNotExists)
   at LiteDB.Engine.LiteEngine.<>c__DisplayClass3_0.<Delete>b__0(TransactionService transaction)
   at LiteDB.Engine.LiteEngine.AutoTransaction[T](Func`2 fn)
   at LiteDB.LiteCollection`1.Delete(BsonValue id)
   at LiteDB.LiteStorage`1.Delete(TFileId id)
   at UniversalAutomation.FileStorage`1.Delete(T item) in D:\a\universal\universal\src\UniversalAutomation.LiteDBv5\LiteDBDatabase.cs:line 328
   at UniversalAutomation.GroomService.DeleteJob(Job job) in D:\a\universal\universal\src\Universal.Server\Services\GroomService.cs:line 152
   at UniversalAutomation.GroomService.ServerWideGroom() in D:\a\universal\universal\src\Universal.Server\Services\GroomService.cs:line 134
   at UniversalAutomation.GroomService.Groom() in D:\a\universal\universal\src\Universal.Server\Services\GroomService.cs:line 167

… and this:

2022-04-09 12:02:29 [ERROR] (Hangfire.Processing.BackgroundExecution) Execution RecurringJobScheduler is in the Failed state now due to an exception, execution will be retried no more than in 00:00:04
System.Threading.SynchronizationLockException
Object synchronization method was called from an unsynchronized block of code.
   at Hangfire.MemoryStorage.Utilities.LocalLock..ctor(String resource, TimeSpan timeout)
   at Hangfire.MemoryStorage.MemoryStorageConnection.AcquireDistributedLock(String resource, TimeSpan timeout)
   at Hangfire.Server.RecurringJobScheduler.UseConnectionDistributedLock[T](JobStorage storage, Func`2 action)
   at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context)
   at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state)
   at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state)

Errors like these are repeated sporadically in the log since Friday. So it’s like something’s running, but not enough to show me a UI.

I’ll restart the app, but thought I’d better report this. 2.10 has been solid all weekend.

Product: PowerShell Universal
Version: 2.9.x

It was 2.10’s time to crash over the weekend. Came in this morning to a 503 error, and a “process cannot access the file ‘D:\home\data\PowershellUniversal\database-log.db’ because it is being used by another process” error in the log.

Can’t wait for proper SQL Server support in v3, 'coz LiteDb just seems to have all sorts of issues in Azure. :frowning:

Another crash over the weekend. Here’s an excerpt from the log:

[09:00:13 WRN] As of "06/04/2022 08:56:42 +00:00", the heartbeat has been running for "00:03:30.3726010" which is longer than "00:00:01". This could be caused by thread pool starvation.
[09:00:33 WRN] The cookie '.AspNetCore.OpenIdConnect.Nonce.CfDJ8Amx_aLuvzZNkG662Ue4WL2lzB_eQd-7vhC3VqurI4qdTyjU6S5D0Xpf8C4XPxYDVTH6PynXjlPvJCAVqVNYVp3C0gYsdqwEI85jM-UwujkMiJ7UbLhiAgA4xhZTwSbgemt8G6409BdKM5LIJ49SFx0PxfgYdIpUhvNy_R99dbDGMFiq2eZNjm59tU9dQdumh1azlQqJtl8mCgHxg4WW6pQpaoLfo2AVL_4Nu9RaPyqQUVbe9tbhGuwcGTCtPNpVoR4zJ02U9cn0s70r9_fLK8c' has set 'SameSite=None' and must also set 'Secure'.
[09:01:31 WRN] The cookie '.AspNetCore.Correlation.Zi8X7qx6GVaRnve-1am0pXG_HN87SC88VWt_LdkRzTY' has set 'SameSite=None' and must also set 'Secure'.
2022-06-04 08:59:27 [ERROR] (Hangfire.Processing.BackgroundExecution) Execution RecurringJobScheduler is in the Failed state now due to an exception, execution will be retried no more than in 00:00:04
System.Threading.SynchronizationLockException
Object synchronization method was called from an unsynchronized block of code.
   at Hangfire.MemoryStorage.Utilities.LocalLock..ctor(String resource, TimeSpan timeout)
   at Hangfire.MemoryStorage.MemoryStorageConnection.AcquireDistributedLock(String resource, TimeSpan timeout)
   at Hangfire.Server.RecurringJobScheduler.UseConnectionDistributedLock[T](JobStorage storage, Func`2 action)
   at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context)
   at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state)
   at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state)
[09:02:22 WRN] As of "06/04/2022 09:01:32 +00:00", the heartbeat has been running for "00:00:49.4371639" which is longer than "00:00:01". This could be caused by thread pool starvation.
[09:05:38 WRN] The cookie '.AspNetCore.OpenIdConnect.Nonce.CfDJ8Amx_aLuvzZNkG662Ue4WL0RsKOF3escY_sAiMDSWn3px4LpGNdFauRojyTaC34JDwV1gKiGDOg5KmhGrv2ZdgxH6ZjX3jy9UwdHGjsZuH0kqizYgMIJ_ELFR4uwm6Ie7ed14m-tft-RRw68ZHoNIhNocy6Nl13xBglaadc6W1dYJ66__D6lbfuGDEXB3xu44l-lnKQDK452SHOaqgA_PAiteKX-FQ5MgVXlABnpuaX1lD_H30BmKm8hAwaDMuv9fb1RarzOs6ltfXACmPHL0k0' has set 'SameSite=None' and must also set 'Secure'.
[09:05:41 WRN] The cookie '.AspNetCore.Correlation.wM5ZSOTAR9LdUERDrYQpPUfQ_oGliKBONcbuJeauaOc' has set 'SameSite=None' and must also set 'Secure'.
2022-06-04 09:07:13 [INFO]  (Hangfire.Processing.BackgroundExecution) Execution RecurringJobScheduler recovered from the Failed state after 00:04:53.4099795 and is in the Running state now
[09:09:46 WRN] As of "06/04/2022 09:03:06 +00:00", the heartbeat has been running for "00:06:40.2391995" which is longer than "00:00:01". This could be caused by thread pool starvation.
[09:09:51 WRN] As of "06/04/2022 09:09:49 +00:00", the heartbeat has been running for "00:00:01.8446367" which is longer than "00:00:01". This could be caused by thread pool starvation.
[09:11:15 WRN] As of "06/04/2022 09:10:54 +00:00", the heartbeat has been running for "00:00:21.5513096" which is longer than "00:00:01". This could be caused by thread pool starvation.

Curiously, the “heartbeat” messages continue right up until just now, so something is still running. But PSU refuses to respond. Refreshing the page gives me “The specified CGI application encountered an error and the server terminated the process.” :frowning:

Really looking forward to v3 and hopefully more stability on Azure.

Hangfire will be using SQL for storage in v3 so hopefully all these weird issues will be resolved. We are super close to release and have released RTM version if you want to try the SQL support to see if it’s more stable.

1 Like