Script Jobs stuck on "Canceling"

Hi All,
I’ve got some jobs stuck on Canceling, and cant seem to get them to stop, this is the output of one of them:

Id : 4861
CreatedTime : 9/8/2021 8:08:23 AM
StartTime : 9/8/2021 8:08:23 AM
EndTime : 1/1/0001 12:00:00 AM
Status : Canceling
Output :
Script :
IsScriptDeleted : False
ScriptFullPath : Get-Sessions.ps1
ScriptCommitId :
AppToken : UniversalAutomation.AppToken
Identity : UniversalAutomation.Identity
ParentJob :
ParentLineNumber : 0
Debug : False
ComputerName : localhost
Port : 51464
ProcessId : 6416
RunspaceId : 3
Activity : Preparing modules for first use.
CurrentOperation :
PercentComplete : -1
SecondsRemaining : -1
StatusDescription :
Environment : 5.1.14393.4583
Agent :
Computer : UniversalAutomation.Computer
ErrorAction : SilentlyContinue
Children :
JobOutput :
Parameters :
Notes :
Credential :
ScheduleId : 0
Triggered : False
Trigger :

I’ve restarted the universal service and even the server expecting it to clear out but hasn’t helped. Any suggestions please?

I cant seem to find any powershell processes for them so not sure if they are actually attempting to run anything.
Thanks

It’s not trying to run anything at this point. I’ll open an issue for this though since it’s a bug.

Hi Adam,

Was going to open another topic for this, but saw this was already open.

This has been stuck… It doesn’t appear to have affected anything, I’m currently at 238 jobs. This is also on 2.5.2, I upgraded last night from 2.3.1, hoped the upgrade would clear it out.

I don’t think was ever resolved so I’ve opened an issue and put it in the 2.6 milestone : Jobs stuck in a cancelling state · Issue #702 · ironmansoftware/issues · GitHub

Yeah I just struck this too. A job that started 7am Saturday was still running when I checked this morning (Monday), saying it had been running for 121 years (as the start date/time was not set for some reason).

I cancelled it, and now it’s been spinning on “canceling” for about 10 minutes. Starting the schedule again produced a new job, but that job eventually timed out. It seems like it’s waiting on the previous one to cancel.

Tried restarting the Azure app to no avail. :frowning:

Edit: Confusingly, the job that failed to start because the previous one is still cancelling also says it ran for 121 years. PSU’s got itself in a knot somehow over this one job from Saturday morning.

Not sure why a cancelling job would prevent another job from starting. That might be another thread we should start to figure out why that’s the case. We don’t limit jobs from starting aside from the concurrent limit in Settings. The issue that’s discussed in this thread has been fixed in 2.6 and in that version, when you restart the server, jobs that are marked as cancelling will be marked as failed.

I have been able to run the script since (which is how I found out about the secret variables not persisting) so it must’ve been some funky one-off thing. I’ll ignore this one now until 2.6 comes out! Looking forward to it!

Yeah I’ve seen this twice now since updating to 2.6.2. Visit PSU after leaving it alone for a day or so, and a script that should only take a few seconds to run says it has been running for > 121 years, and won’t cancel. When I look at the associated job, there is output, so it’s like PSU isn’t properly detecting that the script has finished running. Any info I can give you to help debug this one, Adam?

This might help! The logs are full of this for the past few days, over and over:

2021-12-23 00:52:17.682 +00:00 [ERR] Error when dispatching 'OnDisconnectedAsync' on hub.
System.NullReferenceException: Object reference not set to an instance of an object.
   at UniversalDashboard.TerminalHub.OnDisconnectedAsync(Exception exception) in D:\a\universal\universal\src\Universal.Server\Hubs\TerminalHub.cs:line 69
   at Microsoft.AspNetCore.SignalR.Internal.DefaultHubDispatcher`1.OnDisconnectedAsync(HubConnectionContext connection, Exception exception)
   at Microsoft.AspNetCore.SignalR.Internal.DefaultHubDispatcher`1.OnDisconnectedAsync(HubConnectionContext connection, Exception exception)
   at Microsoft.AspNetCore.SignalR.HubConnectionHandler`1.HubOnDisconnectedAsync(HubConnectionContext connection, Exception exception)
2021-12-23 00:52:17.875 +00:00 [ERR] Failed disposing connection VvHn6LGCQt9EZWnQW4MCTA.
System.NullReferenceException: Object reference not set to an instance of an object.
   at UniversalDashboard.TerminalHub.OnDisconnectedAsync(Exception exception) in D:\a\universal\universal\src\Universal.Server\Hubs\TerminalHub.cs:line 69
   at Microsoft.AspNetCore.SignalR.Internal.DefaultHubDispatcher`1.OnDisconnectedAsync(HubConnectionContext connection, Exception exception)
   at Microsoft.AspNetCore.SignalR.Internal.DefaultHubDispatcher`1.OnDisconnectedAsync(HubConnectionContext connection, Exception exception)
   at Microsoft.AspNetCore.SignalR.HubConnectionHandler`1.HubOnDisconnectedAsync(HubConnectionContext connection, Exception exception)
   at Microsoft.AspNetCore.SignalR.HubConnectionHandler`1.RunHubAsync(HubConnectionContext connection)
   at Microsoft.AspNetCore.SignalR.HubConnectionHandler`1.OnConnectedAsync(ConnectionContext connection)
   at Microsoft.AspNetCore.SignalR.HubConnectionHandler`1.OnConnectedAsync(ConnectionContext connection)
   at Microsoft.AspNetCore.Http.Connections.Internal.HttpConnectionContext.ExecuteApplication(ConnectionDelegate connectionDelegate)
   at Microsoft.AspNetCore.Http.Connections.Internal.HttpConnectionContext.WaitOnTasks(Task applicationTask, Task transportTask, Boolean closeGracefully)
   at Microsoft.AspNetCore.Http.Connections.Internal.HttpConnectionContext.DisposeAsync(Boolean closeGracefully)
   at Microsoft.AspNetCore.Http.Connections.Internal.HttpConnectionManager.DisposeAndRemoveAsync(HttpConnectionContext connection, Boolean closeGracefully)

EDIT:

I came back to PSU after a day to find multiple pages of “queued” instances of a job that should only run once a day, and the one I cancelled still in a “cancelling” state, so I restarted the Azure App Service instance.

Was met with “HTTP Error 502.5 - ANCM Out-Of-Process Startup Failure” after the Azure service restarted, and couldn’t get PSU to start at all. So now I’m redeploying, which usually seems to fix it. :frowning:

EDIT 2:

Nope, not coming back up, even after a redeploy. ANCM Out-Of-Process Startup Failure.

EDIT 3:

And we’re up. Got the LiteDB transaction error again that I mentiioned in a different thread. Deleted database-log.db and it started working again immediately.

EDIT 4:

Dang it. My scheduled job worked every day since posting this, and I logged on this morning (first day back to work after the break) and it was still running after “121 years”. LiteDB transaction error again. :frowning:

I’ve deleted the log file and restarted the app. This transaction error happens without anyone using the system, seemingly at random.

Just in case the stack trace helps (it’s different from the previous one):

[ERR] Error in ConcurrentJobFilter
System.Exception: LiteDB ENSURE: transaction must be active to rollback (current state: Disposed)
   at LiteDB.Constants.ENSURE(Boolean conditional, String message)
   at LiteDB.Engine.TransactionService.Rollback()
   at LiteDB.Engine.LiteEngine.AutoTransaction[T](Func`2 fn)
   at LiteDB.LiteCollection`1.Update(T entity)
   at UniversalAutomation.LiteDBTable`1.Update(T item) in D:\a\universal\universal\src\UniversalAutomation.LiteDBv5\LiteDBDatabase.cs:line 515
   at UniversalAutomation.Services.Automation.ConcurrentJobFilter.TryPerforming(PerformingContext filterContext) in D:\a\universal\universal\src\Universal.Server\Services\Automation\ConcurrentJobFilter.cs:line 79
   at UniversalAutomation.Services.Automation.ConcurrentJobFilter.OnPerforming(PerformingContext filterContext) in D:\a\universal\universal\src\Universal.Server\Services\Automation\ConcurrentJobFilter.cs:line 41
1 Like

Man. LiteDB really doesn’t like running in Azure. I’ll open an issue for this. I don’t really understand why it’s attempting to rollback.

As a side note, in v3 we are working on SQL server support so Azure SQL will work. Additionally, cloud hosting is a focus. It will be part of our nightly integration tests and deployments.

Issue: Errors with LiteDB when running in Azure · Issue #825 · ironmansoftware/issues · GitHub

1 Like

Yeah it’s very strange. It’s just a Windows box in the end, and nobody’s touching it, so there’s no reason for LiteDB to get in a tangle the way it is. Hoping that the focus on cloud hosting means we can stick with LiteDB and not have to faff around with Azure SQL, but I’ll do what I can to get the product stable enough to use in production. :slight_smile:

I was just able to reproduce this trying to restart a dashboard. Happens every time.

2022-01-12 17:36:23.774 +00:00 [ERR] Connection id "0HMELTP1GDGJV", Request id "0HMELTP1GDGJV:00000028": An unhandled exception was thrown by the application.
System.Exception: LiteDB ENSURE: transaction must be active to rollback (current state: Disposed)
   at LiteDB.Constants.ENSURE(Boolean conditional, String message)
   at LiteDB.Engine.TransactionService.Rollback()
   at LiteDB.Engine.LiteEngine.AutoTransaction[T](Func`2 fn)
   at LiteDB.LiteCollection`1.Insert(T entity)
   at LiteDB.LiteRepository.Insert[T](T entity, String collectionName)
   at UniversalAutomation.LiteDBTable`1.Insert(T item) in C:\src\universal\src\UniversalAutomation.LiteDBv5\LiteDBDatabase.cs:line 508
   at UniversalAutomation.DashboardController.RestartAsync(Int64 id) in C:\src\universal\src\Universal.Server\Controllers\DashboardController.cs:line 225
   at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.TaskOfIActionResultExecutor.Execute(IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeActionMethodAsync>g__Awaited|12_0(ControllerActionInvoker invoker, ValueTask`1 actionResultValueTask)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeNextActionFilterAsync>g__Awaited|10_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeInnerFilterAsync()
1 Like

Man I’m so glad you’re able to repro this. I’m striking it all the time. Will be happily working away on a script for an hour or two, running it each time I make changes, and then one time out of the blue it’ll just pop up an “error” toast and I’ll not be able to run it again before I stop the app, delete the log file, and start the app again. Super frustrating! I’m definitely not against using LiteDB rather than the overhead of SQL Server, but it’ll be great if you can get to the bottom of what’s causing the error.

I’m getting closer. After looking at the LiteDB source, it looks like LiteDB is throwing an exception internally and then eating it and then throwing the above exception so I’m not exactly sure what the actual error is yet. Once I can dig that out, I should have some more information to go on.

I found issues in the LiteDB repository pertaining to this exact issue:

After the log file reaches about 8 MB for me it starts to fail.

Based on some comments within the issues, I updated my Data__ConnectionString environment variable on my web app to use the shared connection mode. I also delete my database and log file to start fresh.

So far, I haven’t run into the issue and the log file seems to be flushing to the database.

filename=D:\home\Data\PowerShellUniversal\database.db;Connection=shared

It also sounds like other users have downgraded to v4 and then it works. I will try that next but we had other issues with v4 and large amounts of data being written to it (like in jobs).

1 Like

Can confirm that the log file has consistently been around 8MB the last few times I’ve deleted it!

I’ve been running jobs every 30 seconds for the last 3 hours without incident. Seems this this may have done it for me.

1 Like

OK I’ve just given this a try, but I keep getting “HTTP Error 502.5 - ANCM Out-Of-Process Startup Failure” when the app tries to start.

The log file has this error:

Unhandled exception. System.UnauthorizedAccessException: Access to the path 'D:\home\data\PowerShellUniversal\database.db' is denied.
   at System.IO.FileStream.FlushOSBuffer()
   at System.IO.FileStream.Flush(Boolean flushToDisk)
   at LiteDB.StreamExtensions.FlushToDisk(Stream stream)
   at LiteDB.Engine.WalIndexService.CheckpointInternal()
   at LiteDB.Engine.WalIndexService.TryCheckpoint()
   at LiteDB.Engine.LiteEngine.Dispose(Boolean disposing)
   at LiteDB.Engine.LiteEngine.Dispose()
   at LiteDB.SharedEngine.CloseDatabase()
   at LiteDB.SharedEngine.EnsureIndex(String collection, String name, BsonExpression expression, Boolean unique)
   at LiteDB.LiteCollection`1.EnsureIndex(String name, BsonExpression expression, Boolean unique)
   at UniversalAutomation.LiteDBv5Database.SetupTables() in D:\a\universal\universal\src\UniversalAutomation.LiteDBv5\LiteDBDatabase.cs:line 135

I can see that the database.db file has been created anew, but for some reason PSU doesn’t think it has access to it. Did you do anything other than updating the connection string and deleting the database/log files?

For now I’ve removed the ;Connection=shared text but left the filename= prefix in there, and I’m back up. Every instinct tells me that the explicit use of filename= shouldn’t make a difference, but you never know.

FYI I just tried this again now that I’ve updated us to 2.7.1 but it still won’t start if “connection=shared” is in the connection string.

Same stack trace as this issue:

[BUG] System.UnauthorizedAccessException when running from a shared directory · Issue #1893 · mbdavid/LiteDB (github.com)

All i did was update the connection string. I’ll continue to investigate and take a look at my Azure lab today.

1 Like