Jobs running for more than 24 hours seem to get lost?

Hi,

I’ve had an issue for a while. We have a job that is sometimes taking a lot of time (we are working on that), possibly more than a day. The script uses a remote powershell session to run an instance of robocopy on a remote computer.

PSU is run under IIS and process recycling is disabled.

I noticed that when we ran it continously with a one hour delay, another copy of the same script would start after about 24 hours. I cannot see anything in the logs suggesting any issue, and the robocopy instance on the remote computer i still running, it’s just seemingly orphaned.

Suspecting an issue with the continous schedule, I rewrote the script to find any instance of itself already running to prevent it from starting twice and used a simple each hour schedule. But the same thing happens. The runnings job appears to be lost after 24 hours. The status of the job is “failed” without any indication of any explicit error.

What is more suspicous is that all the output produced by the script and visible in PSU up to that point, both on the “Pipeline” and “Output” is gone, the output pane is showing:

This job produced no output

We’ve also updated PSU to the latest version, but the result is the same. I’m pretty sure I’ve successfully run longer jobs before, a difference may be that it has produced output more continously. I haven’t tried that yet.

I haven’t been able to find any indication of an error in any log thus far and am a bit lost on how to continue from here, other than try to produce some kind of output every hour or so to see if that changes anything. If there was some kind of error, or indication that the script was terminated, that would be alright, but the fact that output previously visible in PSU vanishes is a bit fishy to me.

Any other suggestions?

Product: PowerShell Universal
Version: 2.11.1

Regards,
/Fredrik

You may be able to look at the Hangfire dashboard to see if it provides any additional information about what may have happened to the job: https://docs.powershelluniversal.com/development/hangfire

1 Like

Thanks, at least it provides me with some clues on where to look further. I’ve created a couple of test jobs to see if I can recreate with a simpler script. Wouldn’t be surprised if they run just fine and I need to find the issue elsewhere.

I’ve run some tests with simpler scripts running for 30 hours and they work just fine, so no issue with the platform. I’ll have to look further down the line here. I’ll rewrite the script to get some kind of keep-alive back to PSU and see what happens.

Thanks for your input.

/Fredrik

1 Like

I’m going to have to re-address this issue now. I currently have a different job which is “manually” started, after 24 hours the job has restarted itself.

The hangfire-page currently gives me a 500 response, so I can’t get any additional information there. Looking at the job, you can see that it is created 24 hours earlier (and was started), but is now restarted again after 24 hours later.

The first instance of the job is really still executing, it’s subprocess robocopy still running. It’s just lost from a PSU perspective, the output from the run is replaced with the output of the new instance.

Get-UAJob -Id 945985

Id                : 945985
CreatedTime       : 2022-06-02 12:54:12
StartTime         : 2022-06-03 12:54:13
EndTime           : 0001-01-01 00:00:00
Status            : Running
Output            :
Script            :
IsScriptDeleted   : False
ScriptFullPath    : TransferFileInbound.ps1
ScriptCommitId    :
AppToken          : PowerShellUniversal.AppToken
Identity          : PowerShellUniversal.Identity
ParentJob         :
ParentLineNumber  : 0
Debug             : False
ComputerName      : localhost
Port              : 55693
ProcessId         : 24336
MemoryBytes       : 72673963
RunspaceId        : 0
Activity          : Format-Volume -FileSystem NTFS -Confirm:$false -Force
CurrentOperation  :
PercentComplete   : 100
SecondsRemaining  : 0
StatusDescription : 2/2 completed
Environment       : Powershell 7
Agent             :
Computer          : PowerShellUniversal.Computer
ErrorAction       : Stop
Children          :
JobOutput         :
Parameters        :
Notes             :
Credential        :
ScheduleId        : 0
Triggered         : False
Trigger           :
AccessToken       :
IdToken           :
RetryCount        : 0
Tags              :

Getting entries such as these from hangfire in the log when I get the 500 from the diagnostics page:

[13:44:59 ERR] An unhandled exception has occurred while executing the request.
System.ArgumentException: An item with the same key has already been added. Key: stats:succeeded:2022-06-02-12
   at System.Collections.Generic.Dictionary`2.TryInsert(TKey key, TValue value, InsertionBehavior behavior)
   at System.Linq.Enumerable.ToDictionary[TSource,TKey,TElement](IEnumerable`1 source, Func`2 keySelector, Func`2 elementSelector, IEqualityComparer`1 comparer)
   at Hangfire.MemoryStorage.MemoryStorageMonitoringApi.GetTimelineStats(List`1 dates, Func`2 formatorAction)
   at Hangfire.MemoryStorage.MemoryStorageMonitoringApi.GetHourlyTimelineStats(String type)
   at Hangfire.MemoryStorage.MemoryStorageMonitoringApi.HourlySucceededJobs()
   at Hangfire.Dashboard.Pages.HomePage.Execute()
   at Hangfire.Dashboard.RazorPage.TransformText(String body)
   at Hangfire.Dashboard.RazorPageDispatcher.Dispatch(DashboardContext context)
   at Hangfire.Dashboard.AspNetCoreDashboardMiddleware.Invoke(HttpContext httpContext)
   at Microsoft.AspNetCore.Builder.Extensions.MapMiddleware.Invoke(HttpContext context)
   at Universal.Server.Middleware.RoutingMiddleware.Invoke(HttpContext httpContext, IPolicyEvaluator policyEvaluator) in D:\a\universal\universal\src\Universal.Server\Middleware\RoutingMiddleware.cs:line 153
   at Swashbuckle.AspNetCore.SwaggerUI.SwaggerUIMiddleware.Invoke(HttpContext httpContext)
   at Swashbuckle.AspNetCore.Swagger.SwaggerMiddleware.Invoke(HttpContext httpContext, ISwaggerProvider swaggerProvider)
   at Universal.Server.Middleware.SwaggerAuthenticationMiddleware.InvokeAsync(HttpContext context, RequestDelegate next) in D:\a\universal\universal\src\Universal.Server\Middleware\SwaggerAuthMiddleware.cs:line 42
   at Microsoft.AspNetCore.Builder.UseMiddlewareExtensions.<>c__DisplayClass6_1.<<UseMiddlewareInterface>b__1>d.MoveNext()
--- End of stack trace from previous location ---
   at AspNetCoreRateLimit.RateLimitMiddleware`1.Invoke(HttpContext context) in D:\a\universal\universal\src\AspNetCoreRateLimit\Middleware\RateLimitMiddleware.cs:line 109
   at Microsoft.AspNetCore.Session.SessionMiddleware.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Session.SessionMiddleware.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware.<Invoke>g__Awaited|6_0(ExceptionHandlerMiddleware middleware, HttpContext context, Task task)
[13:44:59 WRN] No exception handler was found, rethrowing original exception.
[13:44:59 ERR] Connection id "0HMHSTAAUOQQI", Request id "0HMHSTAAUOQQI:00000280": An unhandled exception was thrown by the application.
System.ArgumentException: An item with the same key has already been added. Key: stats:succeeded:2022-06-02-12
   at System.Collections.Generic.Dictionary`2.TryInsert(TKey key, TValue value, InsertionBehavior behavior)
   at System.Linq.Enumerable.ToDictionary[TSource,TKey,TElement](IEnumerable`1 source, Func`2 keySelector, Func`2 elementSelector, IEqualityComparer`1 comparer)
   at Hangfire.MemoryStorage.MemoryStorageMonitoringApi.GetTimelineStats(List`1 dates, Func`2 formatorAction)
   at Hangfire.MemoryStorage.MemoryStorageMonitoringApi.GetHourlyTimelineStats(String type)
   at Hangfire.MemoryStorage.MemoryStorageMonitoringApi.HourlySucceededJobs()
   at Hangfire.Dashboard.Pages.HomePage.Execute()
   at Hangfire.Dashboard.RazorPage.TransformText(String body)
   at Hangfire.Dashboard.RazorPageDispatcher.Dispatch(DashboardContext context)
   at Hangfire.Dashboard.AspNetCoreDashboardMiddleware.Invoke(HttpContext httpContext)
   at Microsoft.AspNetCore.Builder.Extensions.MapMiddleware.Invoke(HttpContext context)
   at Universal.Server.Middleware.RoutingMiddleware.Invoke(HttpContext httpContext, IPolicyEvaluator policyEvaluator) in D:\a\universal\universal\src\Universal.Server\Middleware\RoutingMiddleware.cs:line 153
   at Swashbuckle.AspNetCore.SwaggerUI.SwaggerUIMiddleware.Invoke(HttpContext httpContext)
   at Swashbuckle.AspNetCore.Swagger.SwaggerMiddleware.Invoke(HttpContext httpContext, ISwaggerProvider swaggerProvider)
   at Universal.Server.Middleware.SwaggerAuthenticationMiddleware.InvokeAsync(HttpContext context, RequestDelegate next) in D:\a\universal\universal\src\Universal.Server\Middleware\SwaggerAuthMiddleware.cs:line 42
   at Microsoft.AspNetCore.Builder.UseMiddlewareExtensions.<>c__DisplayClass6_1.<<UseMiddlewareInterface>b__1>d.MoveNext()
--- End of stack trace from previous location ---
   at AspNetCoreRateLimit.RateLimitMiddleware`1.Invoke(HttpContext context) in D:\a\universal\universal\src\AspNetCoreRateLimit\Middleware\RateLimitMiddleware.cs:line 109
   at Microsoft.AspNetCore.Session.SessionMiddleware.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Session.SessionMiddleware.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware.<Invoke>g__Awaited|6_0(ExceptionHandlerMiddleware middleware, HttpContext context, Task task)
   at Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware.HandleException(HttpContext context, ExceptionDispatchInfo edi)
   at Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware.<Invoke>g__Awaited|6_0(ExceptionHandlerMiddleware middleware, HttpContext context, Task task)
   at Microsoft.AspNetCore.Diagnostics.StatusCodePagesMiddleware.Invoke(HttpContext context)
   at Microsoft.AspNetCore.ResponseCompression.ResponseCompressionMiddleware.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Authentication.AuthenticationMiddleware.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Server.IISIntegration.IISMiddleware.Invoke(HttpContext httpContext)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)

There is nothing particular I can find about the job itself in the log, other than it being restarted again after 24 hours (plus 1 second).

[12:54:12 INF] Job 945985 started.
...
[12:54:13 INF] Job 945985 started.

I’m almost definitely sure that jobs have successfully run longer than 24 hours before, and that this is an issue that has been introduced in later versions of PSU, but can’t say for certain when it worked. I see this behavior in at least 2.7.4 and 2.11.1.

Same problem here with last version of PSU 2.11.1.
If I let a job running more than 24 hours, another one is started. First job is lost, but it seems to keep running because all actions of the job are done 2 times. I let a job running 5 days, PSU keep displaying only one job running, but each actions of the job script are done multiple times.

1 Like

Hello @fjollberg and @sandy37, this issue has been captured and reported on the github tracker at One-time jobs that run longer than 24 hours may invoke a 2nd unwanted instance of the job under a new Job ID · Issue #1465 · ironmansoftware/issues · GitHub.

If you would like to see further updates on this please subscribe to that issue. Thank you for coming here to report this.