Is there a known memory leak bug in 4.2.17? Since upgrading from 4.2.16, the Universal.Server process ends up taking pretty literally all of the available memory on the server it’s running on, leading to failed health checks, Zabbix alerts about memory utilization, and the PSU web daemon being largely unresponsive. The only [temporary] fix seems to be bouncing the entire PowerShell Universal service, which seems to only take care of things for a bit (yesterday I had to restart the service 4 times within an 5 hour window).
Product: PowerShell Universal
Version: 4.2.17
Doing it now for the 2nd time of the day. I’m creating a dump file to see if I can figure out what’s causing the problem, but I may end up rolling back to 4.2.16 since I don’t recall having this problem prior to upgrading to 4.2.17.
According to the memory dump, it appears that it’s the coreclr.dll causing the issues, unless I’m reading the dump incorrectly.
************* Preparing the environment for Debugger Extensions Gallery repositories **************
ExtensionRepository : Implicit
UseExperimentalFeatureForNugetShare : true
AllowNugetExeUpdate : true
NonInteractiveNuget : true
AllowNugetMSCredentialProviderInstall : true
AllowParallelInitializationOfLocalRepositories : true
EnableRedirectToV8JsProvider : false
-- Configuring repositories
----> Repository : LocalInstalled, Enabled: true
----> Repository : UserExtensions, Enabled: true
>>>>>>>>>>>>> Preparing the environment for Debugger Extensions Gallery repositories completed, duration 0.000 seconds
************* Waiting for Debugger Extensions Gallery to Initialize **************
>>>>>>>>>>>>> Waiting for Debugger Extensions Gallery to Initialize completed, duration 0.031 seconds
----> Repository : UserExtensions, Enabled: true, Packages count: 0
----> Repository : LocalInstalled, Enabled: true, Packages count: 41
Microsoft (R) Windows Debugger Version 10.0.27553.1004 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.
Loading Dump File [C:\Users\Administrator\AppData\Local\Temp\2\Universal.Server.DMP]
User Mini Dump File with Full Memory: Only application data is available
************* Path validation summary **************
Response Time (ms) Location
Deferred srv*
Symbol search path is: srv*
Executable search path is:
Windows 10 Version 20348 MP (4 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Edition build lab: 20348.1.amd64fre.fe_release.210507-1500
Debug session time: Wed Apr 10 11:12:16.000 2024 (UTC - 4:00)
System Uptime: 0 days 1:57:16.861
Process Uptime: 0 days 1:57:09.000
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
......................................
Loading unloaded module list
................................................................
For analysis of this file, run !analyze -v
ntdll!NtWaitForMultipleObjects+0x14:
00007ffb`a543ffa4 c3 ret
0:000> !analyze -v
*******************************************************************************
* *
* Exception Analysis *
* *
*******************************************************************************
*** WARNING: Unable to verify checksum for Hangfire.Core.dll
*** WARNING: Unable to verify checksum for Hangfire.MemoryStorage.dll
*** WARNING: Unable to verify checksum for host.dll
KEY_VALUES_STRING: 1
Key : Analysis.CPU.mSec
Value: 10093
Key : Analysis.Elapsed.mSec
Value: 11814
Key : Analysis.IO.Other.Mb
Value: 0
Key : Analysis.IO.Read.Mb
Value: 1
Key : Analysis.IO.Write.Mb
Value: 0
Key : Analysis.Init.CPU.mSec
Value: 796
Key : Analysis.Init.Elapsed.mSec
Value: 2495
Key : Analysis.Memory.CommitPeak.Mb
Value: 265
Key : CLR.Engine
Value: CORECLR
Key : CLR.Version
Value: 7.0.222.60605
Key : Failure.Bucket
Value: BREAKPOINT_80000003_coreclr.dll!Thread::DoAppropriateWaitWorker
Key : Failure.Hash
Value: {2dbd2c7d-1f72-d894-343d-110a7c82a3a2}
Key : Failure.Source.FileLine
Value: 3514
Key : Failure.Source.FilePath
Value: D:\a\_work\1\s\src\coreclr\vm\threads.cpp
Key : Failure.Source.SourceServerCommand
Value: raw.githubusercontent.com/dotnet/runtime/d037e070ebe5c83838443f869d5800752b0fcb13/src/coreclr/vm/threads.cpp
Key : Timeline.OS.Boot.DeltaSec
Value: 7036
Key : Timeline.Process.Start.DeltaSec
Value: 7029
Key : WER.OS.Branch
Value: fe_release
Key : WER.OS.Version
Value: 10.0.20348.1
Key : WER.Process.Version
Value: 1.0.0.0
FILE_IN_CAB: Universal.Server.DMP
NTGLOBALFLAG: 0
APPLICATION_VERIFIER_FLAGS: 0
EXCEPTION_RECORD: (.exr -1)
ExceptionAddress: 0000000000000000
ExceptionCode: 80000003 (Break instruction exception)
ExceptionFlags: 00000000
NumberParameters: 0
FAULTING_THREAD: 00000acc
PROCESS_NAME: Universal.Server.dll
ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached.
EXCEPTION_CODE_STR: 80000003
STACK_TEXT:
00000021`c297db88 00007ffb`a2b9436c : 00007ffb`3bc35970 00007ffb`9a182461 00000213`47000030 00000021`c297dc60 : ntdll!NtWaitForMultipleObjects+0x14
00000021`c297db90 00007ffb`9a1a5b3c : 00000000`00000000 00000000`00000000 00000000`00000130 00000000`00000000 : KERNELBASE!WaitForMultipleObjectsEx+0xec
00000021`c297de80 00007ffb`9a1a5979 : 00007ffb`00000001 00007ffb`00000001 00000000`00000001 00007ffb`00000000 : coreclr!Thread::DoAppropriateWaitWorker+0x184
00000021`c297df60 00007ffb`9a1a39d8 : 00000000`00000000 00000000`00000001 00000000`00000000 00000213`42879500 : coreclr!Thread::DoAppropriateWait+0x89
00000021`c297dfe0 00007ffb`9a1a37ad : 00000000`00000000 00007ffb`ffffffff 00000213`4c9b6990 00000213`428793d0 : coreclr!SyncBlock::Wait+0x1c8
00000021`c297e100 00007ffb`9767e9f9 : 00000000`ffffffff 00000213`570b69f8 00000000`00000000 00000213`570b68e8 : coreclr!ObjectNative::WaitTimeout+0xcd
00000021`c297e280 00007ffb`9769512d : 00000213`4c9b6880 00000000`ffffffff 00000000`00000000 00000000`00000006 : System_Private_CoreLib!System.Threading.ManualResetEventSlim.Wait+0x189
00000021`c297e320 00007ffb`97694f22 : 00000213`4c9b67a8 00000000`ffffffff 00000000`00000000 00000000`00000006 : System_Private_CoreLib!System.Threading.Tasks.Task.SpinThenBlockingWait+0x9d
00000021`c297e3a0 00007ffb`976e8de2 : 00000213`570b6810 00000000`ffffffff 00000000`00000000 00000000`00000006 : System_Private_CoreLib!System.Threading.Tasks.Task.InternalWaitCore+0x72
00000021`c297e420 00007ffb`94b838b2 : 00000213`570b6810 00000000`00000000 00000000`00000000 00000000`00000006 : System_Private_CoreLib!System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification+0x22
00000021`c297e450 00007ffb`3a783b7b : 00000213`570b5ba8 00000000`00000000 00000000`00000000 00000000`00000001 : Microsoft_Extensions_Hosting_Abstractions!Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.Run+0x32
00000021`c297e480 00000213`570b5ba8 : 00000000`00000000 00000000`00000000 00000000`00000001 00000021`c297e480 : Universal_Server!Universal.Server.Program.<>c__DisplayClass3_0.<Main>b__0+0x18b
00000021`c297e488 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00000213`570b5ba8
STACK_COMMAND: ~0s; .ecxr ; kb
FAULTING_SOURCE_LINE: D:\a\_work\1\s\src\coreclr\vm\threads.cpp
FAULTING_SOURCE_FILE: D:\a\_work\1\s\src\coreclr\vm\threads.cpp
FAULTING_SOURCE_LINE_NUMBER: 3514
FAULTING_SOURCE_SRV_COMMAND: https://raw.githubusercontent.com/dotnet/runtime/d037e070ebe5c83838443f869d5800752b0fcb13/src/coreclr/vm/threads.cpp
FAULTING_SOURCE_CODE:
No source found for 'D:\a\_work\1\s\src\coreclr\vm\threads.cpp'
SYMBOL_NAME: coreclr!Thread::DoAppropriateWaitWorker+184
MODULE_NAME: coreclr
IMAGE_NAME: coreclr.dll
FAILURE_BUCKET_ID: BREAKPOINT_80000003_coreclr.dll!Thread::DoAppropriateWaitWorker
OS_VERSION: 10.0.20348.1
BUILDLAB_STR: fe_release
OSPLATFORM_TYPE: x64
OSNAME: Windows 10
IMAGE_VERSION: 7.0.222.60605
FAILURE_ID_HASH: {2dbd2c7d-1f72-d894-343d-110a7c82a3a2}
Followup: MachineOwner
---------
adam
April 10, 2024, 3:53pm
4
We changed very little between those versions so I would imagine it could be happening in v4.2.16 as well. I actually did a video about how to diagnosis this stuff. You can actually get a listing of the object types that are consuming the data.
If you have access to Visual Studio Enterprise, there is also the Debug Managed Memory feature. Open the Dump file in Visual Studio and then click Debug Managed Memory and it will do what WinDbg is doing in the video.
If you want me to take a look at it, feel free to send me a link of where to download it.
1 Like
@adam Thanks. I’ll watch that in a bit. I’m trying to compress the DMP file right now, as it’s currently 9.5 GB, but then I’ll send you a link to grab it.
@adam Here is a link to the DMP file, compressed with 7-Zip: 628.45 MB file on MEGA
The link will expire on 04/17/2024.
1 Like
adam
April 10, 2024, 5:05pm
7
There are a lot of PSObjects pinned from job executions. I’ll see if I can reproduce this locally and find out what’s happening.
1 Like
Sounds good. Thank you. I added -RunspaceRecycling
onto each of the environments to see if it’ll help anything, since I saw you mention enabling it on a previous release having some memory leak issues.
adam
April 10, 2024, 5:32pm
9
Can you let me know which environment you are using for jobs (integrated, PS7, WinPS)?
@adam They’re all set to use the Default environment, and that’s set to Integrated. At least, the Scheduled jobs are. There are some scripts set to use PowerShell 7, but those are not being actively used yet.
adam
April 10, 2024, 8:02pm
11
I’ve been running jobs for the last few hours and see no noticeable change in memory usage. Let me dig around in the dump some more to see if I can see why all this is pinned.
Did it again overnight/this morning.
adam
April 11, 2024, 1:27pm
14
Can you try running in the non-integrated environment for these jobs? It will start a new PS process per job and any memory will be cleaned up.
1 Like
So, use Agent, for example?
adam
April 11, 2024, 1:37pm
16
Yep. That uses the same PS version as Integrated
1 Like
Okay. I changed the default to be Agent so I didn’t have to touch each script individually (since we aren’t really using anything other than “Default”). I’ll keep an eye on it through the day to see if it fixed/helped anything.
@adam Well, I was hopeful that changing the environment fixed it, but I got into the office and see that the memory issue returned.
I think I found the culprit. There’s a scheduled job that runs a script at 2AM every day. The script is fairly large and takes about an hour to complete if you run it manually. When I killed the agent child process for jobs, I could then see a failed job entry in PSU for that script.
Since pausing the schedule that runs this script, the server has not had any abnormal memory usage instances.