REST API Failures

We have developed a suite of API endpoints. We have a small number of servers making scheduled calls to the endpoints. Too often, the API call immediately returns an exception: “The underlying connection was closed. An unexpected error occurred on a send.”

The same API calls work much of the time, but fail too often with this error. Watching the API log shows that the failed calls don’t reach that point. They are not present as failed calls in the endpoint’s log.

We’re using IIS to host the API. What limitiations could we be reaching to cause this instant rejection? What additional information can I obtain from… somewhere to determine why the “underlying connection was closed”?

Product: PowerShell Universal
Version: 3.9.15

I would increase the log level on the PSU server to debug or verbose so you can get the HTTP request information out of there. With IIS, you will also want to review IIS logs. There may be something IIS related causing this.

How frequently are you calling the APIs? And how quickly do they typically return to the client?

I did do a blog post about PSU API performance here: Optimizing API Performance in PowerShell Universal

It’s based on v4 but much of it applies to v3 as well. There is actually less logging in v3 by default so I would expect it to be faster out of the box.

Thanks for your response, Adam. We have Azure’s Application Insights hooked up to PU and it shows we are averaging a bit under 1 call per second over a 24-hour period. About 77000 total. There are about 20 API endpoints total (about half discovery and half collection scripts, the latter called more often naturally). As I said, evidence shows these failures don’t get as far as running the endpoint script.

Our monitoring application makes the Invoke-RestMethod calls. The failures return very quickly. It is not a timeout situation. I’ve taken the exact same I-RM call and run it manually from a PowerShell session 50 times in a row without a single failure. Same call from within the monitoring application can fail over 50% of the time with the dropped exception. I’m aware there are environmental differences between my testing client and the problematic production client. I’ll test further to remove those differences.

I’ve struggled to use PU logging because it’s often full of the same single exception (one I’ve asked about here already) and devoid of useful content. But I will change its settings and observe. I’ve checked out your blog already but will give it another look.

Primary mystery is what criteria can cause the instant rejection of the I-RM call from a client to PU? A PowerShell session limit? An AppPool/IIS problem? (I’ll look again at IIS logs, but haven’t found anything in prior efforts.)

FYI hardware resources like RAM and CPU are not close to stressed on PU server.

We have Azure’s Application Insights hooked up to PU and it shows we are averaging a bit under 1 call per second over a 24-hour period.

Oh, yeah. It should handle that no problem.

Primary mystery is what criteria can cause the instant rejection of the I-RM call from a client to PU? A PowerShell session limit?

It kind of seems like it’s not even running PowerShell yet based on what you are describing. It could also be attempting to return a result and the connection is being closed via the client or an intermediary.

An AppPool/IIS problem? (I’ll look again at IIS logs, but haven’t found anything in prior efforts.)

Certainly could be. I’m not 100% sure what to look for in those logs but it seems like a network thing.

I’ve struggled to use PU logging because it’s often full of the same single exception (one I’ve asked about here already) and devoid of useful content. But I will change its settings and observe. I’ve checked out your blog already but will give it another look.

If you end up getting a verbose log, I can skim through it. That said, this is a really generic network error: The underlying connection was closed. An unexpected error occurred on a send.

It could be something like firewalls or some intrusion detection system closing the network connection. I’m not sure if you have anything like that involved. Are you running this all locally or is this going over a VPN\internet? Any proxy servers involved?

Little more detail on architecture:

We’re using LogicMonitor which launches our PowerShell monitoring scripts from a couple Windows servers. The monitoring scripts would normally make get-ciminstance or similar calls to remote servers to obtain monitoring metrics. In this case, those scripts are simply calling API endpoints on our single PU server. The endpoint scripts use Invoke-Command to send small scriptblocks to monitored servers to obtain metrics. That JSON data is returned up the chain to LogicMonitor for processing and display.

For many fail cases, the LogicMonitor client-side servers, the PU server, and the monitoring target server are all running within Azure on the same VNET or directly connected VNETs.

I’ll investigate network disruption. But the sporadic, Succeed Succeed Succeed Fail Fail Fail Succeed nature of the calls makes me doubt a consistent network-based interruption is involved.

You got that right about this being a generic issue. Try asking ChatGPT about it and I get a flashback to “The IT Guys” TV show with it basically asking me over and over to confirm my TLS settings (the equivalent of “turn it off and on again”.

Prior to turning on DEBUG logging, the log file is full of this exception. I’m wondering if this exception is representing the failed API calls I’m seeing. I’ve turned on DEBUG logging and am still collecting and examining. But I’m curious if this exception which fills the log is relevant:

2023-10-17 10:54:34.497 -04:00 [ERR] An unhandled exception has occurred while executing the request.
Grpc.Core.RpcException: Status(StatusCode=“Unknown”, Detail=“Stream removed”, DebugException=“Grpc.Core.Internal.CoreErrorDetailException: {“created”:”@1697554474.492000000",“description”:“Error received from peer ipv6:[::1]:49792”,“file”:“…....\src\core\lib\surface\call.cc”,“file_line”:953,“grpc_message”:“Stream removed”,“grpc_status”:2}")
at ProtoBuf.Grpc.Internal.Reshape.UnaryTaskAsyncImpl[TRequest,TResponse](AsyncUnaryCall1 call, MetadataContext metadata, CancellationToken cancellationToken) in /_/src/protobuf-net.Grpc/Internal/Reshape.cs:line 300 at Universal.Server.Services.ApiGrpc.ExecuteAsync(ApiRequest request, CancellationToken cancellationToken) in C:\actions-runner\_work\universal\universal\src\Universal.Server\Services\API\ApiGrpc.cs:line 148 at Universal.Server.Services.ApiService.ExecuteAsync(HttpContext httpContext) in C:\actions-runner\_work\universal\universal\src\Universal.Server\Services\API\ApiService.cs:line 468 at Universal.Server.Middleware.RoutingMiddleware.Invoke(HttpContext httpContext, IPolicyEvaluator policyEvaluator) in C:\actions-runner\_work\universal\universal\src\Universal.Server\Middleware\RoutingMiddleware.cs:line 138 at Swashbuckle.AspNetCore.SwaggerUI.SwaggerUIMiddleware.Invoke(HttpContext httpContext) at Swashbuckle.AspNetCore.Swagger.SwaggerMiddleware.Invoke(HttpContext httpContext, ISwaggerProvider swaggerProvider) at Universal.Server.Middleware.SwaggerAuthenticationMiddleware.InvokeAsync(HttpContext context, RequestDelegate next) in C:\actions-runner\_work\universal\universal\src\Universal.Server\Middleware\SwaggerAuthMiddleware.cs:line 37 at Microsoft.AspNetCore.Builder.UseMiddlewareExtensions.<>c__DisplayClass6_1.<<UseMiddlewareInterface>b__1>d.MoveNext() --- End of stack trace from previous location --- at AspNetCoreRateLimit.RateLimitMiddleware1.Invoke(HttpContext context) in C:\actions-runner_work\universal\universal\src\AspNetCoreRateLimit\Middleware\RateLimitMiddleware.cs:line 109
at Microsoft.AspNetCore.Session.SessionMiddleware.Invoke(HttpContext context)
at Microsoft.AspNetCore.Session.SessionMiddleware.Invoke(HttpContext context)
at Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware.g__Awaited|6_0(ExceptionHandlerMiddleware middleware, HttpContext context, Task task)

Here’s what looks like the DEBUG logging for one typical API call. I can’t tell whether it’s a successful or failed call.

2023-10-17 11:14:36.498 -04:00 [INF] Request starting HTTP/1.1 GET https://lmproxy.asicentral.com/logicmonitor/collection/iisworkerprocess/e2-webun01-01.asiadp04.asinetwork.com - -
2023-10-17 11:14:36.498 -04:00 [DBG] No response compression available, the Accept-Encoding header is missing or invalid.
2023-10-17 11:14:36.498 -04:00 [DBG] The request path /logicmonitor/collection/iisworkerprocess/e2-webun01-01.asiadp04.asinetwork.com does not match a supported file type
2023-10-17 11:14:36.498 -04:00 [DBG] The request path does not match the path filter
2023-10-17 11:14:36.498 -04:00 [DBG] No candidates found for the request path ‘/logicmonitor/collection/iisworkerprocess/e2-webun01-01.asiadp04.asinetwork.com’
2023-10-17 11:14:36.499 -04:00 [DBG] Request did not match any endpoints
2023-10-17 11:14:37.817 -04:00 [INF] Request finished HTTP/1.1 GET https://lmproxy.asicentral.com/logicmonitor/collection/iisworkerprocess/e2-webun01-01.asiadp04.asinetwork.com - - - 200 - application/json 1319.0740ms

In the hope that I haven’t lost an expert’s attention, I wanted to ask if a limit placed on the PowerShell environment’s runspaces that is too low would result in endpoint calls coming back with this error (underlying connection was closed).

This seems like the API process is not responding.

2023-10-17 10:54:34.497 -04:00 [ERR] An unhandled exception has occurred while executing the request.
Grpc.Core.RpcException: Status(StatusCode=“Unknown”, Detail=“Stream removed”, DebugException=“Grpc.Core.Internal.CoreErrorDetailException: {“created”:”@1697554474.492000000",“description”:“Error received from peer ipv6:[::1]:49792”,“file”:“…....\src\core\lib\surface\call.cc”,“file_line”:953,“grpc_message”:“Stream removed”,“grpc_status”:2}")

What version of PowerShell are you running your APIs under? It may be that you don’t have enough runspaces but you’re not calling it that frequently. I know you said the RAM\CPU aren’t stressed but what does the API process look like? Is it just idle?

PowerShell for the APIs is running externally with 7.3.6. I’ve tried a default configuration and one with 100 runspaces configured. Both have persistent runspaces configured.

We certainly get a ton of successful API responses. We’re just getting way too many dropped calls. The API’s pwsh process is usually quite low, but definitely doing work. Between 0.5 - 3% CPU typically.

Currently, our monitoring product gets so many dropped connection errors from the API calls that the solution has become not viable unless we can determine what is happening.

Ok. After doing some more research it looks like we can get some more logging out of that particular error. It also seems like implementing retries in our application code could work around this gRPC error.

I’ll see if we can get logging out of the stack without code change but if I need to change anything to get more information, you may need to perform an upgrade to do so. I’ll update this post when I have more info on progress.

EDIT: We had to make a change to enable logging. It will be controlled via the MicrosoftLogLevel in settings.ps1 or the settings \ diagnostics page. Debug level will output all gRPC communication.

This will be included in tonight’s nightly build as well as 3.9.20 and 4.1.7

Thank you, Adam.

I did update to the latest 4.x already, so I’ll get the point release running ASAP and start logging.

I have logs captured with the latest version of PU.

I’m happy to provide logs for you to glance through, but first wanted to ask about this frequently logged result that seems like a problem or a false error.

2023-10-23 10:04:08.222 -04:00 [DBG] No candidates found for the request path ‘/logicmonitor/collection/service/MSSQLSERVER/servername.asinet.local’
2023-10-23 10:04:08.223 -04:00 [DBG] Request did not match any endpoints

I’ve obfuscated the specifics a bit, but this message shows up for a large number of API request endpoints. The endpoint is valid. The last two chunks are endpoint parameters: /logicmonitor/collection/service/$x/$y

Is this type of logging message indicating a failed API call?

While Adam has moved on, I thought I’d provide a status update and attempt again to get some advice.

The API endpoints are responding with a 400 and “Failed to get runspace” about 25% of the time. We make about 250,000 calls a day. Spread out pretty evenly. 75% successful.

I have adjusted the maximum runspace value many many times and have not been able to improve the fail rate. The failures are random. They occur on any endpoint.

Logging shows no information at all on the failure. It reports 400 response and nothing else. There is no exception to surface from my endpoint script. The runspace failure occurs before endpoint execution.

IIS logs show 400 failure and Bad Request and Failed to get runspace. Failed Request Tracing shows ASPNetCoreModuleV2 returning the error. No additional information.

I have been trying to monitor available runspaces or current runspace utilization but this is not possible with PowerShell AFAIK. So I can’t monitor the condition that leads to a runspace shortage, or whether we are approaching one, or whether we can change the cap (whether MaxRunspaces, in fact, does anything at all).

Looking for thoughts and well-wishes.

  • jason

Seeing this in the PowerShellCore event log over and over. Curious why these are set to 1, and where New-PSUEnvironment’s MaxRunspaces fits in, if it does at all. My environment creation call is at bottom.

Creating RunspacePool object
InstanceId a0b2a4e5-7e41-455b-86da-d1afb4a0c1f8
MinRunspaces 1
MaxRunspaces 1

New-PSUEnvironment -Name “API_RunSpaces_50” -Version “7.3.8” -Path “C:\Program Files\PowerShell\7\pwsh.exe” -Variables @(‘*’) -MaxRunspaces 50 -RunspaceRecycling -HighPerformanceRunspacePool -Description “Recycling and HighPerformance (50).”

We don’t use runspace pools directly in PSU. We have our own implementation that allows for more customization and that’s why you can seeing max\min set to 1.

I would recommend trying to see what is currently executing in the process. The code that is failing internally, is this. The max runspace setting changes the semaphore count. If an API call waits for more than 5 seconds before a runspace is available, what you are seeing will happen.

                if (!_runspaceSemaphore.Wait(5000))
                {
                    throw new Exception("Failed to get runspace.");
                }

I created an endpoint in my example that can return the current runspace state. It would also be possible to run this directly on the PSU server by using Enter-PSHostProcess and running the Get-Runspace command over remoting.

New-PSUEndpoint -Url "/runspace-info" -Method @('GET') -Endpoint {
    Get-Runspace | ForEach-Object {
        @{
            Name         = $_.Name 
            Availability = $_.RunspaceAvailability.ToString()
            CallStack    = $_.Debugger.GetCallstack()
        }
    }
}

What is returned is this. Each object has the name and runspace availability. If it’s busy, you’ll see a call stack.

CallStack    : {@{ScriptName=; ScriptLineNumber=2; InvocationInfo=; Position=; FunctionName=<ScriptBlock>},
               @{ScriptName=; ScriptLineNumber=1; InvocationInfo=; Position=; FunctionName=<ScriptBlock>}}
Availability : Busy
Name         : Runspace121

CallStack    : {}
Availability : Busy
Name         : Runspace126
$RS.CallStack.InvocationInfo |  fl *

Accessing the CallStack.InvocationInfo property will display what’s happening in the runspace. You can see below that MyCommand is set to Start-Sleep 10.

MyCommand             : @{ScriptBlock=
                            Start-Sleep 10
                        ; Definition=
                            Start-Sleep 10
                        ; OutputType=; Name=; CommandType=64; Source=; Version=; Visibility=0; ModuleName=; Module=;
                        RemotingCapability=1; Parameters=; ParameterSets=}
BoundParameters       :
UnboundArguments      : {}
ScriptLineNumber      : 1
OffsetInLine          : 1
HistoryId             : 1
ScriptName            :
Line                  : & $PSUSB
PositionMessage       : At line:1 char:1
                        + & $PSUSB

I would recommend trying to turn off RunspaceRecycling as it will periodically dispose of runspaces after they have been used a number of times. HighPerformanceRunspacePools were deprecated in v3 and the switch doesn’t actually do anything.

Additionally, we use MiniProfiler throughout the Runspace factory so if you switch to the integrated environment you can get timings for your endpoints and allocations in the factory.

            using (MiniProfiler.Current.Step("UDRunspaceFactory.GetRunspace"))
            {

If you want priority on things like this in the future, make sure to open a case by emailing support[@]ironmansoftware.com

1 Like

Thanks for that response. I’m able to use your suggestion directly on the PU server as you described with Enter-PSHostProcess. I couldn’t get it to succeed in an endpoint, likely due to size of output or the load. The results, however, don’t seem to give me insight.

You wrote: “The max runspace setting changes the semaphore count. If an API call waits for more than 5 seconds before a runspace is available, what you are seeing will happen.”

Does the max runspace setting change the 5 seconds? What you’re saying isn’t clear. What is the effect, on an unburdened PU server, of setting max runspaces higher and higher? Will it lead to less failures to “get one” or more instances of runspace access failing after 5 seconds?

Yes, I can see that some runspaces are running my endpoint code. But I’m not sure how this helps me. My issue is that around 25% of the time, the calls return the runspace exception. In those cases, the endpoint code is not run, no?

I’m not looking to discover long-running or problematic endpoint scripts, as there are none. I’m looking to configure PU to provide adequate PowerShell resources to cover my API calls, and wondering if there’s a cap or a negative consequence (other than more mem usage) of a hight max runspace value. Changing it hasn’t seemed to impact my failure rate.

FYI: Changed environment creation to:
New-PSUEnvironment -Name “API_Persistent_100” -Version “7.3.8” -Path “C:\Program Files\PowerShell\7\pwsh.exe” -Variables @(‘*’) -PersistentRunspace -MaxRunspaces 100 -Description “Persistent (100).”

On an unburdened environment it will have no effect. If the server is receiving a single API request every second and is able to process that request in that time, it will not allocate any more runspaces. As soon as a second request comes in before the first is processed, it will allocate another runspace to the pool: up to the max runspace setting. The timeout does not change. Once the max number of runspaces has been allocated, it will stop allocating them.

Yes, I can see that some runspaces are running my endpoint code. But I’m not sure how this helps me. My issue is that around 25% of the time, the calls return the runspace exception. In those cases, the endpoint code is not run, no?

In the error case, all the runspaces must be in use. The code is not run because no runspace is currently available to process it. That said, I know you are saying that there are no long running processes or blocked runspaces. I’ll see if we can introduce some sort of additional diagnostics or configuration (e.g. timeout value) to try to see what is happening. I think reporting runspace exhaustion would be ideal here because we could represent that at a server level rather than only at the individual API level.

That said, how long does a typical API call take to return when it doesn’t fail?

Obviously, I cannot reproduce this, and our testing is typically seeing over 1000 of requests per second while, on average you are only sending 3 per second. I’ll continue to test to see if I can reproduce anything similar but haven’t been successful.

I’m not looking to discover long-running or problematic endpoint scripts, as there are none. I’m looking to configure PU to provide adequate PowerShell resources to cover my API calls, and wondering if there’s a cap or a negative consequence (other than more mem usage) of a hight max runspace value. Changing it hasn’t seemed to impact my failure rate.

Increasing the number of runspaces available just increases the pool size. Eventually, other limitations on CPU and memory will come into play. I think the code sample I provided would need to be run periodically to capture the current state of the runspace pool to see if it’s actually exhausted.

1 Like

Thanks Adam, that clarifies the underlying logic nicely.

Quick response to your questions:
(Brief reminder: All my endpoints do something similar: construct a simple scriptblock and run Invoke-Command to some other server to run the block. Then return simple text back.)

Best case these scripts tend to take a couple seconds to finish up. But in our current situation (contention, etc.) our average call time is now 49 seconds. This average includes timeouts configured on the client side at 2 minutes, at which point it closes the call. I’ve also configured API endpoint timeouts of 90 seconds but they don’t seem to take effect. I see 2-minute duration calls.

My effort to increase maxrunspaces upwards to 100 or more tends to increase instances of reaching the 2-minute timeout trigger on the client.

I will do further testing with your code sample and perhaps try out the Profiler.

This duration of API call could very easily result in what you are seeing. One suggestion I may have would be to consider changing how this API works. For example, the client would make a call to start the data collection jobs using Invoke-PSUScript and then run those jobs in the background.

This API call would return the job ID so the client could check the status on a second API endpoint.

Some benefits to this include short API calls so you don’t overload the stack and job queueing so you could limit how many instances of a script are running. The downside would be a more complicated client call which may not be possible depending on what you have interfacing with this.