-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Response unmarshalling code can potentially block forever #1870
Comments
This change dotnet/corefx#36516 plumbed through cancellation token support into underlying socket |
Hi @flakey-bit, Good afternoon. Thanks for posting the issue. Please let me know if it would be possible for you to share sample code solution which would help us reproduce/troubleshoot the issue. Thanks, |
Hi @ashishdhingra , I've pushed a repo containing a reproduction here: https://github.com/flakey-bit/aws-sqs-sdk-blocking-repro Client code makes a call to Server code is set up to return the headers but never return the content body. Under this scenario, the client will block indefinitely. This may seem far-fetched, however we've seen this happen in the wild. |
@ashishdhingra let me know if you need any further information. |
@flakey-bit Thanks for sharing the information. I will discuss with development team to get this triaged. |
When you see this hanging behavior is it when your code is running in AWS, outside of AWS or doesn't matter? I would want to avoid coping into a buffer as a last resort as I would like to see the SDK reduce the amount of memory allocations it is doing and that would greatly increase it. |
We saw this with code running in AWS in our production environment. .NET Core 3.1 process running in a Linux container under AWS ECS. Our ECS cluster uses AWS EC2 instances all in the same region as SQS (us-east-1)
Agree in principal 👍 , that was only one possible proposed solution. If you wanted to get fancy, something like Microsoft.IO.RecyclableMemoryStream could help. |
@danielmarbach sorry for the tag, but thought you might be interested as you opened #796 |
No worries. Good to know. RecycableMemoryStream is definitely a solution or using the shared array pool and slicing the buffers accordingly. |
Hi @ashishdhingra AWS, any idea when this issue will be looked into? |
Thanks @flakey-bit for repro. Here is my investigation The behavior of plain HTTP client and .NET SDK for AWS is same, i.e. on calling Looking suggested resolutions, buffering the response in a stream will not work if the server is not writing on the response stream. Read operation on the stream will wait infinitely. Also, as @normj pointed correctly, this is not ideal for memory footprint especially for larger responses. Second option is interesting, Microsoft has introduced an overload on ReadAsStreamAsync (https://docs.microsoft.com/en-us/dotnet/api/system.net.http.httpcontent.readasstreamasync?view=net-5.0#System_Net_Http_HttpContent_ReadAsStreamAsync_System_Threading_CancellationToken_) which can be option for us but it is only available for .NET 5.0 or above. This will allow cancellation during stream read. We have backlog item to see how this will work and given .NET 6.0 (LTS) also around which may also affect the prioritization. That said, I'm curious on the root cause of the problem, in what particular scenario |
I'd say the simplest thing that could possibly work would be to change the implementation to use Presumably the reason we've ended up in this situation is due to code-reuse in the SDK - i.e. the low level code for making HTTP calls in the SDK is used not just by the As far as I can tell, there's no obvious benefit to using So yeah - I'd suggest refactoring so that
We don't have a reproduction against actual AWS SQS unfortunately, however we've seen it occur sporadically in our production environment. Presumably it's due to a NAT / proxy / firewall / gateway or similar piece of infrastructure misbehaving. |
The benefit you will get from From the file history, It has been like this since beginning. Changing From here, I only see a change for .NET 5 or above as a viable solution. |
I'm definitely not suggesting we change to using I see what you're saying about a behaviour change (although I'd argue it's changing the behaviour to what most reasonable people would expect). If you really want to avoid a default behaviour change, you can introduce a configuration option (default public bool WaitForContent { get; set; } If |
That's fair. it can be an opt-in only. |
Awesome @ganeshnj - what are the next steps? |
The backlog item has been updated. Next step: it will be groomed and prioritized. |
I know this is all info that's been mostly captured here already but consider it my way of adding a +1 😄 I tried let the .net team know about this issue (with reproduction even) quite a while ago in dotnet/runtime#34164 (comment) and reported it again in dotnet/runtime#31259 but they were not keen to address the underlying issue. The long and short of it is that:
A repro I created can be found here (though in the 2 years since I reported the issue the open socket connection component seems to have stopped working) https://dotnetfiddle.net/UpFwOD As for fixing the issue... So for fixing it for all framework versions would be to add a cancellation token registration that forcibly closes the stream or discards the http response, which is the most reliably way to tell .net to stop waiting for a response on the socket. Finally, I'd ask that whatever solution is employed isn't just for SQS/SNS. We run into this issue quite frequently with S3, where streaming the response is of course desired. At this point I need to guard all S3 content stream reads in a specialized Timeout stream wrapper that maintains its own timeout and will close the underlying stream should the timeout be exceeded. |
@flakey-bit - we've been thinking on the problem of offering finer grained Http timeout control and there's a number of improvements we want to make. Some of this is slowly making it's way into the SDK (example). But we're still deciding how to fully expose this functionality. One of the key issues is the .net paradaigm of extending methods with a single Because we're still evaluating designs and deciding if we can incorporate them into the existing SDK version (vs needing breaking changes in a vNext), I'm hesitant to implement a one-off custom implementation specifically for the SQS, as that could constrain a broader design. However, to address your immediate need, you can use the code snippet below. This will allow you to effectively set a timeout against the entire public static class SQSClientExtensions
{
public static async Task<ReceiveMessageResponse> ReceiveMessageAsyncFullTimeout(
this IAmazonSQS client,
ReceiveMessageRequest request,
CancellationToken token)
{
var queryTask = client.ReceiveMessageAsync(request, token);
// start a background task that will be interrupted if token is cancelled.
// note: the1 hour timespan is arbitrary, but the value must be less than the token timeout.
var cancelTask = Task.Delay(TimeSpan.FromHours(1), token);
// wait for either task to finish
await Task.WhenAny(cancelTask, queryTask);
// verify we haven't timed out
token.ThrowIfCancellationRequested();
// if we made it this far, then queryTask has completed
return await queryTask;
}
} I verified Modify the client as so https://github.com/flakey-bit/aws-sqs-sdk-blocking-repro/blob/master/client/Program.cs#L41: static async Task Main(string[] args)
{
const string url = "http://localhost:5000/";
var client = new AmazonSQSClient(new AmazonSQSConfig
{
ServiceURL = url,
UseHttp = true,
MaxErrorRetry = 0
});
var request = new ReceiveMessageRequest();
var sw = Stopwatch.StartNew();
// use custom extension method ReceiveMessageAsyncFullTimeout
var token = new CancellationTokenSource(TimeSpan.FromSeconds(5)).Token;
var response = await client.ReceiveMessageAsyncFullTimeout(request, token);
Console.Out.WriteLine($"{sw.Elapsed} Got messages: \n{response.Messages.Count}");
} Thanks, PJ |
Hi @ppittle, We're already using the same technique that you proposed (racing the task against a delay task) as workaround. However a serious limitation (with a long running process) is that it won't actually cancel the request, leaving the HTTP connection open. Anyway, thanks for taking the time to look into this and glad to hear it's being fixed holistically (which was what I was originally asking for anyway). Cheers, |
|
@ppittle I am trying to cope with some similar hangs in the SDK's response handling, and given this issue is now closed, is there an open issue I can follow to keep tabs on the timeout control work you mentioned? In particular, is there a relevant issue about adding a .NET 5+ target and plumbing cancellation into aws-sdk-net/sdk/src/Core/Amazon.Runtime/Internal/Transform/_netstandard/HttpClientResponseData.cs Lines 147 to 153 in 0cd9f02
|
HI @wjrogers - We are tracking this work internally, but we do not have a public GitHub issue to track at this time. |
Hi @ppittle, do you know if there has been progress on this? I'm facing the same issue with S3. The workaround that stops waiting for the task isn't a great fit for my app, which is long-running and shouldn't leak connections. |
Description
We've had a repeated problem in production with a component that processes messages from Amazon SQS whereby it will stall (fail to process messages on the queue). The process never recovers and needs to be recycled manually.
From our application's perspective we make a call to
IAmazonSQS::ReceiveMessageAsync(request, ct)
which never completes (we're using a long-polling interval of 10 seconds).We've examined a memory dump for a process in this failed state. We can see that it has made a HTTP call to AWS (
https://sqs.us-east-1.amazonaws.com/xxx
) and that the headers have come back but for whatever reason the content for the response has not. We don't know the underlying cause behind this yet, current working theories are eitherRegardless of the underlying root cause, the AWS SDK should be made more resilient for this situation.
At time of writing, when making requests the AWS SDK awaits the call
HttpClient::SendAsync
, however due to the use ofHttpCompletionOption.ResponseHeadersRead
that task completes as soon as headers become available, without regard to the content (stream) - seeaws-sdk-net/sdk/src/Core/Amazon.Runtime/Pipeline/HttpHandler/_netstandard/HttpRequestMessageFactory.cs
Line 520 in 475822d
The content stream then gets indirectly passed in to the unmarshalling code (
XmlResponseUnmarshaller
,ReceiveMessageResponseUnmarshaller
and friends).Even though we've come through an
async
entrypoint, the unmarshalling code (indirectly) does a blocking (synchronous) read on the stream. If data is never added to the stream, it will block forever (which is exactly what we've seen).Reproduction Steps
Environment
Resolution
Presumably the use of
HttpCompletionOption.ResponseHeadersRead
is deliberate. Assuming that's the case, I'd suggest one of the following solutions:async
code path, read the entire response stream into a buffer (memory stream) w/Stream.CopyToAsync
w/ aCancellationTokenSource
based on a suitable timeout. From that point, code can be safely synchronousReadTimeout
on theNetworkStream
(not sure if this will be possible)If you'd be happy with either of these solutions, let me know and I'll work on a PR 👋
This is a 🐛 bug-report
The text was updated successfully, but these errors were encountered: