Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response unmarshalling code can potentially block forever #1870

Closed
flakey-bit opened this issue Jun 10, 2021 · 24 comments
Closed

Response unmarshalling code can potentially block forever #1870

flakey-bit opened this issue Jun 10, 2021 · 24 comments
Labels
feature-request A feature should be added or improved. module/sdk-generated queued wontfix We have determined that we will not resolve the issue.

Comments

@flakey-bit
Copy link

flakey-bit commented Jun 10, 2021

Description

We've had a repeated problem in production with a component that processes messages from Amazon SQS whereby it will stall (fail to process messages on the queue). The process never recovers and needs to be recycled manually.

From our application's perspective we make a call to IAmazonSQS::ReceiveMessageAsync(request, ct) which never completes (we're using a long-polling interval of 10 seconds).

We've examined a memory dump for a process in this failed state. We can see that it has made a HTTP call to AWS (https://sqs.us-east-1.amazonaws.com/xxx) and that the headers have come back but for whatever reason the content for the response has not. We don't know the underlying cause behind this yet, current working theories are either

Regardless of the underlying root cause, the AWS SDK should be made more resilient for this situation.

At time of writing, when making requests the AWS SDK awaits the call HttpClient::SendAsync, however due to the use of HttpCompletionOption.ResponseHeadersRead that task completes as soon as headers become available, without regard to the content (stream) - see

var responseMessage = await _httpClient.SendAsync(_request, HttpCompletionOption.ResponseHeadersRead, cancellationToken)

The content stream then gets indirectly passed in to the unmarshalling code (XmlResponseUnmarshaller, ReceiveMessageResponseUnmarshaller and friends).

Even though we've come through an async entrypoint, the unmarshalling code (indirectly) does a blocking (synchronous) read on the stream. If data is never added to the stream, it will block forever (which is exactly what we've seen).

Reproduction Steps

Consumer task (00007f8e081f4720) i.e. the one reading from SQS is tied up by AWS SDK Unmarshaller (comes directly after HTTP handler in the AWS pipeline). 
The unmarshaller task is currently executing. 

> dumpasync -tasks -addr 00007F8E08A2B3D0 -stacks
         Address               MT     Size      State Description
00007f8e08a2b3d0 00007f8e800e73a8      152         -1 Amazon.Runtime.Internal.Unmarshaller+<InvokeAsync>d__3`1[[System.__Canon, System.Private.CoreLib]]
Continuation chains:
.00007f8e08a2b468 (0) Amazon.SQS.Internal.ValidationResponseHandler+<InvokeAsync>d__1`1[[System.__Canon, System.Private.CoreLib]]
..00007f8e08a2b4f0 (0) Amazon.Runtime.Internal.ErrorHandler+<InvokeAsync>d__5`1[[System.__Canon, System.Private.CoreLib]]
...00007f8e08a2b578 (0) Amazon.Runtime.Internal.CallbackHandler+<InvokeAsync>d__9`1[[System.__Canon, System.Private.CoreLib]]
....00007f8e08a2b600 (0) Amazon.Runtime.Internal.EndpointDiscoveryHandler+<InvokeAsync>d__2`1[[System.__Canon, System.Private.CoreLib]]
.....00007f8e08a2b698 (1) Amazon.Runtime.Internal.CredentialsRetriever+<InvokeAsync>d__7`1[[System.__Canon, System.Private.CoreLib]]
......00007f8e08a2b738 (0) Amazon.Runtime.Internal.RetryHandler+<InvokeAsync>d__10`1[[System.__Canon, System.Private.CoreLib]]
.......00007f8e08a2b7f8 (0) Amazon.Runtime.Internal.CallbackHandler+<InvokeAsync>d__9`1[[System.__Canon, System.Private.CoreLib]]
........00007f8e08a2b880 (0) Amazon.Runtime.Internal.CallbackHandler+<InvokeAsync>d__9`1[[System.__Canon, System.Private.CoreLib]]
.........00007f8e08a2b908 (0) Amazon.Runtime.Internal.ErrorCallbackHandler+<InvokeAsync>d__5`1[[System.__Canon, System.Private.CoreLib]]
..........00007f8e08a2b990 (0) Amazon.Runtime.Internal.MetricsHandler+<InvokeAsync>d__1`1[[System.__Canon, System.Private.CoreLib]]
...........00007f8e081f4720 (0) Acme.Organisation.Processor.Queue.QueueService+<Write>d__6
............00007f8e081f4808 (0) Acme.Organisation.Processor.Queue.QueueService+<>c__DisplayClass5_1+<<ExecuteAsync>b__1>d
.............00007f8e081e2960 System.Threading.Tasks.UnwrapPromise`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib]]
..............00007f8e081f2618 System.Threading.Tasks.Task+WhenAllPromise
...............00007f8e081f2668 (0) Acme.Organisation.QueueService+<ExecuteAsync>d__5
--------------------------------------------------------------------------------

See callstack for thread 11. It's using the stream returned by the http handler.

> clrstack
OS Thread Id: 0x820 (11)
        Child SP               IP Call Site
00007F8DD7FFB810 00007f8ef802d00c [GCFrame: 00007f8dd7ffb810]
00007F8DD7FFB8F0 00007f8ef802d00c [HelperMethodFrame_1OBJ: 00007f8dd7ffb8f0] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
00007F8DD7FFBA20 00007F8E811A3A25 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/ManualResetEventSlim.cs @ 626]
00007F8DD7FFBAB0 00007F8E8124865F System.Net.Sockets.SocketAsyncContext.PerformSyncOperation[[System.__Canon, System.Private.CoreLib]](OperationQueue`1<System.__Canon> ByRef, System.__Canon, Int32, Int32) [/_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs @ 1273]
00007F8DD7FFBB20 00007F8E81219D3C System.Net.Sockets.SocketAsyncContext.ReceiveFrom(System.Memory`1<Byte>, System.Net.Sockets.SocketFlags ByRef, Byte[], Int32 ByRef, Int32, Int32 ByRef) [/_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs @ 1465]
00007F8DD7FFBBC0 00007F8E81219A9D System.Net.Sockets.SocketPal.Receive(System.Net.Sockets.SafeSocketHandle, Byte[], Int32, Int32, System.Net.Sockets.SocketFlags, Int32 ByRef) [/_/src/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs @ 983]
00007F8DD7FFBC60 00007F8E81219726 System.Net.Sockets.Socket.Receive(Byte[], Int32, Int32, System.Net.Sockets.SocketFlags, System.Net.Sockets.SocketError ByRef) [/_/src/System.Net.Sockets/src/System/Net/Sockets/Socket.cs @ 1386]
00007F8DD7FFBCB0 00007F8E81219340 System.Net.Sockets.NetworkStream.Read(Byte[], Int32, Int32) [/_/src/System.Net.Sockets/src/System/Net/Sockets/NetworkStream.cs @ 266]
00007F8DD7FFBD00 00007F8E8116E1B4 System.Net.Security.SslStream.FillBufferAsync[[System.Net.Security.SslStream+SslReadSync, System.Net.Security]](SslReadSync, Int32) [/_/src/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs @ 1444]
00007F8DD7FFBDA0 00007F8E8116EE93 System.Net.Security.SslStream+<ReadAsyncInternal>d__214`1[[System.Net.Security.SslStream+SslReadSync, System.Net.Security]].MoveNext() [/_/src/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs @ 1439]
00007F8DD7FFBF10 00007F8E8116E904 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[System.Net.Security.SslStream+<ReadAsyncInternal>d__214`1[[System.Net.Security.SslStream+SslReadSync, System.Net.Security]], System.Net.Security]](<ReadAsyncInternal>d__214`1<SslReadSync> ByRef) [/_/src/System.Private.CoreLib/shared/System/Runtime/CompilerServices/AsyncMethodBuilder.cs @ 1036]
00007F8DD7FFBF50 00007F8E8116E82C System.Net.Security.SslStream.ReadAsyncInternal[[System.Net.Security.SslStream+SslReadSync, System.Net.Security]](SslReadSync, System.Memory`1<Byte>)
00007F8DD7FFC000 00007F8E8116E592 System.Net.Security.SslStream.Read(Byte[], Int32, Int32) [/_/src/System.Net.Security/src/System/Net/Security/SslStream.cs @ 753]
00007F8DD7FFC080 00007F8E8116E394 System.IO.Stream.Read(System.Span`1<Byte>) [/_/src/System.Private.CoreLib/shared/System/IO/Stream.cs @ 756]
00007F8DD7FFC0C0 00007F8E8116E056 System.Net.Http.HttpConnection.Read(System.Span`1<Byte>) [/_/src/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnection.cs @ 1481]
00007F8DD7FFC130 00007F8E811350CE System.Net.Http.HttpConnection+ContentLengthReadStream.Read(System.Span`1<Byte>) [/_/src/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/ContentLengthReadStream.cs @ 26]
00007F8DD7FFC170 00007F8E81134F1F System.Net.Http.HttpBaseStream.Read(Byte[], Int32, Int32) [/_/src/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpBaseStream.cs @ 91]
00007F8DD7FFC1B0 00007F8E81163F12 System.IO.StreamReader.ReadBuffer(System.Span`1<Char>, Boolean ByRef) [/_/src/System.Private.CoreLib/shared/System/IO/StreamReader.cs @ 658]
00007F8DD7FFC210 00007F8E8117F6DE System.IO.StreamReader.ReadSpan(System.Span`1<Char>) [/_/src/System.Private.CoreLib/shared/System/IO/StreamReader.cs @ 360]
00007F8DD7FFC270 00007F8E8117F4EC System.IO.StreamReader.Read(Char[], Int32, Int32) [/_/src/System.Private.CoreLib/shared/System/IO/StreamReader.cs @ 338]
00007F8DD7FFC2A0 00007F8E8117F400 System.Xml.XmlTextReaderImpl.ReadData() [/_/src/System.Private.Xml/src/System/Xml/Core/XmlTextReaderImpl.cs @ 3358]
00007F8DD7FFC2D0 00007F8E811AD151 System.Xml.XmlTextReaderImpl.InitTextReaderInput(System.String, System.Uri, System.IO.TextReader) [/_/src/System.Private.Xml/src/System/Xml/Core/XmlTextReaderImpl.cs @ 2940]
00007F8DD7FFC300 00007F8E811AD011 System.Xml.XmlTextReaderImpl.FinishInitTextReader() [/_/src/System.Private.Xml/src/System/Xml/Core/XmlTextReaderImpl.cs @ 784]
00007F8DD7FFC320 00007F8E811AC5BF System.Xml.XmlTextReaderImpl..ctor(System.IO.TextReader, System.Xml.XmlReaderSettings, System.String, System.Xml.XmlParserContext) [/_/src/System.Private.Xml/src/System/Xml/Core/XmlTextReaderImpl.cs @ 768]
00007F8DD7FFC360 00007F8E811AC452 System.Xml.XmlReaderSettings.CreateReader(System.IO.TextReader, System.String, System.Xml.XmlParserContext) [/_/src/System.Private.Xml/src/System/Xml/Core/XmlReaderSettings.cs @ 487]
00007F8DD7FFC3A0 00007F8E7FE11302 Amazon.Runtime.Internal.Transform.XmlUnmarshallerContext.Read()
00007F8DD7FFC400 00007F8E7FE1051A Amazon.SQS.Model.Internal.MarshallTransformations.ReceiveMessageResponseUnmarshaller.Unmarshall(Amazon.Runtime.Internal.Transform.XmlUnmarshallerContext)
00007F8DD7FFC440 00007F8E81194CFA Amazon.Runtime.Internal.Transform.XmlResponseUnmarshaller.Unmarshall(Amazon.Runtime.Internal.Transform.UnmarshallerContext)
00007F8DD7FFC470 00007F8E81182889 Amazon.Runtime.Internal.Transform.ResponseUnmarshaller.UnmarshallResponse(Amazon.Runtime.Internal.Transform.UnmarshallerContext)
00007F8DD7FFC490 00007F8E8118251F Amazon.Runtime.Internal.Unmarshaller.UnmarshallResponse(Amazon.Runtime.Internal.Transform.UnmarshallerContext, Amazon.Runtime.IRequestContext)
00007F8DD7FFC4F0 00007F8E81181E93 Amazon.Runtime.Internal.Unmarshaller+<UnmarshallAsync>d__5.MoveNext()
00007F8DD7FFC590 00007F8E81181A24 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Amazon.Runtime.Internal.Unmarshaller+<UnmarshallAsync>d__5, AWSSDK.Core]](<UnmarshallAsync>d__5 ByRef) [/_/src/System.Private.CoreLib/shared/System/Runtime/CompilerServices/AsyncMethodBuilder.cs @ 1036]
00007F8DD7FFC5D0 00007F8E8118197D Amazon.Runtime.Internal.Unmarshaller.UnmarshallAsync(Amazon.Runtime.IExecutionContext)
00007F8DD7FFC640 00007F8E811521A7 Amazon.Runtime.Internal.Unmarshaller+<InvokeAsync>d__3`1[[System.__Canon, System.Private.CoreLib]].MoveNext()
00007F8DD7FFC6E0 00007F8E7FE1963D System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs @ 201]
00007F8DD7FFC730 00007F8E811817D5 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.__Canon, System.Private.CoreLib],[Amazon.Runtime.Internal.Unmarshaller+<InvokeAsync>d__3`1[[System.__Canon, System.Private.CoreLib]], AWSSDK.Core]].MoveNext(System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System/Runtime/CompilerServices/AsyncMethodBuilder.cs @ 654]
00007F8DD7FFC770 00007F8E7FE19EE6 System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/TaskContinuation.cs @ 804]

Meanwhile, callstack for "event loop" 

> clrstack
OS Thread Id: 0x21 (10)
        Child SP               IP Call Site
00007F8E07FFC740 00007f8ef7c317ef [InlinedCallFrame: 00007f8e07ffc740] Interop+Sys.WaitForSocketEvents(IntPtr, SocketEvent*, Int32*)
00007F8E07FFC740 00007f8e7f7dc06b [InlinedCallFrame: 00007f8e07ffc740] Interop+Sys.WaitForSocketEvents(IntPtr, SocketEvent*, Int32*)
00007F8E07FFC730 00007F8E7F7DC06B ILStubClass.IL_STUB_PInvoke(IntPtr, SocketEvent*, Int32*)
00007F8E07FFC7C0 00007F8E7FB0D25C System.Net.Sockets.SocketAsyncEngine.EventLoop() [/_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs @ 311]
00007F8E07FFC810 00007F8E7FB0D219 System.Net.Sockets.SocketAsyncEngine+<>c.<.ctor>b__23_0(System.Object) [/_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs @ 285]
00007F8E07FFC820 00007F8E7D874883 System.Threading.Tasks.Task.InnerInvoke() [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2445]
00007F8E07FFC840 00007F8E7D874691 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2378]
00007F8E07FFC8C0 00007F8E7D8744E8 System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2321]
00007F8E07FFC8E0 00007F8E7D88146B System.Threading.Tasks.ThreadPoolTaskScheduler+<>c.<.cctor>b__10_0(System.Object) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/ThreadPoolTaskScheduler.cs @ 36]
00007F8E07FFC8F0 00007F8E7D85544A System.Threading.ThreadHelper.ThreadStart(System.Object) [/_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 81]
00007F8E07FFCC60 00007f8ef7142d5f [GCFrame: 00007f8e07ffcc60]
00007F8E07FFCD30 00007f8ef7142d5f [DebuggerU2MCatchHandlerFrame: 00007f8e07ffcd30]

ManualResetEventSlim shared:

> gcroot 00007f8e08a30058
Thread 21:
    00007F8E07FFC7C0 00007F8E7FB0D25C System.Net.Sockets.SocketAsyncEngine.EventLoop() [/_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs @ 311]
        rbx:
            ->  00007F8E081E02E8 System.Net.Sockets.SocketAsyncEngine
            ->  00007F8E081E0328 System.Collections.Concurrent.ConcurrentDictionary`2[[System.IntPtr, System.Private.CoreLib],[System.Net.Sockets.SocketAsyncContext, System.Net.Sockets]]
            ->  00007F8E08382040 System.Collections.Concurrent.ConcurrentDictionary`2+Tables[[System.IntPtr, System.Private.CoreLib],[System.Net.Sockets.SocketAsyncContext, System.Net.Sockets]]
            ->  00007F8E08381728 System.Collections.Concurrent.ConcurrentDictionary`2+Node[[System.IntPtr, System.Private.CoreLib],[System.Net.Sockets.SocketAsyncContext, System.Net.Sockets]][]
            ->  00007F8E089F0D20 System.Collections.Concurrent.ConcurrentDictionary`2+Node[[System.IntPtr, System.Private.CoreLib],[System.Net.Sockets.SocketAsyncContext, System.Net.Sockets]]
            ->  00007F8E089F0C48 System.Net.Sockets.SocketAsyncContext
            ->  00007F8E08A2FFE8 System.Net.Sockets.SocketAsyncContext+BufferMemoryReceiveOperation
            ->  00007F8E08A30058 System.Threading.ManualResetEventSlim

Thread 820:
    00007F8DD7FFBA20 00007F8E811A3A25 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/ManualResetEventSlim.cs @ 626]
        rbp-70: 00007f8dd7ffba30
            ->  00007F8E08A30058 System.Threading.ManualResetEventSlim

    00007F8DD7FFBAB0 00007F8E8124865F System.Net.Sockets.SocketAsyncContext.PerformSyncOperation[[System.__Canon, System.Private.CoreLib]](OperationQueue`1<System.__Canon> ByRef, System.__Canon, Int32, Int32) [/_/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs @ 1273]
        rbp-50: 00007f8dd7ffbac0
            ->  00007F8E08A30058 System.Threading.ManualResetEventSlim

> dumpheap -type BufferMemoryReceiveOperation
         Address               MT     Size

00007f8e08a2ffe8 00007f8e7fcc2a48      112      => context: 00007f8e089f0c48 (SocketAsyncContext)

> do 00007f8e08a2ffe8
Name:        System.Net.Sockets.SocketAsyncContext+BufferMemoryReceiveOperation
MethodTable: 00007f8e7fcc2a48
EEClass:     00007f8e7fcb59e8
Size:        112(0x70) bytes
File:        /usr/share/dotnet/shared/Microsoft.NETCore.App/3.1.1/System.Net.Sockets.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007f8e7ddba0e8  4000298       28         System.Int32  1 instance                0 _state
00007f8e7fb8a6c8  4000299        8 ...ocketAsyncContext  0 instance 00007f8e089f0c48 AssociatedContext
00007f8e7fb8a000  400029a       10 ...xt+AsyncOperation  0 instance 00007f8e08a2ffe8 Next
00007f8e7d33c798  400029b       18        System.Object  0 instance 00007f8e08a30058 CallbackOrEvent
00007f8e7faacf48  400029c       2c         System.Int32  1 instance                0 ErrorCode
00007f8e7ddc14c0  400029d       20        System.Byte[]  0 instance 0000000000000000 SocketAddress
00007f8e7ddba0e8  400029e       30         System.Int32  1 instance                0 SocketAddressLen
00007f8e7e41b5c8  400029f       38 ...TokenRegistration  1 instance 00007f8e08a30020 CancellationRegistration
00007f8e7faad098  40002af       48         System.Int32  1 instance                0 Flags
00007f8e7faad098  40002b0       4c         System.Int32  1 instance                0 ReceivedFlags
00007f8e7ddba0e8  40002b1       50         System.Int32  1 instance                0 BytesTransferred
00007f8e7e66acb8  40002b2       58 ...Private.CoreLib]]  1 instance 00007f8e08a30040 Buffer      

Environment

Resolution

Presumably the use of HttpCompletionOption.ResponseHeadersRead is deliberate. Assuming that's the case, I'd suggest one of the following solutions:

  • While still in the async code path, read the entire response stream into a buffer (memory stream) w/ Stream.CopyToAsync w/ a CancellationTokenSource based on a suitable timeout. From that point, code can be safely synchronous
  • Find a way to set ReadTimeout on the NetworkStream (not sure if this will be possible)

If you'd be happy with either of these solutions, let me know and I'll work on a PR 👋


This is a 🐛 bug-report

@flakey-bit flakey-bit added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jun 10, 2021
@flakey-bit
Copy link
Author

This change dotnet/corefx#36516 plumbed through cancellation token support into underlying socket

@ashishdhingra
Copy link
Contributor

Hi @flakey-bit,

Good afternoon.

Thanks for posting the issue. Please let me know if it would be possible for you to share sample code solution which would help us reproduce/troubleshoot the issue.

Thanks,
Ashish

@ashishdhingra ashishdhingra added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. labels Jun 11, 2021
@flakey-bit
Copy link
Author

Hi @ashishdhingra ,

I've pushed a repo containing a reproduction here: https://github.com/flakey-bit/aws-sqs-sdk-blocking-repro

Client code makes a call to ReceiveMessageAsync - the client has been set up w/ a Timeout of 5 seconds and long-polling time (WaitTimeSeconds) of 10 seconds.

Server code is set up to return the headers but never return the content body.

Under this scenario, the client will block indefinitely. This may seem far-fetched, however we've seen this happen in the wild.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jun 14, 2021
@flakey-bit
Copy link
Author

@ashishdhingra let me know if you need any further information.

@ashishdhingra
Copy link
Contributor

@ashishdhingra let me know if you need any further information.

@flakey-bit Thanks for sharing the information. I will discuss with development team to get this triaged.

@normj
Copy link
Member

normj commented Jun 30, 2021

When you see this hanging behavior is it when your code is running in AWS, outside of AWS or doesn't matter?

I would want to avoid coping into a buffer as a last resort as I would like to see the SDK reduce the amount of memory allocations it is doing and that would greatly increase it.

@flakey-bit
Copy link
Author

flakey-bit commented Jun 30, 2021

When you see this hanging behavior is it when your code is running in AWS, outside of AWS or doesn't matter?

We saw this with code running in AWS in our production environment.

.NET Core 3.1 process running in a Linux container under AWS ECS. Our ECS cluster uses AWS EC2 instances all in the same region as SQS (us-east-1)

I would want to avoid coping into a buffer as a last resort as I would like to see the SDK reduce the amount of memory allocations it is doing and that would greatly increase it.

Agree in principal 👍 , that was only one possible proposed solution. If you wanted to get fancy, something like Microsoft.IO.RecyclableMemoryStream could help.

@flakey-bit
Copy link
Author

@danielmarbach sorry for the tag, but thought you might be interested as you opened #796

@danielmarbach
Copy link
Contributor

No worries. Good to know.

RecycableMemoryStream is definitely a solution or using the shared array pool and slicing the buffers accordingly.

@flakey-bit
Copy link
Author

Hi @ashishdhingra AWS, any idea when this issue will be looked into?

@ganeshnj
Copy link
Contributor

ganeshnj commented Sep 16, 2021

Thanks @flakey-bit for repro.

Here is my investigation

The behavior of plain HTTP client and .NET SDK for AWS is same, i.e. on calling SendAsync with HttpCompletionOption.ResponseHeadersRead and server doesn't write on the response stream, client will keep on waiting until it received the promised bytes.

Looking suggested resolutions, buffering the response in a stream will not work if the server is not writing on the response stream. Read operation on the stream will wait infinitely. Also, as @normj pointed correctly, this is not ideal for memory footprint especially for larger responses.

Second option is interesting, Microsoft has introduced an overload on ReadAsStreamAsync (https://docs.microsoft.com/en-us/dotnet/api/system.net.http.httpcontent.readasstreamasync?view=net-5.0#System_Net_Http_HttpContent_ReadAsStreamAsync_System_Threading_CancellationToken_) which can be option for us but it is only available for .NET 5.0 or above. This will allow cancellation during stream read. We have backlog item to see how this will work and given .NET 6.0 (LTS) also around which may also affect the prioritization.

That said, I'm curious on the root cause of the problem, in what particular scenario ReceiveMessageRequest hangs forever? Do you have reproduction specifically communicating with SQS. I'm pretty sure this doesn't happen on every request and I couldn't able to reproduce on my side.

@flakey-bit
Copy link
Author

flakey-bit commented Sep 16, 2021

@ganeshnj

I'd say the simplest thing that could possibly work would be to change the implementation to use HttpCompletionOption.ResponseContentRead rather than HttpCompletionOption.ResponseHeadersRead in the SQS path.

Presumably the reason we've ended up in this situation is due to code-reuse in the SDK - i.e. the low level code for making HTTP calls in the SDK is used not just by the IAmazonSQS implementation but other services too. At a guess, one of those other services (maybe S3, maybe Kafka at a guess) wants a streaming capability, hence the use of HttpCompletionOption.ResponseHeadersRead.

As far as I can tell, there's no obvious benefit to using HttpCompletionOption.ResponseHeadersRead with SQS because we're not streaming stuff.

So yeah - I'd suggest refactoring so that HttpCompletionOption.ResponseContentRead is used under-the-hood for SQS. That probably means making the behaviour configurable in HttpRequestMessageFactory so that only the services that want streaming get it (or services that don't need streaming can opt-out).

That said, I'm curious on the root cause of the problem, in what particular scenario ReceiveMessageRequest hangs forever? Do you have reproduction specifically communicating with SQS. I'm pretty sure this doesn't happen on every request and I couldn't able to reproduce on my side.

We don't have a reproduction against actual AWS SQS unfortunately, however we've seen it occur sporadically in our production environment. Presumably it's due to a NAT / proxy / firewall / gateway or similar piece of infrastructure misbehaving.

@ganeshnj
Copy link
Contributor

The benefit you will get from ResponseHeadersRead to ResponseContentRead is HTTP timeout, the request timeout set in AmazonSQSClient client will be in effect and in case of long wait/hang, SendAsync method will timeout. This will definitely help in this case. However, I'm not sure of the side affects.

From the file history, It has been like this since beginning. Changing HttpCompletionOption to ResponseContentRead will/can be a breaking change because timeout config now considers Content also. Like you mentioned, this is shared code and affects all the services.

From here, I only see a change for .NET 5 or above as a viable solution.

@flakey-bit
Copy link
Author

flakey-bit commented Sep 16, 2021

Changing HttpCompletionOption to ResponseContentRead will/can be a breaking change because timeout config now considers Content also. Like you mentioned, this is shared code and affects all the services.

I'm definitely not suggesting we change to using ResponseContentRead everywhere (for all services), but just for SQS (where ResponseHeadersRead has no benefit). Even though it's shared code, you can make the shared code internally configurable so that SQS can have the behaviour it needs without affecting other usages.

I see what you're saying about a behaviour change (although I'd argue it's changing the behaviour to what most reasonable people would expect).

If you really want to avoid a default behaviour change, you can introduce a configuration option (default false) to AmazonSQSConfig, something like

public bool WaitForContent { get; set; }

If false (default) then we use ResponseHeadersRead as before. Timeout only considers headers
If true then we use ResponseContentRead and Timeout applies to headers + content

@ganeshnj
Copy link
Contributor

That's fair. it can be an opt-in only.

@flakey-bit
Copy link
Author

Awesome @ganeshnj - what are the next steps?

@ganeshnj ganeshnj added the feature-request A feature should be added or improved. label Sep 17, 2021
@ganeshnj
Copy link
Contributor

The backlog item has been updated.

Next step: it will be groomed and prioritized.

@hunanniu hunanniu removed the bug This issue is a bug. label Sep 21, 2021
@Bio2hazard
Copy link

Bio2hazard commented Sep 29, 2021

I know this is all info that's been mostly captured here already but consider it my way of adding a +1 😄

I tried let the .net team know about this issue (with reproduction even) quite a while ago in dotnet/runtime#34164 (comment) and reported it again in dotnet/runtime#31259 but they were not keen to address the underlying issue.

The long and short of it is that:

  • AWS SDK sends everything with HttpCompletionOption.ResponseHeadersRead
  • Any cancellation token passed in only applies to the request that returns the header, not the content
  • When the actual content is read, any cancellation token is either lost or there is no way to pass it in
  • When AWS services have trouble (unknown if they go in a bad state, get updated, replaced or what 🤷) they never explicitly close open connections / send a RST TCP packet to close the socket
  • .net team does not seem interested in adding a system-wide socket timeout, or a means to enforce a keep-alive or any other safety net

A repro I created can be found here (though in the 2 years since I reported the issue the open socket connection component seems to have stopped working) https://dotnetfiddle.net/UpFwOD

As for fixing the issue...
Doing it the right way would be to ensure a cancellation token is passed down all the way to the socket connection. This might be difficult because overloads with cancellation support were only added in .NET 5, and the fix should ideally also help customers on .net core 3.1.

So for fixing it for all framework versions would be to add a cancellation token registration that forcibly closes the stream or discards the http response, which is the most reliably way to tell .net to stop waiting for a response on the socket.

Finally, I'd ask that whatever solution is employed isn't just for SQS/SNS. We run into this issue quite frequently with S3, where streaming the response is of course desired. At this point I need to guard all S3 content stream reads in a specialized Timeout stream wrapper that maintains its own timeout and will close the underlying stream should the timeout be exceeded.

@ppittle
Copy link
Member

ppittle commented Feb 24, 2022

@flakey-bit - we've been thinking on the problem of offering finer grained Http timeout control and there's a number of improvements we want to make. Some of this is slowly making it's way into the SDK (example). But we're still deciding how to fully expose this functionality. One of the key issues is the .net paradaigm of extending methods with a single CancellationToken. While this works well for low level methods, it breaks down for an abstraction over the http request process where there is a desire to set multiple timeouts.

Because we're still evaluating designs and deciding if we can incorporate them into the existing SDK version (vs needing breaking changes in a vNext), I'm hesitant to implement a one-off custom implementation specifically for the SQS, as that could constrain a broader design.

However, to address your immediate need, you can use the code snippet below. This will allow you to effectively set a timeout against the entire client.ReceiveMessageAsync method and prevent your application from hanging if it fails to receive a fully formed message from SQS:

public static class SQSClientExtensions
{
    public static async Task<ReceiveMessageResponse> ReceiveMessageAsyncFullTimeout(
        this IAmazonSQS client, 
        ReceiveMessageRequest request,
        CancellationToken token)
    {
        var queryTask = client.ReceiveMessageAsync(request, token);

        // start a background task that will be interrupted if token is cancelled.
        // note:  the1 hour timespan is arbitrary, but the value must be less than the token timeout.
        var cancelTask = Task.Delay(TimeSpan.FromHours(1), token);
        
        // wait for either task to finish
        await Task.WhenAny(cancelTask, queryTask);

        // verify we haven't timed out
        token.ThrowIfCancellationRequested();

        // if we made it this far, then queryTask has completed
        return await queryTask;
    }
}

I verified ReceiveMessageAsyncFullTimeout works in your sample submission (thanks for that btw!).

Modify the client as so https://github.com/flakey-bit/aws-sqs-sdk-blocking-repro/blob/master/client/Program.cs#L41:

static async Task Main(string[] args)
{
    const string url = "http://localhost:5000/";
	
    var client = new AmazonSQSClient(new AmazonSQSConfig
    {
        ServiceURL = url,    
        UseHttp = true,
        MaxErrorRetry = 0
    });

    var request = new ReceiveMessageRequest();

    var sw = Stopwatch.StartNew();	
	
    // use custom extension method ReceiveMessageAsyncFullTimeout
    var token = new CancellationTokenSource(TimeSpan.FromSeconds(5)).Token;
    var response = await client.ReceiveMessageAsyncFullTimeout(request, token);

    Console.Out.WriteLine($"{sw.Elapsed} Got messages: \n{response.Messages.Count}");
}

Thanks,

PJ

@ppittle ppittle added the wontfix We have determined that we will not resolve the issue. label Feb 24, 2022
@flakey-bit
Copy link
Author

Hi @ppittle,

We're already using the same technique that you proposed (racing the task against a delay task) as workaround. However a serious limitation (with a long running process) is that it won't actually cancel the request, leaving the HTTP connection open.

Anyway, thanks for taking the time to look into this and glad to hear it's being fixed holistically (which was what I was originally asking for anyway).

Cheers,
Eddie

@ppittle ppittle closed this as completed Feb 24, 2022
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@wjrogers
Copy link

@ppittle I am trying to cope with some similar hangs in the SDK's response handling, and given this issue is now closed, is there an open issue I can follow to keep tabs on the timeout control work you mentioned? In particular, is there a relevant issue about adding a .NET 5+ target and plumbing cancellation into ReadAsStreamAsync()?

public Task<Stream> OpenResponseAsync()
{
if (_disposed)
throw new ObjectDisposedException("HttpWebResponseBody");
return _response.Content.ReadAsStreamAsync();
}

@ppittle
Copy link
Member

ppittle commented May 6, 2022

HI @wjrogers - We are tracking this work internally, but we do not have a public GitHub issue to track at this time.

@wssmith
Copy link

wssmith commented Apr 29, 2024

Hi @ppittle, do you know if there has been progress on this? I'm facing the same issue with S3. The workaround that stops waiting for the task isn't a great fit for my app, which is long-running and shouldn't leak connections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. module/sdk-generated queued wontfix We have determined that we will not resolve the issue.
Projects
None yet
Development

No branches or pull requests

10 participants