Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency issues around dispatch events and waiting for renders #118

Closed
egil opened this issue May 3, 2020 · 9 comments
Closed

Concurrency issues around dispatch events and waiting for renders #118

egil opened this issue May 3, 2020 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@egil
Copy link
Member

egil commented May 3, 2020

The following error shows up in 154 test runs out of about 3000:

  • Htmlizer.cs:line 44, Assert.Debug (20 times)
  • Htmlizer.cs:line 166, Assert.Debug (14 times)
  • Htmlizer.cs:line 59, error InvalidOperationException : We didn't consume any input (33 times)
  • Htmlizer.cs:line 93 , error InvalidOperationException : Invalid element frame type 'None' (28 times)
  • The HashSet Observers in RenderEventObservable.cs is updated concurrently. Throws the following error message: Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct. (3 times)
  • Additional regular assertion fails, probably just related to the above. Many timeouts.
@egil egil added the bug Something isn't working label May 3, 2020
@egil egil added this to the beta-7 milestone May 3, 2020
egil added a commit that referenced this issue May 3, 2020
@egil egil changed the title WaitForState sometimes breaks because of exception in Htmlizer Concurrency issues around dispatch events and waiting for renders May 3, 2020
@egil
Copy link
Member Author

egil commented May 5, 2020

Current thinking is that the issue is that subscribers to render events are in the processes of regenerating their DOM while the next render happens, since they are waiting on one thread and the rendere continues its work on another thread. Thus the OnRender trigger isn't actually synchronous and blocking until the DOMs are updated, it just blocks until the waiting threads are notified, and they go off and do their own work.

Next thing to attempt is to change the notification system such that subscribers can return a Task when notified, that they should complete when they are done doing whatever work they need to do. Looking at the Renderer.cs code, it seems like it will await an uncompleted task before continuing with the next render - https://source.dot.net/#Microsoft.AspNetCore.Components/RenderTree/Renderer.cs,514

@egil
Copy link
Member Author

egil commented May 7, 2020

Pushed a new branch with semi complete attempt at new render event notification system. Needs more testing.

Also, this cloud enable making tasks returned by e.g. ClickAsync first complete after the new DOM elements have been generated, making the use of the WaitFor* methods less needed... Further investigation required!

@egil
Copy link
Member Author

egil commented May 7, 2020

Second update with a complete new notification pipeline, that looks like this:

After a render is complete, but before ComponentBase.OnAfterRender* methods are called, this happens:

  1. TestRenderer.UpdateDisplayAsync is invoked, it calls to Task RenderedFragment.Handle()
  2. Task RenderedFragment.Handle() checks if it or its children are apart of the render, if they are, it notifies subscribers to the event RenderedFragment.OnAfterRender.
  3. Then it checks if there are any markup changes in it or its children. If there are, it updates the Markup property and nulls the Nodes property, and then notifies subscribers of RenderedFragment.OnMarkupUpdated.
  4. Then it completes by returning a completed task, which the renderer is awaiting.

The core of the problem seems to have been:

  1. The order when the "wait for's" was notified of changes did not necessarily occur after the rendered fragments was notified and had markup and anglesharp nodes regenerated, which caused the state predicate or assertion to fail when it didn't have to.
  2. The thread that runs the wait for checker and the other thread that runs the renderer, could be running on different CPUs/cores, and that would cause memory coherence problems, where the update to markup and nodes, even if they happened before the wait for checks, would not be seen by the checker, and still fail.
  3. There was a race condition between nulling out the nodes property and generating new markup, that could lead to the wait for checkers to get an older version of the nodes even though the markup was correctly updated.

To solution seems to be:

  1. In the UpdateDisplayAsync in the test renderer, return a task that does not complete before all the listening rendered fragments have regenerated their markup and nodes, and all their wait for listeners have executed their logic. The incomplete task looks like it causes the renderer to wait before any OnAfterRender* methods are called.
  2. Use Volatile.Write/Volatile.Read when accessing RenderedFragment.Nodes and RenderedFragment.Markup, to ensure that memory coherence is not a problem.
  3. Synchronize access to RenderedFragment.Nodes and RenderedFragment.Markup via the lock statement, both when updating and reading, to avoid the two getting out of sync, e.g. nodes representing an older version of the the current markup.
  4. Adding locking in the wait for helpers around the checking and timeout, to avoid race conditions in them, e.g. if the timer invokes the OnTimeout method while the OnAfterRender method is going. The lock is around the TaskCompletionSource, to ensure that there is not a race to return a value.

I am not convinced that the problem is gone. For example, if something inside a component under test, e.g. a public property, that the user is checking is changed in the same manner that can cause problems with the rendered markup, then that might suffer from the same problem. I did write a test very similar the markup ones that used to fail. It didn't fail during an overnight run, so there might be some mitigating factors in play.

@egil
Copy link
Member Author

egil commented May 7, 2020

This blog describes the memory coherence problem Ive observed: http://benbowen.blog/post/cmmics_iii/

egil added a commit that referenced this issue May 8, 2020
…#118

Squashed commit of the following:
commit e8bde9b
Author: Egil Hansen <[email protected]>
Date:   Fri May 8 11:31:40 2020 +0000

    Fix to code doc

commit 833e67d
Author: Egil Hansen <[email protected]>
Date:   Fri May 8 11:29:22 2020 +0000

    Code clean up

commit 5f3b969
Author: Egil Hansen <[email protected]>
Date:   Fri May 8 11:12:13 2020 +0000

    Removed unused using statements and reordered

commit 2c809e1
Author: Egil Hansen <[email protected]>
Date:   Fri May 8 11:10:04 2020 +0000

    Tweaks to code, simpler and general WaitForHelper class

commit fbf0b2c
Author: Egil Hansen <[email protected]>
Date:   Fri May 8 08:50:56 2020 +0000

    Added additional test for testing changes to properties in components

commit 0969f9d
Author: Egil Hansen <[email protected]>
Date:   Thu May 7 17:48:25 2020 +0000

    Removed test context waitfor functionality

commit ff48f3d
Author: Egil Hansen <[email protected]>
Date:   Thu May 7 17:11:01 2020 +0000

    Added time stamp to log, and wrapped in try/catch, to handle scenarios where a message arrives after the test run is over.

commit 46176c3
Author: Egil Hansen <[email protected]>
Date:   Thu May 7 10:53:40 2020 +0000

    Moved to wait helper classes with locking and better protection for race conditions

commit 98bfb08
Author: Egil Hansen <[email protected]>
Date:   Wed May 6 23:52:02 2020 +0000

    Attempt at reworking notification of render events and provding more stable logic around waiting for functionality
@egil egil closed this as completed May 8, 2020
@duracellko
Copy link
Contributor

@duracellko, you might find this issue interesting.

Is there a branch with failing tests? Are those files already fixing the issue?

Btw. I noticed in RenderedFragment

lock (_lock)
{
    return Volatile.Read(ref _markup);
}

If locking is consistent (applied to all access), then Volatile.Read/Write is not necessary.

@egil
Copy link
Member Author

egil commented May 9, 2020

These are basically the offending tests: https://www.github.com/egil/bunit/tree/dev/src%2Fbunit.core.tests%2FExtensions%2FWaitForHelpers%2FRenderedFragmentWaitForHelperExtensionsTest.cs

And yes, the issue seems fixed (code on dev branch). I did an overnight run on my dual core laptop where these tests ran in a loop, and no errors occurred, compared to earlier where 1 of about 20 times one of the tests would fail.

I'll try to remove the volatile read/writes then and just use the locks and see how it goes. See anything else I can improve?

@egil
Copy link
Member Author

egil commented May 9, 2020

This is the powershell command I use to run it in a loop:

~\Source\bunit\src\bunit.core.tests>
❯ dotnet build
❯ for($i=1;$1 -le 1000;$i++) { dotnet test --no-restore --no-build --filter Bunit.Extensions.WaitForHelpers.RenderedFragmentWaitForHelperExtensionsTest -l "console" --nologo >> testlog.log }

@egil
Copy link
Member Author

egil commented May 9, 2020

Just did a test without volatile, and it works as well. Can't remember exactly the setup before I added them, but the did help initially. Maybe I added the lock statement after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants