Skip to content

fix: refine message loop handling to prevent blocking#1255

Open
AlbertXingZhang wants to merge 3 commits intowebview:masterfrom
AlbertXingZhang:master
Open

fix: refine message loop handling to prevent blocking#1255
AlbertXingZhang wants to merge 3 commits intowebview:masterfrom
AlbertXingZhang:master

Conversation

@AlbertXingZhang
Copy link

@AlbertXingZhang AlbertXingZhang commented Mar 4, 2025

GetMessageW sometimes blocks the thread and freezes the program. This changes it to calling non-blocking PeekMessageW with PM_NOREMOVE first and then calling GetMessageW once there is a message available.

@taozuhong
Copy link
Contributor

Pls using PeekMessageA for UTF-8 or PeekMessage for WCHAR and UTF-8

@AlbertXingZhang
Copy link
Author

I tried to follow the convention of this source file. It seems it just uses Windows APIs in Unicode version.

@SteffenL
Copy link
Collaborator

SteffenL commented Mar 4, 2025

Thanks for this PR.

Isn't there a chance that this changed loop might waste a fair bit of CPU cycles?

In any case, if it's not for long, it might be better than potentially blocking for an extensive period of time.

Would be great if this also solves the issue where tests sometimes hang.

The usage of PeekMessageW looks fine to me.

@SteffenL
Copy link
Collaborator

SteffenL commented Mar 4, 2025

Do you a particular environment where you can easily reproduce the issue? Would love to be able to test and verify even though I understand why the fix works. I just haven't been able to reproduce it outside of running CI tests.

@AlbertXingZhang
Copy link
Author

AlbertXingZhang commented Mar 4, 2025

I couldn't produce the issue constantly. But it did happen occasionally. I found it is blocked at GetMessageW if there is no message coming in and the whole GUI is frozen and not responding. In this situation, I think there is no way for any input event message to be captured by GUI any more. So, it might be better to check if there is a message in the message queue by PeekMessageW before GetMessageW. At least, I don't see any hanging after changing it to this way so far. I added a Sleep to reduce CPU usage.

@SteffenL
Copy link
Collaborator

SteffenL commented Mar 4, 2025

After taking a closer look at this I have to admit that I don't quite understand how this can solve the issue. If we're just looping PeekMessageW() until there's a message, then in my head it's nearly doing the same job as GetMessageW() by itself. It's quite interesting to me.

I just remembered that I had thought about this issue before and the solution I made in PR #766 was never merged.

I believe the real solution is to post a message when WebView2 initialization completes/fails instead of just setting a flag. That way there's no doubt about whether there will be another message to process in the loop, and therefore won't get stuck.

If you want to take a look at it then search for usage of PostMessage and app_window_message in PR #766. Otherwise I'll try to extract the relevant code from that PR.

@AlbertXingZhang
Copy link
Author

AlbertXingZhang commented Mar 4, 2025

If there is always at least one message in the message queue, I think you are right. PeekMessageW plus GetMessageW does the same job as GetMessageW. If there happens to be no message in the message queue at the time this code is called, it still checks flag in between any two calls to PeekMessageW but calling GetMessageW once will block the thread and freeze the GUI, as the result, no input event message can be captured by GUI and there is no chance to check flag any more. That's my understanding.

The PR you mentioned might work but it is a different approach. This change just tries to fix the potential freezing issue for the current approach. I think this is a common way to wait for a task for a certain time without blocking GUI.

@SteffenL
Copy link
Collaborator

SteffenL commented Mar 5, 2025

GetMessageW() blocking is normal but it isn't normal for the GUI to freeze because of the blocking behavior of GetMessageW() because it'll just unblock when there are any updates related to the GUI. At least that's what should happen. Then it should dispatch the message using DispatchMessageW() to update the GUI.

In my environment, WebView2 calls our "controller completed" handler during the call to DispatchMessageW(). That's when we change the state of flag so it should be changed by the time DispatchMessageW() returns. Then flag should be checked before calling GetMessageW() (left to right evaluation order and short-circuiting).

It might be worth adding log statements to the existing code and then run CI tests a few times to see when there might be a delay in messages causing GetMessageW() to hang. That's if the problems in CI are the same as yours, which I don't actually know at this time.

On a side note, using the captured &flag in the the "controller completed" handler looks like it could in theory access flag after it has been freed (after embed() returns). I don't believe that's of such great concern here because all of this happens on the same thread before leaving the scope of embed(), but it's something that could be improved later.

Adding PeekMessageW() shouldn't in my mind help here, unless there's maybe a second message loop somewhere that interferes with the library's internal message loop e.g. by taking messages out of the queue? How can we know for sure that the issue is completely eliminated this way, and not just the frequency of it?

Maybe I'm missing something because it's difficult for me to see how adding PeekMessageW() can make a difference.

Oh, and one small change I spotted in your code is that you set got_quit_msg = true when GetMessageW() returns less than 0 (error). The existing code only does it when zero is returned (WM_QUIT received).

@AlbertXingZhang
Copy link
Author

AlbertXingZhang commented Mar 5, 2025

Honestly, I don't know how GUI and its message queue work internally for Windows. Maybe only Microsoft knows. The bottom line for this change is giving a chance to check flag regularly no matter if GUI is frozen by GetMessageW or something else since GetMessageW won't return unless there is a message coming in and mostly message can hardly come in for a frozen GUI. So, the whole program will be stuck here. At least, PeekMessageW can prevent this situation happening.

I saw GetMessageW returns -1 when there is error. I made an update to only take 0 and -1 as break condition.

@SteffenL
Copy link
Collaborator

I've been trying to experiment but sadly haven't gotten much closer to fixing the hangs. I'm however leaning toward believing that this code change can't actually resolve the issue because I don't believe that it has anything to do with the fact that GetMessageW blocks.

If it really does in your case then I can't figure out why, but I would like to understand why before considering a merge. I have also made some observations that I believe justify my doubts.

It's true that using PeekMessageW allows more chances to check flag, but the state of flag doesn't change until we call DispatchMessageW. There's no way for flag to have its state changed at any other point. This means that the blocking is merely shifted from GetMessageW to us looping PeekMessageW until there's a message available.

Things I've tested:

  • Replacing GetMessageW with PeekMessageW doesn't fix the hangs in CI tests. Messages that are part of the WebView2 initialization may still be delayed by as much as a minute or so in CI. In the latest CI build for this PR, you can tell that a test timed out and it should be for the same reason although you can't tell due to lack of logging. In my log-enabled code however, it's always hanging while waiting for messages, even if we use PeekMessageW instead of GetMessageW.

  • The hang occurs in the first instance of an app that uses WebView2 since cold system boot, but not subsequent app instances. Verified this in CI by adding a new minimal "warmup" test that runs before other tests. This one has the issue while the subsequent ones—even the one that used to hang because it was the first test—no longer have the same severe hangs. Verified this in my local development environment simply by launching one of the examples more than once.

  • It's possible that WebView2 may work more slowly (lower priority) while its UI is invisible, but making the widget/window visible before setting up WebView2 didn't resolve the issue.

  • Using the official WebView2Loader.dll appears to make no difference. I've tried versions 1.0.1150.38 and 1.0.3124.44. We can rule out the built-in WebView2 loader implementation as the problem.

  • Bad proxy settings on the system may cause hanging but I'm not sure about the exact behavior. I tried code like this in CI and it doesn't seem to make any difference:

Set-ItemProperty -Path 'HKCU:\SOFTWARE\Microsoft\Windows\CurrentVersion\Internet Settings\' -Name 'ProxyEnable' -Value 0
  • Incognito/InPrivate mode made no difference. Tested by using the --incognito argument to the browser as well as by calling ICoreWebView2ControllerOptions::put_IsInPrivateModeEnabled().

  • The following browser arguments made no difference: --msWebView2CancelInitialNavigation --no-first-run

  • Passing of browser argument was verified by adding --auto-open-devtools-for-tabs which opened the dev tools automatically as expected.

  • I have personally only seen such severe delays on GitHub Actions hosted runners. I set up my own fresh but stateful runner instance, but can't replicate the hangs there, and haven't seen it in my regular Windows 10 development environment either.

  • In my development environment, after deleting the user data directory created by WebView2 and then restarting the computer, it takes about a second extra to initialize WebView2. Tested this with my experimental code as well as current code in the master branch (e2bd2e5), and I can't perceive any difference in performance between the two branches on my system.

@SteffenL
Copy link
Collaborator

Thanks for this PR but unfortunately I can't clearly see a way to move forward with this, and will close it.

It would be great if we could determine what the real issue is. The fact that GetMessageW() blocks isn't the real issue so using PeekMessageW() isn't right fix either in my opinion.

@SteffenL SteffenL closed this Mar 24, 2025
@AlbertXingZhang
Copy link
Author

Hi @SteffenL, I'm sorry for being late. I was looking for how Microsoft implement Application.DoEvents. I think we tried to implement the same thing here. Here are my findings:

  1. Application.DoEvents calls to Application.ThreadContext.RunMessageLoop with msoloop.DoEvents at https://github.com/dotnet/winforms/blob/main/src/System.Windows.Forms/System/Windows/Forms/Application.cs#L910.
  2. Application.ThreadContext.RunMessageLoop calls to Application.ThreadContext.RunMessageLoopInner with msoloop.DoEvents at https://github.com/dotnet/winforms/blob/main/src/System.Windows.Forms/System/Windows/Forms/Application.ThreadContext.cs#L694
  3. Application.ThreadContext.RunMessageLoopInner calls to Application.ComponentThreadContext.RunMessageLoop with msoloop.DoEvents and fullModal = false at https://github.com/dotnet/winforms/blob/main/src/System.Windows.Forms/System/Windows/Forms/Application.ThreadContext.cs#L807
  4. Application.ComponentThreadContext.RunMessageLoop calls to Application.ComponentManager.FPushMessageLoop at https://github.com/dotnet/winforms/blob/main/src/System.Windows.Forms/System/Windows/Forms/Application.ComponentThreadContext.cs#L259 since (!fullModal && reason != msoloop.DoEventsModal) is true.
  5. As you can see, inside Application.ComponentManager.FPushMessageLoop, it always calls PeekMessage first at https://github.com/dotnet/winforms/blob/main/src/System.Windows.Forms/System/Windows/Forms/Application.ComponentManager.cs#L231 and then calls GetMessage at https://github.com/dotnet/winforms/blob/main/src/System.Windows.Forms/System/Windows/Forms/Application.ComponentManager.cs#L239 only when PeekMessage returns true.
    If PeekMessage returns false, it never calls GetMessage and just breaks at https://github.com/dotnet/winforms/blob/main/src/System.Windows.Forms/System/Windows/Forms/Application.ComponentManager.cs#L268 since uReason is msoloop.DoEvents is true. So, the caller continues doing its own tasks. For our case, we check flag.

As I mentioned before, I'm not an expert who knows Windows messages and message queues better than Microsoft. I just tried to do something similar to what Microsoft did for .NET in this PR. I hope it can solve the hanging issue. Actually, it does work for my case.

@SteffenL
Copy link
Collaborator

Is there any way you can provide a minimal example where you can reproduce the issue?

Does it occur with the C++ examples that are in this repository?

@AlbertXingZhang
Copy link
Author

Sorry, whenever I took my part from the original big project into a minimal repo, the issue never got reproduced. And the C++ examples in this repo don't have this issue either, at least on my computer.

@SteffenL
Copy link
Collaborator

Is it a pure C++ project or is .NET involved in any way?

@SteffenL SteffenL reopened this Mar 25, 2025
@SteffenL SteffenL added need: help Help from the community is requested os: windows Involves Windows operating systems labels Mar 25, 2025
@AlbertXingZhang
Copy link
Author

The project is modularized. Although my module is written in pure C++, along with other modules written in WinForms and WPF, all of them coexist in the same process. And each module can call each other. So, my C++ dialog is shown from a Windows Form or a WPF window.

@SteffenL
Copy link
Collaborator

I'm still trying to wrap my head around this.

If you're mixing different UI frameworks within the same process, it seems reasonable to assume that you have an existing event/message loop somewhere in your app, possibly with their own special handling as you've shown in your earlier post.

The webview library has a few places where it temporarily spins a loop even if you don't call run(). It's needed to turn asynchronous code into synchronous code, but may have side effects.

If another loop (PeekMessage() or GetMessage()) exists, maybe that could cause the library's GetMessage() to block if messages are "stolen".

In this case, maybe that could be alleviated by using PeekMessage() on our end inside the temporary loops, while keeping GetMessage() inside run().

However, only one message loop is supposed to be running at a time per thread, so it's difficult for me to really imagine why that can help.

With that said, someone who integrates the webview library into an app with an existing UI framework might argue that the webview library is the one who is potentially stealing messages?

If there's any cross-thread messaging and synchronization going on somewhere, maybe that could be a source of problems... Not quite sure.

I'm also not sure if it'll help but:

  • Can you view the call stack at the time GetMessageW() hangs? If so, I'm interested in the call stack within the webview library and whether the freeze always occurs at the same point up the call stack (constructor, destructor, bind(), run(), etc).
  • When it freezes, does it ever unfreeze? What if you let it run for a couple minutes?

How to create a message loop

This page shows how to create a message loop:

https://learn.microsoft.com/en-us/windows/win32/winmsg/using-messages-and-message-queues#creating-a-message-loop

MSG msg;
BOOL bRet;
while( (bRet = GetMessage( &msg, NULL, 0, 0 )) != 0)
{ 
    if (bRet == -1)
    {
        // handle the error and possibly exit
    }
    else
    {
        TranslateMessage(&msg); 
        DispatchMessage(&msg); 
    }
}

Here's another variant of this from Microsoft:

https://learn.microsoft.com/en-us/windows/win32/learnwin32/window-messages#the-message-loop

// Correct.

MSG msg = { };
while (GetMessage(&msg, NULL, 0, 0) > 0)
{
    TranslateMessage(&msg);
    DispatchMessage(&msg);
}

That's pretty much how we do it.

Another piece of documentation:

PeekMessage is similar to the GetMessage function; both check a message queue for a message that matches the filter criteria and then copy the message to an MSG structure. The main difference between the two functions is that GetMessage does not return until a message matching the filter criteria is placed in the queue, whereas PeekMessage returns immediately regardless of whether a message is in the queue.

https://learn.microsoft.com/en-us/windows/win32/winmsg/using-messages-and-message-queues#examining-a-message-queue

@AlbertXingZhang
Copy link
Author

AlbertXingZhang commented Mar 26, 2025

Sorry, it is very difficult to debug into the code. I could only use logs to locate the freezing cause before. When it froze, I remember I waited for a couple minutes but it never recovered.
According to the documentation about Examining a Message Queue you shared, my understanding is using PeekMessage to examine a message queue from outside the thread's message loop when during a lengthy operation. I think it is because the lengthy operation is likely called from the original message loop's DispatchMessage if it runs in the same thread. For our case, it is checking flag periodically. But why does Microsoft use PeekMessage instead of GetMessage here to continue processing messages when the original message loop is blocked? Honestly, I don't know. As I mentioned before, I can only think PeekMessage allows checking flag more frequently than GetMessage. However this hanging problem lets me think there might be something wrong like dead lock in some situations I'm not aware of for calling GetMessage inside DispatchMessage when there is no message in the message queue. Again, it is only my guess.

@SteffenL
Copy link
Collaborator

I guess you had log statements around all the GetMessage() (and maybe around DispatchMessage()) calls within the webview library, and saw that GetMessage() never returned.. And I guess your logs were flushed immediately. If not, maybe you can check again.

I believe DispatchMessage() can also block if a message handler blocks. If a handler deadlocks somehow then the whole loop is going to be blocked.

A call stack would be handy in this situation... What if you were to keep it running in a debugger, then when you notice that it's frozen, you just pause the program in the debugger and check the main thread's call stack?

@AlbertXingZhang
Copy link
Author

Sorry, I couldn't find the old logs about hanging currently. I can keep it running in a debugger but I'm not sure if I can capture the hanging issue on this computer. This issue is a mystery to me. It happened occasionally on some of our previous computers while it never happened on the others.

@citkane
Copy link
Contributor

citkane commented Jun 18, 2025

I have observed similar behaviour with init, that depending on where and how it is called, the script will never reach the frontend if the DOM is not in a particular state. When this happens, the application will appear to hang. f4eb133

This is true for all platforms, but different for MSWebview2, which appears to need to navigate before it starts processing scripts (which I think includes the Weview JS script).
See this possibly related issue.

Seeing that add_init_script is co-located in embed for win32_edge.hh straight after pumping for initialisation, it is a possible that webview is initialised, but hanging on the Webview script injection until some other DOM event occurs.

Also possibly contributing is dispatch_impl, in which we get: "warning: Potential memory leak [clang-analyzer-cplusplus.NewDeleteLeaks]" for PostMessageW. Hard to know without seeing the client application code. Could be many other things, like an eval coming from a child thread, or init being called without navigate being called after it.

@SteffenL The Windows CI tests hanging may also be related to them running in console mode instead of GUI mode.

I have been working to resolve and normalise many higher level issues, and it would be good to remove them from the equation for this issue.

@AlbertXingZhang - would you try out #1309
You can turn on logging and trace logging, which should give a clearer picture.

@SteffenL - CI tests are switched to GUI for Windows in #1312 (e23a91d, 6a3c84d)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

need: help Help from the community is requested os: windows Involves Windows operating systems

Development

Successfully merging this pull request may close these issues.

4 participants