Chrome on Windows performance improvements and the journey of Native Window Occlusion

Thursday, December 9, 2021

Whether you prefer organizing your browser with tab groups, naming your windows, tab search, or another method, you have lots of features that help you get to the tabs you want. In this The Fast and the Curious post, we describe how we use what windows are visible to you to optimize Chrome, leading to 25.8% faster start up and 4.5% fewer crashes.

Background

For several years, to improve the user experience, Chrome has lowered the priority of background tabs[1]. For example, JavaScript is throttled in background tabs, and these tabs don’t render web content. This reduces CPU, GPU and memory usage, which leaves more memory, CPU and GPU for foreground tabs that the user actually sees. However, the logic was limited to tabs that weren't focused in their window, or windows that were minimized or otherwise moved offscreen.

Through experiments, we found that nearly 20% of Chrome windows are completely covered by other windows, i.e., occluded. If these occluded windows were treated like background tabs, our hypothesis was that we would see significant performance benefits. So, around three years ago, we started working on a project to track the occlusion state of each Chrome window in real time, and lower the priority of tabs in occluded windows. We called this project Native Window Occlusion, because we had to know about the location of native, non-Chrome windows on the user’s screen. (The location information is discarded immediately after it is used in the occlusion calculation.)

Calculating Native Window Occlusion

The Windows OS doesn’t provide a direct way to find out if a window is completely covered by other windows, so Chrome has to figure it out on its own. If we only had to worry about other Chrome windows, this would be simple because we know where Chrome windows are, but we have to consider all the non-Chrome windows a user might have open, and know about anything that happens that might change whether Chrome windows are occluded or not.

There are two main pieces to keeping track of which Chrome windows are occluded. The first is the occlusion calculation, which consists of iterating over the open windows on the desktop, in z-order (front to back) and seeing if the windows in front of a Chrome window completely cover it. The second piece is deciding when to do the occlusion calculation.

Calculating Occlusion

In theory, figuring out which windows are occluded is fairly simple. In practice, however, there are lots of complications, such as multi-monitor setups, virtual desktops, non-opaque windows, and even cloaked windows(!). This needs to be done with great care, because if we decide that a window is occluded when in fact it is visible to the user, then the area where the user expects to see web contents will be white. We also don’t want to block the UI thread while doing the occlusion calculation, because that could reduce the responsiveness of Chrome and degrade the user experience. So, we compute occlusion on a separate thread, as follows:

Ignore minimized windows, since they’re not visible.
Mark Chrome windows on a different virtual desktop as occluded.
Compute the virtual screen rectangle, which combines the display monitors. This is the unoccluded screen rectangle.
Iterate over the open windows on the desktop from front to back, ignoring invisible windows, transparent windows, floating windows (windows with style WS_EX_TOOLBAR), cloaked windows, windows on other virtual desktops, non-rectangular windows[2], etc. Ignoring these kinds of windows may cause some occluded windows to be considered visible (false negatives) but importantly it won’t lead to treating visible windows as occluded (false positives). For each window:

Subtract the window's area from the unoccluded screen rectangle.
If the window is a Chrome window, check if its area overlapped with the unoccluded area. If it didn’t, that means the Chrome window is completely covered by previous windows, so it is occluded.

Keep iterating until all Chrome windows are captured.

At this point, any Chrome window that we haven’t marked occluded is visible, and we’re done computing occlusion. Whew! Now we post a task to the UI thread to update the visibility of the Chrome windows.

This is all done without synchronization locks, so the occlusion calculation has minimal effect on the UI thread, e.g., it will not ever block the UI thread and degrade the user experience.

For more detailed implementation information, see the documentation.

Deciding When to Calculate Occlusion

We don’t want to continuously calculate occlusion because it would degrade the performance of Chrome, so we need to know when a window might become visible or occluded. Fortunately, Windows lets you track various system events, like windows moving or getting resized/maximized/minimized. The occlusion-calculation thread tells Windows that it wants to track those events, and when notified of an event, it examines the event to decide whether to do a new occlusion calculation. Because we may get several events in a very short time, we don’t calculate occlusion more than once every 16 milliseconds, which corresponds to the time a single frame is displayed, assuming a frame rate of 60 frames per second (fps).

Some of the events we listen for are windows getting activated or deactivated, windows moving or resizing, the user locking or unlocking the screen, turning off the monitor, etc. We don’t want to calculate occlusion more than necessary, but we don’t want to miss an event that causes a window to become visible, because if we do, the user will see a white area where their web contents should be. It’s a delicate balance[3].

The events we listen for are focused on whether a Chrome window is occluded. For example, moving the mouse generates a lot of events, and cursors generate an event for every blink, so we ignore events that aren’t for window objects. We also ignore events for most popup windows, so that tooltips getting shown doesn’t trigger an occlusion calculation.

The occlusion thread tells Windows that it wants to know about various Windows events. The UI thread tells Windows that it wants to know when there are major state changes, e.g., the monitor is powered off, or the user locks the screen.

Results

This feature was developed behind an experiment to measure its effect and rolled out to 100% of Chrome Windows users in October 2020 as part of the M86 release. Our metrics show significant performance benefits with the feature turned on:

8.5% to 25.8% faster startup
3.1% reduction in GPU memory usage
20.4% fewer renderer frames drawn overall
4.5% fewer clients experiencing renderer crashes
3.0% improvement in first input delay
6.7% improvement in first contentful paint and largest contentful paint

[1] Note that certain tabs are exempt from having their priority lowered, e.g., tabs playing audio or video.

[2] Non-rectangular windows complicate the calculations and were thought to be rare, but it turns out non-rectangular windows are common on Windows 7, due to some quirks of the default Windows 7 theme.

[3] When this was initially launched, we quickly discovered that Citrix users were getting white windows whenever another user locked their screen, due to Windows sending us session changed notifications for sessions that were not the current session. For the details, look here.

Background

Calculating Native Window Occlusion

Calculating Occlusion

Ignore minimized windows, since they’re not visible.
Mark Chrome windows on a different virtual desktop as occluded.
Compute the virtual screen rectangle, which combines the display monitors. This is the unoccluded screen rectangle.
Iterate over the open windows on the desktop from front to back, ignoring invisible windows, transparent windows, floating windows (windows with style WS_EX_TOOLBAR), cloaked windows, windows on other virtual desktops, non-rectangular windows[2], etc. Ignoring these kinds of windows may cause some occluded windows to be considered visible (false negatives) but importantly it won’t lead to treating visible windows as occluded (false positives). For each window:

Subtract the window's area from the unoccluded screen rectangle.
If the window is a Chrome window, check if its area overlapped with the unoccluded area. If it didn’t, that means the Chrome window is completely covered by previous windows, so it is occluded.

Keep iterating until all Chrome windows are captured.

This is all done without synchronization locks, so the occlusion calculation has minimal effect on the UI thread, e.g., it will not ever block the UI thread and degrade the user experience.

For more detailed implementation information, see the documentation.

Deciding When to Calculate Occlusion

Results

8.5% to 25.8% faster startup
3.1% reduction in GPU memory usage
20.4% fewer renderer frames drawn overall
4.5% fewer clients experiencing renderer crashes
3.0% improvement in first input delay
6.7% improvement in first contentful paint and largest contentful paint

A reason for the startup and first-contentful-paint improvements is when Chrome restores two or more full-screen windows when starting up, one of the windows is likely to be occluded. Chrome will now skip much of the work for that window, thus saving resources for the more important foreground window.

Posted by David Bienvenu, Chrome Developer

Data source for all statistics: Real-world data anonymously aggregated from Chrome clients.[1] Note that certain tabs are exempt from having their priority lowered, e.g., tabs playing audio or video.[2] Non-rectangular windows complicate the calculations and were thought to be rare, but it turns out non-rectangular windows are common on Windows 7, due to some quirks of the default Windows 7 theme.[3] When this was initially launched, we quickly discovered that Citrix users were getting white windows whenever another user locked their screen, due to Windows sending us session changed notifications for sessions that were not the current session. For the details, look here.

Faster Chrome - Let The Compiler do the work

Wednesday, December 1, 2021

int foo();
int fiver(int num) {
for(int j = 0; j < 5; j++)
num = num + foo();
return num;
}

The compiler can either compile this as a loop (smaller), or turn it into five additions in a row (faster, but bigger)

You save the cost of checking the end of the loop and incrementing a variable every time through the loop. But in exchange, you now have many repeated calls to foo(). If foo() is called a lot, that is a lot of space.

And while speed matters a lot, we also care about binary size. (Yes, we see your memes!) And that tradeoff - exchanging speed for memory, and vice versa, holds for a lot of compiler optimizations.

So how do you decide if the cost is worth it? One good way is to optimize for speed in areas that are run often, because your speed wins accumulate each time you run a function. You could just guess at what you inline (your compiler can do this, it's called "a heuristic", it's an educated guess), and then measure speed and code size.

The result: Likely faster. Likely larger. Is that good?

Ask any engineer a question like that, and they will answer “It depends”. So, how do you get an answer?

The More You Know… (profiling & PGO)

The best way to make a decision is with data. We collect data based on what gets run a lot, and what gets run a little. We do that for several different scenarios, because our users do lots of different things with Chrome and want them to be fast.

Our goal is collecting performance data in various scenarios, and using that to guide the compiler. There are 3 steps needed:

Instrument for profiling
Run that instrumented executable in various scenarios
Use the resulting performance profile to guide the compiler.

But we can do more (ThinLTO)

That's a good start, but we can do better. Let's look at inlining - the compiler takes the code of a called function and inserts all of it at the callsite.

inline int foo() { return 3; };
int fiver_inline(int num) {
for(int j = 0; j < 5; j++)
num = num + foo();
return num;
}

When the compiler inlines foo(), it turns into

int fiver_inline(int num) {
for(int j = 0; j < 5; j++)
num = num + 3;
return num;
}

Not bad - saves us the function call and all the setup that goes with having a function. But the compiler can in fact even do better - because now all the information is in one place. The compiler can apply that knowledge and deduce that fiver_inline() adds the number three and does so 5 times - and so the entire code is boiled down to

return num + 15;

Which is awesome! But the compiler can only do this if the source code for foo() and the location where it is called are in the same source file - otherwise, the compiler does not know what to inline. (That's the fiver() example). A trivial way around that is to combine all the source files into one giant source file and compile and link that in one go.

There's just one downside - that approach needs to generate all of the machine code for Chrome, all the time. Change one line in one file, compile all of Chrome. And there's a lot of Chrome. It also effectively disables caching build results and so makes remote compilation much less useful. (And we rely a lot on remote compilation & caching so we can quickly build new versions of Chrome.)

So, back to the drawing board. The core insight is that each source file only needs to include a few functions - it doesn't need to see every single other file. All that's needed is "cleverly" mixing the right inline functions into the right source files.

Now we're back to compiling individual source files. Distributed/cached compilation works again, small changes don't cause a full rebuild, and since "ThinLTO" does just inline a few functions, and it is relatively little overhead.

Of course, the question of "which functions should ThinLTO inline?" still needs to be answered. And the answer is still "the ones that are small and called a lot". Hey, we know those already - from the profiles we generated for Profile Guided Optimization (PGO). Talk about lucky coincidences!

But wait, there's more! (Callgraph Sorting)

We've done a lot for inlined function calls. Is there anything we can do to speed up functions that haven't been inlined, too? Turns out there is.

One important factor is that the CPU doesn't fetch data byte by byte, but in chunks. And so, if we could ensure that a chunk of data doesn't just contain the function we need right now, but ideally also the ones that we'll need next, we could ensure that we have to go out and get chunks of data less often.

In other words, we want functions that are called right after the other to live next to each other in memory also ("code locality"). And we already know which functions are called close to each other - because we ran our profiling and stored performance profiles for PGO.

We can then use that information to ensure that the right functions are next to each other when we link.

I.e.

g.c
extern int f1();
extern int f2();
extern int f3();
int g() {
f1();
for(..) {
f3();
}
f1();
f2();
}

could be interpreted as "g() calls f3() a lot - so keep that one really close. f1() is called twice, so… somewhat close. And if we can squeeze in f2, even better". The calling sequence is a "call graph", and so this sorting process is called "call graph sorting".

Just changing the order of functions in memory might not sound like a lot, but it leads to ~3% performance improvement. And to know which functions calls which other ones a lot… yep. You guessed it. Our profiles from the PGO work pay off again.

One more thing.

It turns out that the compiler can make even more use of that profile data for PGO. (Not a surprise - once you know where the slow spots are, exactly, you can do a lot to improve!). To make use of that, and enable further improvements, LLVM has something called the "new pass manager". In a nutshell, it's a new way to run optimizations within LLVM, and it helps a lot with PGO. For much more detail, I'd suggest reading the LLVM blog post.

Turning that on leads to another ~3% performance increase, and ~9MB size reduction.

Why Now?

Good question. One part of that is that PGO & profiling unlock an entire new set of optimizations, as you've seen above. It makes sense to do that all in one go.

The other reason is our toolchain. We used to have a colorful mix of different technologies for compilers and linkers on different platforms.

And since this work requires changes to compilers and linkers, that would mean changing the build - and testing it - across 5 compilers and 4 linkers. But, thankfully, we've simplified our toolchain (Simplicity - another one of the 4S's!). To be able to do this, we worked with the LLVM community to make clang a great Windows compiler, in addition to partnering with the LLVM community to create new ELF (Linux), COFF (Windows), and Mach-O (macOS, iOS) linkers.

And suddenly, it's only a single toolchain to fix. (Almost. LTO for lld on MacOS is being worked on).

Sometimes, the best way to get more speed is not to change the code you wrote, but to change the way you build the software.

Posted by Rachel Blum, Engineering Director, Chrome Desktop

Data source for all statistics: Speedometer 2.0.

Chrome is fast, but there's always room for improvement. Often, that's achieved by carefully crafting the algorithms that make up Chrome. But there's a lot of Chrome, so why not let computers do at least some part of our work? In this installment of The Fast And the Curious, we'll show you several changes in how we build Chrome to achieve a 25.8% higher score on Speedometer on Windows and a 22.0% increase in browser responsiveness.

Why speed?

So why do we care about performance benchmarks? It's not a simple "higher numbers is better" chasing of achievements - performance was so important to Chrome that we embedded in our core principles, the "4Ss" - Speed, Security, Stability, Simplicity. And speed matters because we want a browser that responds quickly. Speed matters so much because we want to build a faster and more responsive browser. And by improving the speed of the browser, there's the additional benefit of maximizing battery use, so you don't have to charge your laptop/devices as often.

Speed? Size? Something Else?

Let's look at a typical optimization.

int foo();

int fiver(int num) {

for(int j = 0; j < 5; j++)

num = num + foo();

return num;

}

The More You Know… (profiling & PGO)

Instrument for profiling
Run that instrumented executable in various scenarios
Use the resulting performance profile to guide the compiler.

But we can do more (ThinLTO)

That's a good start, but we can do better. Let's look at inlining - the compiler takes the code of a called function and inserts all of it at the callsite.

inline int foo() { return 3; };

int fiver_inline(int num) {

for(int j = 0; j < 5; j++)

num = num + foo();

return num;

}

When the compiler inlines foo(), it turns into

int fiver_inline(int num) {

for(int j = 0; j < 5; j++)

num = num + 3;

return num;

}

return num + 15;

But wait, there's more! (Callgraph Sorting)

g.c

extern int f1();

extern int f2();

extern int f3();

int g() {

f1();

for(..) {

f3();

}

f1();

f2();

}

Chromium Blog

Chrome on Windows performance improvements and the journey of Native Window Occlusion

Background

Calculating Native Window Occlusion

Calculating Occlusion

Deciding When to Calculate Occlusion

Results

Faster Chrome - Let The Compiler do the work

The More You Know… (profiling & PGO)

But we can do more (ThinLTO)

But wait, there's more! (Callgraph Sorting)

One more thing.

Why Now?

Labels

Archive

Feed