Unless otherwise noted, changes described below apply to the newest Chrome beta channel release for Android, Chrome OS, Linux, macOS, and Windows. Learn more about the features listed here through the provided links or from the list on ChromeStatus.com. Chrome 91 is beta as of April 22, 2021.

Origin Trials

This version of Chrome introduces the origin trials described below. Origin trials allow you to try new features and give feedback on usability, practicality, and effectiveness to the web standards community. To register for any of the origin trials currently supported in Chrome, including the ones described below, visit the Chrome Origin Trials dashboard. To learn more about origin trials in Chrome, visit the Origin Trials Guide for Web Developers. Microsoft Edge runs its own origin trials separate from Chrome. To learn more, see the Microsoft Edge Origin Trials Developer Console.

New Origin Trials

Declarative Link Capturing for PWAs

The new Web App Manifest member called capture_links controls what happens when the user navigates to a page within scope of an installed web app. It allows sites to automatically open a new PWA window when the user clicks a link to their app or to have a single window mode like mobile apps. Sign up for the origin trial and learn more on the origin trial dashboard.

WebTransport

WebTransport is a protocol framework that enables clients constrained by the Web security model to communicate with a remote server using a secure multiplexed transport.

Currently, Web application developers have two APIs for bidirectional communications with a remote server: WebSockets and RTCDataChannel. WebSockets are TCP-based, thus having all of the drawbacks of TCP (head of line blocking, lack of support for unreliable data transport) that make it a poor fit for latency-sensitive applications. RTCDataChannel is based on the Stream Control Transmission Protocol (SCTP), which does not have these drawbacks; however, it is designed to be used in a peer-to-peer context, which causes its use in client-server settings to be fairly low. WebTransport provides a client-server API that supports bidirectional transfer of both unreliable and reliable data, using UDP-like datagrams and cancellable streams. WebTransport calls are visible in the Network panel of DevTools and identified as such in the Type column.

For more information, see Experimenting with WebTransport. Sign up for the origin trial and learn more on the origin trial dashboard.

WebXR Plane Detection API

WebXR applications can now retrieve data about planes (flat surfaces) in the user's environment, allowing better user experiences with less processing power. Without this feature plane detection requires custom computer vision algorithms using data from MediaDevices.getUserMedia(). These solutions usually fall short of quality and accuracy expectations for AR experiences and don't support world scale. Sign up for the origin trial and learn more on the dashboard.

Completed Origin Trials

The following features, previously in a Chrome origin trial, are now enabled by default.

WebAssembly SIMD

WebAssembly SIMD exposes hardware SIMD instructions to WebAssembly applications in a platform-independent way. This introduces a new 128-bit type that can represent different types of packed data, and several vector operations that work on packed data. SIMD can boost performance by exploiting data level parallelism and is also useful when compiling native code to WebAssembly. For more information, see the V8 feature explainer for WebAssembly SIMD.

Other features in this release

Align performance API timer resolution to cross-origin isolated capability

Coarsening of performance.now() and related timestamps based on site isolation status is now consistent across platforms. This decreases the resolution on desktop from 5 microseconds to 100 microseconds in non-isolated contexts. It also increases their resolution on Android from 100 microseconds to 5 microseconds in cross-origin isolated contexts, where it's safe to do so.

Clipboard: Read-Only Files Support

On desktop, apps can now read files from the clipboard (but not write files to the clipboard). For files on the clipboard, apps have read-only access.

async function onPaste(e) {
  let file = e.clipboardData.files[0];
  let contents = await file.text();  
}


CSS

Custom Counter Styles

The CSS @counter-style rule allows web authors to specify and use custom counter styles in list markers and CSS counters. This helps internationalization. This change implements all of the features in CSS Counter Styles Level 3 except:


Approach

As engineers, our training in optimization is to focus on improving the algorithmic performance of the components we own. The last 3 years of analyzing the immensely complex codebase of Chrome however have taught us that the real issue is often cross-cutting: multiple unrelated features’ long-tail performance issues, sharing the same systemic root cause(s). Applying local expertise and optimization is likely to miss the global optimum. It is necessary to disregard our initial intuition and assume ignorance, forcing us to dig beyond what is immediately apparent and find the underlying root cause by relentlessly exposing what we don’t know.



Chasing Invisible Bugs

How do we find bugs that are unforeseen, unreproducible, unowned, and essentially invisible?

First, define a scenario. For this work, we focus on user-visible Jank, which we measure in the field as a way to systematically identify moments where Chrome feels slow.

Second, gather high actionability bug reports in the field. For this we rely on Chrome’s BackgroundTracing infrastructure to generate what we call Slow Reports. A subset of Canary users who have opted in to sharing anonymized metrics have circular-buffer tracing enabled to examine specific scenarios. If a preconfigured threshold on a metric of interest is hit, the trace buffer is captured, anonymized, and uploaded to Google servers.

Such a bug report might look like this:


chrome://tracing view of a 2 seconds Jank on AutocompleteController::UpdateResult() on an otherwise healthy machine


We have a culprit! Let’s optimize AutocompleteController? No! We don’t know why yet: keep assuming ignorance!

By augmenting BackgroundTracing with stack sampling, we were able to find a recurring stack under stalled AutoComplete events:

    RegEnumValueW

    RegEnumValueWStub

    base::win::RegistryValueIterator::Read()

    gfx::`anonymous namespace\'::CachedFontLinkSettings::GetLinkedFonts

    gfx::internal::LinkedFontsIterator::GetLinkedFonts()

    gfx::internal::LinkedFontsIterator::NextFont(gfx::Font *)

    gfx::GetFallbackFonts(gfx::Font const &)

    gfx::RenderTextHarfBuzz::ShapeRuns(...)

    gfx::RenderTextHarfBuzz::ItemizeAndShapeText(...)

    gfx::RenderTextHarfBuzz::EnsureLayoutRunList()

    gfx::RenderTextHarfBuzz::EnsureLayout()

    gfx::RenderTextHarfBuzz::GetStringSizeF()

    gfx::RenderTextHarfBuzz::GetStringSize()

    OmniboxTextView::CalculatePreferredSize()

    OmniboxTextView::ReapplyStyling()

    OmniboxTextView::SetText...)

    OmniboxResultView::Invalidate()

    OmniboxResultView::SetMatch(AutocompleteMatch const &)

    OmniboxPopupContentsView::UpdatePopupAppearance()

    OmniboxPopupModel::OnResultChanged()

    OmniboxEditModel::OnCurrentMatchChanged()

    OmniboxController::OnResultChanged(bool)

    AutocompleteController::UpdateResult(bool,bool)

    AutocompleteController::Start(AutocompleteInput const &)

    (...)


Ah ha! Autocomplete is not at fault. Time to optimize GetFallbackFonts()?! But wait… Why is GetFallbackFonts() even called in the first place?

And before we figure that out, how do we know this is the #1 root cause of our overall long-tail performance issue? We’ve only looked at one trace so far after all...



The Measurement Conundrum

The metrics tell us how many users are affected and how bad it is, but they do not highlight the root cause.

Slow Reports tell us what the problem is for a specific user but not how many users are affected. And while we can query our corpus of Slow Report traces, it comes with inherent biases that make it impossible to correlate 1:1 with metrics. For instance, because Chrome only reports the first instance of bad performance per-session and only for users of the Canary/Dev channel, there’s both a startup and a population bias.

This is the measurement conundrum. The more actionability (data) a tool provides, the fewer scenarios it captures and the more bias it incurs. Depth vs. breadth.

Tools that attempt to do both sit somewhere in the middle, where they use aggregation over a large dataset and risk showing aggregate results based on flawed input (e.g. circular buffer tracing having dropped the interesting portion and contributing to a biased aggregate).





Thus we scientifically opted for the least engineering-minded option: open a bunch of Slow Report traces manually. This gave us the most actionability over a top-level issue we’d already quantified.

After opening dozens of traces it turned out that a great majority showed variations of the aforementioned fonts issue. While this didn’t give us a precise #users-affected, it was enough for us to believe it was the main cause of user pain seen in the metrics.




Fallback Fonts

We dug into why GetFallbackFonts() had to be called in the first place. In the example above, the caller is trying to determine the size in pixels of a Unicode string rendered by a given font.

If a substring within it is from a Unicode Block that can’t be rendered by the given font, GetFallbackFont() is used to request the system recommended fallback font for it. If that fails, GetFallbackFonts() is invoked to try all the linked fonts and determine the one that can best render it; that second fallback is much slower.

GetFallbackFont() should never fail, but in practice it’s not that simple. The reliable way to do this on Windows is to query DirectWrite; however, DirectWrite was added in Windows 7, when Chrome still supported Windows XP. Therefore the GetFallbackFont() logic was forced to stick to a less reliable heuristic using Uniscribe+GDI in order to work on both versions of the OS. Since things worked most of the time, no one noticed that this could have been cleaned up when Chrome later dropped support for Windows XP. With new tooling to investigate long-tail performance, this turned out to be the number one cause of jank (unnecessarily invoking GetFallbackFonts()).

We fixed that, reducing the amount of calls to GetFallbackFonts() by 4x.






Still not zero though, and still seeing instances of the aforementioned AutoComplete issue in our Slow Reports. Keep digging. DirectWrite’s GetFallbackFont() failing was unexpected, but since Slow Reports are anonymized, no user-generated strings can be uploaded -- and therefore, finding which codepoints were problematic was tricky. We teamed up with our privacy experts to instrument Unicode Block and Script of text blocks going through HarfBuzz so that we could ensure no leakage of Personally Identifiable Information.


 

The Emoji Saga

With this new recording enabled, the next wave of Slow Reports came back. The vast majority of reports indicated that font fallback was failing when DirectWrite was being asked to find a font for a codepoint (Unicode character) in Miscellaneous Symbols and Pictographs. We wrote a local script trying all codepoints in that Unicode Block and quickly found out which ones could be problematic: U+1F3FB - U+1F3FF are modifiers added in Unicode 8.0 and are meaningful only when paired with another codepoint. For instance, U+1F9D7 (🧗) when paired with U+1F3FF is 🧗🏿. No font can render U+1F3FF on its own, and font fallback would correctly error out after scanning all linked fonts when asked to find one. The bug was in the browser-side Unicode segmentation logic which incorrectly broke down these two codepoints and asked DirectWrite to render them separately instead of keeping them as a single grapheme.

But wait, doesn’t Chrome support modern Unicode..?! Indeed, it does, in Blink which renders the web content. But the browser-side logic was not updated to support modern emojis (with modifiers) because it didn’t use to draw emojis at all. It’s only when the browser UI (tab strip, bookmark bar, omnibox, etc.) was modernized to support Unicode circa 2018 that the legacy segmentation logic became an (invisible) problem.

On top of that, the caching logic did not cache on error, so trying to render a modifier on its own caused a massive jank, every time, for users with a lot of fonts installed. Ironically, this cache had been added to amortize the cost of this misunderstood bottleneck when Unicode support was first added to browser UI. Diving deeper into the underlying implementation of our fonts logic, rather than stopping at the layer of the fonts APIs, not only fixed a major performance issue but also resulted in a correctness fix for other emojis. For instance, 🏳️‍🌈 is encoded as U+1F3F3( 🏳️) + U+1F308 (🌈); before the itemization fix, browser UI would incorrectly render this grapheme as 🏳️🌈.



And the journey continues...

Our journey keeps going into various components of Chrome but it always follows the same basic playbook: assume ignorance and relentlessly investigate unforeseen, unreproducible, and unowned bugs. And while stack ranking issues is nigh impossible (see: measurement conundrum), fixing the top 5 findings from any given tool and zooming in on the long tail has always addressed the majority of the user pain in practice.

Using this approach, we have reduced user-visible jank by a factor of 10X over the last 2.5 years and improved long-tail performance of many features caught in the cross-fire.


99th percentile of # of unresponsive 100ms intervals over a 30 seconds sample


Posted by Gabriel Charette 🤸🏼 and Etienne Bergeron 🕵🏻, Chrome Software Engineers



Data source for all statistics: Real-world data anonymously aggregated from Chrome clients.





Here's a closer look at memory usage in the browser process for Windows as the M89 release began rolling out in early March:








Background

Chrome is a multi-platform, multi-process, multi-threaded application, serving a wide range of needs, from small embedded WebViews on Android to spacecraft. Performance and memory footprint are of critical importance, requiring a tight integration between Chrome and its memory allocator. But heterogeneity across platforms can be prohibitive with each platform having a different implementation such as tcmalloc on Linux and Chrome OS, jemalloc or scudo on Android, and LFH on Windows.


When we started this project, our goals were to: 1) unify memory allocation across platforms, 2) target the lowest memory footprint without compromising security and performance, and 3) tailor the allocator to optimize the performance of Chrome. Thus we made the decision to use Chromium’s cross-platform allocator, to optimize memory usage for client rather than server workloads and to focus on meaningful end user activities, not micro-benchmarks that wouldn’t really matter in real world usage.




Allocator Security


PartitionAlloc was designed to support multiple independent partitions, i.e. non-overlapping regions of memory. We use these partitions throughout Blink to thwart some forms of type confusion attacks, such as ensuring strings are separated from layout objects. However, this approach only avoids collisions between types that are allocated from different partitions. Furthermore, PartitionAlloc buckets allocations by their sizes, to help avoid type confusion when potentially-colliding objects are of dissimilar size. These techniques work because PartitionAlloc doesn’t re-use address space. Once PartitionAlloc dedicates a region of address space to a certain partition and size bucket, it will always belong to that partition and size bucket.


Additionally, PartitionAlloc protects some of its metadata with guard pages (inaccessible ranges) around memory regions. Not all metadata is equal, however: free-list entries are stored within previously allocated regions, and thus surrounded by other allocations. To detect corrupted free-list entries and off-by-one overflows from client code, we encode and shadow them.
Finally, having our own allocator enables advanced security features like MiraclePtr and *Scan.



Architecture Details

Each partition in PartitionAlloc uses a single, central, slab-based allocator to conserve memory, with a minimal per-thread cache in front for scaling to multi-threaded workloads. This simplicity also pays performance dividends: we’ve extensively profiled and aggressively trimmed the allocator’s fast path, improving thread-local storage access, locks, reducing cache line fetches, and removing branches.

PartitionAlloc pre-reserves slabs of virtual address space. They are gradually backed by physical memory, as allocation requests arrive. Small and medium-sized allocations are grouped in geometrically-spaced, size-segregated buckets, e.g. [241; 256], [257; 288]. Each slab is split into regions (called “slot spans”) that satisfy allocations (“slots”) from only one particular bucket, thereby increasing cache locality while lowering fragmentation. Conversely, larger allocations don’t go through the bucket logic and are fulfilled using the operating system’s primitives directly (mmap() on POSIX systems, and VirtualAlloc() on Windows).

This central allocator is protected by a single per-partition lock. To mitigate the scalability problem arising from contention, we add a small, per-thread cache of small slots in front, yielding a three-tiered architecture:




The first layer (Per-thread cache) holds a small amount of slots belonging to smaller and more commonly used buckets. Because these slots are stored per-thread, they can be allocated without a lock and only requiring a faster thread-local storage lookup, improving cache locality in the process. The per-thread cache has been tailored to satisfy the majority of requests by allocating from and releasing memory to the second layer in batches, amortizing lock acquisition, and further improving locality while not trapping excess memory.

The second layer (Slot span free-lists) is invoked upon a per-thread cache miss. For each bucket size, PartitionAlloc knows a slot span with free slots associated with that size, and captures a slot from the free-list of that span. This is still a fast path, but slower than per-thread cache as it requires taking a lock. However, this section is only hit for larger allocations not supported by per-thread cache, or as a batch to fill the per-thread cache.

Finally, if there are no free slots in the bucket, the third layer (Slot span management) either carves out space from a slab for a new slot span, or allocates an entirely new slab from the operating system, which is a slow but very infrequent operation.

The overall performance and space-efficiency of the allocator hinges on the many tradeoffs across its layers such as how much to cache, how many buckets, and memory reclaiming policy. Please refer to PartitionAlloc to learn more about the design.

All in all, we hope you will enjoy the additional memory savings and performance improvements brought by PartitionAlloc, ensuring a safer, leaner, and faster Chrome for users on Earth and in outer space alike. Stay tuned for further improvements, and support of more platforms coming in the near future.

Posted by Benoît Lizé and Bartek Nowierski, Chrome Software Engineers

Data source for all statistics: Real-world data anonymously aggregated from Chrome clients.
*The core metric measures jank -- delay handling user input -- every 30 seconds.

Video conferencing took on elevated importance in 2020. I’m not on the Google Meet team but I do work on Chrome, so I fired up my favorite profiler during one of my daily meetings to see if I could find anything useful.

There is a lot going on during video conferencing, spread across multiple processes. With my usual dozens of tabs open there were 37 Chrome processes, with six of them actively participating in the video conference. In addition there were over 200 other processes running (87 copies of svchost.exe, for instance), with four of those involved in video conferencing. You may well wonder why it takes 10 processes to connect two people, so here is a list of the processes and their roles:

  • audiodg.exe - Windows Audio Device Graph Isolation, audio output
  • dwm.exe - Windows Desktop Window Manager, showing video
  • svchost.exe - Windows Camera Frame Server (webcam capture)
  • System - Windows system process, does miscellaneous tasks on behalf of processes
  • chrome.exe - browser process, the master control program
  • chrome.exe - renderer process, the Meet tab
  • chrome.exe - GPU process, in charge of rendering pages
  • chrome.exe - NetworkService utility process, talking to the network
  • chrome.exe - VideoCaptureService, talking to the Windows Camera Frame Server
  • chrome.exe - AudioService, controls audio input and output

These tasks are spread across different processes for security and stability. If one of them crashes it can be restarted without taking everything down. If one of them is compromised due to a security bug then it is isolated from the rest of the system and the damage may be contained.

This all makes good sense, but having this many processes involved can make performance-profiling challenging. It can be challenging to look through all of these processes to find areas of potential improvement. It is made more difficult by the fact that I know little about the Meet architecture.


Analyzing a profile

Video conferencing is CPU intensive - you have to record, compress, transmit, receive, decompress, and display both audio and video. The data below shows CPU samples recorded by Microsoft’s Event Tracing for Windows (ETW). This sampling profiler works by interrupting every running thread about 1,000 times a second and recording a call stack. I used Windows Performance Analyzer (WPA) to display the results. In the screenshot below I am looking at a 10-second period and over 16,000 samples (representing about 16 seconds of CPU time) were recorded across the 10 processes involved in video conferencing:



That’s a lot of samples to look through, but the call stacks are collated so that you can drill down on the busiest stacks. I didn’t find anything in the first Chrome process, but in the second one I did:


It doesn’t look like much, but I recognized immediately that the 124 samples in KiPageFault were worth investigating. Most of the CPU-intensive work in this trace was important and unavoidable work but I had a hunch that these samples represented avoidable work - something that I could fix. And, even though they represented just 0.75% of the samples I suspected that they indicated a somewhat greater cost.

I recognized their importance immediately because this is something that I have seen before. KiPageFault means that the processor touched some memory that had been allocated, but was not currently in the process. This could mean that the pages had been removed from the process to save memory, but in an active process on a machine with lots of memory, that didn’t make sense. What was more likely was that this represented recently allocated memory.

When a program allocates a small amount of memory, the local memory manager (sometimes called the “heap”) will usually have some available that it can give to the program. But if it doesn’t have an appropriate block of memory then it will ask the operating system (OS) for some. If a program allocates a large amount of memory (greater than a MB or so) then the heap will definitely ask for more memory. This is, in itself, a relatively cheap operation. The heap asks the OS for some memory, the OS says “sure”, then the OS makes note of the fact that it promised this memory, and that’s it. The OS does not, at that time, actually give the program any memory. This is the way of the world on Windows, Linux, Android and it is good but it can be confusing and surprising. If the process never touches the memory then the memory is never added to the process, but if the process does touch the memory then individual pages of zeroed memory are brought into the process. These are called demand-zero page faults because zeroed pages are “faulted” into the process on demand.

In other words, allocating a large block of memory is quite cheap, but doesn’t actually set up the promised memory. Then, when the program tries to use the memory and the CPU discovers that there is no memory at that address it triggers an exception, which wakes up the OS. The OS checks its records and realizes that it did in fact promise to put memory at that address so it then puts some there and restarts the program. This happens so quickly that if you’re not paying attention you will miss it, but it shows up when profiling as samples hitting in KiPageFault.

This bizarre dance happens again for every 4-KiB block in the allocation - 4 KiB is the size of the pages that the CPU and the OS work on.

The cost is small. Across this 10-second period only 124 samples - representing about 124 ms or 0.124 seconds - hit inside of KiPageFault. The total cost of the enclosing CopyImage_SSE4_1 function was about 240 ms, so the page faults accounted for more than half of this function, but barely a quarter of the cost of the OnSample function on line 15.

The total costs of these page faults is modest but they hint at many other costs:
  • If this memory is being allocated repeatedly (presumably every frame) then it must also be freed every frame. On line 26 we can see that the Release function which frees the memory uses another 64 samples.
  • When the pages are freed the operating system has to zero them (for security reasons) so that they are ready to be reused. This is done in the Windows System process - an almost entirely hidden cost. Sure enough when I looked in the System process I saw 138 samples in the MiZeroPageThread. I found that 87% of the KiPageFault samples in the entire system were in the CopyImage_SSE4_1 call so presumably 87% of the 138 samples in the MiZeroPageThread were due to this pattern.


I analyzed these hidden costs of memory allocation in a 2014 blog post. The basic memory architecture of Windows hasn’t changed since then so the hidden costs remain about the same.

In addition to CPU samples my ETW trace contained call stacks for every call to VirtualAlloc. This WPA screenshot shows a 10-second period where the OnSample function does 298 allocations that are each 1.320 MB, roughly 30 per second:


At this point we can estimate that the cost of these repeated allocations is 124 (faulting in) plus 64 (freeing) plus 124 (87% of the zeroing samples) for a total of 312 samples. This gets us up to 1.9% of the total CPU cost of video conferencing. Fixing this is not going to change the world, but this is a change worth doing.

But wait, there’s more!

We are locking this buffer so that we can look at the contents, but it turns out we don’t actually want the lock call to copy the buffer at all. We just want the lock call to describe the buffer to us so that we can look at it in place. Therefore the entire cost of the MFCopyImage call is waste! That’s another 116 samples. In addition, in the CMF2DMediaBuffer::Unlock call on line 26 there is another call to CMF2DMediaBuffer::ContiguousCopyFrom. That’s because the Unlock call assumes that we might have modified the copy of the buffer, so it copies it back. So the 101 samples there are all waste as well!

If we can examine this buffer without the alloc/copy/copy/free dance then we can save 312 samples plus 116 samples (the rest of the copying cost) plus 101 samples (the copying-it-back cost) for a total saving of 3.2%. This is getting better all the time.

Note that sampled data is only statistically valid, and the actual percentages vary significantly depending on the computer and the exact workload. But, the point remains - it is a non-dramatic but worthwhile change to investigate.

Despite spending years in the video-game business my knowledge of these graphics-buffer locking and unlocking APIs is weak. I ended up relying on the wisdom of my Twitter followers to come to the conclusion that the copying was entirely avoidable, and to get a rough pattern for how it could be fixed. After filing an overly verbose bug I delegated the task of actually fixing it. The fix landed in M85 and was deemed important enough that it was then backported to M84.

You’d have to be paying very close attention to see the difference - spread across a Chrome process and the system process - but I hope that this helped some computers run a bit cooler and last longer on their batteries. And, while this inefficiency was found by profiling Google Meet, the improvement actually benefits any product that uses the webcam inside Chrome (and other Chromium-based browsers).

Verification

After the fix landed I compared two 10-second ETW traces from Chrome Canary before and after the change, each taken with no other programs running except a single Chrome tab running the Google Meet pre-meeting page. In both cases I looked at a 10-second period of time in the profiler. This showed:


CPU time in OnSample:
Before: 458 ms (432 ms of which were in Lock/Unlock/KiPageFault)
After: 27 ms


Allocations:
Before: 30 allocations per second of 1.32 MB (one per frame, running at 30 fps - a higher framerate would mean more allocations), totalling 396 MB over 10 seconds
After: 0 allocations


CPU time in the System process's MiZeroPageThread:
Before: 36 ms
After: 0 ms

These measurements showed - in three different ways - that the performance problem was fixed. The memory copying in OnSample was gone, the repeated allocations were gone, and the system process was doing less work. Mission accomplished, bug closed.