Planet Igalia

January 11, 2025

Qiuyi Zhang (Joyee)

Executable loading and startup performance on macOS

Recently, I fixed a macOS-specific startup performance regression in Node.js after an extensive investigation. Along the way, I learned a lot about tools on macOS and Node

January 11, 2025 10:29 PM

January 09, 2025

Andy Wingo

ephemerons vs generations in whippet

Happy new year, hackfolk! Today, a note about ephemerons. I thought I was done with them, but it seems they are not done with me. The question at hand is, how do we efficiently and correctly implement ephemerons in a generational collector? Whippet‘s answer turns out to be simple but subtle.

on oracles

The deal is, I want to be able to evaluate different collector constructions and configurations, and for that I need a performance oracle: a known point in performance space-time against which to compare the unknowns. For example, I want to know how a sticky mark-bit approach to generational collection does relative to the conventional state of the art. To do that, I need to build a conventional system to compare against! If I manage to do a good job building the conventional evacuating nursery, it will have similar performance characteristics as other nurseries in other unlike systems, and thus I can use it as a point of comparison, even to systems I haven’t personally run myself.

So I am adapting the parallel copying collector I described last July to have generational support: a copying (evacuating) young space and a copying old space. Ideally then I’ll be able to build a collector with a copying young space (nursery) but a mostly-marking nofl old space.

notes on a copying nursery

A copying nursery has different operational characteristics than a sticky-mark-bit nursery, in a few ways. One is that a sticky mark-bit nursery will promote all survivors at each minor collection, leaving the nursery empty when mutators restart. This has the pathology that objects allocated just before a minor GC aren’t given a chance to “die young”: a sticky-mark-bit GC over-promotes.

Contrast that to a copying nursery, which can decide to promote a survivor or leave it in the young generation. In Whippet the current strategy for the parallel-copying nursery I am working on is to keep freshly allocated objects around for another collection, and only promote them if they are live at the next collection. We can do this with a cheap per-block flag, set if the block has any survivors, which is the case if it was allocated into as part of evacuation during minor GC. This gives objects enough time to die young while not imposing much cost in the way of recording per-object ages.

Recall that during a GC, all inbound edges from outside the graph being traced must be part of the root set. For a minor collection where we just trace the nursery, that root set must include all old-to-new edges, which are maintained in a data structure called the remembered set. Whereas for a sticky-mark-bit collector the remembered set will be empty after each minor GC, for a copying collector this may not be the case. An existing old-to-new remembered edge may be unnecessary, because the target object was promoted; we will clear these old-to-old links at some point. (In practice this is done either in bulk during a major GC, or the next time the remembered set is visited during the root-tracing phase of a minor GC.) Or we could have a new-to-new edge which was not in the remembered set before, but now because the source of the edge was promoted, we must adjoin this old-to-new edge to the remembered set.

To preserve the invariant that all edges into the nursery are part of the roots, we have to pay special attention to this latter kind of edge: we could (should?) remove old-to-promoted edges from the remembered set, but we must add promoted-to-survivor edges. The field tracer has to have specific logic that applies to promoted objects during a minor GC to make the necessary remembered set mutations.

other object kinds

In Whippet, “small” objects (less than 8 kilobytes or so) are allocated into block-structed spaces, and large objects have their own space which is managed differently. Notably, large objects are never moved. There is generational support, but it is currently like the sticky-mark-bit approach: any survivor is promoted. Probably we should change this at some point, at least for collectors that don’t eagerly promote all objects during minor collections.

finalizers?

Finalizers keep their target objects alive until the finalizer is run, which effectively makes each finalizer part of the root set. Ideally we would have a separate finalizer table for young and old objects, but currently Whippet just has one table, which we always fully traverse at the end of a collection. This effectively adds the finalizer table to the remembered set. This is too much work—there is no need to visit finalizers for old objects in a minor GC—but it’s not incorrect.

ephemerons

So what about ephemerons? Recall that an ephemeron is an object E×KV in which there is an edge from E to V if and only if both E and K are live. Implementing this conjunction is surprisingly gnarly; you really want to discover live ephemerons while tracing rather than maintaining a global registry as we do with finalizers. Whippet’s algorithm is derived from what SpiderMonkey does, but extended to be parallel.

The question is, how do we implement ephemeron-reachability while also preserving the invariant that all old-to-new edges are part of the remembered set?

For Whippet, the answer turns out to be simple: an ephemeron E is never older than its K or V, by construction, and we never promote E without also promoting (if necessary) K and V. (Ensuring this second property is somewhat delicate.) In this way you never have an old E and a young K or V, so no edge from an ephemeron need ever go into the remembered set. We still need to run the ephemeron tracing algorithm for any ephemerons discovered as part of a minor collection, but we don’t need to fiddle with the remembered set. Phew!

conclusion

As long all promoted objects are older than all survivors, and that all ephemerons are younger than the objects referred to by their key and value edges, Whippet’s parallel ephemeron tracing algorithm will efficiently and correctly trace ephemeron edges in a generational collector. This applies trivially for a sticky-mark-bit collector, which always promotes and has no survivors, but it also holds for a copying nursery that allows for survivors after a minor GC, as long as all survivors are younger than all promoted objects.

Until next time, happy hacking in 2025!

by Andy Wingo at January 09, 2025 10:15 AM

January 08, 2025

Eric Meyer

CSS Wish List 2025

Back in 2023, I belatedly jumped on the bandwagon of people posting their CSS wish lists for the coming year.  This year I’m doing all that again, less belatedly! (I didn’t do it last year because I couldn’t even.  Get it?)

I started this post by looking at what I wished for a couple of years ago, and a small handful of my wishes came true:

Note that by “came true”, I mean “reached at least Baseline Newly Available”, not “reached Baseline Universal”; that latter status comes over time.  And more :has() isn’t really a feature you can track, but I do see more people sharing cool :has() tricks and techniques these days, so I’ll take that as a positive signal.

A couple more of my 2023 wishes are on the cusp of coming true:

Those are both in the process of rolling out, and look set to reach Baseline Newly Available before the year is done.  I hope.

That leaves the other half of the 2023 list, none of which has seen much movement.  So those will be the basis of this year’s list, with some new additions.

Hanging punctuation

WebKit has been the sole implementor of this very nice typographic touch for almost a decade now.  The lack of any support by Blink and Gecko is now starting to verge on feeling faintly ridiculous.

Margin and line box trimming

Trim off the leading block margin on the first child in an element, or the trailing block margin of the last child, so they don’t stick out of the element and mess with margin collapsing.  Same thing with block margins on the first and last line boxes in an element.  And then, be able to do similar things with the inline margins of elements and line boxes!  All these things could be ours.

Stroked text

We can already fake text stroking with text-shadow and paint-order, at least in SVG.  I’d love to have a text-stroke property that can be applied to HTML, SVG, and MathML text.  And XML text and any text that CSS is able to style.  It should be at least as powerful as SVG stroking, if not more so.

Expanded attr() support

This has seen some movement specification-wise, but last I checked, no implementation promises or immediate plans.  Here’s what I want to be able to do:

td {width: attr(data-size em, 1px));

<td data-size="5">…</td>

The latest Values and Units module describes this, so fingers crossed it starts to gain some momentum.

Exclusions

Yes, I still want CSS Exclusions, a lot.  They would make some layout hacks a lot less hacky, and open the door for really cool new hacks, by letting you just mark an element as creating a flow exclusions for the content of other elements.  Position an image across two columns of text and set it to exclude, and the text of those columns will flow around or past it like it was a float.  This remains one of the big missing pieces of CSS layout, in my view.  Linked flow regions is another.

Masonry layout

This one is a bit stalled because the basic approach still hasn’t been decided.  Is it part of CSS Grid or its own display type?  It’s a tough call.  There are persuasive arguments for both.  I myself keep flip-flopping on which one I prefer.

Designers want this.  Implementors want this.  In some ways, that’s what makes it so difficult to pick the final syntax and approach: because everyone wants this, everyone wants to make the exactly perfect right choices for now, for the future, and for ease of teaching new developers.  That’s very, very hard.

Grid track and gap styles

Yeah, I still want a Grid equivalent of column-rule, except more full-featured and powerful.  Ideally this would be combined with a way to select individual grid tracks, something like:

.gallery {display: grid;}
.gallery:col-track(4) {gap-rule: 2px solid red;}

…in order to just put a gap rule on that particular column.  I say that would be ideal because then I could push for a way to set the gap value for individual tracks, something like:

.gallery {gap: 1em 2em;}
.gallery:row-track(2) {gap: 2em 0.5em;}

…to change the leading and trailing gaps on just that row.

Custom media queries

This was listed as “Media query variables” in 2023.  With these, you could define a breakpoint set like so:

@custom-media --sm (inline-size <= 25rem);
@custom-media --md (25rem < inline-size <= 50rem);
@custom-media --lg (50rem < inline-size);

body {margin-inline: auto;}
@media (--sm) {body {inline-size: auto;}}
@media (--md) {body {inline-size: min(90vw, 40em);}
@media (--lg) {body {inline-size: min(90vw, 55em);}

In other words, you can use custom media queries as much as you want throughout your CSS, but change their definitions in just one place.  It’s CSS variables, but for media queries!  Let’s do it.

Unprefix all the things

Since we decided to abandon vendor prefixing in favor of feature flags, I want to see anything that’s still prefixed get unprefixed, in all browsers.  Keep the support for the prefixed versions, sure, I don’t care, just let us write the property and value names without the prefixes, please and thank you.

Grab bag

I still would like a way to indicate when a shorthand property is meant for logical rather than physical directions, a way to apply a style sheet to a single element, the ability to add or subtract values from a shorthand without having to rewrite the whole thing, and styles that cross resource boudnaries.  They’re all in the 2023 post.

Okay, that’s my list.  What’s yours?


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at January 08, 2025 12:45 PM

January 02, 2025

Orko Garai

What is an Input Method Editor?

I’ve been working on chromium input method editor integration for linux wayland at Igalia over the past several months, and I thought I’d share some insights I’ve gained along the way and some highlights from my work.

This is the first in a series of blog posts about input method editors, or IME in short. Here I will try to explain what an IME really is at a high level before diving deeper into some of the technical details of IME support in linux and chromium in upcoming posts.

January 02, 2025 06:29 PM

December 30, 2024

Igalia WebKit Team

WebKit Igalia Periodical #8

Update on what happened in WebKit in the week from December 23 to December 30.

Community & Events 🤝

Published an article on CSS Anchor Positioning. It discusses the current status of the support across browsers, Igalia's contributions to the WebKit's implementation, and the predictions for the future.

That’s all for this week!

by Unknown at December 30, 2024 04:39 PM

December 27, 2024

Pawel Lampe

Contributing to CSS Anchor Positioning in WebKit.

CSS Anchor Positioning is a novel CSS specification module that allows positioned elements to size and position themselves relative to one or more anchor elements anywhere on the web page. In simpler terms, it is a new web platform API that simplifies advanced relative-positioning scenarios such as tooltips, menus, popups, etc.

CSS Anchor Positioning in practice #

To better understand the true power it brings, let’s consider a non-trivial layout presented in Figure 1:

Non-trivial layout.

In the past, creating a context menu with position: fixed and positioned relative to the button required doing positioning-related calculations manually. The more complex the layout, the more complex the situation. For example, if the table in the above example was in a scrollable container, the position of the context menu would have to be updated manually on every scroll event.

With the CSS Anchor Positioning the solution to the above problem becomes trivial and requires 2 parts:

  • The <button> element must be marked as an anchor element by adding anchor-name: --some-name.
  • The context menu element must position itself using the anchor() function: left: anchor(--some-name right); top: anchor(--some-name bottom).

The above is enough for the web engine to understand that the context menu element’s left and top must be positioned to the anchor element’s right and bottom. With that, the web engine can carry out the job under the hood, so the result is as in Figure 2:

Non-trivial layout with anchor-positioned context menu.

As the above demonstrates, even with a few simple API pieces, it’s now possible to address very complex scenarios in a very elegant fashion from the web developer’s perspective. Moreover, CSS Anchor Positioning offers even more than that. There are numerous articles with great examples such as this MDN article, this css-tricks article, or this chrome blog post, but the long story short is that both positioning and sizing elements relative to anchors are now very simple.

Implementation status across web engines #

The first draft of the specification was published in early 2023, which in the web engines field is not so long time ago. Therefore - as one can imagine - not all the major web engines support it yet. The first (and so far the only) web engine to support CSS Anchor Positioning was Chromium (see the introduction blog post) - thus the information on caniuse.com. However, despite the information visible on the WPT results page, the other web engines are currently implementing it (see the meta bug for Gecko and bug list for WebKit). The lack of progress on the WPT results page is due to the feature not being enabled by default yet in those cases.

Implementation status in WebKit #

From the commits visible publicly, one can deduce that the work on CSS Anchor Positioning in WebKit has been started by Apple early 2024. The implementation was initiated by adding a core part - support for anchor-name, position-anchor, and anchor(). Those 2 properties and function are enough to start using the feature in real-world scenarios as well as more sophisticated WPT tests.

The work on the above had been finished by the end of Q3 2024, and then - in Q4 2024 - the work significantly intensified. A parsing/computing support has been added for numerous properties and functions and moreover, a lot of new functionalities and bug fixes landed afterwards. One could expect some more things to land by the end of the year even if there’s not much time left.

Overall, the implementation is in progress and is far from being done, but can already be tested in many real-world scenarios. This can be done using custom WebKit builds (across various OSes) or using Safari Technology Preview on Mac. The precondition for testing is, however, that the runtime preference called CSSAnchorPositioning is enabled.

My contributions #

Since the CSS Anchor Positioning in WebKit is still work in progress, and since the demand for the set of features this module brings is high, I’ve been privileged to contribute a little to the implementation myself. My work so far has been focused around the parts of API that allow creating menu-like elements becoming visible on demand.

The first challenge with the above was to fix various problems related to toggling visibility status such as:

The obvious first step towards addressing the above was to isolate elegant scenarios to reproduce the above. In the process, I’ve created some test cases, and added them to WPT. With tests in place, I’ve imported them into WebKit’s source tree and proceeded with actual bug fixing. The result was the fix for the above crash, and the fix for the layout being broken. With that in place, the visibility of menu-like elements can be changed without any problems now.

The second challenge was about the missing features allowing automatic alignment to the anchor. In a nutshell, to get the alignment like in the Figure 3:

Non-trivial layout with centered anchor-positioned context menu.

there are 2 possibilities:

At first, I wasn’t aware of the anchor-center and hence I’ve started initial work towards supporting position-area. Once I became aware, however, I’ve switched my focus to implementing anchor-center and left the above for Apple to continue - not to block them. Until now, both the initial and core parts of anchor-center implementation have landed. It means, the basic support is in place.

Despite anchor-center layout tests passing, I’ve already discovered some problems such as:

and I anticipate more problems may appear once the testing intensifies.

To address the above, I’ll be focusing on adding extra WPT coverage along with fixing the problems one by one. The key is to make sure that at the end of the day, all the unexpected problems are covered with WPT test cases. This way, other web engines will also benefit.

The future #

With WebKit’s implementation of CSS Anchor Positioning in its current shape, the work can be very much parallel. Assuming that Apple will keep working on that at the same pace as they did for the past few months, I wouldn’t be surprised if CSS Anchor Positioning would be pretty much done by the end of 2025. If the implementation in Gecko doesn’t stall, I think one can also expect a lot of activity around testing in the WPT. With that, the quality of implementation across the web engines should improve, and eventually (perhaps in 2026?) the CSS Anchor Positioning should reach the state of full interoperability.

December 27, 2024 12:00 AM

December 23, 2024

Igalia WebKit Team

WebKit Igalia Periodical #7

Update on what happened in WebKit in the week from December 16 to December 23.

Cross-Port 🐱

Improved logging in WebDriver service executables, using the same infrastructure as the browser (e.g. journald logging and different levels).

Added support for the first WebDriver-BiDi event in WebKit: monitoring console messages.

JavaScriptCore 🐟

The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.

JavaScriptCore got a fix for a wasm test that was flaky on 32-bits. We also submitted a new PR to fix handling of Air (Air is an intermediate representation) Args with offsets that are not directly addressable in the O0 register allocator.

Graphics 🖼️

Since the switch to Skia we have been closely following upstream changes, and making small contributions when needed. After adding support for OpenType-SVG fonts the build with Clang was broken, and needed a fix to allow building Skia in C++ 23 mode (as we do in WebKit). The Skia update for this week resulting in a fix to avoid SK_NO_SANITIZE("cfi") when using GCC.

Releases 📦️

Stable releases of WebKitGTK 2.46.5 and WPE WebKit 2.46.5 are now available. While they include some minor fixes, these are focused on patches for security issues, and come accompanied with a new security advisory (WSA-2024-0008: GTK, WPE). As usual, it is recommended to stay up to date, and fresh packages have been already making their way to mainstream Linux distributions.

That’s all for this week!

by Unknown at December 23, 2024 04:03 PM

December 21, 2024

Orko Garai

About Me

Excited to get started on my blogging journey!

I’ve been planning to get this going for a while, and finally got around to it during the Christmas break :) .

Here’s a little bit about me…

I started working at Igalia a little over a year ago, after having decided I really wanted to work in open source software, and contributing to the chromium project was a natural choice, having worked on it previously. Igalia as a company doesn’t need any introductions in the open source community, and so I ended up here and have the privilege to be working with some amazing people, and getting to learn a lot.

December 21, 2024 06:19 AM

December 20, 2024

Lucas Fryzek

2024 Graphics Team Contributions at Igalia

2024 has been an exciting year for the Igalia’s Graphics Team. We’ve been making a lot of progress on Turnip, AMD display driver, the Raspberry Pi graphics stack, Vulkan video, and more.

Vulkan Device Generated Commands

Igalia’s Ricardo Garcia has been working hard on adding support for the new VK_EXT_device_generated_commands extension in the Vulkan Conformance Test Suite. He wrote an excellent blog post on the extension and on his work that you can read here. Ricardo also presented the extension at XDC 2024 in Montréal, which he also blogged about. Take a look and see what generating Vulkan commands directly on the GPU looks like!

Raspberry Pi Enhancements & Performance Improvements

Our very own Maíra Canal made a big contribution to improve the graphics performance of Raspberry Pi 4 & 5 devices by introducing support for “Super Pages”. She wrote an excellent and detailed blog post on what Super Pages are, how they improve performance, and comparing performance of different apps and games. You can read all the juicy details here.

She also worked on introducing CPU jobs to the Broadcom GPU kernel driver in Linux. These changes allow user space to implement jobs that get executed on the CPU in sync with the work on the GPU. She wrote a great blog post detailing what CPU jobs allow you to do and how they work that you can read here.

Christian Gmeiner on the Graphics team has also been working on adding Perfetto support to Broadcom GPUs. Perfetto is a performance tracing tool and support for it in Broadcom drivers will allow to developers to gain more insight into bottlenecks of their GPU applications. You can check out his changes to add support in the following MRs: - MR 31575 - MR 32277 - MR 31751

The Raspberry Pi team here at Igalia presented all of their work at XDC 2024 in Montréal. You can see a video below.

Linux Kernel 6.8

A number of Igalians made several contributions to the Linux 6.8 kernel release back in March of this year. Our colleague Maíra wrote a great blog post outlining these contributions that you can read here. To highlight some of these contributions:

  • AMD HDR & Color Management
    • Melissa Wen has been working on improving and implementing HDR support in AMD’s display driver as well as working on color management in the Linux display stack.
  • Async Flip
    • André Almeida implemented support for asynchronous page flip in the atomic DRM modesetting API.
  • V3D 7.1.x Kernel Driver
    • Iago Toral contributed a number of patches upstream to get the Broadcom DRM driver working with the latest Broadcom hardware used in the Raspberry Pi 5.
  • GPU stats for the Raspberry Pi 4/5
    • José María “Chema” Casanova worked on adding GPU stats support to the latest Raspberry Pi hardware.

Turnip Improvements

Dhruv Mark Collins has been very hard at work to try and bring performance parity between Qualcomm’s proprietary driver and the open source Turnip driver. Two of his big contributions to this were improving the 2D buffer to image copies on A7XX devices, and implementing unidirectional Low Resolution Z (LRZ) on A7XX devices. You can see the MR for these changes here and here.

A new member of the Igalia Graphics Team Karmjit Mahil has been working on different parts of the Turnip stack, but one notable improvement he made was to improve fmulz handling for Direct3D 9. You can check out his changes here and read more about them.

Danylo Piliaiev has been hard at work adding support for the latest generation of Adreno GPUs. This included getting support for the A750 working, and then implementing performance improvements to bring it up to parity with other Adreno GPUs in Turnip. All-together the turnip team implemented a number of Vulkan extensions and performance improvements such as:

  • VK_KHR_shader_atomic_int64 - Amber
  • VK_KHR_fragment_shading_rate - Danylo Piliaiev
  • VK_KHR_8bit_storage - Žan Dobersek
  • shaderInt8 feature - Žan Dobersek
  • VK_KHR_shader_subgroup_rotate - Job Noorman
  • VK_EXT_map_memory_placed - Dhruv Mark Collins
  • VK_EXT_legacy_dithering - Karmjit Mahil
  • VK_EXT_depth_clamp_zero_one - Danylo Piliaiev

Display Next Hackfest & Display/KMS Meet-up

Igalia hosted the 2024 version of the Display Next Hackfest. This community event is a way to get Linux display developers together to work on improving the Linux display stack. Our Melissa Wen wrote a blog post about the event and what it was like to organize it. You can read all about it here.

Display Next Hackfest
Display Next Hackfest

Just in-case you thought you couldn’t get enough Linux display stack, Melissa also helped organize a Display/KMS meet-up at XDC 2024. She wrote all about that meet-up and the progress the community made on her blog here.

AMD Display & AMDGPU

Melissa Wen has also been hard at work improving AMDGPU’s display driver. She made a number of changes including improving display debug log to include hardware color capabilities, Migrating EDID handling to EDID common code and various bug fixes such as:

Tvrtko Ursulin, a recent addition to our team, has been working on fixing issues in AMDGPU and some of the Linux kernel’s common code. For example, he worked on fixing bugs in the DRM scheduler around missing locks, optimizing the re-lock cycle on the submit path, and cleaned up the code. On AMDGPU he worked on improving memory usage reporting, fixing out of bounds writes, and micro-optimized ring emissions. For DMA fence he simplified fence merging and resolved a potential memory leak. Lastly, on workqueue he fixed false positive sanity check warnings that AMDGPU & DRM scheduler interactions were triggering. You can see the code for some of changes below: - https://lore.kernel.org/amd-gfx/[email protected]/ - https://lore.kernel.org/amd-gfx/[email protected]/ - https://lore.kernel.org/amd-gfx/[email protected]/ - https://lore.kernel.org/amd-gfx/[email protected]/ - https://lore.kernel.org/amd-gfx/[email protected]/

Vulkan & OpenGL Extensions

  • GL_EXT_texture_offset_non_const
    • Ricardo was busy working on extending OpenGL by adding this extension to GLSL as well as providing an implementation for it in glslang
  • VK_KHR_video_encode_av1 & VK_KHR_video_decode_av1
    • Igalia is listed as a contributor to these extensions and worked very hard to implement CTS support for the extensions.

Etnaviv Improvements

Christian Gmeiner, one of the maintainers of the Etnaviv driver for Vivante GPUs, has been hard at work this year to make a number of big improvements to Etnaviv. This includes using hwdb to detect GPU features, which he wrote about here. Another big improvement was migrating Etnaviv to use isaspec for the GPU isa description, allowing an assembler and disassembler to be generated from XML. This also allowed Etnaviv to reuse some common features in Mesa for assemblers/disassemblers and take advantage of the python code generation features others in the community have been working on. He wrote a detailed blog about it, that you can find here. On the same vein of Etnaviv infrastructure improvements, Christian has also been working on a new shader compiler, written in Rust, called “EBC”. Christian presented this new shader compiler at XDC 2024 this year. You can check out his presentation below.

On the side of new features, Christian landed a big one in Mesa 24.03 for Etnaviv: Multiple Render Target (MRT) support! This allows games and applications to render to multiple render targets (think framebuffers) in a single graphics operations. This feature is heavily used by deferred rendering techniques, and is a requirement for later versions of desktop OpenGL and OpenGL ES 3. Keep an eye on Christian’s blog to see any of his future announcements.

Lavapipe/LLVMpipe, Android & ChromeOS

I had a busy year working on improving Lavapipe/LLVMpipe platform integration. This started with adding support for DMABUF import/export, so that the display handles from Android Window system could be properly imported and mapped. Next came Android window system integration for DRI software rendering backend in EGL, and lastly but most importantly came updating the documentation in Mesa for building Android support. I wrote all about this effort here.

The latter half on the year had me working on improving lavapipe’s integration with ChromeOs, and having Lavapipe work as a host Vulkan driver for Venus. You can see some of the changes I made in virglrenderer here and crosvm here. This work is still ongoing.

What’s Next?

We’re not planning to stop our 2024 momentum, and we’re hopping for 2025 to be a great year for Igalia and the Linux graphics stack! I’m booked to present about Lavapipe at Vulkanised 2025, where Ricardo will also present about Device-Generated Commands. Maíra & Chema will be presenting together at FOSDEM 2025 about improving performance on Raspberry Pi GPUs, and Melissa will also present about kworkflow there. We’ll also be at XDC 2025, networking and presenting about all the work we are doing on the Linux graphics stack. Thanks for following our work this year, and here’s to making 2025 an even better year for Linux graphics!

December 20, 2024 05:00 AM

December 18, 2024

Víctor Jáquez

Igalia Multimedia achievements (2024)

As 2024 draws to a close, it’s a perfect time to reflect on the year’s accomplishments done by the Multimedia team in Igalia. In our consideration, there were three major achievements:

  1. WebRTC’s support in WPE/WebKitGTK with GStreamer.
  2. GES maturity improved for real-life use-cases.
  3. Vulkan Video development and support in GStreamer.

WebRTC support in WPE/WebKitGTK with GStreamer #

WPE and WebKitGTK are WebKit ports maintained by Igalia, the former for embedded devices and the latter for applications with a full-featured Web integration.

WebRTC is a web API that allows real-time communication (RTC) directly between web browser and applications. Examples of these real-time communications are video conferencing, cloud gaming, live-streaming, etc.

Some WebKit ports support libwebrtc, an open-source library that implements the WebRTC specification, developed and maintained by Google. WPE and WebKitGTK originally also supports libwebrtc, but we started to use also GstWebRTC, a set of GStreamer plugins and libraries that implement WebRTC, which adapts perfectly to the multimedia implementation in both ports, also in GStreamer.

This year the fruits of this work have been unlocked by enabling Amazon Luna gaming:

https://www.youtube.com/watch?v=lyO7Hqj1jMs

And also enabling a CAD modeling, server-side rendered service, known as Zoo:

https://www.youtube.com/watch?v=CiuYjSCDsUM

WebKit Multimedia #

WebKit made significant improvements in multimedia handling, addressing various issues to enhance stability and playback quality. Key updates include preventing premature play() calls during seeking, fixing memory leaks. The management of track identifiers was also streamlined by transitioning from string-based to integer-based IDs. Additionally, GStreamer-related race conditions were resolved to prevent hangs during playback state transitions. Memory leaks in WebAudio and event listener management were addressed, along with a focus on memory usage optimizations.

The handling of media buffering and seeking was enhanced with buffering hysteresis for smoother playback. Media Source Extensions (MSE) behavior was refined to improve playback accuracy, such as supporting markEndOfStream() before appendBuffer() and simplifying playback checks. Platform-specific issues were also tackled, including AV1 and Opus support for encrypted media and better detection of audio sinks. And other improvements on multimedia performance and efficiency.

GES maturity improved for real-live use-cases #

GStreamer Editing Services (GES) is a set of GStreamer plugins and a library that allow non-linear video editing. For example, GES is what’s behind of Pitivi, the open source video editor application.

Last year, GES was deployed in web-based video editors, where the actual video processing is done server-side. These projects allowed, in great deal, the enhancement and maturity of the library and plugins.

Tella is a browser-based tool that allow to screen record and webcam, without any extra software. Finished the recording, the user can edit, in the browser, the video and publish it.

https://www.youtube.com/watch?v=uSWqWHBRDWE

Sequence is a complete, browser-based, video editor with collaborative features. GES is used in the backend to render the editing operations.

https://www.youtube.com/watch?v=bXNdDIiG9lE

Vulkan Video development and GStreamer support #

The last but not the least, this year we continue our work with the Vulkan Video ecosystem by working the task subgroup (TSG) enabling H.264/H.265 encoding, and AV1 decoding and encoding.

Early this year we delivered a talk in the Vulkanised about our work, which ranges from the Conformance Test Suite (CTS), Mesa, and GStreamer.

https://www.youtube.com/watch?v=z1HcWrmdwzI

Conclusion #

As we wrap up 2024, it’s clear that the year has been one of significant progress, driven by innovation and collaboration. Here’s to continuing the momentum and making 2025 even better!

December 18, 2024 12:00 AM

December 16, 2024

Igalia WebKit Team

WebKit Igalia Periodical #6

Update on what happened in WebKit in the week from December 9 to December 16.

Cross-Port 🐱

Web Platform 🌐

Shipped the X25519 algorithm of the WebCrypto API for the Mac, GTK+ and WPE ports.

Fixed corner case invoker issue with popover inside invoker, matching the updated spec.

Form controls have long standing interoperability issues and <textarea> is no exception. This patch fixes space being reserved for scrollbars despite overlay scrollbars being enabled. This brings WebKit in line with Firefox's behaviour.

Implemented improvements to the popover API to allow imperative invokers relationships, this brings the JavaScript APIs inline with the declarative popovertarget attribute.

Implemented the CanvasRenderingContext2D letterSpacing/wordSpacing support.

Multimedia 🎥

GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.

Due to on-going work on improving memory usage in WebRTC use-cases, several patches landed in WebKit (1, 2,3) and GStreamer (4). Another related task is under review in libnice.

Several WebCodecs GStreamer backend fixes landed, mostly related with Opus and LPCM decoding support.

JavaScriptCore 🐟

The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.

JavaScriptCore now has Wasm tail call support on ARMv7.

Graphics 🖼️

OpenType color fonts with SVG outlines stopped working with the transition from Cairo to Skia. This was unintentional, and support for this kind of fonts has been re-enabled for Skia.

Building the OpenType-SVG support required building Skia's SVG module, which uses Expat as its XML parser. Packagers will need to add it as a build dependency, or configure the compilation passing -DUSE_SKIA_OPENTYPE_SVG=OFF, which disables the feature.

That’s all for this week!

by Unknown at December 16, 2024 01:33 PM

December 10, 2024

Igalia WebKit Team

WebKit Igalia Periodical #5

Update on what happened in WebKit in the week from December 3 to December 10.

Web Platform 🌐

Improved interoperability somewhat by including font-weight in the CanvasRenderingContext2D.font serialization.

Graphics 🖼️

Support for cross-thread transfer of accelerated ImageBitmap objects landed upstream for the GTK and WPE ports. It improves performance of applications that use worker threads and pass accelerated ImageBitmap objects (with ownership) around.

That’s all for this week!

by Unknown at December 10, 2024 10:05 PM

Jesse Alama

Announcing a polyfill for the TC39 decimal proposal

An­nounc­ing a poly­fill for the TC39 dec­i­mal pro­pos­al

I’m hap­py to an­nounce that the dec­i­mal pro­pos­al—a pro­posed ex­ten­sion of JavaScript to sup­port dec­i­mal num­bers—is now avail­able as an NPM pack­age called pro­pos­al-dec­i­mal!

(Ac­tu­al­ly, it has been avail­able for some time, made avail­able not long af­ter we de­cid­ed to pur­sue IEEE 754 Dec­i­mal128 as a data mod­el for the dec­i­mal pro­pos­al rather than some al­ter­na­tives. The old pack­age was—and still is—avail­able un­der a dif­fer­ent name—dec­i­mal128—but I’ll be sun­set­ting that pack­age in fa­vor of the new one an­nounced here. If you’ve been us­ing dec­i­mal128, you can con­tin­ue to use it, but you’ll prob­a­bly want to switch to pro­pos­al-dec­i­mal.)

To use pro­pos­al-dec­i­mal in your project, in­stall the NPM pack­age. If you’re look­ing to use this code in Node.js or oth­er JS en­gines that sup­port ESM, you'll want to im­port the code like this:

im­port { Dec­i­mal128 } from 'pro­pos­al-dec­i­mal';
con­st x = new Dec­i­mal128("0.1");
// etc.

For use in a brows­er, the file dist/Dec­i­mal128.mjs con­tains the Dec­i­mal128 class and all its in­ter­nal de­pen­den­cies in a sin­gle file. Use it like this:

<script type="mod­ule">
im­port { Dec­i­mal128 } from 'path/to/Dec­i­mal128.mjs';
con­st x = new Dec­i­mal128("0.1");
// keep rock­ing dec­i­mals!
</script>

The in­ten­tion of this poly­fill is to track the spec text for the dec­i­mal pro­pos­al. I can­not rec­om­mend this pack­age for pro­duc­tion use just yet, but it is us­able and I’d love to hear any ex­pe­ri­ence re­ports you may have. We’re aim­ing to be as faith­ful as pos­si­ble to the spec, so we don’t aim to be blaz­ing­ly fast. That said, please do re­port any wild de­vi­a­tions in per­for­mance com­pared to oth­er dec­i­mal li­braries for JS as an is­sue. Any crash­es or in­cor­rect re­sults should like­wise be re­port­ed as an is­sue.

En­joy!

December 10, 2024 05:04 AM

December 09, 2024

Alex Bradbury

Directly committing files to a separate git branch

Suppose you have some files you want to directly commit to a branch in your current git repository, doing so without perturbing your current branch. Why would you want to do that? My current motivating use case is to commit all my draft muxup.com posts to a separate branch so I can get some tracking and backups without needing to add WIP work to the public repo. But I also use essentially the same approach to make a throw-away commit of the current repo state (including any non-staged or non-committed changes) to be pushed to a remote machine for building.

Our goal is to create a commit, so a sensible starting point is to break down what's involved. Referring to Git documentation we can break down the different object types that we need to put together a commit:

  • A commit object contains metadata as well as a reference a tree object.
  • A tree object is essentially a list of blob objects and other tree objects with extra metadata such as name and permissions.
  • A blob object is a chunk of data (e.g. a file).

Although it's possible to build a tree object semi-manually using git hash-object to create blobs and git mktree for trees, fortunately this isn't necessary. Using a throwaway git index file allows us to rely on git to create the tree object for us after indicating the files to be included. The basic approach is:

  • Identify the files you wish to commit.
  • Set the GIT_INDEX_FILE environment variable to a throwaway/temporary name.
  • Add the files to the temporary index (git update-index is the 'plumbing' way of doing this, but git add can work just fine as well).
  • Create a tree object from the index using git write-tree.
  • Create a commit object with an appropriate commit message and parent.
    • You likely want to bail out and avoid creating a commit that changes no files if it turns out the tree is unmodified vs the parent commit.
  • Use git update-ref to update the branch ref to point to the new commit.
  • Remove the temporary index file.

Implementation in Python

Here is how I implemented this in the site generator I use for muxup.com:

def commit_untracked() -> None:
    def exec(*args: Any, **kwargs: Any) -> tuple[str, int]:
        kwargs.setdefault("encoding", "utf-8")
        kwargs.setdefault("capture_output", True)
        kwargs.setdefault("check", True)

        result = subprocess.run(*args, **kwargs)
        return result.stdout.rstrip("\n"), result.returncode

    result, _ = exec(["git", "status", "-uall", "--porcelain", "-z"])
    untracked_files = []
    entries = result.split("\0")
    for entry in entries:
        if entry.startswith("??"):
            untracked_files.append(entry[3:])

    if len(untracked_files) == 0:
        print("No untracked files to commit.")
        return

    bak_branch = "refs/heads/bak"
    show_ref_result, returncode = exec(
        ["git", "show-ref", "--verify", bak_branch], check=False
    )
    if returncode != 0:
        print("Branch {back_branch} doesn't yet exist - it will be created")
        parent_commit = ""
        parent_commit_tree = None
        commit_message = "Initial commit of untracked files"
        extra_write_tree_args = []
    else:
        parent_commit = show_ref_result.split()[0]
        parent_commit_tree, _ = exec(["git", "rev-parse", f"{parent_commit}^{{tree}}"])
        commit_message = "Update untracked files"
        extra_write_tree_args = ["-p", parent_commit]

    # Use a temporary index in order to create a commit. Add any untracked
    # files to the index, create a tree object based on the index state, and
    # finally create a commit using that tree object.
    temp_index = pathlib.Path(".drafts.gitindex.tmp")
    atexit.register(lambda: temp_index.unlink(missing_ok=True))
    git_env = os.environ.copy()
    git_env["GIT_INDEX_FILE"] = str(temp_index)
    nul_terminated_untracked_files = "\0".join(file for file in untracked_files)
    exec(
        ["git", "update-index", "--add", "-z", "--stdin"],
        input=nul_terminated_untracked_files,
        env=git_env,
    )
    tree_sha, _ = exec(["git", "write-tree"], env=git_env)
    if tree_sha == parent_commit_tree:
        print("Untracked files are unchanged vs last commit - nothing to do.")
        return
    commit_sha, _ = exec(
        ["git", "commit-tree", tree_sha] + extra_write_tree_args,
        input=commit_message,
    )
    exec(["git", "update-ref", bak_branch, commit_sha])

    diff_stat, _ = exec(["git", "show", "--stat", "--format=", commit_sha])

    print(f"Backup branch '{bak_branch}' updated successfully.")
    print(f"Created commit {commit_sha} with the following modifications:")
    print(diff_stat)

For my particular use case, creating a commit containing only the untracked files is what I want. I'm happy to lose the ability to precisely recreate the repository state for the combination of tracked and untracked files in return for avoiding noise in the changes for the bak branch that would otherwise be present from changes to tracked files. Using paths separated by NUL via stdin is overkill here, but as it doesn't increase complexity of the code much, I've opted for the most universal approach in case I copy the logic to other projects.


Article changelog
  • 2024-12-09: Initial publication date.

December 09, 2024 11:00 AM

Vivienne Watermeier

How I set up my IDE for the WebKit container SDK

This is based on this previous blog post by Alicia, and I recommend taking a look - many things mentioned in it are still useful here.
Example of symbol documentation in Helix. The screenshot shows the entire window, with a outlined popup below the cursor, showing documentation. Reading a symbol’s documentation in a popup

The most straight-forward option would have been to just install another instance of my IDE inside the container. However, I use NixOS + Home Manager to manage and configure my packages declaratively, so the Ubuntu-based container environment would be a quite frustrating difference:

  1. Package versions will be lagging behind, and sooner or later I will have to deal with differences with configuration, features, or bugs. For example, at time of writing, neovim is packaged in Debian 24.10 at version 0.9.5, while nixpkgs ships 0.10.2. (To be fair, Flathub and Snapcraft would be up-to-date as well, but I have my gripes with those too.)

  2. Either way, I now have a new set of configurations to manage and keep in sync with their canonical versions on the host system.

  3. Any other tools I don’t install in the container, I won’t have access to - for example, for running commands from inside my IDE.

Overall, this will waste time and disk space better used for other things. So, after trying out a few different approaches, a clangd wrapper script that bridges the disconnect between my host system and the container was the first satisfying solution I found.

Conveniently, this fits well with my approach of writing wrappers around wkdev scripts to expose as much functionality as possible to my host system, to avoid manually entering the container - in effect abstracting it out of sight.

Step 1: Wrapper script #

This is roughly the script I currently use. I personally prefer nushell, but I will go into details below so you can write your own version in whatever language you prefer.

The idea is to start clangd inside the container, and use socat to expose its stdin/out to the IDE over TCP. That is to avoid this podman issue I ran into if I tried using stdin.

#!/usr/bin/env -S nu --stdin

def main [
  --name (-n): string = "wkdev-sdk"
  --show-config
] {
  # picking a random port for the connection avoids colliding with itself in case an earlier instance of this script is still around
  let port = random int 2000..5000 
  let workdir = $"/host(pwd)"

  # the container SDK mounts your home directory to `/host/home/...`,
  # so as long as the WebKit checkout is somewhere within your $HOME,
  # mapping paths is as easy as just prepending `/host`
  let mappings_table = ["Source" "WebKitBuild/GTK/Debug" "WebKitBuild/GTK/Release"]
    | each {|path| {host: $"($env.WEBKIT_DIR)/($path)" container: $"/host($env.WEBKIT_DIR)/($path)"}}

  let mappings = $mappings_table
    | each {|it| $"($it.host)=($it.container)" }
    | str join ","

  let podman_args = [
    exec
    --detach
    --user
    1000:100
    $name
  ]

  let clangd_args = [
    $"--path-mappings=($mappings)"
    --header-insertion=never  # clangd has the tendency to insert unnecessary includes, so I prefer to just disable the feature.
    --limit-results=5000      # The default limit for reference search results is too low for WebKit
    --background-index
    --enable-config           # Enable reading .clangd file
    -j 8
  ]

  # Show results of above configuration when called with --show-config, particularly helpful for debugging
  if $show_config {
    {
      port: $port
      work_dir: $workdir
      mappings: $mappings_table
      podman_args: $podman_args
      clangd_args: $clangd_args
    }
  } else {
    # ensure that the container is running
    podman start $name | ignore

    # container side
    ( podman ...$podman_args /usr/bin/env $'--chdir=($workdir)' socat 
      $"tcp-l:($port),fork"
      $"exec:'clangd ($clangd_args | str join (char space))'"
    ) | ignore

    # host side
    nc localhost $port
  }
}

Step 2: IDE setup #

IDE setup is largely the same as it would usually be, aside from pointing the clangd path at our wrapper script instead. I use helix, where I just need to add a .helix/languages.toml to the WebKit checkout directory:

[language-server.clangd]
command = "/path/to/clangd_wrapper"

In VS Code, you need the clangd extension, then you can enter the absolute path to the script under File > Preferences > Settings > Extensions/clangd > Clangd: Path, ideally in the Workspace tab so the setting only applies to WebKit.

Step 3: clangd setup #

Clangd will require two things to be set up at the root of your WebKit checkout:

First, create a compile_commands.json symlink for the build you will use, for example to WebKitBuild/GTK/Debug/compile_commands.json.

Secondly, a .clangd (which is what we needed the --enable-config flag for) at the root of the WebKit checkout:

If:
PathMatch: "(/host/home/vivienne/dev/work/metro/wk/up)?Source/.*\\.h"
PathExclude: "(/host/home/vivienne/dev/work/metro/wk/up)?Source/ThirdParty/.*"

CompileFlags:
Add: [-include, config.h]

I created both files manually, but as of [cmake] Auto-complete via clangd auto-setup, there seem to be new scripts to help with setting up and updating both files. (Thanks Alicia!) I haven’t tried it so far, but I recommend you take a look yourself.

Conclusion #

Example of using the symbol picker in Helix in a WebKit header file - there is a big popup across the entire window, showing results of a fuzzy search on the left, and a snippet of the selected item on the right Searching for a field using the symbol picker

Overall, I’m very satisfied with the results, so far everything is working like I expected it to. Finally having a working language server brought me the usual benefits - I mostly got rid of the manual compile-fix cycles that introduced so much friction and waiting times, and trivial mistakes and typos are much less of a headache now. But the biggest improvement, to me, is Goto definition/references and the symbol picker, making it easier to grasp how things interact. Much better than using grep over and over!

As I was fighting clangd/podman, I also came across some other options that I didn’t try, but might be interesting to look at:

  1. VSCode dev containers
    Probably the most polished option, though it is exclusive to VSCode - from what I understand, the extension isn’t even available to forks for licensing reasons.

  2. Distant
    Its main purpose is to act as a tool for working remotely, but I don’t see why it couldn’t be used with a container. It is still in alpha, and so far only has support in Neovim. I can’t tell how well it would play with LSP, but it might be worth a shot if you already use Neovim.

December 09, 2024 12:00 AM

December 06, 2024

Pablo Saavedra

Building a Rack for Embedded Boards Management, Part III: PoE Management

In previous posts, I detailed the construction of a rack for managing embedded boards: A critical aspect of this setup is the ability to remotely control the power cycling of the boards. To achieve this, I utilized the zyxel-poe-manager, a convenient Python script that manages the Power over Ethernet (PoE) ports on Zyxel GS1900 series […]

by Pablo Saavedra at December 06, 2024 06:24 PM

Stephanie Stimac

You can pay for that: How web browser features get built

As I have spent the last 6 months starting to explore the rather precarious position we've placed web ecosystem in, with regard to how browsers are maintained and funded, I thought I'd dive into another angle: The ways that web platform features get prioritized and built.

I worked on Microsoft Edge, so I have direct experience working on a browser team. My current work is at Igalia, which is an open source consultancy that is hired by companies to work on many things across technology. My team, the web platform team, implements web platform features and APIs, and works on their specifications. Yes. You can pay for browser features to be built and for specifications to be written/updated/continued. We'll talk about both.

Browser vendors #

Browser vendors are the companies that develop, maintain and distribute a web browser. Some browser vendors are also stewards of a whole engine/browser (blink, WebKit and gecko). Google, Apple, Mozilla, Microsoft, Opera, Vivaldi, etc are all browser vendors. Google, Apple and Mozilla are engine stewards.

There are many teams that make up an entire browser team. A browser is so much more than just the web platform too. There is quite a lot of thought and design that goes into even the smallest user experience updates.

General consumer facing features, which typically have a UI component, tend to often get prioritized over more "hidden" web platform features for developers. The general consumer base is larger than the developer base. The goal is more market share (more people using your browser) which helps bring money to the browser vendor.

The web platform team works on browser features and specifications and making sure the implementation matches the spec, so you don't get different behavior in different browsers. But they're also there to enable what we'll refer to as "first party" needs.

First Party Development #

First party refers to groups within the same company as the browser vendor. Microsoft Office/Microsoft 365 is an example of a first party within Microsoft with web platform needs. Subsequently, their needs for the web will get prioritized.

Surface Duo is another example. I spent a lot of time talking about the web platform primitives and design considerations for dual screen devices. Having layout capabilities that adapted to this new form factor was incredibly important so the specification and implementing those features were also prioritized.

In my experience, first party development is typically prioritized above all else as you're enabling/enhancing another product in the company. Especially if those products are money-makers. These are also broader company strategic initiatives and very visible ways to make impact.

Come yearly review time, these things matter for compensation and bonuses. It is all deeply intertwined in company politics. These are things that make the company money and make the business case for having a browser. Bill Gates' The Internet Tidal Wave memo from 1995 even points out how access to the internet through PCs is vital for the business. Enhancing user experience and moving the web forward will be what wins consumers over.

External Companies and Third Party Development #

Another scenario is when an external, or third party, company has needs for the web platform. My experience with this while working at a browser vendor company is more limited. Third party can also mean general web developers. It was much harder to get the needs of the general web development community prioritized when first party often takes priority.

I truthfully can't remember whether it was/is possible for third parties to ask Microsoft to work on enabling certain features. I mean, of course you can ask, but I'm not sure how often an agreement would be made. With a relatively small team working on enabling web platform features, this probably wasn't/isn't a common scenario unless there's some big underlying strategic initiative that would benefit the browser and company. This type of contract could entirely have been outside my sphere of work.

The trouble with being a third party is that it's not as easy to align priorities or business cases. In fact, you might even be a competitor. Regardless, since resources are finite, it's likely that it's difficult to convince vendors to pay attention to your specific needs. At the end of the day, practically speaking, funding is required to advance features, fix bugs, etc.

Consultancies #

Guess what? Yes! You can hire experts to implement web browser features and/or give you that attention and priority! Do you need a new feature implemented or spec'd? A consultancy can help. Do you need bugs that are affecting your organization fixed? A consultancy can help.

If you have a need for a web platform feature, there are consultancies available for hire to help write and edit the specifications, work with standards groups, write web platform tests and get that feature shipped (or ready to be shipped).

I work for Igalia, and you can hire us for many things across many technologies and areas including web platform development.

In fact, we've been pivotal in moving forward a whole lot of things in the Web Platform, including features like CSS Grid, :has, container queries, MathML, classes in JavaScript, scroll snap, list-stlye-type: <string>...the list does go on and on. We work on lots of specifications and implementations for the web platform.

Instead of waiting or relying on a browser vendor to implement the features you need, which could potentially be years or even possibly never, you can hire experts like Igalia to do this work.

Why should I hire a consultancy to build web platform features? #

The most obvious answer is: It works. We've helped a lot of happy customers do amazing things.

Aside from needing features more quickly, hiring a consultancy like Igalia has advantages. We are experts in these processes and the dynamics of working in standards bodies, and our strength comes from not only our technical expertise, but our ability to navigate between the three main browser vendors with web engines to ensure feature design is agreed upon. This is a lot of work and often times it can be slow because there are only a handful of people at browser vendor companies who are responsible for reviewing patches, proposed features, design documents, etc etc.

Let's say you are a customer with a web platform need. You most likely have a backlog of work for your engineering team. There could be a few different scenarios that prevent you from internally prioritizing the web platform need: No one on the engineering team has the technical background for the type of work you need done, someone might have the technical background but not enough time to manage the entire process of spec writing, test writing and implementation, or maybe the team just doesn't have the capacity based on broader company priorities and product roadmaps.

When you hire a consultancy to do this work, then your product engineering team can spend time on the product work and roadmap while we work on the spec, implementation and coordinating among browser vendors. This stuff takes a lot of time, because it's the nature of the work, and it's our area of expertise.

Funding specific features #

There have also been instances where specific features have been funded by the community or donors, primarily driven by a want for better support and not by a business need, even though there most likely are business needs for such feature improvements out there somewhere.

The MathML work Igalia has been doing is an example of that. Igalia also ran an open prioritization experiment where the community collectively selected and funded a feature.

Sometimes there are really vital features the web needs, but for whatever reason, they're not a priority. With that being said, if anyone's interested in helping to advance and improve SVG, drop Igalia an email. We'd love to work on it.

Closing #

I have encountered many people since starting at Igalia earlier this year, who didn't know you could hire someone to build a browser feature, or work on a specification, or fix browser bugs. You can even hire us to work on improving a novel web browser engine (say hello to Servo), because you might need a web browser solution that is more lightweight than the major open source options.

Or maybe you need a browser for your Extended Reality/Virtual Reality device. With 50% of Meta Quest users spending time in the browser, it would be a missed opportunity to not offer the same on your device. This is where we come in with Wolvic. It's designed with browsing in XR in mind, and you don't have to build a browser from the ground up.

There are so many benefits to hiring someone to work on The Web in whatever way you may need. It also means the web platform can advance more quickly (in browser timescales anyway) because more people outside of browser vendors are working on things.

And that's good for the overall health of the web ecosystem.

Note: Thank you to my colleague Brian Kardell for reviewing & editing this post, which had been taking up a lot of space in my mind for a long while.

December 06, 2024 12:00 AM

December 05, 2024

Stephen Chenney

Canvas Text Editing

Editing text in HTML canvas has never been easy. It requires identifying which character is under the hit point in order to place a caret, and it requires computing bounds for a range of text that is selected. The existing implementations of Canvas TextMetrics made these things possible, but not without a lot of Javascript making multiple expensive calls to compute metrics for substrings. Three new additions to the TextMetrics API are intended to support editing use cases in Canvas text. They are in the standards pipeline, and implemented in Chromium-based browsers behind the ExtendedTextMetrics flag:

  • getIndexFromOffset gives the location in a string corresponding to a pixel length along the string. Use it to identify where the caret is in the string, and what the bounds of a selection range are.
  • getSelectionRects returns the rectangles that a browser would use to highlight a range of text. Use it to draw the selection highlight.
  • getActualBoundingBox returns the bounding box for a sub-range of text within a string. Use it if you need to know whether a point lies within a substring, rather than the entire string.

To enable the flag, use --enable-blink-features=ExtendedTextMetrics when launching Chrome from a script or command line, or enable “Experimental Web Platform features” via chrome://flags/#enable-experimental-web-platform-features.

I wrote a basic web app (opens in a new tab) in order to demonstrate the use of these features. It will function in Chrome versions beyond 128.0.6587.0 (Canary at the time of writing) with the above flags set. Some functionality is available in Safari Preview, and it’s growing all the time.

The app allows the editing of a single line of text drawn in an HTML canvas. Here I’ll work through usage of the new features.

In the demo, the first instance of “new Canvas Text Metrics” is considered a link back to this blog page. Canvas Text has no notion of links, and thousands of people have looked at Stack Exchange for a way to insert hyperlinks in canvas text. Part of the problem, assuming you know where the link is in the text, is determining when the link was clicked on. The TextMetrics getActualBoundingBox(start, end) method is intended to simplify the problem by returning the bounding box of a substring of the text, in this case the link.

  onStringChanged() {
text_metrics = context.measureText(string);
link_start_position = string.indexOf(link_text);
if (link_start_position != -1) {
link_end_position = link_start_position + link_text.length;
}
}
...
linkHit(x, y) {
let bound_rect = undefined;
try {
bound_rect = text_metrics.getActualBoundingBox(link_start_position, link_end_position);
} catch (error) {
return false;
}
let relative_x = x - string_x;
let relative_y = y - string_y;
return relative_x >= bound_rect.left && relative_y >= bound_rect.top
&& relative_x < bound_rect.right && relative_y < bound_rect.bottom;
}

The first function finds the link in the string and stores the start and end string offsets. When a click event happens, the second method is called to determine if the hit point was within the link area. The text metrics object is queried for the bounding box of the link’s substring. Note the call is contained within a try...catch block because an exception will be returned if the substring is invalid. The event offset is mapped into the coordinate system of the text (in this case by subtracting the text location) and the resulting point is tested against the rectangle.

In more general situations you may need to use a regular expression to find links, and keep track of a more complex transformation chain to convert event locations into the text string’s coordinate system.

Mapping a Point to a String Index #

A primary concept of any editing application is the caret location because it indicates where typed text will appear, or what will be deleted by backspace, or where an insertion will happen. Mapping a hit point in the canvas into the caret position in the text string is a fundamental editing operation. It is possible to do this with existing methods but it is expensive (you can do a binary search using the width of substrings).

The TextMetrics getIndexFromOffset(offset) method uses existing code in browsers to efficiently map a point to a string position. The underlying functionality is very similar to the document.caretPositionFromPoint(x,y) method, but modified for the canvas situation. The demo code uses it to position the caret and to identify the selection range.

  text_offset = event.offsetX - string_x;
caret_position = text_metrics.getIndexFromOffset(text_offset);

The getIndexFromOffset function takes the horizontal offset, in pixels, measured from the origin of the text (based on the textAlign property of the canvas context). The function finds the character boundary closest to the given offset, then returns the character index to the right for left-to-right text, and to the left for right-to-left text. The offset can be negative to allow characters to the left of the origin to be mapped.

In the figure below, the top string has textDirection = "ltr" and textAlign = "center". The origin for measuring offsets is the center of the string. Green shows the offsets given, while blue shows the indexes returned. The bottom string demonstrates textDirection = "rtl" and textAlign = "start".

An offset past the beginning of the text always returns 0, and past the end returns the string length. Note that the offset is always measured left-to-right, even if the text direction is right-to-left. The “beginning” and “end” of the text string do respect the text direction, so for RTL text the beginning is on the right.

The getIndexFromOffset function may produce very counter-intuitive results when the text string has mixed bidi content, such as a latin substring within an arabic string. As the offset moves along the string the positions will not steadily increase, or decrease, but may jump around at the boundaries of a directional run. Full handling of bidi content requires incorporating bidi level information, particularly for selecting text, and is beyond the scope of this article.

Selection Rectangles #

Selected text is normally indicated by drawing a highlight over the range, but to produce such an effect in canvas requires estimating the rectangle using existing text metrics, and again making multiple queries to text metrics to obtain the left and right extents. The new TextMetrics getSelectionRects(start, end) function returns a list of browser defined selection rectangles for the given subrange of the string. There may be multiple rectangles because the browser returns one for each bidi run; you would need to draw them all to highlight the complete range. The demo assumes a single rectangle because it assumes no mixed-direction strings.

selection_rect = text_metrics.getSelectionRects(selection_range[0], selection_range[1])[0];
...
context.fillStyle = 'yellow';
context.fillRect(selection_rect.x + string_x,
selection_rect.y + string_y,
selection_rect.width,
selection_rect.height)

Like all the new methods, the rectangle returned is in the coordinate system of the string, as defined by the transform, textAlign and textBaseline.

Conclusion #

The new Canvas Text Metrics described here are in the process of standardization. Follow WHATWG Issue #10677 and add your feedback.

Thanks #

The implementation of Canvas Text Features was aided by Igalia S.L. funded by Bloomberg L.P.

December 05, 2024 12:00 AM

December 04, 2024

Tiago Vignatti

Using Chromium for Building Apps (2025)

Instead of using Chromium for browsing the Web, let’s explore how to use it for building applications.

Chromium is open-source and its codebase is organized in components which can be used for many different purposes. For example, Chromium is used for building browsers other than Chrome like Edge, Brave, Vivaldi, among others. You may also be familiarized with V8, the Chromium JavaScript engine that may be used to power scripting on server-side, like Node.js and Deno.

by Author at December 04, 2024 01:36 PM

Jesse Alama

Binary floats can let us down! When close enough isn't enough

How bi­na­ry float­ing points can let us down

If you've played Mo­nop­oly, you'll know abuot the Bank Er­ror in Your Fa­vor card in the Com­mu­ni­ty Chest. Re­mem­ber this?

Card from the game Monopoly: Bank error in your favor!

A bank er­ror in your fa­vor? Sweet! But what if the bank makes an er­ror in its fa­vor? Sure­ly that's just as pos­si­ble, right?

I'm here to tell you that if you're do­ing every­day fi­nan­cial cal­cu­la­tions—noth­ing fan­cy, but in­volv­ing mon­ey that you care about—then you might need to know that us­ing bi­na­ry float­ing point num­bers, then some­thing might be go­ing wrong. Let's see how bi­na­ry float­ing-point num­bers might yield bank er­rors in your fa­vor—or the bank's.

In a won­der­ful pa­per on dec­i­mal float­ing-point num­bers, Mike Col­ishaw gives an ex­am­ple.

Here's how you can re­pro­duce that in JavaScript:

(1.05 * 0.7).to­Pre­ci­sion(2);
# 0.73

Some pro­gram­mers might not be aware of this, but many are. By point­ing this out I'm not try­ing to be a smar­ty­pants who knows some­thing you don't. For me, this ex­am­ple il­lus­trates just how com­mon this sort of er­ror might be.

For pro­gram­mers who are aware of the is­sue, one typ­i­cal ap­proache to deal­ing with it is this: Nev­er work with sub-units of a cur­ren­cy. (Some cur­ren­cies don't have this is­sue. If that's you and your prob­lem do­main, you can kick back and be glad that you don't need to en­gage in the fol­low­ing sorts of headaches.) For in­stance, when work­ing with US dol­lars of eu­ros, this ap­proach man­dates that one nev­er works with eu­ros and cents, but only with cents. In this set­ting, dol­lars ex­ist only as an ab­strac­tion on top of cents. As far as pos­si­ble, cal­cu­la­tions nev­er use floats. But if a float­ing-point num­ber threat­ens to come up, some form of round­ing is used.

An­oth­er aproach for a pro­gram­mer is to del­e­gate fi­nan­cial cal­cu­la­tions to an ex­ter­nal sys­tem, such as a re­la­tion­al data­base, that na­tive­ly sup­ports prop­er dec­i­mal cal­cu­la­tions. One dif­fi­cul­ty is that even if one del­e­gates these cal­cu­la­tions to an ex­ter­nal sys­tem, if one lets a float­ing-point val­ue flow int your pro­gram, even a val­ue that can be trust­ed, it may be­come taint­ed just by be­ing im­port­ed into a lan­guage that doesn't prop­er­ly sup­port dec­i­mals. If, for in­stance, the re­sult of a cal­cu­la­tion done in, say, Post­gres, is ex­act­ly 0.1, and that flows into your JavaScript pro­gram as a num­ber, it's pos­si­ble that you'll be deal­ing with a con­t­a­m­i­nat­ed val­ue. For in­stance:

(0.1).to­Pre­ci­sion(25)
# 0.1000000000000000055511151

This ex­am­ple, ad­mit­ted­ly, re­quires quite a lot of dec­i­mals (19!) be­fore the ugly re­al­i­ty of the sit­u­a­tion rears its head. The re­al­i­ty is that 0.1 does not, and can­not, have an ex­act rep­re­sen­ta­tion in bi­na­ry. The ear­li­er ex­am­ple with the cost of a phone call is there to raise your aware­ness of the pos­si­bil­i­ty that one doesn't need to go 19 dec­i­mal places be­fore one starts to see some weird­ness show­ing up.

There are all sorts of ex­am­ples of this. It's ex­ceed­ing­ly rare for a dec­i­mal num­ber to have an ex­act rep­re­sen­ta­tion in bi­na­ry. Of the num­bers 0.1, 0.2, …, 0.9, only 0.5 can be ex­act­ly rep­re­sent­ed in bi­na­ry.

Next time you look at a bank state­ment, or a bill where some tax is cal­cu­lat­ed, I in­vite you to ask how that was cal­cu­lat­ed. Are they us­ing dec­i­mals, or floats? Is it cor­rect?

I'm work­ing on the dec­i­mal pro­pos­al for TC39 to try to work what it might be like to add prop­er dec­i­mal num­bers to JavaScript. There are a few very in­ter­est­ing de­grees of free­dom in the de­sign space (such as the pre­cise datatype to be used to rep­re­sent these kinds of num­ber), but I'm op­ti­mistic that a rea­son­able path for­ward ex­ists, that con­sen­sus be­tween JS pro­gram­mers and JS en­gine im­ple­men­tors can be found, and even­tu­al­ly im­ple­ment­ed. If you're in­ter­est­ed in these is­sues, check out the README in the pro­pos­al and get in touch!

December 04, 2024 05:27 AM

Getting started with Lean 4, your next programming language

Get­ting start­ed with Lean 4, your next pro­gram­ming lan­guage

I had the plea­sure of learn­ing about Lean 4 with David Chris­tiansen and Joachim Bre­it­ner at their tu­to­r­i­al at BOBKonf 2024. I‘m plan­ning on do­ing a cou­ple of for­mal­iza­tions with Lean and would love to share what I learn as a to­tal new­bie, work­ing on ma­cOS.

Need­ed tools

I‘m on ma­cOS and use Home­brew ex­ten­sive­ly. My sim­ple go-to ap­proach to find­ing new soft­ware is to do brew search lean. This re­vealed lean as well as sur­face elan. Run­ning brew info lean showed me that that pack­age (at the time I write this) in­stalls Lean 3. But I know, out-of-band, that Lean 4 is what I want to work with. Run­ning brew info elan looked bet­ter, but the out­put re­minds me that (1) the in­for­ma­tion is for the elan-init pack­age, not the elan cask, and (2) elan-init con­flicts with both the elan and the afore­men­tioned lean. Yikes! This strikes me as a po­ten­tial prob­lem for the com­mu­ni­ty, be­cause I think Lean 3, though it still works, is pre­sum­ably not where new Lean de­vel­op­ment should be tak­ing place. Per­haps the Home­brew for­mu­la for Lean should be up­dat­ed called lean3, and a new lean4 pack­age should be made avail­able. I‘m not sure. The sit­u­a­tion seems less than ide­al, but in short, I have been suc­cess­ful with the elan-init pack­age.

Af­ter in­stalling elan-init, you‘ll have the elan tool avail­able in your shell. elan is the tool used for main­tain­ing dif­fer­ent ver­sions of Lean, sim­i­lar to nvm in the Node.js world or pyenv.

Set­ting up a blank pack­age

When I did the Lean 4 tu­to­r­i­al at BOB, I worked en­tire­ly with­in VS Code and cre­at­ed a new stand­alone pack­age us­ing some in-ed­i­tor func­tion­al­i­ty. At the com­mand line, I use lake init to man­u­al­ly cre­ate a new Lean pack­age. At first, I made the mis­take of run­ning this com­mand, as­sum­ing it would cre­ate a new di­rec­to­ry for me and set up any con­fig­u­ra­tion and boil­er­plate code there. I was sur­prised to find, in­stead, that lake init sets things up in the cur­rent di­rec­to­ry, in ad­di­tion to cre­at­ing a sub­di­rec­to­ry and pop­u­lat­ing it. Us­ing lake --help, I read about the lake new com­mand, which does what I had in mind. So I might sug­gest us­ing lake new rather than lake init.

What‘s in the new di­rec­to­ry? Do­ing tree foo­bar re­veals

foo­bar
├── Foo­bar
│   └── Ba­sic.lean
├── Foo­bar.lean
├── Main.lean
├── lake­file.lean
└── lean-tool­chain

Tak­ing a look there, I see four .lean files. Here‘s what they con­tain:

Main.lean
im­port «Foo­bar»

def main : IO Unit :=
  IO.print­ln s!"Hel­lo, {hel­lo}!"
Foo­bar.lean
-- This mod­ule serves as the root of the `Foo­bar` li­brary.
-- Im­port mod­ules here that should be built as part of the li­brary.
im­port «Foo­bar».Ba­sic
Foo­bar/Ba­sic.lean
def hel­lo := "world"
lake­file.lean
im­port Lake
open Lake DSL

pack­age «foo­bar» where
  -- add pack­age con­fig­u­ra­tion op­tions here

lean_lib «Foo­bar» where
  -- add li­brary con­fig­u­ra­tion op­tions here

@[de­fault_tar­get]
lean_exe «foo­bar» where
  root := `Main

It looks like there‘s a lit­tle mod­ule struc­ture here, and a ref­er­ence to the iden­ti­fi­er hel­lo, de­fined in Foo­bar/Ba­sic.lean and made avail­able via Foo­bar.lean. I’m not go­ing to touch lake­file.lean for now; as a new­bie, it looks scary enough that I think I’ll just stick to things like Ba­sic.lean.

There‘s also an au­to­mat­i­cal­ly cre­at­ed .git there, not shown in the di­rec­to­ry out­put above.

Now what?

Now that you‘ve got Lean 4 in­stalled and set up a pack­age, you‘re ready to dive in to one of the of­fi­cial tu­to­ri­als. The one I‘m work­ing through is David‘s Func­tion­al Pro­gram­ming in Lean. There‘s all sorts of ad­di­tion­al things to learn, such as all the dif­fer­ent lake com­mands. En­joy!

December 04, 2024 05:19 AM

December 03, 2024

Alex Bradbury

Rootless cross-architecture debootstrap

As usual, let's start by introducing the problem. Suppose you want to produce either a Debian-derived sysroot for cross-compilation, something you can chroot into, or even a full image you can boot with QEMU or on real hardware. Debootstrap can get you started and has minimal external dependencies. If you wish to avoid using sudo, Running debootstrap under fakeroot and fakechroot works if building a rootfs for the same architecture as the current host, but it has problems out of the box for a foreign architecture. These tools are packaged and in the main repositories for at least Debian, Arch, and Fedora, so a solution that works without additional dependencies is advantageous.

I'm presenting my preferred solution / approach in the first subheading and relegating more discussion and background explanation to later on in the article, in order to cater for those who just want something they can try out without wading through lots of text.

Warning: I haven't found fakeroot to be as robust as I would like, even knowing its fundamental limitations with e.g. statically linked binaries. Specifically, a sporadically reproducible case involving installing lots of packages on riscv64 sid resulted in /usr/lib/riscv64-linux-gnu/libcbor.so.0.10.2 being given the directory bit in fakeroot's database (which I haven't yet managed to track down to the point I can file a useful bug report). I'm sharing this post because the approach may still be useful to people, especially if you rely on fakeroot for only the minimum needed to get a bootable image in qemu-system.

Not explored in this article: using newuidmap/newgidmap with appropriate /etc/subuid (see here), though note one-off setup is needed to allow your user to set sufficient UIDs.

My preferred solution

Assuming you have debootstrap and fakeroot installed (sudo pacman -S debootstrap fakeroot will suffice on Arch), and to support transparent emulation of binaries for other architectures you also have user-mode QEMU installed and set to execute via binfmt_misc (sudo pacman -S qemu-user-static qemu-user-static-binfmt on Arch) we proceed to:

  • Do the first stage debootstrap under fakeroot on the host, saving the state (the uid/gid and permissions set after operations like chown/chmod) to a file, and including fakeroot in the list of packages to install for the target.
  • Extract the contents of the fakeroot and libfakeroot .debs into the directory tree created by debootstrap directory (as we need to be able to use it as a pre-requisite of initiating the second-stage debootstrap which extracts and installs all the packages).
  • Make use of user namespaces to chroot into the debootstrapped sysroot without needing root permissions. Then give the illusion of permissions to set arbitrary uid/gid and other permissions on files via fakeroot (loading the environment saved earlier).
  • Run the second stage debootstrap.

Translated into shell commands (and later a script), you can do this by:

SYSROOT_DIR=sysroot-deb-riscv64-sid
TMP_FAKEROOT_ENV=$(mktemp)
fakeroot -s "$TMP_FAKEROOT_ENV" debootstrap \
  --variant=minbase \
  --include=fakeroot,symlinks \
  --arch=riscv64 --foreign \
  sid \
  "$SYSROOT_DIR"
mv "$TMP_FAKEROOT_ENV" "$SYSROOT_DIR/.fakeroot.env"
fakeroot -i "$SYSROOT_DIR/.fakeroot.env" -s "$SYSROOT_DIR/.fakeroot.env" sh <<EOF
ar p "$SYSROOT_DIR"/var/cache/apt/archives/libfakeroot_*.deb 'data.tar.xz' | tar xv -J -C "$SYSROOT_DIR"
ar p "$SYSROOT_DIR"/var/cache/apt/archives/fakeroot_*.deb 'data.tar.xz' | tar xv -J -C "$SYSROOT_DIR"
ln -s fakeroot-sysv "$SYSROOT_DIR/usr/bin/fakeroot"
EOF
cat <<'EOF' > "$SYSROOT_DIR/_enter"
#!/bin/sh
export PATH=/usr/sbin:$PATH
FAKEROOTDONTTRYCHOWN=1 unshare -fpr --mount-proc -R "$(dirname -- "$0")" \
  fakeroot -i .fakeroot.env -s .fakeroot.env "$@"
EOF
chmod +x "$SYSROOT_DIR/_enter"
"$SYSROOT_DIR/_enter" debootstrap/debootstrap --second-stage

You'll note this creates a helper _enter within the root of the rootfs for chrooting into it and executing fakeroot with appropriate arguments.

If you want to use this rootfs as a sysroot for cross-compiling, you'll need to convert any absolute symlinks to relative symlinks so that they resolve properly when being accessed outside of a chroot. We use the symlinks utility installed within the target filesystem for this:

"$SYSROOT_DIR/_enter" symlinks -cr .

I've written a slightly more robust and configurable encapsulation of the above logic in the form of rootless-debootstrap-wrapper which I would recommend using/adapting in preference to the above. Further code examples in the rest of this post use the rootless-debootstrap-wrapper script for convenience.

Further discussion

Depending on how you look at it, fakeroot is either a horrendous hack or a clever use of LD_PRELOAD developed at a time where there weren't lots of options for syscall interposition. As there's been so much development in that area I'd hope there are other alternatives by now, but I didn't see something that's quite so easy to use, well tested for this use case, widely packaged, and up to date.

I've avoided using fakechroot both because I couldn't get it to work reliably in the cross-architecture bootstrap scenario, and also because thinking through how it logically should work in that scenario is fairly complex. Given we're able to use user namespaces to chroot, let's save ourselves the hassle and do that. Except there was a slight hiccup in that chown was failing (running under fakeroot) when chrooted in this way. Thankfully the folks in the buildroot project had run into the same issue and their patch alerted me to the undocumented FAKEROOTDONTTRYCHOWN environment variable. As written up in that commit message, the issue is that under a user namespace with limited uid/gid mappings (in my case, just one), chown returns EINVAL which isn't masked by fakeroot unless this environment variable is set.

There has of course been previous work on rootless debootstrap, notably Johannes Schauer's blog post that takes a slightly different route (by my understanding, including communication between LD_PRELOADed fakeroot on the target and a faked running on the host). A variant of this approach is used in mmdebstrap from the same author.

Limitations

  • See the warning near the top of this article about correctness issues I've encountered in some cases.
  • As the file permissions info stored in fakeroot.env is keyed by the inode, you may lose important permissions information if you copy the rootfs. You should instead tar it under fakeroot, and if extracting in an unprivileged environment again then untar it under fakeroot, creating a new fakeroot.env.
  • The use of unshare requires that unprivileged user namespace support is enabled. I believe this is the case in all common distributions by now, but please check your distro's guidance if not.

Litmus test across many architectures

Just to demonstrate how this working, here is how you can debootstrap all architectures supported by Debian + QEMU (except for mips, where I had issues with qemu) then run a trivial test - compiling and running a hello world:

#!/bin/sh

error() {
  printf "!!!!!!!!!! Error: %s !!!!!!!!!!\n" "$*" >&2
  exit 1
}

# TODO: mips skipped due to QEMU issues.
ARCHES="amd64 arm64 armel armhf i386 ppc64el riscv64 s390x"
mkdir -p "$HOME/debcache"

for arch in $ARCHES; do
  rootless-debootstrap-wrapper \
    --arch=$arch \
    --suite=sid \
    --cache-dir="$HOME/debcache" \
    --target-dir=debootstrap-all-test-$arch \
    --include=build-essential || error "Debootstrap failed for arch $arch"
done

for arch in $ARCHES; do
  rootfs_dir="./debootstrap-all-test-$arch"
  cat <<EOF > "$rootfs_dir/hello.c"
#include <stdio.h>
#include <sys/utsname.h>

int main() {
  struct utsname buffer;
  if (uname(&buffer) != 0) {
      perror("uname");
      return 1;
    }
  printf("Hello from %s\n", buffer.machine);
  return 0;
}
EOF
  ./debootstrap-all-test-$arch/_enter sh -c "gcc hello.c && ./a.out"
done

Executing the above script eventually gives you:

Hello from x86_64
Hello from aarch64
Hello from armv7l
Hello from armv7l
Hello from x86_64
Hello from ppc64le
Hello from riscv64
Hello from s390x

(The repeated "armv7l" is because armel and armhf differ in ABI rather than the architecture as returned by uname).

Making a RISC-V image bootable in QEMU

Here is how to use the tool to build a bootable RISC-V image. First build the rootfs:

TGT=riscv-sid-for-qemu
rootless-debootstrap-wrapper \
  --arch=riscv64 \
  --suite=sid \
  --cache-dir="$HOME/debcache" \
  --target-dir=$TGT \
  --include=linux-image-riscv64,zstd,default-dbus-system-bus || error "Debootstrap failed"
cat - <<EOF > $TGT/etc/resolv.conf
nameserver 1.1.1.1
EOF
"$TGT/_enter" sh -e <<EOF
ln -s /dev/null /etc/udev/rules.d/80-net-setup-link.rules # disable persistent network names
cat - <<INNER_EOF > /etc/systemd/network/10-eth0.network
[Match]
Name=eth0

[Network]
DHCP=yes
INNER_EOF
systemctl enable systemd-networkd
echo root:root | chpasswd
ln -sf /dev/null /etc/systemd/system/[email protected]
EOF

Then produce an ext4 partition and extract the kernel and initrd:

fakeroot -i riscv-sid-for-qemu/.fakeroot.env sh <<EOF
ln -L riscv-sid-for-qemu/vmlinuz kernel
ln -L riscv-sid-for-qemu/initrd.img initrd
fallocate -l 30GiB rootfs.img
mkfs.ext4 -d riscv-sid-for-qemu rootfs.img
EOF

And boot it in qemu:

qemu-system-riscv64 \
  -machine virt \
  -cpu rv64 \
  -smp 4 \
  -m 8G \
  -device virtio-blk-device,drive=hd \
  -drive file=rootfs.img,if=none,id=hd,format=raw \
  -device virtio-net-device,netdev=net \
  -netdev user,id=net,hostfwd=tcp:127.0.0.1:10222-:22 \
  -bios /usr/share/qemu/opensbi-riscv64-generic-fw_dynamic.bin \
  -kernel kernel \
  -initrd initrd \
  -object rng-random,filename=/dev/urandom,id=rng \
  -device virtio-rng-device,rng=rng \
  -nographic \
  -append "rw noquiet root=/dev/vda console=ttyS0"

You can then log in with user root and password root. We haven't installed sshd so far, but the above command line sets up forwarding from port 10222 on the local interface to port 22 on the guest in anticipation of that.


Article changelog
  • 2024-12-03: Initial publication date.

December 03, 2024 12:00 PM

Max Ihlenfeldt

Using Swarming to run Chromium test suites

Learn how to run big Chromium test suites without sacrificing valuable local CPU cycles using this one simple trick!

December 03, 2024 12:00 AM

December 02, 2024

Igalia WebKit Team

WebKit Igalia Periodical #4

Update on what happened in WebKit in the week from November 23 to December 2.

Cross-Port 🐱

The documentation on GTK/WPE port profiling with Sysprof landed upstream.

Support for anchor-center alignment landed upstream for all the WebKit ports. This is a part of cutting-edge CSS spec called CSS Anchor Positioning. To test this feature, the CSSAnchorPositioning runtime preference needs to be enabled.

Web Platform 🌐

WebKit has since a long time offered a non-standard method Document.caretRangeFromPoint() to get the caret range at a certain coordinate, but now offers the same functionality in a standardised way.

We improved the multi touch support on WPE: the touch identifiers are now more reliable when using the Web API Pointer Events. This has been backported to the last stable release 2.46.4

JavaScriptCore 🐟

The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.

On the JSC front, Justin Michaud has fixed a tricky issue in the implementation of Air shuffles (i.e. smartly copying N arbitrary locations to N different arbitrary locations). He also fixed some lowering code that generated invalid B3, as well as the 32-bit version of addI31Ref (part of the GC wasm extension).

Angelos Oikonomopoulos fixed another corner case in the testing of single-precision floating point arguments on 32-bits.

Graphics 🖼️

Support for multi-threaded GPU rendering landed upstream for both GTK/WPE ports. In main branch, GPU accelerated tile rendering was already activated by default—it is still the case, but now it utilizes one extra GPU rendering thread instead of performing the GPU rendering using (and blocking) the main thread.

The number of threads used for CPU multi-threaded rendering was controlled by the WEBKIT_SKIA_PAINTING_THREADS environment variable and has been renamed to WEBKIT_SKIA_CPU_PAINTING_THREADS. Likewise we now support the setting WEBKIT_SKIA_GPU_PAINTING_THREADS (where 0 implies using the main thread, and values in the 1 to 4 range enable threaded GPU rendering) to control the amount of GPU rendering threads used.

Negotiation of buffer formats with Wayland using DMA-BUF feedback was getting the first format that fits with the requirements in the first tranche even when the transparency did not match. Now we honor the transparency if there is a way to do it, even when other tranches than the first one need to be used. This allows the compositor to do direct scanout in more cases.

Releases 📦️

This has been a week filled with releases!

On the stable series, WebKitGTK 2.46.4 and WPE WebKit 2.46.4 include the usual stream of small fixes, a number of multimedia handling improvements focused on around Media Stream, and two important security fixes covered in a new security advisory (WSA‑2024‑0007: GTK, WPE). The covered vulnerabilities are known to be exploited in the wild, and updating is strongly encouraged; fresh packages are already available (or will be soon) in popular Linux distributions.

Also, development releases WebKitGTK 2.47.2 and WPE WebKit 2.47.2 are now available. The main highlights are the multi-threaded GPU rendering, and the added system settings API in WPEPlatform. These development snapshots are often timed around important changes; we greatly appreciate when people put the effort to give them a try, because detecting (and reporting) any issues earlier is a great help that gives us developers more time to polish the code before it reaches a stable version.

Infrastructure 🏗️

Flatpak 1.15.11 was released with a handful of patches related to accessibility. These patches enable WebKit accessibility to work in sandboxed environments. With this release, all the pieces of this puzzle fell in place, and now sandboxed apps that use WebKit are properly accessible and introspectable by screen readers and Braille generators.

Of course, there are further improvements to be made, and lots of fine-tuning to how WebKit handles accessibility of web pages. But this is nonetheless an exciting step, both for accessibility on Linux and also for the platform.

A WPE MiniBrowser runner for the Web-Platform-Tests (WPT) cross-browser test suite was added recently. Please check the documentation on how to use it and remember that there is also a WebKitGTK MiniBrowser runner there also available. Both runners allow to automatically download and use the last nightly universal bundle for running the tests if you pass the flag --install-browser to ./wpt run. Pass also --log-mach=- for increased verbosity. Please note that this only adds the runner for manual testing. We are still working on adding WPE to the automated testing dashboard at wpt.fyi

Justin Michaud submitted a fix for flashing Yocto images to external SD cards.

The WPE WebKit web site now has a separate RSS feed for security advisories. It can be reached at https://wpewebkit.org/security.xml and may be useful for those interested in automated notifications about security fixes.

That’s all for this week!

by Unknown at December 02, 2024 01:06 PM

November 24, 2024

Brian Kardell

Interop and Hard Problems

Interop and Hard Problems

Let's talk about priorities, technical debt and hard problems in the Web Platform...

In many ways, browser engine projects are not that different from most other software projects. The "stewards" still have teams with managers and specializations and budgets and people out on leave, and so on. They have to prioritize and plan. And just like every other project I've ever seen, they face the same kinds of pressures and problems: There are never enough resources for everything, there are always new asks, they always accumulate tech debt, and sometimes there are really hard problems.

What is perhaps special about them is that they are trying to be more than just an isolated program: They're trying to contribute to a standard, interoperable platform. The catch, for us, is that we only reap the benefit when something makes it all the way through all of the team's independent priority gauntlets, get shipped widely, and so on.

This can be kind of painful to experience.

For example, consider the <details> element. It's literally the simplest possible interactive element. Here's how we got it:

  • In June 2011, Chrome shipped <details>.
  • A year later, in July 2012, Safari shipped it.
  • Another year later in July 2013, Opera shipped it.
  • Still another three years later, in September 2016, Firefox shipped it
  • Microsoft never implemented it - we finally got the last implementation when Edge moved to Chromium in 2020.

If you're counting, that's nine years it took to reach shipping in all browsers. The newly defined "Baseline Widely Available" which indicates roughly when something should have reached as close to 100% market-share/deployment as possible would take another three... So, more or less, that happened last year.

And that's just the initial and appealing part. Then there are, of course, bugs discovered, new tests added, and ultimately feature improvements and iterations and so on. Even currently there are newly failing tests that make support for <details> ragged as we've tried to improve things like find-in-page and and add new concepts like invokers.

As time goes by, we're accumulating squares in the feature grid and new "gaps" in it faster than we're filling them. We're accumulating tech debt.

Enter Interop

Interop was originally intended as a way to pay down, that debt: Let's pick some things to prioritize together and turn all of the little red failure squares green. But prioritizing is tricky.

We'll get very roughly around 100 submissions of what we should focus on every year. But Interop is merely allowing browser makers to agree on how to focus and prioritize on some of the same things. The resources themselves are still finite. That means that prioritizing some things inevitably means not prioritizing something else.

And, there are a lot of competing pressures about what to prioritize, and why.

For example: It is super effective for developers if we can focus initial developments together. Imagine if we could have delivered <details> across the board and very high quality in 2011, or 2012.

Focusing together on a few new features has other added benefits too. People are more excited to work on it for one. We also get everyone talking about the same things at the same time, that's helpful - nobody misses the big event. It means use will grow faster, etc. It gives us something a like ECMA annual editions. So, it's a little unsurprising that last year, Interop included areas like CSS nesting, popover, relative color syntax, declarative shadow DOM.

However - at the other end of the spectrum, there are lots of things which are already very ragged. These things are damned hard to prioritize. They're all over the map. They are of obviously different, and debatable kinds of value, to sometimes very different communities. They can also incur different costs on different engines, and so on.

All of this conspires together to create some perennially hard problems. They continue to be needs, sometimes for exceedingly long times.

Perennially Hard Problems

This year, I'm making the case that we need to find a way to prioritize those perennially hard problems which, for whatever reason, we can never seem to prioritize. Perhaps every 5, 7 or 10 years we we should focus on these kinds of projects.

If you've been reading my blog or listening to our podcast, then you're already aware that MathML and SVG are probably the biggest examples of this kind of problem. Both are among the oldest web specifications, having their first versions published about the same time as HTML 4.0 and CSS 2. They were specially integrated with the HTML parser, and are integrated into the HTML Living Standard (MathML, SVG).

Yet both are historically dramatically under-funded and much of the actual work on them have been funded by volunteers and non-steward organizations! 26 years later we're still struggling to find the will to cross some important last miles.

Thus, every year, we have submissions about both for Interop.

The 2024 State of HTML survey found that <svg> was the top content pain point cited by developers, with almost double the pain attributed to “browser support”. <svg> - that is the literal <svg> element, not including the other ways SVGs can be used - is used on over 55% of HTML pages in the HTTP archive data. Only 27 of HTML's roughly 130 elements are more popular. SVG is also used heavily in embedded applications powered by Web engines.

A lot of math content is in more specific sites like arXiv and Wikipedia, which each have millions and millions of equations, or in online education or books. The HttpArchive crawl isn't the best way to measure that since it is focused mainly on public home pages where there's not likely to be a lot of math. However, even in the crawl, we still see thousands of pages do load 2 of the most popular JavaScript libraries which are bridging the gaps instead of rendering native math. This hurts performance and is unique - we don't require JavaScript to render text. We also know that numerous document editing tools like Adobe Indesign and Microsoft Word support MathML. Those are complex applications which require a lot of script already, and lacking good support means that they have to load even more.

Igalia has contributed implementations and improvements with funding from others and ourselves. Every year we have invested a bit ourselves to keep things moving forward. But it moves slowly this way. What we really need are some concerted efforts to push us across those last miles. We'd go a lot farther, a lot faster, together.

If you support the idea of some focus and push on these, please let us know - let vendors know. It might help.

Of course, it might not too. Historically, it's been difficult. What we know works is for someone outside of vendors to do the work - or fund it. Igalia will keep plugging away, but without external funding our own investments only go so far. If your organization would benefit from these, consider financially sponsoring some work. Alternatively, you can also help fund work on MathML directly.

November 24, 2024 05:00 AM

November 22, 2024

Igalia WebKit Team

WebKit Igalia Periodical #3

Update on what happened in WebKit in the week from November 15 to November 22.

Cross-Port 🐱

The getImageData() canvas method has been optimized to avoid an intermediate memory copy. This made fetching pixel data about ten times faster in the embedded hardware and laptops with integrated GPUs used for testing. The improvement is slated for inclusion in the upcoming 2.46.4 stable release.

The WebKit#31458 PR landed today. This adds a mechanism that leverages the damage information to reduce the amount of painting during composition. The biggest gains of that are expected with WPE used on embedded devices.

Running WebKit layout tests using the multi-threaded Skia-CPU mode (WEBKIT_SKIA_ENABLE_CPU_RENDERING=1, WEBKIT_SKIA_PAINTING_THREADS>0) fired assertions in debug/assert-enabled builds. Recording a DisplayList on the main thread and replaying it on a worker thread exposed a thread-safety issue. The Pattern class was not expecting to be dereferenced from a non-main thread. Pattern now inherits from ThreadSafeRefCounted to fix the problem.

Traditionally, we supported multi-threaded tile rendering using Cairo (which is CPU-only), and also using Skia in CPU rendering mode. Skia with GPU accelerated rendering is driven from the main thread and does not support multi-threading. However, there is a non-negligible amount of CPU work to be performed prior to using the GPU for rendering, where it can be beneficial to parallelize that work across multiple cores.

Preparation is ongoing for threaded GPU rendering, by adding GPU synchronization primitives to NativeImage and ImageBuffer for Skia, and making use of the new GPU synchronization primitives during DisplayList recording (on the main thread) and replays—which will happen in a GPU worker thread, once we have added support for that.

Multimedia 🎥

GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.

Canvas to PeerConnection streaming was fixed, there was an issue with video orientation tags handling leading to flipped frames on receiving side.

Screen capture support using PipeWire and GStreamer was fixed, DMA-BUFs are now negotiated with PipeWire, enabling zero-copy rendering to video elements. Screen capture streaming to PeerConnection is still an open issue though.

WPE WebKit 📟

WPE Platform API 🧩

New, modern platform API that supersedes usage of libwpe and WPE backends.

WPEPlatform now supports a Settings API allowing platforms and applications to set options such as fonts or dark mode. This can be tested by launching MiniBrowser and passing an INI-style configuration file with settings:

MiniBrowser --use-wpe-platform-api --config-file=config.ini

WPE Android 🤖

Adaptation of WPE WebKit targeting the Android operating system.

WPE-Android got updated to WebKit 2.46.3. Coming from 2.46.0, it includes a fix for DuckDuckGo results link, better text kerning, a better performing Canvas putImageData()operation, improved selection of H.264 encoding parameters, and more.

As usual, the 0.1.2 release at GitHub contains the downloadable .apk packages. The Maven repository has been updated as well.

Producing WPE-Android releases on GitHub has been automated, and version 0.1.2 has already been made this way, the only manual intervention being the approval of the draft created by the CI setup.

Infrastructure 🏗️

GNOME Web Canary builds are working again, now based on the GNOME Flatpak runtime instead of the soon-to-be-deprecated WebKit Flatpak SDK runtime. To install it run:

flatpak --user install \
  https://nightly.gnome.org/repo/appstream/org.gnome.Epiphany.Canary.flatpakref

This version of GNOME Web leverages nightly WebKitGTK builds from the WebKit Git main development branch.

The WebKitGTK Debian 11 bot has been retired. We officially stopped supporting WebKitGTK on Debian 11 on June 12th (one year after the release of Debian 12), however we have been maintaining WebKitGTK on Debian 11 for a longer time than initially expected. Debian 11 Security support reached end-of-life on August 14th, 2024.

That’s all for this week!

by Unknown at November 22, 2024 10:46 PM

November 19, 2024

Tiago Vignatti

Good fortune 복

Last week, we talked about Chrome, Google’s browser.

We discussed open technologies, cooperativism, and Chromium’s governance with a focus on a transparent future. We talked about the development of browser variants, its use on different platforms, and how we will approach Generative AI in Chrome.

It has also been great to be surrounded by amazing friends in Seoul. Feeling lucky and grateful for the Koreans! 🙏

by Author at November 19, 2024 03:00 PM

Melissa Wen

Display/KMS Meeting at XDC 2024: Detailed Report

XDC 2024 in Montreal was another fantastic gathering for the Linux Graphics community. It was again a great time to immerse in the world of graphics development, engage in stimulating conversations, and learn from inspiring developers.

Many Igalia colleagues and I participated in the conference again, delivering multiple talks about our work on the Linux Graphics stack and also organizing the Display/KMS meeting. This blog post is a detailed report on the Display/KMS meeting held during this XDC edition.

Short on Time?

  1. Catch the lightning talk summarizing the meeting here (you can even speed up 2x):
  1. For a quick written summary, scroll down to the TL;DR section.

TL;DR

This meeting took 3 hours and tackled a variety of topics related to DRM/KMS (Linux/DRM Kernel Modesetting):

  • Sharing Drivers Between V4L2 and KMS: Brainstorming solutions for using a single driver for devices used in both camera capture and display pipelines.
  • Real-Time Scheduling: Addressing issues with non-blocking page flips encountering sigkills under real-time scheduling.
  • HDR/Color Management: Agreement on merging the current proposal, with NVIDIA implementing its special cases on VKMS and adding missing parts on top of Harry Wentland’s (AMD) changes.
  • Display Mux: Collaborative design discussions focusing on compositor control and cross-sync considerations.
  • Better Commit Failure Feedback: Exploring ways to equip compositors with more detailed information for failure analysis.

Bringing together Linux display developers in the XDC 2024

While I didn’t present a talk this year, I co-organized a Display/KMS meeting (with Rodrigo Siqueira of AMD) to build upon the momentum from the 2024 Linux Display Next hackfest. The meeting was attended by around 30 people in person and 4 remote participants.

Speakers: Melissa Wen (Igalia) and Rodrigo Siqueira (AMD)

Link: https://indico.freedesktop.org/event/6/contributions/383/

Topics: Similar to the hackfest, the meeting agenda was built over the first two days of the conference and mixed talks follow-up with new ideas and ongoing community efforts.

The final agenda covered five topics in the scheduled order:

  1. How to share drivers between V4L2 and DRM for bridge-like components (new topic);
  2. Real-time Scheduling (problems encountered after the Display Next hackfest);
  3. HDR/Color Management (ofc);
  4. Display Mux (from Display hackfest and XDC 2024 talk, bringing AMD and NVIDIA together);
  5. (Better) Commit Failure Feedback (continuing the last minute topic of the Display Next hackfest).

Unpacking the Topics

Similar to the hackfest, the meeting agenda evolved over the conference. During the 3 hours of meeting, I coordinated the room and discussion rounds, and Rodrigo Siqueira took notes and also contacted key developers to provide a detailed report of the many topics discussed.

From his notes, let’s dive into the key discussions!

How to share drivers between V4L2 and KMS for bridge-like components.

Led by Laurent Pinchart, we delved into the challenge of creating a unified driver for hardware devices (like scalers) that are used in both camera capture pipelines and display pipelines.

  • Problem Statement: How can we design a single kernel driver to handle devices that serve dual purposes in both V4L2 and DRM subsystems?
  • Potential Solutions:
    1. Multiple Compatible Strings: We could assign different compatible strings to the device tree node based on its usage in either the camera or display pipeline. However, this approach might raise concerns from device tree maintainers as it could be seen as a layer violation.
    2. Separate Abstractions: A single driver could expose the device to both DRM and V4L2 through separate abstractions: drm-bridge for DRM and V4L2 subdev for video. While simple, this approach requires maintaining two different abstractions for the same underlying device.
    3. Unified Kernel Abstraction: We could create a new, unified kernel abstraction that combines the best aspects of drm-bridge and V4L2 subdev. This approach offers a more elegant solution but requires significant design effort and potential migration challenges for existing hardware.

Real-Time Scheduling Challenges

We have discussed real-time scheduling during this year Linux Display Next hackfest and, during the XDC 2024, Jonas Adahl brought up issues uncovered while progressing on this front.

  • Context: Non-blocking page-flips can, on rare occasions, take a long time and, for that reason, get a sigkill if the thread doing the atomic commit is a real-time schedule.
  • Action items:
    • Explore alternative backtraces during the busy wait (e.g., ftrace).
    • Investigate the maximum thread time in busy wait to reproduce issues faced by compositors. Tools like RTKit (mutter) can be used for better control (Michel Dänzer can help with this setup).

HDR/Color Management

This is a well-known topic with ongoing effort on all layers of the Linux Display stack and has been discussed online and in-person in conferences and meetings over the last years.

Here’s a breakdown of the key points raised at this meeting:

  • Talk: Color operations for Linux color pipeline on AMD devices: In the previous day, Alex Hung (AMD) presented the implementation of this API on AMD display driver.
  • NVIDIA Integration: While they agree with the overall proposal, NVIDIA needs to add some missing parts. Importantly, they will implement these on top of Harry Wentland’s (AMD) proposal. Their specific requirements will be implemented on VKMS (Virtual Kernel Mode Setting driver) for further discussion. This VKMS implementation can benefit compositor developers by providing insights into NVIDIA’s specific needs.
  • Other vendors: There is a version of the KMS API applied on Intel color pipeline. Apart from that, other vendors appear to be comfortable with the current proposal but lacks the bandwidth to implement it right now.
  • Upstream Patches: The relevant upstream patches were can be found here. [As humorously notes, this series is eagerly awaiting your “Acked-by” (approval)]
  • Compositor Side: The compositor developers have also made significant progress.
    • KDE has already implemented and validated the API through an experimental implementation in Kwin.
    • Gamescope currently uses a driver-specific implementation but has a draft that utilizes the generic version. However, some work is still required to fully transition away from the driver-specific approach. AP: work on porting gamescope to KMS generic API
    • Weston has also begun exploring implementation, and we might see something from them by the end of the year.
  • Kernel and Testing: The kernel API proposal is well-refined and meets the DRM subsystem requirements. Thanks to Harry Wentland effort, we already have the API attached to two hardware vendors and IGT tests, and, thanks to Xaver Hugl, a compositor implementation in place.

Finally, there was a strong sense of agreement that the current proposal for HDR/Color Management is ready to be merged. In simpler terms, everything seems to be working well on the technical side - all signs point to merging and “shipping” the DRM/KMS plane color management API!

Display Mux

During the meeting, Daniel Dadap led a brainstorming session on the design of the display mux switching sequence, in which the compositor would arm the switch via sysfs, then send a modeset to the outgoing driver, followed by a modeset to the incoming driver.

  • Context:
  • Key Considerations:
    • HPD Handling: There was a general consensus that disabling HPD can be part of the sequence for internal panels and we don’t need to focus on it here.
    • Cross-Sync: Ensuring synchronization between the compositor and the drivers is crucial. The compositor should act as the “drm-master” to coordinate the entire sequence, but how can this be ensured?
    • Future-Proofing: The design should not assume the presence of a mux. In future scenarios, direct sharing over DP might be possible.
  • Action points:
    • Sharing DP AUX: Explore the idea of sharing DP AUX and its implications.
    • Backlight: The backlight definition represents a problem in the mux switch context, so we should explore some of the current specs available for that.

Towards Better Commit Failure Feedback

In the last part of the meeting, Xaver Hugl asked for better commit failure feedback.

  • Problem description: Compositors currently face challenges in collecting detailed information from the kernel about commit failures. This lack of granular data hinders their ability to understand and address the root causes of these failures.

To address this issue, we discussed several potential improvements:

  • Direct Kernel Log Access: One idea is to directly load relevant kernel logs into the compositor. This would provide more detailed information about the failure and potentially aid in debugging.
  • Finer-Grained Failure Reporting: We also explored the possibility of separating atomic failures into more specific categories. Not all failures are critical, and understanding the nature of the failure can help compositors take appropriate action.
  • Enhanced Logging: Currently, the dmesg log doesn’t provide enough information for user-space validation. Raising the log level to capture more detailed information during failures could be a viable solution.

By implementing these improvements, we aim to equip compositors with the necessary tools to better understand and resolve commit failures, leading to a more robust and stable display system.

A Big Thank You!

Huge thanks to Rodrigo Siqueira for these detailed meeting notes. Also, Laurent Pinchart, Jonas Adahl, Daniel Dadap, Xaver Hugl, and Harry Wentland for bringing up interesting topics and leading discussions. Finally, thanks to all the participants who enriched the discussions with their experience, ideas, and inputs, especially Alex Goins, Antonino Maniscalco, Austin Shafer, Daniel Stone, Demi Obenour, Jessica Zhang, Joan Torres, Leo Li, Liviu Dudau, Mario Limonciello, Michel Dänzer, Rob Clark, Simon Ser and Teddy Li.

This collaborative effort will undoubtedly contribute to the continued development of the Linux display stack.

Stay tuned for future updates!

November 19, 2024 01:00 PM

November 18, 2024

Ricardo García

My XDC 2024 talk about VK_EXT_device_generated_commands

Some days ago I wrote about the new VK_EXT_device_generated_commands Vulkan extension that had just been made public. Soon after that, I presented a talk at XDC 2024 with a brief introduction to it. It’s a lightning talk that lasts just about 7 minutes and you can find the embedded video below, as well as the slides and the talk transcription if you prefer written formats.

Truth be told, the topic deserves a longer presentation, for sure. However, when I submitted my talk proposal for XDC I wasn’t sure if the extension was going to be public by the time XDC would take place. This meant I had two options: if I submitted a half-slot talk and the extension was not public, I needed to talk for 15 minutes about some general concepts and a couple of NVIDIA vendor-specific extensions: VK_NV_device_generated_commands and VK_NV_device_generated_commands_compute. That would be awkward so I went with a lighning talk where I could talk about those general concepts and, maybe, talk about some VK_EXT_device_generated_commands specifics if the extension was public, which is exactly what happened.

Fortunately, I will talk again about the extension at Vulkanised 2025. It will be a longer talk and I will cover the topic in more depth. See you in Cambridge in February and, for those not attending, stay tuned because Vulkanised talks are recorded and later uploaded to YouTube. I’ll post the link here and in social media once it’s available.

XDC 2024 recording

* { padding: 0; margin: 0; overflow: hidden; } html, body { height: 100%; } img, span { /* All elements take the whole iframe width and are vertically centered. */ position: absolute; width: 100%; top: 0; bottom: 0; margin: auto; } span { /* This mostly applies to the play button. */ height: 1.5em; text-align: center; font-family: sans-serif; font-size: 500%; color: white; } Video: Device-Generated Commands in Vulkan "> * { padding: 0; margin: 0; overflow: hidden; } html, body { height: 100%; } img, span { /* All elements take the whole iframe width and are vertically centered. */ position: absolute; width: 100%; top: 0; bottom: 0; margin: auto; } span { /* This mostly applies to the play button. */ height: 1.5em; text-align: center; font-family: sans-serif; font-size: 500%; color: white; } Video: Device-Generated Commands in Vulkan " >

Talk slides and transcription

Title slide: Device-Generated Commands in Vulkan (VK_EXT_device_generated_commands)

Hello, I’m Ricardo from Igalia and I’m going to talk about Device-Generated Commands in Vulkan. This is a new extension that was released a couple of weeks ago. I wrote CTS tests for it, I helped with the spec and I worked with some actual heros, some of them present in this room, that managed to get this implemented in a driver.

What are device-generated commands?

Device-Generated Commands is an extension that allows apps to go one step further in GPU-driven rendering because it makes it possible to write commands to a storage buffer from the GPU and later execute the contents of the buffer without needing to go through the CPU to record those commands, like you typically do by calling vkCmd functions working with regular command buffers.

It’s one step ahead of indirect draws and dispatches, and one step behind work graphs.

Naïve CPU-based approach

Getting away from Vulkan momentarily, if you want to store commands in a storage buffer there are many possible ways to do it. A naïve approach we can think of is creating the buffer as you see in the slide. We assign a number to each Vulkan command and store it in the buffer. Then, depending on the command, more or less data follows. For example, lets take the sequence of commands in the slide: (1) push constants followed by (2) dispatch. We can store a token number or command id or however you want to call it to indicate push constants, then we follow with meta-data about the command (which is the section in green color) containing the layout, stage flags, offset and size of the push contants. Finally, depending on the size, we store the push constant values, which is the first chunk of data in blue. For the dispatch it’s similar, only that it doesn’t need metadata because we only want the dispatch dimensions.

But this is not how GPUs work. A GPU would have a very hard time processing this. Also, Vulkan doesn’t work like this either. We want to make it possible to process things in parallel and provide as much information in advance as possible to the driver.

VK_EXT_device_generated_commands

So in Vulkan things are different. The buffer will not contain an arbitrary sequence of commands where you don’t know which one comes next. What we do is to create an Indirect Commands Layout. This is the main concept. The layout is like a template for a short sequence of commands. We create this layout using the tokens and meta-data that we saw colored red and green in the previous slide.

We specify the layout we will use in advance and, in the buffer, we ony store the actual data for each command. The result is that the buffer containing commands (lets call it the DGC buffer) is divided into small chunks, called sequences in the spec, and the buffer can contain many such sequences, but all of them follow the layout we specified in advance.

In the example, we have push constant values of a known size followed by the dispatch dimensions. Push constant values, dispatch. Push constant values, dispatch. Etc.

Restricted Command Selection

The second thing Vulkan does is to severely limit the selection of available commands. You can’t just start render passes or bind descriptor sets or do anything you can do in a regular command buffer. You can only do a few things, and they’re all in this slide. There’s general stuff like push contants, stuff related to graphics like draw commands and binding vertex and index buffers, and stuff to dispatch compute or ray tracing work. That’s it.

Moreover, each layout must have one token that dispatches work (draw, compute, trace rays) but you can only have one and it must be the last one in the layout.

Indirect Execution Sets

Something that’s optional (not every implementation is going to support this) is being able to switch pipelines or shaders on the fly for each sequence.

Summing up, in implementations that allow you to do it, you have to create something new called Indirect Execution Sets, which are groups or arrays of pipelines that are more or less identical in state and, basically, only differ in the shaders they include.

Inside each set, each pipeline gets an index and you can change the pipeline used for each sequence by (1) specifying the Execution Set in advance (2) using an execution set token in the layout, and (3) storing a pipeline index in the DGC buffer as the token data.

Quick How-To

The summary of how to use it would be:

First, create the commands layout and, optionally, create the indirect execution set if you’ll switch pipelines and the driver supports that.

Then, get a rough idea of the maximum number of sequences that you’ll run in a single batch.

With that, create the DGC buffer, query the required preprocess buffer size, which is an auxiliar buffer used by some implementations, and allocate both.

Then, you record the regular command buffer normally and specify the state you’ll use for DGC. This also includes some commands that dispatch work that fills the DGC buffer somehow.

Finally, you dispatch indirect work by calling vkCmdExecuteGeneratedCommandsEXT. Note you need a barrier to synchronize previous writes to the DGC buffer with reads from it.

You can also do explicit preprocessing but I won’t go into detail here.

Closing slide: Thanks for watching!

That’s it. Thank for watching, thanks Valve for funding a big chunk of the work involved in shipping this, and thanks to everyone who contributed!

November 18, 2024 03:55 PM

November 15, 2024

Igalia WebKit Team

WebKit Igalia Periodical #2

Update on what happened in WebKit in the week from November 8 to November 15.

Cross-Port 🐱

Sysprof received a round of improvements to the Marks Waterfall view, the hover tooltip now show the duration of the mark. The Graphics view also received some visual improvements, such as taller graphs and line rendering without cutoffs. Finally, Sysprof collector is now able to handle multiprocess scenarios better.

A new tool for Sysprof was added: sysprof-cat. It takes a capture file, and dumps it in textual form.

This is all in preparation to further profiler integration in WebKit on Linux. A new set of integration points is being prepared for WebKit where it can, for example, report the page FPS and memory usage to Sysprof in the Graphics view.

Sysprof user interface screenshot, showing the updated “Graphics” tab

The JSCOnly port may be built with support for the GLib main loop when configured with cmake -DPORT=JSCOnly -DEVENT_LOOP_TYPE=GLib. This is a seldom used option and the build was broken for months, but it has now been fixed.

This week the team took some time to kickstart improvements to the documentation. One of the goals we have had in mind for long is adding pages to the manual on a number of topics, and in this vein Georges has added an overview page for WebKitGTK and Alex started a page listing some of the available environment variables.

In order to allow sharing selected content between the GTK and WPE ports, Adrian is adding support to setup additional content directories for gi-docgen and to process templates to pick fragments of the source files depending on the port.

Improving what we already have is important, and Lauro has clarified how WebKitWebView::is-controlled-by-automation works.

Infrastructure 🏗️

We lately have been deploying nightly packaging bots, to provide binaries ready to use for different projects.

These bots run once per day and upload different built products that you can check below:

  1. GNOME Web Canary (built products):

    This one is meant to build GNOME Web with the GNOME SDK to produce the Canary builds of Web. Follow the progress at the corresponding Web merge request.

  2. WebKitGTK and WPE WebKit MiniBrowser/WebDriver universal bundles.

    These universal bundles should work on any Linux distribution and are intended for running tests on third-party CI systems without having to build WebKit. They include inside the tarball all the system libraries and resources needed to run WebKit, from libc up to the Mesa graphics drivers without requiring the usage of containers (similar concept to AppImage). Currently these builds are used to for the WPT tests at wpt.fyi, running on the Mozilla TaskCluster CI.

  3. JSC universal bundle (built products.

    Same content as the other universal bundles, but only including the jsc command line program. This is currently used by jsvu to easily allow developers to test the latest version of JavaScriptCore.

That’s all for this week!

by Unknown at November 15, 2024 11:10 PM

Manuel Rego

Servo Revival: 2023-2024

As many of you already know, Igalia took over the maintenance of the Servo project in January 2023. We’ve been working hard on bringing the project back to life again, and this blog post is a summary of our achievements so far.

Some history first #

You can skip this this section if you already know about the Servo project.

Servo is an experimental browser engine created by Mozilla in 2012. From the very beginning, it was developed alongside the Rust language, like a showcase for the new language that Mozilla was developing. Servo has always aimed to be a performant and secure web rendering engine, as those are also main characteristics of the Rust language.

Mozilla was the main force behind Servo’s development for many years, with some other companies like Samsung collaborating too. Some of Servo’s components, like Stylo and WebRender, were adopted and used in Firefox releases by 2017, and continue to be used there today.

In 2020, Mozilla laid off the whole Servo team, and the project moved to the Linux Foundation. Despite some initial interest in the project, by the end of 2020 there was barely any active work on the project, and 2021 and 2022 weren’t any better. At that point, many considered the Servo project to be abandoned.

2023: Igalia takes over maintenance of Servo #

Things changed in 2023, when Igalia got external funding to work on Servo and get the project moving again. We have previous experience working on Servo during the Mozilla years, and we also have a wide experience working on other web rendering engines.

To explore new markets and grow a stable community, the Servo project joined Linux Foundation Europe in September 2023, and is now one of the most active projects there. LF Europe is an umbrella organization that hosts the Servo project, and provides us with opportunities to showcase Servo at several events.

Over the last two years, Igalia has had a team of around five engineers working full-time on the Servo project. We’re taking care of the project maintenance, communication, community governance, and a large portion of development. We have also been looking for new partners and organizations that may be interested in the Servo project, and can devote resources or money towards moving it forward.

Picture of the Servo Project Updates talk at Linux Foundation Europe Member Summit 2024. In the picture you can see a room with a group of people sitting watching the presentation. On the projector there is a slide about Servo usage exmaples showing Tauri, Qt WebView, Blitz and Verso. Manuel Rego is giving the talk at the right of the projector.
Servo Project Updates talk at Linux Foundation Europe Member Summing 2024 (Picture by Linux Foundation)

Achievements #

Two years is not a lot of time for a project like Servo, and we’re very proud of what we’ve achieved in this period. Some highlights, including work from Igalia and the wider Servo community:

  • Maintenance: General project maintenance, tooling, documentation, community, and governance work. Lots of work has been put in upgrading Servo dependencies, specially the big ones (SpiderMonkey, Stylo and WebRender). There have been improvements in Servo’s CI and tooling. We now have a Servo book, the Servo community has been growing, and the Technical Steering Committee (TSC) is running periodic meetings every month.
  • Communications: The project is alive again and we want the world to know it, so we’ve been doing monthly blog posts, weekly updates on social media, plus conference talks and booths at several events. The project is more popular than ever, and our GitHub stars haven’t stopped growing.
  • Layout engine: Servo had two layout engines: its original one, and a newer one started around 2020, which follows specs more closely but was in very early stages. After some analysis, we decided to go with the new layout engine and start to implement features in it like floats, tables, flexbox, fonts, right-to-left, etc. We now pass 1.4 million subtests (1,401,889 at the time of publishing this blog post) in the Web Platform Tests. And of the tests we run today, our overall pass rate has gone from 40.8% in April 2023 to 62.0% (+21.2). 📈
  • New platforms: Servo used to support Linux, macOS, and Windows. That’s still the case, but we’ve also added support for Android and OpenHarmony, helping us reach a wide range of mobile devices. 📱
  • Embedded experiments: We’ve been working with the developers of Tauri, making Servo easier to embed and creating a first prototype of using Servo as web engine in WRY. Our community has also experimented with integrating Servo in other projects, including Blitz, Qt WebView, and Verso.
  • Outreachy: Outreachy is an internship program for getting people from underrepresented groups into our industry via FOSS, and we’ve rejoined the program in 2024. Our intern in the May cohort, @eerii, brought DevTools support back to Servo.

Of course, it’s impossible to summarize all of the work that happened in so much time, with so many people working together to achieve these great results. Big thanks to everyone that has collaborated on the project! 🙏

Screencast of Servo browsing the servo.org website and some Wikipedia articles

Some numbers #

While it’s hard to list all of our achievements, we can take a look at some stats. Numbers are always tricky to draw conclusions from, but we’ve been taking a look at the number of PRs merged in Servo’s main repository since 2018, to understand how the project is faring.

2018 2019 2020 2021 2022 2023 2024
PRs 1,188 986 669 118 65 776 1,771
Contributors 27.33 27.17 14.75 4.92 2.83 11.33 26.33
Contributors ≥ 10 2.58 1.67 1.17 0.08 0.00 1.58 4.67
  • PRs: total numbers of PRs merged. We’re now doing more than we were doing in 2018-2019.
  • Contributors: average number of contributors per month. We’re getting very close to the numbers of 2018-2019, it looks like a good number.
  • Contributors ≥ 10: average number of contributors that have merged more than 10 PRs per month. In this case we already have bigger numbers than in 2018-2019.

As a clarification, these numbers don’t include PRs from bots (dependabot and Servo WPT Sync).

Our participation in Outreachy has also made a marked difference. During each month-long contribution period, we get a huge influx of new people contributing to the project. This year, we participated in both the May and December cohorts, and the contribution periods are very visible in our March and October 2024 stats:

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
PRs 103 97 255 111 71 96 103 188 167 249 170 161
Contributors 19 22 32 27 23 22 30 31 24 37 21 28
Contributors ≥ 10 2 3 9 3 2 2 2 6 5 8 7 7

Anyway overall, we think these numbers ratify that the project is back to life, and the community is growing and engaging with the project. We’re very happy about them.

Donations #

Early this year we set up the Servo Open Collective and GitHub Sponsors, where many people and organizations have since been donating to the project. We decide how to spend this money transparently in the TSC, and so far we’ve used it to cover the project’s infrastructure costs, like self-hosted runners to speed up CI times.

We’re very grateful to everyone donating money to the project, so far adding up to over $24,500 from more than 500 donors. Thank you! 🙏

Picture of the Servo booth at Linux Foundation Europe Member Summit 2024. It's a talbe with a TV on the left showing a video of Servo running on Raspberry Pi, there's also a smaller screen on the right, showing a presentation about Servo. Behind the table there's a Servo banner. Over the table there is a Raspberry Pi, a Rockchip board and an Android phone. There are also stikcers and flyers.
Servo booth at Linux Foundation Europe Member Summing 2024 (Picture by Linux Foundation)

About the future #

Servo has a lot of potential. Several things make it a very special project:

  • Independent: Servo is managed under Linux Foundation Europe with an open governance model, which we believe is very important for the open web ecosystem as a whole.
  • Performant: Servo is the only web rendering engine that uses parallel layout for web content. Though Servo is not yet faster than other web engines that are already production-ready, we’re exploring new ways to render web content that take advantage of modern CPUs with many cores.
  • Secure: Servo is written in Rust, which prevents many security vulnerabilities by design, such as memory safety bugs. Other kinds of unsoundness, such as concurrency bugs, are also easier to manage in Rust.

Of course, the project has some challenges too:

  • Experimental: Servo is still in an experimental phase and is not a production-ready product. We still lack several features that are important for web content.
  • Competition: The dominant web rendering engines have huge teams behind them, working on making them better, faster, and more complete. Catching up with them is a really complex task.
  • Funding: This is a key topic for any open source project, and Servo is no different in this regard. We need organizations from public and private sectors to join efforts with us, and grow a healthy ecosystem around the project.

To finish on a positive note, we also have some key opportunities:

  • Single web apps: Servo may excel at rendering specific web apps in embedded scenarios, where the HTML and CSS features are known in advance.
  • UI frameworks: UI frameworks in the Rust ecosystem could start using Servo as web engine to render their contents.
  • Web engine: Servo could become the default web engine for any Rust application.
  • Web browser: In the future, we could grow into being able to develop a full-featured web browser.

The future of Servo, as with many other open source projects, is still uncertain. We’ll see how things go in the coming years, but we would love to see that the Servo project can keep growing.

Servo is a huge project. To keep it alive and making progress, we need continuous funding on a bigger scale than crowdfunding can generally accomplish. If you’re interested in contributing to the project or sponsoring the development of specific functionality, please contact us at [email protected] or igalia.com/contact.

Let’s hope we can walk this path together and keep working on Servo for many years ahead. 🤞

PS: Special thanks to my colleague Delan Azabani for proofreading and copyediting this blog post. 🙏

November 15, 2024 12:00 AM

November 13, 2024

Igalia Compilers Team

Summary of the October-2024 TC39 plenary

code { word-break: normal; }

In October, many colleagues from Igalia participated in a TC39 meeting organized in Tokyo, Japan by Sony Interactive Entertainment to discuss proposed features for the JavaScript standard alongside delegates from various other organizations.

Let's delve together into some of the most exciting updates!

You can also read the full agenda and the meeting minutes on GitHub.

Day 1 #

Import attributes to Stage 4 igalia logo #

Import attributes, (alongside JSON modules) reached Stage 4. Import attributes allow customizing how modules are imported. For example, in all JavaScript environments you'll be able to natively import JSON files using

import myData from "./data" with { type: "json" };

The proposals reached finally reached Stage 4, after a bumpy path - including being regressed from Stage 3 to Stage 2 in 2023 and needing to change their syntax.

The proposals are already supported in Chrome, Safari, and all the server-side runtimes, with Firefox soon to follow!

Iterator Helpers to stage 4! #

Although we didn't work directly on the Iterator Helpers proposal, we've been eagerly anticipating its completion. It elevates Javascript's standard library iterators up to a level of developer convenience that's comparable to Python's itertools module or the iterators in Rust. Here's a code snippet:

const result = Iterator.from(myArray)
.filter(myPredicate)
.take(50)
.map(myTransformFunc)
.reduce(mySummationFunc);

It's often convenient to think of processing code such as the above in terms of map and filter operations. But often you'd have to iterate through the array multiple times if you wrote it that way using Array's map and filter methods, which is inefficient. Conversely, if you wrote it with a for-of loop, you'd be using break, continue, and possibly state-tracking variables, which is harder to reason about. Iterator helpers give you the best of both worlds.

Normative changes to ECMA-402 igalia logo #

ECMA-402 is the Internationalization API specification, the companion standard to JavaScript's ECMA-262. Our colleagues Ben Allen and Ujjwal Sharma are on the editorial board of ECMA-402 and their responsibilities often include proposing small changes. This round, we received consensus for PRs that:

  • Fix a bug that caused date and time values to sometimes be rendered in the wrong numbering system for some locales.

  • Correctly format currency values when rendered in scientific/engineering notations.

  • Give an explicit ordering for the plural categories returned by Intl.PluralNames.resolvedOptions(). This allows for easier testing of the correctness of this method, and makes it easier for developers to discover what plural categories exist in which languages.

  • Allow use of non-ISO 4217 data in CurrencyDigits AO This is a small change, but one that makes the ECMA-402 specification more closely match both extant localization needs and also web reality. Previously the specification mandated the use of a standard for determining the number of "minor units" displayed when formatting currency values -- think here the number of digits used to display cents when formatting values such as 1.25 USD. The previously mandated data source is useful for some contexts, but not others. This PR allows implementors to use whichever source of data on currency minor units is best suited for that engine -- something that implementators had already been doing.

Road to the source map standard #

TG4, a task group created one year ago within TC39, has been diligently working to standardize and enhance the source maps functionality, with the goal of ensuring a good and consistent developer experience across multiple platforms.

Thanks also to the efforts by our colleagues Nicolò Ribaudo and Asumu Takikawa, TC39 approved the first draft of the new specification (https://tc39.es/source-map/2024/), which can now advance through the Ecma publishing process to become an official standard.

JSSugar / JS0 #

JavaScript keeps evolving and adding many new features over time: this can significantly help JavaScript developers, but it comes with its own problems. Every new feature is something that browsers need to implement and optimize, and which might cause bugs and slowdowns.

Some committee members (mostly representing browsers), initiated a discussion about whether there are possible alternative approaches to evolving the language. The primary example that was presented as a potential path forward is to leverage existing deveoper tools more extensively: many developers already transpile their code, so what if we made it "official"?

We could split the language in two parts, that together compose the ECMAScript standard:

  • one (JS0) would be the current language that is implemented in browsers, which is what we call today "JavaScript"
  • the other (JSSugar), only implemented in tools, similar to how today we use JSX or type annotations.

The discussion is in its very early stages, and the direction it will take is uncertain. It could evolve in many ways and it's possible that TC39 could ultimately decide that actually, the way things work today is fine. There are many voices pushing in opposite directions, and everything is still on the table. Stay tuned for more developments!

Day 2 #

Decimal igalia logo #

Our colleague Jesse Alama presented Decimal, showing the latest iterations on the design and data model for decimal in response to feedback in and between plenaries. The most recent version proposed an "auto-canonicalize" variant of IEEE 754 Decimal128, in which the notion of precision (or "quantum", to use the official term) of Decimal128 values would not be exposed. We received some feedback there, so decimal stays at stage 1. But stay tuned! We're going back to the drawing board and will keep iterating.

Promise.try to stage 4! #

Promise.try() is a new API that allows you to wrap a function--whether async or not--allowing it to be treated it as though it is always asynchronous. It replaces cumbersome workarounds like new Promise((resolve) => resolve(myFunction())). We didn't work on this proposal, but are nonetheless looking forward to using it!

Restricting support for sub-classing built-ins #

Sometimes TC39 makes some mistakes, designing features one way only to later realise that it should have been done differently.

One example of this is the level of support that JavaScript has for defining subclasses of built-in classes, such as

class MyUint8Array extends Uint8Array {}

const myArr = new MyUint8Array([1, 2, 3]);
myArr instanceof MyUint8Array; // true!

// also true, even if we didn't redefine .map to return a MyUint8Array
myArr.map(x => x) instanceof Uint8Array;

It often leads to vulnerabilities in JavaScript engines, and leads to many hard-to-optimize patterns that make everybody pay the cost of this language feature, even for those who don't rely on it.

After features have been shipped for years it's usually too late for TC39 to change them: our highest priority is to "not break the web", meaning that once developers start relying on something it's going to stay there forever.

The discussion focused on whether or not the use cases appeared real, conclusion was that they are for Array and Promise, but we can move forward with the conservative step of removing making only prototype methods of typed arays, Array Buffer, and Shared Array Buffer not look at their this to dynamically construct the corresponding class. The commitee will investigate further, the use cases for RegExp.

Day 3 #

Measure igalia logo #

We're very excited to announce that our colleague Ben Allen presented the Measure proposal, and it has reached Stage 1. Measure proposes an API for handling general-purpose unit conversion between measurement scales and measurement systems. Measure was originally part of the localization-related Smart Units proposal, but was promoted into its own proposal in response to demand for this tool in a wide range of contexts.

Smart units #

Smart Units is a proposal to include an API for localizing measured quantities to locale-appropriate scales and measuring systems. This can be complicated by how the appropriate measuring scale for some usages varies based on the type of thing being measured; for example, many locales use a different measurement scale for the heights of people than they do for other length measurements.

Although much of the action involved in developing this proposal has shifted to the related Measure proposal, in this session we considered what units and usages should be supported.

Other Updates igalia logo #

Our colleague Philip Chimento presented a short update on the progress of getting Temporal into browsers. Here's the representing the test262 conformance as of last plenary! temporal graph

Good news from the AsyncContext champions, including our colleague Andreu Botella: the proposal is almost ready for Stage 2.7! All the semantics relevant to ECMAScript have been finalized, and it's now just waiting on finalizing the integration semantics with the rest of the web APIs.

November 13, 2024 12:00 AM

November 08, 2024

Igalia WebKit Team

WebKit Igalia Periodical #1

Update on what happened in WebKit in the week from November 1 to November 8.

Cross-Port 🐱

The end-to-end latency slightly improved in the GstWebRTC backend, as the latency from capture devices is now properly taken into account.

Georges has proposed a new feature for Linux ports of WebKit: support for a new category of profiling information called “counters”. Counters are useful to track information over time, for example, the FPS of WebKit while showing a web page, or how much memory a web page is consuming during its display. The counters are integrated with Sysprof.

This is another tool that developers and enthusiasts can use to help profile and improve the performance of WebKit on Linux. The FPS counter is added as a proof of concept. This is still under review.

The prefer-hardware WebCodecs option for video decoders is no longer ignored. It is used as a hint to attempt decoding with hardware-accelerated components. If that fails, the decoder falls back to software.

On the JSC ARMv7 front, work on enabling OMG, the highest WebAssembly optimizing JIT tier, is ongoing. Max Rottenkolber has added support for atomics. Justin Michaud has synced up the tail call code with 64-bits and submitted PRs to further sync the 64/32-bit OMG generators. Most importantly, he's been working on an OSR fix (On Stack Replacement, the ability for the VM to tier up to an optimizing tier even in the middle of a loop, which is vital for taking advantage of the optimized code). Angelos Oikonomopoulos has been going over corner cases in the B3 (the intermediate representation used by OMG) tests and submitting numerous fixes.

The minimum required ICU version is now 70.1. This change updates ICU version checked by CMake to reflect a change that had already been done in 284568@main, which rebaselined JavaScriptCore to ICU 70. By updating the version checks the build will fail as early as possible in case the required ICU version is not installed. In addition to ICU, the minimum versions of Harfbuzz and LibXML were updated too. These two libraries depend on ICU.

Philip fixed the --enable-write-console-messages-to-stdout setting so that it works inside AudioWorklet environments; previously it would have been ignored.

The MediaRecorder backend gained WebM support (which requires GStreamer 1.24.9 or newer), and audio bitrate configuration support.

WebKitGTK 🖥️

The GTK port of the MiniBrowser now uses the GtkGraphicsOffload widget when built with a modern GTK4 version. This allows GTK and the compositor to optimize the web view contents, potentially direct scanout it, or maybe put it in a monitor overlay plane as well. This should lead to less power consumption. This is an “invisible” improvement, meaning users won't be able to notice this.

WPE WebKit 📟

The WPE WebKit 2.47.1 development release is now available. This is the first preview release for the upcoming stable series, and includes a few new features like support for the Spiel speech synthesis library, improvements to DMA-BUF usage in WebGL and video decoding, and the WPEPlatform API has gotten some new features and improvements.

As usual, feedback for development releases is welcome, including issue reports on Bugzilla.

WPE Platform API 🧩

New, modern platform API that supersedes usage of libwpe and WPE backends.

Carlos Garcia added basic touch input support to WPEPlatform DRM plug-in.

Community & Events 🤝

Mario published an article based on the talk delivered at the WebKit Contributors meeting on October 22nd, summarizing the work on WebKit done at Igalia in the past twelve months: Igalia and WebKit: status update and plans.

The original slides are also available.

That’s all for this week!

by Unknown at November 08, 2024 11:25 PM

November 03, 2024

Mario Sanchez Prada

Igalia and WebKit: status update and plans (2024)

It’s been more than 2 years since the last time I wrote something here, and in that time a lot of things happened. Among those, one of the main highlights was me moving back to Igalia‘s WebKit team, but this time I moved as part of Igalia’s support infrastructure to help with other types of tasks such as general coordination, team facilitation and project management, among other things.

On top of those things, I’ve been also presenting our work around WebKit in different venues, such as in the Embedded Open Source Summit or in the Embedded Recipes conference, for instance. Of course, that included presenting our work in the WebKit community as part of the WebKit Contributors Meeting, a small and technically focused event that happens every year, normally around the Bay Area (California). That’s often a pretty dense presentation where, over the course of 30-40 minutes, we go through all the main areas that we at Igalia contribute to in WebKit, trying to summarize our main contributions in the previous 12 months. This includes work not just from the WebKit team, but also from other ones such as our Web Platform, Compilers or Multimedia teams.

So far I did that a couple of times only, both last year on October 24rth as well as this year, just a couple of weeks ago in the latest instance of the WebKit Contributors meeting. I believe the session was interesting and informative, but unfortunately it does not get recorded so this time I thought I’d write a blog post to make it more widely accessible to people not attending that event.

This is a long read, so maybe grab a cup of your favorite beverage first…

Igalia and WebKit

So first of all, what is the relationship between Igalia and the WebKit project?

In a nutshell, we are the lead developers and the maintainers of the two Linux-based WebKit ports, known as WebKitGTK and WPE. These ports share a common baseline (e.g. GLib, GStreamer, libsoup) and also some goals (e.g. performance, security), but other than that their purpose is different, with WebKitGTK being aimed at the Linux desktop, while WPE is mainly focused on embedded devices.

This means that, while WebKitGTK is the go-to solution to embed Web content in GTK applications (e.g. GNOME Web/Epiphany, Evolution), and therefore integrates well with that graphical toolkit, WPE does not even provide a graphical toolkit since its main goal is to be able to run well on embedded devices that often don’t even have a lot of memory or processing power, or not even the usual mechanisms for I/O that we are used to in desktop computers. This is why WPE’s architecture is designed with flexibility in mind with a backends-based architecture, why it aims for using as few resources as possible, and why it tries to depend on as few libraries as possible, so you can integrate it virtually in any kind of embedded Linux platform.

Besides that port-specific work, which is what our WebKit and Multimedia teams focus a lot of their effort on, we also contribute at a different level in the port-agnostic parts of WebKit, mostly around the area of Web standards (e.g. contributing to Web specifications and to implement them) and the Javascript engine. This work is carried out by our Web Platform and Compilers team, which tirelessly contribute to the different parts of WebCore and JavaScriptCore that affect not just the WebKitGTK and WPE ports, but also the rest of them to a bigger or smaller degree.

Last but not least, we also devote a considerable amount of our time to other topics such as accessibility, performance, bug fixing, QA... and also to make sure WebKit works well on 32-bit devices, which is an important thing for a lot of WPE users out there.

Who are our users?

At Igalia we distinguish 4 main types of users of the WebKitGTK and WPE ports of WebKit:

Port users: this category would include anyone that writes a product directly against the port’s API, that is, apps such as a desktop Web browser or embedded systems that rely on a fullscreen Web view to render its Web-based content (e.g. digital signage systems).

Platform providers: in this category we would have developers that build frameworks with one of the Linux ports at its core, so that people relying on such frameworks can leverage the power of the Web without having to directly interface with the port’s API. RDK could be a good example of this use case, with WPE at the core of the so-called Thunder plugin (previously known as WPEFramework).

Web developers: of course, Web developers willing to develop and test their applications against our ports need to be considered here too, as they come with a different set of needs that need to be fulfilled, beyond rendering their Web content (e.g. using the Web Inspector).

End users: And finally, the end user is the last piece of the puzzle we need to pay attention to, as that’s what makes all this effort a task worth undertaking, even if most of them most likely don’t need what WebKit is, which is perfectly fine :-)

We like to make this distinction of 4 possible types of users explicit because we think it’s important to understand the complexity of the amount of use cases and the diversity of potential users and customers we need to provide service for, which is behind our decisions and the way we prioritize our work.

Strategic goals

Our main goal is that our product, the WebKit web engine, is useful for more and more people in different situations. Because of this, it is important that the platform is homogeneous and that it can be used reliably with all the engines available nowadays, and this is why compatibility and interoperability is a must, and why we work with the the standards bodies to help with the design and implementation of several Web specifications.

With WPE, it is very important to be able to run the engine in small embedded devices, and that requires good performance and being efficient in multiple hardware architectures, as well as great flexibility for specific hardware, which is why we provided WPE with a backend-based architecture, and reduced dependencies to a minimum.

Then, it is also important that the QA Infrastructure is good enough to keep the releases working and with good quality, which is why I regularly maintain, evolve and keep an eye on the EWS and post-commit bots that keep WebKitGTK and WPE building, running and passing the tens of thousands of tests that we need to check continuously, to ensure we don’t regress (or that we catch issues soon enough, when there’s a problem). Then of course it’s also important to keep doing security releases, making sure that we release stable versions with fixes to the different CVEs reported as soon as possible.

Finally, we also make sure that we keep evolving our tooling as much as possible (see for instance the release of the new SDK earlier this year), as well as improving the documentation for both ports.

Last, all this effort would not be possible if not because we also consider a goal of us to maintain an efficient collaboration with the rest of the WebKit community in different ways, from making sure we re-use and contribute to other ports as much code as possible, to making sure we communicate well in all the forums available (e.g. Slack, mailing list, annual meeting).

Contributions to WebKit in numbers

Well, first of all the usual disclaimer: number of commits is for sure not the best possible metric,  and therefore should be taken with a grain of salt. However, the point here is not to focus too much on the actual numbers but on the more general conclusions that can be extracted from them, and from that point of view I believe it’s interesting to take a look at this data at least once a year.

With that out of the way, it’s interesting to confirm that once again we are still the 2nd biggest contributor to WebKit after Apple, with ~13% of the commits landed in this past 12-month period. More specifically, we landed 2027 patches out of the 15617 ones that took place during the past year, only surpassed by Apple and their 12456 commits. The remaining 1134 patches were landed mostly by Sony, followed by RedHat and several other contributors.

Now, if we remove Apple from the picture, we can observe how this year our contributions represented ~64% of all the non-Apple commits, a figure that grew about ~11% compared to the past year. This confirms once again our commitment to WebKit, a project we started contributing about 14 years ago already, and where we have been systematically being the 2nd top contributor for a while now.

Main areas of work

The 10 main areas we have contributed to in WebKit in the past 12 months are the following ones:

  • Web platform
  • Graphics
  • Multimedia
  • JavaScriptCore
  • New WPE API
  • WebKit on Android
  • Quality assurance
  • Security
  • Tooling
  • Documentation

In the next sections I’ll talk a bit about what we’ve done and what we’re planning to do next for each of them.

Web Platform

content-visibility:auto

This feature allows skipping painting and rendering of off-screen sections, particularly useful to avoid the browser spending time rendering parts in large pages, as content outside of the view doesn’t get rendered until it gets visible.

We completed the implementation and it’s now enabled by default.

Navigation API

This is a new API to manage browser navigation actions and examine history, which we started working on in the past cycle. There’s been a lot of work happening here and, while it’s not finished yet, the current plan is that Apple will continue working on that in the next months.

hasUAVisualTransition

This is an attribute of the NavigateEvent interface, which is meant to be True if the User Agent has performed a visual transition before a navigation event. It was something that we have also finished implementing and is now also enabled by default.

Secure Curves in the Web Cryptography API

In this case, we worked on fixing several Web Interop related issues, as well as on increasing test coverage within the Web Platform Tests (WPT) test suites.

On top of that we also moved the X25519 feature to the “prepare to ship” stage.

Trusted Types

This work is related to reducing DOM-based XSS attacks. Here we finished the implementation and this is now pending to be enabled by default.

MathML

We continued working on the MathML specification by working on the support for padding, border and margin, as well as by increasing the WPT score by ~5%.

The plan for next year is to continue working on core features and improve the interaction with CSS.

Cross-root ARIA

Web components have accessibility-related issues with native Shadow DOM as you cannot reference elements with ARIA attributes across boundaries. We haven’t worked on this in this period, but the plan is to work in the next months on implementing the Reference Target proposal to solve those issues.

Canvas Formatted Text

Canvas has not a solution to add formatted and multi-line text, so we would like to also work on exploring and prototyping the Canvas Place Element proposal in WebKit, which allows better text in canvas and more extended features.

Graphics

Completed migration from Cairo to Skia for the Linux ports

If you have followed the latest developments, you probably already know that the Linux WebKit ports (i.e. WebKitGTK and WPE) have moved from Cairo to Skia for their 2D rendering library, which was a pretty big and important decision taken after a long time trying different approaches and experiments (including developing our own HW-accelerated 2D rendering library!), as well as running several tests and measuring results in different benchmarks.

The results in the end were pretty overwhelming and we decided to give Skia a go, and we are happy to say that, as of today, the migration has been completed: we covered all the use cases in Cairo, achieving feature parity, and we are now working on implementing new features and improvements built on top of Skia (e.g. GPU-based 2D rendering).

On top of that, Skia is now the default backend for WebKitGTK and WPE since 2.46.0, released on September 17th, so if you’re building a recent version of those ports you’ll be already using Skia as their 2D rendering backend. Note that Skia is using its GPU-based backend only on desktop environments, on embedded devices the situation is trickier and for now the default is the CPU-based Skia backend, but we are actively working to narrow the gap and to enable GPU-based rendering also on embedded.

Architecture changes with buffer sharing APIs (DMABuf)

We did a lot of work here, such as a big refactoring of the fencing system to control the access to the buffers, or the continued work towards integrating with Apple’s DisplayLink infrastructure.

On top of that, we also enabled more efficient composition using damaging information, so that we don’t need to pass that much information to the compositor, which would slow the CPU down.

Enablement of the GPUProcess

On this front, we enabled by default the compilation for WebGL rendering using the GPU process, and we are currently working in performance review and enabling it for other types of rendering.

New SVG engine (LBSE: Layer-Based SVG Engine)

If you are not familiar with this, here the idea is to make sure that we reuse the graphics pipeline used for HTML and CSS rendering, and use it also for SVG, instead of having its own pipeline. This means, among other things, that SVG layers will be supported as a 1st-class citizen in the engine, enabling HW-accelerated animations, as well as support for 3D transformations for individual SVG elements.

On this front, on this cycle we added support for the missing features in the LBSE, namely:

  • Implemented support for gradients & patterns (applicable to both fill and stroke)
  • Implemented support for clipping & masking (for all shapes/text)
  • Implemented support for markers
  • Helped review implementation of SVG filters (done by Apple)

Besides all this, we also improved the performance of the new layer-based engine by reducing repaints and re-layouts as much as possible (further optimizations still possible), narrowing the performance gap with the current engine for MotionMark. While we are still not at the same level of performance as the current SVG engine, we are confident that there are several key places where, with the right funding, we should be able to improve the performance to at least match the current engine, and therefore be able to push the new engine through the finish line.

General overhaul of the graphics pipeline, touching different areas (WIP):

On top of everything else commented above, we also worked on a general refactor and simplification of the graphics pipeline. For instance, we have been working on the removal of the Nicosia layer now that we are not planning to have multiple rendering implementations, among other things.

Multimedia

DMABuf-based sink for HW-accelerated video

We merged the DMABuf-based sink for HW-accelerated video in the GL-based GStreamer sink.

WebCodecs backend

We completed the implementation of  audio/video encoding and decoding, and this is now enabled by default in 2.46. As for the next steps, we plan to keep working on the integration of WebCodecs with WebGL and WebAudio.

GStreamer-based WebRTC backends

We continued working on GstWebRTC, bringing it to a point where it can be used in production in some specific use cases, and we will still be working on this in the next months.

Other

Besides the points above, we also added an optional text-to-speech backend based on libspiel to the development branch, and worked on general maintenance around the support for Media Source Extensions (MSE) and Encrypted Media Extensions (EME), which are crucial for the use case of WPE running in set-top-boxes, and is a permanent task we will continue to work on in the next months.

JavaScriptCore

ARMv7/32-bit support:

A lot of work happened around 32-bit support in JavaScriptCore, especially around WebAssembly (WASM): we ported the WASM BBQJIT and ported/enabled concurrent JIT support, and we also completed 80% of the implementation for the OMG optimization level of WASM, which we plan to finish in the next months. If you are unfamiliar with what the OMG and BBQ optimization tiers in WASM are, I’d recommend you to take a look at this article in webkit.org: Assembling WebAssembly.

We also contributed to the JIT-less WASM, which is very useful for embedded systems that can’t support JIT for security or memory related constraints, and also did some work on the In-Place Interpreter (IPInt), which is a new version of the WASM Low-level interpreter (LLInt) that uses less memory and executes WASM bytecode directly without translating it to LLInt bytecode  (and should therefore be faster to execute).

Last, we also contributed most of the implementation for the WASM GC, with the exception of some Kotlin tests.

As for the next few months, we plan to investigate and optimize heap/JIT memory usage in 32-bit, as well as to finish several other improvements on ARMv7 (e.g. IPInt).

New WPE API

The new WPE API is a new API that aims at making it easier to use WPE in embedded devices, by removing the hassle of having to handle several libraries in tandem (i.e. WPEWebKit, libWPE and WPEBackend-FDO, for instance), available from WPE’s releases page, and providing a more modern API in general, better aimed at the most common use cases of WPE.

A lot of effort happened this year along these lines, including the fact that we finally upstreamed and shipped its initial implementation with WPE 2.44, back in the first half of the year. Now, while we recommend users to give it a try and report feedback as much as possible, this new API is still not set in stone, with regular development still ongoing, so if you have the chance to try it out and share your experience, comments are welcome!

Besides shipping its initial implementation, we also added support for external platforms, so that other ones can be loaded beyond the Wayland, DRM and “headless” ones, which are the default platforms already included with WPE itself. This means for instance that a GTK4 platform, or another one for RDK could be easily used with WPE.

Then of course a lot of API additions were included in the new API in the latest months:

  • Screens management APIAPI to handle different screens, ask the display for the list of screens with their device scale factor, refresh rate, geometry…
  • Top level management API: This API allows a greater degree of control, for instance by allowing more than one WebView for the same top level, as well as allowing to retrieve properties such as size, scale or state (i.e. full screen, maximized…).
  • Maximized and minimized windows API: API to maximize/minimize a top level and monitor its state. mainly used by WebDriver.
  • Preferred DMA-BUF formats API: enables asking the platform (compositor or DRM) for the list of preferred formats and their intended use (scanout/rendering).
  • Input methods APIallows platforms to provide an implementation to handle input events (e.g. virtual keyboard, autocompletion, auto correction…).
  • Gestures API: API to handle gestures (e.g. tap, drag).
  • Buffer damaging: WebKit generates information about the areas of the buffer that actually changed and we pass that to DRM or the compositor to optimize painting.
  • Pointer lock API: allows the WebView to lock the pointer so that the movement of the pointing device (e.g. mouse) can be used for a different purpose (e.g. first-person shooters).

Last, we also added support for testing automation, and we can support WebDriver now in the new API.

With all this done so far, the plan now is to complete the new WPE API, with a focus on the Settings API and accessibility support, write API tests and documentation, and then also add an external platform to support GTK4. This is done on a best-effort basis, so there’s no specific release date.

WebKit on Android

This year was also a good year for WebKit on Android, also known as WPE Android, as this is a project that sits on top of WPE and its public API (instead of developing a fully-fledged WebKit port).

In case you’re not familiar with this, the idea here is to provide a WebKit-based alternative to the Chromium-based Web view on Android devices, in a way that leverages HW acceleration when possible and that it integrates natively (and nicely) with the several Android subsystems, and of course with Android’s native mainloop. Note that this is an experimental project for now, so don’t expect production-ready quality quite yet, but hopefully something that can be used to start experimenting with selected use cases.

If you’re adventurous enough, you can already try the APKs yourself from the releases page in GitHub at https://github.com/Igalia/wpe-android/releases.

Anyway, as for the changes that happened in the past 12 months, here is a summary:

  • Updated WPE Android to WPE 2.46 and NDK 27 LTS
  • Added support for WebDriver and included WPT test suites
  • Added support for instrumentation tests, and integrated with the GitHub CI
  • Added support for the remote Web inspector, very useful for debugging
  • Enabled the Skia backend, bringing HW-accelerated 2D rendering to WebKit on Android
  • Implemented prompt delegates, allowing implementing things such as alert dialogs
  • Implemented WPEView client interfaces, allowing responding to things such as HTTP errors
  • Packaged a WPE-based Android WebView in its own library and published in Maven Central. This is a massive improvement as now apps can use WPE Android by simply referencing the library from the gradle files, no need to build everything on their own.
  • Other changes: enabled HTTP/2 support (via the migration to libsoup3), added support for the device scale factor, improved the virtual on-screen keyboard, general bug fixing…

On top of that, we published 3 different blog posts covering different topics, from a general intro to a more deep dive explanation of the internals, and showing some demos. You can check them out in Jani’s blog at https://blogs.igalia.com/jani

As for the future, we’ll focus on stabilization and regular maintenance for now, and then we’d like to work towards achieving production-ready quality for specific cases if possible.

Quality Assurance

On the QA front, we had a busy year but in general we could highlight the following topics.

  • Fixed a lot of API tests failures in the bots that were limiting our test coverage.
  • Fixed lots of assertions-related crashes in the bots, which were slowing down the bots as well as causing other types of issues, such as bots exiting early due too many failures.
  • Enabled assertions in the release bots, which will help prevent crashes in the future, as well as with making our debug bots healthier.
  • Moved all the WebKitGTK and WPE bots to building now with Skia instead of Cairo. This means that all the bots running tests are now using Skia, and there’s only one bot still using Cairo to make sure that the compilation is not broken, but that bot does not run tests.
  • Moved all the WebKitGTK bots to use GTK4 by default. As with the move to Skia, all the WebKit bots running tests now use GTK4 and the only one remaining building with GTK3 does not run tests, it only makes sure we don’t break the GTK3 compilation for now.
  • Working on moving all the bots to use the new SDK. This is still work in progress and will likely be completed during 2025 as it’s needed to implement several changes in the infrastructure that will take some time.
  • General gardening and bot maintenance

In the next months, our main focus would be a revamp of the QA infrastructure to make sure that we can get all the bots (including the debug ones) to a healthier state, finish the migration of all the bots to the new SDK and, ideally, be able to bring back the ready-to-use WPE images that we used to have available in wpewebkit.org.

Security

The current release cadence has been working well, so we continue issuing major releases every 6 months (March, September), and then minor and unstable development releases happening on-demand when needed.

As usual, we kept aligning releases for WebKitGTK and WPE, with both of them happening at the same time (see https://webkitgtk.org/releases and https://wpewebkit.org/release), and then also publishing WebKit Security Advisories (WSA) when necessary, both for WebKitGTK and for WPE.

Last, we also shortened the time before including security fixes in stable releases this year, and we have removed support for libsoup2 from WPE, as that library is no longer maintained.

Tooling & Documentation

On tooling, the main piece of news is that this year we released the initial version of the new SDK,  which is developed on top of OCI-based containers. This new SDK fixes the issues with the current existing approaches based on JHBuild and flatpak, where one of them was great for development but poor for testing and QA, and the other one was great for testing and QA, but not very convenient for development.

This new SDK is regularly maintained and currently runs on Ubuntu 24.04 LTS with GCC 14 & Clang 18. It has been made public on GitHub and announced to the public in May 2024 in Patrick’s blog, and is now the officially recommended way of building WebKitGTK and WPE.

As for documentation, we didn’t do as much as we would have liked here, but we still landed a few contributions in docs.webkit.org, mostly related to WebKitGTK (e.g. Releases and VersioningSecurity UpdatesMultimedia). We plan to do more on this regard in the next months, though, mostly by writing/publishing more documentation and perhaps also some tutorials.

Final thoughts

This has been a fairly long blog post but, as you can see, it’s been quite a year for WebKit here at Igalia, with many exciting changes happening at several fronts, and so there was quite a lot of stuff to comment on here. This said, you can always check the slides of the presentation in the WebKit Contributors Meeting here if you prefer a more concise version of the same content.

In any case, what’s clear it’s that the next months are probably going to be quite interesting as well with all the work that’s already going on in WebKit and its Linux ports, so it’s possible that in 12 months from now I might be writing an equally long essay. We’ll see.

Thanks for reading!

by mario at November 03, 2024 05:20 PM

November 01, 2024

Stephanie Stimac

The `<details>` and `<summary>` elements are getting an upgrade

Form controls are notoriously difficult to style, something the web community has been talking about for years. In 2019, when I was still at Microsoft, I had been working with Greg Whitworth to start evangelizing the work that was being planned for <select>, as well as the Open UI community group that would help bring this plan to life.

There's a lot that has happened in that five years, and more still to come. Most recently I've seen work being done to improve the customizability of the <details> and <summary> elements. More stylable accordions. Exciting!

Why <details> is hard to work with #

The <details> element is a disclosure widget, which is a piece of UI that has a brief summary or heading and a control to expand the UI to show more details.

When you use <details> however, you don't have a lot of control over customizing it (like a lot of HTML Controls). The little triangle to indicate whether it's open or closed is not easily replaced. Styling or customizing <details> just isn't easy, which in turn means developers end up building a custom component for their accordions.

This ends up creating a lot of unnecessary work. Using existing HTML elements means you get all the security, accessibility and performance benefits that have already been baked in. The browser takes care of all that for you. Rebuilding from scratch means you've got to worry about adding all that back in, especially the accessibility bits.

But you can't style it how you want to, so you end up building it from scratch anyway. Rinse. Repeat. A tale as old as time.

Proposed updates #

There's quite a bit being proposed to help make <details> more customizable and interoperable between browsers (because no one likes it when browsers make things behave/display differently!)

A few highlights:

  1. Remove CSS display property restrictions so you can use other display types like flex & grid.
  2. Specify the structure of the shadow tree more clearly (this helps with flex/grid).
  3. Add pseudo-elements to give access to more parts leading to more stylability
  4. Better ability to animate these elements through additional changes (these are a part of a separate work stream but will benefit these elements).
  5. Improve ::marker styling

The exciting news is items 1 & 3 in the list above should be shipping in Chrome 131 Stable next week (first week of November 2024). This will bring a new ::details-content pseudo-element to the web, allowing more access to parts of <details>.

Further reading #

A Nod to Open UI #

Much of this work to improve form controls is started within the Open UI community group. The community there has been working for years to make progress in this space and getting all the browser vendors to agree and work together is often a difficult and time-consuming task. Cheers to all you do.

November 01, 2024 12:00 AM

October 28, 2024

Maíra Canal

Unleashing Power: Enabling Super Pages on the RPi

Unleashing the power of 3D graphics in the Raspberry Pi is a key commitment for Igalia through its collaboration with Raspberry Pi. The introduction of Super Pages for the Raspberry Pi 4 and 5 marks another step in this journey, offering some performance enhancements and more efficient memory usage. In this post, we’ll dive deep into the technical details of Super Pages, discuss the challenges we faced during implementation, and illustrate the benefits this feature brings to the Raspberry Pi ecosystem.

What are Super Pages?

A Memory Management Unit (MMU) is a hardware component responsible for handling memory access at the system level. It translates virtual addresses used by programs into physical addresses in main memory, enabling efficient memory management and protection. The MMU allows the operating system to allocate memory dynamically, isolating processes from one another to prevent them from interfering with each other’s memory.

Recommendation: 📚 Structured computer organization by Andrew Tanenbaum

The V3D MMU, which is part of the Broadcom GPU found in the Raspberry Pi 4 and 5, is responsible for translating 32-bit virtual addresses (VA) used by V3D into 40-bit physical addresses used externally to V3D. The MMU relies on a page table, stored in physical memory, which maps virtual addresses to their corresponding physical addresses. The operating system manages this page table, and the MMU uses it to perform address translation during memory access.

A fundamental principle of modern operating systems is that memory is not stored contiguously. Instead, a contiguous block of memory is divided into smaller blocks, called “pages”, which are scattered across the entire address space. These pages are typically 4KB in size. This approach enables more efficient memory management and allows for features like virtual memory and memory protection.

Over the years, the amount of available memory in computers has increased dramatically. An early IBM PC had up to 640 KiB of RAM, whereas the ThinkPad I’m typing on right now has 32 GB of RAM. Naturally, memory demands have grown alongside this increase. Today, it’s common for web browsers to consume several gigabytes of RAM, and a single shader can take up multiple megabytes.

As memory usage grows, a 4KB page size may become inefficient for managing large memory blocks. Handling a large number of small pages for a single block means the MMU must perform multiple address translations, which increases overhead. This can reduce the effectiveness of the Translation Lookaside Buffer (TLB), as it must store and handle more entries, potentially leading to more cache misses and reduced overall performance.

This is why many CPU manufacturers have introduced support for larger page sizes. For instance, x86 CPUs typically support 4KB and 2MB pages, with 1GB pages available if supported by the hardware. Similarly, ARM64 CPUs can support 4KB, 16KB, and 64KB page sizes. These larger page sizes help reduce the number of pages the MMU needs to manage, improving performance by reducing the overhead of address translation and making more efficient use of the TLB.

So, if CPUs are using bigger sizes, why shouldn’t GPUs do the same?

By default, V3D supports 4KB pages. However, by setting specific bits in the page table entry, it is possible to create 64KB “Big Pages” and 1MB “Super Pages.” The issue is that the current V3D driver available in Linux does not enable the use of Big or Super Pages, meaning this hardware feature is currently unused.

The advantage of enabling Big and Super Pages is that once an entry for any page within a Big or Super Page is cached in the MMU, it can be used to translate all virtual addresses within that page’s range without needing to fetch additional entries. In theory, this should result in improved performance, especially for applications with high memory demands, such as those using multiple large buffer objects (BOs).

As Igalia continually strives to enhance the experience for Raspberry Pi users, we decided to implement this feature in the upstream kernel. But before diving into the implementation details, let’s take a look at the real-world results and see if the theoretical benefits of Super Pages have translated into measurable improvements for Raspberry Pi users.

What Does This Feature Mean for RPi Users?

With Super Pages implemented, let’s now explore the actual performance improvements observed on the Raspberry Pi and see how impactful this feature is for users.

Benchmarking Super Pages: Traces and FPS Improvements

To measure the impact of Super Pages, we tested a variety of games and demos traces on the Raspberry Pi 4 and 5, covering genres from action to racing. On average, we observed a +1.40% FPS improvement on the Raspberry Pi 4 and a +1.30% improvement on the Raspberry Pi 5.

For instance, on the Raspberry Pi 4, Warzone 2100 saw an 8.36% FPS increase, and on the Raspberry Pi 5, Quake II enjoyed a 3.62% boost. These examples demonstrate the benefits of Super Pages in resource-demanding applications, where optimized memory handling becomes critical.

Raspberry Pi 4 FPS Improvements

Trace Before Super Pages After Super Pages Improvement
warzone2100.30secs.1024x768.trace 56.39 61.10 +8.36%
ue4_shooter_game_shooting_low_quality_640x480.gfxr 20.71 21.47 +3.65%
quake3e_capture_frames_1800_through_2400_1920x1080.gfxr 60.88 62.50 +2.67%
supertuxkart-menus_1024x768.trace 112.62 115.61 +2.65%
ue4_shooter_game_shooting_high_quality_640x480.gfxr 20.45 20.88 +2.10%
quake2-gles3-1280x720.trace 59.76 60.84 +1.82%
ue4_sun_temple_640x480.gfxr 27.60 28.03 +1.54%
vkQuake_capture_frames_1_through_1200_1280x720.gfxr 54.59 55.30 +1.29%
ue4_shooter_game_low_quality_640x480.gfxr 32.75 33.08 +1.00%
sponza_demo02_800x600.gfxr 20.90 21.03 +0.61%
supertuxkart-racing_1024x768.trace 8.58 8.63 +0.60%
ue4_shooter_game_high_quality_640x480.gfxr 19.62 19.74 +0.59%
serious_sam_trace02_1280x720.gfxr 44.00 44.21 +0.50%
ue4_vehicle_game-2_640x480.gfxr 12.59 12.65 +0.49%
sponza_demo01_800x600.gfxr 21.42 21.46 +0.19%
quake3e-1280x720.trace 84.45 84.52 +0.09%

Raspberry Pi 5 FPS Improvements

Trace Before Super Pages After Super Pages Improvement
quake2-gles3-1280x720.trace 151.77 157.26 +3.62%
supertuxkart-menus_1024x768.trace 306.79 313.88 +2.31%
warzone2100.30secs.1024x768.trace 140.92 144.03 +2.21%
vkQuake_capture_frames_1_through_1200_1280x720.gfxr 131.45 134.20 +2.10%
ue4_vehicle_game-2_640x480.gfxr 24.42 24.88 +1.89%
ue4_shooter_game_high_quality_640x480.gfxr 32.12 32.53 +1.29%
ue4_sun_temple_640x480.gfxr 42.05 42.55 +1.20%
ue4_shooter_game_shooting_high_quality_640x480.gfxr 52.77 53.31 +1.04%
quake3e-1280x720.trace 238.31 240.53 +0.93%
warzone2100.70secs.1024x768.trace 151.09 151.81 +0.48%
sponza_demo02_800x600.gfxr 50.81 51.05 +0.46%
supertuxkart-racing_1024x768.trace 20.91 20.98 +0.33%
ue4_shooter_game_low_quality_640x480.gfxr 59.68 59.86 +0.29%
quake3e_capture_frames_1_through_1800_1920x1080.gfxr 167.70 168.17 +0.29%
ue4_shooter_game_shooting_low_quality_640x480.gfxr 53.40 53.51 +0.22%
quake3e_capture_frames_1800_through_2400_1920x1080.gfxr 163.37 163.64 +0.17%
serious_sam_trace02_1280x720.gfxr 60.00 60.03 +0.06%
sponza_demo01_800x600.gfxr 45.04 45.04 <.01%

While an average +1% FPS improvement might seem modest, Super Pages can deliver more noticeable gains in memory-intensive 3D applications and when the GPU is under heavy usage. Let’s see how the Super Pages perform on Mesa CI.

Benchmarking Super Pages: Mesa CI Job Duration

To avoid introducing regressions in user-space, I usually test my custom kernels with Mesa CI, focusing on the “broadcom-postmerge” stage to verify that all Piglit and CTS tests ran smoothly. For Super Pages, I was pleasantly surprised by the job duration results, as some job durations were reduced by several minutes.

Mesa CI Jobs Duration Improvements

Job Before Super Pages After Super Pages
v3d-rpi4-traces:arm64 ~4m30s ~3m40s
v3d-rpi5-traces:arm64 ~3m30s ~2m45s
v3d-rpi4-gl-full:arm64 */6 ~24-25 minutes ~22-23 minutes
v3d-rpi5-gl-full:arm64 ~48 minutes ~48 minutes
v3dv-rpi4-vk-full:arm64 */6 ~44 minutes ~41 minutes
v3dv-rpi5-vk-full:arm64 ~102 minutes ~92 minutes

Seeing these reductions is especially rewarding. For example, the “v3dv-rpi5-vk-full:arm64” job duration decreased by 10 minutes, meaning more FPS for users and shorter wait times for Mesa developers.

Benchmarking Super Pages: PS2 Emulation

After sharing a couple of tables, I’ll admit that showcasing performance improvements solely through numbers doesn’t always convey the real impact. Personally, I find it more satisfying to see performance gains in action with real-world applications.

This led me to explore PlayStation 2 (PS2) emulation on the RPi 5. From watching YouTube videos, I noticed that PS2 is a popular console for the RPi 5. While the PlayStation (PS1) emulates well even on the RPi 4, and Nintendo 64 and Sega Saturn struggle across most hardware, PS2 hits a sweet spot for testing the RPi 5’s limits.

Fortunately, I still have my childhood PS2 — my second console after the Nintendo GameCube, and one of the most successful consoles worldwide, including in Brazil. With a library packed with titles like Metal Gear Solid, Resident Evil, Tomb Raider, and Shadow of the Colossus, the PS2 remains a great system for collectors and retro gamers alike.

I selected a few games from my collection to benchmark on the RPi 5 using a PS2 emulator. My emulator of choice was Aether SX2 with Vulkan support. Although AetherSX2 is no longer in development, it still performs well on the RPi.

Initially, many games were barely playable, especially those with large buffer objects, like Shadow of the Colossus and Gran Turismo 4. However, after enabling Super Pages support, I noticed immediate improvements. For example, Shadow of the Colossus wouldn’t even open before Super Pages, and while it’s not fully playable yet, it does load now. This isn’t a silver bullet, but it’s a step forward in improving the driver one piece at a time.

I ended up selecting four games for a video comparison: Burnout 3: Takedown, Metal Gear Solid 3: Snake Eater, Resident Evil 4, and Tekken 4.

Disclaimer: The BIOS used in the emulator was extracted from my own PS2, and I played only games I own, with ROMs I personally extracted. Neither I nor Igalia encourage using downloaded BIOS or ROM files from the internet.

From the video, we can see noticeable improvements in all four games. Although they aren’t perfectly playable yet, the performance gains are evident, particularly in Resident Evil 4, where the gameplay saw a solid 5 FPS boost. I realize 18 FPS might not satisfy most players, but I still had a lot of fun playing Resident Evil 4 on the RPi 5.

When tracking the FPS for these games, it’s clear that the performance gains go well beyond the average 1% seen in other benchmarks. Super Pages show their true potential in high-memory applications like PS2 emulation.

Having seen the performance gains Super Pages can bring to the Raspberry Pi, let’s now dive into the technical aspects of the feature.

Implementing Super Pages

The first challenge was figuring out how to allocate a contiguous block of memory using shmem. The Shared Memory Virtual Filesystem (shmem) is used as a flexible memory mechanism that allows the GPU and CPU to share access to BOs through the system’s temporary filesystem, tmpfs. tmpfs is a volatile filesystem that stores files in RAM, making it ideal for temporary or high-speed data that doesn’t need to persist on RAM.

For example, to allocate a 256KB BO across four 64KB pages, we need four contiguous 64KB memory blocks. However, by default, tmpfs only allocates memory in PAGE_SIZE chunks (as seen in shmem_file_setup()), whereas PAGE_SIZE is 4KB on the Raspberry Pi 4 and 16KB on the Raspberry Pi 5. Since the function drm_gem_object_init() — which initializes an allocated shmem-backed GEM object — relies on shmem_file_setup() to back these objects in memory, we had to consider alternatives, as the default PAGE_SIZE would divide memory into increments that are too small to ensure the large, contiguous blocks needed by the GPU.

The solution we proposed was to create drm_gem_object_init_with_mnt(), which allows us to specify the tmpfs mountpoint where the GEM object will be created. This enables us to allocate our BOs in a mountpoint that supports larger page sizes. Additionally, to ensure that our BOs are allocated in the correct mountpoint, we introduced drm_gem_shmem_create_with_mnt(), which allows the mountpoint to be specified when creating a new DRM GEM shmem object.

[PATCH v6 04/11] drm/gem: Create a drm_gem_object_init_with_mnt() function

[PATCH v6 06/11] drm/gem: Create shmem GEM object in a given mountpoint

The next challenge was figuring out how to create a new mountpoint that would allow for different page sizes based on the allocation. Simply creating a new tmpfs mountpoint with a fixed bigger page size wouldn’t suffice, as we needed flexibility for various allocations. Inspired by the i915 driver, we decided to use a tmpfs mountpoint with the “huge=within_size” flag. This flag, which requires the kernel to be configured with CONFIG_TRANSPARENT_HUGEPAGE, enables the allocation of huge pages.

Transparent Huge Pages (THP) is a kernel feature that automatically manages large memory pages to improve performance without needing changes from applications. THP dynamically combines smaller pages into larger ones, typically 2MB, reducing memory management overhead and improving cache efficiency.

To support our new allocation strategy, we created a dedicated tmpfs mountpoint for V3D, called gemfs, which provides us an ideal space for managing these larger allocations.

[PATCH v6 05/11] drm/v3d: Introduce gemfs

With everything in place for contiguous allocations, the next step was configuring V3D to enable Big/Super Page support.

We began by addressing a major source of memory pressure on the Raspberry Pi: the current 128KB alignment for allocations in the virtual memory space. This alignment wastes space when handling small BO allocations, especially since the userspace driver performs a large number of these small allocations.

As a result, we can’t fully utilize the 4GB address space available for the GPU on the Raspberry Pi 4 or 5. For example, we can currently allocate up to 32,000 BOs of 4KB (~140MB) and 3,000 BOs of 400KB (~1.3GB). This becomes a limitation for memory-intensive applications. By reducing the page alignment to 4KB, we can significantly increase the number of BOs, allowing up to 1,000,000 BOs of 4KB (~4GB) and 10,000 BOs of 400KB (~4GB).

Therefore, the first change I made was reducing the VA alignment of all allocations to 4KB.

[PATCH v6 07/11] drm/v3d: Reduce the alignment of the node allocation

With the alignment issue resolved, we can now implement the code to properly set the flags on the Page Table Entries (PTE) for Big/Super Pages. Setting these flags is straightforward — a simple bitwise operation. The challenge lies in determining which BOs can be allocated in Super Pages. For a BO to be eligible for a Big Page, its virtual address must be aligned to 64KB, and the same applies to its physical address. Same thing for Super Pages, but now the addresses must be aligned to 1MB.

If the BO qualifies for a Big/Super Page, we need to iterate over 16 4KB pages (for Big Pages) or 256 4KB pages (for Super Pages) and insert the appropriate PTE.

Additionally, we modified the way we iterate through the BO’s memory. This was necessary because the THP may not always allocate the entire BO contiguously. For example, it might only allocate contiguously 1MB of a 2MB block. To handle this, we now iterate over the blocks of contiguous memory scattered across the scatterlist, ensuring that each segment is properly handled during the allocation process.

What is a scatterlist? It is a Linux Kernel data structure that manages non-contiguous memory as if it were contiguous. It organizes separate memory blocks into a single logical buffer, allowing efficient data handling, especially in Direct Memory Access (DMA) operations, without needing a physically contiguous memory allocation.

[PATCH v6 08/11] drm/v3d: Support Big/Super Pages when writing out PTEs

However, the last few patches alone don’t fully enable the use of Super Pages. While PATCH 08/11 technically allows for Super Pages, we’re still relying on DRM GEM shmem objects, meaning allocations are still happening in PAGE_SIZE chunks. Although Big/Super Pages could potentially be used if the system naturally allocated 1MB or 64KB contiguously, this is quite rare and not our intended outcome. Our goal is to actively use Big/Super Pages as much as possible.

To achieve this, we’ll utilize the V3D-specific mountpoint we created earlier for BO allocation whenever possible. By creating BOs through drm_gem_shmem_create_with_mnt(), we can ensure that large pages are allocated contiguously when possible, enabling the consistent use of Big/Super Pages.

[PATCH v6 09/11] drm/v3d: Use gemfs/THP in BO creation if available

And there you have it — Big/Super Pages are now fully enabled in V3D. The only requirement to activate this feature in any given kernel is ensuring that CONFIG_TRANSPARENT_HUGEPAGE is enabled.

Final Words

You can learn more about ongoing enhancements to the Raspberry Pi driver stack in this XDC 2024 talk by José María “Chema” Casanova Crespo. In the talk, Chema discusses the Super Pages work I developed, along with other advancements in the driver stack.

Of course, there are still plenty of improvements on the horizon at Igalia. I’m currently experimenting with 64KB CLE allocations in user-space, and I hope to share more good news soon.

Finally, I’d like to express my gratitude to Iago Toral and Tvrtko Ursulin for their invaluable support in developing Super Pages for the V3D kernel driver. Thank you both for sharing your experience with me!

October 28, 2024 12:00 PM

October 26, 2024

Stephen Chenney

Caret Customization on the Web

A caret is the symbol used to show where text will appear in text input applications,such as a word processor, code editor or input form. It might be a vertical bar, or an underscore, or some other shape. It may flash, or pulse.

On the web, sites currently have some control over the editing caret through CSS, along with entirely custom solutions. Here I discuss the current state of caret customization on the web and look at proposals to allow more customization through CSS.

Carets on the Web #

The majority of web sites use the default editing caret. In all browsers it is a vertical bar in the text color that blinks. The blink rate and duration varies across browsers: Chrome’s cursor blinks continuously at a fixed rate; Firefox blinks the caret continuously at a rate from user settings; Safari blinks with keyboard input but switches to a strobe effect with dictation input.

All browsers support the CSS caret-color property allowing sites to color the caret independent of the text, and allow the color to be animated. You can also make the caret disappear with a transparent color.

Changing the caret shape is currently not possible through CSS in any browser. There are some ways to work around this, such as those documented in this Stack Overflow post. The general idea is hide the caret and replace it with an element controlled by script. The script relies on the selection range and other APIs to know where the element should be positioned. Or you can completely replace the default browser editing behavior with a custom javascript editor (as is done by, for example, Google Docs).

Browser support for caret customization currently leaves web developers in an awkward situation: accept the basic default cursor with color control, or implement a custom editor, but nothing in-between.

Improving Caret Customization #

There are at least two CSS properties not yet implemented in browsers that improve caret customization. The first concerns the interaction between color animation and the default blinking behavior, and the second adds support for different caret shapes.

The CSS caret-animation property #

When the caret-color is animated, there is no way to reliably synchronize the animation with the browser’s blinking rate. Different browsers blink at different rates, and it may be under user control. The CSS caret-animation property, when set to the value manual suppresses the blinking, giving the color animation complete control over the color of the caret at all times. Using caret-animation: auto (the initial value) leaves the blinking behavior under browser control.

Site control of the blinking is both an accessibility benefit and potentially harmful to users. For users sensitive to motion, disabling the blinking may be a benefit. At the same time, a cursor that does not blink is much harder for users to recognize. Please use caution when employing this property when it is available in browsers.

There is an implementation of caret-animation in Chrome 132 behind the CSSCaretAnimation flag, accessible through --enable-blink-features="CSSCaretAnimation" on the command line. Feel free to comment on the bug if you have thoughts on this feature.

The CSS caret-shape property #

The shape of the caret in native applications is most commonly a vertical bar, an underscore or a rectangular block. In addition, the shape often varies depending on the input mode, such as insert or replace. The CSS caret-shape property allows sites to choose one of these shapes for the caret, or leave the choice up to the browser. The recognized property values are auto, bar, block and underscore.

No browser currently supports the caret-shape property, and the specification needs a little work to confirm the exact location and shape of the underscore and block. Please leave feedback on the Chromium bug if you would like this feature to be implemented.

Other proposed caret properties #

The caret-shape property does not allow any control over the size of the caret, such as the thickness of the bar or block. There was, for example, a proposal to add caret-width as a means for working around browser bugs on zooming and transforming the caret. Please create a new CSS working Group issue if you would like greater customization and can provide use cases.

October 26, 2024 12:00 AM

October 23, 2024

Víctor Jáquez

GStreamer Conference 2024

Early this month I spent a couple weeks in Montreal, visiting the city, but mostly to attend the GStreamer Conference and the following hackfest, which happened to be co-located with the XDC Conference. It was my first time in Canada and I utterly enjoyed it. Thanks to all those from the GStreamer community that organized and attended the event.

For now you can replay the whole streams of both days of conference, but soon the split videos for each talk will be published:

GStreamer Conference 2024 - Day 1, Room 1 - October 7, 2024

https://www.youtube.com/watch?v=KLUL1D53VQI


GStreamer Conference 2024 - Day 1, Room 2 - October 7, 2024

https://www.youtube.com/watch?v=DH64D_6gc80


GStreamer Conference 2024 - Day 2, Room 2 - October 8, 2024

https://www.youtube.com/watch?v=jt6KyV757Dk


GStreamer Conference 2024 - Day 2, Room 1 - October 8, 2024

https://www.youtube.com/watch?v=W4Pjtg0DfIo


And a couple pictures of igalians :)

GStreamer Skia plugin
GStreamer Skia plugin

GStreamer VA
GStreamer VA

GstWebRTC
GstWebRTC

GstVulkan
GstVulkan

GStreamer Editing Services
gstreamer editing services

October 23, 2024 12:00 AM

October 22, 2024

Stephen Chenney

CSS Highlight Inheritance has Shipped!

The CSS highlight inheritance model describes the process for inheriting the CSS properties of the various highlight pseudo elements:

  • ::selection controlling the appearance of selected content
  • ::spelling-error controlling the appearance of misspelled word markers
  • ::grammar-error controlling how grammar errors are marked
  • ::target-text controlling the appearance of the string matching a target-text URL
  • ::highlight defining the appearance of a named highlight, accessed via the CSS highlight API

The feature has now launched in Chrome 131, and this post describes how to achieve the old selection styling behavior if selection is now not visible, or newly visible, on your site.

The most common problem comes when a site relies on ::selection properties only applying to the matched elements and not child elements. For example, sites don’t want <div> elements themselves to be visible, but do want the text content inside them to show the selected text. Some sites may have used such styling to work around old bugs in browsers whereby the highlight on the selected <div> would be too large.

Previously, this style block would have the intended result:

<style>
div::selection {
background-color: transparent;
}
</style>

With CSS highlight inheritance the children of the <div> will now also have transparent selection, which is probably not what was intended.

There’s a good chance that the selection styling added to work around bugs is no longer needed. So the first thing to try is removing the style rule entirely.

Should you really want to style selection on an element but not its children, simplest is to add a style rule to apply the default selection behavior to everything that is not a <div> (or whatever element you are selecting):

<style>
div::selection {
background-color: transparent;
}
:not(div)::selection {
background-color: Highlight;
}
</style>

Highlight is the system default selection background color, whatever that might be in the browser and operating system you are using.

More efficient is to use the child combinator because it will only create new pseudo style objects for the <div> and its direct children, inheriting the rest:

<style>
div::selection {
background-color: transparent;
}
div > *::selection {
background-color: Highlight;
}
</style>

Overall, the former may be better if you have multiple ::selection styles or your ::selection styles already use complex rules, because it avoids the need to have a matching child selector for every one of those.

There are other situations, but building on the examples we suggest two generalizations:

  • Make use of the CSS system colors Highlight and HighlightText to revert selection back to the platform default.
  • Augment any existing ::selection style rules with complementary rules to cancel the selection style changes on other elements that should not inherit the change.

Chrome issue 374528707 may be used to report breakage due to enabling CSS highlight inheritance and to see fixes for the reported cases.

October 22, 2024 12:00 AM

October 17, 2024

Jani Hautakangas

Bringing WebKit back to Android: Internals

In my last blog post, I introduced the WPE-Android project by providing a high-level overview of what the project aims to achieve and the motivations behind bringing WebKit back to Android. This post will take a deeper dive into the technical details and internal structure of WPE-Android, focusing on some key areas of its design and implementation.

WPE-Android is not a standalone WebKit port but rather an extension and adaptation of the existing WPE WebKit APIs. By leveraging the libwpe adaptation layer library and a platform-specific libwpebackend plugin, WPE-Android integrates seamlessly with Android. These components work together to provide WebKit with the necessary access to Android-specific functionality, enabling it to render web content efficiently on Android devices.

WPE-Android High Level

Overall Design #

At the core of WPE-Android lies the native WPE WebKit codebase, which powers much of the browser functionality. However, for Android applications to interact with this native code, a bridge must be established between the Java environment of Android apps and the C++ world of WebKit. This is where the Java Native Interface (JNI) comes into play.

The JNI allows Java code to call native methods implemented in C or C++ and vice versa. In the case of WPE-Android, a robust JNI layer is implemented to facilitate the communication between the Android system and the WebKit engine. This layer consists of several utility classes, which are designed to handle JNI calls efficiently and reduce the possibility of code duplication. These classes essentially act as intermediaries, ensuring that all interactions with the WebKit engine are managed smoothly and reliably.

Below is a simplified diagram of the overall design of WPE-Android internals, highlighting the key components involved in this architecture.

WPE-Android Design

WPEBackend-Android: Bridging the Platform #

WPE-Android relies heavily on the libwpe library to enable platform-specific functionalities. The role of libwpe is crucial because it abstracts away the platform-specific details, allowing WebKit to run on various platforms, including Android, without needing to be aware of the underlying system intricacies.

One of the primary responsibilities of libwpe is to interface with the platform-specific libwpebackend plugin, which handles tasks such as platform graphics buffer support and the sharing of buffers between the WebKit UIProcess and WebProcess. The libwpebackend plugin ensures that graphical content generated by the WebProcess can be efficiently displayed on the device’s screen by the UIProcess.

Although this libwpebackend plugin is a critical component of WPE-Android, I won’t go into describing its detailed implementation in this post. However, for those interested in the internal workings of the WPE Backend, I highly recommend reading Loïc Le Page’s comprehensive blog post on the subject: Create WPE Backends. In WPE-Android, this backend functionality is implemented in the WPEBackend-Android repository.

Recently, WPE-Android has been upgraded to WPE WebKit 2.46.0, which introduces an initial version of the new WPE adaptation API called WPE Platform API. This API is designed to provide a cleaner and more flexible way of integrating WPE WebKit with various platforms. However, since this API is still under active development, WPE-Android currently continues to use the older libwpe API.

The diagram below shows the internals of the WPEBackend-android and how it integrates to WPE WebKit

WPEBackend-android

Rendering Web Content: The Journey from WebProcess to Screen #

Rendering web content efficiently on Android involves several moving parts. Once the WebProcess has generated the graphical frames, these frames need to be displayed on the device’s screen in a seamless and performant manner. To achieve this, WPE-Android makes use of the Android SurfaceControl component.

SurfaceControl plays a key role in managing the surface that displays the rendered content. It allows buffers generated by the WebProcess to be posted directly to SurfaceFlinger, which is Android’s system compositor responsible for combining multiple surfaces into a single screen image. This direct posting of buffers to SurfaceFlinger ensures that the rendered frames are composed and displayed in a highly efficient way, with minimal overhead.

The diagram below illustrates how the WebProcess-generated frames are transferred to the Android system and eventually rendered on the device’s screen.

WPE-Android Rendering

Event loop #

WPE WebKit is built on top of GLib, which provides a wide array of functionality for managing events, network communications, and other system-level tasks. GLib is essential to the smooth operation of WPE WebKit, especially when it comes to handling asynchronous events and running the event loop.

To integrate WPE WebKit’s GLib-based event loop with the Android platform, WPE-Android uses a mechanism that drives the WebKit event loop using the Android looper. Specifically, event file descriptors (FDs) from the WPE WebKit GLib main loop are fed into the Android main looper. On each iteration of the event loop, WPE-Android checks whether any of the file descriptors in the GLib main loop have changed. Based on these changes, WPE-Android adjusts the Android looper by adding or removing FDs as necessary.

This complex logic is implemented in a class called MessagePump, which handles the synchronization between the two event loops and ensures that events are processed in a timely manner.

WPE-Android MessagePump

What’s New Since the Last Update #

Since the last update, WPE-Android has undergone a number of significant changes. These updates have brought new features, bug fixes, and performance improvements, making the project even more robust and capable of handling modern web content on Android devices. Below is a list of the most notable changes:

  • Upgrade to Cerbero 1.24.8: This upgrade ensures better compatibility and improved build processes.
  • Upgrade to NDK r27 LTS: The move to the latest NDK release provides long-term support and brings several new features and optimizations.
  • Upgrade to WPE WebKit 2.46.0: This version of WPE WebKit introduces new functionality, including support for the Skia backend.
  • Enabled Skia backend: Skia is a graphics engine that provides hardware-accelerated rasterization, which enhances rendering performance.
  • Enabled Remote Inspector: This allows developers to remotely inspect and debug web content.
  • Published WPEView to Maven Central: The WPEView library is now publicly available, making it easier for developers to integrate WPE-Android into their own projects.
  • Major bug fixes: Significant improvements have been made to the GLib main loop integration, ensuring smoother operation and fewer edge cases.
  • Dropped gnutls in favor of openssl: This change simplifies the security stack.
  • Streamlined libraries: Unnecessary deployment of libraries that are already provided by Android as system libraries has been removed.
  • Refined WPEView API: Internal methods that should not be exposed to developers have been moved to implementation details, reducing API surface and making the library more user-friendly

Demos #

Mediaplayer #

Demo shows WPE-Android mediaplayer demo on Android device. Demo source code can be found in mediaplayer

Remote Inspector #

Demo shows Remote Inspector usage on Android device. Detailed instructions how to run Remote Inspector can be found in README.md

This demo shows loading the www.igalia.com webpage in a desktop browser and connecting it to a remote inspector service on device. The video demonstrates how webpage elements can be inspected and edited in real-time using the remote inspector.

Try it yourself #

With the recent release of WPEView to Maven Central, it’s now easier than ever to experiment with WPE-Android and integrate it into your own Android projects

To get started, make sure you have mavenCentral() included in your project’s repository configuration. Here’s how you can do it:

dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
google()
mavenCentral()
}
}

Once that’s done, you can add WPEView as a dependency in your project:

dependencies {
implementation("org.wpewebkit.wpeview:wpeview:0.1.0")
...
}

If you’re interested in learning more or contributing to the project, you can find all the details on the WPE-Android GitHub page. We welcome feedback, contributions, and new ideas!

This project is partially funded by the NLNet Foundation, and we appreciate their support in making it possible.

October 17, 2024 12:00 AM

October 16, 2024

Enrique Ocaña

Adding buffering hysteresis to the WebKit GStreamer video player

The <video> element implementation in WebKit does its job by using a multiplatform player that relies on a platform-specific implementation. In the specific case of glib platforms, which base their multimedia on GStreamer, that’s MediaPlayerPrivateGStreamer.

WebKit GStreamer regular playback class diagram

The player private can have 3 buffering modes:

  • On-disk buffering: This is the typical mode on desktop systems, but is frequently disabled on purpose on embedded devices to avoid wearing out their flash storage memories. All the video content is downloaded to disk, and the buffering percentage refers to the total size of the video. A GstDownloader element is present in the pipeline in this case. Buffering level monitoring is done by polling the pipeline every second, using the fillTimerFired() method.
  • In-memory buffering: This is the typical mode on embedded systems and on desktop systems in case of streamed (live) content. The video is downloaded progressively and only the part of it ahead of the current playback time is buffered. A GstQueue2 element is present in the pipeline in this case. Buffering level monitoring is done by listening to GST_MESSAGE_BUFFERING bus messages and using the buffering level stored on them. This is the case that motivates the refactoring described in this blog post, what we actually wanted to correct in Broadcom platforms, and what motivated the addition of hysteresis working on all the platforms.
  • Local files: Files, MediaStream sources and other special origins of video don’t do buffering at all (no GstDownloadBuffering nor GstQueue2 element is present on the pipeline). They work like the on-disk buffering mode in the sense that fillTimerFired() is used, but the reported level is relative, much like in the streaming case. In the initial version of the refactoring I was unaware of this third case, and only realized about it when tests triggered the assert that I added to ensure that the on-disk buffering method was working in GST_BUFFERING_DOWNLOAD mode.

The current implementation (actually, its wpe-2.38 version) was showing some buffering problems on some Broadcom platforms when doing in-memory buffering. The buffering levels monitored by MediaPlayerPrivateGStreamer weren’t accurate because the Nexus multimedia subsystem used on Broadcom platforms was doing its own internal buffering. Data wasn’t being accumulated in the GstQueue2 element of playbin, because BrcmAudFilter/BrcmVidFilter was accepting all the buffers that the queue could provide. Because of that, the player private buffering logic was erratic, leading to many transitions between “buffer completely empty” and “buffer completely full”. This, it turn, caused many transitions between the HaveEnoughData, HaveFutureData and HaveCurrentData readyStates in the player, leading to frequent pauses and unpauses on Broadcom platforms.

So, one of the first thing I tried to solve this issue was to ask the Nexus PlayPump (the subsystem in charge of internal buffering in Nexus) about its internal levels, and add that to the levels reported by GstQueue2. There’s also a GstMultiqueue in the pipeline that can hold a significant amount of buffers, so I also asked it for its level. Still, the buffering level unstability was too high, so I added a moving average implementation to try to smooth it.

All these tweaks only make sense on Broadcom platforms, so they were guarded by ifdefs in a first version of the patch. Later, I migrated those dirty ifdefs to the new quirks abstraction added by Phil. A challenge of this migration was that I needed to store some attributes that were considered part of MediaPlayerPrivateGStreamer before. They still had to be somehow linked to the player private but only accessible by the platform specific code of the quirks. A special HashMap attribute stores those quirks attributes in an opaque way, so that only the specific quirk they belong to knows how to interpret them (using downcasting). I tried to use move semantics when storing the data, but was bitten by object slicing when trying to move instances of the superclass. In the end, moving the responsibility of creating the unique_ptr that stored the concrete subclass to the caller did the trick.

Even with all those changes, undesirable swings in the buffering level kept happening, and when doing a careful analysis of the causes I noticed that the monitoring of the buffering level was being done from different places (in different moments) and sometimes the level was regarded as “enough” and the moment right after, as “insufficient”. This was because the buffering level threshold was one single value. That’s something that a hysteresis mechanism (with low and high watermarks) can solve. So, a logical level change to “full” would only happen when the level goes above the high watermark, and a logical level change to “low” when it goes under the low watermark level.

For the threshold change detection to work, we need to know the previous buffering level. There’s a problem, though: the current code checked the levels from several scattered places, so only one of those places (the first one that detected the threshold crossing at a given moment) would properly react. The other places would miss the detection and operate improperly, because the “previous buffering level value” had been overwritten with the new one when the evaluation had been done before. To solve this, I centralized the detection in a single place “per cycle” (in updateBufferingStatus()), and then used the detection conclusions from updateStates().

So, with all this in mind, I refactored the buffering logic as https://commits.webkit.org/284072@main, so now WebKit GStreamer has a buffering code much more robust than before. The unstabilities observed in Broadcom devices were gone and I could, at last, close Issue 1309.

by eocanha at October 16, 2024 06:12 AM

October 15, 2024

Alex Bradbury

Accessing a cross-compiled build tree from qemu-system

I'm cross-compiling a large codebase (LLVM and sub-projects such as Clang) and want to access the full tree - the source and the build artifacts from under qemu. This post documents the results of my experiments in various ways to do this. Note that I'm explicitly choosing to run timings using something that approximates the work I want to do rather than any microbenchmark targeting just file access time.

Requirements:

  • The whole source tree and the build directories are accessible from the qemu-system VM.
  • Writes must be possible, but don't need to be reflected in the original directory (so importantly, although files are written as part of the test being run, I don't need to get them back out of the virtual machine).

Results

The test simply involves taking a cross-compiled LLVM tree (with RISC-V as the only enabled target) and running the equivalent of ninja check-llvm on it after exposing / transferring it to the VM using the listed method. Results in chart form for overall time taken in the "empty cache" case:

9pfs
13m05s 
virtiofsd
13m56s 
squashfs
11m33s 
ext4
11m42s 
  • 9pfs: 13m05s with empty cache, 12m45s with warm cache
  • virtiofsd: 13m56s with empty cache, 13m49s with warm cache
  • squashfs: 11m33s (~17s image build, 11m16s check-llvm) with empty cache, 11m16s (~3s image build, 11m13s check-llvm) with warm cache
  • new ext4 filesystem: 11m42s, (~38s image build, 11m4s check-llvm) with empty cache, 11m34s (~30s image build, 11m4s check-llvm) with warm cache.

By way of comparison, it takes ~10m40s to naively transfer the data by piping it over ssh (tarring on one end, untarring on the other).

For this use case, building and mounting a filesystem seems the most compelling option with the ext4 being overall simplest (no need to use overlayfs) assuming you have an e2fsprogs new enough to support tar input to mkfs.ext4. I'm not sure why I'm not seeing the purported performance improvements when using virtiofsd.

Notes on test setup

  • The effects of caching are accounted for by taking measurements in two scenarios below. Although neither is a precise match for the situation in a typical build system, it at least gives some insight into the impact of the page cache in a controlled manner.
    • "Empty cache". Caches are dropped (sync && echo 3 | sudo tee /proc/sys/vm/drop_caches) before building the filesystem (if necessary) and launching QEMU.
    • "Warm cache". Caches are dropped as above but immediately after that (i.e. before building any needed filesystem) we load all files that will be accessed so they will be cached. For simplicity, just do this by doing using a variant of the tar command described below.
  • Running on AMD Ryzen 9 7950X3D with 128GiB RAM.
  • QEMU 9.1.0, running qemu-system-riscv64, with 32 virtual cores matching the number 16 cores / 32 hyper-threads in the host.
  • LLVM HEAD as of 2024-10-13 (48deb3568eb2452ff).
  • Release+Asserts build of LLVM with all non-experimental targets enabled. Built in a two-stage cross targeting rv64gc_zba_zbb_zbs.
  • Running riscv64 Debian unstable within QEMU.
  • Results gathered via time ../bin/llvm-lit -sv . running from test/ within the build directory, and combining this with the timing for whatever command(s) are needed to transfer and extract the source+build tree.
  • Only the time taken to build the filesystem (if relevant) and to run the tests is measured. There may be very minor difference in the time taken to mount the different filesystems, but we assume this is too small to be worth measuring.
  • If I were to rerun this, I would invoke lit with --order=lexical to ensure runtimes aren't affected by the order of tests (e.g. long-running tests being scheduled last).
  • Total size of the files being exposed is ~8GiB or so.
  • Although not strictly necessary, I've tried to map to the uid:gid pair used in the guest where possible.

Details: 9pfs (QEMU -virtfs)

  • Add -virtfs local,path=$HOME/llvm-project/,mount_tag=llvm,security_model=none,id=llvm to the QEMU command line.
  • Within the guest sudo mount -t 9p -o trans=virtio,version=9p2000.L,msize=512000 llvm /mnt. I haven't swept different parameters for msize and mount reports 512000 is the maximum for the virtio transport.
  • Sadly 9pfs doesn't support ID-mapped mounts at the moment, which would allow us to easily remap the uid and gid used within the file tree on the host to one of an arbitrary user account in the guest without chowning everything (which would also require security_model=mapped-xattr or security_model=mapped-file to be passed to qemu. In the future you should be able to do sudo mount --bind --map-users 1000:1001:1 --map-groups 984:1001:1 /mnt mapped
  • We now need to set up an overlayfs that we can freely write to.
    • mkdir -p upper work llvm-project
    • sudo mount -t overlay overlay -o lowerdir=/mnt,upperdir=$HOME/upper,workdir=$HOME/work $HOME/llvm-project
  • You could create a user account with appropriate uid/gid to match those in the exposed directory, but for simplicity I've just run the tests as root within the VM.

Details: virtiofsd

  • Installed the virtiofsd package on Arch Linux (version 1.11.1). Refer to the documentation for examples of how to run.
  • Run /usr/lib/virtiofsd --socket-path=/tmp/vhostqemu --shared-dir=$HOME/llvm-project --cache=always
  • Add the following commands to your qemu launch command (where the memory size parameter to -object seems to need to match the -m argument you used):
  -chardev socket,id=char0,path=/tmp/vhostqemu \
  -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=llvm \
  -object memory-backend-memfd,id=mem,size=64G,share=on -numa node,memdev=mem
  • sudo mount -t virtiofs llvm /mnt
  • Unfortunately id-mapped mounts also aren't quite supported for virtiofs at the time of writing. The changes were recently merged into the upstream kernel and a merge request to virtiofsd is pending.
  • mkdir -p upper work llvm-project
  • sudo mount -t overlay overlay -o lowerdir=/mnt,upperdir=$HOME/upper,workdir=$HOME/work $HOME/llvm-project
  • As with the 9p tests, just ran the tests using sudo.

Details: squashfs

  • I couldn't get a regex that squashfs would accept to exclude all build/* directories except build/stage2cross, so use find to generate the individual directory exclusions.
  • As you can see in the commandline below, I disabled compression altogether.
  • mksquashfs ~/llvm-project llvm-project.squashfs -no-compression -force-uid 1001 -force-gid 1001 -e \ .git $(find $HOME/llvm-project/build -maxdepth 1 -mindepth 1 -type d -not -name 'stage2cross' -printf '-e build/%P ')
  • Add this to the qemu command line: -drive file=$HOME/llvm-project.squashfs,if=none,id=llvm,format=raw
  • sudo mount -t squashfs /dev/vdb /mnt
  • Use the same instructions for setting up overlayfs as for 9pfs above. There's no need to run the tests using sudo as we've set the appropriate uid.

Details: new ext4 filesystem

  • Make use of the -d parameter of mkfs.ext4 which allows you to create a filesystem, using the given directory or tarball to initialise it. As mkfs.ext4 lacks features for filtering the input or forcing the uid/gid of files, we rely on tarball input (only just added in e2fsprogs 1.47.1).
  • Create a filesystem image that's big enough with fallocate -l 15GiB llvm-project.img
  • Execute:
tar --create \
  --file=- \
  --owner=1001 \
  --group=1001 \
  --exclude=.git \
  $(find $HOME/llvm-project/build -maxdepth 1 -mindepth 1 -type d -not -name 'stage2cross' -printf '--exclude=build/%P ') \
  -C $HOME/llvm-project . \
  | mkfs.ext4 -d - llvm-project.img
  • Attach the drive by adding this to the qemu command line -device virtio-blk-device,drive=hdb -drive file=$HOME/llvm-project.img,format=raw,if=none,id=hdb
  • Mount in qemu with sudo mount -t ext4 /dev/vdb $HOME/llvm-project

Details: tar piped over ssh

  • Essentially, use the tar command above and pipe it to tar running under qemu-system over ssh to extract it to root filesystem (XFS in my VM setup).
  • This could be optimised. e.g. by selecting a less expensive ssh encryption algorithm (or just netcat), a better networking backend to QEMU, and so on. But the intent is to show the cost of the naive "obvious" solution.
  • Execute:
tar --create \
  --file=- \
  --owner=1001 \
  --group=1001 \
  --exclude=.git \
  $(find $HOME/llvm-project/build -maxdepth 1 -mindepth 1 -type d -not -name 'stage2cross' -printf '--exclude=build/%P ') \
  -C $HOME/llvm-project . \
  | ssh -p10222 asb@localhost "mkdir -p llvm-project && tar xf - -C \
  llvm-project"

Article changelog
  • 2024-12-04: Correct where I'd accidentally copied too much for the ext4 qemu command line.
  • 2024-10-15: Initial publication date.

October 15, 2024 12:00 PM

October 03, 2024

Andy Wingo

preliminary notes on a nofl field-logging barrier

When you have a generational collector, you aim to trace only the part of the object graph that has been allocated recently. To do so, you need to keep a remembered set: a set of old-to-new edges, used as roots when performing a minor collection. A language run-time maintains this set by adding write barriers: little bits of collector code that run when a mutator writes to a field.

Whippet’s nofl space is a block-structured space that is appropriate for use as an old generation or as part of a sticky-mark-bit generational collector. It used to have a card-marking write barrier; see my article diving into V8’s new write barrier, for more background.

Unfortunately, when running whiffle benchmarks, I was seeing no improvement for generational configurations relative to whole-heap collection. Generational collection was doing fine in my tiny microbenchmarks that are part of Whippet itself, but when translated to larger programs (that aren’t yet proper macrobenchmarks), it was a lose.

I had planned on doing some serious tracing and instrumentation to figure out what was happening, and thereby correct the problem. I still plan on doing this, but instead for this issue I used the old noggin technique instead: just, you know, thinking about the thing, eventually concluding that unconditional card-marking barriers are inappropriate for sticky-mark-bit collectors. As I mentioned in the earlier article:

An unconditional card-marking barrier applies to stores to slots in all objects, not just those in oldspace; a store to a new object will mark a card, but that card may contain old objects which would then be re-scanned. Or consider a store to an old object in a more dense part of oldspace; scanning the card may incur more work than needed. It could also be that Whippet is being too aggressive at re-using blocks for new allocations, where it should be limiting itself to blocks that are very sparsely populated with old objects.

That’s three problems. The second is well-known. But the first and last are specific to sticky-mark-bit collectors, where pages mix old and new objects.

a precise field-logging write barrier

Back in 2019, Steve Blackburn’s paper Design and Analysis of Field-Logging Write Barriers took a look at the state of the art in precise barriers that record not regions of memory that have been updated, but the precise edges (fields) that were written to. He ends up re-using this work later in the 2022 LXR paper (see §3.4), where the write barrier is used for deferred reference counting and a snapshot-at-the-beginning (SATB) barrier for concurrent marking. All in all field-logging seems like an interesting strategy. Relative to card-marking, work during the pause is much less: you have a precise buffer of all fields that were written to, and you just iterate that, instead of iterating objects. Field-logging does impose some mutator cost, but perhaps the payoff is worth it.

To log each old-to-new edge precisely once, you need a bit per field indicating whether the field is logged already. Blackburn’s 2019 write barrier paper used bits in the object header, if the object was small enough, and otherwise bits before the object start. This requires some cooperation between the collector, the compiler, and the run-time that I wasn’t ready to pay for. The 2022 LXR paper was a bit vague on this topic, saying just that it used “a side table”.

In Whippet’s nofl space, we have a side table already, used for a number of purposes:

  1. Mark bits.

  2. Iterability / interior pointers: is there an object at a given address? If so, it will have a recognizable bit pattern.

  3. End of object, to be able to sweep without inspecting the object itself

  4. Pinning, allowing a mutator to prevent an object from being evacuated, for example because a hash code was computed from its address

  5. A hack to allow fully-conservative tracing to identify ephemerons at trace-time; this re-uses the pinning bit, since in practice such configurations never evacuate

  6. Bump-pointer allocation into holes: the mark byte table serves the purpose of Immix’s line mark byte table, but at finer granularity. Because of this though, it is swept lazily rather than eagerly.

  7. Generations. Young objects have a bit set that is cleared when they are promoted.

Well. Why not add another thing? The nofl space’s granule size is two words, so we can use two bits of the byte for field logging bits. If there is a write to a field, a barrier would first check that the object being written to is old, and then check the log bit for the field being written. The old check will be to a byte that is nearby or possibly the same as the one to check the field logging bit. If the bit is unsert, we call out to a slow path to actually record the field.

preliminary results

I disassembled the fast path as compiled by GCC and got something like this on x86-64, in AT&T syntax, for the young-generation test:

mov    %rax,%rdx
and    $0xffffffffffc00000,%rdx
shr    $0x4,%rax
and    $0x3ffff,%eax
or     %rdx,%rax
testb  $0xe,(%rax)

The first five instructions compute the location of the mark byte, from the address of the object (which is known to be in the nofl space). If it has any of the bits in 0xe set, then it’s in the old generation.

Then to test a field logging bit it’s a similar set of instructions. In one of my tests the data type looks like this:

struct Node {
  uintptr_t tag;
  struct Node *left;
  struct Node *right;
  int i, j;
};

Writing the left field will be in the same granule as the object itself, so we can just test the byte we fetched for the logging bit directly with testb against $0x80. For right, we should be able to know it’s in the same slab (aligned 4 MB region) and just add to the previously computed byte address, but the C compiler doesn’t know that right now and so recomputes. This would work better in a JIT. Anyway I think these bit-swizzling operations are just lost in the flow of memory accesses.

For the general case where you don’t statically know the offset of the field in the object, you have to compute which bit in the byte to test:

mov    %r13,%rcx
mov    $0x40,%eax
shr    $0x3,%rcx
and    $0x1,%ecx
shl    %cl,%eax
test   %al,%dil

Is it good? Well, it improves things for my whiffle benchmarks, relative to the card-marking barrier, seeing a 1.05×-1.5× speedup across a range of benchmarks. I suspect the main advantage is in avoiding the “unconditional” part of card marking, where a write to a new object could cause old objects to be added to the remembered set. There are still quite a few whiffle configurations in which the whole-heap collector outperforms the sticky-mark-bit generational collector, though; I hope to understand this a bit more by building a more classic semi-space nursery, and comparing performance to that.

Implementation links: the barrier fast-path, the slow path, and the sequential store buffers. (At some point I need to make it so that allocating edge buffers in the field set causes the nofl space to page out a corresponding amount of memory, so as to be honest when comparing GC performance at a fixed heap size.)

Until next time, onwards and upwards!

by Andy Wingo at October 03, 2024 08:54 AM

October 01, 2024

Alex Bradbury

Arch Linux on remote server setup runbook

As I develop, I tend to offload a lot of the computationally heavy build or test tasks to a remote build machine. It's convenient for this to match the distribution I use on my local machine, and I so far haven't felt it would be advantageous to add another more server-oriented distro to the mix to act as a host to an Arch Linux container. This post acts as a quick reference / runbook for me on the occasions I want to spin up a new machine with this setup. To be very explicit, this is shared as something that might happen to be helpful if you have similar requirements, or to give some ideas if you have different ones. I definitely don't advocate that you blindly copy it.

Although this is something that could be fully automated, I find structuring this kind of thing as a series of commands to copy and paste and check/adjust manually hits the sweetspot for my usage. As always, the Arch Linux wiki and its install guide is a fantastic reference that you should go and check if you're unsure about anything listed here.

Details about my setup

  • I'm typically targeting Hetzner servers with two SSDs. I'm not worried about data loss or minor downtime waiting for a disk to be replaced before re-imaging, so RAID0 is just fine. The commands below assume this setup.
  • I do want encrypted root just in case anything sensitive is ever copied to the machine. As it is a remote machine, remote unlock over SSH must be supported (i.e. initrd contains an sshd that will accept the passphrase for unlocking the root partition before continuing the boot process).
    • The described setup means that if the disk is thrown away or given to another customer, you shouldn't have anything plain text in the root partition lying around it. As the boot partition is unencrypted, there are a whole bunch of threat models where someone is able to modify that boot partition which wouldn't be addressed by this approach.
  • The setup also assumes you start from some kind of rescue environment (in my case Hetzner's) that means you can freely repartition and overwrite the disk drives.
  • If you're not on Hetzner, you'll have to use different mirrors as they're not available externally.
  • Hosts typically provide their own base images and installer scripts. The approach detailed here is good if you'd rather control the process yourself, and although there are a few hoster-specific bits in here it's easy enough to replace them.
  • For simplicity I use UEFI boot and EFI stub. If you have problems with efibootmgr, you might want to use systemd-boot instead.
  • Disclaimer: As is hopefully obvious, the below instructions will overwrite anything on the hard drive of the system you're running them on.

Setting up and chrooting into Arch bootstrap environment from rescue system

See the docs on the Hetzner rescue system for details on how to enter it. If you've previously used this server, you likely want to remove old known_hosts entries with ssh-keygen -R your.server.

Now ssh in and do the following:

wget http://mirror.hetzner.de/archlinux/iso/latest/archlinux-bootstrap-x86_64.tar.zst
tar -xvf archlinux-bootstrap-x86_64.tar.zst --numeric-owner
sed -i '1s;^;Server=https://mirror.hetzner.de/archlinux/$repo/os/$arch\n\n;' root.x86_64/etc/pacman.d/mirrorlist
mount --bind root.x86_64/ root.x86_64/ # See <https://bugs.archlinux.org/task/46169>
printf "About to enter bootstrap chroot\n===============================\n"
./root.x86_64/bin/arch-chroot root.x86_64/

You are now chrooted into the Arch boostrap environment. We will do as much work from there as possible so as to minimise dependency on tools in the Hetzner rescue system.

Set up drives and create and chroot into actual rootfs

The goal is now to set up the drives, perform an Arch Linux bootstrap, and chroot into it. The UEFI partition is placed at 1MiB pretty much guaranteed to be properly aligned for any reasonable physical sector size (remembering space must be left at the beginning of the disk for the partition table itself).

We'll start by collecting info that will be used throughout the process:

printf "Enter the desired new hostname:\n"
read NEW_HOST_NAME; export NEW_HOST_NAME
printf "Enter the desired passphrase for unlocking the root partition:\n"
read ROOT_PART_PASSPHRASE
printf "Enter the public ssh key (e.g. cat ~/.ssh/id_ed22519.pub) that will be used for access:\n"
read PUBLIC_SSH_KEY; export PUBLIC_SSH_KEY
printf "Enter the name of the main user account to be created:\n"
read NEW_USER; export NEW_USER
printf "You've entered the following information:\n"
printf "NEW_HOST_NAME: %s\n" "$NEW_HOST_NAME"
printf "ROOT_PART_PASSPHRASE: %s\n" "$ROOT_PART_PASSPHRASE"
printf "PUBLIC_SSH_KEY: %s\n" "$PUBLIC_SSH_KEY"
printf "NEW_USER: %s\n" "$NEW_USER"

And now proceed to set up the disks, create filesystems, perform an initial bootstrap and chroot into the new rootfs:

pacman-key --init
pacman-key --populate archlinux
pacman -Sy --noconfirm xfsprogs dosfstools mdadm cryptsetup

for M in /dev/md?; do mdadm --stop $M; done
for DRIVE in /dev/nvme0n1 /dev/nvme1n1; do
  sfdisk $DRIVE <<EOF
label: gpt

start=1MiB, size=255MiB, type=uefi
start=256MiB, type=raid
EOF
done

# Warning from mdadm about falling back to creating md0 via node seems fine to ignore
yes | mdadm --create --verbose --level=0 --raid-devices=2 --homehost=any /dev/md/0 /dev/nvme0n1p2 /dev/nvme1n1p2

mkfs.fat -F32 /dev/nvme0n1p1
printf "%s\n" "$ROOT_PART_PASSPHRASE" | cryptsetup -y -v luksFormat --batch-mode /dev/md0
printf "%s\n" "$ROOT_PART_PASSPHRASE" | cryptsetup open /dev/md0 root
unset ROOT_PART_PASSPHRASE

mkfs.xfs /dev/mapper/root # rely on this calaculating appropriate sunit and swidth

mount /dev/mapper/root /mnt
mkdir /mnt/boot
mount /dev/nvme0n1p1 /mnt/boot
pacstrap /mnt base linux linux-firmware efibootmgr \
  xfsprogs dosfstools mdadm cryptsetup \
  mkinitcpio-netconf mkinitcpio-tinyssh mkinitcpio-utils python3 \
  openssh sudo net-tools git man-db man-pages vim
genfstab -U /mnt >> /mnt/etc/fstab
# Note that sunit and swidth in fstab confusingly use different units vs those shown by xfs <https://superuser.com/questions/701124/xfs-mount-overrides-sunit-and-swidth-options>

mdadm --detail --scan >> /mnt/etc/mdadm.conf
printf "About to enter newrootfs chroot\n===============================\n"
arch-chroot /mnt

Do final configuration from within the new rootfs chroot

Unfortunately I wasn't able to find a way to get ipv6 configured via DHCP on Hetzner's network, so we rely on hardcoding it (which seems to be their recommendation.

We set up sudo and also configure ssh to disable password-based login. A small swapfile is configured in order to allow the kernel to move allocated but not actively used pages there if deemed worthwhile.

sed /etc/locale.gen -i -e "s/^\#en_GB.UTF-8 UTF-8.*/en_GB.UTF-8 UTF-8/"
locale-gen
# Ignore "System has not been booted with systemd" and "Failed to connect to bus" error for next command.
systemd-firstboot --locale=en_GB.UTF-8 --timezone=UTC --hostname="$NEW_HOST_NAME"
ln -s /dev/null /etc/udev/rules.d/80-net-setup-link.rules # disable persistent network names

ssh-keygen -A # generate openssh host keys for the tinyssh hook to pick up
printf "%s\n" "$PUBLIC_SSH_KEY" > /etc/tinyssh/root_key
sed /etc/mkinitcpio.conf -i -e 's/^HOOKS=.*/HOOKS=(base udev autodetect microcode modconf kms keyboard keymap consolefont block mdadm_udev netconf tinyssh encryptssh filesystems fsck)/'
# The call to tinyssh-convert in mkinitcpio-linux is incorrect, so do it
# manually <https://github.com/grazzolini/mkinitcpio-tinyssh/issues/10>
tinyssh-convert /etc/tinyssh/sshkeydir < /etc/ssh/ssh_host_ed25519_key
mkinitcpio -p linux

printf "efibootmgr before changes:\n==========================\n"
efibootmgr -u
# Clean up any other efibootmgr entries from previous installs
for BOOTNUM in $(efibootmgr | grep '^Boot0' | grep -v PXE | sed 's/^Boot\([0-9]*\).*/\1/'); do
  efibootmgr -b $BOOTNUM -B
done
# Set up efistub
efibootmgr \
  --disk /dev/nvme0n1 \
  --part 1 \
  --create \
  --label 'Arch Linux' \
  --loader /vmlinuz-linux \
  --unicode "root=/dev/mapper/root rw cryptdevice=UUID=$(blkid -s UUID -o value /dev/md0):root:allow-discards initrd=\initramfs-linux.img ip=:::::eth0:dhcp" \
  --verbose
efibootmgr -o 0001,0000 # prefer PXE boot. May need to check IDs
printf "efibootmgr after changes:\n=========================\n"
efibootmgr -u

mkswap --size=8G --file /swapfile
cat - <<EOF > /etc/systemd/system/swapfile.swap
[Unit]
Description=Swap file

[Swap]
What=/swapfile

[Install]
WantedBy=multi-user.target
EOF
systemctl enable swapfile.swap

cat - <<EOF > /etc/systemd/network/10-eth0.network
[Match]
Name=eth0

[Network]
DHCP=yes
Address=$(ip -6 addr show dev eth0 scope global | grep "scope global" | cut -d' ' -f6)
Gateway=$(ip route show | head -n 1 | cut -d' ' -f 3)
Gateway=fe80::1
EOF
systemctl enable systemd-networkd.service systemd-resolved.service systemd-timesyncd.service
printf "PasswordAuthentication no\n" > /etc/ssh/sshd_config.d/20-no-password-auth.conf
systemctl enable sshd.service
useradd -m -g users -G wheel -s /bin/bash "$NEW_USER"
usermod --pass='!' root # disable root login
chmod +w /etc/sudoers
printf "%%wheel ALL=(ALL) ALL\n" >> /etc/sudoers
chmod -w /etc/sudoers
mkdir "/home/$NEW_USER/.ssh"
printf "%s\n" "$PUBLIC_SSH_KEY" > "/home/$NEW_USER/.ssh/authorized_keys"
chmod 700 "/home/$NEW_USER/.ssh"
chmod 600 "/home/$NEW_USER/.ssh/authorized_keys"
chown -R "$NEW_USER:users" "/home/$NEW_USER/.ssh"

Set user password and reboot

# The next command will ask for the user password (needed for sudo).
passwd "$NEW_USER"

Then ctrl+d twice and reboot.

Unlocking rootfs and logging in

Firstly, you likely want to remove any saved host public keys as there will be a new one for the newly provisioned machine ssh-keygen -R your.server. The server will have a different host key for the root key unlock so I'd recommend using a different known_hosts file for unlock. e.g. ssh -o UserKnownHostsFile=~/.ssh/known_hosts_unlock [email protected]. If you put something like the following in your ~/.ssh/config you could just ssh unlock-foo-host:

Host unlock-foo-host
HostName your.server
User root
UserKnownHostsFile ~/.ssh/known_hosts_unlock

Once you've successfully entered the key to unlock the rootfs, you can just ssh as normal to the server.


Article changelog
  • 2024-10-01: Initial publication date.

October 01, 2024 12:00 PM

September 28, 2024

Qiuyi Zhang (Joyee)

Reproducible Node.js built-in snapshots, part 1 - Overview and Node.js fixes

In a recent effort to make the Node.js builds reproducible again (at least on Linux), a big remaining piece was the built-in snapshot. I wrote some notes about the journey of making the Node

September 28, 2024 07:39 PM

Reproducible Node.js built-in snapshots, part 3 - fixing V8 startup snapshot

In the previous posts we looked into how the Node.js executable was made a bit more reproducible after the Node.js snapshot data and the V8 code cache were made reproducible, and did a bit of anatomy on the unreproducible V8 snapshot blobs

September 28, 2024 07:23 PM

Reproducible Node.js built-in snapshots, part 2 - V8 code cache and snapshot blobs

In the previous post, we covered how the Node.js built-in snapshot is generated and embedded into the executable, and how I fixed the Node

September 28, 2024 07:05 PM

September 27, 2024

Carlos García Campos

Graphics improvements in WebKitGTK and WPEWebKit 2.46

WebKitGTK and WPEWebKit recently released a new stable version 2.46. This version includes important changes in the graphics implementation.

Skia

The most important change in 2.46 is the introduction of Skia to replace Cairo as the 2D graphics renderer. Skia supports rendering using the GPU, which is now the default, but we also use it for CPU rendering using the same threaded rendering model we had with Cairo. The architecture hasn’t changed much for GPU rendering: we use the same tiled rendering approach, but buffers for dirty regions are rendered in the main thread as textures. The compositor waits for textures to be ready using fences and copies them directly to the compositor texture. This was the simplest approach that already resulted in much better performance, specially in the desktop with more powerful GPUs. In embedded systems, where GPUs are not so powerful, it’s still better to use the CPU with several rendering threads in most of the cases. It’s still too early to announce anything, but we are already experimenting with different models to improve the performance even more and make a better usage of the GPU in embedded devices.

Skia has received several GCC specific optimizations lately, but it’s always more optimized when built with clang. The optimizations are more noticeable in performance when using the CPU for rendering. For this reason, since version 2.46 we recommend to build WebKit with clang for the best performance. GCC is still supported, of course, and performance when built with GCC is quite good too.

HiDPI

Even though there aren’t specific changes about HiDPI in 2.46, users of high resolution screens using a device scale factor bigger than 1 will notice much better performance thanks to scaling being a lot faster on the GPU.

Accelerated canvas

The 2D canvas can be accelerated independently on whether the CPU or the GPU is used for painting layers. In 2.46 there’s a new setting WebKitSettings:enable-2d-canvas-acceleration to control the 2D canvas acceleration. In some embedded devices the combination of CPU rendering for layer tiles and GPU for the canvas gives the best performance. The 2D canvas is normally rendered into an image buffer that is then painted in the layer as an image. We changed that for the accelerated case, so that the canvas is now rendered into a texture that is copied to a compositor texture to be directly composited instead of painted into the layer as an image. In 2.46 the offscreen canvas is enabled by default.

There are more cases where accelerating the canvas is not desired, for example when the canvas size is not big enough it’s faster to use the GPU. Also when there’s going to be many operations to “download” pixels from GPU. Since this is not always easy to predict, in 2.46 we added support for the willReadFrequently canvas setting, so that when set by the application when creating the canvas it causes the canvas to be always unaccelerated.

Filters

All the CSS filters are now implemented using Skia APIs, and accelerated when possible. The most noticeable change here is that sites using blur filters are no longer slow.

Color spaces

Skia brings native support for color spaces, which allows us to greatly simplify the color space handling code in WebKit. WebKit uses color spaces in many scenarios – but especially in case of SVG and filters. In case of some filters, color spaces are necessary as some operations are simpler to perform in linear sRGB. The good example of that is feDiffuseLighting filter – it yielded wrong visual results for a very long time in case of Cairo-based implementation as Cairo doesn’t have a support for color spaces. At some point, however, Cairo-based WebKit implementation has been fixed by converting pixels to linear in-place before applying the filter and converting pixels in-place back to sRGB afterwards. Such a workarounds are not necessary anymore as with Skia, all the pixel-level operations are handled in a color-space-transparent way as long as proper color space information is provided. This not only impacts the results of some filters that are now correct, but improves performance and opens new possibilities for acceleration.

Font rendering

Font rendering is probably the most noticeable visual change after the Skia switch with mixed feedback. Some people reported that several sites look much better, while others reported problems with kerning in other sites. In other cases it’s not really better or worse, it’s just that we were used to the way fonts were rendered before.

Damage tracking

WebKit already tracks the area of the layers that has changed to paint only the dirty regions. This means that we only repaint the areas that changed but the compositor incorporates them and the whole frame is always composited and passed to the system compositor. In 2.46 there’s experimental code to track the damage regions and pass them to the system compositor in addition to the frame. Since this is experimental it’s disabled by default, but can be enabled with the runtime feature PropagateDamagingInformation. There’s also UnifyDamagedRegions feature that can be used in combination with PropagateDamagingInformation to unify the damage regions into one before passing it to the system compositor. We still need to analyze the impact of damage tracking in performance before enabling it by default. We have also started an experiment to use the damage information in WebKit compositor and avoid compositing the entire frame every time.

GPU info

Working on graphics can be really hard in Linux, there are too many variables that can result in different outputs for different users: the driver version, the kernel version, the system compositor, the EGL extensions available, etc. When something doesn’t work for some people and work for others, it’s key for us to gather as much information as possible about the graphics stack. In 2.46 we have added more useful information to webkit://gpu, like the DMA-BUF buffer format and modifier used (for GTK port and WPE when using the new API). Very often the symptom is the same, nothing is rendered in the web view, even when the causes could be very different. For those cases, it’s even more difficult to gather the info because webkit://gpu doesn’t render anything either. In 2.46 it’s possible to load webkit://gpu/stdout to get the information as a JSON directly in stdout.

Sysprof

Another common symptom for people having problems is that a particular website is slow to render, while for others it works fine. In these cases, in addition to the graphics stack information, we need to figure out where we are slower and why. This is very difficult to fix when you can’t reproduce the problem. We added initial support for profiling in 2.46 using sysprof. The code already has some marks so that when run under sysprof we get useful information about timings of several parts of the graphics pipeline.

Next

This is just the beginning, we are already working on changes that will allow us to make a better use of both the GPU and CPU for the best performance. We have also plans to do other changes in the graphics architecture to improve synchronization, latency and security. Now that we have adopted sysprof for profiling, we are also working on improvements and new tools.

by carlos garcia campos at September 27, 2024 10:30 AM

Ricardo García

Waiter, there's an IES in my DGC!

Finally! Yesterday Khronos published Vulkan 1.3.296 including VK_EXT_device_generated_commands. Thousands of engineering hours seeing the light of day, and awesome news for Linux gaming.

Device-Generated Commands, or DGC for short, are Vulkan’s equivalent to ExecuteIndirect in Direct3D 12. Thanks to this extension, originally based on a couple of NVIDIA vendor extensions, it will be possible to prepare sequences of commands to run directly from the GPU, and executing those sequences directly without any data going through the CPU. Also, Proton now has a much-more official leg to stand on when it has to translate ExecuteIndirect from D3D12 to Vulkan while you run games such as Starfield.

The extension not only provides functionality equivalent to ExecuteIndirect. It goes beyond that and offers more fine-grained control like explicit preprocessing of command sequences, or switching shaders and pipelines with each sequence thanks to something called Indirect Execution Sets, or IES for short, that potentially work with ray tracing, compute and graphics (both regular and mesh shading).

As part of my job at Igalia, I’ve implemented CTS tests for this extension and I had the chance to work very closely with an awesome group of developers discussing specification, APIs and test needs. I hope I don’t forget anybody and apologize in advance if so.

  • Mike Blumenkrantz, of course. Valve contractor, Super Good Coder and current OpenGL Working Group chair who took the initial specification work from Patrick Doane and carried it across the finish line. Be sure to read his blog post about DGC. Also incredibly important for me: he developed, and kept up-to-date, an implementation of the extension for lavapipe, the software Vulkan driver from Mesa. This was invaluable in allowing me to create tests for the extension much faster and making sure tests were in good shape when GPU driver authors started running them.

  • Spencer Fricke from LunarG. Spencer did something fantastic here. For the first time, the needed changes in the Vulkan Validation Layers for such a large extension were developed in parallel while tests and the spec were evolving. His work will be incredibly useful for app developers using the extension in their games. It also allowed me to detect test bugs and issues much earlier and fix them faster.

  • Samuel Pitoiset (Valve contractor), Connor Abbott (Valve contractor), Lionel Landwerlin (Intel) and Vikram Kushwaha (NVIDIA) providing early implementations of the extension, discussing APIs, reporting test bugs and needs, and making sure the extension works as good as possible for a variety of hardware vendors out there.

  • To a lesser degree, most others mentioned as spec contributors for the extension, such as Hans-Kristian Arntzen (Valve contractor), Baldur Karlsson (Valve contractor), Faith Ekstrand (Collabora), etc, making sure the spec works for them too and makes sense for Proton, RenderDoc, and drivers such as NVK and others.

If you’ve noticed, a significant part of the people driving this effort work for Valve and, from my side, the work has also been carried as part of Igalia’s collaboration with them. So my explicit thanks to Valve for sponsoring all this work.

If you want to know a bit more about DGC, stay tuned for future talks about this topic. In about a couple of weeks, I’ll present a lightning talk (5 mins) with an overview at XDC 2024 in Montreal. Don’t miss it!

September 27, 2024 09:42 AM

Brian Kardell

Igalia: 2024 Mid-season Power Rankings

Igalia: 2024 Mid-season Power Rankings

Let’s take the good old annual look at how Igalia is stacking up so far in terms of Open Source contributions - you know, "The One With the Charts"

Igalia is involved in so many things it's honestly hard for me to keep track. So, every now and then, I like to step back and take stock of it, and what it means to the open source ecosystem. Let's have a look...

The Big Browser Projects

Some interesting shifts this year in browser projects and where we've made the biggest impacts.

chromium

In 2018 Microsoft announced that they were switching to a Chromium based engine. However, in 2019 it was Igalia (not Microsoft) who wound up with the most commits to Chromium (after Google). We thought for sure that they would overtake us in 2020, but we also had the most commits in 2020. And in 2021, 2022 and 2023. 5 years later, after increases on their end and some shifts on ours, Microsoft has finally eclipsed us as the one with the most commits after Google. Congratulations Microsoft!... Enjoy it! For now :)

data

Top 10 contributors

ContributorContributions
Microsoft37.26%
Igalia17.94%
Intel12.41%
januschka.com6.72%
Opera6.09%
Samsung1.69%
Bytedance1.48%
LGE0.90%
Anton Bershanskyi0.79%
Md Hasibul Hasan0.74%

It's nice to see Intel and Opera stepping it up this year too!

It seems impossible to not comment on the fact that > 6.7% of commits this year are from a single individual, Helmut Januschka as mostly unpaid work. On the one hand, that's heroic but also, it feels a little exploitative, right? Multi-million, billion or even trillion dollar corporations with products built on chromium could surely fund a little more. If you work for one of those and would like to help change that...

Another distinction that is perhaps worth making is that this is measuring commits to the whole chromium project which is kind of a super project. The majority of the code (and commits) are in support of the browser (chromium is a also base browser implementation without some of the "Google" stuff that is in Chrome). These include graphics/rendering related standalone projects like SKIA, or wildvine for DRM, or things that deal with the different OS windowing systems (X/Wayland on Linux, for example), abstract plumbing for chromium based downstream. On the engine itself, it's mainly just Google, Microsoft, Igalia, Intel and Opera.

WebKit

Still the number one contributors after Apple to WebKit 💪🏻💪🏼💪🏽💪🏾💪🏿.

data

Top 10 contributors

ContributorContributions
Igalia65.65%
Sony16.40%
Ubie3.19%
Red Hat2.63%
softathome.com1.16%
Rose0.99%
Alexey Knyazev0.73%
GitHub0.73%
Ian Grunert0.69%
cox.net0.65%
Gecko

Hey look at that - Igalia moves into second place after Mozilla! I'm not sure how anyone could not observe that André Bargull (@anba, a contributor to Mozilla for 16 years!) is just amazingly prolific and important to the project. Mozilla has a long tail of almost 600 individual/unaffliated contributors!

data

Top 10 contributors

ContributorContributions
André Bargull23.75%
Igalia10.71%
Red Hat8.74%
Birchill6.19%
protonmail.com5.66%
longsonr3.38%
Jonatan Klemets2.43%
yahoo.com1.82%
Debadree Chatterjee1.63%
Mugurell1.52%

WHATWG Stuff

HTML

Would you expect Igalia to be #2 in HTML? We are, this year! It's astounding the size of the role Google plays in a lot of this stuff, really. Then again, they have the only economic model for it so it makes a lot of sense that they invest heavily - this is peanuts comparatively.

data

Top 10 contributors

ContributorContributions
Google40.12%
Igalia14.37%
Apple12.57%
Mozilla3.59%
ibiblio.org2.99%
w3.org2.99%
Alex Rudenko2.40%
serenityos.org1.80%
meiert.com1.80%
Tawanda Moyo1.80%
DOM

What about DOM? Hey-o it's us again! Good job Igalia!

data

Top 10 contributors

ContributorContributions
Google40.12%
Igalia14.37%
Apple12.57%
Mozilla3.59%
ibiblio.org2.99%
w3.org2.99%
Alex Rudenko2.40%
serenityos.org1.80%
meiert.com1.80%
Tawanda Moyo1.80%
Web Platform Tests

It's almost like we're the 4th browser engine - #4 in contributions to Web Platform Tests, after Google, Mozilla, Apple. Go us!

data

Top 10 contributors

ContributorContributions
Google45.51%
Mozilla18.49%
Apple7.62%
Igalia6.15%
Intel4.09%
Microsoft3.69%
Opera0.99%
Maksim Sadym0.96%
Jonathan Lee0.88%
longsonr0.78%

We are champions

There are also a few related projects where Igalia is now the champion: Servo, the independent, parallel, memory safe, Rust based engine - and Wolvic, the XR browser formerly known as Firefox Reality, and Babel where we have been playing a huge role. In each of these projects Igalia is doing sponsored work, and our own investment because we believe they are important to the ecosystem right now.

Servo

#1 in servo 🔥🔥!

It's still recent history that Igalia got involved in Servo, but it's been exciting to watch it develop! If you or your organization would be interested in funding some work, reach out to me. You can also donate via GitHub sponsors or open collective.

data

Top 10 contributors

ContributorContributions
Igalia43.94%
sagudev7.96%
Huawei4.16%
inventati.org3.71%
Taym Haddadi3.54%
Oluwatobi Sofela3.45%
Zesty2.83%
Rosemary Ajayi2.12%
Mozilla1.77%
Crab Nebula1.68%
Babel

Igalia is also #1 in Babel commits 🔥🔥! Did you expect it? Nicolò Ribaudo has been busy!

data

Top 10 contributors

ContributorContributions
Igalia38.58%
liuxingbaoyu28.19%
Huáng Jùnliàng20.18%
renovate[bot]2.08%
Amjad Yahia Robeen Hassan1.48%
Sukka1.19%
blake.id0.59%
fisker Cheung0.59%
samual.uk0.59%
ynnsuis0.30%
Wolvic

Of course, as the stewards of Wolvic, we're the #1 in contributors there too! 91.6% of commits! There's also now the wolvic-chromium

data

Top 10 contributors

ContributorContributions
Igalia91.64%
weblate.org1.55%
reimu.nl1.24%
сэр Аноним1.24%
Felipe Erias0.93%
hotmail.es0.93%
opensuse.org0.62%
Ha Anh Vu0.31%
Houssem Nasri0.31%
HoussemNasri0.31%

Accessibility

Igalia has had an investment in accessibility for years. We've got a lot of expertise, and play an important role in some key projects:

Orca

Realisically this should be above as we're pretty much the maintainers of Orca, the screen reader for Linux (shout out to my colleague Joanie!).

data

Top 10 contributors

ContributorContributions
Igalia76.43%
ubuntu.com2.42%
gnome.org2.17%
Andy Holmes1.78%
bk.ru1.15%
pickup.hu1.15%
gtu.ge1.15%
Yaron Shahrabani1.15%
ukr.net0.89%
Sabri Ünal0.89%
aria

Igalia is #3 in commits to the aria repo - unsurprising as our colleague Valerie Young is co-chair of the working group!

data

Top 10 contributors

ContributorContributions
Paciello Group21.51%
w3.org17.74%
Igalia16.13%
github-actions[bot]9.68%
Daniel Montalvo7.53%
Peter Krautzberger5.38%
Adobe4.84%
Matt Garrish4.30%
usablenet.com1.61%
pkra1.61%
CORE-AAM

The AAMs are what actually map the standards to the accessibility layer, there are several of them. Very few people work on these, and guess who is pretty important there too? If you guessed Igalia, you'd be right. Igalia is the #1 contributor in CORE-AAM, Graphics-AAM and MathML-AAMs.

data

Top 10 contributors

ContributorContributions
github-actions[bot]28.57%
Igalia28.57%
w3.org28.57%
Paciello Group14.29%

That's not remotely all

These are just some that I am closer to and feel are interesting to highlight, but the list of open source repositories in which Igalia is playing an important to critical role is way longer than this. For example:

  1. Igalia #2 in commits to GStreamer, the powerful multimedia framework. We're the main maintainers of the GStreamer ports of WebKit used in Linux browsers on desktops as well as on a plethora of embedded devices.
  2. We've been very involved in work on GL, and contributing to Khronos Vulkan, OpenGL, and OpenGL ES Conformance Tests. In fact, we are the #1 contributor to the Vulkan GL Conformance Test Suite Repository!
  3. Igalia is the #2 contributor to the CI-tron project, which was started by Valve and is currently developed mostly by Valve contractors. CI-tron lets developers easily create CI systems based on bare metal computers, which is particularly important for Mesa and other open-source graphics projects.
  4. Igalia is the #1 contributor to libsoup the HTTP client/server library for GNOME.
  5. Igalia is the #3 contributor to Mesa the 3D Graphics Library

Anyway, I love to take the time to look at these every year and see how we're doing. It's easy to get lost in the bustle of everything that's going on and miss how much we really do - it's good to take stock. I hope that others find this interesting too, and maybe learn something about Igalia. Most of the work that we do is funded by companies. If your organization has some needs, you know where to find me :)

September 27, 2024 04:00 AM

September 26, 2024

Andy Wingo

needed-bits optimizations in guile

Hey all, I had a fun bug this week and want to share it with you.

numbers and representations

First, though, some background. Guile’s numeric operations are defined over the complex numbers, not over e.g. a finite field of integers. This is generally great when writing an algorithm, because you don’t have to think about how the computer will actually represent the numbers you are working on.

In practice, Guile will represent a small exact integer as a fixnum, which is a machine word with a low-bit tag. If an integer doesn’t fit in a word (minus space for the tag), it is represented as a heap-allocated bignum. But sometimes the compiler can realize that e.g. the operands to a specific bitwise-and operation are within (say) the 64-bit range of unsigned integers, and so therefore we can use unboxed operations instead of the more generic functions that do run-time dispatch on the operand types, and which might perform heap allocation.

Unboxing is important for speed. It’s also tricky: under what circumstances can we do it? In the example above, there is information that flows from defs to uses: the operands of logand are known to be exact integers in a certain range and the operation itself is closed over its domain, so we can unbox.

But there is another case in which we can unbox, in which information flows backwards, from uses to defs: if we see (logand n #xff), we know:

  • the result will be in [0, 255]

  • that n will be an exact integer (or an exception will be thrown)

  • we are only interested in a subset of n‘s bits.

Together, these observations let us transform the more general logand to an unboxed operation, having first truncated n to a u64. And actually, the information can flow from use to def: if we know that n will be an exact integer but don’t know its range, we can transform the potentially heap-allocating computation that produces n to instead truncate its result to the u64 range where it is defined, instead of just truncating at the use; and potentially this information could travel farther up the dominator tree, to inputs of the operation that defines n, their inputs, and so on.

needed-bits: the |0 of scheme

Let’s say we have a numerical operation that produces an exact integer, but we don’t know the range. We could truncate the result to a u64 and use unboxed operations, if and only if only u64 bits are used. So we need to compute, for each variable in a program, what bits are needed from it.

I think this is generally known a needed-bits analysis, though both Google and my textbooks are failing me at the moment; perhaps this is because dynamic languages and flow analysis don’t get so much attention these days. Anyway, the analysis can be local (within a basic block), global (all blocks in a function), or interprocedural (larger than a function). Guile’s is global. Each CPS/SSA variable in the function starts as needing 0 bits. We then compute the fixpoint of visiting each term in the function; if a term causes a variable to flow out of the function, for example via return or call, the variable is recorded as needing all bits, as is also the case if the variable is an operand to some primcall that doesn’t have a specific needed-bits analyser.

Currently, only logand has a needed-bits analyser, and this is because sometimes you want to do modular arithmetic, for example in a hash function. Consider Bon Jenkins’ lookup3 string hash function:

#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))
#define mix(a,b,c) \
{ \
  a -= c;  a ^= rot(c, 4);  c += b; \
  b -= a;  b ^= rot(a, 6);  a += c; \
  c -= b;  c ^= rot(b, 8);  b += a; \
  a -= c;  a ^= rot(c,16);  c += b; \
  b -= a;  b ^= rot(a,19);  a += c; \
  c -= b;  c ^= rot(b, 4);  b += a; \
}
...

If we transcribe this to Scheme, we get something like:

(define (jenkins-lookup3-hashword2 str)
  (define (u32 x) (logand x #xffffFFFF))
  (define (shl x n) (u32 (ash x n)))
  (define (shr x n) (ash x (- n)))
  (define (rot x n) (logior (shl x n) (shr x (- 32 n))))
  (define (add x y) (u32 (+ x y)))
  (define (sub x y) (u32 (- x y)))
  (define (xor x y) (logxor x y))

  (define (mix a b c)
    (let* ((a (sub a c)) (a (xor a (rot c 4)))  (c (add c b))
           (b (sub b a)) (b (xor b (rot a 6)))  (a (add a c))
           (c (sub c b)) (c (xor c (rot b 8)))  (b (add b a))
           ...)
      ...))

  ...

These u32 calls are like the JavaScript |0 idiom, to tell the compiler that we really just want the low 32 bits of the number, as an integer. Guile’s compiler will propagate that information down to uses of the defined values but also back up the dominator tree, resulting in unboxed arithmetic for all of these operations.

(When writing this, I got all the way here and then realized I had already written quite a bit about this, almost a decade ago ago. Oh well, consider this your lucky day, you get two scoops of prose!)

the bug

All that was just prelude. So I said that needed-bits is a fixed-point flow analysis problem. In this case, I want to compute, for each variable, what bits are needed for its definition. Because of loops, we need to keep iterating until we have found the fixed point. We use a worklist to represent the conts we need to visit.

Visiting a cont may cause the program to require more bits from the variables that cont uses. Consider:

(define-significant-bits-handler
    ((logand/immediate label types out res) param a)
  (let ((sigbits (sigbits-intersect
                   (inferred-sigbits types label a)
                   param
                   (sigbits-ref out res))))
    (intmap-add out a sigbits sigbits-union)))

This is the sigbits (needed-bits) handler for logand when one of its operands (param) is a constant and the other (a) is variable. It adds an entry for a to the analysis out, which is an intmap from variable to a bitmask of needed bits, or #f for all bits. If a already has some computed sigbits, we add to that set via sigbits-union. The interesting point comes in the sigbits-intersect call: the bits that we will need from a are first the bits that we infer a to have, by forward type-and-range analysis; intersected with the bits from the immediate param; intersected with the needed bits from the result value res.

If the intmap-add call is idempotent—i.e., out already contains sigbits for a—then out is returned as-is. So we can check for a fixed-point by comparing out with the resulting analysis, via eq?. If they are not equal, we need to add the cont that defines a to the worklist.

The bug? The bug was that we were not enqueuing the def of a, but rather the predecessors of label. This works when there are no cycles, provided we visit the worklist in post-order; and regardless, it works for many other analyses in Guile where we compute, for each labelled cont (basic block), some set of facts about all other labels or about all other variables. In that case, enqueuing a predecessor on the worklist will cause all nodes up and to including the variable’s definition to be visited, because each step adds more information (relative to the analysis computed on the previous visit). But it doesn’t work for this case, because we aren’t computing a per-label analysis.

The solution was to rewrite that particular fixed-point to enqueue labels that define a variable (possibly multiple defs, because of joins and loop back-edges), instead of just the predecessors of the use.

Et voilà ! If you got this far, bravo. Type at y’all again soon!

by Andy Wingo at September 26, 2024 12:44 PM

September 25, 2024

Melissa Wen

Reflections on 2024 Linux Display Next Hackfest

Hey everyone!

The 2024 Linux Display Next hackfest concluded in May, and its outcomes continue to shape the Linux Display stack. Igalia hosted this year’s event in A Coruña, Spain, bringing together leading experts in the field. Samuel Iglesias and I organized this year’s edition and this blog post summarizes the experience and its fruits.

One of the highlights of this year’s hackfest was the wide range of backgrounds represented by our 40 participants (both on-site and remotely). Developers and experts from various companies and open-source projects came together to advance the Linux Display ecosystem. You can find the list of participants here.

The event covered a broad spectrum of topics affecting the development of Linux projects, user experiences, and the future of display technologies on Linux. From cutting-edge topics to long-term discussions, you can check the event agenda here.

Organization Highlights

The hackfest was marked by in-depth discussions and knowledge sharing among Linux contributors, making everyone inspired, informed, and connected to the community. Building on feedback from the previous year, we refined the unconference format to enhance participant preparation and engagement.

Structured Agenda and Timeboxes: Each session had a defined scope, time limit (1h20 or 2h10), and began with an introductory talk on the topic.

  • Participant-Led Discussions: We pre-selected in-person participants to lead discussions, allowing them to prepare introductions, resources, and scope.
  • Transparent Scheduling: The schedule was shared in advance as GitHub issues, encouraging participants to review and prepare for sessions of interest.

Engaging Sessions: The hackfest featured a variety of topics, including presentations and discussions on how participants were addressing specific subjects within their companies.

  • No Breakout Rooms, No Overlaps: All participants chose to attend all sessions, eliminating the need for separate breakout rooms. We also adapted run-time schedule to keep everybody involved in the same topics.
  • Real-time Updates: We provided notifications and updates through dedicated emails and the event matrix room.

Strengthening Community Connections: The hackfest offered ample opportunities for networking among attendees.

  • Social Events: Igalia sponsored coffee breaks, lunches, and a dinner at a local restaurant.

  • Museum Visit: Participants enjoyed a sponsored visit to the Museum of Estrela Galicia Beer (MEGA).

Fruitful Discussions and Follow-up

The structured agenda and breaks allowed us to cover multiple topics during the hackfest. These discussions have led to new display feature development and improvements, as evidenced by patches, merge requests, and implementations in project repositories and mailing lists.

With the KMS color management API taking shape, we discussed refinements and best approaches to cover the variety of color pipeline from different hardware-vendors. We are also investigating techniques for a performant SDR<->HDR content reproduction and reducing latency and power consumption when using the color blocks of the hardware.

Color Management/HDR

Color Management and HDR continued to be the hottest topic of the hackfest. We had three sessions dedicated to discuss Color and HDR across Linux Display stack layers.

Color/HDR (Kernel-Level)

Harry Wentland (AMD) led this session.

Here, kernel Developers shared the Color Management pipeline of AMD, Intel and NVidia. We counted with diagrams and explanations from HW-vendors developers that discussed differences, constraints and paths to fit them into the KMS generic color management properties such as advertising modeset needs, IN\_FORMAT, segmented LUTs, interpolation types, etc. Developers from Qualcomm and ARM also added information regarding their hardware.

Upstream work related to this session:

Color/HDR (Compositor-Level)

Sebastian Wick (RedHat) led this session.

It started with Sebastian’s presentation covering Wayland color protocols and compositor implementation. Also, an explanation of APIs provided by Wayland and how they can be used to achieve better color management for applications and discussions around ICC profiles and color representation metadata. There was also an intensive Q&A about LittleCMS with Marti Maria.

Upstream work related to this session:

Color/HDR (Use Cases and Testing)

Christopher Cameron (Google) and Melissa Wen (Igalia) led this session.

In contrast to the other sessions, here we focused less on implementation and more on brainstorming and reflections of real-world SDR and HDR transformations (use and validation) and gainmaps. Christopher gave a nice presentation explaining HDR gainmap images and how we should think of HDR. This presentation and Q&A were important to put participants at the same page of how to transition between SDR and HDR and somehow “emulating” HDR.

We also discussed on the usage of a kernel background color property.

Finally, we discussed a bit about Chamelium and the future of VKMS (future work and maintainership).

Power Savings vs Color/Latency

Mario Limonciello (AMD) led this session.

Mario gave an introductory presentation about AMD ABM (adaptive backlight management) that is similar to Intel DPST. After some discussions, we agreed on exposing a kernel property for power saving policy. This work was already merged on kernel and the userspace support is under development.

Upstream work related to this session:

Strategy for video and gaming use-cases

Leo Li (AMD) led this session.

Miguel Casas (Google) started this session with a presentation of Overlays in Chrome/OS Video, explaining the main goal of power saving by switching off GPU for accelerated compositing and the challenges of different colorspace/HDR for video on Linux.

Then Leo Li presented different strategies for video and gaming and we discussed the userspace need of more detailed feedback mechanisms to understand failures when offloading. Also, creating a debugFS interface came up as a tool for debugging and analysis.

Real-time scheduling and async KMS API

Xaver Hugl (KDE/BlueSystems) led this session.

Compositor developers have exposed some issues with doing real-time scheduling and async page flips. One is that the Kernel limits the lifetime of realtime threads and if a modeset takes too long, the thread will be killed and thus the compositor as well. Also, simple page flips take longer than expected and drivers should optimize them.

Another issue is the lack of feedback to compositors about hardware programming time and commit deadlines (the lastest possible time to commit). This is difficult to predict from drivers, since it varies greatly with the type of properties. For example, color management updates take much longer.

In this regard, we discusssed implementing a hw_done callback to timestamp when the hardware programming of the last atomic commit is complete. Also an API to pre-program color pipeline in a kind of A/B scheme. It may not be supported by all drivers, but might be useful in different ways.

VRR/Frame Limit, Display Mux, Display Control, and more… and beer

We also had sessions to discuss a new KMS API to mitigate headaches on VRR and Frame Limit as different brightness level at different refresh rates, abrupt changes of refresh rates, low frame rate compensation (LFC) and precise timing in VRR more.

On Display Control we discussed features missing in the current KMS interface for HDR mode, atomic backlight settings, source-based tone mapping, etc. We also discussed the need of a place where compositor developers can post TODOs to be developed by KMS people.

The Content-adaptive Scaling and Sharpening session focused on sharpening and scaling filters. In the Display Mux session, we discussed proposals to expose the capability of dynamic mux switching display signal between discrete and integrated GPUs.

In the last session of the 2024 Display Next Hackfest, participants representing different compositors summarized current and future work and built a Linux Display “wish list”, which includes: improvements to VTTY and HDR switching, better dmabuf API for multi-GPU support, definition of tone mapping, blending and scaling sematics, and wayland protocols for advertising to clients which colorspaces are supported.

We closed this session with a status update on feature development by compositors, including but not limited to: plane offloading (from libcamera to output) / HDR video offloading (dma-heaps) / plane-based scrolling for web pages, color management / HDR / ICC profiles support, addressing issues such as flickering when color primaries don’t match, etc.

After three days of intensive discussions, all in-person participants went to a guided tour at the Museum of Extrela Galicia beer (MEGA), pouring and tasting the most famous local beer.

Feedback and Future Directions

Participants provided valuable feedback on the hackfest, including suggestions for future improvements.

  • Schedule and Break-time Setup: Having a pre-defined agenda and schedule provided a better balance between long discussions and mental refreshments, preventing the fatigue caused by endless discussions.
  • Action Points: Some participants recommended explicitly asking for action points at the end of each session and assigning people to follow-up tasks.
  • Remote Participation: Remote attendees appreciated the inclusive setup and opportunities to actively participate in discussions.
  • Technical Challenges: There were bandwidth and video streaming issues during some sessions due to the large number of participants.

Thank you for joining the 2024 Display Next Hackfest

We can’t help but thank the 40 participants, who engaged in-person or virtually on relevant discussions, for a collaborative evolution of the Linux display stack and for building an insightful agenda.

A big thank you to the leaders and presenters of the nine sessions: Christopher Cameron (Google), Harry Wentland (AMD), Leo Li (AMD), Mario Limoncello (AMD), Sebastian Wick (RedHat) and Xaver Hugl (KDE/BlueSystems) for the effort in preparing the sessions, explaining the topic and guiding discussions. My acknowledge to the others in-person participants that made such an effort to travel to A Coruña: Alex Goins (NVIDIA), David Turner (Raspberry Pi), Georges Stavracas (Igalia), Joan Torres (SUSE), Liviu Dudau (Arm), Louis Chauvet (Bootlin), Robert Mader (Collabora), Tian Mengge (GravityXR), Victor Jaquez (Igalia) and Victoria Brekenfeld (System76). It was and awesome opportunity to meet you and chat face-to-face.

Finally, thanks virtual participants who couldn’t make it in person but organized their days to actively participate in each discussion, adding different perspectives and valuable inputs even remotely: Abhinav Kumar (Qualcomm), Chaitanya Borah (Intel), Christopher Braga (Qualcomm), Dor Askayo, Jiri Koten (RedHat), Jonas Ådahl (Red Hat), Leandro Ribeiro (Collabora), Marti Maria (Little CMS), Marijn Suijten, Mario Kleiner, Martin Stransky (Red Hat), Michel Dänzer (Red Hat), Miguel Casas-Sanchez (Google), Mitulkumar Golani (Intel), Naveen Kumar (Intel), Niels De Graef (Red Hat), Pekka Paalanen (Collabora), Pichika Uday Kiran (AMD), Shashank Sharma (AMD), Sriharsha PV (AMD), Simon Ser, Uma Shankar (Intel) and Vikas Korjani (AMD).

We look forward to another successful Display Next hackfest, continuing to drive innovation and improvement in the Linux display ecosystem!

September 25, 2024 01:50 PM

Paulo Matos

FEX x87 Stack Optimization

In this blog post, I will describe my recent work on improving and optimizing the handling of x87 code in FEX. This work landed on July 22, 2024, through commit https://github.com/FEX-Emu/FEX/commit/a1378f94ce8e7843d5a3cd27bc72847973f8e7ec (tests and updates to size benchmarks landed separately).

FEX is an ARM64 emulator of Intel assembly and, as such, needs to be able to handle things as old as x87, which was the way we did floating point pre-SSE.

Consider a function to calculate the well-known Pythagorean Identity:

float PytIdent(float x) {
return sinf(x)*sinf(x) + cosf(x)*cosf(x);
}

Here is how this can be translated into assembly:

  sub rsp, 4
  movss [rsp], xmm0
  fld [rsp]
  fsincos
  fmul st0, st0
  fxch
  fmul st0, st0
  faddp
  fstp [rsp]
  movss xmm0, [rsp]

These instructions will take x from register xmm0 and return the result of the identity (the number 1, as we expect) to the same register.

A brief visit to the x87 FPU

The x87 FPU consists of 8 registers (R0 to R7), which are aliased to the MMX registers - mm0 to mm7. The FPU uses them as a stack and tracks the top of the stack through the TOS (Top of Stack) or TOP (my preferred way to refer to it), which is kept in the control word. To understand how it all fits together, I will describe the three most important components of this setup for the purpose of this post:

  • FPU Stack-based operation
  • Status register
  • Tag register

From the Intel spec, these components look like this (image labels are from the Intel in case you need to refer to it):

Patch Details

The 8 registers are labelled as ST0-ST7 relative to the position of TOP. In the case above TOP is 3; therefore, the 3rd register is ST0. Since the stack grows down, ST1 is the 4th register and so on until it wraps around. When the FPU is initialized TOP is set to 0. Once a value is loaded, the TOP is decreased, its value becomes 7 and the first value added to the stack is in the 7th register. Because the MMX registers are aliased to the bottom 64 bits of the stack registers, we can also refer to them as MM0 to MM7. This is not strictly correct because the MMX registers are 64 bits and the FPU registers are 80 bits but if we squint, they are basically the same.

The spec provides an example of how the FPU stack works. I copy the image here verbatim for your understanding. Further information can be found in the spec itself.

Patch Details

The TOP value is stored along with other flags in the FPU Status Word, which we won't discuss in detail here.

Patch Details

And the tag word marks which registers are valid or empty. The Intel spec also allows marking the registers as zero or special. At the moment, tags in FEX are binary, marking the register as valid or invalid. Instead of two bits per register, we only use one bit of information per register.

Patch Details

Current issues

The way things worked in FEX before this patch landed is that stack based operations would essentially always lower to perform three operations:

  • Load stack operation argument register values from special memory location;
  • Call soft-float library to perform the operation;
  • Store result in special memory location;

For example, let's consider the instruction FADDP st4, st0. This instruction performs st0 <- f80add(st4, st0) && pop(). In x87.cpp, the lowering of this instruction involved the following steps:

  • Load current value of top from memory;
  • With the current value of top, we can load the values of the registers in the instruction, in this case st4 and st0 since they are at a memory offset from top;
  • Then we call the soft-float library function to perform 80-bit adds;
  • Finally we store the result into st4 (which is a location in memory);
  • And pop the stack, which means marking the current top value invalid and incrementing top;

In code with a lot of x87 usage, these operations become cumbersome and data movement makes the whole code extremely slow. Therefore, we redesigned the x87 system to cope better with stack operations. FEX support a reduced precision mode that uses 64bit floating points rather than 80 bits, avoiding the soft-float library calls and this work applies equally to the code in reduced precision mode.

The new pass

The main observation to understand the operation of the new pass is that when compiling a block of instructions, where multiple instructions are x87 instructions, before code generation we have an idea of what the stack will look like, and if we have complete stack information we can generate much better code than the one generated on a per-instruction basis.

Instead of lowering each x87 instruction in x87.cpp to its components as discussed earlier, we simply generate stack operation IR instructions (all of these added new since there were no stack operations in the IR). Then an optimization pass goes through the IR on a per block basis and optimizes the operations and lowers them.

Here’s a step-by-step explanation of how this scheme works for the code above. When the above code gets to the OpcodeDispatcher, there’ll be one function a reasonable amount of opcodes, but generally one per functionality, meaning that although there are several opcodes for different versions of fadd, there is a single function in OpcodeDispatcher for x87 (in x87.cpp) implementing it.

Each of these functions will just transform the incoming instruction into an IR node that deals with the stack. These stack IR operations have Stack in its name. See IR.json for a list of these.

x86 Asm IR nodes
fld qword [rsp] LoadMem + PushStack
fsincos F80SinCosStack
fmul st0, st0 F80MulStack
fxch F80StackXchange
fmul st0, st0 F80MulStack
faddp F80AddStack + PopStackDestroy
fstp qword [rsp] StoreStackMemory + PopStackDestroy

The IR will then look like:

	%9 i64 = LoadRegister #0x4, GPR
	%10 i128 = LoadMem FPR, %9 i64, %Invalid, #0x10, SXTX, #0x1
	(%11 i0) PushStack %10 i128, %10 i128, #0x10, #0x1
	(%12 i0) F80SINCOSStack
	%13 i64 = Constant #0x0
	(%14 i8) StoreContext GPR, %13 i64, #0x3fa
	(%15 i0) F80MulStack #0x0, #0x0
	(%16 i0) F80StackXchange #0x1
	%17 i64 = Constant #0x0
	(%18 i8) StoreContext GPR, %17 i64, #0x3f9
	(%19 i0) F80MulStack #0x0, #0x0
	(%20 i0) F80AddStack #0x1, #0x0
	(%21 i0) PopStackDestroy
	%23 i64 = LoadRegister #0x4, GPR
	(%24 i0) StoreStackMemory %23 i64, i128, #0x1, #0x8
	(%25 i0) PopStackDestroy

The Good Case

Remember that before this pass, we would have never generated this IR as is. There were no stack operations, therefore each of these F80MulStack operations, would generate loads from the MMX registers that are mapped into memory for each of the operands, call to F80Mul softfloat library function and then a write to the memory mapped MMX register to write the result.

However, once we get to the optimization pass a see a straight line block line this, we can step through the IR nodes and create a virtual stack with pointers to the IR nodes. I have sketched the stacks for the first few loops in the pass.

The pass will only process x87 instructions, these are the one whose IR nodes are marked with "X87: true in IR.json. Therefore the first two instructions above:

	%9 i64 = LoadRegister #0x4, GPR
	%10 i128 = LoadMem FPR, %9 i64, %Invalid, #0x10, SXTX, #0x1

are ignored. But PushStack is not. PushStack will literally push the node into the virtual stack.

	(%11 i0) PushStack %10 i128, %10 i128, #0x10, #0x1
Patch Details

Internally before we push the node into our internal stack, we update the necessary values like the top of the stack. It starts at zero and it's decremented to seven, so the value %10 i128 is inserted in what we'll call R7, now ST0.

Next up is the computation of SIN and COS. This will result in SIN ST0 replacing ST0 and then pushing COS ST0, where the value in ST0 here is %10.

	(%12 i0) F80SINCOSStack
Patch Details

Then we have two instructions that we just don't deal with in the pass, these set the C2 flag to zero - as required by FSINCOS.

	%13 i64 = Constant #0x0
	(%14 i8) StoreContext GPR, %13 i64, #0x3fa

Followed by a multiplication of the stack element to square them. So we square ST0.

	(%15 i0) F80MulStack #0x0, #0x0
Patch Details

The next instruction swaps ST0 with ST1 so that we can square the sine value.

	(%16 i0) F80StackXchange #0x1
Patch Details

Again, similarly two instructions that are ignored by the pass, which set flag C1 to zero - as required by FXCH.

	%17 i64 = Constant #0x0
	(%18 i8) StoreContext GPR, %17 i64, #0x3f9

And again we square the value at the top of the stack.

	(%19 i0) F80MulStack #0x0, #0x0
Patch Details

We are almost there - to finish off the computation we need to add these two values together.

	(%20 i0) F80AddStack #0x1, #0x0

In this case we add ST1 to ST0 and store the result in ST1.

Patch Details

This is followed by a pop of the stack.

	(%21 i0) PopStackDestroy

As expected this removed the F80Mul from the top of the stack, leaving us with the result of the computation at the top.

Patch Details

OK - there are three more instructions.

	%23 i64 = LoadRegister #0x4, GPR
	(%24 i0) StoreStackMemory %23 i64, i128, #0x1, #0x8
	(%25 i0) PopStackDestroy

The first two store the top of the stack to memory, which is the result of our computation and which maths assures us it's 1. And then we pop the stack just to clean it up, although this is not strictly necessary since we are finished anyway.

Note that we have finished analysing the block, and played it through in our virtual stack. Now we can generate the instructions that are necessary to calculate the stack values, without always load/stor'ing them back to memory. In addition to generating these instructions we also generate instructions to update the tag word, update the top of stack value and save whatever remaining values in the stack at the end of the block into memory - into their respective memory mapped MMX registers. This is the good case where all the values necessary for the computation were found in the block. However, this is not necessarily always the case.

The Bad Case

Let's say that FEX for some reason breaks the beautiful block we had before into two:

Block 1:

	%9 i64 = LoadRegister #0x4, GPR
	%10 i128 = LoadMem FPR, %9 i64, %Invalid, #0x10, SXTX, #0x1
	(%11 i0) PushStack %10 i128, %10 i128, #0x10, #0x1
	(%12 i0) F80SINCOSStack
	%13 i64 = Constant #0x0
	(%14 i8) StoreContext GPR, %13 i64, #0x3fa
	(%15 i0) F80MulStack #0x0, #0x0
	(%16 i0) F80StackXchange #0x1
	%17 i64 = Constant #0x0
	(%18 i8) StoreContext GPR, %17 i64, #0x3f9

Block 2:

	(%19 i0) F80MulStack #0x0, #0x0
	(%20 i0) F80AddStack #0x1, #0x0
	(%21 i0) PopStackDestroy
	%23 i64 = LoadRegister #0x4, GPR
	(%24 i0) StoreStackMemory %23 i64, i128, #0x1, #0x8
	(%25 i0) PopStackDestroy

When we generate code for the first block, there are absolutely no blockers and we run through it exactly as we did before, except that after StoreContext, we reach the end of the block and our virtual stack is in this state:

Patch Details

At this point, we'll do what we said we always do before at the end of the block. We save the values in the virtual stack to the respective MMX registers in memory mapped locations: %36 is saved to MM7 and %34 is saved to MM6. The TOP is recorded to be 6 and the tags for MM7 and MM6 are marked to be valid. Then we exit the block. And start analyzing the following block.

When we start the pass for the following block, our virtual stack is empty. We have no knowledge of virtual stacks for previous JITTed blocks, and we see this instruction:

	(%19 i0) F80MulStack #0x0, #0x0

We cannot play through this instruction in our virtual stack because it multiplies the value in ST0 with itself and our virtual stack does not have anything in ST0. It's empty after all.

In this case, we switch to the slow path that does exactly what we used to do before this work. We load the value of ST0 from memory by finding the current value for TOP and loading the respective register value, issue the F80Mul on this value and storing it back to the MMX register. This is then done for the remainder of the block.

The Ugly Case

The ugly case is when we actually force the slow path, not because we don't have all the values in the virtual stack but out of necessity.

There are a few instructions that trigger a _StackForceSlow() and forces us down the slow path. These are: FLDENV, FRSTOR, FLDCW. All of these load some part or all of the FPU environment from memory and it triggers a switch to the slow path so that values are loaded properly and we can then use those new values in the block. It is not impossible to think we could at least in some cases, load those values from memory, try to rebuild our virtual stack for the current block and continue without switching to the slow path but that hasn't been yet done.

There are a few instructions other instructions that trigger a _SyncStackToSlow(), which doesn't force us down the slow path but instead just updates the in-memory data from the virtual stack state. These are: FNSTENV, FNSAVE, and FNSTSW. All of these store some part or all of the FPU environment into memory and it ensures that the values in-memory are up-to-date so that the store doesn't use obsolete values.

Results

In addition to the huge savings in data movement from the loads and stores of register values, TOP and tags for each stack instruction, we also optimize memory copies through the stack. So if there's a bunch of value loaded to the stack, which are then stored to memory, we can just memcopy these values without going through the stack.

In our code size validation benchmarks from various Steam games, we saw significant reductions in code size. These size code reductions saw a similarly high jump in FPS numbers. For example:

Game Before After Reduction
Half-Life (Block 1) 2034 1368 32.7%
Half-Life (Block 2) 838 551 34.2%
Oblivion (Block 1) 34065 25782 24.3%
Oblivion (Block 2) 22089 16635 24.6%
Psychonauts (Block 1) 20545 16357 20.3%
Psychonauts Hot Loop (Matrix Swizzle) 2340 165 92.9%

Future work

The bulk of this work is now complete, however there are a few details that still need fixing. There's still the mystery of why Touhou: Luna Nights works with reduced precision but not normal precision (if anything you would expect it to be the other way around). Then there's also some work to be done on the interface between the FPU state and the MMX state as described in "MMX/x87 interaction is subtly broken".

This is something I am already focusing on and will continue working on this until complete.

September 25, 2024 12:00 AM

September 23, 2024

Eric Meyer

Announcing BCD Watch

One of the things I think we all struggle with is keeping up to date with changes in web development.  You might hear about a super cool new CSS feature or JavaScript API, but it’s never supported by all the browsers when you hear about it, right?  So you think “I’ll have to make sure check in on that again later” and quickly forget about it.  Then some time down the road you hear about it again, talked about like it’s been best practice for years.

To help address this, Brian Kardell and I have built a service called BCD Watch, with a nicely sleek design by Stephanie Stimac.  It’s free for all to use thanks to the generous support of Igalia in terms of our time and hosting the service.

What BCD Watch does is, it grabs releases of the Browser Compatibility Data (BCD) repository that underpins the support tables on MDN and services like caniuse.com.  It then analyzes what’s changed since the previous release.

Every Monday, BCD Watch produces two reports.  The Weekly Changes Report lists all the changes to BCD that happened in the previous week — what’s been added, removed, or renamed in the whole of BCD.  It also tells you which of the Big Three browsers newly support (or dropped support for) each listed feature, along with a progress bar showing how close the feature is to attaining Baseline status.

The Weekly Baselines Report is essentially a filter of the first report: instead of all the changes, it lists only changes to Baseline status, noting which features are newly Baseline.  Some weeks, it will have nothing to report. Other weeks, it will list everything that’s reached Baseline’s “Newly Available” tier.

Both reports are available as standalone RSS, Atom, and JSON feeds, which are linked at the bottom of each report.  So while you can drop in on the site every week to bask in the visual design if you want (and that’s fine!), you can also get a post or two in your feed reader every Monday that will get you up to date on what’s been happening in the world of web development.

If you want to look back at older reports, the home page has a details/summary collapsed list of weekly reports going back to the beginning of 2022, which we generated by downloading all the BCD releases back that far, and running the report script against them.

If you encounter any problems with BCD Watch or have suggestions for improvements, please feel free to open an issue in the repository, or submit suggested changes via pull request if you like.  We do expect the service to evolve over time, perhaps adding a report for things that have hit Baseline Widely Available status (30 months after hitting all three engines) or reports that look at more than just the Big Three engines.  Hard to say!  Always in motion, the future is.

Whatever we may add, though, we’ll keep BCD Watch centered on the idea of keeping you better up to date on web dev changes, once a week, every week.  We really hope this is useful and interesting for you!  We’ve definitely appreciated having the weekly updates as we built and tested this, and we think a lot of you will, too.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at September 23, 2024 03:54 PM

Brian Kardell

Keeping up: Signal to Noise

Keeping up: Signal to Noise

Last year in December, Eric Meyer and I did an episode of Igalia Chats titled "The Struggle to Keep Up with Web Tech". It recounts the multi-dimensional challenges that developers can face in trying to keep up with developments in the web platform.

I've thought for a long time now that we all need better ways to keep up, and especially to not miss the most important stuff. Today, I'm happy to annouce what we think is a real help at that.

Introducing bcd-watch

If you're not familiar, "bcd" is short for "browser compatibility data". The browser-compat-data repository is the source of what is shown on MDN and used for many other critical things as well. All of the browser vendors, contributors and tech writers strive together to keep the information there accurate and up to date.

Back in 2022, Eric and I had the idea to write a tool which would fetch this data from time to time, compare it with the previous data and put together a brief report that you could view, and an RSS feed you could subscribe to. I wrote a thing about BCD and the fact that you could actually even use GitHub's API to subscribe to the release notes, which are already pretty nice. This was November 2022, and it was teasing a few screenshots from the prototype we had.

A month and 5 days later, I unexpectedly had a heart attack.

So, it got shoved to the back burner for a while but today we're finally sharing it. My colleage Eric Meyer has a a much better post annoucing this that I'll just point you to: Annoucing BCD-Watch..

September 23, 2024 04:00 AM

September 18, 2024

Andy Wingo

whippet progress update: feature-complete!

Greetings, gentle readers. Today, an update on recent progress in the Whippet embeddable garbage collection library.

feature-completeness

When I started working on Whippet, two and a half years ago already, I was aiming to make a new garbage collector for Guile. In the beginning I was just focussing on proving that it would be advantageous to switch, and also learning how to write a GC. I put off features like ephemerons and heap resizing until I was satisfied with the basics.

Well now is the time: with recent development, Whippet is finally feature-complete! Huzzah! Now it’s time to actually work on getting it into Guile. (I have some performance noodling to do before then, adding tracing support, and of course I have lots of ideas for different collectors to build, but I don’t have any missing features at the current time.)

heap resizing

When you benchmark a garbage collector (or a program with garbage collection), you generally want to fix the heap size. GC overhead goes down with more memory, generally speaking, so you don’t want to compare one collector at one size to another collector at another size.

(Unfortunately, many (most?) benchmarks of dynamic language run-times and the programs that run on them fall into this trap. Imagine you are benchmarking a program before and after a change. Automatic heap sizing is on. Before your change the heap size is 200 MB, but after it is 180 MB. The benchmark shows the change to regress performance. But did it really? It could be that at the same heap size, the change improved performance. You won’t know unless you fix the heap size.)

Anyway, Whippet now has an implementation of MemBalancer. After every GC, we measure the live size of the heap, and compute a new GC speed, as a constant factor to live heap size. Every second, in a background thread, we observe the global allocation rate. The heap size is then the live data size plus the square root of the live data size times a factor. The factor has two components. One is constant, the expansiveness of the heap: the higher it is, the more room the program has. The other depends on the program, and is computed as the square root of the ratio of allocation speed to collection speed.

With MemBalancer, the heap ends up getting resized at every GC, and via the heartbeat thread. During the GC it’s easy because it’s within the pause; no mutators are running. From the heartbeat thread, mutators are active: taking the heap lock prevents concurrent resizes, but mutators are still consuming empty blocks and producing full blocks. This works out fine in the same way that concurrent mutators is fine: shrinking takes blocks from the empty list one by one, atomically, and returns them to the OS. Expanding might reclaim paged-out blocks, or allocate new slabs of blocks.

However, even with some exponentially weighted averaging on the speed observations, I have a hard time understanding whether the algorithm is overall a good thing. I like the heartbeat thread, as it can reduce memory use of idle processes. The general square-root idea sounds nice enough. But adjusting the heap size at every GC feels like giving control of your stereo’s volume knob to a hyperactive squirrel.

GC collection time vs memory usage graph comparing V8 with MemBalancer. MemBalancer is shown to be better.
Figure 5 from the MemBalancer paper

Furthermore, the graphs in the MemBalancer paper are not clear to me: the paper claims more optimal memory use even in a single-heap configuration, but the x axis of the graphs is “average heap size”, which I understand to mean that maximum heap size could be higher than V8’s maximum heap size, taking into account more frequent heap size adjustments. Also, some measurement of total time would have been welcome, in addition to the “garbage collection time” on the paper’s y axis; there are cases where less pause time doesn’t necessarily correlate to better total times.

deferred page-out

Motivated by MemBalancer’s jittery squirrel, I implemented a little queue for use in paging blocks in and out, for the mmc and pcc collectors: blocks are quarantined for a second or two before being returned to the OS via madvise(MADV_DONTNEED). That way if you release a page and then need to reacquire it again, you can do so without bothering the kernel or other threads. Does it matter? It seems to improve things marginally and conventional wisdom says to not mess with the page table too much, but who knows.

mmc rename

Relatedly, Whippet used to be three things: the project itself, consisting of an API and a collection of collectors; one specific collector; and one specific space within that collector. Last time I mentioned that I renamed the whippet space to the nofl space. Now I finally got around to renaming what was the whippet collector as well: it is now the mostly-marking collector, or mmc. Be it known!

Also as a service note, I removed the “serial copying collector” (scc). It had the same performance as the parallel copying collector with parallelism=1, and pcc can be compiled with GC_PARALLEL=0 to explicitly choose the simpler serial grey-object worklist.

per-object pinning

The nofl space has always supported pinned objects, but it was never exposed in the API. Now it is!

Of course, collectors with always-copying spaces won’t be able to pin objects. If we want to support use of these collectors with embedders that require pinning, perhaps because of conservative root scanning, we’d need to switch to some kind of mostly-copying algorithm.

safepoints without allocation

Another missing feature was a safepoint API. It hasn’t been needed up to now because my benchmarks all allocate, but for programs that have long (tens of microseconds maybe) active periods without allocation, you want to be able to stop them without waiting too long. Well we have that exposed in the API now.

removed ragged stop

Following on my article on ragged stops, I removed ragged-stop marking from mmc, for a nice net 180 line reduction in some gnarly code. Speed seems to be similar.

next up: tracing

And with that, I’m relieved to move on to the next phase of Whippet development. Thanks again to NLnet for their support of this work. Next up, adding fine-grained tracing, so that I can noodle a bit on performance. Happy allocating!

by Andy Wingo at September 18, 2024 02:18 PM