Blogging about open source virtualization
https://planet.virt-tools.org/
enBlogging about open source virtualization - https://planet.virt-tools.org/Thomas Huth: How to use secure RHCOS images on s390xhttp://people.redhat.com/~thuth/blog/general/2024/10/23/secure-rhcos.html
http://people.redhat.com/~thuth/blog/general/2024/10/23/secure-rhcos.html
<p>Recently, I needed to debug a problem that only occurred in
<a href="https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/architecture/architecture-rhcos">RHCOS</a>
images that are running in
<a href="https://www.ibm.com/docs/en/linux-on-systems?topic=execution-introduction">secure execution</a>
mode on an IBM Z system. Since I don’t have a
<a href="https://docs.redhat.com/en/documentation/openshift_container_platform">OCP</a>
installation at hand, I wanted to run such an image directly with
<a href="https://www.qemu.org/">QEMU</a> or <a href="https://www.libvirt.org/">libvirt</a>.
This sounded easy at a first glance, since there are qcow2 images available
for RHCOS, but in the end, it was quite tricky to get this working, so I’d
like to summarize the steps here, maybe it’s helpful for somebody else, too.
Since the “secex” images are encrypted, you cannot play the usual tricks
with e.g. <a href="https://www.libguestfs.org/">guestfs</a> here, you have to go through
the ignition process of the image first.
Well, maybe there is already the right documentation for this available
somewhere and I missed it, but most other documents mainly talk about x86 or
normal (unencrypted) images (like the one for
<a href="https://docs.fedoraproject.org/en-US/fedora-coreos/provisioning-libvirt/#_launching_a_vm_instance">Fedora CoreOS on libvirt</a> ),
so I think it will be helpful to have this summary here anyway.</p>
<h2 id="preparation">Preparation</h2>
<p>First, make sure that you have the right tools installed for this task:</p>
<pre><code class="language-sh"><span class="nb">sudo </span>dnf <span class="nb">install </span>butane wget mkpasswd openssh virt-install qemu-img</code></pre>
<p>Since we are interested in the secure execution image, we have to
download the image with “secex” in the name, together with the right
GPG key that is required for encrypting the config file later, for
example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/4.16/4.16.3/rhcos-qemu-secex.s390x.qcow2.gz
wget https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/4.16/4.16.3/rhcos-ignition-secex-key.gpg.pub
</code></pre></div></div>
<p>Finally, uncompress the image. And since we want to avoid modifying the
original image, let’s also create an overlay qcow2 image for it:</p>
<pre><code class="language-sh"><span class="nb">gzip</span> <span class="nt">-d</span> rhcos-qemu-secex.s390x.qcow2.gz
qemu-img create <span class="nt">-f</span> qcow2 <span class="nt">-b</span> rhcos-qemu-secex.s390x.qcow2 <span class="nt">-F</span> qcow2 rhcos.qcow2</code></pre>
<h2 id="creation-of-the-configuration-file">Creation of the configuration file</h2>
<p>For being able to log in your guest via ssh later, you need an ssh key,
so let’s create one and add it to your local ssh-agent:</p>
<pre><code class="language-sh">ssh-keygen <span class="nt">-f</span> rhcos-key
ssh-add rhcos-key</code></pre>
<p>If you also want to log in on the console via password, create a
password hash with the <code class="language-plaintext highlighter-rouge">mkpasswd</code> tool, too.</p>
<p>Now create a butane configuration file and save it as “<code class="language-plaintext highlighter-rouge">config.bu</code>”:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>variant: fcos
version: 1.4.0
passwd:
users:
- name: core
ssh_authorized_keys:
- INSERT_THE_CONTENTS_OF_YOUR_rhcos-key.pub_FILE_HERE
password_hash: INSERT_THE_HASH_FROM_mkpasswd_HERE
groups:
- wheel
storage:
files:
- path: /etc/se-hostkeys/ibm-z-hostkey-1
overwrite: true
contents:
local: HKD.crt
systemd:
units:
- name: [email protected]
mask: false
</code></pre></div></div>
<p>Make sure to replace the “INSERT_…” markers in the file with the contents
of your <code class="language-plaintext highlighter-rouge">rhcos-key.pub</code> and the hash from <code class="language-plaintext highlighter-rouge">mkpasswd</code>, and also
make sure to have the host key document (required for encrypting the
guest with <code class="language-plaintext highlighter-rouge">genprotimg</code>) available as <code class="language-plaintext highlighter-rouge">HKD.crt</code> in the current directory.</p>
<p>Next, the butane config file needs to be converted into an ignition file,
which then needs to be encrypted with the GPG key of the RHCOS image:</p>
<pre><code class="language-sh">butane <span class="nt">-d</span> <span class="nb">.</span> config.bu <span class="o">></span> config.ign
gpg <span class="nt">--recipient-file</span> rhcos-ignition-secex-key.gpg.pub <span class="nt">--yes</span> <span class="se">\</span>
<span class="nt">--output</span> config.crypted <span class="nt">--armor</span> <span class="nt">--encrypt</span> config.ign</code></pre>
<h2 id="ignition-of-the-guest-image">Ignition of the guest image</h2>
<p>The encrypted config file can now be used to start the ignition of the
guest. On s390x, the config file is not presented via the “fw_cfg”
mechanism to the guest (like it is done on x86), but with a drive that
has a special serial number. Thus QEMU should be started like this:</p>
<pre><code class="language-sh">/usr/libexec/qemu-kvm <span class="nt">-d</span> guest_errors <span class="nt">-accel</span> kvm <span class="nt">-m</span> 4G <span class="nt">-smp</span> 4 <span class="nt">-nographic</span> <span class="se">\</span>
<span class="nt">-object</span> s390-pv-guest,id<span class="o">=</span>pv0 <span class="nt">-machine</span> confidential-guest-support<span class="o">=</span>pv0 <span class="se">\</span>
<span class="nt">-drive</span> <span class="k">if</span><span class="o">=</span>none,id<span class="o">=</span>dr1,file<span class="o">=</span>rhcos.qcow2,auto-read-only<span class="o">=</span>off,cache<span class="o">=</span>unsafe <span class="se">\</span>
<span class="nt">-device</span> virtio-blk,drive<span class="o">=</span>dr1 <span class="nt">-netdev</span> user,id<span class="o">=</span>n1,hostfwd<span class="o">=</span>tcp::2222-:22 <span class="se">\</span>
<span class="nt">-device</span> virtio-net-ccw,netdev<span class="o">=</span>n1 <span class="se">\</span>
<span class="nt">-drive</span> <span class="k">if</span><span class="o">=</span>none,id<span class="o">=</span>drv_cfg,format<span class="o">=</span>raw,file<span class="o">=</span>config.crypted,readonly<span class="o">=</span>on <span class="se">\</span>
<span class="nt">-device</span> virtio-blk,serial<span class="o">=</span>ignition_crypted,iommu_platform<span class="o">=</span>on,drive<span class="o">=</span>drv_cfg</code></pre>
<p>This should start the ignition process during the first boot of the guest.
During future boots of the guest, you don’t have to specify the drive with
the “config.crypted” file anymore.
Once the ignition is done, you can log in to the guest either on the
console with the password that you created with <code class="language-plaintext highlighter-rouge">mkpasswd</code>, or via ssh:</p>
<pre><code class="language-sh">ssh <span class="nt">-p</span> 2222 core@localhost</code></pre>
<p>Now you should be able to use the image. But keep in mind that this is an
rpm-ostree based image, so for installing additional packages, you have to
use <code class="language-plaintext highlighter-rouge">rpm-ostree install</code> instead of <code class="language-plaintext highlighter-rouge">dnf install</code> here. And the kernel
can be replaced like this, for example:</p>
<pre><code class="language-sh"><span class="nb">sudo </span>rpm-ostree override replace <span class="se">\</span>
kernel-5.14.0-...s390x.rpm <span class="se">\</span>
kernel-core-5.14.0-...s390x.rpm <span class="se">\</span>
kernel-modules-5.14.0-...s390x.rpm <span class="se">\</span>
kernel-modules-core-5.14.0-...s390x.rpm <span class="se">\</span>
kernel-modules-extra-5.14.0-...s390x.rpm</code></pre>
<p>That’s it! Now you can enjoy your configured secure-execution RHCOS image!</p>
<p>Special thanks to Nikita D. for helping me understanding the ignition
process of the secure execution images.</p>Wed, 23 Oct 2024 16:30:00 +0000Stefan Hajnoczi: Video and slides available for "IOThread Virtqueue Mapping" talk at KVM Forum 2024https://blog.vmsplice.net/2024/10/video-and-slides-available-for-iothread.html
https://blog.vmsplice.net/2024/10/video-and-slides-available-for-iothread.html
<p>My KVM Forum 2024 talk "IOThread Virtqueue Mapping: Improving virtio-blk SMP scalability in QEMU" is now <a href="https://www.youtube.com/watch?v=tVIRDdf79-0">available on YouTube</a>. The slides are also available <a href="https://vmsplice.net/~stefan/stefanha-kvm-forum-2024.pdf">here</a>.</p>
<p>IOThread Virtqueue Mapping is a new QEMU feature for configuring multiple IOThreads that will handle a virtio-blk device's virtqueues. This means QEMU can take advantage of the host's Linux multi-queue block layer and assign CPUs to I/O processing. Giving additional resources to virtio-blk emulation allows QEMU to achieve higher IOPS and saturate fast local NVMe drives. This is especially important for applications that submit I/O on many vCPUs simultaneously - a workload that QEMU had trouble keeping up with in the past.</p>
<p>You can read more about IOThread Virtqueue Mapping in this Red Hat <a href="https://developers.redhat.com/articles/2024/09/05/scaling-virtio-blk-disk-io-iothread-virtqueue-mapping">blog post</a>.</p>Tue, 01 Oct 2024 14:02:33 +0000QEMU project: QEMU version 9.1.0 releasedhttps://www.qemu.org/2024/09/03/qemu-9-1-0/
https://www.qemu.org/2024/09/03/qemu-9-1-0/
<p>We’d like to announce the availability of the QEMU 9.1.0 release. This release contains 2800+ commits from 263 authors.</p>
<p>You can grab the tarball from our <a href="https://www.qemu.org/download/#source">download page</a>. The full list of changes are available <a href="https://wiki.qemu.org/ChangeLog/9.1">in the changelog</a>.</p>
<p>Highlights include:</p>
<ul>
<li>migration: compression offload support via Intel In-Memory Analytics Accelerator (IAA) or User Space Accelerator Development Kit (UADK), along with enhanced support for postcopy failure recovery</li>
<li>virtio: support for VIRTIO_F_NOTIFICATION_DATA, allowing guest drivers to provide additional data as part of sending device notifications for performance/debug purposes</li>
<li>guest-agent: support for guest-network-get-route command on linux, guest-ssh-* commands on Windows, and enhanced CLI support for configuring allowed/blocked commands</li>
<li>block: security fixes for QEMU NBD server and NBD TLS encryption</li>
<li>ARM: emulation support for FEAT_NMI, FEAT_CSV2_3, FEAT_ETS2, FEAT_Spec_FPACC, FEAT_WFxT, FEAT_Debugv8p8 architecture features</li>
<li>ARM: nested/two-stage page table support for emulated SMMUv3</li>
<li>ARM: xilinx_zynq board support for cache controller and multiple CPUs, and B-L475E-IOT01A board support for a DM163 display</li>
<li>LoongArch: support for directly booting an ELF kernel and for running up to 256 vCPUs via extioi virt extension</li>
<li>LoongArch: enhanced debug/GDB support</li>
<li>RISC-V: support for version 1.13 of privileged architecture specification</li>
<li>RISC-V: support for Zve32x, Zve64x, Zimop, Zcmop, Zama16b, Zabha, Zawrs, and Smcntrpmf extensions</li>
<li>RISC-V: enhanced debug/GDB support and general fixes</li>
<li>SPARC: emulation support for FMAF, IMA, VIS3, and VIS4 architecture features</li>
<li>x86: KVM support for running AMD SEV-SNP guests</li>
<li>x86: CPU emulation support for Icelake-Server-v7, SapphireRapids-v3, and SierraForest</li>
<li>and lots more…</li>
</ul>
<p>Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!</p>Tue, 03 Sep 2024 23:08:00 +0000KVM on Z: New Feature: Installation Assistant for Linux on IBM Zhttps://kvmonz.blogspot.com/2024/08/new-feature-installation-assistant-for.html
https://kvmonz.blogspot.com/2024/08/new-feature-installation-assistant-for.html
<p>Ever struggled to create configuration files for starting Linux on IBM Z and LinuxONE installations? Fear no more, we got you covered now: A new assistant available online will help you create parameter files!<br />Writing parameter files can be a challenge, with bugs triggering cycles with lengthy turnaround times. Our new installation assistant generates installer parameter files by walking you through a step-by-step process, where you answer simple questions to generate a parameter file. Comes with contextual help in every stage, so you can follow along what is happening!<br />Currently supports OSA and PCI networking devices, IPv4/v6, and VLAN installations.</p><p>Currently supports RHEL 9 and SLES 15 SP5 or later.<br />Access the assistant at <a href="https://ibm.github.io/liz/">https://ibm.github.io/liz/</a><br /></p><div class="separator"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlSSw2Fb3mTt842HRx5vmttgxEIFM6sT2ATJUwI1_em8nB6EZUeem1rLHhHrHnRWkigCvyCtc4IxEHyDpPftroT3L-JDhAH6z0sEBQE3EELOQP9NX7TQ7pQbm9udmsz9ax1tgkv8R451GgjsMqupKh62S83221AAZmwIV4xH0HSlBrWGd_WFdXO3o5wYw/s1351/installation_assistant.png"><img border="0" height="228" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlSSw2Fb3mTt842HRx5vmttgxEIFM6sT2ATJUwI1_em8nB6EZUeem1rLHhHrHnRWkigCvyCtc4IxEHyDpPftroT3L-JDhAH6z0sEBQE3EELOQP9NX7TQ7pQbm9udmsz9ax1tgkv8R451GgjsMqupKh62S83221AAZmwIV4xH0HSlBrWGd_WFdXO3o5wYw/w400-h228/installation_assistant.png" width="400" /></a></div><p></p>Fri, 09 Aug 2024 12:11:06 +0000Gerd Hoffmann: modern uefi network bootinghttps://www.kraxel.org/blog/2024/07/uefi-network-boot/
https://www.kraxel.org/blog/2024/07/uefi-network-boot/
<h3>Network boot kickoff.</h3>
<p>
Step number one for the firmware on any system is sending out a DHCP
request, asking the DHCP server for an IP address, the boot server
(called "next server" in dhcp terms) and the bootfile.
</p>
<p>
On success the firmware will contact the boot server, fetch the
bootfile and hand over control to the bootfile. Traditional method
to serve the bootfile is using tftp (trivial file transfer
protocol). Modern systems support http too. I have an article
on <a href="https://www.kraxel.org/blog/2021/09/vm-network-boot/">setting up the dhcp
server for virtual machines</a> you might want check out.
</p>
<p>
What the bootfile is expected to be depends on the system being
booted. There are embedded systems -- for example IP phones --
which load the complete system software that way.
</p>
<p>
When booting UEFI systems the bootfile typically is an EFI binary.
That is not the only option though, more on that below.
</p>
<h3>UEFI network boot with a boot loader.</h3>
<p>
The traditional way to netboot linux on UEFI systems is using a
bootloader. The bootfile location handed out by the DHCP server
points to the bootloader and is the first file loaded over the
network. Typical choices for the bootloader
are <code>grub.efi</code>, <code>snponly.efi</code>
(from <a href="https://ipxe.org/">ipxe project</a>)
or <code>syslinux.efi</code>.
</p>
<p>
Next step is the bootloader fetching the config file. That works
the same way the bootloader itself was loaded, using the EFI network
driver provided by either the platform firmware (typically the case
for onboard NICs) or via PCI option rom (plug-in NICs). The
bootloader does not need its own network drivers.
</p>
<p>
The loaded config file controls how the boot will continue. This
can be very simple, three lines asking the bootloader to fetch
kernel + initrd from a fixed location, then start the kernel with
some command line. This can also be very complex, creating an
interactive menu system where the user has dozens of options to
choose from (see for
example <a href="https://netboot.xyz">netboot.xyz</a>).
</p>
<p>
Now the user can -- in case the config file defines menus -- pick
what he wants boot.
</p>
<p>
Final step is the bootloader fetching the kernel and initrd (again
using the EFI network driver) and starting the kernel. Voila.
</p>
<h3>Boot loaders and secure boot.</h3>
<p>
When using secure boot there is one more intermediate step needed:
The first binary needs to be be <code>shim.efi</code>, which in turn
will download the actual bootloader. Most distros ship
only <code>grub.efi</code> with a secure boot signature, which
limits the boot loader choice to that.
</p>
<p>
Also all components (shim + grub + kernel) must come from the same
distribution. <code>shim.efi</code> has the distro secure boot
signing certificate embedded, so Fedora shim will only boot
grub + kernel with a secure boot signature from Fedora.
</p>
<h3>Netbooting machines without EFI network driver.</h3>
<p>
You probably do not have to worry about this. Shipping systems with
EFI network driver and UEFI network boot support is standard feature
today, <code>snponly.efi</code> should be used for these systems.
</p>
<p>
When using older hardware network boot support might be missing
though. Should that be the case
the <a href="https://ipxe.org/">ipxe project</a> can help because it
also features a large collection of firmware network drivers. It
ships an all-in-one EFI binary named <code>ipxe.efi</code> which
includes the the bootloader and scripting features (which are
in <code>snponly.efi</code> too) and additionally all the ipxe
hardware drivers.
</p>
<p>
That way <code>ipxe.efi</code> can boot from the network even if the
firmware does not provide a driver. In that
case <code>ipxe.efi</code> itself must be loaded from local storage
though. You can download the efi binary and ready-to-use ISO/USB
images from <a href="https://boot.ipxe.org">boot.ipxe.org</a>.
</p>
<h3>UEFI network boot with an UKI.</h3>
<p>
A UKI
(<a href="https://github.com/uapi-group/specifications/blob/main/specs/unified_kernel_image.md">unified
kernel image</a>) is an EFI binary bundle. It contains a linux
kernel, an initrd, the command line and a few more optional
components not covered here in sections of the EFI binary. Also the
systemd efi stub, which handles booting the bundled linux kernel
with the bundled initrd.
</p>
<p>
One advantage is that the secure boot signature of an UKI image will
cover all components and not only the kernel itself, which is a big
step forward for linux boot security.
</p>
<p>
Another advantage is that a UKI is self-contained. It does not need
a bootloader which knows how to boot linux kernels and handle initrd
loading. It is simply an EFI binary which you can start any way you
want, for example from the EFI shell prompt.
</p>
<p>
The later makes UKIs interesting for network booting, because they
can be used as bootfile too. The DHCP server hands out the UKI
location, the UEFI firmware fetches the UKI and starts it. Done.
</p>
<p>
Combining the bootloader and UKI approaches is possible too. UEFI
bootloaders can load not only linux kernels. EFI binaries
(including UKIs) can be loaded too, in case of <code>grub.efi</code>
with the <code>chainloader</code> command. So if you want
interactive menus to choose an UKI to boot you can do that.
</p>
<h3>UEFI network boot with an ISO image.</h3>
<p>
Modern UEFI implementations can netboot ISO images too.
Unfortunately there are a few restrictions though:
</p>
<ol>
<li>
It is a relatively new feature. It exists for a few years already
in edk2, but with the glacial speeds firmware feature updates are
delivered (if at all) this means there is hardware in the wild
which does not support this.
</li>
<li>
It is only supported for HTTP boot. Which makes sense given that
ISO images can be bulky and the http protocol typically is much
faster than the tftp protocol used by PXE boot. Nevertheless you
might need additional setup steps because of this.
</li>
</ol>
<p>
When the UEFI firmware gets an ISO image as bootfile from the DHCP
server it will load the image into a ramdisk, register the ramdisk
as block device and try to boot from it.
</p>
<p>
From that point on booting works the same way booting from a local
cdrom device works. The firmware will look for the boot loader on
the ramdisk and load it. The bootloader will find the other
components needed on the ramdisk, i.e. kernel and initrd in case of
linux. All without any network access.
</p>
<p>
The UEFI firmware will also create ACPI tables for a pseudo nvdimm
device. That way the booted linux kernel will find the ramdisk too.
You can use the standard Fedora / CentOS / RHEL netinst ISO image,
linux will find the <code>images/install.img</code> on the ramdisk
and boot up all the way to anaconda. With enough RAM you can even
load the DVD with all packages, then do the complete system install
from ramdisk.
</p>
<p>
The big advantage of this approach is that the netboot workflow
becomes very simliar to other installation workflows. It's not the
odd kid on the block any more where loading kernel and initrd works
in a completely different way. The very same ISO image can be:
</p>
<ol>
<li>
Burned to a physical cdrom and used to boot a physical machine.
</li>
<li>
In many cases the ISO images are hypbrid, so they can be flashed
to a USB stick too for booting a physical machine.
</li>
<li>
The ISO image can be attached as virtual device to a virtual
machine.
</li>
<li>
On server grade managed hardware the ISO image can be attached as
virtual media using redfish and the BMC.
</li>
<li>
And finally: The ISO image can be loaded into a ramdisk via UEFI
http boot.
</li>
</ol>
<p>
Bonus: secure boot support suddenly isn't a headace any more.
</p>
<h3>The kernel command line.</h3>
<p>
There is one problem with the fancy new world though. We have lots
of places in the linux world which depend on the linux kernel
command line for system configuration. For example anaconda expects
getting the URL of the install repository and the kickstart file
that way.
</p>
<p>
When using a boot loader that is simple. The kernel command line
simply goes into the boot loader config file.
</p>
<p>
With ISO images it is more complicated, changing the grub config
file on a ISO image is a cumbersome process. Also ISO images are
not exactly small, so install images with customized grub.cfg need
quite some storage space.
</p>
<p>
UKIs can pass through command line arguments to the linux kernel,
but that is only allowed in case secure boot is disabled. When
using UKIs with secure boot the best option is to use the UKIs built
and signed on distro build infrastructure. Which implies using the
kernel command line for customization is not going to work with
secure boot enabled.
</p>
<p>
So, all of the above (and UKIs in general) will work better if we
can replace the kernel command line as universal configuration
vehicle with something else. Which most likely will not be a single
thing but a number of different approaches depending on the use
case. Some steps into that direction did happen already. Systemd
can <a href="https://uapi-group.org/specifications/specs/discoverable_partitions_specification/">autodetect
partitions</a> (so booting from disk without <code>root=...</code>
on the kernel command line works).
And <a href="https://systemd.io/CREDENTIALS/">systemd
credentials</a> can be used to configure some aspects of a linux
system. There is still a loooong way to go though.
</p>Mon, 08 Jul 2024 22:00:00 +0000KVM on Z: New Video: Configuring Crypto Express Adapters for KVM Guestshttps://kvmonz.blogspot.com/2024/07/new-video-configuring-crypto-express.html
https://kvmonz.blogspot.com/2024/07/new-video-configuring-crypto-express.html
<p>A new video illustrating the steps to perform on a KVM host and in a virtual server configuration to make AP queues of cryptographic adapters available to a KVM guest can be found <a href="https://video.ibm.com/recorded/133759252">here</a>.<br /></p>Tue, 02 Jul 2024 09:14:18 +0000KVM on Z: virtio-blk scalability improvementshttps://kvmonz.blogspot.com/2024/06/virtio-blk-scalability-improvements.html
https://kvmonz.blogspot.com/2024/06/virtio-blk-scalability-improvements.html
<p><a href="https://wiki.qemu.org/ChangeLog/9.0#Block_devices" target="_blank">QEMU 9.0</a> and <a href="https://www.debugpoint.com/libvirt-10-0/" target="_blank">libvirt 10.0</a> introduced an important improvement for disk (virtio-blk) scalability. Until now, a virtio-blk device could use one iothread in the host (or the QEMU main thread).<br />For a while it was already possible to specify multiple virtio queues for a single virtio-block device, and the Linux guest driver was able to exploit thse concurrently:</p><pre class="literal-block"><span><driver name='qemu' queues='3'></span></pre><p>All queues have been handled by one iothread. This introduced a limit for the maximum amount of IO requests per seconds per disk no matter how many queues have been defined.</p><p>Now it is possible to assign a queue to a given host iothread allowing for higher throughput per device:</p><pre class="literal-block"><span><driver name='qemu' queues='3'><br /> <iothreads><br /> <iothread id='2'><br /> <queue id='1'/><br /> </iothread><br /> <iothread id='3'><br /> <queue id='0'/><br /> <queue id='2'/><br /> </iothread><br /> </iothreads><br /></driver></span></pre><p><span>Example with 3 queues to illustrate the possibility to have 2 queues on one iothread and one queue on another. In real life 2 or 4 queues make more sense.</span></p><p>Initial tests showed improved performance and reduced CPU cycles when going from 1 to 2 queues. More performance analysis needs to happen but this looks like a very promising improvement and going from 1 to 2 is almost a no-brainer. Adding more queues continues to improve the performance, but also increases the overall CPU consumption so this needs additional considerations.</p><p>Sharing iothreads across multiple disks continues to be possible. </p><p>This feature is also being backported into some distributions like RHEL 9.4 or will be available via regular QEMU/libvirt upgrades.</p>Tue, 18 Jun 2024 11:39:39 +0000KVM on Z: New Release: Ubuntu 24.04https://kvmonz.blogspot.com/2024/04/new-release-ubuntu-2404.html
https://kvmonz.blogspot.com/2024/04/new-release-ubuntu-2404.html
<p></p><p>Canonical released a new version of their Ubuntu server offering <a href="http://releases.ubuntu.com/24.04/">Ubuntu Server 24.04</a>!</p><p></p><p> Highlights include</p><ul><li>HSM support for Secure Execution</li><li>Further Crypto enhancements and extensions</li></ul><p>See the announcement on the mailing list <a href="https://lists.ubuntu.com/archives/ubuntu-announce/2024-April/000301.html">here</a>, and the blog entry at Canonical with all Z-specific highlights <a href="https://ubuntu-on-big-iron.blogspot.com/2024/04/ubuntu-server-24.04-lts.html">here</a>.</p>This release is very significant, since it marks a so-called LTS (Long
Term Support) release, granting an extended service timeframe of up to 10 years, as illustrated <a href="https://ubuntu.com/about/release-cycle">here</a>.Thu, 23 May 2024 08:07:23 +0000QEMU project: KVM Forum 2024: Call for presentationshttps://www.qemu.org/2024/05/06/kvm-forum-cfp/
https://www.qemu.org/2024/05/06/kvm-forum-cfp/
<p>The <a href="https://kvm-forum.qemu.org/">KVM Forum 2024</a> conference will take
place in Brno, Czech Republic on September 22-23, 2024. KVM Forum brings
together the Linux virtualization community, especially around the KVM stack,
including QEMU and other virtual machine monitors.</p>
<p>The Call for Presentations is open until June 8, 2024. You are invited to
submit presentation proposals via the <a href="https://kvm-forum.qemu.org/2024/cfp/">KVM Forum CfP
page</a>. All presentation slots will be
25 minutes + 5 minutes for questions.</p>
<p>Suggested topics include:</p>
<ul>
<li>Scalability and Optimization</li>
<li>Hardening and security</li>
<li>Confidential computing</li>
<li>Testing</li>
<li>KVM and the Linux Kernel
<ul>
<li>New Features and Architecture Ports</li>
<li>Device Passthrough: VFIO, mdev, vDPA</li>
<li>Network Virtualization</li>
<li>Virtio and vhost</li>
</ul>
</li>
<li>Virtual Machine Monitors and Management
<ul>
<li>VMM Implementation: APIs, Live Migration, Performance Tuning, etc.</li>
<li>Multi-process VMMs: vhost-user, vfio-user, QEMU Storage Daemon, SPDK</li>
<li>QEMU without KVM: Hypervisor.framework, Windows Hypervisor Platform, etc.</li>
<li>Managing KVM: Libvirt, KubeVirt, Kata Containers</li>
</ul>
</li>
<li>Emulation
<ul>
<li>New Devices, Boards and Architectures</li>
<li>CPU Emulation and Binary Translation</li>
</ul>
</li>
</ul>Mon, 06 May 2024 06:00:00 +0000KVM on Z: IBM Secure Execution for Linux support for Crypto Express adaptershttps://kvmonz.blogspot.com/2024/04/ibm-secure-execution-for-linux-support.html
https://kvmonz.blogspot.com/2024/04/ibm-secure-execution-for-linux-support.html
<p>IBM
Secure Execution for Linux -- the Linux Kernel Virtual Machine (KVM)
based Confidential Computing technology for IBM LinuxONE and Linux on
IBM Z -- now allows Secure Execution guests leverage secure
passthrough access to up to 12 Crypto Express 8S adapter domains in
accelerator or EP11 co-processor mode.</p>
<p>Customers
who require the highest level of protection (FIPS 140-2 level 4
certified) for their cryptographic keys and thus for their sensitive
data can now have their workloads deployed as Secure Execution KVM
guests with access to Hardware Security Modules (HSMs) if the
provider uses IBM z16 or LinuxONE 4 servers with Crypto Express 8S
adapters. This combination provides business value for solutions
around key and certificate management, multi-party computation and
digital assets. But more use cases arise as confidential computing
becomes more common and the need to leverage such highly certified
HSM to protect AI models or provide data sovereignty across
organizational and infrastructure boundaries grows.</p>
<p>To
exploit this new function, IBM z16 or LinuxONE 4 severs with firmware
bundles S30 and S31b are needed. To use a Crypto Express 8S adapter
in EP11 mode the minimal EP11 firmware version loaded must be version
5.8.30.</p>
<p>IBM
is working with Linux distribution partners to include the required
Linux support for this function for both the KVM Hypervisor and the
Secure Execution guests in future distribution releases. Linux
support for this function is already available today with Ubuntu
24.04 (Noble Numbat).</p>
<p>This
new capability showcases IBM’s commitment and previously stated
direction to foster the use of confidential computing and expand the
security value proposition of existing security and crypto solutions
as the business needs of our customers and technical possibilities
evolve.</p>
<p>For
detailed information on how to use Crypto Express support see the
<a href="https://www.ibm.com/docs/en/linux-on-systems?topic=virtualization-secure-execution">Introducing IBM Secure Execution for Linux</a> publication<span lang="en-US">.</span></p><p><span lang="en-US"> </span>
</p>
Authored by<ul><li lang="en-US">Reinhard Bündgen – <span><u><a href="mailto:[email protected]">[email protected]</a></u></span></li><li lang="en-US"><span lang="en-US">Nicolas
Mäding – </span><span><u><a href="mailto:[email protected]"><span lang="en-US">[email protected]</span></a></u></span><span lang="en-US">
</span>
</li></ul>Mon, 29 Apr 2024 11:57:12 +0000QEMU project: QEMU version 9.0.0 releasedhttps://www.qemu.org/2024/04/23/qemu-9-0-0/
https://www.qemu.org/2024/04/23/qemu-9-0-0/
<p>We’d like to announce the availability of the QEMU 9.0.0 release. This release contains 2700+ commits from 220 authors.</p>
<p>You can grab the tarball from our <a href="https://www.qemu.org/download/#source">download page</a>. The full list of changes are available <a href="https://wiki.qemu.org/ChangeLog/9.0">in the changelog</a>.</p>
<p>Highlights include:</p>
<ul>
<li>block: virtio-blk now supports multiqueue where different queues of a single disk can be processed by different I/O threads</li>
<li>gdbstub: various improvements such as catching syscalls in user-mode, support for fork-follow modes, and support for siginfo:read</li>
<li>memory: preallocation of memory backends can now be handled concurrently using multiple threads in some cases</li>
<li>migration: support for “mapped-ram” capability allowing for more efficient VM snapshots, improved support for zero-page detection, and checkpoint-restart support for VFIO</li>
<li>ARM: architectural feature support for ECV (Enhanced Counter Virtualization), NV (Nested Virtualization), and NV2 (Enhanced Nested Virtualization)</li>
<li>ARM: board support for B-L475E-IOT01A IoT node, mp3-an536 (MPS3 dev board + AN536 firmware), and raspi4b (Raspberry Pi 4 Model B)</li>
<li>ARM: additional IO/disk/USB/SPI/ethernet controller and timer support for Freescale i.MX6, Allwinner R40, Banana Pi, npcm7xxx, and virt boards</li>
<li>HPPA: numerous bug fixes and SeaBIOS-hppa firmware updated to version 16</li>
<li>LoongArch: KVM acceleration support, including LSX/LASX vector extensions</li>
<li>RISC-V: ISA/extension support for Zacas, amocas, RVA22 profiles, Zaamo, Zalrsc, Ztso, and more</li>
<li>RISC-V: SMBIOS support for RISC-V virt machine, ACPI support for SRAT, SLIT, AIA, PLIC and updated RHCT table support, and numerous fixes</li>
<li>s390x: Emulation support for CVDG, CVB, CVBY and CVBG instructions, and fixes for LAE (Load Address Extended) emulation</li>
<li>and lots more…</li>
</ul>
<p>Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!</p>Tue, 23 Apr 2024 23:02:00 +0000Marcin Juszkiewicz: ConfigurationManager in EDK2: just say nohttps://marcin.juszkiewicz.com.pl/2024/04/16/configurationmanager-in-edk2-just-say-no/
https://marcin.juszkiewicz.com.pl/2024/04/16/configurationmanager-in-edk2-just-say-no/
<p>During my work on <span class="caps">SBSA</span> Reference Platform I have spent lot of time in firmware’s
code. Which mostly meant Tianocore <span class="caps">EDK2</span> as Trusted Firmware is quite small.</p>
<p>Writing all those <span class="caps">ACPI</span> tables by hand takes time. So I checked
ConfigurationManager component which can do it for me.</p>
<!--MORE-->
<h3>Introduction</h3>
<p>In 2018 Sami Mujawar from Arm contributed Dynamic Tables Framework to Tianocore
<span class="caps">EDK2</span> project. The goal was to have code which generates all <span class="caps">ACPI</span> tables from
all those data structs describing hardware which <span class="caps">EDK2</span> already has.</p>
<p>In 2023 I was writing code for <span class="caps">IORT</span> and <span class="caps">GTDT</span> tables to generate them from C. And
started wondering about use of ConfigurationManager.</p>
<p>Mailed edk2-devel <span class="caps">ML</span> for pointers, documentation, hints. Got nothing in return,
idea went to the shelf.</p>
<h3><span class="caps">SBSA</span>-Ref and multiple <span class="caps">PCI</span> Express buses</h3>
<p>Last week I got <span class="caps">SBSA</span>-Ref system booting in <span class="caps">NUMA</span> configuration with three
separate <span class="caps">PCI</span> Express buses. And started working on getting <span class="caps">EDK2</span> firmware to
recognize them as such.</p>
<p>Took me a day and <code>pci</code> command listed cards properly:</p>
<pre><code>Shell> pci
Seg Bus Dev Func
--- --- --- ----
00 00 00 00 ==> Bridge Device - Host/PCI bridge
Vendor 1B36 Device 0008 Prog Interface 0
00 00 01 00 ==> Network Controller - Ethernet controller
Vendor 8086 Device 10D3 Prog Interface 0
00 00 02 00 ==> Bridge Device - PCI/PCI bridge
Vendor 1B36 Device 000C Prog Interface 0
00 00 03 00 ==> Bridge Device - Host/PCI bridge
Vendor 1B36 Device 000B Prog Interface 0
00 00 04 00 ==> Bridge Device - Host/PCI bridge
Vendor 1B36 Device 000B Prog Interface 0
00 01 00 00 ==> Mass Storage Controller - Non-volatile memory subsystem
Vendor 1B36 Device 0010 Prog Interface 2
00 40 00 00 ==> Bridge Device - PCI/PCI bridge
Vendor 1B36 Device 000C Prog Interface 0
00 40 01 00 ==> Bridge Device - PCI/PCI bridge
Vendor 1B36 Device 000C Prog Interface 0
00 41 00 00 ==> Base System Peripherals - SD Host controller
Vendor 1B36 Device 0007 Prog Interface 1
00 42 00 00 ==> Display Controller - Other display controller
Vendor 1234 Device 1111 Prog Interface 0
00 80 00 00 ==> Bridge Device - PCI/PCI bridge
Vendor 1B36 Device 000E Prog Interface 0
00 80 01 00 ==> Bridge Device - PCI/PCI bridge
Vendor 1B36 Device 000C Prog Interface 0
00 81 09 00 ==> Multimedia Device - Audio device
Vendor 1274 Device 5000 Prog Interface 0
00 81 10 00 ==> Network Controller - Ethernet controller
Vendor 8086 Device 100E Prog Interface 0
00 82 00 00 ==> Mass Storage Controller - Serial ATA controller
Vendor 8086 Device 2922 Prog Interface 1
</code></pre>
<p>Three buses are: 0x00, 0x40 and 0x80. But then I had to tell Operating System
about those. Which meant playing with <span class="caps">ACPI</span> tables code in C.</p>
<p>So idea came “what about trying ConfigurationManager?”.</p>
<h3>Another try</h3>
<p>Mailed edk2-devel <span class="caps">ML</span> again for pointers, documentation hints. And then looked at
code written for <span class="caps">N1SDP</span> and started playing with ConfigurationManager…</p>
<p>ConfigurationManager.c has EDKII_PLATFORM_REPOSITORY_INFO struct with
hundreds of lines of data (as another structs). From listing which <span class="caps">ACPI</span> tables I
want to have (<span class="caps">FADT</span>, <span class="caps">GTDT</span>, <span class="caps">APIC</span>, <span class="caps">SPCR</span>, <span class="caps">DBG2</span>, <span class="caps">IORT</span>, <span class="caps">MCFG</span>, <span class="caps">SRAT</span>, <span class="caps">DSDT</span>, <span class="caps">PPTT</span> etc.)
to listing all hardware details like <span class="caps">GIC</span>, PCIe, Timers, <span class="caps">CPU</span> and Memory information.</p>
<p>Then code for querying this struct. I thought that <span class="caps">CM</span>/<span class="caps">DT</span>
(ConfigurationManager/DynamicTables) framework will have those already in <span class="caps">EDK2</span>
code but no — each platform has own set of functions. Another hundreds of lines
to maintain.</p>
<p>Took some time to get it built, then started filling proper data and compared
with <span class="caps">ACPI</span> tables I had previously. There were differences to sort out. But
digging more and more into code I saw that I go deeper and deeper into rabbit hole…</p>
<h3>Dynamic systems do not fit <span class="caps">CM</span>?</h3>
<p>For platforms with dynamic hardware configuration (like <span class="caps">SBSA</span>-Ref) I needed to
write code which would populate that struct with data on runtime. Check amount
of cpu cores and write cpu information (with topology, cache etc), create all
<span class="caps">GIC</span> structures and mappings. Then same for PCIe buses. Etc. Etc. etc…</p>
<pre><code>STATIC
EFI_STATUS
EFIAPI
InitializePlatformRepository (
IN EDKII_PLATFORM_REPOSITORY_INFO * CONST PlatRepoInfo
)
{
GicInfo GicInfo;
CM_ARM_GIC_REDIST_INFO *GicRedistInfo;
CM_ARM_GIC_ITS_INFO *GicItsInfo;
CM_ARM_SMMUV3_NODE *SmmuV3Info;
GetGicDetails(&GicInfo);
PlatRepoInfo->GicDInfo.PhysicalBaseAddress = GicInfo.DistributorBase;
GicRedistInfo = &PlatRepoInfo->GicRedistInfo[0];
GicRedistInfo->DiscoveryRangeBaseAddress = GicInfo.RedistributorsBase;
GicItsInfo = &PlatRepoInfo->GicItsInfo[0];
GicItsInfo->PhysicalBaseAddress = GicInfo.ItsBase;
SmmuV3Info = &PlatRepoInfo->SmmuV3Info[0];
SmmuV3Info->BaseAddress = PcdGet64 (PcdSmmuBase);
return EFI_SUCCESS;
}
</code></pre>
<p>Which in my case can mean even more code written to populate <span class="caps">CM</span> struct of
structs than it would take to generate <span class="caps">ACPI</span> tables by hand.</p>
<h3>Summary</h3>
<p>ConfigurationManager and DynamicTables frameworks look tempting. There may be
systems where it can be used with success. I know that I do not want to touch it
again. All those structs of structs may look good for someone familiar with <span class="caps">LISP</span>
or <span class="caps">JSON</span> but not for me.</p>Tue, 16 Apr 2024 07:44:00 +0000Marcin Juszkiewicz: DT-free EDK2 on SBSA Reference Platformhttps://marcin.juszkiewicz.com.pl/2024/04/04/dt-free-edk2-on-sbsa-reference-platform/
https://marcin.juszkiewicz.com.pl/2024/04/04/dt-free-edk2-on-sbsa-reference-platform/
<p>During last weeks we worked on getting rid of DeviceTree from <span class="caps">EDK2</span> on <span class="caps">SBSA</span>
Reference Platform. And finally we managed!</p>
<p>All code is merged into upstream <span class="caps">EDK2</span> repository.</p>
<!--MORE-->
<h3>What?</h3>
<p>Someone may wonder where DeviceTree was in <span class="caps">SBSA</span> Reference Platform. Wasn’t it
<span class="caps">UEFI</span> and <span class="caps">ACPI</span> platform?</p>
<p>Yes, from Operating System point of view it is <span class="caps">UEFI</span> and <span class="caps">ACPI</span>. But if you look
deeper you will see DeviceTree hidden inside our chain of software components:</p>
<pre><code>/dts-v1/;
/ {
machine-version-minor = <0x03>;
machine-version-major = <0x00>;
#size-cells = <0x02>;
#address-cells = <0x02>;
compatible = "linux,sbsa-ref";
chosen {
};
memory@10000000000 {
reg = <0x100 0x00 0x00 0x40000000>;
device_type = "memory";
};
intc {
reg = <0x00 0x40060000 0x00 0x10000
0x00 0x40080000 0x00 0x4000000>;
its {
reg = <0x00 0x44081000 0x00 0x20000>;
};
};
cpus {
#size-cells = <0x00>;
#address-cells = <0x02>;
cpu@0 {
reg = <0x00 0x00>;
};
cpu@1 {
reg = <0x00 0x01>;
};
cpu@2 {
reg = <0x00 0x02>;
};
cpu@3 {
reg = <0x00 0x03>;
};
};
};
</code></pre>
<p>It is very minimal one, providing us with only those information we need. It
does not even pass any compliance checks. For example for Linux, <span class="caps">GIC</span> node
(/intc/ one) should have gazillion of fields, but we only need addresses.</p>
<p>Trusted Firmware reads it, parses and provides information from it via Secure
Monitor Calls (<span class="caps">SMC</span>) to upper firmware level (<span class="caps">EDK2</span> in our case). DeviceTree is
provided too but we do not read it any more.</p>
<h3>Why?</h3>
<p>Our goal is to treat software components a bit different then people may expect.
<span class="caps">QEMU</span> is “virtual hardware” layer, <span class="caps">TF</span>-A provides interface to “embedded
controller” (<span class="caps">EC</span>) layer and <span class="caps">EDK2</span> is firmware layer on top.</p>
<p>On physical hardware firmware assumes some parts and asks <span class="caps">EC</span> for the rest of
system information. <span class="caps">QEMU</span> does not give us that, while giving us a way to alter
system configuration more than it would be possible on most of hardware
platforms using a bunch of cli arguments.</p>
<p><span class="caps">EDK2</span> asks for <span class="caps">CPU</span>, <span class="caps">GIC</span> and Memory. When there is no info about processors or
memory, it informs the user and shutdowns the system (such situation does not
have a chance of happening but it works as an example).</p>
<h3>Bonus stuff: <span class="caps">NUMA</span></h3>
<p>Bonus part of this work was adding firmware support for <span class="caps">NUMA</span> configuration. When
<span class="caps">QEMU</span> is run with <span class="caps">NUMA</span> arguments then operating system gets whole memory and
proper configuration information.</p>
<p><span class="caps">QEMU</span> arguments used:</p>
<pre><code>-smp 4,sockets=4,maxcpus=4
-m 4G,slots=2,maxmem=5G
-object memory-backend-ram,size=1G,id=m0
-object memory-backend-ram,size=3G,id=m1
-numa node,nodeid=0,cpus=0-1,memdev=m0
-numa node,nodeid=1,cpus=2,memdev=m1
-numa node,nodeid=2,cpus=3
</code></pre>
<p>How Operating System sees <span class="caps">NUMA</span> information:</p>
<pre><code>root@sbsa-ref:~# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: 975 MB
node 0 free: 840 MB
node 1 cpus: 2
node 1 size: 2950 MB
node 1 free: 2909 MB
node 2 cpus: 3
node 2 size: 0 MB
node 2 free: 0 MB
node distances:
node 0 1 2
0: 10 20 20
1: 20 10 20
2: 20 20 10
</code></pre>
<h3>What next?</h3>
<p>There is <span class="caps">CPU</span> topology information in review queue. All those sockets, clusters,
cores and threads. <span class="caps">QEMU</span> will pass it in DeviceTree, <span class="caps">TF</span>-A will give it via <span class="caps">SMC</span>
and then <span class="caps">EDK2</span> will put it in one of <span class="caps">ACPI</span> tables (<span class="caps">PPTT</span> == Processor Properties
Topology Table).</p>
<p>If someone decide to write own firmware for <span class="caps">SBSA</span> Reference Platform (like port
of U-Boot) then both DeviceTree and set of <span class="caps">SMC</span> calls will wait for them, ready
to be used to gather hardware information.</p>Thu, 04 Apr 2024 11:26:00 +0000Stefan Hajnoczi: Where are the Supply Chain Safe Programming Languages?https://blog.vmsplice.net/2024/03/where-are-supply-chain-safe-programming.html
https://blog.vmsplice.net/2024/03/where-are-supply-chain-safe-programming.html
<p>Programming languages currently offer few defences against <a href="https://en.wikipedia.org/wiki/Supply_chain_attack">supply chain
attacks</a> where a malicious third-party library compromises a program. As I write this, the
open source community is trying to figure out the details of the <a href="https://en.wikipedia.org/wiki/XZ_Utils#Supply_chain_attack">xz-utils
backdoor</a>, but there is a long history of supply chain attacks. <a href="https://blog.sonatype.com/npm-hijackers-at-it-again-popular-coa-and-rc-open-source-libraries-taken-over-to-spread-malware">High</a>
<a href="https://pytorch.org/blog/compromised-nightly-dependency/">profile</a>
<a href="https://www.wordfence.com/blog/2018/02/cryptomining-javascript-supply-chain-attack/">incidents</a>
have made plain the danger of shipping software
built from large numbers dependencies, many of them unaudited and under little
scrutiny for malicious code. In this post I will share ideas on future
supply chain safe programming languages.</p>
<h2>Supply Chain Safe Programming Languages?</h2>
<p>I'm using the term Supply Chain Safe Programming Languages for languages
that defend against supply chain attacks and allow library dependencies to be
introduced with strong guarantees about what the dependencies can and cannot
do. This type of programming language is not yet widely available as of March
2024, to the best of my knowledge.</p>
<p>Supply chain safety is often associated with software packaging and
distribution techniques for verifying that software was built from known good
inputs. Although adding supply chain safety tools on top of existing
programming languages is a pragmatic solution, I think future progress requires
addressing supply chain safety directly in the programming language.</p>
<h2>Why today's languages are not supply chain safe</h2>
<p>Many existing languages have a module system that gives the programmer
control over the visibility of variables and functions. By hiding variable and functions from other modules, one might hope to achieve isolation so
that a component like a decompression library could not read a sensitive
variable from the program. Unfortunately this level of isolation between components is not really available in popular
programming languages today even in languages with public/private visibility features. Visibility is more of a software engineering tool
for keeping programs decoupled than an isolation mechanism that actually protects
components of a program from each other. There are many ways to bypass visibility.</p>
<p>The fundamental problem is that existing programming languages do not even
acknowledge that programs often consist of untrusted components. Compilers and
interpreters currently treat the entire input source code as having more or
less the same level of trust. Here are some of the ways in which today's programming languages
fall short:</p>
<ul>
<li>Unsafe programming languages like C, C++, and even Rust allow the programmer to bypass the type system to do pretty much anything.</li>
<li>Dynamic languages like Python and JavaScript have introspection and <a href="https://en.wikipedia.org/wiki/Monkey_patch">monkey patching</a> abilities that allow the programmer to hook other parts of the program and escape attempts at isolation.</li>
<li>Build systems and metaprogramming facilities like macros allow untrusted components to generate code that executes in the context of another component.</li>
<li>Standard libraries provide access to spawning new programs, remapping virtual memory, loading shared libraries written in unsafe languages, hijacking function calls through the linker, raw access to the program's memory space with <tt>/proc/self/mem</tt>, and so on. All of these can bypass language-level mechanisms for isolating components in a program.</li>
</ul>
<p>Whatever your current language, it's unlikely that the language itself
allows you to isolate components of a program. The best approach we have today
for run-time isolation is through <a href="https://en.wikipedia.org/wiki/Sandbox_(computer_security)">sandboxing</a>. Examples of sandboxing approaches
include <a href="https://en.wikipedia.org/wiki/Seccomp"><tt>seccomp(2)</tt></a>,
<a href="https://v8docs.nodesource.com/node-0.8/d5/dda/classv8_1_1_isolate.html">v8
Isolates</a> for JavaScript, invoking untrusted code in a WebAssembly runtime,
or the descendents of <a href="https://en.wikipedia.org/wiki/Chroot">chroot(2)</a>.</p>
<p>Sandboxes are not
supported directly by the programming language and have a number of drawbacks
and limitations. Integrating sandboxing into programs is tedious so they are
primarily used in the most critical attack surfaces like web browsers or
hypervisors. There is usually a performance overhead associated with
interacting with the sandbox because data needs to be marshalled or copied.
Sandboxing is an opt-in mechanism that doesn't raise the bar of software in
general. I believe that supply chain safe programming languages could offer
similar isolation but as the default for most software.</p>
<h2>What a Supply Chain Safe Programming Language looks like</h2>
<p>The goal of a supply chain safe programming language is to isolate components of a
program by default. Rather than leaving supply chain safety outside the scope
of the language, the language should allow components to be integrated with
strong guarantees about what effects they can have on each other. There may be
practical reasons to offer an escape hatch to unsafe behavior, but the default
needs to be safe.</p>
<p>At what level of granularity should isolation operate? I think modules are
too coarse grained because they are often collections of functions that perform
very different types of computation requiring different levels of access to
resources. The level of granularity should at least go down to the function
level within a component, although even achieving module-level granularity
would be a major improvement over today's standards.</p>
<p>An example is that a hash table lookup function should be unable to connect
to the internet. That way the function can be used without fear of it becoming
a liability if it contains bugs or its source code is manipulated by an
attacker.</p>
<p>A well-known problem in programming language security is that the majority
of languages expose ambient capabilities to all components in a program.
Ambient capabilities provide access to resources that are not explicitly passed
in to the component. Think of a file descriptor in a POSIX process that is
available to any function in the program, including a string compare function
that has no business manipulating file descriptors.</p>
<p><a href="https://en.wikipedia.org/wiki/Capability-based_security">Capability-based
security</a> approaches are a solution to the ambient capabilities problem
in languages today. Although mainstream programming languages do not offer
capabilities as part of the language, there have been special-purpose and
research languages that demonstrated that this approach works. In a type safe programming language with capability-based security it
becomes possible to give components access to only those resources that they
require. Usually type safety is the mechanism that prevents capabilities from
being created out of thin air, although other approaches may be possible for
dynamic languages. The type system will not allow a component to create itself
a new capability that the component does not already possess.</p>
<p>Capability-based security addresses safety at runtime, but it does not
address safety at compile time. If we want to compose programs from untrusted
components then it is not possible to rely on today's build scripts, code
generators, or macro systems. The problem is that they can be abused by a
component to execute code in the context of another component.</p>
<p>Compile-time supply chain safety means isolating components so their code
stays within their component. For example, a "leftpad" macro that pads a string
literal with leading spaces would be unsafe if it can generate code that is
compiled as part of the main program using the macro. Similarly, a build script for
the leftpad module must not be able to affect or escape the build
environment.</p>
<p>Macros, build scripts, code generators, and so on are powerful tools that
programmers find valuable. The challenge for supply chain safe programming
languages is to harness that power so that it remains convenient to use without
endangering safety. One example solution is running build scripts in an
isolated environment that cannot affect other components in the program. This
way a component can take advantage of custom build-time behavior without
endangering the program. However, it is unclear to me how far inter-component
facilities like macros can be made safe, if at all.</p>
<h2>Conclusion</h2>
<p>I don't have the answers or even a prototype, but I think supply chain safe
programming languages are an inevitability. Modern programs are composed of
many third-party components yet we do not have effective techniques for
confining components. Languages treat the entire program as trusted rather than
as separate untrusted components that must be isolated.</p>
<p>Hopefully we will begin to see new mainstream programming languages emerge
that are supply chain safe, not just memory safe!</p>Sun, 31 Mar 2024 01:16:49 +0000Marcin Juszkiewicz: Running SBSA Reference Platformhttps://marcin.juszkiewicz.com.pl/2024/03/25/running-sbsa-reference-platform/
https://marcin.juszkiewicz.com.pl/2024/03/25/running-sbsa-reference-platform/
<p>Recently people asked me how to run <span class="caps">SBSA</span> Reference Platform for their own
testing and development. Which shows that I should write some documentation.</p>
<p>But first let me blog about it…</p>
<!--MORE-->
<h3>Requirements</h3>
<p>To run <span class="caps">SBSA</span> Reference Platform emulation you need:</p>
<ul>
<li><span class="caps">QEMU</span> (8.2+ recommended)</li>
<li><span class="caps">EDK2</span> firmware files</li>
</ul>
<p>That’s all. Sure, some hardware resources would be handy but everyone has some
kind of computer available, right?</p>
<h4><span class="caps">QEMU</span></h4>
<p>Nothing special is required as long as you have <code>qemu-system-aarch64</code> binary available.</p>
<h4><span class="caps">EDK2</span></h4>
<p>We provide
<a href="https://artifacts.codelinaro.org/ui/native/linaro-419-sbsa-ref/"><span class="caps">EDK2</span> binaries</a>
on CodeLinaro server. Go to “latest/edk2” directory, fetch both “SBSA_FLASH*”
files, unpack them and you are ready to go. You may compare checksums (before
unpacking) with values present in “latest/<span class="caps">README</span>.txt” file.</p>
<p>Those binaries are built from release versions of Trusted Firmware (<span class="caps">TF</span>-A) and
Tianocore <span class="caps">EDK2</span> plus latest “edk2-platforms” code (as this repo is not using tags).</p>
<h5>Building <span class="caps">EDK2</span></h5>
<p>If you decide to build <span class="caps">EDK2</span> on your own then we provide <span class="caps">TF</span>-A binaries in
“edk2-non-osi” repository. I update those when it is needed.</p>
<p>Instructions to build <span class="caps">EDK2</span> are provided in
<a href="https://github.com/tianocore/edk2-platforms/blob/master/Platform/Qemu/SbsaQemu/Readme.md">Qemu/SbsaQemu</a>
directory of “edk2-platforms” repository.</p>
<h3>Running <span class="caps">SBSA</span> Reference Platform emulation</h3>
<p>Note that this machine is fully emulated. Even on AArch64 systems where
virtualization is available.</p>
<p>Let go through example <span class="caps">QEMU</span> command line:</p>
<pre><code>qemu-system-aarch64
-machine sbsa-ref
-drive file=firmware/SBSA_FLASH0.fd,format=raw,if=pflash
-drive file=firmware/SBSA_FLASH1.fd,format=raw,if=pflash
-serial stdio
-device usb-kbd
-device usb-tablet
-cdrom disks/alpine-standard-3.19.1-aarch64.iso
</code></pre>
<p>At first we select “sbsa-ref” machine (defaults to four Neoverse-N1 cpu cores
and <span class="caps">1GB</span> of ram). Then we point to firmware files (order of them is important).</p>
<p>Serial console is useful for diagnostic output, just remember to not press
Ctrl-C there unless you want to take whole emulation down.</p>
<p><span class="caps">USB</span> devices are to have working keyboard and pointing device. <span class="caps">USB</span> tablet is more
useful that <span class="caps">USB</span> mouse (<code>-device usb-mouse</code> adds it). If you want to run *<span class="caps">BSD</span>
operating systems then I recommend to add <span class="caps">USB</span> mouse.</p>
<p>And the last entry adds Alpine 3.19.1 <span class="caps">ISO</span> image.</p>
<p>System boots to text console on graphical output. For some reason boot console
is on serial port.</p>
<h4>Adding hard disk</h4>
<p>If you want to add hard disk then adding “<code>-hdb disk.img</code>” is enough (“hdb” as
cdrom took 1st slot on the <span class="caps">AHCI</span> controller).</p>
<p>Handy thing is “virtual <span class="caps">FAT</span> drive” which allows to create guest’s drive from
directory on the host:</p>
<pre><code>-drive if=ide,file=fat:ro:DIRECTORY_ON_HOST,format=raw
</code></pre>
<p>This is useful for running <span class="caps">EFI</span> binaries as this drive is visible in <span class="caps">UEFI</span>
environment. It is not present when operating system is booted.</p>
<h4>Adding <span class="caps">NVME</span> drive</h4>
<p><span class="caps">NVME</span> is composed from two things:</p>
<ul>
<li>PCIe device</li>
<li>storage drive</li>
</ul>
<p>So let add it in a nice way, using PCIe root-port:</p>
<pre><code>-device pcie-root-port,id=root_port_for_nvme1,chassis=2,slot=0
-device nvme,serial=deadbeef,bus=root_port_for_nvme1,drive=nvme
-drive file=disks/nvme.img,format=raw,id=nvme,if=none
</code></pre>
<h4>Using <span class="caps">NUMA</span> configuration</h4>
<p><span class="caps">QEMU</span> can emulate Non-Uniform Memory Access (<span class="caps">NUMA</span>) setup. This usually means
multisocket systems with memory available per cpu socket.</p>
<p>Example config:</p>
<pre><code>-smp 4,sockets=4,maxcpus=4
-m 4G,slots=2,maxmem=5G
-object memory-backend-ram,size=1G,id=m0
-object memory-backend-ram,size=3G,id=m1
-numa node,nodeid=0,cpus=0-1,memdev=m0
-numa node,nodeid=1,cpus=2,memdev=m1
-numa node,nodeid=2,cpus=3
</code></pre>
<p>This adds four cpu sockets and <span class="caps">4GB</span> of memory. First node has 2 cpu cores and <span class="caps">1GB</span>
ram, second node has 1 cpu and <span class="caps">3GB</span> of ram and last node has 1 cpu without local memory.</p>
<p>Note that support for such setup in work in progress now (March 2024). We merged
required code into <span class="caps">TF</span>-A and have set of patches for <span class="caps">EDK2</span> in review. Without them
you will see resources only from the first <span class="caps">NUMA</span> node.</p>
<h4>Complex <span class="caps">PCI</span> Express setup</h4>
<p>Our platform has <span class="caps">GIC</span> <span class="caps">ITS</span> support so we can try some complex <span class="caps">PCI</span> Express structures.</p>
<p>This example uses PCIe switch to add more PCIe slots and then (to complicate
things) puts PCIe-to-<span class="caps">PCI</span> bridge into one of them to make use of old Intel e1000
network card:</p>
<pre><code>-device pcie-root-port,id=root_port_for_switch1,chassis=0,slot=0
-device x3130-upstream,id=up_port1,bus=root_port_for_switch1
-device xio3130-downstream,id=down_port1,bus=up_port1,chassis=1,slot=0
-device ac97,bus=down_port1
-device xio3130-downstream,id=down_port2,bus=up_port1,chassis=1,slot=1
-device pcie-pci-bridge,id=pci1,bus=down_port2
-device e1000,bus=pci1,addr=2
</code></pre>
<h3>Some helper scripts</h3>
<p>During last year I wrote some helper scripts for working with <span class="caps">SBSA</span> Reference
Platform testing. They are stored in
<a href="https://github.com/hrw/sbsa-ref-status">sbsa-ref-status</a> repository on GitHub.</p>
<p>May lack up-to-date documentation but can show my way of using the platform.</p>
<h3>Summary</h3>
<p><span class="caps">SBSA</span> Reference Platform can be used for testing several things. From operating
systems to (S)<span class="caps">BSA</span> compliance of the platform. Or to check how some things are
emulated in <span class="caps">QEMU</span>. Or playing with PCIe setups (<span class="caps">NUMA</span> systems can have separate
<span class="caps">PCI</span> Express buses but we do not handle it yet in firmware).</p>
<p>Have fun!</p>Mon, 25 Mar 2024 10:21:00 +0000KVM on Z: Important Note on Verifying Secure Execution Host Key Documentshttps://kvmonz.blogspot.com/2024/03/important-note-on-verifying-secure.html
https://kvmonz.blogspot.com/2024/03/important-note-on-verifying-secure.html
<p></p><p>
<span lang="en-US">The certificates of the host key signing keys that
are needed to verify host key documents will expire on April 24, 2024
for IBM z15 and LinuxONE III and on March 29, 2024 for IBM z16 and
LinuxONE 4. Due to a requirement from the Certificate Authority
(DigiCert), the renewed certificates are equipped with a new Locality
value (“Armonk” instead of “Poughkeepsie”). These renewed
certificates cause the current versions of the </span><span lang="en-US"><i><b>genprotimg</b></i></span><span lang="en-US">,
</span><span lang="en-US"><i><b>pvattest,</b></i></span><span lang="en-US">
and </span><span lang="en-US"><i><b>pvsecret</b></i></span><span lang="en-US">
tools to fail the verification of host key documents.</span></p>
<p><span lang="en-US">The
IBM Z team is preparing updates of the </span><span lang="en-US"><span>genprotimg</span></span><span lang="en-US">,
</span><span><span lang="en-US">pvattest</span></span><span lang="en-US">,
and </span><span><span lang="en-US">pvsecret</span></span><span lang="en-US">
tools to accept the new certificates and is working with Linux
distribution partners to release the updated tools.</span></p><p><span lang="en-US">To
build new Secure Execution images, attestation requests, or
add-secret requests before the updated tools are available in Linux
distributions, follow these steps:<br /><br /></span></p>
<p lang="en-US">
</p>
<h4><span lang="en-US">Step
1:</span></h4><p><span lang="en-US">Obtain the host key document, the host key signing key
certificate, the intermediate certificate from the Certificate
Authority, and the list of revoked host keys (CRL):</span></p>
<ul><li><p><span lang="en-US">For
IBM z15 and LinuxONE III, see
</span><span><u><a href="https://www.ibm.com/support/resourcelink/api/content/public/secure-execution-gen1.html"><span lang="en-US">https://www.ibm.com/support/resourcelink/api/content/public/secure-execution-gen1.html</span></a></u></span></p>
</li><li><p><span lang="en-US">For
IBM z16 and LinuxONE 4, see
</span><span><u><a href="https://www.ibm.com/support/resourcelink/api/content/public/secure-execution-gen2.html"><span lang="en-US">https://www.ibm.com/support/resourcelink/api/content/public/secure-execution-gen2.html<br /><br /></span></a></u></span><span lang="en-US"></span></p>
</li></ul>
<h4><span lang="en-US">Step
2:</span><span lang="en-US"> </span></h4><div><span lang="en-US">Download the script </span><span lang="en-US"><span>check_hostkeydoc</span></span><span lang="en-US">
from
</span><span><u><a href="https://github.com/ibm-s390-linux/s390-tools/blob/master/genprotimg/samples/check_hostkeydoc"><span lang="en-US">https://github.com/ibm-s390-linux/s390-tools/blob/master/genprotimg/samples/check_hostkeydoc</span></a></u></span><span lang="en-US">.<br /><br /></span></div>
<h4><span lang="en-US">Step
3:</span></h4><p><span lang="en-US">Verify each host key document using the </span><span lang="en-US"><i><b>check_hostkeydoc</b></i></span><span lang="en-US">
script. For example, issue</span></p>
<p><span lang="en-US">#
./check_hostkeydoc HKD1234.crt ibm-z-host-key-signing.crt \ -c
DigiCertCA.crt -r ibm-z-host-key.crl</span></p>
<p>
<span lang="en-US">This example verifies the host key document
</span><span><span lang="en-US">HKD1234.crt</span></span><span><span lang="en-US">
</span></span><span lang="en-US">using the host key signing key
certificate</span><span><span lang="en-US">
</span></span><span><span lang="en-US">ibm-z-host-key-signing.crt</span></span><span><span lang="en-US">, </span></span><span lang="en-US">and the
intermediate certificate of the Certificate Authority</span><span><span lang="en-US">
</span></span><span><span lang="en-US">DigiCertCA.crt</span></span><span><span lang="en-US">,</span></span><span lang="en-US">as well as the list
of revoked host keys</span><span><span lang="en-US">
</span></span><span><span lang="en-US">ibm-z-host-key.crl</span></span><span><span lang="en-US">.</span></span></p>
<p lang="en-US">
</p>
<p><span lang="en-US">After
the host key documents are verified using the </span><span lang="en-US"><span>check_hostkeydoc</span></span><span lang="en-US">
script, you can safely call </span><span lang="en-US"><span>genprotimg</span></span><span lang="en-US">,
</span><span><span lang="en-US">pvattest</span></span><span lang="en-US">,
or </span><span><span lang="en-US">pvsecret</span></span><span lang="en-US">
with the </span><span><span lang="en-US">–-no-verify</span></span><span lang="en-US">
option.</span></p>
<p lang="en-US">
</p>
<p><span lang="en-US">For
a description about how to manually verify host key documents, see
</span><span><u><a href="https://www.ibm.com/docs/en/linux-on-z?topic=execution-verify-host-key-document"><span lang="en-US">https://www.ibm.com/docs/en/linux-on-z?topic=execution-verify-host-key-document</span></a></u></span><span lang="en-US">.
</span>
</p>
<p></p>Mon, 25 Mar 2024 09:51:10 +0000Stefan Hajnoczi: How to access libvirt domains in KubeVirthttps://blog.vmsplice.net/2024/03/how-to-access-libvirt-domains-in.html
https://blog.vmsplice.net/2024/03/how-to-access-libvirt-domains-in.html
<p><a href="https://kubevirt.io/">KubeVirt</a> makes it possible to run virtual machines on Kubernetes alongside container workloads. Virtual machines are configured using <a href="https://kubevirt.io/user-guide/virtual_machines/virtual_machine_instances/">VirtualMachineInstance YAML</a>. But under the hood of KubeVirt lies the same <a href="https://libvirt.org/">libvirt</a> tooling that is commonly used to run KVM virtual machines on Linux. Accessing libvirt can be convenient for development and troubleshooting.</p>
<p><b>Note that bypassing KubeVirt must be done carefully.</b> Doing this in production may interfere with running VMs. If a feature is missing from KubeVirt, then please <a href="https://github.com/kubevirt/kubevirt/issues">request it</a>.</p>
<p>The following diagram shows how the user's VirtualMachineInstance is turned into a libvirt domain:</p>
<div class="separator"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdX43sTYCBkwaQU4ZJhZeEdP4Um82wt3BNvBDcxIwpwp9n7pDuHgqeNSb8QJt_UL-_8xdVnYDnBwT-sSbazWBgXuf37j_73qoiWQwfv8PZef1OYLC7jnFNJ3PdiI22ivlN59gWvQH4BEPFQeH3nfTMkd00_A6IlJeS2HXReWf4dyjnJcaYjDisyIOOq0w/s1600/virt-launcher.png"><img alt="" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdX43sTYCBkwaQU4ZJhZeEdP4Um82wt3BNvBDcxIwpwp9n7pDuHgqeNSb8QJt_UL-_8xdVnYDnBwT-sSbazWBgXuf37j_73qoiWQwfv8PZef1OYLC7jnFNJ3PdiI22ivlN59gWvQH4BEPFQeH3nfTMkd00_A6IlJeS2HXReWf4dyjnJcaYjDisyIOOq0w/s1600/virt-launcher.png" /></a></div>
<h2>Accessing virsh</h2>
<p>Libvirt's <tt>virsh</tt> command-line tool is available inside the virt-launcher Pod that runs a virtual machine. First determine <tt>vm1</tt>'s virt-launcher Pod name by filtering on its label (thanks to Alice Frosi for this trick!):</p>
<pre>
$ kubectl get pod -l vm.kubevirt.io/name=vm1
NAME READY STATUS RESTARTS AGE
virt-launcher-vm1-5gxvg 2/2 Running 0 8m13s
</pre>
<p>Find the name of the libvirt domain (this is guessable but it doesn't hurt to check):</p>
<pre>
$ kubectl exec virt-launcher-vm1-5gxvg -- virsh list
Id Name State
-----------------------------
1 default_vm1 running
</pre>
<p>Arbitrary virsh commands can be invoked. Here is an example of dumping the libvirt domain XML:</p>
<pre>
$ kubectl exec virt-launcher-vm1-5gxvg -- virsh dumpxml default_vm1
<domain type='kvm' id='1'>
<name>default_vm1</name>
...
</pre>
<h2>Viewing libvirt logs and full the QEMU command-line</h2>
<p>The libvirt logs are captured by Kubernetes so you can view them with <tt>kubectl log <virt-launcher-pod-name></tt>. If you don't know the virt-launcher pod name, check with <tt>kubectl get pod</tt> and look for your virtual machine's name.</p>
<p>The full QEMU command-line is part of the libvirt logs, but unescaping the JSON string is inconvenient. Here is another way to get the full QEMU command-line:</p>
<pre>
$ kubectl exec <virt-launcher-pod-name> -- ps aux | grep qemu
</pre>
<h2>Customizing KubeVirt's libvirt domain XML</h2>
<p>KubeVirt has a feature for customizing libvirt domain XML called <i>hook sidecars</i>. After the libvirt XML is generated, it is sent to a user-defined container that processes the XML and returns it back. The libvirt domain is defined using this processed XML. To learn more about how it works, check out the <a href="https://deploy-preview-751--kubevirt-user-guide.netlify.app/operations/hook-sidecar/#hook-sidecar-container">documentation</a>.</p>
<p>Hook sidecars are available when the <tt>Sidecar</tt> feature gate is enabled in the <tt>kubevirt/kubevirt</tt> custom resource. Normally only the cluster administrator can modify the kubevirt CR, so be sure to check when trying this feature:</p>
<pre>
$ kubectl auth can-i update kubevirt/kubevirt -n kubevirt
yes
</pre>
<p>Although you can provide a complete container image for the hook sidecar, there is a shortcut if you just want to run a script. A generic hook sidecar image is available that launches a script which can be provided as a ConfigMap. Here is example YAML including a ConfigMap that I've used to test the libvirt IOThread Virtqueue Mapping feature:</p>
<pre>
---
<b>apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
name: kubevirt
namespace: kubevirt
spec:
configuration:
developerConfiguration:
featureGates:
- Sidecar</b>
---
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
name: "fedora"
spec:
storage:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
source:
http:
url: "https://download.fedoraproject.org/pub/fedora/linux/releases/38/Cloud/x86_64/images/Fedora-Cloud-Base-38-1.6.x86_64.raw.xz"
---
<b>apiVersion: v1
kind: ConfigMap
metadata:
name: sidecar-script
data:
my_script.sh: |
#!/usr/bin/env python3
import xml.etree.ElementTree as ET
import os.path
import sys
NUM_IOTHREADS = 4
VOLUME_NAME = 'data' # VirtualMachine volume name
def main(xml):
domain = ET.fromstring(xml)
domain.find('iothreads').text = str(NUM_IOTHREADS)
disk = domain.find(f"./devices/disk/alias[@name='ua-{VOLUME_NAME}']..")
driver = disk.find('driver')
del driver.attrib['iothread']
iothreads = ET.SubElement(driver, 'iothreads')
for i in range(NUM_IOTHREADS):
iothread = ET.SubElement(iothreads, 'iothread')
iothread.set('id', str(i + 1))
ET.dump(domain)
if __name__ == "__main__":
# Workaround for https://github.com/kubevirt/kubevirt/issues/11276
if os.path.exists('/tmp/ran-once'):
main(sys.argv[4])
else:
open('/tmp/ran-once', 'wb')
print(sys.argv[4])</b>
---
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
creationTimestamp: 2018-07-04T15:03:08Z
generation: 1
labels:
kubevirt.io/os: linux
name: vm1
annotations:
<b>hooks.kubevirt.io/hookSidecars: '[{"args": ["--version", "v1alpha3"],
"image": "kubevirt/sidecar-shim:20240108_99b6c4bdb",
"configMap": {"name": "sidecar-script",
"key": "my_script.sh",
"hookPath": "/usr/bin/onDefineDomain"}}]'</b>
spec:
domain:
ioThreadsPolicy: auto
cpu:
cores: 8
devices:
blockMultiQueue: true
disks:
- disk:
bus: virtio
name: disk0
- disk:
bus: virtio
name: data
machine:
type: q35
resources:
requests:
memory: 1024M
volumes:
- name: disk0
persistentVolumeClaim:
claimName: fedora
- name: data
emptyDisk:
capacity: 8Gi
</pre>
<p>If you need to go down one level further and customize the QEMU command-line, see my post on <a href="https://blog.vmsplice.net/2011/04/how-to-pass-qemu-command-line-options.html">passing QEMU command-line options in libvirt domain XML</a>.</p>
<h2>More KubeVirt debugging tricks</h2>
<p>The official KubeVirt documentation has a <a href="https://kubevirt.io/user-guide/debug_virt_stack/logging/">Virtualization Debugging</a> section with more tricks for customizing libvirt logging, launching QEMU with strace or gdb, etc. Thanks to Alice Frosi for sharing the link!</p>
<h2>Conclusion</h2>
<p>It is possible to get libvirt access in KubeVirt for development and testing. This can make troubleshooting easier and it gives you the full range of libvirt domain XML if you want to experiment with features that are not yet exposed by KubeVirt.</p>Tue, 12 Mar 2024 19:55:40 +0000Stefano Garzarella: vDPA: support for block devices in Linux and QEMUhttps://stefano-garzarella.github.io/posts/2024-02-12-vdpa-blk/
https://stefano-garzarella.github.io/posts/2024-02-12-vdpa-blk/
<p>A <em>vDPA device</em> is a type of device that follows the <strong><a href="https://docs.oasis-open.org/virtio/virtio/v1.3/virtio-v1.3.html">virtio specification</a>
for its datapath</strong> but has a <strong>vendor-specific control path</strong>.</p>
<p>vDPA devices can be both physically located on the hardware or emulated by
software.</p>
<p><img alt="vDPA overview" src="https://stefano-garzarella.github.io/img/vdpa_overview.png" /></p>
<p>A small vDPA parent driver in the host kernel is required only for the control
path. The main advantage is the <strong>unified software stack</strong> for all vDPA
devices:</p>
<ul>
<li><em>vhost interface</em> (vhost-vdpa) for userspace or guest virtio driver, like a
VM running in QEMU</li>
<li><em>virtio interface</em> (virtio-vdpa) for bare-metal or containerized applications
running in the host</li>
<li><em>management interface</em> (vdpa netlink) for instantiating devices and
configuring virtio parameters</li>
</ul>
<h3 id="useful-resources">Useful Resources</h3>
<p>Many blog posts and talks have been published in recent years that can help you
better understand vDPA and the use cases. On <a href="https://vdpa-dev.gitlab.io/">vdpa-dev.gitlab.io</a>
we collected some of them; I suggest you at least explore the following:</p>
<ul>
<li><a href="https://www.redhat.com/en/blog/introduction-vdpa-kernel-framework">Introduction to vDPA kernel framework</a></li>
<li><a href="https://www.redhat.com/en/blog/introducing-vduse-software-defined-datapath-virtio">Introducing VDUSE: a software-defined datapath for virtio</a></li>
</ul>
<h2 id="block-devices">Block devices</h2>
<p>Most of the work in vDPA has been driven by network devices, but in recent years,
we have also developed support for block devices.</p>
<p>The main use case is definitely leveraging the hardware to directly emulate the
virtio-blk device and support different network backends such as Ceph RBD or
iSCSI. This is the goal of some SmartNICs or DPUs, which are able to emulate
virtio-net devices of course, but also virtio-blk for network storage.</p>
<p>The abstraction provided by vDPA also makes software accelerators possible,
similar to existing vhost or vhost-user devices.
<a href="https://kvmforum2021.sched.com/event/ke3a/vdpa-blk-unified-hardware-and-software-offload-for-virtio-blk-stefano-garzarella-red-hat">We discussed about that at KVM Forum 2021</a>.</p>
<p>We talked about the fast path and the slow path in that talk. When QEMU needs
to handle requests, like supporting live migration or executing I/O throttling,
it uses the slow path. During the slow path, the device exposed to the guest is
emulated in QEMU. QEMU intercepts the requests and forwards them to the vDPA
device by taking advantage of the driver implemented in libblkio.
On the other hand, when QEMU doesn’t need to intervene, the fast path comes
into play. In this case, the vDPA device can be directly exposed to the guest,
bypassing QEMU’s emulation.</p>
<p><a href="https://gitlab.com/libblkio/libblkio">libblkio</a> exposes common API for accessing
block devices in userspace. It supports several drivers. We will focus more
on <code>virtio-blk-vhost-vdpa</code> driver, which is used by <code>virtio-blk-vhost-vdpa</code>
block device in QEMU. It only supports slow path for now, but in the future
it should be able to switch to fast path automatically. Since QEMU 7.2, it
supports libblkio drivers, so you can use the following options to attach a
vDPA block device to a VM:</p>
<div class="highlight"><pre class="chroma" tabindex="0"><code class="language-bash"><span class="line"><span class="cl"> -blockdev node-name<span class="o">=</span>drive_src1,driver<span class="o">=</span>virtio-blk-vhost-vdpa,path<span class="o">=</span>/dev/vhost-vdpa-0,cache.direct<span class="o">=</span>on <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> -device virtio-blk-pci,id<span class="o">=</span>src1,bootindex<span class="o">=</span>2,drive<span class="o">=</span>drive_src1 <span class="se">\
</span></span></span></code></pre></div><p>Anyway, to fully leverage the performance of a vDPA hardware device, we can
always use the generic <code>vhost-vdpa-device-pci</code> device offered by QEMU that
supports any vDPA device and exposes it directly to the guest. Of course,
QEMU is not able to intercept requests in this scenario and therefore some
features offered by its block layer (e.g. live migration, disk format, etc.)
are not supported. Since QEMU 8.0, you can use the following options to attach
a generic vDPA device to a VM:</p>
<div class="highlight"><pre class="chroma" tabindex="0"><code class="language-bash"><span class="line"><span class="cl"> -device vhost-vdpa-device-pci,vhostdev<span class="o">=</span>/dev/vhost-vdpa-0
</span></span></code></pre></div><p>At KVM Forum 2022, <a href="https://kvmforum2022.sched.com/event/15jLs/introducing-the-libblkio-high-performance-block-io-api-stefan-hajnoczi-alberto-faria-red-hat">Alberto Faria and Stefan Hajnoczi introduced libblkio</a>,
while <a href="https://kvmforum2022.sched.com/event/15jK5/qemu-storage-daemon-and-libblkio-exploring-new-shores-for-the-qemu-block-layer-kevin-wolf-stefano-garzarella-red-hat">Kevin Wolf and I discussed its usage in the QEMU Storage Deamon (QSD)</a>.</p>
<h2 id="software-devices">Software devices</h2>
<p>One of the significant benefits of vDPA is its strong abstraction, enabling
the implementation of virtio devices in both hardware and software—whether in
the kernel or user space. This unification under a single framework, where
devices appear identical for QEMU facilitates the seamless integration of
hardware and software components.</p>
<h3 id="kernel-devices">Kernel devices</h3>
<p>Regarding in-kernel devices, starting from Linux v5.13, there exists a simple
simulator designed for development and debugging purposes. It is available
through the <code>vdpa-sim-blk</code> kernel module, which emulates a 128 MB ramdisk.
As highlighted in the presentation at KVM Forum 2021, a future device in the
kernel (similar to the repeatedly proposed but never merged <code>vhost-blk</code>)
could potentially offer excellent performance. Such a device could be used
as an alternative when hardware is unavailable, for instance, facilitating
live migration in any system, regardless of whether the destination system
features a SmartNIC/DPU or not.</p>
<h3 id="user-space-devices">User space devices</h3>
<p>Instead, regarding user space, we can use VDUSE. QSD supports it and thus
allows us to export any disk image supported by QEMU, such as a vDPA device
in this way:</p>
<div class="highlight"><pre class="chroma" tabindex="0"><code class="language-bash"><span class="line"><span class="cl">qemu-storage-daemon <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> --blockdev file,filename<span class="o">=</span>/path/to/disk.qcow2,node-name<span class="o">=</span>file <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> --blockdev qcow2,file<span class="o">=</span>file,node-name<span class="o">=</span>qcow2 <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> --export <span class="nv">type</span><span class="o">=</span>vduse-blk,id<span class="o">=</span>vduse0,name<span class="o">=</span>vduse0,node-name<span class="o">=</span>qcow2,writable<span class="o">=</span>on
</span></span></code></pre></div><h2 id="containers-vms-or-bare-metal">Containers, VMs, or bare-metal</h2>
<p>As mentioned in the introduction, vDPA supports different buses such as
<code>vhost-vdpa</code> and <code>virtio-vdpa</code>. This flexibility enables the utilization of
vDPA devices with virtual machines or user space drivers (e.g., libblkio)
through the <code>vhost-vdpa</code> bus. Additionally, it allows interaction with
applications running directly on the host or within containers via the
<code>virtio-vdpa</code> bus.</p>
<p>The <code>vdpa</code> tool in iproute2 facilitates the management of vdpa devices
through netlink, enabling the allocation and deallocation of these devices.</p>
<p>Starting with Linux 5.17, vDPA drivers support <code>driver_ovveride</code>. This
enhancement allows dynamic reconfiguration during runtime, permitting the
migration of a device from one bus to another in this way:</p>
<div class="highlight"><pre class="chroma" tabindex="0"><code class="language-bash"><span class="line"><span class="cl"><span class="c1"># load vdpa buses</span>
</span></span><span class="line"><span class="cl">$ modprobe -a virtio-vdpa vhost-vdpa
</span></span><span class="line"><span class="cl"><span class="c1"># load vdpa-blk in-kernel simulator</span>
</span></span><span class="line"><span class="cl">$ modprobe vdpa-sim-blk
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># instantiate a new vdpasim_blk device called `vdpa0`</span>
</span></span><span class="line"><span class="cl">$ vdpa dev add mgmtdev vdpasim_blk name vdpa0
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># `vdpa0` is attached to the first vDPA bus driver loaded</span>
</span></span><span class="line"><span class="cl">$ driverctl -b vdpa list-devices
</span></span><span class="line"><span class="cl">vdpa0 virtio_vdpa
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># change the `vdpa0` bus to `vhost-vdpa`</span>
</span></span><span class="line"><span class="cl">$ driverctl -b vdpa set-override vdpa0 vhost_vdpa
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># `vdpa0` is now attached to the `vhost-vdpa` bus</span>
</span></span><span class="line"><span class="cl">$ driverctl -b vdpa list-devices
</span></span><span class="line"><span class="cl">vdpa0 vhost_vdpa <span class="o">[</span>*<span class="o">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Note: driverctl(8) integrates with udev so the binding is preserved.</span>
</span></span></code></pre></div><h2 id="examples">Examples</h2>
<p>Below are several examples on how to use <em>VDUSE</em> and the <em>QEMU Storage Daemon</em>
with VMs (<code>QEMU</code>) or Containers (<code>podman</code>).
These steps are easily adaptable to any hardware that supports virtio-blk
devices via vDPA.</p>
<h3 id="qcow2-image-available-for-host-applications-and-containers">qcow2 image available for host applications and containers</h3>
<div class="highlight"><pre class="chroma" tabindex="0"><code class="language-bash"><span class="line"><span class="cl"><span class="c1"># load vdpa buses</span>
</span></span><span class="line"><span class="cl">$ modprobe -a virtio-vdpa vhost-vdpa
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># create an empty qcow2 image</span>
</span></span><span class="line"><span class="cl">$ qemu-img create -f qcow2 test.qcow2 10G
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># load vduse kernel module</span>
</span></span><span class="line"><span class="cl">$ modprobe vduse
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># launch QSD exposing the `test.qcow2` image as `vduse0` vDPA device</span>
</span></span><span class="line"><span class="cl">$ qemu-storage-daemon --blockdev file,filename<span class="o">=</span>test.qcow2,node-name<span class="o">=</span>file <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> --blockdev qcow2,file<span class="o">=</span>file,node-name<span class="o">=</span>qcow2 <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> --export vduse-blk,id<span class="o">=</span>vduse0,name<span class="o">=</span>vduse0,num-queues<span class="o">=</span>1,node-name<span class="o">=</span>qcow2,writable<span class="o">=</span>on <span class="p">&</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># instantiate the `vduse0` device (same name used in QSD)</span>
</span></span><span class="line"><span class="cl">$ vdpa dev add name vduse0 mgmtdev vduse
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># be sure to attach it to the `virtio-vdpa` device to use with host applications</span>
</span></span><span class="line"><span class="cl">$ driverctl -b vdpa set-override vduse0 virtio_vdpa
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># device exposed as a virtio device, but attached to the host kernel</span>
</span></span><span class="line"><span class="cl">$ lsblk -pv
</span></span><span class="line"><span class="cl">NAME TYPE TRAN SIZE RQ-SIZE MQ
</span></span><span class="line"><span class="cl">/dev/vda disk virtio 10G <span class="m">256</span> <span class="m">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># start a container with `/dev/vda` attached</span>
</span></span><span class="line"><span class="cl">podman run -it --rm --device /dev/vda --group-add keep-groups fedora:39 bash
</span></span></code></pre></div><h3 id="launch-a-vm-using-a-vdpa-device">Launch a VM using a vDPA device</h3>
<div class="highlight"><pre class="chroma" tabindex="0"><code class="language-bash"><span class="line"><span class="cl"><span class="c1"># download Fedora cloud image (or use any other bootable image you want)</span>
</span></span><span class="line"><span class="cl">$ wget https://download.fedoraproject.org/pub/fedora/linux/releases/39/Cloud/x86_64/images/Fedora-Cloud-Base-39-1.5.x86_64.qcow2
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># launch QSD exposing the VM image as `vduse1` vDPA device</span>
</span></span><span class="line"><span class="cl">$ qemu-storage-daemon <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> --blockdev file,filename<span class="o">=</span>Fedora-Cloud-Base-39-1.5.x86_64.qcow2,node-name<span class="o">=</span>file <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> --blockdev qcow2,file<span class="o">=</span>file,node-name<span class="o">=</span>qcow2 <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> --export vduse-blk,id<span class="o">=</span>vduse1,name<span class="o">=</span>vduse1,num-queues<span class="o">=</span>1,node-name<span class="o">=</span>qcow2,writable<span class="o">=</span>on <span class="p">&</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># instantiate the `vduse1` device (same name used in QSD)</span>
</span></span><span class="line"><span class="cl">$ vdpa dev add name vduse1 mgmtdev vduse
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># initially it's attached to the host (`/dev/vdb`), because `virtio-vdpa`</span>
</span></span><span class="line"><span class="cl"><span class="c1"># is the first kernel module we loaded</span>
</span></span><span class="line"><span class="cl">$ lsblk -pv
</span></span><span class="line"><span class="cl">NAME TYPE TRAN SIZE RQ-SIZE MQ
</span></span><span class="line"><span class="cl">/dev/vda disk virtio 10G <span class="m">256</span> <span class="m">1</span>
</span></span><span class="line"><span class="cl">/dev/vdb disk virtio 5G <span class="m">256</span> <span class="m">1</span>
</span></span><span class="line"><span class="cl">$ lsblk /dev/vdb
</span></span><span class="line"><span class="cl">NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
</span></span><span class="line"><span class="cl">vdb 251:16 <span class="m">0</span> 5G <span class="m">0</span> disk
</span></span><span class="line"><span class="cl">├─vdb1 251:17 <span class="m">0</span> 1M <span class="m">0</span> part
</span></span><span class="line"><span class="cl">├─vdb2 251:18 <span class="m">0</span> 1000M <span class="m">0</span> part
</span></span><span class="line"><span class="cl">├─vdb3 251:19 <span class="m">0</span> 100M <span class="m">0</span> part
</span></span><span class="line"><span class="cl">├─vdb4 251:20 <span class="m">0</span> 4M <span class="m">0</span> part
</span></span><span class="line"><span class="cl">└─vdb5 251:21 <span class="m">0</span> 3.9G <span class="m">0</span> part
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># and it is identified as `virtio1` in the host</span>
</span></span><span class="line"><span class="cl">$ ls /sys/bus/vdpa/devices/vduse1/
</span></span><span class="line"><span class="cl">driver driver_override power subsystem uevent virtio1
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># attach it to the `vhost-vdpa` device to use the device with VMs</span>
</span></span><span class="line"><span class="cl">$ driverctl -b vdpa set-override vduse1 vhost_vdpa
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># `/dev/vdb` is not available anymore</span>
</span></span><span class="line"><span class="cl">$ lsblk -pv
</span></span><span class="line"><span class="cl">NAME TYPE TRAN SIZE RQ-SIZE MQ
</span></span><span class="line"><span class="cl">/dev/vda disk virtio 10G <span class="m">256</span> <span class="m">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># the device is identified as `vhost-vdpa-1` in the host</span>
</span></span><span class="line"><span class="cl">$ ls /sys/bus/vdpa/devices/vduse1/
</span></span><span class="line"><span class="cl">driver driver_override power subsystem uevent vhost-vdpa-1
</span></span><span class="line"><span class="cl">$ ls -l /dev/vhost-vdpa-1
</span></span><span class="line"><span class="cl">crw-------. <span class="m">1</span> root root 511, <span class="m">0</span> Feb <span class="m">12</span> 17:58 /dev/vhost-vdpa-1
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># launch QEMU using `/dev/vhost-vdpa-1` device with the</span>
</span></span><span class="line"><span class="cl"><span class="c1"># `virtio-blk-vhost-vdpa` libblkio driver</span>
</span></span><span class="line"><span class="cl">$ qemu-system-x86_64 -m 512M -smp <span class="m">2</span> -M q35,accel<span class="o">=</span>kvm,memory-backend<span class="o">=</span>mem <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> -object memory-backend-memfd,share<span class="o">=</span>on,id<span class="o">=</span>mem,size<span class="o">=</span><span class="s2">"512M"</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> -blockdev node-name<span class="o">=</span>drive0,driver<span class="o">=</span>virtio-blk-vhost-vdpa,path<span class="o">=</span>/dev/vhost-vdpa-1,cache.direct<span class="o">=</span>on <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> -device virtio-blk-pci,drive<span class="o">=</span>drive0
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># `virtio-blk-vhost-vdpa` blockdev can be used with any QEMU block layer</span>
</span></span><span class="line"><span class="cl"><span class="c1"># features (e.g live migration, I/O throttling).</span>
</span></span><span class="line"><span class="cl"><span class="c1"># In this example we are using I/O throttling:</span>
</span></span><span class="line"><span class="cl">$ qemu-system-x86_64 -m 512M -smp <span class="m">2</span> -M q35,accel<span class="o">=</span>kvm,memory-backend<span class="o">=</span>mem <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> -object memory-backend-memfd,share<span class="o">=</span>on,id<span class="o">=</span>mem,size<span class="o">=</span><span class="s2">"512M"</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> -blockdev node-name<span class="o">=</span>drive0,driver<span class="o">=</span>virtio-blk-vhost-vdpa,path<span class="o">=</span>/dev/vhost-vdpa-1,cache.direct<span class="o">=</span>on <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> -blockdev node-name<span class="o">=</span>throttle0,driver<span class="o">=</span>throttle,file<span class="o">=</span>drive0,throttle-group<span class="o">=</span>limits0 <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> -object throttle-group,id<span class="o">=</span>limits0,x-iops-total<span class="o">=</span><span class="m">2000</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> -device virtio-blk-pci,drive<span class="o">=</span>throttle0
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Alternatively, we can use the generic `vhost-vdpa-device-pci` to take</span>
</span></span><span class="line"><span class="cl"><span class="c1"># advantage of all the performance, but without having any QEMU block layer</span>
</span></span><span class="line"><span class="cl"><span class="c1"># features available</span>
</span></span><span class="line"><span class="cl">$ qemu-system-x86_64 -m 512M -smp <span class="m">2</span> -M q35,accel<span class="o">=</span>kvm,memory-backend<span class="o">=</span>mem <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> -object memory-backend-memfd,share<span class="o">=</span>on,id<span class="o">=</span>mem,size<span class="o">=</span><span class="s2">"512M"</span> <span class="se">\
</span></span></span><span class="line"><span class="cl"><span class="se"></span> -device vhost-vdpa-device-pci,vhostdev<span class="o">=</span>/dev/vhost-vdpa-0
</span></span></code></pre></div>Mon, 12 Feb 2024 17:42:57 +0000Stefan Hajnoczi: Key-Value Stores: The Foundation of File Systems and Databaseshttps://blog.vmsplice.net/2024/01/key-value-stores-foundation-of-file.html
https://blog.vmsplice.net/2024/01/key-value-stores-foundation-of-file.html
<p>File systems and relational databases are like cousins. They share more than is apparent at first glance.</p>
<div class="separator"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOe4ynFc3gNeT-GF281OiZQUOSVrxJQGYLH7PTcA2o5XFol66GllC4MPYokcI_TIqJn8V7tmXoIAl2jacElBPYYizB23PKfmuzuBV6Uis_zKcTBZWB0N_Q-2cN3fn8me6N7X543mhHMsmTQZBPUV5PtwYTzFxP_eNHJU6z3pQgXqxnzMRueKvKRi3cEio/s1600/fs-db.png"><img alt="" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOe4ynFc3gNeT-GF281OiZQUOSVrxJQGYLH7PTcA2o5XFol66GllC4MPYokcI_TIqJn8V7tmXoIAl2jacElBPYYizB23PKfmuzuBV6Uis_zKcTBZWB0N_Q-2cN3fn8me6N7X543mhHMsmTQZBPUV5PtwYTzFxP_eNHJU6z3pQgXqxnzMRueKvKRi3cEio/s1600/fs-db.png" /></a></div>
<p>It's not immediately obvious that relational databases and file systems rely upon the same underlying concept. That underlying concept is the <a href="https://en.wikipedia.org/wiki/Key%E2%80%93value_database">key-value store</a> and this article explores how both file systems and databases can be implemented on top of key-value stores.
</p>
<h2>The key-value store interface</h2>
<p>
Key-value stores provide an <i>ordered map</i> data structure. A <a href="https://en.wikipedia.org/wiki/Associative_array"><i>map</i></a> is a data structure that supports storing and retrieving from a collection of pairs. It's called a map because it is like a mathematical relation from a given <i>key</i> to an associated <i>value</i>. These are the key-value pairs that a key-value store holds. Finally, <i>ordered</i> means that the collection can be traversed in sorted key order. Not all key-value store implementations support ordered traversal, but both file systems and databases need this property as we shall see.
</p>
<p>Here is a key-value store with an integer key and a string value:</p>
<div class="separator"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGc-RFEj8ukpbK2f4C9O623icYwjcBygQTS8i06HXrNLSoaVCbTy2RsnkWdN6Q1Zjf3xzaFv-Cc-mzr43XBbVt7YEtFMfzr5stvaUcIqD-DPOy2eyPxkqdo1rXrv2SikSxUzTJn8GVI81hlGecyi3QDXA71vDk3Hr6woR41vexd659eej9U9JlZgRzaRQ/s1600/key-value.png"><img alt="" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGc-RFEj8ukpbK2f4C9O623icYwjcBygQTS8i06HXrNLSoaVCbTy2RsnkWdN6Q1Zjf3xzaFv-Cc-mzr43XBbVt7YEtFMfzr5stvaUcIqD-DPOy2eyPxkqdo1rXrv2SikSxUzTJn8GVI81hlGecyi3QDXA71vDk3Hr6woR41vexd659eej9U9JlZgRzaRQ/s1600/key-value.png" /></a></div>
<p>Notice that the keys can be enumerated in sorted order: 2 → 14 → 17.</p>
<p>A key-value store provides the following interface for storing and retrieving values by a given key:</p>
<ul>
<li><tt>put(Key, Value)</tt> - an insert/update operation that stores a value for a given key</li>
<li><tt>get(Key) -> Value</tt> - a lookup operation that retrieves the most recently stored value for a given key</li>
<li><tt>first() -> Key</tt>, <tt>last() -> Key</tt>, <tt>next(Key) -> Key</tt>, <tt>prev(Key) -> Key</tt> - a cursor API that enumerates keys in sorted order</li>
</ul>
<p>You've probably seen this sort of API if you have explored libraries like <a href="https://github.com/google/leveldb">LevelDB</a>, <a href="https://rocksdb.org/">RocksDB</a>, <a href="https://www.symas.com/lmdb">LMDB</a>, <a href="https://github.com/boltdb/bolt">BoltDB</a>, etc or used NoSQL key-value stores. File systems and databases usually implement their own customized key-value stores rather than use these off-the-shelf solutions.</p>
<h2>Why key-value stores are necessary</h2>
<p>Let's look at how the key-value store interface relates to disks. Disks present a range of blocks that can be read or written at their block addresses. Disks can be thought of like arrays in programming. They have O(1) lookup and update time complexity but inserting or removing a value before the end of the array is O(n) because subsequent elements need to be copied. They are efficient for dense datasets where every element is populated but inefficient for sparse datasets that involve insertion and removal.</p>
<p>Workloads that involve insertion or removal are not practical when the cost is O(n) for realistic sizes of n. That's why programs often use in-memory data structures like hash tables or balanced trees instead of arrays. Key-value stores can be thought of as the on-disk equivalent to these in-memory data structures. Inserting or removing values from a key-value store takes sub-linear time, perhaps O(log n) or even better amortized time. We won't go into the data structures used to implement key-value stores, but <a href="https://en.wikipedia.org/wiki/B%2B_tree">B+ trees</a> and <a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree">Log-Structured Merge-Trees</a> are popular choices.</p>
<p>This gives us an intuition about when key-value stores are needed and why they are an effective tool. Now let's look at how file systems and databases can be built on top of key-value stores next.</p>
<h2>Building a file system on a key-value store</h2>
<p>First let's start with how data is stored in files. A file system locates file data on disk by translating file offsets to <a href="https://en.wikipedia.org/wiki/Logical_block_addressing">Logical Block Addresses (LBAs)</a>. This is necessary because file data may not be stored contiguously on disk and files can be sparse with unallocated "holes" where nothing has been written yet. Thus, each file can be implemented as a key-value store with <Offset, <LBA, Length>> key-value pairs that comprise the translations needed to locate data on disk:</p>
<div class="separator"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjD3bLSAgd7gVFNpxa3gnIYi-BIC_Q8c6MjpS12BxehwImyxA6aMIpRo5wt9FL0zKA5tF5PNjabxXIr7Xyx6kJEml6rn-gcvi4RJcoT8J-bsY5tXDjkFH1RZqom20TLS5nm36acCURrxxd7REeJpfVhKXCsL1Cg8K0ZnIIxrFfHOeeyR64_w2yhCtIgIQk/s1600/file-mapping.png"><img alt="" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjD3bLSAgd7gVFNpxa3gnIYi-BIC_Q8c6MjpS12BxehwImyxA6aMIpRo5wt9FL0zKA5tF5PNjabxXIr7Xyx6kJEml6rn-gcvi4RJcoT8J-bsY5tXDjkFH1RZqom20TLS5nm36acCURrxxd7REeJpfVhKXCsL1Cg8K0ZnIIxrFfHOeeyR64_w2yhCtIgIQk/s1600/file-mapping.png" /></a></div>
<p>Reading and writing to the file involves looking up Offset -> LBA translations and inserting new translations when new blocks are allocated for the file. This is a good fit for a key-value store, but it's not the only place where file systems employ key-value stores.</p>
<p>File systems track free blocks that are not in used by files or metadata so that the block allocator can quickly satisfy allocation requests. This can be implemented as a key-value store with <LBA, Length> key-value pairs representing all free LBA ranges.</p>
<p>If the block allocator needs to satisfy contiguous allocation requests then a second key-value store with <Length, LBA> key-value pairs can serve as an efficient lookup or <i>index</i>. A best-fit allocator uses this key-value store by looking up the requested contiguous allocation size. Either an free LBA range of the matching size will be found or the next ordered key can be traversed when lookup fails to find a bigger free range capable of satisfying this allocation request. This is an important pattern with key-value stores: we can have one main key-value store plus one or more indices that are derived from the same key-value pairs but use a different datum as the key than the primary key-value store, allowing efficient lookups and ordered traversal. The same pattern will come up in databases too.</p>
<p>Next, let's look at how to represent directory metadata in a key-value store. Files are organized into a hierarchy of directories (or folders). The file system stores the directory entries belonging to each directory. Each directory can be organized as a key-value store with filenames as keys and inode numbers as values. Path traversal consists of looking up directory entries in each directory along file path components like <tt>home</tt>, <tt>user</tt>, and <tt>file</tt> in the path <tt>/home/user/file</tt>. When a file is created, a new directory entry is inserted. When a file is deleted, its directory entry is removed. The contents of a directory can be listed by traversing the keys.</p>
<p>Some file systems like <a href="https://dl.acm.org/doi/abs/10.1145/2501620.2501623">BTRFS</a> use key-value stores for other on-disk structures such as snapshots, checksums, etc, too. There is even a root key-value store in BTRS from which all these other key-value stores can be looked up. We'll see that the same concept of a "forest of trees" or a root key-value store that points to other key-value stores also appears in databases below.</p>
<h2>Building a database on a key-value store</h2>
<p>The core concept in <a href="https://en.wikipedia.org/wiki/Relational_database">relational databases</a> is the table, which contains the rows of the data we wish to store. The table columns are the various fields that are stored by each row. One or more columns make up the primary key by which table lookups are typically performed. The table can be implemented as a key-value store using the primary key columns as the key and the remainder of the columns as the value:</p>
<div class="separator"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-1izYyHKUydpmpZ-z3uZCOZ_jfyK0kD-vn775w96EW4_q7MHnvam4KDio1H-N8FehKNHpRMSXJVtuv6oUfFEXeokIAys0ch1iay8w3QtRcC4cDT56v_kW_ukswAFH2hC8VdBSaf-oNlCxChYXPeqH-1mnActT5bmutF5f8buvhQVbDW_MgOS4qbNJep0/s1600/relational-vs-kv.png"><img alt="" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-1izYyHKUydpmpZ-z3uZCOZ_jfyK0kD-vn775w96EW4_q7MHnvam4KDio1H-N8FehKNHpRMSXJVtuv6oUfFEXeokIAys0ch1iay8w3QtRcC4cDT56v_kW_ukswAFH2hC8VdBSaf-oNlCxChYXPeqH-1mnActT5bmutF5f8buvhQVbDW_MgOS4qbNJep0/s1600/relational-vs-kv.png" /></a></div>
<p>This key-value store can look up rows in the table by their Id. What if we want to look up a row by Username instead?
<div class="separator"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjd8_SnvZPOUAdDXuY1zIHbQ52Zy6MyU93pgJPTSaGVHxEqnCL8nvsxZCTqOeKC8xy5vg01ZdpMc3xjejE0EzNX2WaPwAV9wsWvPYIqIb48GbRyIt3zp-TgoH59pM0QD_0rY9Ac21TZL_uQwiV51-ItEZmBy_Xgs6bOSgg2KBha4UjYzs9DTD6-7UNUI7w/s1600/table-and-index.png"><img alt="" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjd8_SnvZPOUAdDXuY1zIHbQ52Zy6MyU93pgJPTSaGVHxEqnCL8nvsxZCTqOeKC8xy5vg01ZdpMc3xjejE0EzNX2WaPwAV9wsWvPYIqIb48GbRyIt3zp-TgoH59pM0QD_0rY9Ac21TZL_uQwiV51-ItEZmBy_Xgs6bOSgg2KBha4UjYzs9DTD6-7UNUI7w/s1600/table-and-index.png" /></a></div>
<p>To enable efficient lookups by Username, a secondary key-value store called an index maintains a mapping from Username to Id. The index does not duplicate all the columns in the table, just the Username and Id. To perform a query like <tt>SELECT * FROM Users WHERE Username = 'codd'</tt>, the index is first used to look up the Id and then the remainder of the columns are looked up from the table.</p>
<p>SQLite's <a href="https://sqlite.org/fileformat2.html">file format documentation</a> shows the details of how data is organized along these lines and the power of key-value stores. The file format has a header the references the <a href="https://sqlite.org/fileformat2.html#storage_of_the_sql_database_schema">"table b-tree"</a> that points to the roots of all tables. This means there is an entry point key-value store that points to all the other key-value stores associated with tables, indices, etc in the database. This is similar to the forest of trees we saw in the BTRFS file system where the key-value store acts as the central data structure tying everything together.</p>
<h2>Conclusion</h2>
<div class="separator"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZlLCKuq7vh8Bqm4ErMno95cnjoRQs57dVgsYPxjMGG_BcaR7SmTp3cuXBjA69mqLcTaUPZw4fYHLZLLQSOO4x2w0Sdkf7c5D-FBQxN4k6DVBO3tHr2U0gqYm15lvLmJ_zYdIjzyFxGkTo-47ZOTdE8827rQSgYjlsNzMarP3qybHuqihQgA50vYjckAU/s1600/relationship.png"><img alt="" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZlLCKuq7vh8Bqm4ErMno95cnjoRQs57dVgsYPxjMGG_BcaR7SmTp3cuXBjA69mqLcTaUPZw4fYHLZLLQSOO4x2w0Sdkf7c5D-FBQxN4k6DVBO3tHr2U0gqYm15lvLmJ_zYdIjzyFxGkTo-47ZOTdE8827rQSgYjlsNzMarP3qybHuqihQgA50vYjckAU/s1600/relationship.png" /></a></div>
<p>If a disk is like an array in programming, then a key-value store is like a dict. It offers a convenient interface for storing and retrieving sparse data with good performance. Both file systems and databases are abundant with sparse data and therefore fit naturally on top of key-value store. The actual key-value store implementations inside file systems and databases may be specialized variants of B-trees and other data structures that don't even call themselves key-value stores, but the fundamental abstraction upon which file systems and databases are built is the key-value store.</p></p>Fri, 26 Jan 2024 01:16:04 +0000Stefan Hajnoczi: QEMU AioContext removal and how it was donehttps://blog.vmsplice.net/2024/01/qemu-aiocontext-removal-and-how-it-was.html
https://blog.vmsplice.net/2024/01/qemu-aiocontext-removal-and-how-it-was.html
<p>This post is about the AioContext lock removal in QEMU 9.0 (planned for release in 2024), how we got here, and what it means for multi-threaded code in QEMU.</p>
<h2>Early QEMU as a single-threaded program</h2>
<div class="separator"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaRqkkKqhE7ghlhSAp8xKLgRs-R4JL13E2hNg6FnGlbLWGkl4m34B0sLntGiSdNB70eSySkV0tG5O0oV4XIG6KuM9roPIfKHJ6ANI7xiSH-xLI5N2us7AFDmj0dBMlien019UaDJgQ3FLZdO6K2n27U7hyphenhyphenAmvB6KE0M2kmY1NTarvHPDWWNpzpdNvZAEA/s1600/aiocontext-lock-removal.png"><img alt="" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaRqkkKqhE7ghlhSAp8xKLgRs-R4JL13E2hNg6FnGlbLWGkl4m34B0sLntGiSdNB70eSySkV0tG5O0oV4XIG6KuM9roPIfKHJ6ANI7xiSH-xLI5N2us7AFDmj0dBMlien019UaDJgQ3FLZdO6K2n27U7hyphenhyphenAmvB6KE0M2kmY1NTarvHPDWWNpzpdNvZAEA/s1600/aiocontext-lock-removal.png" /></a></div>
<p>Until 2009 QEMU was largely a single-threaded program. This had the benefit
that the code didn't need to consider thread-safety and was thus simpler and
less bug-prone. The main loop interleaved running the next piece of guest code
and handling external events such as timers, disk I/O, and network I/O. This
architecture had the downside that emulating multi-processor guests was
bottlenecked by the single host CPU on which QEMU ran. There was no parallelism
and this became problematic as multi-processor guests became popular.</p>
<h2>Multi-threading with vCPU threads and the Big QEMU Lock</h2>
<div class="separator"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiH839GLUwVrLnNx3fjDgB16sXcc85EsF5DxjWYPUueVjQvtCyVGB3hq_K78iES6Ks3eP97t2NS08WcKIJ9dC0ZwZuptwwNWyvoGB7uA9Lnh24hO5fN7Sky2CKGf2Z3HGXXdMmD8ViypgIKzPnfBbnzg33skQ9HRCp9Osmt9ZCuYNtoR3FV2EkMD42rFS0/s1600/aiocontext-lock-removal%281%29.png"><img alt="" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiH839GLUwVrLnNx3fjDgB16sXcc85EsF5DxjWYPUueVjQvtCyVGB3hq_K78iES6Ks3eP97t2NS08WcKIJ9dC0ZwZuptwwNWyvoGB7uA9Lnh24hO5fN7Sky2CKGf2Z3HGXXdMmD8ViypgIKzPnfBbnzg33skQ9HRCp9Osmt9ZCuYNtoR3FV2EkMD42rFS0/s1600/aiocontext-lock-removal%281%29.png" /></a></div>
<p>The architecture was modified to support running dedicated vCPU threads for KVM guests. This made parallelism possible for multi-processor guests but the feature was initially only available for KVM guests. The Multi-Threaded TCG (MTTCG) feature eventually allowed translated code
to also take advantage of vCPU threads in 2016.</p>
<p>A straightforward approach to making all existing code thread-safe was taken: the Big QEMU
Lock (BQL) was introduced to serialize access to QEMU's internal state. The BQL is a single global mutex that is used to protect the majority of QEMU's internal state. KVM vCPU threads do not need access to
QEMU's internal state while executing guest code, so they don't hold the BQL most of the time. The main loop thread drops the BQL while blocking in <tt>ppoll(2)</tt> and this allows vCPU threads to acquire the lock when they come out of guest code.</p>
<h2>Multi-threading with IOThreads and the AioContext lock</h2>
<p>Although the vCPU bottleneck had been solved, device emulation still ran with the BQL held. This meant that only a single QEMU thread could process I/O requests at a time. For I/O bound workloads this was a bottleneck and especially disk I/O performance suffered due to this limitation. My first attempt at removing the bottleneck in 2012 amounted to writing a new "dataplane" code path outside the BQL, but it lacked the features that users needed like disk image file formats, I/O throttling, etc because it couldn't use the existing code that relied on the BQL. The long term solution would be introducing thread-safety to the existing code and that led to the creation of the AioContext lock.</p>
<div class="separator"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxXoqTPO2j_GrEHm8cU73KXT1RlIphD4nTi6WX4pRjdSOYwcfzqtvDzQR5bCInPyZzmcrom_46oO1Kq595z2Ie0oK4LuajjWFHoSIkFU5RG0HZFMjv_5ub4E5j7GpdijsikrcsnCdQoBAvs4_h10TjdZs5SEvWbhonBE0B7YpUtsEf6MG1OKlQh5YwVyY/s1600/aiocontext-lock-removal%282%29.png"><img alt="" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxXoqTPO2j_GrEHm8cU73KXT1RlIphD4nTi6WX4pRjdSOYwcfzqtvDzQR5bCInPyZzmcrom_46oO1Kq595z2Ie0oK4LuajjWFHoSIkFU5RG0HZFMjv_5ub4E5j7GpdijsikrcsnCdQoBAvs4_h10TjdZs5SEvWbhonBE0B7YpUtsEf6MG1OKlQh5YwVyY/s1600/aiocontext-lock-removal%282%29.png" /></a></div>
<p>The AioContext lock was like a mini-BQL but for an event loop (QEMU calls this an AioContext) instead of the entire program. Initially the event loop would acquire the lock while running event handlers, thereby ensuring mutual exclusion for all handlers associated with the event loop. Another thread could acquire the lock to stop the event loop from running and safely access variables. This was a crude approach though and propagated the BQL way of thinking further. QEMU began to suffer from deadlocks and race conditions now that multi-threading was possible. Although I wrote developer documentation about how the model worked, it became tricky to gain confidence in the safety of the code as the whole QEMU block layer needed to grapple with AioContext locking and did so incompletely and inconsistently.</p>
<p>The upshot of all of this was that disk I/O processing could run in a dedicated event loop thread (QEMU calls this an IOThread) while the QEMU monitor could acquire the AioContext lock for a brief moment to inspect the emulated disk for an "info block" monitor command, for example. Unlike the earlier "dataplane" approach, it was now possible for the QEMU block layer to run outside the BQL and instead rely on the AioContext lock.</p>
<h2>Removing the AioContext lock</h2>
<p>Paolo Bonzini had the idea to gradually eliminate the AioContext lock in favor of fine-grained locks because we kept hitting problems with the AioContext lock that I described above. His insight was to change the model so that handler functions would explicitly take their AioContext's lock instead acquiring the lock around the entire event loop iteration. The advantage to letting handlers take the lock was that they could also replace it with another mechanism. Eventually it would be possible to move away from the AioContext lock.</p>
<p>What came after was a multi-year journey that I credit to Paolo's vision. Emanuele Giuseppe Esposito worked with Paolo on putting fine-grained locking into practice and on sorting through the entire QEMU block layer to determine under which threads and locks variables were accessed. This was a massive effort and required a lot of persistence. Kevin Wolf figured out how to use clang's Thread Safety Analysis (TSA) to check some of the locking rules at compile time. Kevin also spent a lot of time protecting the block driver graph with a reader/writer lock so that in-flight I/O does not crash amidst modifications to the graph. Emanuele and Kevin gave a talk at KVM Forum 2023 about the larger QEMU multi-queue block layer effort and the slides are available <a href="https://kvm-forum.qemu.org/2023/Multiqueue_in_the_block_layer_wLom4Bt.pdf">here (PDF)</a>.</p>
<div class="separator"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoui6qLoeUNBnyz6Wz8LN9aZz-0kMBT0wU4dkZFnMWnaR61H9esMca1QZ0RJh1w-FkVrPGOMLrV622ZG4Wh2ucn3xFD6AIFxUIlV2D_mjyBIESdj4eMoEbISGdVTzOhoW10Dn01eCzP7pdjmR5EJRcD9_I3-rGmRbW4zbMJsgEJVJiHGfLHPogXr_Lkpw/s1600/aiocontext-lock-removal%283%29.png"><img alt="" border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoui6qLoeUNBnyz6Wz8LN9aZz-0kMBT0wU4dkZFnMWnaR61H9esMca1QZ0RJh1w-FkVrPGOMLrV622ZG4Wh2ucn3xFD6AIFxUIlV2D_mjyBIESdj4eMoEbISGdVTzOhoW10Dn01eCzP7pdjmR5EJRcD9_I3-rGmRbW4zbMJsgEJVJiHGfLHPogXr_Lkpw/s1600/aiocontext-lock-removal%283%29.png" /></a></div>
<p>Once everything that previously relied on the AioContext lock had switched to another form of thread-safety, it was possible to remove the AioContext lock as nothing used it anymore. The BQL is still widely used and covers global state that is accessed from few threads. Code that can run in any IOThread now uses its own locks or other mechanisms. The complexity of the codebase is still roughly the same as with the AioContext lock, but now there are fine-grained locks, which are easier to understand and there are fewer undocumented locking assumptions that led to deadlocks and races in the past.</p>
<h2>Conclusion</h2>
<p>QEMU's AioContext lock enabled multi-threading but was also associated with deadlocks and race conditions due to its ambient nature. From QEMU 9.0 onwards, QEMU will switch to fine-grained locks that are more localized and make thread-safety more explicit. Changing locking in a large program is time-consuming and difficult. It took a multi-person multi-year effort to complete this work, but it forms the basis for further work including the QEMU multi-queue block layer effort that push multi-threading further in QEMU.</p>Tue, 02 Jan 2024 20:08:41 +0000Stefan Hajnoczi: Upcoming talk: "Trust, confidentiality, and hardening: the virtio lessons" at LPC 2023https://blog.vmsplice.net/2023/11/upcoming-talk-trust-confidentiality-and.html
https://blog.vmsplice.net/2023/11/upcoming-talk-trust-confidentiality-and.html
<p><b>Update:</b> The video is now available <a href="https://www.youtube.com/watch?v=Gu0B9J4ZNNI">here</a> and the slides are available <a href="https://vmsplice.net/~stefan/stefanha-lpc-2023.pdf">here (PDF)</a>.</p>
<p>I will be at Linux Plumbers Conference 2023 to present <a href="https://lpc.events/event/17/contributions/1516/">"Trust, confidentiality, and hardening: the virtio lessons"</a> at 2:30pm on Wednesday, November 15th. Michael Tsirkin and I prepared this talk about the evolution of the trust model of the Linux VIRTIO drivers. It explores how the drivers have been hardened in response to new use cases for VIRTIO, including Linux VDUSE, hardware VIRTIO devices, and Confidential Computing.</p>
<p>Details are available on the <a href="https://lpc.events/event/17/contributions/1516/">LPC schedule</a>. Come watch to talk to find out how drivers work when you can't trust the hypervisor!</p>Mon, 01 Jan 2024 10:52:21 +0000Stefan Hajnoczi: Storage literature notes on free space management and snapshotshttps://blog.vmsplice.net/2024/01/storage-literature-notes-on-free-space.html
https://blog.vmsplice.net/2024/01/storage-literature-notes-on-free-space.html
<div><p>I recently looked at papers about free space management and snapshots in storage systems like file systems, volume managers, and key-value stores. I'm publishing my notes in case you find them useful, but the real value might simply be the links to papers in this field. They might be a useful starting point for someone wishing to read into this field.</p>
<p>My aim was to get an overview of data structures and algorithms used in modern storage systems for tracking free space and snapshotting objects.</p><h3>Literature</h3><ul><li>For a 50-year overview of file systems, see <a href="https://blog.koehntopp.info/2023/05/05/50-years-in-filesystems-1974.html">this blog series</a>.</li></ul><ul><li><a href="https://www.cs.hmc.edu/~rhodes/cs134/readings/The%20Zettabyte%20File%20System.pdf">The Zettabyte File system</a> (2003)<ul><li>The Storage Pool Allocator (SPA) provides allocation and freeing of blocks across physical disks. It deals in disk virtual addresses (DVAs) so the caller is unaware of which disk storage is located. Blocks can be migrated between devices without changing their DVA because the SPA can just update translation metadata.<ul><li>A slab allocator is used to satisfy contiguous block allocation requests of power-of-2 sizes (<a href="https://www.bsdcan.org/2016/schedule/attachments/366_ZFS%20Allocation%20Performance.pdf">see details</a>). Each device is divided into ~200 “metaslabs” (i.e. 0.5% of the device).</li></ul><ul><li>Allocations in a metaslab are written into a log called a space map and rewritten when the log becomes too long (<a href="https://www.delphix.com/blog/openzfs-code-walk-metaslabs-and-space-maps">see details</a>). In memory, range trees are built from the on-disk log so that free space can be looked up by offset or length (<a href="https://web.archive.org/web/20130311220814/https://blogs.oracle.com/bonwick/entry/space_maps">see details</a>).</li></ul></li></ul><ul><li>All blocks are checksummed. Checksums are stored along with the block pointer, so the integrity of the entire tree is protected via the checksum. When data is mirrored across drives it is possible to fix checksum failures.</li></ul><ul><li>The Data Management Unit (DMU) provides an object storage interface for creating, accessing, and deleting objects on top of the SPA.</li></ul><ul><li>The ZFS POSIX Layer (ZPL) implements POSIX file system semantics using the DMU to create objects for directories, files, etc.</li></ul><ul><li>When there are too many data blocks to store the block pointers, ZFS uses indirect blocks (up to 6 levels). Indirect blocks are blocks containing block pointers.</li></ul></li></ul><ul><li><a href="https://dominoweb.draco.res.ibm.com/reports/h-0245.pdf">B-trees, Shadowing, and Clones</a> (2006)<ul><li>Uses a copy-on-write B+-tree to implement an object storage device (OSD).</li></ul><ul><li>Requests are written to a log for recovery in between B+-tree checkpoints.</li></ul><ul><li>B+-tree pages are kept cached in memory until checkpoint write-out so that multiple updates to the same page are batched.</li></ul><ul><li>Hierarchical reference counts are used on tree nodes. This makes refcounts lazy and avoids having to increment/decrement refcounts on all blocks upfront.</li></ul></li></ul><ul><li><a href="https://www.usenix.org/legacy/event/usenix08/tech/full_papers/edwards/edwards.pdf">FlexVol: Flexible, Efficient File Volume Virtualization in WAFL</a> (2008)<ul><li>Introduces logical volumes into WAFL so that multiple file systems can be managed on the same physical storage with separate snapshots, policies, etc.</li></ul><ul><li>Delayed Block Freeing: do not actually free blocks and instead defer until 2% of blocks are ready to be freed in the background.</li></ul><ul><li>Cloning Volumes from Snapshots works like backing file chains in qcow2 or VMDK. WAFL knows which Snapshots are referenced and won’t free their metadata and blocks because Clone Volumes may still be using them. Clone Volumes can be detached from their Snapshots by copying out the data blocks to new blocks.</li></ul></li></ul><ul><li><a href="https://www.usenix.org/legacy/event/fast10/tech/full_papers/fast10proceedings.pdf#page=23">Tracking Back References in a Write-Anywhere File System</a> (2010)<ul><li>Log-structured back references are write-optimized so that block allocation, snapshot creation, etc efficiently record users of physical blocks. This information is needed during defragmentation and other data reorganization operations.</li></ul><ul><li>Serves queries from physical block address to logical block (inode, offset).</li></ul><ul><li>Implemented using a log-structured merge tree (requires periodic compaction) and a Bloom filter.</li></ul></li></ul><ul><li><a href="https://ldapcon.org/2011/downloads/chu-paper.pdf">MDB: A Memory-Mapped Database and Backend for OpenLDAP</a> (2011)<ul><li>LMDB is a read-optimized key-value store implemented as a copy-on-write B+-tree</li></ul><ul><li>Concurrency model: 1 writer and N readers at the same time</li></ul><ul><li>Entire database file is mmapped but writes and flushes use syscalls</li></ul><ul><li>Freelist B+-tree tracks free pages in database file</li></ul></li></ul><ul><li><a href="https://picture.iczhiku.com/resource/paper/syidkZUgkzLoyNVN.pdf">BTRFS: The Linux B-tree filesystem</a> (2012)<ul><li>Extent-based free space management<ul><li>Extent allocation tree stores back references, allowing extents to be moved later</li></ul><ul><li>Relies on contiguous free space, so background defragmentation is necessary</li></ul></li></ul><ul><li>Sub-volume tree nodes are reference counted</li></ul><ul><li>A 4KB write creates new inodes, file extents, checksums, and back references and corresponding b-tree spine nodes. When there are multiple modifications, spatial locality (sequential I/O or inode changes in a directory) helps batch these changes together resulting in fewer than N new nodes for N operations. Random I/O is less efficient.</li></ul></li></ul><ul><li><a href="https://storageconference.us/2015/Papers/09.Dragga.pdf">GCTrees: Garbage Collecting Snapshots</a> (2015)<ul><li>Rodeh's hierarchical reference counting delays refcount updates by keep refcounts on tree nodes and updating only the node's refcount closest to the root. Further tree modifications might eventually make it necessary to update subsets of refcounts in tree leaves. This can be combined with a refcount log to reduce the random I/O involved in updating many scattered refcounts.</li></ul><ul><li>GCTrees node store an offset to the parent GCTree node and a borrowed bitmap tracking which blocks are shared with the parent.<ul><li>When a GCTree is deleted:<ul><li>Blocks are ignored when the borrowed bit is set</li></ul><ul><li>The borrowed bit is checked in immediate child GCTree nodes to determine if the remaining blocks are still in use:<ul><li>If not in use, free the block</li></ul><ul><li>If in use, clear the borrowed bit in the child to transfer ownership of the block to the child (paper doesn't explain how this works when multiple immediate children borrow the same block because this research only considers read-only snapshots without writeable clone support)</li></ul></li></ul><ul><li>The linked list (relationship between GCTree nodes) is updated</li></ul></li></ul></li></ul></li></ul><ul><li><a href="https://www.usenix.org/system/files/conference/fast17/fast17-kesavan.pdf">Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL</a> (2017)<ul><li>WAFL keeps free space metadata up-to-date instead of eventually consistent (relying on scanning metadata in the background to identify free space).</li></ul><ul><li>Free space is recorded in a bitmap called activemap. Blocks are allocated near each other (e.g. contiguous), if possible, to minimize updates to the activemap.</li></ul><ul><li>WAFL implements background and inline defragmentation to make contiguous free space available.</li></ul><ul><li>File deletion does not instantly clear bits in the activemap because doing so would be expensive on large files. Deleted files are incrementally freed across checkpoints.</li></ul><ul><li>The Batched Free Log (BFLog) holds deleted blocks and sorts them so they can be deleted incrementally.</li></ul></li></ul><ul><li><a href="https://www.usenix.org/system/files/fast20-zhan.pdf">How to Copy Files</a> (2020)<ul><li>Aims to create what they call "nimble clones" (fast creation, fast read/write I/O, and efficient space utilization)</li></ul><ul><li>Read performance with btrfs, ZFS, xfs degrades after successive rounds of clone + write. The intuition is that at some point it's better to copy the blocks to avoid fragmentation instead of sharing them.<ul><li>They call this Copy-on-Abundant-Write (CAW)</li></ul></li></ul><ul><li>Implemented in BetrFS, a file system based on a Bε-tree key-value store that uses path names as keys instead of inode numbers.<ul><li>Uses hierarchical reference counts to track nodes</li></ul><ul><li>Free space is tracked in a bitmap in the node translation table, which is used for indirection to avoid rewriting nodes when physical block locations are updated</li></ul><ul><li>Didn't look in detail at the Bε-tree DAG technique introduced to implement efficient copies</li></ul></li></ul></li></ul><h3>Data structures</h3><ul><li><a href="https://en.wikipedia.org/wiki/B%2B_tree">B+ trees</a>: common in file systems and databases for ordered indexes</li></ul><ul><li>Bitmaps: widely used to track block allocation</li></ul><ul><li><a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree">Log-structured merge trees</a>: write-optimized key-value stores that require periodic compaction</li></ul><ul><li><a href="https://en.wikipedia.org/wiki/Bloom_filter">Bloom filters</a>: probabilistic data structure for set membership tests sacrificing accuracy (there can be false positives) for low space requirements</li></ul><ul><li><a href="https://en.wikipedia.org/wiki/Skip_list">Skip lists</a>: probabilistic O(log n) multi-level linked list data structure atop a sorted array but not as popular as B+ trees for on-disk structures</li></ul><p>
</p></div><span></span>Mon, 01 Jan 2024 10:50:41 +0000QEMU project: QEMU version 8.2.0 releasedhttps://www.qemu.org/2023/12/20/qemu-8-2-0/
https://www.qemu.org/2023/12/20/qemu-8-2-0/
<p>We’d like to announce the availability of the QEMU 8.2.0 release. This release contains 3200+ commits from 238 authors.</p>
<p>You can grab the tarball from our <a href="https://www.qemu.org/download/#source">download page</a>. The full list of changes are available <a href="https://wiki.qemu.org/ChangeLog/8.2">in the changelog</a>.</p>
<p>Highlights include:</p>
<ul>
<li>New virtio-sound device emulation</li>
<li>New virtio-gpu rutabaga device emulation used by Android emulator</li>
<li>New hv-balloon for dynamic memory protocol device for Hyper-V guests</li>
<li>New Universal Flash Storage device emulation</li>
<li>Network Block Device (NBD) 64-bit offsets for improved performance</li>
<li>dump-guest-memory now supports the standard kdump format</li>
<li>ARM: Xilinx Versal board now models the CFU/CFI, and the TRNG device</li>
<li>ARM: CPU emulation support for cortex-a710 and neoverse-n2</li>
<li>ARM: architectural feature support for PACQARMA3, EPAC, Pauth2, FPAC, FPACCOMBINE, TIDCP1, MOPS, HBC, and HPMN0</li>
<li>HPPA: CPU emulation support for 64-bit PA-RISC 2.0</li>
<li>HPPA: machine emulation support for C3700, including Astro memory controller and four Elroy PCI bridges</li>
<li>LoongArch: ISA support for LASX extension and PRELDX instruction</li>
<li>LoongArch: CPU emulation support for la132</li>
<li>RISC-V: ISA/extension support for AIA virtualization support via KVM, and vector cryptographic instructions</li>
<li>RISC-V: Numerous extension/instruction cleanups, fixes, and reworks</li>
<li>s390x: support for vfio-ap passthrough of crypto adapter for protected virtualization guests</li>
<li>Tricore: support for TC37x CPU which implements ISA v1.6.2</li>
<li>Tricore: support for CRCN, FTOU, FTOHP, and HPTOF instructions</li>
<li>x86: Zen support for PV console and network devices</li>
<li>and lots more…</li>
</ul>
<p>Thank you to everybody who contributed to this release, whether that was by writing code, reporting bugs, improving documentation, testing, or providing the project with CI resources. We couldn’t do these without you!</p>Wed, 20 Dec 2023 16:33:00 +0000Gerd Hoffmann: W^X in UEFI firmware and the linux boot chain.https://www.kraxel.org/blog/2023/12/uefi-nx-linux-boot/
https://www.kraxel.org/blog/2023/12/uefi-nx-linux-boot/
<h2>What is W^X?</h2>
<p>
If this sounds familiar to you, it probably is. It means that
memory should be either writable ("W", typically data), or
executeable ("X", typically code), but not both.
Elsewhere in the software industry this is standard security
practice since ages. Now it starts to take off for UEFI firmware
too.
</p>
<p>
This is a deep dive into recent changes, in both code (firmware) and
administration (secure boot signing), the consequences this has for
the linux, and the current state of affairs.
</p>
<h2>Changes in the UEFI spec and edk2</h2>
<p>
All UEFI memory allocations carry a memory type
(<code>EFI_MEMORY_TYPE</code>). UEFI tracks since day one whenever
a memory allocation is meant for code or data, among a bunch of
other properties such as boot service vs. runtime service memory.
</p>
<p>
For a long time it didn't matter much in practice. The concept of
virtual memory does not exist for UEFI. IA32 builds even run with
paging disabled (and this is unlikely to change until the
architecture disappears into irrelevance). Other architectures use
identity mappings.
</p>
<p>
While UEFI does not use address translation, nowdays it can use page
tables to enforce memory attributes, including (but not limited to)
write and execute permissions. When configured to do so it will set
code pages to <code>R-X</code> and data pages to <code>RW-</code>
instead of using <code>RWX</code> everywhere, so code using memory
types incorrectly will trigger page faults.
</p>
<p>
New in the UEFI spec (added in version 2.10) is
the <code>EFI_MEMORY_ATTRIBUTE_PROTOCOL</code>. Sometimes
properties of memory regions need to change, and this protocol can
be used to do so. One example is a self-uncompressing binary, where
the memory region the binary gets unpacked to initially must be
writable. Later (parts of) the memory region must be flipped from
writable to executeable.
</p>
<p>
As of today (Dec 2023) edk2 has
a <code>EFI_MEMORY_ATTRIBUTE_PROTOCOL</code> implementation for the
ARM and AARCH64 architectures, so this is present in the ArmVirt
firmware builds but not in the OVMF builds.
</p>
<h2>Changed secure boot signing requirements</h2>
<p>
In an effort to improve firmware security in general and especially
for secure boot
Microsoft <a href="https://techcommunity.microsoft.com/t5/hardware-dev-center/updated-uefi-signing-requirements/ba-p/1062916">changed
the requirements</a> for binaries they are willing to sign with
their UEFI CA key.
</p>
<p>
One key requirement added is that the binary layout must allow to
enforce memory attributes with page tables, i.e. PE binary sections
must be aligned to page size (4k). Sections also can't be both
writable and executable. And the application must be able to deal
with data section being mapped as not executable (NX_COMPAT).
</p>
<p>
These requirements apply to the binary itself
(i.e. <code>shim.efi</code> for linux systems) and everything loaded
by the binary (i.e. <code>grub.efi</code>, <code>fwupd.efi</code>
and the linux kernel).
</p>
<h2>Where does linux stand?</h2>
<p>
We had and party still have a bunch of problems in all components
involved in the linux boot process,
i.e. <code>shim.efi</code>, <code>grub.efi</code> and the efi stub
of the linux kernel.
</p>
<p>
Some are old bugs such as memory types not being used correctly,
which start to cause problems due to the firmware becoming more
strict. Some are new problems due to Microsoft raising the bar for
PE binaries, typically sections not being page-aligned. The latter
are easily fixed in most cases, often it is just a matter of adding
alignment to the right places in the linker scripts.
</p>
<p>
Lets have closer look at the components one by one:
</p>
<dl>
<dt><code>shim.efi</code></dt>
<dd>
<p>
shim added code to use the new <code>EFI_MEMORY_ATTRIBUTE_PROTOCOL</code>
before it was actually implemented by any firmware. Then this
was released completely untested. Did not work out very
well, we got a nice time bomb, and edk2 implementing
<code>EFI_MEMORY_ATTRIBUTE_PROTOCOL</code> for arm triggered it
...
</p>
<p>
Fixed in <code>main</code> branch, no release yet.
</p>
<p>
Getting new shim.efi binaries signed by Microsoft depends on the
complete boot chain being compilant with the new requirements,
which prevents shim bugfixes being shipped to users right now.
</p>
<p>
That should be solved soon though, see the kernel section below.
</p>
</dd>
<dt><code>grub.efi</code></dt>
<dd>
<p>
grub.efi used to use memory types incorrectly.
</p>
<p>
Fixed upstream years ago, case closed.
</p>
<p>
Well, in theory. Upstream grub development goes at glacial
speeds, so all distros carry a big stack of downstream patches.
Not surprisingly that leads to upstream fixes being absorbed
slowly and also to bugs getting reintroduced.
</p>
<p>
So, in practice we still have buggy grub versions in the wild.
It is getting better though.
</p>
</dd>
<dt>The linux kernel</dt>
<dd>
<p>
The linux kernel efi stub had it's fair share of bugs too. On
non-x86 architectures (arm, riscv, ...) all issues have been
fixed a few releases ago. They all share much of the efi stub
code base and also use the same self-decompressing method
(CONFIG_EFI_ZBOOT=y).
</p>
<p>
On x86 this all took a bit longer to sort out. For historical
reasons x86 can't use the zboot approach used by the other
architectures. At least as long as we need hybrid BIOS/UEFI
kernels, which most likely will be a number of years still.
</p>
<p>
The final x86 patch series has been merged during the 6.7 merge
window. So we should have a fixed stable kernel in early
January 2024, and distros picking up the new kernel in the
following weeks or months. Which in turn should finally unblock
shim updates.
</p>
</dd>
</dl>
<p>
There should be enough time to get everything sorted for the spring
distro releases (Fedora 40, Ubuntu 24.04).
</p>
<h2>edk2 config options</h2>
<p>
edk2 has a bunch of config options to fine tune the firmware
behavior, both compile time and runtime. The relevant ones for the
problems listed above are:
</p>
<dl>
<dt><code>PcdDxeNxMemoryProtectionPolicy</code></dt>
<dd>
<p>
Compile time option. Use the <code>--pcd</code> switch for the
edk2 <code>build</code> script to set these. It's bitmask, with
one bit for each memory type, specifying whenever the firmware
shoud apply memory protections for that particular memory type,
by setting the flags in the page tables accordingly.
</p>
<p>
Strict configuration is <code>PcdDxeNxMemoryProtectionPolicy =
0xC000000000007FD5</code>. This is also the default for ArmVirt
builds.
</p>
<p>
Bug compatible configuration
is <code>PcdDxeNxMemoryProtectionPolicy =
0xC000000000007FD1</code>. This excludes
the <code>EfiLoaderData</code> memory type from memory
protections, so using <code>EfiLoaderData</code> allocations for
code will not trigger page faults. Which is an very common
pattern seen in boot loader bugs.
</p>
</dd>
<dt><code>PcdUninstallMemAttrProtocol</code></dt>
<dd>
<p>
Compile time options, for ArmVirt only. Brand
new, <a href="https://github.com/tianocore/edk2/commit/cee7ba349c0c1ce489001a338a4e28555728b573">committed</a>
to the edk2 repo this week (Dec 12th 2023). When set to TRUE
the <code>EFI_MEMORY_ATTRIBUTE_PROTOCOL</code> will be
unistalled.
Default is FALSE.
</p>
<p>
Setting this to TRUE will work around the shim bug.
</p>
</dd>
<dt><code>opt/org.tianocore/UninstallMemAttrProtocol</code></dt>
<dd>
<p>
Runtime option, for ArmVirt only. Also new. Can be set using
-fw_cfg on the qemu command line: <code>-fw_cfg
name=opt/org.tianocore/UninstallMemAttrProtocol,string=<i>y|n</i></code>.
This is a runtime override for PcdUninstallMemAttrProtocol.
Works for both enabling and disabling the shim bug workaround.
</p>
</dd>
</dl>
<p>
In the future <code>PcdDxeNxMemoryProtectionPolicy</code> will
probably disappear in favor of memory profiles, which will allow to
configure the same settings (plus a few more) at runtime.
</p>
<h2 id="strictnx">Hands on, part #1 — using fedora edk2 builds</h2>
<p>
The default builds in the <code>edk2-ovmf</code>
and <code>edk2-aarch64</code> packages are configured to be bug
compatible, so VMs should boot fine even in case the guests are
using a buggy boot chain.
</p>
<p>
While this is great for end users it doesn't help much for
bootloader development and testing, so there are alternatives.
The <code>edk2-experimental</code> package comes with a collection
of builds better suited for that use case, configured with strict
memory protections and (on
aarch64) <code>EFI_MEMORY_ATTRIBUTE_PROTOCOL</code> enabled, so you
can see buggy builds actually crash and burn. 🔥
</p>
<h3>AARCH64 architecture</h3>
<p>
For AARCH64 this
is <code>/usr/share/edk2/experimental/QEMU_EFI-strictnx-pflash.raw</code>.
The magic words for libvirt are:
</p>
<pre><code class="language-xml"><span class="nt"><domain</span> <span class="na">type=</span><span class="s">'kvm'</span><span class="nt">></span>
[ ... ]
<span class="nt"><os></span>
<span class="nt"><type</span> <span class="na">arch=</span><span class="s">'aarch64'</span> <span class="na">machine=</span><span class="s">'virt'</span><span class="nt">></span>hvm<span class="nt"></type></span>
<span class="nt"><loader</span> <span class="na">readonly=</span><span class="s">'yes'</span> <span class="na">type=</span><span class="s">'pflash'</span><span class="nt">></span>/usr/share/edk2/experimental/QEMU_EFI-strictnx-pflash.raw<span class="nt"></loader></span>
<span class="nt"><nvram</span> <span class="na">template=</span><span class="s">'/usr/share/edk2/aarch64/vars-template-pflash.raw'</span><span class="nt">/></span>
<span class="nt"></os></span>
[ ... ]</code></pre>
<p>
If a page fault happens you will get this line ...
</p>
<pre>
Synchronous Exception at 0x00000001367E6578
</pre>
<p>
... on the serial console, followed by a stack trace and register
dump.
</p>
<h3>X64 architecture</h3>
<p>
For X64 this
is <code>/usr/share/edk2/experimental/OVMF_CODE_4M.secboot.strictnx.qcow2</code>.
Needs <code>edk2-20231122-12.fc39</code> or newer. The magic words
for libvirt are:
</p>
<pre><code class="language-xml"><span class="nt"><domain</span> <span class="na">type=</span><span class="s">'kvm'</span><span class="nt">></span>
[ ... ]
<span class="nt"><os></span>
<span class="nt"><type</span> <span class="na">arch=</span><span class="s">'x86_64'</span> <span class="na">machine=</span><span class="s">'q35'</span><span class="nt">></span>hvm<span class="nt"></type></span>
<span class="nt"><loader</span> <span class="na">readonly=</span><span class="s">'yes'</span> <span class="na">secure=</span><span class="s">'yes'</span> <span class="na">type=</span><span class="s">'pflash'</span> <span class="na">format=</span><span class="s">'qcow2'</span><span class="nt">></span>/usr/share/edk2/experimental/OVMF_CODE_4M.secboot.strictnx.qcow2<span class="nt"></loader></span>
<span class="nt"><nvram</span> <span class="na">template=</span><span class="s">'/usr/share/edk2/ovmf/OVMF_VARS_4M.secboot.qcow2'</span> <span class="na">format=</span><span class="s">'qcow2'</span><span class="nt">/></span>
<span class="nt"></os></span>
[ ... ]</code></pre>
<p>
It is also a good idea to add a debug console to capture the
firmware log:
</p>
<pre><code class="language-xml"> <span class="nt"><serial</span> <span class="na">type=</span><span class="s">'null'</span><span class="nt">></span>
<span class="nt"><log</span> <span class="na">file=</span><span class="s">'/path/to/firmware.log'</span> <span class="na">append=</span><span class="s">'off'</span><span class="nt">/></span>
<span class="nt"><target</span> <span class="na">type=</span><span class="s">'isa-debug'</span> <span class="na">port=</span><span class="s">'1'</span><span class="nt">></span>
<span class="nt"><model</span> <span class="na">name=</span><span class="s">'isa-debugcon'</span><span class="nt">/></span>
<span class="nt"></target></span>
<span class="nt"><address</span> <span class="na">type=</span><span class="s">'isa'</span> <span class="na">iobase=</span><span class="s">'0x402'</span><span class="nt">/></span>
<span class="nt"></serial></span></code></pre>
<p>
If you are lucky the page fault is logged there, also with an
register dump. If you are not so lucky the VM will just reset and
reboot.
</p>
<h2>Hands on, part #2 — using virt-firmware</h2>
<p>
The <a href="https://gitlab.com/kraxel/virt-firmware">virt-firmware</a>
project is a collection of python modules and scripts for working
with efi variables, efi varstores and also pe binaries. In case
your distro hasn't packages you can install it
using <code>pip</code> like most python packages.
</p>
<h3>virt-fw-vars</h3>
<p>
The <code>virt-fw-vars</code> utility can work with efi varstores.
For example it is used to create the <code>OVMF_VARS*secboot*</code>
files, enrolling the secure boot certificates into the efi security
databases.
</p>
<p>
The simplest operation is to print the variable store:
</p>
<pre><code class="language-sh">virt-fw-vars <span class="nt">--input</span> /usr/share/edk2/ovmf/OVMF_VARS_4M.secboot.qcow2 <span class="se">\</span>
<span class="nt">--print</span> <span class="nt">--verbose</span> | less</code></pre>
<p>
When updating edk2 varstores <code>virt-fw-vars</code> always needs
both input and output files. If you want change an existing
variable store both input and output can point to the same file.
For example you can turn on shim logging for an existing libvirt
guest this way:
</p>
<pre><code class="language-sh">virt-fw-vars <span class="nt">--input</span> /var/lib/libvirt/qemu/nvram/<span class="k">${</span><span class="nv">guest</span><span class="k">}</span>_VARS.qcow2 <span class="se">\</span>
<span class="nt">--output</span> /var/lib/libvirt/qemu/nvram/<span class="k">${</span><span class="nv">guest</span><span class="k">}</span>_VARS.qcow2 <span class="se">\</span>
<span class="nt">--set-shim-verbose</span></code></pre>
<p>
The next virt-firmware version will get a new <code>--inplace</code>
switch to avoid listing the file twice on the command line for this
use case.
</p>
<p>
If you want start from scratch you can use an empty variable store
from <code>/usr/share/edk2</code> as input. For example when
creating a new variable store template with the test CA certificate
(shipped with pesign.rpm) enrolled additionally:
</p>
<pre><code class="language-sh">dnf <span class="nb">install</span> <span class="nt">-y</span> pesign
certutil <span class="nt">-L</span> <span class="nt">-d</span> /etc/pki/pesign-rh-test <span class="nt">-n</span> <span class="s2">"Red Hat Test CA"</span> <span class="nt">-a</span> <span class="se">\</span>
| openssl x509 <span class="nt">-text</span> <span class="o">></span> rh-test-ca.txt
virt-fw-vars <span class="nt">--input</span> /usr/share/edk2/ovmf/OVMF_VARS_4M.qcow2 <span class="se">\</span>
<span class="nt">--output</span> OVMF_VARS_4M.secboot.rhtest.qcow2 <span class="se">\</span>
<span class="nt">--enroll-redhat</span> <span class="nt">--secure-boot</span> <span class="se">\</span>
<span class="nt">--add-db</span> OvmfEnrollDefaultKeys rh-test-ca.txt</code></pre>
<p>
The test CA will be used by all Fedora, CentOS Stream and RHEL build
infrastructure to sign unofficial builds, for example when doing
scratch builds in koji or when building rpms locally on your
developer workstation. If you want test such builds in a VM, with
secure boot enabled, this is a convenient way to do it.
</p>
<h3>pe-inspect</h3>
<p>
Useful for having a look at EFI binaries is <code>pe-inspect</code>.
If this isn't present try <code>pe-listsigs</code>. Initially the
utility only listed the signatures, but was extended over time to
show more information, so I added the <code>pe-inspect</code> alias
later on.
</p>
<p>
Below is the output for an 6.6 x86 kernel, you can see it does not
have the patches to page-align the sections:
</p>
<pre>
# file: /boot/vmlinuz-6.6.4-200.fc39.x86_64
# section: file 0x00000200 +0x00003dc0 virt 0x00000200 +0x00003dc0 r-x (.setup)
# section: file 0x00003fc0 +0x00000020 virt 0x00003fc0 +0x00000020 r-- (.reloc)
# section: file 0x00003fe0 +0x00000020 virt 0x00003fe0 +0x00000020 r-- (.compat)
# section: file 0x00004000 +0x00df6cc0 virt 0x00004000 +0x05047000 r-x (.text)
# sigdata: addr 0x00dfacc0 +0x00000d48
# signature: len 0x5da, type 0x2
# certificate
# subject CN: Fedora Secure Boot Signer
# issuer CN: Fedora Secure Boot CA
# signature: len 0x762, type 0x2
# certificate
# subject CN: kernel-signer
# issuer CN: fedoraca
</pre>
<p>
<code>pe-inspect</code> also knows the names for a number of special
sections and supports decoding and pretty-printing them, for example
here:
</p>
<pre>
# file: /usr/lib/systemd/boot/efi/systemd-bootx64.efi
# section: file 0x00000400 +0x00011a00 virt 0x00001000 +0x0001191f r-x (.text)
# section: file 0x00011e00 +0x00003a00 virt 0x00013000 +0x00003906 r-- (.rodata)
# section: file 0x00015800 +0x00000400 virt 0x00017000 +0x00000329 rw- (.data)
# section: file 0x00015c00 +0x00000200 virt 0x00018000 +0x00000030 r-- (.sdmagic)
# #### LoaderInfo: systemd-boot 254.7-1.fc39 ####
# section: file 0x00015e00 +0x00000200 virt 0x00019000 +0x00000049 r-- (.osrel)
# section: file 0x00016000 +0x00000200 virt 0x0001a000 +0x000000de r-- (.sbat)
# sbat,1,SBAT Version,sbat,1,https://github.com/rhboot/shim/blob/main/SBAT.md
# systemd,1,The systemd Developers,systemd,254,https://systemd.io/
# systemd.fedora,1,Fedora Linux,systemd,254.7-1.fc39,https://bugzilla.redhat.com/
# section: file 0x00016200 +0x00000200 virt 0x0001b000 +0x00000084 r-- (.reloc)
</pre>
<h3>virt-fw-sigdb</h3>
<p>
The last utility I want introduce is <code>virt-fw-sigdb</code>,
which can create, parse and modify signature databases. The
signature database format is used by the firmware to store
certificates and hashes in EFI variables. But sometimes the format
used for files too. virt-firmware has the functionality anyway, so
I've added a small frontend utility to work with those files.
</p>
<p>
One file in signature database format
is <code>/etc/pki/ca-trust/extracted/edk2/cacerts.bin</code> which
contains the list of of trusted CAs in sigature database format.
Can be used to pass the CA list to the VM firmware for TLS
connections (https network boot).
</p>
<p>
Shim also uses that format when compiling multiple certificates into
the built-in VENDOR_DB or VENDOR_DBX databases.
</p>
<h2>Final remarks</h2>
<p>
Thats it for today folks. Hope you find this useful.
</p>Thu, 14 Dec 2023 23:00:00 +0000KVM on Z: Red Hat Ansible Automation Platform available on IBM Z and LinuxONEhttps://kvmonz.blogspot.com/2023/12/red-hat-ansible-automation-platform.html
https://kvmonz.blogspot.com/2023/12/red-hat-ansible-automation-platform.html
<p>While Linux on IBM Z and LinuxONE could be used as a target for Ansible scripts ever since, the backend had to be run on other architectures. But no longer so: Starting today, the entire Red Hat Ansible Automation Platform is becoming available on IBM Z and LinuxONE!</p><p>See <a href="https://community.ibm.com/community/user/ibmz-and-linuxone/blogs/gerald-hosch1/2023/12/07/red-hat-ansible-automation-platform-capabilities-o?CommunityKey=710b7087-fa90-4d58-94df-2926e18da67f">here</a> for more details, and <a href="https://www.redhat.com/en/blog/red-hat-ansible-automation-platform-now-available-on-ibm">here</a> for the formal announcement from Red Hat.</p>Mon, 11 Dec 2023 17:46:53 +0000Alex Bennée: A Systems Programmer's Perspectives on Generative AIhttps://www.bennee.com/~alex/blog/2023/12/10/a-systems-programmers-perspectives-on-generative-ai/
https://www.bennee.com/~alex/blog/2023/12/10/a-systems-programmers-perspectives-on-generative-ai/
<p>Like many people over the last few months I've been playing with a number of Large Language Models (LLMs). LLMs are
perhaps best typified by the current media star ChatGPT. It is hard to avoid the current media buzz while every tech
titan is developing their "AI" play and people are exposed to tools where the label of Artificial Intelligence is
liberally applied. The ability of these models to spit out <a href="https://www.bennee.com/~alex/blog/2023/10/22/comparing-forge-based-and-email-based-workflow-for-open-source-projects/">competent comprehensible
text</a> is seemingly a step change in ability compared to previous generations of tech.</p>
<p>I thought I would try and collect some of my thoughts and perspectives on this from the point of view of a <a href="https://en.wikipedia.org/wiki/Systems_programming" title="link to wikipedia definition">systems
programmer</a>. For those not familiar
with the term is refers to the low level development of providing platforms for the applications people actually use. In
my case a lot of the work I do on <a href="https://www.qemu.org" title="link to QEMU homepage">QEMU</a> which involves emulating the very
lowest level instructions a computer can do: the simple arithmetic and comparison of numbers that all code is eventually
expressed as.</p>
<h1>Magic numbers and computing them</h1>
<p>I claim no particular expertise on machine learning so expect this to be a very superficial explanation of whats going
on.</p>
<p>In normal code the CPU tends to execute a lot of different instruction sequences as a program runs through solving the
problem you have set it. The code that calculates where to draw your window will be different to the code checking the
network for new data, or the logic that stores information safely on your file system. Each of those tasks is decomposed
and abstracted into simpler and simpler steps until eventually it is simple arithmetic dictating what the processor
should do do next. You occasionally see hot spots where a particular sequence of instructions are doing a lot of heavy
lifting. There is a whole discipline devoted to managing computational complexity and ensuring algorithms are as
efficient as possible.</p>
<p>However the various technologies that are currently wowing the world work very differently. They are models of various
networks represented by a series of magic numbers or "weights" arranged in a hierarchical structure of interconnected
<a href="https://en.wikipedia.org/wiki/Matrix_(mathematics)">matrices</a>. While there is a lot of nuance to how problems are
encoded and fed into these models fundamentally the core piece of computation is multiplying a bunch of numbers with
another bunch of numbers feeding their results into the next layer of the network. At the end of the process the model
spits out a prediction of the most likely next word is going to be. After selecting one the cycle repeats taking to
account our expanded context to predict the most likely next word.</p>
<p>The "models" that drive these things are described mostly by the number of parameters they have. This encompasses the
number of inputs and outputs they have and the number of numbers in between. For example common small open source models
start at 3 billion parameters with 7, 13 and 34 billion also being popular sizes. Beyond that it starts getting hard to
run models locally on all but the most tricked out desktop PCs. As a developer my desktop is pretty beefy (32 cores,
64Gb RAM) and can chew through computationally expensive builds pretty easily. However as I can't off-load processing
onto my GPU a decent sized model will chug out a few words a second while maxing out my CPU. The ChatGPT v4 model is
speculated to run about 1.7 trillion parameters which needs to be run on expensive cloud hardware - I certainly don't
envy <a href="https://openai.com/">OpenAI</a> their infrastructure bill.</p>
<p>Of course the computational power needed to run these models is a mere fraction of what it took to train them. In fact
the bandwidth and processing requirements are so large it pays to develop custom silicon that is really good at
multiplying large amounts of numbers and not much else. You can get a lot more bang for your buck compared to running
those calculations on a general purpose CPU designed for tackling a wide range of computation problems.</p>
<h1>The Value of Numbers</h1>
<p>Because of the massive investment in synthesising these magic numbers they themselves become worth something. The "magic
sauce" behind a model is more about how it was trained and what data was used to do it. We already know its possible to
encode societies biases into models due to sloppy selection of the input data. One of the principle criticisms of
proprietary generative models is how opaque the training methods are making it hard to judge their safety. The degree to
which models may regurgitate data without any transformation is hard to quantify when you don't know what went into it.</p>
<p>As I'm fundamentally more interested in knowing how the technology I use works under the hood its fortunate there is a
growing open source community working on building their own models. Credit should be given to Meta who made their
language model <a href="https://ai.meta.com/llama/" title="link to Meta's Llama page">LLaMA 2</a> freely available on fairly permissive
terms. Since then there has been an explosion of open source projects that can run the models (e.g:
<a href="https://github.com/ggerganov/llama.cpp" title="link to the llama.cpp project for running CPU bound models">llama.cpp</a>,
<a href="https://ollama.ai/" title="link to Ollama, another tool for locally running models">Ollama</a>) and provide front-ends (e.g:
<a href="https://github.com/oobabooga/text-generation-webui" title="link to Oobabooga's text generation UI">Oobabooga's text generation UI</a>, <a href="https://github.com/s-kostyaev/ellama" title="link to the Ellama github page">Ellama front-end for Emacs</a>) for them.</p>
<h1>Smaller Magic Numbers</h1>
<p>The principle place where this work is going on is <a href="https://huggingface.co/" title="link to the Hugging Face website">Hugging Face</a>. Think of it as the <a href="https://github.com" title="GitHub">GitHub</a> of the machine learning community. It provides an
environment for publishing and collaborating on data sets and models as well hosting and testing their effectiveness in
various benchmarks. This make experimenting with models accessible to developers who aren't part of the well funded
research divisions of the various tech titans. Datasets for example come with
<a href="https://huggingface.co/docs/hub/datasets-cards">cards</a> which describe the sources that went into these multi-terabyte
files.</p>
<p>One example of a such is the <a href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T" title="link to the RedPajama dataset">RedPajama dataset</a>. This is an open source initiative to recreate the LLaMA training data which combines
data from the open web and well as numerous permissively licensed source such as Wikipedia, GitHub, StackExchange and
ArXiv. This dataset has been used to train models like <a href="https://huggingface.co/openlm-research" title="link to OpenLM research Hugging Face pages">OpenLLaMA</a> in an attempt to provide an unencumbered version of Meta's LLaMA 2. However
training up these foundational models is an expensive and time consuming task, the real action is taking these models
and then fine tuning them for particular tasks.</p>
<p>To fine tune a model you first take a general purpose model and further train it against data with a specific task in
mind. The purpose of this is not only to make your new model better suited for a particular task but also to optimise
the number of calculations that model has to do to achieve acceptable results. This is also where the style of prompting
will be set as you feed the model examples of the sort of questions and answers you want it to give.</p>
<p>The are further stages that be applied including "alignment" where you ensure results are broadly in tune with the
values of the organisation. This is the reason the various chatbots around won't readily cough up the recipe to build
nukes or make it easier to explicitly break the law. This can be augmented with Reinforcement Learning through Human
Feedback (RHLF) which is practically the purpose of every <a href="https://en.wikipedia.org/wiki/CAPTCHA">CAPTCHA</a> you'll have
filled in over the last 25 years online.</p>
<p>Finally the model can be quantised to make it more manageable. This takes advantage of the fact that a lot of the
numbers will be have a negligible effect on the result for a wide range of inputs. In those cases there is no point
storing them at full precision. As computation is a function of the number of bits of information being processed this
also reduces the cost of computation. While phones and other devices are increasingly including dedicated hardware to
process these models they are still constrained by physics - and the more you process the more heat you need to
dissipate, the more battery you use and the more bandwidth you consume. Obviously the more aggressively you quantise the
models the worse it will perform so there is an engineering trade off to make. Phones work best with multiple highly
tuned models solving specific tasks as efficiently as possible. Fully flexible models giving a
<a href="https://en.wikipedia.org/wiki/J.A.R.V.I.S.">J.A.R.V.I.S</a> like experience will probably always need to run in the cloud
where thermal management is simply an exercise in plumbing.</p>
<h1>Making magic numbers work for you</h1>
<p>Before we discuss using models I want to discuss 3 more concepts: "prompts", "context" and "hallucinations".</p>
<p>The prompt is the closest thing there is to "programming" the model. The prompt can be purely explicit or include other
inputs behind the scenes. For example the prompt can instruct the model to be friendly or terse, decorate code snippets
with markdown, make changes as diffs or in full functions. Generally the more explicit your prompt is about what you
want the better the result you get from the model. <a href="https://en.wikipedia.org/wiki/Prompt_engineering">Prompt
engineering</a> has the potential to be one of those newly created job
titles that will have to replace the jobs obsoleted by advancing AI. One of the ways to embed AI APIs into your app is
to create a task specific prompt that will be put in front of user input that guides the results to what you want.</p>
<p>The "context" is the rest of the input into the model. That could be the current conversation in a chat or the current
page of source code in a code editor. The larger the context the more reference the model has for its answer although
that does come at the cost of even more composition as the context makes for more input parameters into the model.</p>
<p>In a strong candidate for 2023's word of the year "hallucination" describes the quirky and sometime unsettling behaviour
of models outputting weird sometimes contradictory information. They will sincerely and confidently answer questions
with blatant lies or start <a href="https://www.theregister.com/2023/12/01/chatgpt_poetry_ai/">regurgitating training data</a> when
given certain prompts. It is a salient reminder that the statistical nature of these generative models will mean they
occasionally spout complete rubbish. They are also very prone to following the lead of their users - the longer you chat
with a model the more likely it is to end up agreeing with you.</p>
<p>So lets talk about what these models can and can't do. As a developer one of the areas I'm most interested in is their
ability to write code. Systems code especially is an exercise in precisely instructing a computer what to do in explicit
situations. I'd confidently predicted my job would be one of the last to succumb to the advance of AI as systems aren't
something you can get "mostly" right. It was quite a shock when I first saw quite how sophisticated the generated code
can be.</p>
<h2>Code Review</h2>
<p>One of the first things I asked ChatGPT to do was review a function I'd written. It manged to make 6 observations about
the code, 3 of which where actual logic problems I'd missed and 3 where general points about variable naming and
comments. The prompt is pretty important though. If not constrained to point out actual problems LLMs tend to have a
tendency to spit out rather generic advice about writing clean well commented code.</p>
<p>They can be super useful when working with an unfamiliar language or framework. If you are having trouble getting
something to work it might be faster to ask an LLM how to fix your function that spending time reading multiple
<a href="https://stackoverflow.com/">StackOverflow</a> answers to figure out what you've misunderstood. If compiler errors are
confusing supplying the message alongside the code can often be helpful in understanding whats going on.</p>
<h2>Writing Code</h2>
<p>However rather than just suggesting changes one very tempting use case is writing code from scratch based on a
description of what you want. Here the context is very important, the more detail you provide the better chance of
generating something useful. My experience has been that the solutions are usually fairly rudimentary and can often
benefit from a manual polishing step once you have something working.</p>
<p>For my <a href="https://www.bennee.com/~alex/presentations/kvm23-qemu-keynote.html">QEMU KVM Forum 2023 Keynote</a> I got ChatGPT
to write the first draft of a number of my data processing scripts. However it missed obvious optimisations by
repeatedly reading values inside inner loops that made the scripts slower than they needed to be.</p>
<p>If the task is a straight transformation they are very good. Ask an LLM to convert a function in one language into
another and it will do a pretty good job - and probably with less mistakes than your first attempt. However there are
limitations. For example I asked a model to convert some Aarch64 assembler into the equivalent 32 bit Arm assembler. It
did a very good job of the mechanical part of that but missed the subtle differences in how to setup the MMU. This
resulted in code which compiled but didn't work until debugged by a human who was paying close attention to the
architecture documentation as they went.</p>
<p>One of the jobs LLM's are very well suited for is writing code that matches an existing template. For example if you are
mechanically transforming a bunch of enums into a function to convert them to strings you need only do a few examples
before there is enough context for the LLM to reliably figure out what you are doing. LLM's are a lot more powerful than
a simple template expansion because you don't need to explicitly define a template first. The same is true of tasks like
generating test fixtures for your code.</p>
<p>There is a potential trap however with using LLMs to write code. As there is no source code and the proprietary models
are fairly cagey about exactly what data the models where trained on there are worries about them committing copyright
infringement. There are active debates ongoing in the open source community (e.g. <a href="https://lists.gnu.org/archive/html/qemu-devel/2023-11/msg05007.html" title="link to archive of discussion about LLM code generation">on
qemu-devel</a>) about the potential ramifications of a model regurgitating its training data. Without clarity on what
license that data has there is a risk of contaminating projects with code of an unknown province. While I'm sure these
issues will be resolved in time it's certainly a problem you need to be cognisant off.</p>
<h2>Prose</h2>
<p>Writing prose is a much more natural problem territory for LLM's and an area where low-effort text generation will be
rapidly replaced by generative models like ChatGPT. "My" previous <a href="https://www.bennee.com/~alex/blog/2023/10/22/comparing-forge-based-and-email-based-workflow-for-open-source-projects/#comparing-forge-based-and-email-based-workflow-for-open-source-projects">blog
post</a>
was mostly written by a ChatGPT based on a simple brief and a few requests for rewrites in a chat session. While it made
the process fairly quick the result comes across as a little bland and "off". I find there is a tendency for LLM's to
fall back on fairly obvious generalisations and erase any unique authorial voice there may have been.</p>
<p>However if you give enough structure its very easy to get an LLM to expand on a bullet list into more flowery prose.
They are more powerful when being fed a large piece of text and asked to summarise key information in a more accessible
way.</p>
<p>They are certainly an easy way to give a first pass review of your writing although I try to re-phrase things myself
rather than accept suggestions verbatim to keep my voice coming through the text.</p>
<h1>Final Thoughts</h1>
<p>The recent advances in LLM's and the public's exposure to popular tools like ChatGPT have certainly propelled the topic
of AI in the zeitgeist. While we are almost certainly approaching the "Peak of Inflated Expectations" stage of the <a href="https://en.wikipedia.org/wiki/Gartner_hype_cycle">hype
cycle</a> they will undoubtedly be an important step on the road to the
eventual goal of <a href="https://en.wikipedia.org/wiki/Artificial_general_intelligence">Artificial General Intelligence (AGI)</a>.
We are still a long way from being able to ask computers to solve complex problems they way they can in for example in
Star Trek. However in their current form they will certainly have a big impact on the way we work over the next decade
or so.</p>
<p>It's important as a society we learn about how they are built, what their limitations are and understand the
computational cost and resultant impact on the environment. It will be awhile before I'd want to trust a set of magic
numbers over a carefully developed algorithm to actuate the control surfaces on a plane I'm flying on. However they are
already well placed to help us learn new information through interactive questioning and summarising random information
on the internet. We must learn to recognise when we've gone down hallucinatory rabbit hole and verify what we've learned
with reference to trusted sources.</p>Sun, 10 Dec 2023 19:28:00 +0000KVM on Z: New Linux on IBM Z & LinuxONE Forum at Open Mainframe Projecthttps://kvmonz.blogspot.com/2023/12/new-linux-on-ibm-z-linuxone-forum-at.html
https://kvmonz.blogspot.com/2023/12/new-linux-on-ibm-z-linuxone-forum-at.html
<p>The Open Mainframe Project has launched a new forum dedicated to Linux on Z. It can be found <a href="https://community.openmainframeproject.org/c/linux-s390x/17">here</a>, and is intended to complement existing facilities like the <a href="https://www.vm.ibm.com/techinfo/listserv.html">mailing lists</a>
hosted at Maris college. Any topic around Linux on Z, including
virtualization as provided by z/VM and KVM, is fair game, and you may
use it to ask quesitions, share useful hints and tips, or simply have a
casual conversation about some aspect of the platform!</p>Fri, 08 Dec 2023 10:29:00 +0000KVM on Z: New Releases: RHEL 8.9 and RHEL 9.3 on IBM Z & LinuxONEhttps://kvmonz.blogspot.com/2023/12/new-releases-rhel-89-and-rhel-93-on-ibm.html
https://kvmonz.blogspot.com/2023/12/new-releases-rhel-89-and-rhel-93-on-ibm.html
<p></p><p>Both, Red Hat Enterprise Linux 8.9 and 9.1 are out! See the press release <a href="https://www.redhat.com/en/about/press-releases/red-hat-launches-next-versions-worlds-leading-enterprise-linux-platform">here</a>, and Red Hat's blog entry <a href="https://www.redhat.com/en/blog/rhel-93-and-89">here</a>!</p><p>Both releases ship<br /></p><ul><li>s390-tools v2.27 (renamed to <span>s390utils</span>)</li><li>smc-tools v1.8.2</li><li>openCryptoki v3.21</li></ul><p>Further information can be found in the release notes for <a href="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/8.9_release_notes/index">RHEL 8.9</a> and see <a href="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/9.3_release_notes/index?extIdCarryOver=true&sc_cid=701f2000001OH6kAAG">RHEL 9.3</a>.</p>Fri, 01 Dec 2023 10:26:04 +0000Gerd Hoffmann: physical address space in qemuhttps://www.kraxel.org/blog/2023/12/qemu-phys-bits/
https://www.kraxel.org/blog/2023/12/qemu-phys-bits/
<p>
The physical addess space is where all memory and most IO resources
are located. PCI memory bars, PCI MMIO bars, platform devices like
lapic, io-apic, hpet, tpm, ...
</p>
<p>
On your linux machine you can use <code>lscpu</code> to see the size
of the physical address space:
</p>
<pre>
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
^^^^^^^^^^^^^^^^
[ ... ]
</pre>
<p>
In <code>/proc/iomem</code> you can see how the address space is
used. Note that the actual addresses are only shown to root.
</p>
<h3>The physical address space problem on x86_64</h3>
<p>
The very first x86_64 processor (AMD Opteron) shipped with a
physical address space of 40 bits (aka one TeraByte). So when qemu
added support for the (back then) new architecture the qemu vcpu
likewise got 40 bits of physical address space, probably assuming
that this would be a safe baseline. It is still the default in qemu
(version 8.1 as of today) for backward compatibility reasons.
</p>
<p>
Enter Intel. The first 64-bit processors shipped by Intel featured
only 36 bits of physical address space. More recent Intel
processors have 39, 42 or more physical address bits. Problem is
this limit applies not only to the real physical addess space, but
also to Extended Page Tables (EPT). Which means the physical
address space of virtual machines is limited too.
</p>
<p>
So, the problem is the virtual machine firmware does not know how
much physical address space it actually has. When checking CPUID it
gets back 40 bits, but it could very well be it actually has only 36
bits.
</p>
<h3>Traditional firmware behavior</h3>
<p>
To address that problem the virtual machine firmware was very
conservative with address space usage, to avoid crossing the unknown
limit.
</p>
<p>
OVMF used to have a MMIO window with fixed size (32GB), which was
based on the first multiple of 32GB after normal RAM. So a typical,
smallish virtual machine had 0 -> 32GB for RAM and 32GB -> 64GB for
IO, staying below the limit for 36 bits of physical address space
(which equals 64GB).
</p>
<p>
VMs having more than 30GB of RAM will need address space above 32GB
for RAM, which pushes the IO window above the 64GB limit. The
assumtion that hosts which have enough physical memory to run such
big virtual machines also have a physical address space larger than
64GB seems to have worked good enough.
</p>
<p>
Nevertheless the fixed 32G-sized IO window became increasingly
problematic. Memory sizes are growing, not only for main memory,
but also for device memory. GPUs have gigabytes of memory these
days.
</p>
<h3 id="qemucfg">Config options in qemu</h3>
<p>
Qemu has tree <code>-cpu</code> options to control physical address
space advertized to the guest, for quite a while already.
</p>
<dl>
<dt>host-phys-bits={on,off}</dt>
<dd>
When enabled qemu will use the hosts physical address bits for the
guest, i.e. the guest can see the actual limit. I recommend
enable this everywhere.
<br />
Upstream default: <code>off</code> (except for <code>-cpu
host</code> where it is <code>on</code>).
<br />
Some downstream linux distro builds flip this to <code>on</code>
by default.
</dd>
<dt>host-phys-bits-limit=<i>bits</i></dt>
<dd>
Is used only with <code>host-phys-bits=on</code>. Can be used to
reduce the number of physical address space bits communicated to
the guest. Useful for live migration compatibility in case your
machine cluster has machines with different physical address space
sizes.
</dd>
<dt>phys-bits=<i>bits</i></dt>
<dd>
Is used only with <code>host-phys-bits=off</code>. Can be used to
set the number of physical address space bits to any value you
want, including non-working values. Use only if you know what you
are doing, it's easy to shot yourself into the foot with this one.
</dd>
</dl>
<h3>Changes in OVMF</h3>
<p>
Recent OVMF versions (edk2-stable202211 and newer) try to figure the
size of the physical address space using a heuristic: In case the
physical address space bits value received via CPUID is 40 or below
it is checked against known-good values, which are 36 and 39 for
Intel processors and 40 for AMD processors. If that check passes or
the number of bits is 41 or higher OVMF assumes qemu is configured
with <code>host-phys-bits=on</code> and the value can be trusted.
</p>
<p>
In case there is no trustworthy phys-bits value OVMF will continue
with the traditional behavior described above.
</p>
<p>
In case OVMF trusts the phys-bits value it will apply some
OVMF-specific limitations before actually using it:
</p>
<ul>
<li>
The concept if virtual memory does not exist in UEFI, so the
firmware will identity-map everything. Without 5-level paging
(which is not yet supported in OVMF) at most 128TB (phys-bits=47)
can be identity-mapped, so OVMF can not use more than that.
<br />
The actual limit is phys-bits=46 (64TB) for now due to older linux
kernels (4.15) having problems if OVMF uses phys-bits=47.
</li>
<li>
In case gigabyte pages are not available OVMF will not use more
than phys-bits=40 (1TB). This avoids high memory usage and long
boot times due to OVMF creating lots of page tables for the
identity mapping.
</li>
</ul>
<p>
The final phys-bits value will be used to calculate the size of the
physical address space available. The 64-bit IO window will be
placed as high as possibe, i.e. at the end of the physical address
space. The size of the IO window and also the size of the PCI
bridge windows (for prefetchable 64-bit bars) will be scaled up with
the physical address space, i.e. on machines with a larger physical
address space you will also get larger IO windows.
</p>
<h3>Changes in SeaBIOS</h3>
<p>
Starting with version <b>1.16.3</b> SeaBIOS uses a heuristic simliar
to OVMF to figure whenever there is a trustworthy phys-bits value.
</p>
<p>
If that is the case SeaBIOS will enable the 64-bit IO window by
default and place it at the end of the address space like OVMF does.
SeaBIOS will also scale the size of the IO window with the size of
the address space.
</p>
<p>
Although the overall behavior is simliar there are some noteworthy
differences:
</p>
<ul>
<li>
SeaBIOS will not enable the 64-bit IO window in case there is no
RAM above 4G, for better compatibility with old -- possibly 32-bit
-- guests.
</li>
<li>
SeaBIOS will not enable the 64-bit IO window in case the CPU has
no support for long mode (i.e. it is a 32-bit processor), likewise
for better compatibility with old guests.
</li>
<li>
SeaBIOS will limit phys-bits to 46, simliar to OVMF, likewise for
better compatibility with old guests. SeaBIOS does not use paging
though and does not care about support for gigabyte pages, it will
never limit phys-bits to 40.
</li>
<li>
SeaBIOS has a list of devices which will never be placed in the
64-bit IO window. This list includes devices where SeaBIOS
drivers must be able to access the PCI bars. SeaBIOS runs in
32-bit mode so these PCI bars must be mapped below 4GB.
</li>
</ul>
<h3>Changes in qemu</h3>
<p>
Starting with release 8.2 the firmware images bundled with upstream
qemu are new enough to include the OVMF and SeaBIOS changes
described above.
</p>
<h3>Live migration and changes in libvirt</h3>
<p>
The new firmware behavior triggered a few bugs elsewhere ...
</p>
<p>
When doing live migration the vcpu configuration on source and
target host must be identical. That includes the size of the
physical address space.
</p>
<p>
libvirt can calculate the cpu baseline for a given cluster,
i.e. create a vcpu configuration which is compatible with all
cluster hosts. That calculation did <b>not</b> include the size of
the physical address space though.
</p>
<p>
With the traditional, very conservative firmware behavior this bug
did not cause problems in practice, but with OVMF starting to use
the full physical address space live migrations in heterogeneous
clusters started to fail because of that.
</p>
<p>
In libvirt 9.5.0 and newer this has been fixed.
</p>
<h3>Trouble shooting tips</h3>
<p>
In general, it is a good idea to set the <a href="https://www.kraxel.org/blog/2023/12/qemu-phys-bits/#qemucfg">qemu
config option</a> <code>host-phys-bits=on</code>.
</p>
<p>
In case guests can't deal with PCI bars being mapped at high
addresses the <code>host-phys-bits-limit=<i>bits</i></code> option
can be used to limit the address space usage. I'd suggest to stick
to values seen in actual processors, so 40 for AMD and 39 for Intel
are good candidates.
</p>
<p>
In case you are running 32-bit guests with alot of memory (which btw
isn't a good idea performance-wise) you might need turn off long
mode support to force the PCI bars being mapped below 4G. This can
be done by simply using <code>qemu-system-i386</code> instead
of <code>qemu-system-x86_64</code>, or by explicitly
setting <code>lm=off</code> in the <code>-cpu</code> options.
</p>Thu, 30 Nov 2023 23:00:00 +0000Daniel Berrange: ANNOUNCE: libvirt-glib release 5.0.0https://www.berrange.com/posts/2023/11/30/announce-libvirt-glib-release-5-0-0/
https://www.berrange.com/posts/2023/11/30/announce-libvirt-glib-release-5-0-0/
<p>I am pleased to announce that a new release of the libvirt-glib package, version 5.0.0, is now available from</p>
<pre><a href="https://libvirt.org/sources/glib/">https://libvirt.org/sources/glib/</a>
</pre>
<p>The packages are GPG signed with</p>
<pre>Key fingerprint: DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)
</pre>
<p>Changes in this release:</p>
<ul>
<li>Fix compatiblity with libxml2 >= 2.12.0</li>
<li>Bump min libvirt version to 2.3.0</li>
<li>Bump min meson to 0.56.0</li>
<li>Require use of GCC >= 4.8 / CLang > 3.4 / XCode CLang > 5.1</li>
<li>Mark USB disks as removable by default</li>
<li>Add support for audio device backend config</li>
<li>Add support for DBus graphics backend config</li>
<li>Add support for controlling firmware feature flags</li>
<li>Improve compiler flag handling in meson</li>
<li>Extend library version script handling to FreeBSD</li>
<li>Fix pointer sign issue in capabilities config API</li>
<li>Fix compat with gnome.mkenums() in Meson 0.60.0</li>
<li>Avoid compiler warnings from gi-ir-scanner generated code by not setting glib version constraints</li>
<li>Be more robust about NULL GError parameters</li>
<li>Disable unimportant cast alignment compiler warnings</li>
<li>Use ‘pragma once’ in all header files</li>
<li>Updated translations</li>
</ul>
<p>Thanks to everyone who contributed to this new release.</p>Thu, 30 Nov 2023 14:59:31 +0000