Skip to content

Conversation

@Indanz
Copy link
Contributor

@Indanz Indanz commented Sep 26, 2024

Implementation of the FPU RFC.

Simplest version of semi-lazy FPU switching.

Still work in progress, but in good enough shape for wider review. For previous development, see here.

TODO:

  • Test virtualisation aarch64, aarch32 and 32-bit x86 (x86_64 done).
  • Test and benchmark.

Possible Optimisations:

  • Try alternative domain switch variants.
  • Try switchFpuState with or without non-NULL ksActiveFPUState.
  • Try global ksFPUEnabled.

@Indanz Indanz added verification Needs formal verification input/change, or is motivated by verification hw-test sel4test hardware builds + runs for this PR labels Sep 26, 2024
@Indanz Indanz marked this pull request as draft September 26, 2024 18:04
@Indanz Indanz mentioned this pull request Sep 26, 2024
@Indanz
Copy link
Contributor Author

Indanz commented Sep 26, 2024

For some reason if (config_set(CONFIG_HAVE_FPU)) doesn't work to hide things. Easily fixed by creating an empty fpuRelease, but I wouldn't expect that would be necessary. Any suggestions?

Edit: I fixed it by using ifdef.

@Indanz Indanz self-assigned this Sep 26, 2024
@Indanz Indanz force-pushed the fpu branch 3 times, most recently from 4ce3c41 to 5873c16 Compare September 27, 2024 13:18
@Indanz Indanz requested a review from chrisguikema October 17, 2024 12:10
@Indanz Indanz force-pushed the fpu branch 2 times, most recently from a31dae8 to 8f3f1b2 Compare October 31, 2024 17:06
@Indanz
Copy link
Contributor Author

Indanz commented Nov 5, 2024

@chrisguikema If you have time, could you review and test the x86 VCPU changes on your VmMs? Would be good to know whether there is a performance degradation too. I don't expect any, but if there is, it's probably a bug.

I'm not sure if the guest's requested CR0:TS is being preserved or not (because I clear TS when unconditionally enabling the FPU for the guest). Is the CR0 used by guests the same as the CR0 used by hosts, or does it have its own architectural copy? I guess it has it's own copy because of VMX_GUEST_CR0, but not sure what the interaction is with CR0.

Edit: Or was it @Furao who promised me this at the summit? I forgot.

@Indanz Indanz force-pushed the fpu branch 2 times, most recently from e0d3443 to ad1be9f Compare February 27, 2025 21:12
@corlewis
Copy link
Member

corlewis commented Mar 24, 2025

Just a couple of last changes that we'd like to make to this for verification. Would it be ok to move prepareSetDomain out of setDomain and into decodeDomainInvocation? That matches up better with how the specifications have been written and allows us to separate out the proofs about FPU and domain state.

Separately, we need to label enableFpu with MODIFIES: phantom_machine_state, as it us where we currently have the machine interface for the proofs and we need to know that it only modifies state that is not in the model.

See corlewis@ef617bc for the versions of these changes that I've been working with.

Also, when you have a chance another rebase would be helpful, to get the most recent verification-visible changes.

With these we're able to complete the functional correctness proofs for AArch64 and expect the other architectures to be relatively straightforward (although keeping in mind that the verified x86 configuration does not have hypervisor support so we won't have to handle the complications there that you've been dealing with).

@corlewis
Copy link
Member

Just a couple of last changes that we'd like to make to this for verification. Would it be ok to move prepareSetDomain out of setDomain and into decodeDomainInvocation? That matches up better with how the specifications have been written and allows us to separate out the proofs about FPU and domain state.

Oh, I've just realised that if we do make this change then for consistency we should probably wrap them up in a new invokeDomain function and call that from decodeDomainInvocation, like we do with all of the other decode functions.

@Indanz
Copy link
Contributor Author

Indanz commented Mar 27, 2025

Okay, done. Let me know if this is fine.

@corlewis
Copy link
Member

Okay, done. Let me know if this is fine.

Sorry, I thought that the new invokeDomainSetSet function would be fine but it returning an exception causes a surprising amount of pain in the proofs. Do you have any objections to making it a void function, like in corlewis@c852075? With that I've been able to update all of the proofs. There is also a precedent for this pattern in the various invokeIRQHandler functions that can be seen at https://github.com/seL4/seL4/blob/master/src/object/interrupt.c#L86.

@Indanz
Copy link
Contributor Author

Indanz commented Mar 31, 2025

Sorry, I thought that the new invokeDomainSetSet function would be fine but it returning an exception causes a surprising amount of pain in the proofs. Do you have any objections to making it a void function, like in corlewis@c852075?

That's surprising, as that's how all other invoke functions are in that file. Actually, looking at all invoke functions, only the IRQ ones are void, all others return exception_t. I have no objection to changing it, but it might be worth looking into why this one is so hard compared to the others? Perhaps you can simplify the proofs a lot if you change the other invoke functions to void too.

@corlewis
Copy link
Member

That's surprising, as that's how all other invoke functions are in that file. Actually, looking at all invoke functions, only the IRQ ones are void, all others return exception_t. I have no objection to changing it, but it might be worth looking into why this one is so hard compared to the others? Perhaps you can simplify the proofs a lot if you change the other invoke functions to void too.

Honestly, in this case a fair bit of the pain is probably because I already had working proofs and I'm now trying to minimise the changes I'm making for invokeDomainSetSet. If we'd had this from the start I would have just copied the specifications and proofs of the other functions and it would presumably have been fine.

One thing when considering the other invoke functions is that some of them actually can be preempted and return an exception, so those ones can't be changed to void. Now that I've done this though I think that in the cases where they can't be preempted it actually is nicer to be void, but I don't think that it's enough of an improvement to be worth going through and updating them.

@lsf37
Copy link
Member

lsf37 commented Mar 31, 2025

One thing when considering the other invoke functions is that some of them actually can be preempted and return an exception, so those ones can't be changed to void. Now that I've done this though I think that in the cases where they can't be preempted it actually is nicer to be void, but I don't think that it's enough of an improvement to be worth going through and updating them.

When there is a quiet period at some point, we should consider it, because it is one of the rare cases where the C type systems actually does help a little bit on the C side (when it returns void we know there is no preemption). Axel had a pull request up a while ago to change some of the existing invocations, but I never got around to doing the update. For the same reasons we don't want to do them now :-), but not because it's a bad idea as such.

@lsf37
Copy link
Member

lsf37 commented Jul 23, 2025

Tests are looking good now, I'll rebase for the cmake restyling.

Indanz and others added 7 commits July 24, 2025 10:12
Aligning saves one cycle per LDP/STP instruction.

Signed-off-by: Indan Zupancic <[email protected]>
It was blindly copied from Aarch32 and has never been correct.

Signed-off-by: Indan Zupancic <[email protected]>
Otherwise ksActiveFPUState may point to freed memory after a vcpu
with active FPU state gets deleted.

Signed-off-by: Indan Zupancic <[email protected]>
Save current FPU state on domain switch.

Signed-off-by: Indan Zupancic <[email protected]>
Co-authored-by: Corey Lewis <[email protected]>
Signed-off-by: Indan Zupancic <[email protected]>
Retain the ridiculous name to make clear which invocation is being
handled.

Rename tptr to tcb for consistency within the file.

We have a dom_t type, use it as early as possible.

Signed-off-by: Indan Zupancic <[email protected]>
Indanz and others added 16 commits July 24, 2025 10:20
Add flags to tcb_t and the seL4_TCBFlag_fpuDisabled flag.

Enums are signed, make TCB flags word_t to make it unsigned.

Signed-off-by: Indan Zupancic <[email protected]>
Signed-off-by: Corey Lewis <[email protected]>
Remove fault-based FPU state saving and loading.

Signed-off-by: Indan Zupancic <[email protected]>
Signed-off-by: Gerwin Klein <[email protected]>
On x86 both the VM monitor as the VM guest have their own FPU state.
The FPU of the monitor is controlled with seL4_TCBFlag_fpuDisabled
TCB flag. For the guest running on the VCPU the FPU will always be
enabled.

Signed-off-by: Indan Zupancic <[email protected]>
Trivial change, except for x86 virtualisation and bootup.

Normally the seL4_TCBFlag_fpuDisabled flag is used to decide whether
to enable or disable the FPU. However, with x86 virtualisation there
is only one TCB. Use the task flag for the host and always enable FPU
for the guest.

When either the host or the guest's FPU is loaded, ksCurFPUOwner will
point to our TCB. saveFpuState and loadFpuState check fpu_active and
do the right thing depending on the current FPU user. To make this
work it was necessary to pass tcb_t to saveFpuState and loadFpuState.
Use the idle thread to store the initial FPU state. x86KSnullFpuState
is kept to simplify verification.

However, when Arch_initFpu is called, the idle thread hasn't been
created yet, so move the initialisation to after create_idle_thread,
but before create_initial_thread, as that leads to Arch_initFpuContext
using x86KSnullFpuState.

Also initialise VCPU FPU registers correctly for x86, otherwise the
initial state is wrong and can't be loaded when XSAVES is enabled.

Signed-off-by: Indan Zupancic <[email protected]>
Add config choice and change the default from XSAVEOPT to XSAVE.

The first config choice is used as the default option. Only XSAVE
is guaranteed to always work, the others require newer CPUs.

Get rid of dubious FPU state headers, we don't need them:
- XCOMP_BV_COMPACTED_FORMAT is set by xsavec or xsaves.
- We can init MXCSR with the ldmxcsr instruction.

Only system state should be configured in IA32_XSS_MSR,
setting FPU user state bits causes an exception.

All memory should be zeroed already, no need to do it again.

See also issue seL4#179.

Signed-off-by: Indan Zupancic <[email protected]>
Unused on ARM, on RISC-V it is actually the final decision and not
a cache.

Signed-off-by: Indan Zupancic <[email protected]>
To ease verification.

Fairly trivial change, except for x86, of course.

Signed-off-by: Indan Zupancic <[email protected]>
It's not only used when deleting a thread any more.

Swapping fpuRelease and dissociateVCPUTCB makes no practical
difference as they are independent, but it simplifies
verification slightly.

Signed-off-by: Indan Zupancic <[email protected]>
Also release FPU when dissociating a vcpu, otherwise ksCurFPUOwner
may be wrong.

fpuRelease() sets ksCurFPUOwner to NULL, if this happens before
we return to the host, it may end up with the FPU disabled and no
FPU state loaded. Instead of hunting down and handling all obscure
corner cases where this might happen (dissociating vcpus, cross-core
operations), just check for this in vcpu_fpu_to_host().

Signed-off-by: Indan Zupancic <[email protected]>
disableFpu() is always called at boot and calls one of those.

Signed-off-by: Indan Zupancic <[email protected]>
Signed-off-by: Indan Zupancic <[email protected]>
Signed-off-by: Corey Lewis <[email protected]>
On 32-bit ARM the fpexc system register is set by loadFpuState,
which includes the FPU enable/disable bit FPEXC_EN_BIT. This
register is part of the usercontext and needs to be initialised
correctly, otherwise the FPU will be disabled by loadFpuState.

Before, this bug was hidden because the FPU was enabled lazily
after a trap. This bug just caused one extra FPU trap at first
FPU use for each task: The first handleFPUFault would fail to
enable the FPU, causing another FPU trap when user space gets
restarted.

On the second FPU fault, switchLocalFpuOwner calls enableFpu first
and then calls saveFpuState because ksActiveFPUState is set to the
current task's FPU state. Then it gets saved with FPEXC_EN_BIT set
and the task can continue with the FPU actually enabled.

This also means that with the old code, the initial FPEXC state
of each task was equal to the previous active FPU task's.

Signed-off-by: Indan Zupancic <[email protected]>
According to Arm documentation, the exception-handling routine
needs to zero the following bits in FPEXC:

DEX, IDF, IXF, UFF, OFF, DZF and IOF

In seL4, the user space fault handler should do this, as the kernel
doesn't know when user space is done handling the FPU trap.

In case of virtualisation, the guest kernel should clear these bits.

Now that the seL4 kernel doesn't handle FPU traps itself any more,
user space can handle asynchronous traps if it wants to.

Signed-off-by: Indan Zupancic <[email protected]>
Signed-off-by: Indan Zupancic <[email protected]>
@lsf37
Copy link
Member

lsf37 commented Jul 24, 2025

Alright, things are finally passing. The link check is failing on the RFC link that will only exist once merged, and the zcu106 is failure is unrelated.

Will go ahead with the merge!

@lsf37 lsf37 merged commit 22f816a into seL4:master Jul 24, 2025
85 of 110 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hw-test sel4test hardware builds + runs for this PR verification Needs formal verification input/change, or is motivated by verification

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants