Skip to content

Latest commit

 

History

History
2314 lines (1704 loc) · 91.3 KB

XTeic.adoc

File metadata and controls

2314 lines (1704 loc) · 91.3 KB

XTeic

Table of Contents

Jan Oleksiewicz [email protected]
document version 0.36.3
extension status: unstable/PoC
This document is released under a Creative Commons Attribution 4.0 International License

revision history

preface

This document uses semantic versioning with respect to potential hardware designs. Assembly syntax change is a minor increment. Version 1.0.0 will be the first somewhat useable. Changes in prior versions are not versioned properly and not tracked in revision history. The number in a major revision doesn’t hold the freeze or ratification status.

Document is written in a way that reduces the duplications as those are hard to maintain.

1. Introduction

Even though the current risc-v "privileged" architecture is great for general unix systems. It fails to meet many embedded and hard real time requirements.

Instead of adding more and more on top of layered legacy, that leads to silicon waste, let’s replace entire volume II (aka riscv privileged) with minimal yet efficient embedded architecture.

The goal is to achieve interrupt architecture capable of predictable and fast control loops by providing minimal interrupt latency and jitter.
Optionally offer single digit cycles of interrupt latency to actual code and true zero jitter, as to not disturb minimal implementations.
By leveraging general purpose computing capability of risc-v architecture, we can avoid the need for separate cores (often with asymetric architectures) to offload low priority tasks (communication, HMI etc).

The lack of many "legacy" functionalities allows reduction of silicon area, power, and verification costs.

1.1. prior art

A quick recap of what we already have available.

1.1.1. cortex-m NVIC

[13] defacto established "industry standard" of efficient interrupt handling. Anyone complaining about risc-v likes and wants the NVIC.

The addition of trustzone in armv8m, increases the interrupt latency/jitter due to the need of preserving and zeroing extra "unnecessary" registers. (to prevent potential leaks)

1.1.2. CLIC

CLIC CLIC is a designated goto for interrupt handling to fulfill everyone needs.

Attempts to be an unix capable interrupt controller with horizontal nesting of U, S, H (so far only proposed) and M mode.

All used registers must be saved in software, trampoline handlers need to save all ABI registers. If interrupts can be taken at multiple privilege modes, then each handler at higher privilege have to swap stack pointer (and interrupt level ??) by 2 additional CSR instructions per handler. (during vertical nesting those instructions just copy rs1 operand)

Preemption is handled in software by special CSR mechanism, that requires extra boilerplate code in every interrupt handler. Even in "inline" handlers.

Highest priority inline handlers should be possible to be made similar to legacy ones.

Trampoline handlers mimic the late arrival and tail chaining optimizations. Currently trampoline handlers cannot be used alongside "inline" handlers [50].

Introduces unavoidable jitter due to:

  • blocks of code executed with disabled interrupts (additive jitter)

  • late arrival handled through mnxti read (subtractive jitter of entry time)

  • tail chaining handled by another mnxti read (and extra branch) in epilogue

  • indirect jump instruction to actual code (branch prediction)

assuming 1 cycle per instruction, 10.2 and 11.1 listings from clic spec CLIC offer:

  • entry + 6 cycles of jitter from "inline" handlers.

  • entry + 7 + 16 cycles of jitter from "C-ABI" trampoline entry

  • 4 + exit or abs(entry - 7) cycles of jitter from "C-ABI" trampoline epilogue

Note
trampoline jitter can be reduced by 16 cycles of register stacking at the cost of late arrival handling
Note
according to [21], handler entry time is 6 cycles on sifive E2 and 10 cycles in E3/5.
Note
BTW, my prediction is that the "competitor A" will be able to do a "comparison against riscv" without resorting to FUD tactics, right after CLIC is ratified

Typical interrupt latency of CLIC trampoline was measured at 33 (inline handler) and 42 (trampoline) cycles for CV32E40P [53].

1.1.3. CV32RT fastirq

CV32RT "fastirq" [53] extends CLIC by moving prologue handling entirely into the hardware as well as introducing background lazy stacking from a shadow register set.

The epilogue is still handled in software.

Tail chaining is supported by emret instruction, but a late arrival (higher priority) will have to wait for the background stacking to finish. As a consequence there is a jitter equal to the stacking window.

1.1.4. emb-riscv

emb-riscv [1] is clean sheet design that attempts to be universal solution for every microcontroller. Designed with a strong focus on RTOS support.

Note
Currently development is stalled due to "not encouraging general interest"

Achieves lower interrupt latency by introducing EABI with reduced amount of caller-saved registers. FP registers are handled by lazy stacking.

Many similarities with NVIC.

mandates 4 64bit timers (even on RV32):

  • cycle counter

  • instret counter

  • system timer

  • rtc timer

1.1.5. CLINT

Attaches to generic interrupt scheme.

According to CLINT, it provides memory mapped interface for timers and IPI.

Note
ofiicial CLINT is called ACLINT but doesn’t differ much from CLINT in sifive documentations.

1.1.6. generic riscv interrupts as described in "privileged" volume II

Very often refered to as CLINT. e.g. [4].

has optional vectored mode which simply jumps to the position in vector table.

Doesn’t provide any nesting other than privilege levels or a complex boilerplate code to disable reatking active interrupts. Registers and CSR state (fcsr etc.) have to be pushed by software before use

1.1.7. PLIC/AIA

[5], [6]

A heavyweight frontend for delivering interrupts to multiple cores running typical unix OS. Not suitable for microcontrolers.

claim/complete architecture

handlers stay very similar to generic case.

AIA adds another set of CSR registers available only through indirect access mechanism (by miselect and mireg CSRs).

1.1.8. CH32 PFIC

Proprietary design by WCH build on top of generic riscv privileged [28], [29], [30].

Introduces HW stacking and single cycle register shadowing (aka HPE). It is of course necessary to use custom toolchain that implement a "proprietary" attribute: __attribute__((interrupt("WCH-Interrupt-fast")))

Note
without prestacked annotation there will be no portable way of doing this without compilers build on custom patches. Naked handler + mret trick doesn’t work in llvm, it should break in gcc anyway due to eventual use of callee saved registers and stack.

Another feature is "vector table free" interrupt mechanism that allows to skip fetching from vector table and jump to handler directly. It provides significant improvement only when all registers are "stacked" by shadow regfile. (or not satcked at all)

The descriptions of a lot of functional behaviour feel like a copy-paste of risc-v privileged. Highly under/undocumented.
e.g. There is nothing about what happens to mepc, mcause or mstatus during nesting (especially on "V2" core).
It is also unknown whether ra register doesn’t have an additional use (like saving mepc…​) during interrupt entry/exit and connot be used immediately as the currently implemented gcc attribute treats those functions the same way as the regular ABI ones with mret based return.
Inline with average chinese documentation standards.

The vendor provided headers, of course, contain 46 instances of "NVIC" string and just 5 for "PFIC"

There is also under/undocumented "EABI enable" bit in INTSYSCR on "V2" core. Most probably it reduces number of HW stacked registers to match the official EABI proposal [31].

QingKeV4 implements 3 shadow registers sets (aka HPE), given to handlers on first comes first served basis. Result is that only 3 lowest level handlers can practically use shadow registers.

Note
supressing dynamic nesting by HWSTKOVEN would cause priority inversion.

1.1.9. RNMI (aka returnable NMI)

[44] Adds another horizontal nesting level above the machine mode, that works very similarly to generic interrupts. Achieved by providing additional set of CSR registers as well as interrupt return instruction (mnret).

1.1.10. PicoRV32 interrupts

Note: The IRQ handling features in PicoRV32 do not follow the RISC-V Privileged ISA specification. Instead a small set of very simple custom instructions is used to implement IRQ handling with minimal hardware overhead.

Original author of the PicoRV found the riscv-privileged to be too heavy for minimal core, and provided own [9] interrupt scheme.

Note
FPGA minimum cores, is a non goal for XTeic

1.1.11. ti c2000 (main core)

Proprietary TI architecture [23] sporting an ancient looking accumulator-memory architecture (with 8 pointer registers), similar to the classic CISCs. An x86 of motor control and signal processing. FPU [24] is more RISC-ish with a bit of VLIW in some instructions.

Note
TI is very hesitant to release any general purpose benchmark scores (speed/size etc.) [25], [26]. Claiming that their architecture "is optimized for real world control applications". Those kind of scores are also almost non existent in independent sources.

According to [22], the core automatically saves some of the registers, rest must be pushed in software.
"High priority" interrupts can also save and restore all 8 floating point registers into shadow registers using special instructions.
There are also 5 (4 in prologue) defacto useless instructions for aligning stack and setting "C28 modes"

To allow nesting of "low priority" interrupts handlers must include extra boilerplate code to handle prioritiy masking in software. (8 instructions in prologue, 3 in epilogue)

As a consequence there is 21 cycles of jitter (to HPI and other LPIs) and 43 (HPI) or 63 (LPI) cycles of interrupt latency in worst case.

Use of RPT istruction will introduce even more jitter and latecy as the sequence is uninterruptible and takes arbitrary numbers of cycles to execute.

Note
ISR entry latency is 10 cycles due to 8 stage pipeline and automatically stacking 13 registers. [40] suggests that the latency is 14 cycles for internal signals. Which would further increase the worst case jitter and latencies.

1.1.12. ti c2000 CLA

CLA [51] is a separate coprocessor designated to offload main core from control loop tasks "freeing it to handle other tasks such as handling communication stacks"
Exactly those workloads that are general purpose tasks for which "c2000 architecture was not optimized for"

Offers less registers/instrucrtiions and lacks TMU so it’s not always faster than the main core.

Can be used as a true coprocesor for delegation of certain tasks to it. According to [52] this mode of operation brings just 12% improvement in motor FOC current loop.

CLA tasks are uninterruptible. TI claims [14],[15],[27] that their task driven machine "reduces interrupt latency and jitter" compared to classic CPU even though it does exactly the opposite when there is more than one (async) interrupt to handle (which happens in [14] example)

1.1.13. Xh3irq

Xh3irq extension (as implemented by hazard3) [54] provides nested and vectored interrupt handling that is conceptually similar to CLIC (mnxti) trampoline.

Unlike CLIC, dispatcher has to index pointer array in software (by using index from meinext)

Example handler implements only jumptable but it can be easily convertod into pointer table.

Access to configuration bits of all 512 inputs is performed by inline windowing of configuration CSRs, which is incompatible with zicsrind.

1.2. overwiew/discussion of some concepts/features

1.2.1. whole app must be doable in C/C++

In this case interrupts must always push all caller saved registers to be able to use functions without __attribute__((interrupt*)) annotation. Leading to ABIs with less caller saved registers

It also requires preinitialized table with pointer to startup code, sp, gp, and of course any other addition like Zcmt JVT csr.

This table is also not necessarily smaller than software setup, e.g. sp can be usually done with single lui instruction.

There is still a risk of corruption if the compiler decides to reorder something before initialization of .data/.bss sections.

Such startup code is also inefficient as it will have to obey the ABI (spill ra to stack) and compilers can’t optimize out link time symbols anyway. (even though some can be assumed to always be at certain addresses or offset from each other)

Of course I often find that there is a competition on who will make the worst startup code in assembly. So pure C/C++ startup code turns out to be "better" due to confirmation effect. But let’s have a look at my "combotablecrt" implementation [7] for stm32f030x4/6. Is your compiler able to do that?

There is also a case of interrupt handlers that are using only a few registers and don’t need to take latency of the whole ABI/EABI.

1.2.2. ABIs with less caller saved registers

The rationale of introducing ABIs with reduced number of caller saved registers is to reduce interrupt latency.

The major downside of such approach is lowered overall performance and code denisty. Which is highly unliked across riscv community [10] and stalls development of such (E)ABI.

I think for marketing reasons we should have the RISC-V EABI mimic the competitor ABI as closely as possible, and be available and supported by the tools, even if almost no-one should end up actually using it.

Zcmp[e] was also prepared for such fragmentation by reserving first 4 points in rlist for EABI, so the cores can implement UABI and EABI push/pop instructions at the same time. Those 4 points are, of course, supposed to handle 20 caller saved regs of EABI (probably with some reuse of few higher points).

It will also make the processors capable of stacking 2 registers per cycle, underutilized during HW stacking due to shorter stacking time than pipeline refill.

An alternative is to provide interrupts with defacto customizable ABIs by e.g. prestacked annotation (to match the HW stackers) and handle the function call pressure by IPRA.

1.2.3. "you are better off with soft stacking in inline handlers"

aka generic riscv __attribute__((interrupt))

The major issue lies within the principles of hardware stackers.

When entering interrupt handler, the core first fetches the entry from vector table and then jumps to that address. Both of those fetches can hit a flash waitstate or a cache miss. During that operation the data bus remains idle waiting for a first store instruction to be executed.

Those cycles can be accomodated for a "free" stacking of registers. If a higher amount of registers is stacked then it can hide a bit of jitter coming from cache misses or flash waitstates.

Even stacking by the special push instructions (e.g. XTheadInt [12] or PUSHINT [11] and maybe a subsets of those), won’t help much. Those start pushing after the latency of double (waitstated) miss was taken.

The only situation when soft stacking yields better results is when HW stacker has to push way more registers than is actually used.

Note
Zcmp[e] doesn’t cover caller saved registers except ra.

1.2.4. EABI for RVE must be subset of RVI EABI.

To be able to call RVE only code from RVI ABI
Recurrig thing in RVE ABI proposals.

The idea is to allow compilers and software vendors to provide a single set of precompiled libraries for RVI and RVE ABIs.

The issue with this approach is that the code arbitrarily compiled for RVE is likely to turn out to be less efficient than RVI one. It also limits the capabilities of RVI ABI like trading off argument registers for temporary/saved ones.

1.2.5. one universal standard for everyone use cases

Having one universal solution for all possible scenarios brings a lot of inefficiency to all of them. Due to mandatory support for a lot of rarely used functionality, keeping the compatibility with unused legacy, or having to be a subset of a bigger architecture optimized for a different use cases.

Even if that "flexibility" is made completely optional and non intrusive the vendors will implement it anyway for the sake of having the longest "flexibility" bar.

1.2.6. special handler return pattern

aka "HANDLER_RETURN" on emb-riscv and "EXC_RETURN" on ARM

The idea is to put special pattern in ra during handler entry and exit by reusing regular return mechanism provided by the ABI. Requires certain memory area to be non executable (e.g. 0xF0000000 - 0xFFFFFFFF)

This mechanism follows the typical ABI function call and together with HW stacking, allows the interrupt handlers to be a regular C functions.

The downside is that the ra and pc both have to be pushed onto stack and in some specifc cases, it could add extra stall cycles after the tail due to the waitstates or cache miss caused by delayed prefetch.

Alternatively we can just stack the ra and put there current pc with lowest bit set to trigger handler return operation. One less register counted towards interrupt latency.

Note
normally the jalr instruction just ignores the LSB bit of resulting address. LSB in register and immediate will lead to "bogus" jump over 2 extra bytes. Even though this behaviour simplifies hardware, existing ABIs are allowing "auxiliary information" in pointers as well as jalr immediate, effectively making both useless.

1.2.7. vector tables that are jumped to

It’s simply inefficient in truly vectored scenario. The vector entries will have to be populated with jump instructions anyway. Those have to take the second round of waitstates or cache miss without amortization by register stacking.

And if the code is far away from vector table (e.g. in SRAM for more deterministic execution), compiler will have to emit a jump island, aka "veener", that will perform yet another unamortized jump. Additionally far jumps require a free register which in typical scenario reqires pushing to stack and returning to veener from handler to handle epilogue.

allocating 8 bytes per entry, allowing lui + jalr sequence, will severly trump the code density and performance in typical use scenarios.

Note
8051 allocated 8 bytes per entry, but it was able to sometimes fit entire handler or one of the conditional path. Especially when following entries were unused. This kind of optimizations is exlusive to assmebly programming and generally not practised today.

1.2.8. MMIO vs CSR mapped config registers

In case of mass initialization MMIO could result in better code density CSR space is also limited.

My take is that anything architecturally coupled to the core should reside in CSR space and keep the rest in MMIO.

Nothing should exist as both.

There is no point in avoiding CSR registers when the cost of Zicsr instructions is already taken.

1.2.9. "reduced/zero jitter"

Very often claimed, yet those claims rarely meet with reality.

Note
There are also many non-architectural sources of jitter like caches, waitstated flash, accessing peripherals in different clock domains (usually divided from sysclk), DMA contention, or just the code masking out the interrupts.

Cortex-m0 offers a "zero jitter" by optional IP (RTL for ASICs) configuration that adjusts the best case of interrupt latency by extra cycle to acommodate random stall from bus contention.

Cortex-m3/4 offer up to 6 cycles of jitter due to "late arrival" and "pop pre-emption". Regular handler entry is dominated by stacking registers, giving some headroom for extra vector/instruction fetch latency.

Cortex-m7 of course suffers from Proprietary&Confidential syndrome. Most probably it’s similar to cm3/4.

In case of C2000 CLA, TI claims [14],[15],[27] that their task driven machine (non preemptible) "reduces interrupt latency and jitter" compared to classic CPU, even though it does exactly the opposite when there is more than 1 async interrupt to handle.

Note
Of course whenever TI compares CLA to "classic cpu", it’s always a cpu with preemption priorities only and background task not present on CLA. As if the similar "task machine" couldn’t be achieved by regular general purpose architecture (e.g. risc-v, cortex-m) without nesting and WFI loop (or "sleep on exit" feature) giving access to all GPRs in interrupts without stacking.

1.2.10. "everything will run Linux in future"

The Linux cargo cult.
Because a simplest tasks suitable for bunch of 555&74s or a simple microcontroler with a few KiB of flash and RAM must be done under linux so it will work somehow "better".

To be able to properly run linux you need quite beefy unit (usually with MMU), 2-4MiB of flash, 4-8MiB of RAM (usually external DRAM), long boot time and a bad power consumption in idle.
Just to run the OS itself.

One of the the most blatant example is NOMMU linux on stm32f429 with memory mapped SDRAM that is not even cached by cpu. If the XIP image doesn’t fit in 2MiB internal flash, it has to land in external parallel NOR flash, which is of course not cached by cpu and shares bus with SDRAM.
Any attempt to touch internal SRAM regions will defeat the remaining "universality/portability of linux apps" arguments.
Not to mention much higher unit price than typical 200+Mhz cotex A5/7 SOCs.

Of course there are still actual reasons to use linux in non-realtime embedded, consisting of large collection of drivers for external devices, higher portability or access to the raw performance (at much better perf/price ratio) not available in typical microcontrollers [16].

1.2.10.1. RTLinux and hard-realtime

Whenever those rt patches are measured, both the interrupt latency and jitter is always given in tens or hundreds of microseconds, not cycles [17],[18],[19],[20].

In some scenarios those numbers are unacceptable.
As an example, industry standard, FOC current loops close within 5-10us [35] and in some cases it achieves sub 1us latency [34]. On a <200 Mhz core clock.

1.2.11. lazy stacking

Lazy stacking allows to skip stacking of FP registers if handler doesn’t touch floating point registers.

The main issue is that all of the caller saved FP registers are saved (execution stalls during push) onto stack whenever FP instruction is executed even though only a few of the registers are used.

Requires additional CSR to hold address of reserved space in stack frame.

1.2.12. 64bit microcontrollers

So far, mostly the application processors used in bare metal.

Use cases for such also have different requirements than from typical 32bit microcontrollers.

1.3. required ABI

Ideally we should not change the established ABI to avoid disruption But definitely get rid of the tp register which is overall useless.

1.3.1. stack alignment

should be 2x`XLEN`, mandated thorought entire program execution so as to not require special realignment in interrupts.

Note

psABI [33] says that:

stack pointer must remain aligned throughout procedure execution

and fails to enforce enforce this anyway:

Non-standard ABI code must realign the stack pointer prior to invoking standard ABI procedures. The
operating system must realign the stack pointer prior to invoking a signal handler; hence, POSIX
signal handlers need not realign the stack pointer. In systems that service interrupts using the
interruptee’s stack, the interrupt service routine must realign the stack pointer if linked with any
code that uses a non-standard stack-alignment discipline, but need not realign the stack pointer if
all code adheres to the standard ABI

Major ilp32e issue is that the addi16sp instruction works on 16 byte stack increment. Once the c.addi range (-32..+31) is exhausted compilers have to chose beetwen denser code and more efficient use of stack.

Zcmp extension was also designed for 16 byte aligned stack. There is Zcmpe extension postponed to the future which should handle the EABI. Lowering the stack alignment requires doubling (per bit of alignment) waste of codepoints by push/pop instructions.

Note
addi8sp won’t be neccesary as Zcmpe push/pop can prepare initial 8 byte allocation for an (optionally) following addi16sp
Note
2x`XLEN` alignment allows more optimal use of microarchitectures capable of stacking 2 registers per cycle

1.3.2. RVE

register ABI name Saver description

x0

zero

-

Hardwired zero

x1

ra

caller

return address

x2

sp

callee

stack pointer

x3

gp

-

global pointer

x4

t0

caller

temporary

x5

t1

caller

temporary

x6

t2

caller

temporary

x7

t3

caller

temporary

x8

s0/fp

callee

saved/frame pointer

x9

s1

callee

saved

x10

a0

caller

argument/return

x11

a1

caller

argument/return

x12

a2

caller

argument

x13

a3

caller

argument

x14

a4

caller

argument

x15

a5

caller

argument

x16-x31

-

-

reserved for custom use

Note
ilp32e with tp turned into temporary.

1.3.3. RVI

register ABI name Saver description

x0

zero

-

Hardwired zero

x1

ra

caller

return address

x2

sp

callee

stack pointer

x3

gp

-

global pointer

x4

t0

caller

temporary

x5

t1

caller

temporary

x6

t2

caller

temporary

x7

t3

caller

temporary

x8

s0/fp

callee

saved/frame pointer

x9

s1

callee

saved

x10

a0

caller

argument/return

x11

a1

caller

argument/return

x12-x17

a2-a7

caller

argument

x18-x27

s2-s11

callee

saved

x28-x31

t4-t7

caller

temporary

Note
ilp32 with tp turned into temporary.

1.4. debug

The official risc-v debug spec [45] is good enough to not necessitate another incompatible one, although the "minimal debug implementation" is actually not minimal.

Some of the minor things that could be "improved" for minimal implementations:

  • 1 entry progbuf accepting 32bit instructions only (saves 2 bits, currently must accept compressed insns)

  • writing this 1 entry progbuf immediately executes written instruction (ie. no storage in progbuf)

  • remove dpc CSR, and allow debuggers to get the "current" pc by executing auipc from progbuf

  • no mandatory abstract register reads (data exchange only through message registers)

  • get rid of certain discovery bits

  • etc.

Biggest offenders of course are and will be the actual implementations that despite being the "minimal" ones designated as "8bit killers", are happily implementing more than necessary. Like 8-word progbuf in ch32v003 [28].

1.4.1. DTM

Low pin count devices (8-32) need a denser debug interface as the JTAG uses too many wires.

There are industry proven 2 wire interfaces like cJTAG or ARM SWD.
It would be best to have 1 wire solution like avr8 debugWIRE/updi or the WCH "SDI" (aka "SWD") [46]

1.5. tooling issues to solve

1.5.1. prestacked annotation

Note
official RFC has been submitted here: riscv-non-isa/riscv-c-api-doc#53

Currently there is no universal solution to indicate which registers in interrupt handlers can be freely used without stacking them.

  • __attribute__((interrupt)) makes all registers callee saved and uses mret to return.

  • __attribute__((interrupt("SiFive-CLIC-preemptible"))) extends regular interrupt by CLIC preemption

  • __attribute__((interrupt("WCH-Interrupt-fast"))) requires custom build toolchain, no floating point regs (even on the cores with F extension), still uses mret

  • Or just a plain C function that requires prestacking of all caller saved registers, reuses standard return mechanism to exit interrupt context

Even worse, there are already hardware stackers designed for ilp32e and ilp32. When the new and better ABI will be introduced, it will be impossible to use with pre-existing HW stackers. The same applies to creating HW stackers that stack less registers to optimize interrupt latency.

Therefore we need universal way to annotate which registers are available for use in a given function as a defacto calller saved one (aka create custom calling convention)

  • prestacked("") attribute

  • no whitespaces in string parameter

  • register range cover all registers between and including specified (x4-x6 is equivalent to x4,x5,x6)

  • register range must span at least 3 consecutive registers

  • registers/ranges are separated by comma

  • calee saved registers have to be properly turned into temporary when included in the list

  • CSRs taking part in calling conventions are also subject to this mechanism

  • should use raw names instead of ABI mnemonics as to make it ABI agnostic (more portable)

  • registers must be sorted (integer, floating point, vector, custom, then by lowest numbered)

  • CSRs must be put after the architectural regfiles, those don’t have to be sorted

  • must not collide with __attribute__((interrupt)) as to support "legacy" handler return mechanisms

  • must not imply __attribute__((interrupt)) as well

  • custom CSRs would also have to be somehow covered. (hw loops etc.)

  • annotated functions should be callable by regular code

  • argument registers that are passed but not included in the list, can be assumed to be unmodified after return from an annotated function

ilp32 caller saved:

__attribute__((prestacked("x5-x7,x10-x17,x28-x31")))

ilp32f, caller saved:

__attribute__((prestacked("x5-x7,x10-x17,x28-x31,f0-f7,f10-f17,f28-f31,fcsr")))

preemptible CLIC irq with simplified ranges(e.g. shadow register file):

__attribute__((interrupt("CLIC-preemptible"), prestacked("x8-x15")))

TEIC irq, range0 + shadow regs of half integer regfile (where bit 2 of operand is set, covers range1+2) and F + P extensions:

__attribute__((prestacked("x4-x7,x10,x11,x12-x15,x20-x23,x28-x31,fcsr,vxsat")))

ch32v003 irq (ilp32e + PFIC HW stacker, assuming ra doesn’t have some undocumented use):

__attribute__((interrupt, prestacked("x1,x5-x7,x10-x15")))

Note
unannotated ra is assumed as a valid return address, otherwise a special return mechanism must be used (e.g. return by mret in __attribute__((interrupt))
1.5.1.1. optimization for noreturn functions

gcc/llvm compilers can purge the epilogue (even down the call tree) by automatic detection of infinite loop or by using __attribute__((noreturn)) or __builtin_unreachable().

It is not the case on prologues though, leading to waste of stack and codespace in the most typical embedded scenario of main or thread functions with an infinite loops.

This missing optimization is intentional [32] to allow backtracing (abort() etc.) and throwing exceptions (of course under -fno-exceptions and exception less code)

By abusing the "prestacked annotation" we can get rid of this prologue by "prestacking" all of the available registers.
e.g. __attribute__((noreturn, prestacked("x1,x4-x31,f0-f31,fcsr")))

Note
addition of noreturn_nobacktrace_noexcept attribute is very unlikely, optimizing regular noreturn attribute is even less.
Note
__attribute__((naked)) won’t work, as it will remove the stack allocation and consequently underflow the stack.
1.5.1.2. functions with partially custom calling conventions

It can be additionally abused to:

  • define IPRA clobbers of assembly functions in its C function declarations (see applying IPRA to assembly functions)

  • certain (premature) optimizations (manually solving 2way IPRA recursion etc.)

  • dynamic linked functions with a subset of clobbers. e.g. functions like memcpy(),strcmp() etc. don’t need to clobber all caller saved registers so only common clobbers for straightforward, unrolled (?) and vectorized implementations need to be applied. Requires standardization of canonical clobbers for each offending function. (quite unrealistic)

1.5.2. IPRA - Inter procedural register allocation

So far implemented only by llvm [8].
Limited to statically linked code.
There are almost no benchmarks results, especially the ones other than x86 at -O3.

In simple explanation, it makes every function export information about its usage of caller saved registers effectively allowing non leaf functions to use caller saved registers as a callee saved ones. That avoids some of the stacking/spilling leading to a more efficiet code.

requirements and improvements needed for efficient IPRA:

  • this mechanism must cover the CSRs as well as the registers (e.g. fcsr, vtype, vl etc.)

  • custom registers and CSRs should also be covered (e.g. HW loops) (unnamed?)

  • compilers need to avoid using more registers than necessary (currently no reason)

  • registers from compressible range should be allocated only when it will benefit code density (currently no reason)

  • to avoid regressions, compilers need some kind of heuristic to detect when stacking certain (compressible) callee saved registers would yield better code density than using more temporaries from non compressible ranges

Note
on riscv it’s s0 and s1, in presence of Zcmp[e] pushing s0,s1 is free in non leaf functions, and just 2 16bit instructions in leaf. With IPRA it should be also possible to just move ra and s0/s1 into caller saved regs.
Note
This is also non IPRA optimization (-Oz kind)
Note

Automatic detection is not an option due to self constructed instructions (e.g. from [39]):

.word (0b0000000<<25)|(8<<20)|(0<<15)|(0b001<<12)|(10<<7)|0x43
.insn i CUSTOM_1, 0x0, 1, a0, 0x123
//equivalent to:
//tio.add0.xy a0, y0, s0
//tio.addi0.yx y1, a0, 0x123
  • precompiled libraries should also do an "IPRA exports"

  • very important point is resolving IPRA annotations of callbacks, where the callback call will use the smallest common regmask of all functions that can be called through this point

    • callbacks initialized once at startup (typical in many HALs)

    • callbacks passed as function parameters

    • queues (of structs) with callbacks

Note
callbacks are commonly used in peripheral interrups, therefore it’s important to apply IPRA optimizations to those as well
  • it can be used to annotate that passed function arguments (through registers or stack) were not modified and can be recycled by caller (e.g. in loops)

  • it can also "export" list of deterministic constants (and addresses) that are left in registers after return

Note
This mechanism is portable to other architectures, the more caller saved registers are available, the higher relative gain is.
Note
vector extension can benefit from IPRA as current psABI makes all vector registers temporary, though the syscall destroys entire state
1.5.2.1. adjusting ipra wrt prestacked registers

Because the HW stackers (used with prestacked annotation) will prefer to stack out the compressible registers first, it might not be the best match for IPRA optimized allocation

Note
compilers usally don’t care about non-abi (interrupt) prologues/epilogues and emit code as if it was the regular ABI function

The solution could be:

  • optimize HW stacker for typical allocations

  • make compilers treat specially a call trees growing from interrupt handlers

  • trump the general IPRA optimizations to use a0-a5 first

Handlers that are not calling another functions should be straightforward as long as the compiler allocators/optimizers are not going to straight out ignore prestacked annotation.

1.5.2.2. applying IPRA to assembly functions

Special attribute to annotate function declaration in header associated with assembly code (e.g. __attribute__((regmask("clobbered list here")))) was proposed [49], but it wasn’t implemented upstream.

The other option is to use inline asm clobbers to make call to such funcions

	__attribute__((always_inline))
	static inline int weird_call(int n, void* p)
	{
		register int result asm("a0") = n;
		register void* a1 asm("a1") = p;

		asm volatile(
			"call foo \n\t"
			: [ARG0] "+r" (result) // return in same register
			: [ARG1] "r" (a1)
			: "memory", "ra", "a2" // use clobber for any caller saved regs used
		);

		return result;
	}
  • requires the call pseudoinstruction that expands to a proper sequence. Otherwise we get errors when calling too far or missing optimization when short call can be made.

  • works in existing compilers (at least in gcc and llvm)

Another solution could be applying prestacked annotation e.g.

pure assembly function (FP compute kernel) using only subset of caller saved registers (a0 argument not modified):
attributeprestacked("x5,x11-x15,f10-f13,v0,v1,v8-v31,fcsr,vl,vtype,vstart")

Note
Both mothods are insufficient for "annotating" unmodified stack arguments which are caller saved (documented on arm and defacto on risc-v)

2. XTeic (aka Total Embedded Interrupt Controller)

smallest profile?

machine mode only

RV32 only

2 or 4 interrupt nesting levels

little endian only software shall assume little endian

2.1. implementation constants

name default value notes

TEIC_ENTRY_VECT_BASE

implementation specific

Base address of the first application entry point as well as its vector table. May have additional constarints on the alignment.

TEIC_EXEC_SRAM_BASE

implementation specific

Base address of the most designated executable SRAM memory. (Some devices implement a special memory area designated for interrupt handlers. aka "ITCM". Usually it will be the main memoy address)

TEIC_MMIO_CTRL_BASE

0xFFFE0000

Base address of XTeic MMIO control block

TEIC_IRQ_NESTING_BITS

{0,1,2}

Number of implemented interrupt nesting priority bits

TEIC_IRQ_PRIORITY_BITS

{1,2,3,4}

Number of implemented interrupt sub-priority bits

TEIC_IRQ_VECT_ENTRIES

{9..1023}

Number of allocated interrupt entries including skipped ones and NMIs

TEIC_IRQ_VECT_ENTRY_SIZE

{2,4}

Size in bytes of the single entry in vector table. By default it’s 4. 2 if XTeicTinyIrqTable subextension is implemented.

2.2. startup behaviour

Upon hart reset:

  • all of the architectural registers are initialized to their reset state.

  • The MMIO control block registers are also initialized to their reset state.

  • The pc is set to the TEIC_ENTRY_VECT_BASE.

Performing the system reset will additionally initialize the state of the peripheral registers to their reset state.

The hart reset is always equivalent to a system reset until XTeicMP extension is implemented.

2.2.1. reset state of registers

The reset state of all architectural registers is undefined unless explicitly specified in specific extension.

Note
That means the reset state of integer, fp, and vector registers is undefined.
Note
some of the CSR registers also remain in undefined state.

2.2.2. bootloaders

If the application start is preceeded by bootloader, or the application enters the bootloader, then the the switch code shall ensure that before redirecting execution to the target address:

  • all peripherals are disabled, or initialized to reset state if enabled on reset (e.g. watchdogs)

  • external GPIOs are configured to reset state

  • the oscillators, PLLs, clock selects and divisors are configured to their reset state

  • all nesting levels in teic_irq_msk are enabled

  • teic_irq_vect is set to the target entry point, right before the jump happens

Note
The rationale of these rules is to avoid bloat in startup code (and duplicate of it in SystemInit()), which is a result of assuming the worst case scenario
Note
bootloaders placed at application entry area (at TEIC_ENTRY_VECT_BASE) can be entered by setting a certain pattern in backup register and then executing system reset.
Note
Some devices switch between bootloader and application modes by performing whole system reset after modifying certain configuration registers (remap of executable area at TEIC_ENTRY_VECT_BASE)

2.3. interrupts

The interrupt controller supports only level triggered interrupts. The logical high is used to assert pending interrupt request lines.

The irq number is the position in vector table

Note
there is no irq offseting like in NVIC

Stack pointer is not realigned, if stack is not 8 byte aligned the behaviour is implementation specified

Note
typical HW won’t care about 4 byte stack, some dual issuers or hardened cores might want to set irqentryexit_unrec nmi request
Note
Zcmp similarly doesn’t specify the required alignment.

2.3.1. register ranges

Register ranges define which registers are pushed onto the stack on irq entry.

Adding certain range require inclusion of all previous ranges.

The selection is implementation specific, fixed at silicon level. Shall not deviate from the predefined ranges.

Note
only highest nesting level has configurable stacking ranges.
range registers added stack area mandatory implemented (all nesting)

0

"x1,x10,x11,reserved"

XLEN * 4

yes

1

"x12-x15"

XLEN * 4

yes

2

"x4-x7"

XLEN * 4

no

3

"x16,x17,x28-x31"

XLEN * 6

no

Note
Range 0+1 gives similar amount of usable registers as NVIC
stack frame pseudocode
// all ranges used
// range 0
sw x1, -4(sp)
sw x10, -8(sp)
sw x11, -12(sp)
sw reserved, -16(sp)

// range 1
sw x12, -20(sp)
sw x13, -24(sp)
sw x14, -28(sp)
sw x15, -32(sp)

// range 2
sw x4, -36(sp)
sw x5, -40(sp)
sw x6, -44(sp)
sw x7, -48(sp)

// range 3
sw x16, -52(sp)
sw x17, -56(sp)
sw x28, -60(sp)
sw x29, -64(sp)
sw x30, -68(sp)
sw x31, -72(sp)

addi sp, sp, -72
Note
reserved position in range0 window can be optionally used for preserving additional state during nesting

2.3.2. interrupt entry

when a given interrupt nesting level (reflected by pending_nestx in teic_irq_status) becomes pending which is not masked out by corresponing bit in teic_irq_msk register, the interrupt entry procedure is triggered.

During the interrupt entry the hardware will:

  • stacks configured/implemented register ranges at given nesting level (can be affected by n4_stacking)

  • decrement sp according to largest configured/implemented register ranges

  • put content of interrupted pc into ra register with lowest bit set

  • set in_nestx bit in teic_irq_status register

  • fetches target address from vector table pointed by teic_irq_vect. The vector entry is selected by handler dispatch process.

  • jumps to the fetched address

Note
optimized microarchitectures will implement late arrival, tail chaining and pop preemption which further complicate entry/exit procedures

If irq request is spuriously deasserted during the interrupt entry (or e.g. tail chaining), the core must either; enter the offending handler or immediately return (or e.g. tail chain to yet another handler).

Note
Sometimes it takes a few cycles to deassert irq request signal, after e.g. clearing status flag. Behaviour must be deterministic. Otherwise erratas will be populated.
2.3.2.1. handler dispatch

During the handler dispatch the hardware will evaluate all pending irq requests and select the one with highest configured sub-priority, ties are resolved by highest irq number.

2.3.3. interrupt exit

When jalr or cm.popret instruction is executed and the lowest bit in the source register is set (before calculating final target address), the interrupt exit procedure is triggered.
If no interrupt is currently active then irqretnest0_unrec nmi request is set.

During the interrupt exit the hardware will:

  • unstack configured/implemented register ranges at given nesting level (can be affected by n4_stacking)

  • increment sp according to largest configured/implemented register ranges

  • clear in_nestx bit in teic_irq_status register

  • jumps to the target address of jalr or cm.popret instruction

Note
The bogus jalr target address issue remains as per unprivileged spec. Therefore conforming software shall not set the lsb in jalr immediate used for function returns
Note
only the lsb in source register is checked, not the computed target address of jalr instruction. It allows detection of irq ret condition earlier in the pipeline.
Note
optimized microarchitectures will implement late arrival, tail chaining and pop preemption which further complicate entry/exit procedures

2.3.4. NMI interrupts

NMIs (non maskable interrupts) are a special type of interrupts that cannot be masked by teic_irq_msk register. Typically used for signalling critical conditions.

Entry/exit procedure is similar to regular IRQs with the following excepions:

  • activity is signalled by in_nmi in teic_irq_status register

  • preserves at least range 0 registers, stacking ranges are impelmentation defined.

  • adjusts sp by stacked ranges

Note
typically NMIs will stack the same register ranges as regular interrupts

Before returning from NMI handler all requests in teic_nmi_cause CSR must be acknowledged (cleared).

2.3.4.1. NMI unrecoverable state

unrecoverable NMI handler is entered whenever:

  • any of the *_unrec requests is raised in teic_nmi_cause

  • synchronous exception is raised during active NMI handler

  • any of the synchronous exception flag (*_exc in teic_nmi_cause) is not cleared before performing interrupt exit from NMI handler

  • *_async that was escalated to unrecoverable nmi request (escalated_async_unrec in teic_nmi_cause)

Entry procedure is similar to regular NMIs with the following excepions:

  • activity is signalled by in_nmi_unrecoverable in teic_irq_status register

  • busfaults, alignment or other errors during stacking are ignored

  • not required to actually stack the registers only the ra shall be written with pc during fault and sp decremented by range 0 area

2.3.4.2. NMI lockup state

The hart enters the NMI lockup state whenever

  • code attempts to return from Unrecoverable_NMI handler

  • synchronous or imprecise exception is raised within Unrecoverable_NMI handler

NMI lockup state halts any further code execution, except debug mode one.

Note
it is necessary to allow debuggers to read out state of registers/memory after experiencing lockup state.
Note
experiencing exceptions within (or return from) unrecoverable handler means a serious issue with control flow, where further attempts to execute code would do more harm than halting until watchdog performs system wide reset.
Note
lack of tripple fault lockout can also lead to security vulnerabilities [43]
Note
microarchitectures can provide external output for signaling NMI lockup state as to allow immediate shutdown of certain peripherals (pwm timers etc.)

2.3.5. vector table allocation

irq num type name notes

0

-

reserved

reserved for startup code (typically jump instruction)

1

NMI

reserved

2

NMI

IntegrityViolation_NMI

(optional) software and hardware integrity violations

3

NMI

ClockViolation_NMI

(optional) Lost clock or other anomaly. It should be assumed that the core/system clock could have been switched to a different one at this point.

4

NMI

WatchdogViolation_NMI

(optional) Entered right before any of the watchdogs trips and performs a (device) reset. Designated for safety measures and error logging. It should be assumed that execution could be frozen at this point and no further action can or need to be performed.

5

NMI

MemoryViolation_NMI

Bus or memory access fault

6

NMI

InstructionViolation_NMI

Illegal instruction exception

7

NMI

Unrecoverable_NMI

Nested nmi, unknown or a state that cannot be easily recovered from.

8

IRQ

Deffered0_IRQ

software deffered interrupt, can be used for context switch.

9

IRQ

reserved

reserved/systick???

10..1022

IRQ

*_IRQ

(optional) device specific interrupts

Unimplemented optional NMIs can be recycled for custom NMIs other than the ones provided in table above.

Note
XTeic doesn’t provide any peripheral API for optional watchdog, clock and integrity protection systems. It’s up to the implementer to provide them.
2.3.5.1. alternate vector table allocation

Alternate vector table allocation designated for minimal implementationns that are not making use of optional NMIs, but benefit from additional space savings.

Alternate vector table allocation is implentation defined. It’s not discoverable nor configurable.

irq num type name notes

0

-

reserved

reserved for startup code (typically jump instruction)

1

NMI

HW_NMI

(optional) hardware related exceptions (watchdogs, ECC etc.)

2

NMI

SW_NMI

exceptions related to application execution on a given hart (illegal instr, integrity violations by sw etc.)

3

NMI

Unrecoverable_NMI

Nested nmi, unknown or a state that cannot be easily recovered from.

4

IRQ

Deffered0_IRQ

software deffered interrupt, can be used for context switch.

5

IRQ

reserved

reserved/systick???

6..1022

IRQ

*_IRQ

(optional) device specific interrupts

Note
Fragmentation is not a big of a deal, as all devices will be fragmented by implementing it’s own layout of device specific IRQ handlers. Which will be provided within startup files.

2.4. recycled volume II CSRs

To reduce disruption some of the "privileged" csr have been recycled according to "privileged" specification.

number name privilege description notes

0x001

fflags

URW

iee754 exception flags

implemented when F,D,Zfinx,Zdinx is present

0x002

frm

URW

iee754 dyn rounding mode

implemented when F,D,Zfinx,Zdinx is present

0x003

fcsr

URW

frm+fflags

implemented when F,D,Zfinx,Zdinx is present

0xf11

mvendorid

MRO

vendor ID

jedec??

0xf12

marchid

MRO

architecture ID

0xf13

mimpid

MRO

implementation ID

0xf14

mhartid

MRO

hart ID

2.5. added instructions

2.5.1. wfi (Wait for interrupt)

Mnemonic
wfi
Encoding (RV32, RV64)
{reg:[
 { bits: 7, name: 0x73, attr: ['SYSTEM'] },
 { bits: 5, name: 0x0, attr: ['rd'] },
 { bits: 3, name: 0x0, attr: ['PRIV'] },
 { bits: 5, name: 0x0, attr: ['rs1'] },
 { bits: 12, name: 0x105, attr: ['WFI'] },
]}
Description

Execution of the wfi instruction stalls the execution and allows the core to enter various low power states until the interrupt is taken or any nesting level becomes pending
It is allowed to terminte spontaneously or even be implemented as a nop.

In addition, the wfi instruction is allowed to optionally stack out certain registers ahead of the interrupts, to reduce their latency. In this case, sp is not changed until interrupt arrives.

Note
wfi can be executed when interrupts are disabled. Which is a very common use case that avoids introduction of non deterministic delays to event respose time. (i.e. irq arriving right before wfi)
Note
It is basically the same thing as priviliged wfi but without the configuration bits in privileged CSR’s

2.5.2. teic.wfi.n4ign

Mnemonic
teic.wfi.n4ign
Encoding (RV32, RV64)
{reg:[
 { bits: 7, name: 0x73, attr: ['SYSTEM'] },
 { bits: 5, name: 0x0, attr: ['rd'] },
 { bits: 3, name: 0x0, attr: ['PRIV'] },
 { bits: 5, name: 0x0, attr: ['rs1'] },
 { bits: 12, name: 0x115, attr: ['WFI'] },
]}
Description

Similar to wfi instruction, but doesn’t have to terminate after executing interrupts at 4th nesting priority only. Shall terminate if any other nesting level was entered before returning from n4 irq. (i.e. tail chained to n3, then pop preempted back into n4)

If only single nesting priority is implemented (TEIC_IRQ_NESTING_BITS == 0) then this instruction behaves like a standard wfi.

Note
Designated to reduce wakeups caused by high frequency control loop interrupts that don’t need attention from rest of the application.
Note
Typicall implementation would require additional hidden state to track if interrupt of lower nesting priority was entered.
Note
similarly to standard wfi it can terminate spontaneously so the additional functionality is optional

2.6. TEIC CSR map

number name privilege description

0xbc0

teic_irq_vect

MRW

interrupt vector table

0xbc1

teic_estate

MRW

irq saved state

0x800

teic_irq_msk

URW

interrupt mask

0x801

teic_irq_status

URO

current interrupt status

0xbc4

teic_nmi_cause

MRW

coarse mask of NMI causes

0xbc5

teic_cfg

MRW

config register

0xbc6

teic_sptlimit

MRW

added with XTeicStackLimit

0xbc7

teic_spmlimit

MRW

added with XTeicStackLimit&&XTeicRTOS

0xbc8

teic_swpspm

MRW

added with XTeicRTOS

2.6.1. teic_irq_vect

bit name type reset value description

[31:5]

vect_offset

WLRL

TEIC_ENTRY_VECT_BASE>>5

top bits of vector table offset.
Must be aligned to 64 bytes or rounded up to next power of 2, of the number of entries multiplied by the entry size, whichever is greater

[4:0]

reserved

WLRL

0

reserved

Note
alignment requirement allows to avoid use of the additional adder circuit during irq dispatch
Note
minimum alignment can by calculated by following formula: pow(2, ceil(log2(TEIC_IRQ_VECT_ENTRIES)/log2(2))) * TEIC_IRQ_VECT_ENTRY_SIZE
If vector table consists of 100 entries total, 4 byte each. Then minimum required alignment is 512 bytes
Note
vect_offset can be implemented with just enough bits to point at existing memory areas only, as to reduce necessary state to implement.
Note
Implementations may impose additional alignment requirement
Note
vect_offset can also be implemented as a read only constant pointing to beggining of the flash memory

2.6.2. teic_estate

bit name type reset value description

[31:0]

estate_nl

WPRI

undefined

implementation specified pattern used to recover execution state upon interrupt return. Covers certain csr registers: (fcsr, vcsr, vstart etc.), and (optionally) multi cycle instruction progress. The content read as well as the write to this register is valid only at the lowest implemented nesting level. Otherwise read and write operations on this register are undefined.

Note
Altough optional, the ability to interrupt multicycle instructions is especially important for cores implementing zero jitter features. As an example the ratified Zcmp cm.popretz intruction has 3 uninterrupible instructions (one is branch). (Even though it could be just 2 as zeroing a0 is restartable. 3 instruction sequence will be formally pushed down your throats anyway)
Note
designated to allow an efficient context switch from the lowest priority interrupt
Note
As the risc-v doesn’t have condition codes for branching/predication, it is expected that the smallest implementations will not make use of estate register at all.
Note
due to maximum 5-level nesting and limited state to preserve, it was decided to not push previous state onto stack, that would increase interrupt latency.

2.6.3. teic_irq_msk

bit name type reset value description

[31:4]

reserved

WPRI

0

reserved

3

nest4

rw

1

Fourth nesting level
0: disabled
1: enabled

2

nest3

WARL

1

Third nesting level
0: disabled
1: enabled

1

nest2

WARL

1

Second nesting level
0: disabled
1: enabled

0

nest1

WARL

1

First nesting level
0: disabled
1: enabled

Disabling any nesting level shall take effect immediately before executing next instruction.

bits related to unimplemented nesting levels are hardwired to zero.

Note
only nest4 level is mandatory to implement
Note
TEIC_IRQ_NESTING_BITS == 1 implements nest2 and nest4 only

2.6.4. teic_irq_status

bit name type reset value description

[31:12]

reserved

WPRI

0

reserved

11

n4_stacked

ro

0

(optional) signals that currently stacked registers cover only ranges configured for nest4 level.
It is used only when ranges configured by n123_stacking differs from n4_stacking.
If the interrupt handler is tailchained to lower nesting level then the core must stack the remaining ranges. Similarly the core can enter nest4 with n123 ranges stacked as well.
1: only nest4 ranges were stacked
0: all ranges stacked as per n123_stacking

10

nmi_lockup

ro

0

NMI lockup state, can be cleared only by hart/system reset
1: active
0: inactive

9

in_nmi_unrecoverable

ro

0

unrecoverable NMI handler state, can be cleared only by hart/system reset
1: active
0: inactive

8

in_nmi

ro

0

returnable NMI handler state
1: active
0: inactive

7

in_nest4

ro

0

irq handler at 4th nesting priority state
1: active
0: inactive

6

in_nest3

ro

0

irq handler at 3rd nesting priority state
1: active
0: inactive

5

in_nest2

ro

0

irq handler at 2nd nesting priority state
1: active
0: inactive

4

in_nest1

ro

0

irq handler at 1st nesting priority state
1: active
0: inactive

3

pending_nest4

ro

0

pending status of 4th nesting priority
1: active
0: inactive

2

pending_nest3

ro

0

pending status of 3rd nesting priority
1: active
0: inactive

1

pending_nest2

ro

0

pending status of 2nd nesting priority
1: active
0: inactive

0

pending_nest1

ro

0

pending status of 1st nesting priority
1: active
0: inactive

Note
nmi_lockup bit is defacto readable only by debugger

2.6.5. teic_nmi_cause

bit name type reset value description

31

reserved

ro

0

30

irqretnest0_unrec

ro

0

irq return without active irq/nmi

29

irqentryexit_unrec

ro

0

any fault during irq entry/exit (stack alignment, memory faults etc.)

28

bus_fault_imprecise_unrec

ro

0

(optional) imprecise bus faults

27

hw_integrity_imprecise_unrec

ro

0

(optional) imprecise hw integrity error

26

sw_integrity_imprecise_unrec

ro

0

(optional) imprecise sw integrity error

25

nested_exc_unrec

ro

0

synchronous exception raised during execution of nmi handler

24

escalated_async_unrec

ro

0

(optional) escalated *_async requests

[23:10]

reserved

rw1c

0

reserved

9

clock_async

ro

0

(optional)

8

watchdog_async

ro

0

(optional)

7

`reserved

ro

0

reserved

6

hw_integrity_async

ro

0

(optional) asynchronous integrity error not related to the architectural control flow (e.g. unrecoverable ECC error triggered by scrubber or speculative prefetch)

5

reserved

rw1c

0

reserved

4

sw_integrity_exc

rw1c

0

(optional) software related integrity exceptions
e.g. pmp, stacklimit or other control flow violations related to the the software.

3

hw_integrity_exc

rw1c

0

(optional) hardware related integrity exceptions
e.g. ECC, parity, lockstep or other integrity error on core, memory or buses.

2

misaligned_address_exc

rw1c

0

(optional) misaligned load/store address

1

bus_fault_exc

rw1c

0

memory access faults

0

illegal_instruction_exc

rw1c

0

Illegal instruction exception and misaligned instr

The *_async nmi requests have to be cleared within the source peripheral.

2.6.6. teic_cfg

bit name type reset value description

[31:8]

reserved

WLRL

0

reserved

[7:6]

n4_stacking

WARL

implementation specific (highest implemented)

stacking ranges at 4th nesting level.
Connot be set to higher ranges than implemented by lower nestings.
Must not be changed within interrupt handler, otherwise behaviour is undefined.
0b00: range 0
0b01: range 0, 1
0b10: range 0, 1, 2
0b11: range 0, 1, 2, 3

5

reserved

WARL

0

4

access_thread_regs_n1

WARL

0

(optional) Switches current (part of) register file to thread one if applicable.
It has effect only in interrupts at lowest implemented nesting priority.
Designated to allow context switching of threads in case of automatic irq shadow registers.
1: thread context remapped
0: no context remap

3

thread_enter

WARL

0

added with XTeicRTOS

2

escalate_async_nmi

WARL

0

(optional) if *_async nmi request is raised during active nmi, it will be escalated to unrecoverable nmi request (i.e. raises escalated_async_unrec nmi request)
1: enabled
0: disabled

1

sleeponexit

WARL

0

(optional)
1: enabled
0: disabled

0

zero_jitter

WARL

0

(optional) Ensure that the highest nesting priority interrupts are always entered within the same number of cycles regardless of the interrupted execution state.

Doesn’t affect tailchaining of handlers within the highest nesting priority, as well as irq return procedure. Various deep sleep states are also an exception.

It shall be assumed that irq vector table, highest level interrupt code and stack resides in zero waitstated memories and no HW measures will be implemented to adjust for a different scenario.
1: enabled
0: disabled

2.7. MMIO TEIC registers

private to the hart

offset from TEIC_MMIO_CTRL_BASE entry size name non-native access description

0x0

4

teic_extra_cfg

no

0x4

4

teic_reset_req

no

0x8

4

teic_Deffered_pending

no

0x10

4

teic_Deffered_request

no

0x20

4

teic_irq_pending[32]

no

0x40

4

teicMP_irq_enable[32]

no

added with XTeicMP

0x400

1

teic_prio_cfg[1023]

yes

2.7.1. teic_extra_cfg

2.7.2. teic_reset_req

bit name type reset value description

[31:16]

reserved

rw

0

reserved

[15]

nmi_lockup_onreset

ro

dependent

1: nmi_lockup was active prior to reset 0: no nmi_lockup prior to reset

[14:11]

last_reset_cause

ro

dependent

0b0000: power on reset
0b0001: software reset
0b0010: watchdog reset
0b0011: external reset (master core, RST input pin etc.)
other: reserved

[10:3]

reset_key

wo

0

write of 0xC5 to this field performs system reset

[2:1]

reserved

rw

0

[0]

hart_only

rw

implementation specific

(optional) write 1 together with reset_key to reset only hart. If implementation allows only a hart reset, this field reads always 1, 0 otherwise

Note
[45] provides sysreset with excluded debug subsystem, in case of custom debug spec, it should at least provide its own config to exclude itself from reset

2.7.3. teic_Deffered_pending

bit name type reset value description

[31:1]

deffered{i}_pending

rw1c

0

(optional) pending status of deffered1-deffered31 irq requests

[0]

deffered0_pending

rw1c

0

pending status of deffered0 irq request

2.7.4. teic_Deffered_request

bit name type reset value description

[31:1]

deffered{i}_req

w1s (wo)

undefined

(optional) write 1 to set deffered1-deffered31 irq requests

[0]

deffered0_req

w1s (wo)

undefined

write 1 to set deffered0 irq request

2.7.5. teic_irq_pending[32]

For each implemented irq vector, there is corresponding pending bit in pending register at teic_irq_pending[IRQn/32] position.

First 8 bit entries (corresponding to NMIs) are reserved.

bit name type reset value description

[31:0]

pending{i}_irq

ro

0

signals pending status of IRQn % 32 interrupt

2.7.6. teic_prio_cfg[1023]

Consists of 1023 entries, 1 byte each. First 8 entries (corresponding to NMIs) are reserved.

For each implemented irq vector, there is corresponding priority config register at teic_prio_cfg[IRQn] position.

priority encoding
bit name type reset value description

[8:(9 - TEIC_IRQ_NESTING_BITS)]

nest_prio

rw

0

nesting priority bits

[(8 - TEIC_IRQ_NESTING_BITS):(9 - (TEIC_IRQ_NESTING_BITS + TEIC_IRQ_PRIORITY_BITS))]

sub_prio

rw

0

sub-priority bits

[(8 - (TEIC_IRQ_NESTING_BITS + TEIC_IRQ_PRIORITY_BITS)):0]

reserved

WLRL

0

reserved

Unimplemented bottom nesting bits are treated as if they were hardwired to 1. If only 1 bit is implemented then only nest2 and nest4 levels are possible.

2.8. additional optional subextensions

2.8.1. XTeicMP

additional per vector entry interrupt enable

private to the hart

2.8.1.1. teicMP_irq_enable[32]

For each implemented irq vector, there is corresponding enable bit in "enable" register at teicMP_irq_enable[IRQn/32] position.

First 8 bit entries (corresponding to NMIs) are reserved.

bit name type reset value description

[31:0]

enable{i}_irq

WARL

0

enable control of IRQn % 32 interrupt
0: disabled
1: enabled

2.8.2. XTeicRTOS

Adds additional RTOS specific features

After thread mode (aka "user" or "unprivileged") is activated by thread_enter bit:

  • Current sp becomes a defacto thread stack

  • On irq entry from thread, current sp is swapped with the context of teic_swpspm register which happens after stacking (registers are pushed to thread stack)

  • Thread mode protects only CSR registers, memory regions should be protected by additional PMP unit.

  • Interrups are always executing in machine mode.

2.8.2.1. thread_enter

bit in teic_cfg CSR

Setting this bit will make the hart to enter thread mode (aka user mode in privileged nomenclature). Once set it cannot be cleared.

Must not be set within interrupt handler, otherwise behaviour is undefined.

Note
It is expected that startup code will turn itself into an idle thread after configuring everything in machine mode.
2.8.2.2. teic_swpspm

Holds the stack pointer to be swapped with sp when entering interrupt context.

Note
Separate interrupt stack allows thread stacks to allocate only the area for context switch storage in addition to its own usage (which can be statically analysed)

If access_thread_regs_n1 control bit is implemented, then it switches sp to thread stack as well.
When in effect, the teic_swpspm content is undefined. When another interrupt nests, it pushes registers onto the machine (interrupt) stack.

2.8.3. XTeicTinyIrqTable

Makes each address entry in irq vector table take only 2 byte in size. (TEIC_IRQ_VECT_ENTRY_SIZE == 2)

The effective addres is constructed by concatenation of the 2 bytes of the vector entry content and top 16 bit of TEIC_ENTRY_VECT_BASE implementation constant.

The TEIC_ENTRY_VECT_BASE must be 64KiB aligned.

The entry encoding with the least significant bit set, is reserved.

Note
Extension designated for smallest devices where a vector table size has a significant code size impact.
Note
SRAM can be used for enplacing handlers if mapped within the same 64KiB block

2.8.4. XTeicTinyIrqTableExt

Implies XTeicTinyIrqTable extension.

If the fetched vector entry has the lowest bit set, then the effective addres is constructed by concatenation of the 2 bytes of the vector entry content and top 16 bits of TEIC_EXEC_SRAM_BASE implementation constant.

The TEIC_EXEC_SRAM_BASE must be 64KiB aligned.

Note
It is possible to implement this on devices with large flash memories and resort to compiler tricks, to keep handlers within 64KiB range. But the gains will be relatively low.

2.9. XTeicStackLimit

Provides additional CSR registers with stack address thresholds.

Throws sw_integrity_exc exception, when sp (x1) register is written with value lower than the one specified in teic_sp*limit register.

Note
local arrays can be created on stack and then accessed by pointer passed in working register. Therefore stacklimit comparison must happen on write to sp register

2.9.1. teic_sptlimit

Used for limiting sp when hart is in thread mode or thread_enter == 0.

bit name type reset value description

[31:3]

spt_limit

WLRL

0

top bits of bottom stack threshold, unsigned

[2:0]

reserved

WLRL

0

reserved

2.9.2. teic_spmlimit

available only with XTeicRTOS

Used for limiting sp when hart is in interrupt (machine) mode (thread_enter == 1).

bit name type reset value description

[31:3]

spm_limit

WLRL

0

top bits of bottom stack threshold, unsigned

[2:0]

reserved

WLRL

0

reserved

3. auxiliary extensions

Additional extensions that are usefull addition to XTeic

3.1. Xfenceiext

Because J extension group is going to simply ignore the fact that fence.i instruction allocated whole 22.125 bits of opcodes, and introduce a new instructions for operational subset of fence.i (e.g. IMPORT.I) [38],[39]. We don’t need to care about eventual sync with Zjid encodings.

The rationale is that the fence.i encodes whole instruction side synchronization with all zero immediate. Therefore we can remove all of the sync mechanisms by inverting the bits, other than the one designated for certain operation.

The uppermost 4 bits remain zero to allow enabling extra features not covered by fence.i.

3.1.1. teic.fence.ipipe

Flushes the pipeline and prefetch buffers before executing next instruction.
Encoded in bit 0 of fence.i immediate

Note
not suitable for synchronizing with architectural state modifications by CSR instructions, use teic.fence.icsrsync instead
Mnemonic
teic.fence.ipipe
Encoding (RV32, RV64)
{reg:[
 { bits: 7, name: 0xf, attr: ['MISC-MEM'] },
 { bits: 5, name: 0x0, attr: ['rd'] },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 0x0, attr: ['rs1'] },
 { bits: 12, name: 0x0fe, attr: ['imm'] },
]}

3.1.2. teic.fence.icsrsync

Ensures that the following instructions are executed after the architectural state change by a preceding CSR instructions (or equivalent) takes effect. Encoded in bit 1 of fence.i immediate

Note
In many cases CSR updates don’t require full pipeline flush, though it can be implemented as regular pipeline flush.
Note
necessary to sync e.g irq vector table updates wrt following (peripheral) MMIO access
Note
[41] do require fencing after update of jvt and mtvec (even though jvt falls into "program order" category).
Mnemonic
teic.fence.icsrsync
Encoding (RV32, RV64)
{reg:[
 { bits: 7, name: 0xf, attr: ['MISC-MEM'] },
 { bits: 5, name: 0x0, attr: ['rd'] },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 0x0, attr: ['rs1'] },
 { bits: 12, name: 0x0fd, attr: ['imm'] },
]}

3.2. Xicsrmz

Implemented similarly to Zicsr with uimm=0 mapped into -1 constant.

Note
csrrsi/csrrci with uimm=0 still doesn’t write and cause write side effects.
Note
This extensions allows to sync csrrwi instruction, with some other extensions [39], as to not cause additional immediate formats.
Note
csrrw rd, csr, x0 can still be used to write a zero into csr.

3.3. Xtolerantcsr

None of the CSR access shall raise an exception.

  • Writes to read only CSRs shall be ignored.

  • in machine mode access to unimplemented CSRs is undefined

  • in thread mode access to unimplemented CSRs as well as higher privilege ones shall cause no side effects, read a 0 value and have its write ignored

Note
UNIMP instruction maps to write into cycle csr register, so it can no longer be used. c.unimp remains available which is encoded as all zero.
Note
Extension designated for reduction of silicon use, reflects behaviour of certain privileged csr registers (e.g. misa, mvendorid etc.) when unimplemented

3.4. Xzcmpt

Implemented similarly to Zcmp but with additional immediate bit to accomodate 8 byte aligned stacks, and following changes.

Note
addi8sp is not required as push instruction can prepare initial allocation with 8byte granularity.
rlist encoding
RV32E:
case 0: {reg_list="ra"; xreg_list="x1";}
case 1: {reg_list="ra, s0"; xreg_list="x1, x8";}
case 2: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
case 3-15: reserved
RV32I:
case 0: {reg_list="ra"; xreg_list="x1";}
case 1: {reg_list="ra, s0"; xreg_list="x1, x8";}
case 2: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
case 3: {reg_list="ra, s0-s2"; xreg_list="x1, x8-x9, x18";}
case 4: {reg_list="ra, s0-s3"; xreg_list="x1, x8-x9, x18-x19";}
case 5: {reg_list="ra, s0-s4"; xreg_list="x1, x8-x9, x18-x20";}
case 6: {reg_list="ra, s0-s5"; xreg_list="x1, x8-x9, x18-x21";}
case 7: {reg_list="ra, s0-s6"; xreg_list="x1, x8-x9, x18-x22";}
case 8: {reg_list="ra, s0-s7"; xreg_list="x1, x8-x9, x18-x23";}
case 9: {reg_list="ra, s0-s8"; xreg_list="x1, x8-x9, x18-x24";}
case 10: {reg_list="ra, s0-s9"; xreg_list="x1, x8-x9, x18-x25";}
case 11: {reg_list="ra, s0-s10"; xreg_list="x1, x8-x9, x18-x26";}
case 12: {reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
case 13-15: reserved
stack_adj_base derivation from rlist
case 0..1:   stack_adj_base = 8
case 2..3:   stack_adj_base = 16
case 4..5:   stack_adj_base = 24
case 6..7:   stack_adj_base = 32
case 8..9:   stack_adj_base = 40
case 10..11: stack_adj_base = 48
case 12:     stack_adj_base = 56
case 13..15: reserved

Valid values:
case 0..1:   stack_adj = [ 8|16|24|32|40|48|56|64]
case 2..3:   stack_adj = [16|24|32|40|48|56|64|72]
case 4..5:   stack_adj = [24|32|40|48|56|64|72|80]
case 6..7:   stack_adj = [32|40|48|56|64|72|80|88]
case 8..9:   stack_adj = [40|48|56|64|72|80|88|96]
case 10..11: stack_adj = [48|56|64|72|80|88|96|104]
case 12:     stack_adj = [56|64|72|80|88|96|104|112]
case 13..15: reserved
register stacking order

currently same as in Zcmp

3.4.1. teic.cm.push

Synopsis

Allocates stack frame and saves registers selected by rlist.

Mnemonic
teic.cm.push {reg_list}, -stack_adj
Encoding
{reg:[
 { bits:  2, name: 0x2, attr: ['C2'] },
 { bits:  1, name: 'spimm[5]' },
 { bits:  2, name: 'rlist[1:0]' },
 { bits:  2, name: 'spimm[4:3]' },
 { bits:  2, name: 'rlist[3:2]' },
 { bits:  1, name: 0 },
 { bits:  2, name: 0x0 },
 { bits:  1, name: 0 },
 { bits:  3, name: 0x5, attr: ['C.FSDSP'] },
],config:{bits:16}}

3.4.2. teic.cm.pop

Synopsis

Deallocates stack frame and loads registers selected by rlist.

Mnemonic
teic.cm.pop {reg_list}, stack_adj
Encoding
{reg:[
 { bits:  2, name: 0x2, attr: ['C2'] },
 { bits:  1, name: 'spimm[5]' },
 { bits:  2, name: 'rlist[1:0]' },
 { bits:  2, name: 'spimm[4:3]' },
 { bits:  2, name: 'rlist[3:2]' },
 { bits:  1, name: 1 },
 { bits:  2, name: 0x0 },
 { bits:  1, name: 0 },
 { bits:  3, name: 0x5, attr: ['C.FSDSP'] },
],config:{bits:16}}

3.4.3. teic.cm.popret

Synopsis

Deallocates stack frame, loads registers selected by rlist and returns.

Mnemonic
teic.cm.popret {reg_list}, stack_adj
Encoding
{reg:[
 { bits:  2, name: 0x2, attr: ['C2'] },
 { bits:  1, name: 'spimm[5]' },
 { bits:  2, name: 'rlist[1:0]' },
 { bits:  2, name: 'spimm[4:3]' },
 { bits:  2, name: 'rlist[3:2]' },
 { bits:  1, name: 1 },
 { bits:  2, name: 0x0 },
 { bits:  1, name: 1 },
 { bits:  3, name: 0x5, attr: ['C.FSDSP'] },
],config:{bits:16}}
Description

The ra register may not be populated.

3.4.4. teic.cm.popretz

Synopsis

Deallocates stack frame, loads registers selected by rlist, writes zero to a0 and returns.

Mnemonic
teic.cm.popretz {reg_list}, stack_adj
Encoding
{reg:[
 { bits:  2, name: 0x2, attr: ['C2'] },
 { bits:  1, name: 'spimm[5]' },
 { bits:  2, name: 'rlist[1:0]' },
 { bits:  2, name: 'spimm[4:3]' },
 { bits:  2, name: 'rlist[3:2]' },
 { bits:  1, name: 0 },
 { bits:  2, name: 0x0 },
 { bits:  1, name: 1 },
 { bits:  3, name: 0x5, attr: ['C.FSDSP'] },
],config:{bits:16}}
Description

The ra register may not be populated. Unlike in Zcmp the load to a0 is non atomic.

3.4.5. todo: mva/mvs

those are quite annoying on rve

Appendix A: irq atomic block

mask out all interrupts
void foo()
{
	size_t tmp;
	asm volatile(
		"csrrci %[out], teic_irq_msk, 0b01111 \n\t"
		: [out] "=r" (tmp) :: "memory");
	//
	// execute code with irq disabled
	//
	asm volatile("csrw teic_irq_msk, %[in] \n\n" :: [in] "r" (tmp) : "memory");
}
mask out only nest1 level
void foo()
{
	size_t tmp;
	asm volatile(
		"csrrci %[out], teic_irq_msk, 0b00001 \n\t"
		: [output] "=r" (tmp) :: "memory");
	//
	// execute code with irq disabled
	//
	asm volatile ("csrw teic_irq_msk, %[in] \n\t" :: [in] "r" (tmp) : "memory");
}

Appendix B: RTOS context switch

Appendix C: vendor software support packages

what headers, definitions, names etc. must be provided.

Appendix D: design decisions

D.1. no cause code

The cause code can be implied from hardcoded vector table position or periphereals state if handler is shared. Therefore it’s redundant. The other issue is that it has to be somehow preserved during nesting.

Note
NMIs are handled through teic_nmi_cause CSR.

D.2. no single bit interrupt enable

It would be redundant to the irq_msk nest enables. Which can be similarly managed by csrsi, csrci instructions.

D.3. no misa register

It’s useless.

will it tell you if there is Zbb, Zmmul or Zcmt implemented? - no

On embedded targets, HW information about implemented extensions and ability to enable/disable them, has a rather low value.

D.4. stacking of floating point and vector registers

currently ???

Zfinx ???

Those can still be handled by IPRA anyway. FP push/pop instruction might be usefull in such case.

D.5. undefined initial state of architectural registers

It is said that registers have to be zeroed at reset "to protect software from itself" [36] It doesn’t, it just hides bugs until they manifest in the worst possible scenario. Just like developing and debugging code at -O0.

This kind of use of uninitailized variables is UB in C/C++ and easily detectable by compilers. Languages like Rust or Ada are supposed to be free from this UB, so there is no need to spend transistors or code memory for zeroing those.

Note
V extension uses all ones for tail agnostic filling just to prevent software from relying on uarch dependent zeroing.

However, certain hardened cores may need to have all registers initialized to consistent state, as to avoid integrity faults when stacking out yet unused registers. In some cases, it’s still possible to require initialization of all registers in startup code instead.

D.6. little endian only

Why would you want to have big endian loads/stores?
Probably for handling tasks that compute "network byte order" data which uses big endian representation.

Nice. So, lets add a big-endian mode (making it configurable at runtime of course), and enjoy mandatory endian neutral loads/stores ([37]) used by networking libraries, because one cannot be sure which endianess the code will be run on.

Just use rev8 for "network order" data. It’s much better than doing endian neutral access.

Big endianess is also inefficient to handle in vector registers.

D.7. TEIC_MMIO_CTRL_BASE address selection

addressable through c.lui + offset

D.8. no csr scratch registers

Unlike the big unix machines, the RTOS context can be statically addressed by lui + addi sequence.

With hardware stacking there is no need to free up additional registers.

D.9. 1023 vector entries

One entry less than full 1024 due to 2s complement jump immediate.

This is the biggest capacity that can be escaped by single c.j instruction from a first entry in case of TEIC_IRQ_VECT_ENTRY_SIZE == 2 (XTeicTinyIrqTable)

This is also more than enough for any microcontroller.

D.10. no per irq pending/enable in base extension

It is simply redundant to in-peripherals enablees, as well as the nestx interrupt enables.

Has use case only when the same interrupts are routed to multiple harts or when peripheral interrupt lines are shared across multiple master units (e.g. FIFO empty irq signal shared with DMA)

D.11. no nmi/exception nesting

Nesting NMIs is easy way to overflow the stack or greatly increase the worst case in static stack analysis (if there is even a bound)

It also becomes an issue in pure HW state preservation by estate_nl or shadow registers.

Normally such condition is very rare and is usually a sign of bad coding or much more serious hardware issue, that’s causing everything to fail at the same moment.

D.12. no software triggered interrupts

aka software trigger in ARM terminology [47]

Designated for triggering unallocated (or unused peripheral) vectors, by writing to the special NVIC→STIR register. Which is of course redundant to the use of NVIC→ISPRx registers.

However it’s rarely used and only "implemented" vectors can be triggered in such way. Officially it is supposed to be 32 entry granularity in ARM case, but it’s not even obvious wether you can use unimplemented vectors at all. [48]

Note
Even the PendSV is done by setting ICSR→PENDSVET bit instead of executing this mechanism.
Note
TEIC instead provides dedicated "peripheral" for handling software (deffered) interrupts

All of this causes a lot of redundancy to allow handling peripheral interrupts and "software" triggered ones by the same handler. The ARM implementation also depends on edge triggered irq mechanism, which is also ommitted by XTeic.

D.13. no stack realignment upon interrupt entry/exit

This is just a waste of hardware. The ABI should mandate the alignment instead. If not followed then the microarchitecture should be allowed to trap.

Note
some architectures, due to legacy codebases, require explicit stack alignment instructions which also contribute to interrupt latency/jitter and impact code density.

D.14. "zero jitter" only in highest nesting level interrupts

It doesn’t make sense to implement "zero jitter" at any other level. If given interrupt can by interrupted by a higher nesting priority, then it would no longer be considered a "zero jitter" one.

Note
NMIs can still break the "zero jitter" guarantee, though those should be considered as a rare fault/error condition.

D.15. only level triggered interrupts

Peripherals usually implement level triggered interrupts. (ie. require clearing trigger source by performing certain actions like reading FIFO registers or clearing the status flags)

Therefore it’s wastefull to spend additional resources (e.g. latch for pending status and related clear on irq entry) on the edge triggered mechanism which is made redundant on every irq line (see [no "software interrupts"])

Note
Sampling edges on GPIO is usually done by a separate peripheral that turns those into an level triggered ones.

D.16. no faulting addr register

aka mtval which ` is often not impelemnted anyway, even by uarch without unaligned loads/stores support.

Due to the lack of MMU, the memory access exceptions are considered fatal errors anyway.

The faulting address can still be recovered in a more complex way of decompilation of faulting instr.

D.17. no (default) "legacy" interrupt modes

Having our cores to boot with "legacy" interrupt modes

  • is a waste of transistors

  • it would reqire sync with the CLIC mode/submode encodings (or be incompatible with CLIC which is of course unwanted when lengthening the "flexibility" bar)

  • causes interrupt hole or additional boilerplate code to handle exceptions/NMIs that arrived before setting up mtvec and thus were routed to reset handler entry.

Note
There was even an CVE related to uninitialized mtvec: [43]

This also allows us to use vector address with zeroed two lowest bits. Which, in some scenarios, allows setup of vector table address with a single lui instruction

Also, in cores designated to work in vectored mode, the mtvec has the bottom address lines hardwired to 0. Which leads to large alignment granularity of the unvectored handler (e.g. on ch32v003 it’s 1KiB). Making the unvectored mode handler share entry with startup code or require large alignment.

D.18. no sub-priority reflected in any status registers

Sub-priority is used only during irq handler dispatch. Current priority field would consume additional circuitry to latch in sub-priority of the current handler.

Additionaly the current sub-priority field would have to be somehow preserved during nesting.

D.19. only 4 irq nesting levels

It’s enough for a great majority of use cases, not to mention that a lot of applications would be fine with just 1 nesting level.

Adding more nesting levels will diminish the gains from tail chaining.

D.20. no syscall (in XTeicRTOS)

Problematic to properly implement.

Offers less separation of kernel structures from the thread (by MPU). Though cortex-m port of FreeRTOS uses it only to start a first thread.

D.21. no SEV/WFE

Most use cases are redundant to wfi. (e.g. SEVONPEND)

The SEV from irq method is rarely used and is supposed to reduce wakeups from high frequency interrupts which can be handled by teic.wfi.n4ign instead.

Bibliography