XTeic

Table of Contents

revision history
preface
1. Introduction
2. XTeic (aka Total Embedded Interrupt Controller)
3. auxiliary extensions
Appendix A: irq atomic block
Appendix B: RTOS context switch
Appendix C: vendor software support packages
Appendix D: design decisions
Bibliography

Jan Oleksiewicz jnk0le@hotmail.com
document version 0.36.3
extension status: unstable/PoC
This document is released under a Creative Commons Attribution 4.0 International License

revision history

preface

This document uses semantic versioning with respect to potential hardware designs. Assembly syntax change is a minor increment. Version 1.0.0 will be the first somewhat useable. Changes in prior versions are not versioned properly and not tracked in revision history. The number in a major revision doesn’t hold the freeze or ratification status.

Document is written in a way that reduces the duplications as those are hard to maintain.

1. Introduction

Even though the current risc-v "privileged" architecture is great for general unix systems. It fails to meet many embedded and hard real time requirements.

Instead of adding more and more on top of layered legacy, that leads to silicon waste, let’s replace entire volume II (aka riscv privileged) with minimal yet efficient embedded architecture.

The goal is to achieve interrupt architecture capable of predictable and fast control loops by providing minimal interrupt latency and jitter.
Optionally offer single digit cycles of interrupt latency to actual code and true zero jitter, as to not disturb minimal implementations.
By leveraging general purpose computing capability of risc-v architecture, we can avoid the need for separate cores (often with asymetric architectures) to offload low priority tasks (communication, HMI etc).

The lack of many "legacy" functionalities allows reduction of silicon area, power, and verification costs.

1.1. prior art

A quick recap of what we already have available.

1.1.1. cortex-m NVIC

[13] defacto established "industry standard" of efficient interrupt handling. Anyone complaining about risc-v likes and wants the NVIC.

The addition of trustzone in armv8m, increases the interrupt latency/jitter due to the need of preserving and zeroing extra "unnecessary" registers. (to prevent potential leaks)

1.1.2. CLIC

CLIC CLIC is a designated goto for interrupt handling to fulfill everyone needs.

Attempts to be an unix capable interrupt controller with horizontal nesting of U, S, H (so far only proposed) and M mode.

All used registers must be saved in software, trampoline handlers need to save all ABI registers. If interrupts can be taken at multiple privilege modes, then each handler at higher privilege have to swap stack pointer (and interrupt level ??) by 2 additional CSR instructions per handler. (during vertical nesting those instructions just copy rs1 operand)

Preemption is handled in software by special CSR mechanism, that requires extra boilerplate code in every interrupt handler. Even in "inline" handlers.

Highest priority inline handlers should be possible to be made similar to legacy ones.

Trampoline handlers mimic the late arrival and tail chaining optimizations. Currently trampoline handlers cannot be used alongside "inline" handlers [50].

Introduces unavoidable jitter due to:

blocks of code executed with disabled interrupts (additive jitter)
late arrival handled through mnxti read (subtractive jitter of entry time)
tail chaining handled by another mnxti read (and extra branch) in epilogue
indirect jump instruction to actual code (branch prediction)

assuming 1 cycle per instruction, 10.2 and 11.1 listings from clic spec CLIC offer:

entry + 6 cycles of jitter from "inline" handlers.
entry + 7 + 16 cycles of jitter from "C-ABI" trampoline entry
4 + exit or abs(entry - 7) cycles of jitter from "C-ABI" trampoline epilogue

Note	trampoline jitter can be reduced by 16 cycles of register stacking at the cost of late arrival handling

Note	according to [21], handler entry time is 6 cycles on sifive E2 and 10 cycles in E3/5.

Note	BTW, my prediction is that the "competitor A" will be able to do a "comparison against riscv" without resorting to FUD tactics, right after CLIC is ratified

Typical interrupt latency of CLIC trampoline was measured at 33 (inline handler) and 42 (trampoline) cycles for CV32E40P [53].

1.1.3. CV32RT fastirq

CV32RT "fastirq" [53] extends CLIC by moving prologue handling entirely into the hardware as well as introducing background lazy stacking from a shadow register set.

The epilogue is still handled in software.

Tail chaining is supported by emret instruction, but a late arrival (higher priority) will have to wait for the background stacking to finish. As a consequence there is a jitter equal to the stacking window.

1.1.4. emb-riscv

emb-riscv [1] is clean sheet design that attempts to be universal solution for every microcontroller. Designed with a strong focus on RTOS support.

Note	Currently development is stalled due to "not encouraging general interest"

Achieves lower interrupt latency by introducing EABI with reduced amount of caller-saved registers. FP registers are handled by lazy stacking.

Many similarities with NVIC.

mandates 4 64bit timers (even on RV32):

cycle counter
instret counter
system timer
rtc timer

1.1.5. CLINT

Attaches to generic interrupt scheme.

According to CLINT, it provides memory mapped interface for timers and IPI.

Note	ofiicial CLINT is called ACLINT but doesn’t differ much from CLINT in sifive documentations.

1.1.6. generic riscv interrupts as described in "privileged" volume II

Very often refered to as CLINT. e.g. [4].

has optional vectored mode which simply jumps to the position in vector table.

Doesn’t provide any nesting other than privilege levels or a complex boilerplate code to disable reatking active interrupts. Registers and CSR state (fcsr etc.) have to be pushed by software before use

1.1.7. PLIC/AIA

[5], [6]

A heavyweight frontend for delivering interrupts to multiple cores running typical unix OS. Not suitable for microcontrolers.

claim/complete architecture

handlers stay very similar to generic case.

AIA adds another set of CSR registers available only through indirect access mechanism (by miselect and mireg CSRs).

1.1.8. CH32 PFIC

Proprietary design by WCH build on top of generic riscv privileged [28], [29], [30].

Introduces HW stacking and single cycle register shadowing (aka HPE). It is of course necessary to use custom toolchain that implement a "proprietary" attribute: __attribute__((interrupt("WCH-Interrupt-fast")))

Note	without prestacked annotation there will be no portable way of doing this without compilers build on custom patches. Naked handler + mret trick doesn’t work in llvm, it should break in gcc anyway due to eventual use of callee saved registers and stack.

Another feature is "vector table free" interrupt mechanism that allows to skip fetching from vector table and jump to handler directly. It provides significant improvement only when all registers are "stacked" by shadow regfile. (or not satcked at all)

The descriptions of a lot of functional behaviour feel like a copy-paste of risc-v privileged. Highly under/undocumented.
e.g. There is nothing about what happens to mepc, mcause or mstatus during nesting (especially on "V2" core).
It is also unknown whether ra register doesn’t have an additional use (like saving mepc…) during interrupt entry/exit and connot be used immediately as the currently implemented gcc attribute treats those functions the same way as the regular ABI ones with mret based return.
Inline with average chinese documentation standards.

The vendor provided headers, of course, contain 46 instances of "NVIC" string and just 5 for "PFIC"

There is also under/undocumented "EABI enable" bit in INTSYSCR on "V2" core. Most probably it reduces number of HW stacked registers to match the official EABI proposal [31].

QingKeV4 implements 3 shadow registers sets (aka HPE), given to handlers on first comes first served basis. Result is that only 3 lowest level handlers can practically use shadow registers.

Note	supressing dynamic nesting by `HWSTKOVEN` would cause priority inversion.

1.1.9. RNMI (aka returnable NMI)

[44] Adds another horizontal nesting level above the machine mode, that works very similarly to generic interrupts. Achieved by providing additional set of CSR registers as well as interrupt return instruction (mnret).

1.1.10. PicoRV32 interrupts

Note: The IRQ handling features in PicoRV32 do not follow the RISC-V Privileged ISA specification. Instead a small set of very simple custom instructions is used to implement IRQ handling with minimal hardware overhead.

Original author of the PicoRV found the riscv-privileged to be too heavy for minimal core, and provided own [9] interrupt scheme.

Note	FPGA minimum cores, is a non goal for XTeic

1.1.11. ti c2000 (main core)

Proprietary TI architecture [23] sporting an ancient looking accumulator-memory architecture (with 8 pointer registers), similar to the classic CISCs. An x86 of motor control and signal processing. FPU [24] is more RISC-ish with a bit of VLIW in some instructions.

Note	TI is very hesitant to release any general purpose benchmark scores (speed/size etc.) [25], [26]. Claiming that their architecture "is optimized for real world control applications". Those kind of scores are also almost non existent in independent sources.

According to [22], the core automatically saves some of the registers, rest must be pushed in software.
"High priority" interrupts can also save and restore all 8 floating point registers into shadow registers using special instructions.
There are also 5 (4 in prologue) defacto useless instructions for aligning stack and setting "C28 modes"

To allow nesting of "low priority" interrupts handlers must include extra boilerplate code to handle prioritiy masking in software. (8 instructions in prologue, 3 in epilogue)

As a consequence there is 21 cycles of jitter (to HPI and other LPIs) and 43 (HPI) or 63 (LPI) cycles of interrupt latency in worst case.

Use of RPT istruction will introduce even more jitter and latecy as the sequence is uninterruptible and takes arbitrary numbers of cycles to execute.

Note	ISR entry latency is 10 cycles due to 8 stage pipeline and automatically stacking 13 registers. [40] suggests that the latency is 14 cycles for internal signals. Which would further increase the worst case jitter and latencies.

1.1.12. ti c2000 CLA

CLA [51] is a separate coprocessor designated to offload main core from control loop tasks "freeing it to handle other tasks such as handling communication stacks"
Exactly those workloads that are general purpose tasks for which "c2000 architecture was not optimized for"

Offers less registers/instrucrtiions and lacks TMU so it’s not always faster than the main core.

Can be used as a true coprocesor for delegation of certain tasks to it. According to [52] this mode of operation brings just 12% improvement in motor FOC current loop.

CLA tasks are uninterruptible. TI claims [14],[15],[27] that their task driven machine "reduces interrupt latency and jitter" compared to classic CPU even though it does exactly the opposite when there is more than one (async) interrupt to handle (which happens in [14] example)

1.1.13. Xh3irq

Xh3irq extension (as implemented by hazard3) [54] provides nested and vectored interrupt handling that is conceptually similar to CLIC (mnxti) trampoline.

Unlike CLIC, dispatcher has to index pointer array in software (by using index from meinext)

Example handler implements only jumptable but it can be easily convertod into pointer table.

Access to configuration bits of all 512 inputs is performed by inline windowing of configuration CSRs, which is incompatible with zicsrind.

1.2. overwiew/discussion of some concepts/features

1.2.1. whole app must be doable in C/C++

In this case interrupts must always push all caller saved registers to be able to use functions without __attribute__((interrupt*)) annotation. Leading to ABIs with less caller saved registers

It also requires preinitialized table with pointer to startup code, sp, gp, and of course any other addition like Zcmt JVT csr.

This table is also not necessarily smaller than software setup, e.g. sp can be usually done with single lui instruction.

There is still a risk of corruption if the compiler decides to reorder something before initialization of .data/.bss sections.

Such startup code is also inefficient as it will have to obey the ABI (spill ra to stack) and compilers can’t optimize out link time symbols anyway. (even though some can be assumed to always be at certain addresses or offset from each other)

Of course I often find that there is a competition on who will make the worst startup code in assembly. So pure C/C++ startup code turns out to be "better" due to confirmation effect. But let’s have a look at my "combotablecrt" implementation [7] for stm32f030x4/6. Is your compiler able to do that?

There is also a case of interrupt handlers that are using only a few registers and don’t need to take latency of the whole ABI/EABI.

1.2.2. ABIs with less caller saved registers

The rationale of introducing ABIs with reduced number of caller saved registers is to reduce interrupt latency.

The major downside of such approach is lowered overall performance and code denisty. Which is highly unliked across riscv community [10] and stalls development of such (E)ABI.

I think for marketing reasons we should have the RISC-V EABI mimic the competitor ABI as closely as possible, and be available and supported by the tools, even if almost no-one should end up actually using it.

Zcmp[e] was also prepared for such fragmentation by reserving first 4 points in rlist for EABI, so the cores can implement UABI and EABI push/pop instructions at the same time. Those 4 points are, of course, supposed to handle 20 caller saved regs of EABI (probably with some reuse of few higher points).

It will also make the processors capable of stacking 2 registers per cycle, underutilized during HW stacking due to shorter stacking time than pipeline refill.

An alternative is to provide interrupts with defacto customizable ABIs by e.g. prestacked annotation (to match the HW stackers) and handle the function call pressure by IPRA.

1.2.3. "you are better off with soft stacking in inline handlers"

aka generic riscv __attribute__((interrupt))

The major issue lies within the principles of hardware stackers.

When entering interrupt handler, the core first fetches the entry from vector table and then jumps to that address. Both of those fetches can hit a flash waitstate or a cache miss. During that operation the data bus remains idle waiting for a first store instruction to be executed.

Those cycles can be accomodated for a "free" stacking of registers. If a higher amount of registers is stacked then it can hide a bit of jitter coming from cache misses or flash waitstates.

Even stacking by the special push instructions (e.g. XTheadInt [12] or PUSHINT [11] and maybe a subsets of those), won’t help much. Those start pushing after the latency of double (waitstated) miss was taken.

The only situation when soft stacking yields better results is when HW stacker has to push way more registers than is actually used.

Note	Zcmp[e] doesn’t cover caller saved registers except `ra`.

1.2.4. EABI for RVE must be subset of RVI EABI.

To be able to call RVE only code from RVI ABI
Recurrig thing in RVE ABI proposals.

The idea is to allow compilers and software vendors to provide a single set of precompiled libraries for RVI and RVE ABIs.

The issue with this approach is that the code arbitrarily compiled for RVE is likely to turn out to be less efficient than RVI one. It also limits the capabilities of RVI ABI like trading off argument registers for temporary/saved ones.

1.2.5. one universal standard for everyone use cases

Having one universal solution for all possible scenarios brings a lot of inefficiency to all of them. Due to mandatory support for a lot of rarely used functionality, keeping the compatibility with unused legacy, or having to be a subset of a bigger architecture optimized for a different use cases.

Even if that "flexibility" is made completely optional and non intrusive the vendors will implement it anyway for the sake of having the longest "flexibility" bar.

1.2.6. special handler return pattern

aka "HANDLER_RETURN" on emb-riscv and "EXC_RETURN" on ARM

The idea is to put special pattern in ra during handler entry and exit by reusing regular return mechanism provided by the ABI. Requires certain memory area to be non executable (e.g. 0xF0000000 - 0xFFFFFFFF)

This mechanism follows the typical ABI function call and together with HW stacking, allows the interrupt handlers to be a regular C functions.

The downside is that the ra and pc both have to be pushed onto stack and in some specifc cases, it could add extra stall cycles after the tail due to the waitstates or cache miss caused by delayed prefetch.

Alternatively we can just stack the ra and put there current pc with lowest bit set to trigger handler return operation. One less register counted towards interrupt latency.

Note

normally the jalr instruction just ignores the LSB bit of resulting address. LSB in register and immediate will lead to "bogus" jump over 2 extra bytes. Even though this behaviour simplifies hardware, existing ABIs are allowing "auxiliary information" in pointers as well as jalr immediate, effectively making both useless.

1.2.7. vector tables that are jumped to

It’s simply inefficient in truly vectored scenario. The vector entries will have to be populated with jump instructions anyway. Those have to take the second round of waitstates or cache miss without amortization by register stacking.

And if the code is far away from vector table (e.g. in SRAM for more deterministic execution), compiler will have to emit a jump island, aka "veener", that will perform yet another unamortized jump. Additionally far jumps require a free register which in typical scenario reqires pushing to stack and returning to veener from handler to handle epilogue.

allocating 8 bytes per entry, allowing lui + jalr sequence, will severly trump the code density and performance in typical use scenarios.

Note	8051 allocated 8 bytes per entry, but it was able to sometimes fit entire handler or one of the conditional path. Especially when following entries were unused. This kind of optimizations is exlusive to assmebly programming and generally not practised today.

1.2.8. MMIO vs CSR mapped config registers

In case of mass initialization MMIO could result in better code density CSR space is also limited.

My take is that anything architecturally coupled to the core should reside in CSR space and keep the rest in MMIO.

Nothing should exist as both.

There is no point in avoiding CSR registers when the cost of Zicsr instructions is already taken.

1.2.9. "reduced/zero jitter"

Very often claimed, yet those claims rarely meet with reality.

Note	There are also many non-architectural sources of jitter like caches, waitstated flash, accessing peripherals in different clock domains (usually divided from sysclk), DMA contention, or just the code masking out the interrupts.

Cortex-m0 offers a "zero jitter" by optional IP (RTL for ASICs) configuration that adjusts the best case of interrupt latency by extra cycle to acommodate random stall from bus contention.

Cortex-m3/4 offer up to 6 cycles of jitter due to "late arrival" and "pop pre-emption". Regular handler entry is dominated by stacking registers, giving some headroom for extra vector/instruction fetch latency.

Cortex-m7 of course suffers from Proprietary&Confidential syndrome. Most probably it’s similar to cm3/4.

In case of C2000 CLA, TI claims [14],[15],[27] that their task driven machine (non preemptible) "reduces interrupt latency and jitter" compared to classic CPU, even though it does exactly the opposite when there is more than 1 async interrupt to handle.

Note

Of course whenever TI compares CLA to "classic cpu", it’s always a cpu with preemption priorities only and background task not present on CLA. As if the similar "task machine" couldn’t be achieved by regular general purpose architecture (e.g. risc-v, cortex-m) without nesting and WFI loop (or "sleep on exit" feature) giving access to all GPRs in interrupts without stacking.

1.2.10. "everything will run Linux in future"

The Linux cargo cult.
Because a simplest tasks suitable for bunch of 555&74s or a simple microcontroler with a few KiB of flash and RAM must be done under linux so it will work somehow "better".

To be able to properly run linux you need quite beefy unit (usually with MMU), 2-4MiB of flash, 4-8MiB of RAM (usually external DRAM), long boot time and a bad power consumption in idle.
Just to run the OS itself.

One of the the most blatant example is NOMMU linux on stm32f429 with memory mapped SDRAM that is not even cached by cpu. If the XIP image doesn’t fit in 2MiB internal flash, it has to land in external parallel NOR flash, which is of course not cached by cpu and shares bus with SDRAM.
Any attempt to touch internal SRAM regions will defeat the remaining "universality/portability of linux apps" arguments.
Not to mention much higher unit price than typical 200+Mhz cotex A5/7 SOCs.

Of course there are still actual reasons to use linux in non-realtime embedded, consisting of large collection of drivers for external devices, higher portability or access to the raw performance (at much better perf/price ratio) not available in typical microcontrollers [16].

1.2.10.1. RTLinux and hard-realtime

Whenever those rt patches are measured, both the interrupt latency and jitter is always given in tens or hundreds of microseconds, not cycles [17],[18],[19],[20].

In some scenarios those numbers are unacceptable.
As an example, industry standard, FOC current loops close within 5-10us [35] and in some cases it achieves sub 1us latency [34]. On a <200 Mhz core clock.

1.2.11. lazy stacking

Lazy stacking allows to skip stacking of FP registers if handler doesn’t touch floating point registers.

The main issue is that all of the caller saved FP registers are saved (execution stalls during push) onto stack whenever FP instruction is executed even though only a few of the registers are used.

Requires additional CSR to hold address of reserved space in stack frame.

1.2.12. 64bit microcontrollers

So far, mostly the application processors used in bare metal.

Use cases for such also have different requirements than from typical 32bit microcontrollers.

1.3. required ABI

Ideally we should not change the established ABI to avoid disruption But definitely get rid of the tp register which is overall useless.

1.3.1. stack alignment

should be 2x`XLEN`, mandated thorought entire program execution so as to not require special realignment in interrupts.

Note

psABI [33] says that:

stack pointer must remain aligned throughout procedure execution

and fails to enforce enforce this anyway:

Non-standard ABI code must realign the stack pointer prior to invoking standard ABI procedures. The
operating system must realign the stack pointer prior to invoking a signal handler; hence, POSIX
signal handlers need not realign the stack pointer. In systems that service interrupts using the
interruptee’s stack, the interrupt service routine must realign the stack pointer if linked with any
code that uses a non-standard stack-alignment discipline, but need not realign the stack pointer if
all code adheres to the standard ABI

Major ilp32e issue is that the addi16sp instruction works on 16 byte stack increment. Once the c.addi range (-32..+31) is exhausted compilers have to chose beetwen denser code and more efficient use of stack.

Zcmp extension was also designed for 16 byte aligned stack. There is Zcmpe extension postponed to the future which should handle the EABI. Lowering the stack alignment requires doubling (per bit of alignment) waste of codepoints by push/pop instructions.

Note	`addi8sp` won’t be neccesary as Zcmpe `push`/`pop` can prepare initial 8 byte allocation for an (optionally) following `addi16sp`

Note	2x`XLEN` alignment allows more optimal use of microarchitectures capable of stacking 2 registers per cycle

1.3.2. RVE

register	ABI name	Saver	description
x0	zero	-	Hardwired zero
x1	ra	caller	return address
x2	sp	callee	stack pointer
x3	gp	-	global pointer
x4	t0	caller	temporary
x5	t1	caller	temporary
x6	t2	caller	temporary
x7	t3	caller	temporary
x8	s0/fp	callee	saved/frame pointer
x9	s1	callee	saved
x10	a0	caller	argument/return
x11	a1	caller	argument/return
x12	a2	caller	argument
x13	a3	caller	argument
x14	a4	caller	argument
x15	a5	caller	argument
x16-x31	-	-	reserved for custom use

Note	ilp32e with `tp` turned into temporary.

1.3.3. RVI

register	ABI name	Saver	description
x0	zero	-	Hardwired zero
x1	ra	caller	return address
x2	sp	callee	stack pointer
x3	gp	-	global pointer
x4	t0	caller	temporary
x5	t1	caller	temporary
x6	t2	caller	temporary
x7	t3	caller	temporary
x8	s0/fp	callee	saved/frame pointer
x9	s1	callee	saved
x10	a0	caller	argument/return
x11	a1	caller	argument/return
x12-x17	a2-a7	caller	argument
x18-x27	s2-s11	callee	saved
x28-x31	t4-t7	caller	temporary

Note	ilp32 with `tp` turned into temporary.

1.4. debug

The official risc-v debug spec [45] is good enough to not necessitate another incompatible one, although the "minimal debug implementation" is actually not minimal.

Some of the minor things that could be "improved" for minimal implementations:

1 entry progbuf accepting 32bit instructions only (saves 2 bits, currently must accept compressed insns)
writing this 1 entry progbuf immediately executes written instruction (ie. no storage in progbuf)
remove dpc CSR, and allow debuggers to get the "current" pc by executing auipc from progbuf
no mandatory abstract register reads (data exchange only through message registers)
get rid of certain discovery bits
etc.

Biggest offenders of course are and will be the actual implementations that despite being the "minimal" ones designated as "8bit killers", are happily implementing more than necessary. Like 8-word progbuf in ch32v003 [28].

1.4.1. DTM

Low pin count devices (8-32) need a denser debug interface as the JTAG uses too many wires.

There are industry proven 2 wire interfaces like cJTAG or ARM SWD.
It would be best to have 1 wire solution like avr8 debugWIRE/updi or the WCH "SDI" (aka "SWD") [46]

1.5. tooling issues to solve

1.5.1. prestacked annotation

Note	official RFC has been submitted here: riscv-non-isa/riscv-c-api-doc#53

Currently there is no universal solution to indicate which registers in interrupt handlers can be freely used without stacking them.

__attribute__((interrupt)) makes all registers callee saved and uses mret to return.
__attribute__((interrupt("SiFive-CLIC-preemptible"))) extends regular interrupt by CLIC preemption
__attribute__((interrupt("WCH-Interrupt-fast"))) requires custom build toolchain, no floating point regs (even on the cores with F extension), still uses mret
Or just a plain C function that requires prestacking of all caller saved registers, reuses standard return mechanism to exit interrupt context

Even worse, there are already hardware stackers designed for ilp32e and ilp32. When the new and better ABI will be introduced, it will be impossible to use with pre-existing HW stackers. The same applies to creating HW stackers that stack less registers to optimize interrupt latency.

Therefore we need universal way to annotate which registers are available for use in a given function as a defacto calller saved one (aka create custom calling convention)

prestacked("") attribute
no whitespaces in string parameter
register range cover all registers between and including specified (x4-x6 is equivalent to x4,x5,x6)
register range must span at least 3 consecutive registers
registers/ranges are separated by comma
calee saved registers have to be properly turned into temporary when included in the list
CSRs taking part in calling conventions are also subject to this mechanism
should use raw names instead of ABI mnemonics as to make it ABI agnostic (more portable)
registers must be sorted (integer, floating point, vector, custom, then by lowest numbered)
CSRs must be put after the architectural regfiles, those don’t have to be sorted
must not collide with __attribute__((interrupt)) as to support "legacy" handler return mechanisms
must not imply __attribute__((interrupt)) as well
custom CSRs would also have to be somehow covered. (hw loops etc.)
annotated functions should be callable by regular code
argument registers that are passed but not included in the list, can be assumed to be unmodified after return from an annotated function

ilp32 caller saved:

__attribute__((prestacked("x5-x7,x10-x17,x28-x31")))

ilp32f, caller saved:

__attribute__((prestacked("x5-x7,x10-x17,x28-x31,f0-f7,f10-f17,f28-f31,fcsr")))

preemptible CLIC irq with simplified ranges(e.g. shadow register file):

__attribute__((interrupt("CLIC-preemptible"), prestacked("x8-x15")))

TEIC irq, range0 + shadow regs of half integer regfile (where bit 2 of operand is set, covers range1+2) and F + P extensions:

__attribute__((prestacked("x4-x7,x10,x11,x12-x15,x20-x23,x28-x31,fcsr,vxsat")))

ch32v003 irq (ilp32e + PFIC HW stacker, assuming ra doesn’t have some undocumented use):

__attribute__((interrupt, prestacked("x1,x5-x7,x10-x15")))

Note	unannotated `ra` is assumed as a valid return address, otherwise a special return mechanism must be used (e.g. return by `mret` in `__attribute__((interrupt))`

1.5.1.1. optimization for `noreturn` functions

gcc/llvm compilers can purge the epilogue (even down the call tree) by automatic detection of infinite loop or by using __attribute__((noreturn)) or __builtin_unreachable().

It is not the case on prologues though, leading to waste of stack and codespace in the most typical embedded scenario of main or thread functions with an infinite loops.

This missing optimization is intentional [32] to allow backtracing (abort() etc.) and throwing exceptions (of course under -fno-exceptions and exception less code)

By abusing the "prestacked annotation" we can get rid of this prologue by "prestacking" all of the available registers.
e.g. __attribute__((noreturn, prestacked("x1,x4-x31,f0-f31,fcsr")))

Note	addition of `noreturn_nobacktrace_noexcept` attribute is very unlikely, optimizing regular `noreturn` attribute is even less.

Note	`__attribute__((naked))` won’t work, as it will remove the stack allocation and consequently underflow the stack.

1.5.1.2. functions with partially custom calling conventions

It can be additionally abused to:

define IPRA clobbers of assembly functions in its C function declarations (see applying IPRA to assembly functions)
certain (premature) optimizations (manually solving 2way IPRA recursion etc.)
dynamic linked functions with a subset of clobbers. e.g. functions like memcpy(),strcmp() etc. don’t need to clobber all caller saved registers so only common clobbers for straightforward, unrolled (?) and vectorized implementations need to be applied. Requires standardization of canonical clobbers for each offending function. (quite unrealistic)

1.5.2. IPRA - Inter procedural register allocation

So far implemented only by llvm [8].
Limited to statically linked code.
There are almost no benchmarks results, especially the ones other than x86 at -O3.

In simple explanation, it makes every function export information about its usage of caller saved registers effectively allowing non leaf functions to use caller saved registers as a callee saved ones. That avoids some of the stacking/spilling leading to a more efficiet code.

requirements and improvements needed for efficient IPRA:

this mechanism must cover the CSRs as well as the registers (e.g. fcsr, vtype, vl etc.)
custom registers and CSRs should also be covered (e.g. HW loops) (unnamed?)
compilers need to avoid using more registers than necessary (currently no reason)
registers from compressible range should be allocated only when it will benefit code density (currently no reason)
to avoid regressions, compilers need some kind of heuristic to detect when stacking certain (compressible) callee saved registers would yield better code density than using more temporaries from non compressible ranges

Note	on riscv it’s `s0` and `s1`, in presence of Zcmp[e] pushing `s0,s1` is free in non leaf functions, and just 2 16bit instructions in leaf. With IPRA it should be also possible to just move `ra` and `s0/s1` into caller saved regs.

Note	This is also non IPRA optimization (-Oz kind)

need special assembly directive to annotate such exports from pure assembly code (workaround exists applying IPRA to assembly functions)

Note	Automatic detection is not an option due to self constructed instructions (e.g. from [39]): `.word (0b0000000<<25)\|(8<<20)\|(0<<15)\|(0b001<<12)\|(10<<7)\|0x43 .insn i CUSTOM_1, 0x0, 1, a0, 0x123 //equivalent to: //tio.add0.xy a0, y0, s0 //tio.addi0.yx y1, a0, 0x123`

precompiled libraries should also do an "IPRA exports"
very important point is resolving IPRA annotations of callbacks, where the callback call will use the smallest common regmask of all functions that can be called through this point
- callbacks initialized once at startup (typical in many HALs)
- callbacks passed as function parameters
- queues (of structs) with callbacks

Note	callbacks are commonly used in peripheral interrups, therefore it’s important to apply IPRA optimizations to those as well

it can be used to annotate that passed function arguments (through registers or stack) were not modified and can be recycled by caller (e.g. in loops)
it can also "export" list of deterministic constants (and addresses) that are left in registers after return

Note	This mechanism is portable to other architectures, the more caller saved registers are available, the higher relative gain is.

Note	vector extension can benefit from IPRA as current psABI makes all vector registers temporary, though the syscall destroys entire state

1.5.2.1. adjusting ipra wrt prestacked registers

Because the HW stackers (used with prestacked annotation) will prefer to stack out the compressible registers first, it might not be the best match for IPRA optimized allocation

Note	compilers usally don’t care about non-abi (interrupt) prologues/epilogues and emit code as if it was the regular ABI function

The solution could be:

optimize HW stacker for typical allocations
make compilers treat specially a call trees growing from interrupt handlers
trump the general IPRA optimizations to use a0-a5 first

Handlers that are not calling another functions should be straightforward as long as the compiler allocators/optimizers are not going to straight out ignore prestacked annotation.

1.5.2.2. applying IPRA to assembly functions

Special attribute to annotate function declaration in header associated with assembly code (e.g. __attribute__((regmask("clobbered list here")))) was proposed [49], but it wasn’t implemented upstream.

The other option is to use inline asm clobbers to make call to such funcions

	__attribute__((always_inline))
	static inline int weird_call(int n, void* p)
	{
		register int result asm("a0") = n;
		register void* a1 asm("a1") = p;

		asm volatile(
			"call foo \n\t"
			: [ARG0] "+r" (result) // return in same register
			: [ARG1] "r" (a1)
			: "memory", "ra", "a2" // use clobber for any caller saved regs used
		);

		return result;
	}

requires the call pseudoinstruction that expands to a proper sequence. Otherwise we get errors when calling too far or missing optimization when short call can be made.
works in existing compilers (at least in gcc and llvm)

Another solution could be applying prestacked annotation e.g.

pure assembly function (FP compute kernel) using only subset of caller saved registers (a0 argument not modified):
attributeprestacked("x5,x11-x15,f10-f13,v0,v1,v8-v31,fcsr,vl,vtype,vstart")

Note	Both mothods are insufficient for "annotating" unmodified stack arguments which are caller saved (documented on arm and defacto on risc-v)

2. XTeic (aka Total Embedded Interrupt Controller)

smallest profile?

machine mode only

RV32 only

2 or 4 interrupt nesting levels

little endian only software shall assume little endian

2.1. implementation constants

name	default value	notes
`TEIC_ENTRY_VECT_BASE`	implementation specific	Base address of the first application entry point as well as its vector table. May have additional constarints on the alignment.
`TEIC_EXEC_SRAM_BASE`	implementation specific	Base address of the most designated executable SRAM memory. (Some devices implement a special memory area designated for interrupt handlers. aka "ITCM". Usually it will be the main memoy address)
`TEIC_MMIO_CTRL_BASE`	0xFFFE0000	Base address of XTeic MMIO control block
`TEIC_IRQ_NESTING_BITS`	{0,1,2}	Number of implemented interrupt nesting priority bits
`TEIC_IRQ_PRIORITY_BITS`	{1,2,3,4}	Number of implemented interrupt sub-priority bits
`TEIC_IRQ_VECT_ENTRIES`	{9..1023}	Number of allocated interrupt entries including skipped ones and NMIs
`TEIC_IRQ_VECT_ENTRY_SIZE`	{2,4}	Size in bytes of the single entry in vector table. By default it’s 4. 2 if XTeicTinyIrqTable subextension is implemented.

2.2. startup behaviour

Upon hart reset:

all of the architectural registers are initialized to their reset state.
The MMIO control block registers are also initialized to their reset state.
The pc is set to the TEIC_ENTRY_VECT_BASE.

Performing the system reset will additionally initialize the state of the peripheral registers to their reset state.

The hart reset is always equivalent to a system reset until XTeicMP extension is implemented.

2.2.1. reset state of registers

The reset state of all architectural registers is undefined unless explicitly specified in specific extension.

Note	That means the reset state of integer, fp, and vector registers is undefined.

Note	some of the CSR registers also remain in undefined state.

2.2.2. bootloaders

If the application start is preceeded by bootloader, or the application enters the bootloader, then the the switch code shall ensure that before redirecting execution to the target address:

all peripherals are disabled, or initialized to reset state if enabled on reset (e.g. watchdogs)
external GPIOs are configured to reset state
the oscillators, PLLs, clock selects and divisors are configured to their reset state
all nesting levels in teic_irq_msk are enabled
teic_irq_vect is set to the target entry point, right before the jump happens

Note	The rationale of these rules is to avoid bloat in startup code (and duplicate of it in `SystemInit()`), which is a result of assuming the worst case scenario

Note	bootloaders placed at application entry area (at `TEIC_ENTRY_VECT_BASE`) can be entered by setting a certain pattern in backup register and then executing system reset.

Note	Some devices switch between bootloader and application modes by performing whole system reset after modifying certain configuration registers (remap of executable area at `TEIC_ENTRY_VECT_BASE`)

2.3. interrupts

The interrupt controller supports only level triggered interrupts. The logical high is used to assert pending interrupt request lines.

The irq number is the position in vector table

Note	there is no irq offseting like in NVIC

Stack pointer is not realigned, if stack is not 8 byte aligned the behaviour is implementation specified

Note	typical HW won’t care about 4 byte stack, some dual issuers or hardened cores might want to set `irqentryexit_unrec` nmi request

Note	Zcmp similarly doesn’t specify the required alignment.

2.3.1. register ranges

Register ranges define which registers are pushed onto the stack on irq entry.

Adding certain range require inclusion of all previous ranges.

The selection is implementation specific, fixed at silicon level. Shall not deviate from the predefined ranges.

Note	only highest nesting level has configurable stacking ranges.

range	registers	added stack area	mandatory implemented (all nesting)
0	"x1,x10,x11,reserved"	XLEN * 4	yes
1	"x12-x15"	XLEN * 4	yes
2	"x4-x7"	XLEN * 4	no
3	"x16,x17,x28-x31"	XLEN * 6	no

Note	Range 0+1 gives similar amount of usable registers as NVIC

stack frame pseudocode

// all ranges used
// range 0
sw x1, -4(sp)
sw x10, -8(sp)
sw x11, -12(sp)
sw reserved, -16(sp)

// range 1
sw x12, -20(sp)
sw x13, -24(sp)
sw x14, -28(sp)
sw x15, -32(sp)

// range 2
sw x4, -36(sp)
sw x5, -40(sp)
sw x6, -44(sp)
sw x7, -48(sp)

// range 3
sw x16, -52(sp)
sw x17, -56(sp)
sw x28, -60(sp)
sw x29, -64(sp)
sw x30, -68(sp)
sw x31, -72(sp)

addi sp, sp, -72

Note	reserved position in range0 window can be optionally used for preserving additional state during nesting

2.3.2. interrupt entry

when a given interrupt nesting level (reflected by pending_nestx in teic_irq_status) becomes pending which is not masked out by corresponing bit in teic_irq_msk register, the interrupt entry procedure is triggered.

During the interrupt entry the hardware will:

stacks configured/implemented register ranges at given nesting level (can be affected by n4_stacking)
decrement sp according to largest configured/implemented register ranges
put content of interrupted pc into ra register with lowest bit set
set in_nestx bit in teic_irq_status register
fetches target address from vector table pointed by teic_irq_vect. The vector entry is selected by handler dispatch process.
jumps to the fetched address

Note	optimized microarchitectures will implement late arrival, tail chaining and pop preemption which further complicate entry/exit procedures

If irq request is spuriously deasserted during the interrupt entry (or e.g. tail chaining), the core must either; enter the offending handler or immediately return (or e.g. tail chain to yet another handler).

Note	Sometimes it takes a few cycles to deassert irq request signal, after e.g. clearing status flag. Behaviour must be deterministic. Otherwise erratas will be populated.

2.3.2.1. handler dispatch

During the handler dispatch the hardware will evaluate all pending irq requests and select the one with highest configured sub-priority, ties are resolved by highest irq number.

2.3.3. interrupt exit

When jalr or cm.popret instruction is executed and the lowest bit in the source register is set (before calculating final target address), the interrupt exit procedure is triggered.
If no interrupt is currently active then irqretnest0_unrec nmi request is set.

During the interrupt exit the hardware will:

unstack configured/implemented register ranges at given nesting level (can be affected by n4_stacking)
increment sp according to largest configured/implemented register ranges
clear in_nestx bit in teic_irq_status register
jumps to the target address of jalr or cm.popret instruction

Note	The bogus `jalr` target address issue remains as per unprivileged spec. Therefore conforming software shall not set the lsb in `jalr` immediate used for function returns

Note	only the lsb in source register is checked, not the computed target address of `jalr` instruction. It allows detection of irq ret condition earlier in the pipeline.

Note	optimized microarchitectures will implement late arrival, tail chaining and pop preemption which further complicate entry/exit procedures

2.3.4. NMI interrupts

NMIs (non maskable interrupts) are a special type of interrupts that cannot be masked by teic_irq_msk register. Typically used for signalling critical conditions.

Entry/exit procedure is similar to regular IRQs with the following excepions:

activity is signalled by in_nmi in teic_irq_status register
preserves at least range 0 registers, stacking ranges are impelmentation defined.
adjusts sp by stacked ranges

Note	typically NMIs will stack the same register ranges as regular interrupts

Before returning from NMI handler all requests in teic_nmi_cause CSR must be acknowledged (cleared).

2.3.4.1. NMI unrecoverable state

unrecoverable NMI handler is entered whenever:

any of the *_unrec requests is raised in teic_nmi_cause
synchronous exception is raised during active NMI handler
any of the synchronous exception flag (*_exc in teic_nmi_cause) is not cleared before performing interrupt exit from NMI handler
*_async that was escalated to unrecoverable nmi request (escalated_async_unrec in teic_nmi_cause)

Entry procedure is similar to regular NMIs with the following excepions:

activity is signalled by in_nmi_unrecoverable in teic_irq_status register
busfaults, alignment or other errors during stacking are ignored
not required to actually stack the registers only the ra shall be written with pc during fault and sp decremented by range 0 area

2.3.4.2. NMI lockup state

The hart enters the NMI lockup state whenever

code attempts to return from Unrecoverable_NMI handler
synchronous or imprecise exception is raised within Unrecoverable_NMI handler

NMI lockup state halts any further code execution, except debug mode one.

Note	it is necessary to allow debuggers to read out state of registers/memory after experiencing lockup state.

Note	experiencing exceptions within (or return from) unrecoverable handler means a serious issue with control flow, where further attempts to execute code would do more harm than halting until watchdog performs system wide reset.

Note	lack of tripple fault lockout can also lead to security vulnerabilities [43]

Note	microarchitectures can provide external output for signaling NMI lockup state as to allow immediate shutdown of certain peripherals (pwm timers etc.)

2.3.5. vector table allocation

irq num	type	name	notes
0	-	reserved	reserved for startup code (typically jump instruction)
1	NMI		reserved
2	NMI	IntegrityViolation_NMI	(optional) software and hardware integrity violations
3	NMI	ClockViolation_NMI	(optional) Lost clock or other anomaly. It should be assumed that the core/system clock could have been switched to a different one at this point.
4	NMI	WatchdogViolation_NMI	(optional) Entered right before any of the watchdogs trips and performs a (device) reset. Designated for safety measures and error logging. It should be assumed that execution could be frozen at this point and no further action can or need to be performed.
5	NMI	MemoryViolation_NMI	Bus or memory access fault
6	NMI	InstructionViolation_NMI	Illegal instruction exception
7	NMI	Unrecoverable_NMI	Nested nmi, unknown or a state that cannot be easily recovered from.
8	IRQ	Deffered0_IRQ	software deffered interrupt, can be used for context switch.
9	IRQ	reserved	reserved/systick???
10..1022	IRQ	*_IRQ	(optional) device specific interrupts

Unimplemented optional NMIs can be recycled for custom NMIs other than the ones provided in table above.

Note	XTeic doesn’t provide any peripheral API for optional watchdog, clock and integrity protection systems. It’s up to the implementer to provide them.

2.3.5.1. alternate vector table allocation

Alternate vector table allocation designated for minimal implementationns that are not making use of optional NMIs, but benefit from additional space savings.

Alternate vector table allocation is implentation defined. It’s not discoverable nor configurable.

irq num	type	name	notes
0	-	reserved	reserved for startup code (typically jump instruction)
1	NMI	HW_NMI	(optional) hardware related exceptions (watchdogs, ECC etc.)
2	NMI	SW_NMI	exceptions related to application execution on a given hart (illegal instr, integrity violations by sw etc.)
3	NMI	Unrecoverable_NMI	Nested nmi, unknown or a state that cannot be easily recovered from.
4	IRQ	Deffered0_IRQ	software deffered interrupt, can be used for context switch.
5	IRQ	reserved	reserved/systick???
6..1022	IRQ	*_IRQ	(optional) device specific interrupts

Note	Fragmentation is not a big of a deal, as all devices will be fragmented by implementing it’s own layout of device specific IRQ handlers. Which will be provided within startup files.

2.4. recycled volume II CSRs

To reduce disruption some of the "privileged" csr have been recycled according to "privileged" specification.

number	name	privilege	description	notes
0x001	`fflags`	URW	iee754 exception flags	implemented when F,D,Zfinx,Zdinx is present
0x002	`frm`	URW	iee754 dyn rounding mode	implemented when F,D,Zfinx,Zdinx is present
0x003	`fcsr`	URW	frm+fflags	implemented when F,D,Zfinx,Zdinx is present
0xf11	`mvendorid`	MRO	vendor ID	jedec??
0xf12	`marchid`	MRO	architecture ID
0xf13	`mimpid`	MRO	implementation ID
0xf14	`mhartid`	MRO	hart ID

2.5. added instructions

2.5.1. wfi (Wait for interrupt)

Mnemonic

wfi

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x73, attr: ['SYSTEM'] },
 { bits: 5, name: 0x0, attr: ['rd'] },
 { bits: 3, name: 0x0, attr: ['PRIV'] },
 { bits: 5, name: 0x0, attr: ['rs1'] },
 { bits: 12, name: 0x105, attr: ['WFI'] },
]}

Description: Execution of the wfi instruction stalls the execution and allows the core to enter various low power states until the interrupt is taken or any nesting level becomes pending
It is allowed to terminte spontaneously or even be implemented as a nop.

In addition, the wfi instruction is allowed to optionally stack out certain registers ahead of the interrupts, to reduce their latency. In this case, sp is not changed until interrupt arrives.

Note	`wfi` can be executed when interrupts are disabled. Which is a very common use case that avoids introduction of non deterministic delays to event respose time. (i.e. irq arriving right before `wfi`)

Note	It is basically the same thing as priviliged `wfi` but without the configuration bits in privileged CSR’s

2.5.2. teic.wfi.n4ign

Mnemonic

teic.wfi.n4ign

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0x73, attr: ['SYSTEM'] },
 { bits: 5, name: 0x0, attr: ['rd'] },
 { bits: 3, name: 0x0, attr: ['PRIV'] },
 { bits: 5, name: 0x0, attr: ['rs1'] },
 { bits: 12, name: 0x115, attr: ['WFI'] },
]}

Description: Similar to wfi instruction, but doesn’t have to terminate after executing interrupts at 4th nesting priority only. Shall terminate if any other nesting level was entered before returning from n4 irq. (i.e. tail chained to n3, then pop preempted back into n4)

If only single nesting priority is implemented (TEIC_IRQ_NESTING_BITS == 0) then this instruction behaves like a standard wfi.

Note	Designated to reduce wakeups caused by high frequency control loop interrupts that don’t need attention from rest of the application.

Note	Typicall implementation would require additional hidden state to track if interrupt of lower nesting priority was entered.

Note	similarly to standard `wfi` it can terminate spontaneously so the additional functionality is optional

2.6. TEIC CSR map

number	name	privilege	description
0xbc0	`teic_irq_vect`	MRW	interrupt vector table
0xbc1	`teic_estate`	MRW	irq saved state
0x800	`teic_irq_msk`	URW	interrupt mask
0x801	`teic_irq_status`	URO	current interrupt status
0xbc4	`teic_nmi_cause`	MRW	coarse mask of NMI causes
0xbc5	`teic_cfg`	MRW	config register
0xbc6	`teic_sptlimit`	MRW	added with XTeicStackLimit
0xbc7	`teic_spmlimit`	MRW	added with XTeicStackLimit&&XTeicRTOS
0xbc8	`teic_swpspm`	MRW	added with XTeicRTOS

2.6.1. `teic_irq_vect`

bit	name	type	reset value	description
[31:5]	`vect_offset`	WLRL	`TEIC_ENTRY_VECT_BASE>>5`	top bits of vector table offset. Must be aligned to 64 bytes or rounded up to next power of 2, of the number of entries multiplied by the entry size, whichever is greater
[4:0]	reserved	WLRL	0	reserved

Note	alignment requirement allows to avoid use of the additional adder circuit during irq dispatch

Note	minimum alignment can by calculated by following formula: `pow(2, ceil(log2(TEIC_IRQ_VECT_ENTRIES)/log2(2))) * TEIC_IRQ_VECT_ENTRY_SIZE` If vector table consists of 100 entries total, 4 byte each. Then minimum required alignment is 512 bytes

Note	`vect_offset` can be implemented with just enough bits to point at existing memory areas only, as to reduce necessary state to implement.

Note	Implementations may impose additional alignment requirement

Note	`vect_offset` can also be implemented as a read only constant pointing to beggining of the flash memory

2.6.2. `teic_estate`

bit	name	type	reset value	description
[31:0]	`estate_nl`	WPRI	undefined	implementation specified pattern used to recover execution state upon interrupt return. Covers certain csr registers: (`fcsr`, `vcsr`, `vstart` etc.), and (optionally) multi cycle instruction progress. The content read as well as the write to this register is valid only at the lowest implemented nesting level. Otherwise read and write operations on this register are undefined.

Note

Altough optional, the ability to interrupt multicycle instructions is especially important for cores implementing zero jitter features. As an example the ratified Zcmp cm.popretz intruction has 3 uninterrupible instructions (one is branch). (Even though it could be just 2 as zeroing a0 is restartable. 3 instruction sequence will be formally pushed down your throats anyway)

Note	designated to allow an efficient context switch from the lowest priority interrupt

Note	As the risc-v doesn’t have condition codes for branching/predication, it is expected that the smallest implementations will not make use of `estate` register at all.

Note	due to maximum 5-level nesting and limited state to preserve, it was decided to not push previous state onto stack, that would increase interrupt latency.

2.6.3. `teic_irq_msk`

bit	name	type	reset value	description
[31:4]	reserved	WPRI	0	reserved
3	`nest4`	rw	1	Fourth nesting level 0: disabled 1: enabled
2	`nest3`	WARL	1	Third nesting level 0: disabled 1: enabled
1	`nest2`	WARL	1	Second nesting level 0: disabled 1: enabled
0	`nest1`	WARL	1	First nesting level 0: disabled 1: enabled

Disabling any nesting level shall take effect immediately before executing next instruction.

bits related to unimplemented nesting levels are hardwired to zero.

Note	only `nest4` level is mandatory to implement

Note	`TEIC_IRQ_NESTING_BITS == 1` implements `nest2` and `nest4` only

2.6.4. `teic_irq_status`

bit	name	type	description
[31:12]	reserved	WPRI	reserved
11	`n4_stacked`	ro	(optional) signals that currently stacked registers cover only ranges configured for nest4 level. It is used only when ranges configured by `n123_stacking` differs from `n4_stacking`. If the interrupt handler is tailchained to lower nesting level then the core must stack the remaining ranges. Similarly the core can enter nest4 with n123 ranges stacked as well. 1: only nest4 ranges were stacked 0: all ranges stacked as per `n123_stacking`
10	`nmi_lockup`	ro	NMI lockup state, can be cleared only by hart/system reset 1: active 0: inactive
9	`in_nmi_unrecoverable`	ro	unrecoverable NMI handler state, can be cleared only by hart/system reset 1: active 0: inactive
8	`in_nmi`	ro	returnable NMI handler state 1: active 0: inactive
7	`in_nest4`	ro	irq handler at 4th nesting priority state 1: active 0: inactive
6	`in_nest3`	ro	irq handler at 3rd nesting priority state 1: active 0: inactive
5	`in_nest2`	ro	irq handler at 2nd nesting priority state 1: active 0: inactive
4	`in_nest1`	ro	irq handler at 1st nesting priority state 1: active 0: inactive
3	`pending_nest4`	ro	pending status of 4th nesting priority 1: active 0: inactive
2	`pending_nest3`	ro	pending status of 3rd nesting priority 1: active 0: inactive
1	`pending_nest2`	ro	pending status of 2nd nesting priority 1: active 0: inactive
0	`pending_nest1`	ro	pending status of 1st nesting priority 1: active 0: inactive

Note	`nmi_lockup` bit is defacto readable only by debugger

2.6.5. `teic_nmi_cause`

bit	name	type	description
31	reserved	ro
30	`irqretnest0_unrec`	ro	irq return without active irq/nmi
29	`irqentryexit_unrec`	ro	any fault during irq entry/exit (stack alignment, memory faults etc.)
28	`bus_fault_imprecise_unrec`	ro	(optional) imprecise bus faults
27	`hw_integrity_imprecise_unrec`	ro	(optional) imprecise hw integrity error
26	`sw_integrity_imprecise_unrec`	ro	(optional) imprecise sw integrity error
25	`nested_exc_unrec`	ro	synchronous exception raised during execution of nmi handler
24	`escalated_async_unrec`	ro	(optional) escalated `*_async` requests
[23:10]	reserved	rw1c	reserved
9	`clock_async`	ro	(optional)
8	`watchdog_async`	ro	(optional)
7	`reserved	ro	reserved
6	`hw_integrity_async`	ro	(optional) asynchronous integrity error not related to the architectural control flow (e.g. unrecoverable ECC error triggered by scrubber or speculative prefetch)
5	reserved	rw1c	reserved
4	`sw_integrity_exc`	rw1c	(optional) software related integrity exceptions e.g. pmp, stacklimit or other control flow violations related to the the software.
3	`hw_integrity_exc`	rw1c	(optional) hardware related integrity exceptions e.g. ECC, parity, lockstep or other integrity error on core, memory or buses.
2	`misaligned_address_exc`	rw1c	(optional) misaligned load/store address
1	`bus_fault_exc`	rw1c	memory access faults
0	`illegal_instruction_exc`	rw1c	Illegal instruction exception and misaligned instr

The *_async nmi requests have to be cleared within the source peripheral.

2.6.6. `teic_cfg`

bit	name	type	reset value	description
[31:8]	reserved	WLRL	0	reserved
[7:6]	`n4_stacking`	WARL	implementation specific (highest implemented)	stacking ranges at 4th nesting level. Connot be set to higher ranges than implemented by lower nestings. Must not be changed within interrupt handler, otherwise behaviour is undefined. 0b00: range 0 0b01: range 0, 1 0b10: range 0, 1, 2 0b11: range 0, 1, 2, 3
5	reserved	WARL	0
4	`access_thread_regs_n1`	WARL	0	(optional) Switches current (part of) register file to thread one if applicable. It has effect only in interrupts at lowest implemented nesting priority. Designated to allow context switching of threads in case of automatic irq shadow registers. 1: thread context remapped 0: no context remap
3	`thread_enter`	WARL	0	added with XTeicRTOS
2	`escalate_async_nmi`	WARL	0	(optional) if `*_async` nmi request is raised during active nmi, it will be escalated to unrecoverable nmi request (i.e. raises `escalated_async_unrec` nmi request) 1: enabled 0: disabled
1	`sleeponexit`	WARL	0	(optional) 1: enabled 0: disabled
0	`zero_jitter`	WARL	0	(optional) Ensure that the highest nesting priority interrupts are always entered within the same number of cycles regardless of the interrupted execution state. Doesn’t affect tailchaining of handlers within the highest nesting priority, as well as irq return procedure. Various deep sleep states are also an exception. It shall be assumed that irq vector table, highest level interrupt code and stack resides in zero waitstated memories and no HW measures will be implemented to adjust for a different scenario. 1: enabled 0: disabled

2.7. MMIO TEIC registers

private to the hart

offset from `TEIC_MMIO_CTRL_BASE`	entry size	name	non-native access	description
0x0	4	`teic_extra_cfg`	no
0x4	4	`teic_reset_req`	no
0x8	4	`teic_Deffered_pending`	no
0x10	4	`teic_Deffered_request`	no
0x20	4	`teic_irq_pending[32]`	no
0x40	4	`teicMP_irq_enable[32]`	no	added with XTeicMP
0x400	1	`teic_prio_cfg[1023]`	yes

2.7.1. `teic_extra_cfg`

2.7.2. `teic_reset_req`

bit	name	type	reset value	description
[31:16]	reserved	rw	0	reserved
[15]	`nmi_lockup_onreset`	ro	dependent	1: `nmi_lockup` was active prior to reset 0: no `nmi_lockup` prior to reset
[14:11]	`last_reset_cause`	ro	dependent	0b0000: power on reset 0b0001: software reset 0b0010: watchdog reset 0b0011: external reset (master core, RST input pin etc.) other: reserved
[10:3]	`reset_key`	wo	0	write of `0xC5` to this field performs system reset
[2:1]	reserved	rw	0
[0]	`hart_only`	rw	implementation specific	(optional) write 1 together with `reset_key` to reset only hart. If implementation allows only a hart reset, this field reads always 1, 0 otherwise

Note	[45] provides sysreset with excluded debug subsystem, in case of custom debug spec, it should at least provide its own config to exclude itself from reset

2.7.3. `teic_Deffered_pending`

bit	name	type	reset value	description
[31:1]	`deffered{i}_pending`	rw1c	0	(optional) pending status of deffered1-deffered31 irq requests
[0]	`deffered0_pending`	rw1c	0	pending status of deffered0 irq request

2.7.4. `teic_Deffered_request`

bit	name	type	reset value	description
[31:1]	`deffered{i}_req`	w1s (wo)	undefined	(optional) write 1 to set deffered1-deffered31 irq requests
[0]	`deffered0_req`	w1s (wo)	undefined	write 1 to set deffered0 irq request

2.7.5. `teic_irq_pending[32]`

For each implemented irq vector, there is corresponding pending bit in pending register at teic_irq_pending[IRQn/32] position.

First 8 bit entries (corresponding to NMIs) are reserved.

bit	name	type	reset value	description
[31:0]	`pending{i}_irq`	ro	0	signals pending status of `IRQn % 32` interrupt

2.7.6. `teic_prio_cfg[1023]`

Consists of 1023 entries, 1 byte each. First 8 entries (corresponding to NMIs) are reserved.

For each implemented irq vector, there is corresponding priority config register at teic_prio_cfg[IRQn] position.

priority encoding

bit	name	type	description
[8:(9 - `TEIC_IRQ_NESTING_BITS`)]	`nest_prio`	rw	nesting priority bits
[(8 - `TEIC_IRQ_NESTING_BITS`):(9 - (`TEIC_IRQ_NESTING_BITS` + `TEIC_IRQ_PRIORITY_BITS`))]	`sub_prio`	rw	sub-priority bits
[(8 - (`TEIC_IRQ_NESTING_BITS` + `TEIC_IRQ_PRIORITY_BITS`)):0]	reserved	WLRL	reserved

Unimplemented bottom nesting bits are treated as if they were hardwired to 1. If only 1 bit is implemented then only nest2 and nest4 levels are possible.

2.8. additional optional subextensions

2.8.1. XTeicMP

additional per vector entry interrupt enable

private to the hart

2.8.1.1. `teicMP_irq_enable[32]`

For each implemented irq vector, there is corresponding enable bit in "enable" register at teicMP_irq_enable[IRQn/32] position.

First 8 bit entries (corresponding to NMIs) are reserved.

bit	name	type	reset value	description
[31:0]	`enable{i}_irq`	WARL	0	enable control of `IRQn % 32` interrupt 0: disabled 1: enabled

2.8.2. XTeicRTOS

Adds additional RTOS specific features

After thread mode (aka "user" or "unprivileged") is activated by thread_enter bit:

Current sp becomes a defacto thread stack
On irq entry from thread, current sp is swapped with the context of teic_swpspm register which happens after stacking (registers are pushed to thread stack)
Thread mode protects only CSR registers, memory regions should be protected by additional PMP unit.
Interrups are always executing in machine mode.

2.8.2.1. `thread_enter`

bit in teic_cfg CSR

Setting this bit will make the hart to enter thread mode (aka user mode in privileged nomenclature). Once set it cannot be cleared.

Must not be set within interrupt handler, otherwise behaviour is undefined.

Note	It is expected that startup code will turn itself into an idle thread after configuring everything in machine mode.

2.8.2.2. `teic_swpspm`

Holds the stack pointer to be swapped with sp when entering interrupt context.

Note	Separate interrupt stack allows thread stacks to allocate only the area for context switch storage in addition to its own usage (which can be statically analysed)

If access_thread_regs_n1 control bit is implemented, then it switches sp to thread stack as well.
When in effect, the teic_swpspm content is undefined. When another interrupt nests, it pushes registers onto the machine (interrupt) stack.

2.8.3. XTeicTinyIrqTable

Makes each address entry in irq vector table take only 2 byte in size. (TEIC_IRQ_VECT_ENTRY_SIZE == 2)

The effective addres is constructed by concatenation of the 2 bytes of the vector entry content and top 16 bit of TEIC_ENTRY_VECT_BASE implementation constant.

The TEIC_ENTRY_VECT_BASE must be 64KiB aligned.

The entry encoding with the least significant bit set, is reserved.

Note	Extension designated for smallest devices where a vector table size has a significant code size impact.

Note	SRAM can be used for enplacing handlers if mapped within the same 64KiB block

2.8.4. XTeicTinyIrqTableExt

Implies XTeicTinyIrqTable extension.

If the fetched vector entry has the lowest bit set, then the effective addres is constructed by concatenation of the 2 bytes of the vector entry content and top 16 bits of TEIC_EXEC_SRAM_BASE implementation constant.

The TEIC_EXEC_SRAM_BASE must be 64KiB aligned.

Note	It is possible to implement this on devices with large flash memories and resort to compiler tricks, to keep handlers within 64KiB range. But the gains will be relatively low.

2.9. XTeicStackLimit

Provides additional CSR registers with stack address thresholds.

Throws sw_integrity_exc exception, when sp (x1) register is written with value lower than the one specified in teic_sp*limit register.

Note	local arrays can be created on stack and then accessed by pointer passed in working register. Therefore stacklimit comparison must happen on write to `sp` register

2.9.1. `teic_sptlimit`

Used for limiting sp when hart is in thread mode or thread_enter == 0.

bit	name	type	reset value	description
[31:3]	`spt_limit`	WLRL	0	top bits of bottom stack threshold, unsigned
[2:0]	reserved	WLRL	0	reserved

2.9.2. `teic_spmlimit`

available only with XTeicRTOS

Used for limiting sp when hart is in interrupt (machine) mode (thread_enter == 1).

bit	name	type	reset value	description
[31:3]	`spm_limit`	WLRL	0	top bits of bottom stack threshold, unsigned
[2:0]	reserved	WLRL	0	reserved

3. auxiliary extensions

Additional extensions that are usefull addition to XTeic

3.1. Xfenceiext

Because J extension group is going to simply ignore the fact that fence.i instruction allocated whole 22.125 bits of opcodes, and introduce a new instructions for operational subset of fence.i (e.g. IMPORT.I) [38],[39]. We don’t need to care about eventual sync with Zjid encodings.

The rationale is that the fence.i encodes whole instruction side synchronization with all zero immediate. Therefore we can remove all of the sync mechanisms by inverting the bits, other than the one designated for certain operation.

The uppermost 4 bits remain zero to allow enabling extra features not covered by fence.i.

3.1.1. teic.fence.ipipe

Flushes the pipeline and prefetch buffers before executing next instruction.
Encoded in bit 0 of fence.i immediate

Note	not suitable for synchronizing with architectural state modifications by CSR instructions, use `teic.fence.icsrsync` instead

Mnemonic

teic.fence.ipipe

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0xf, attr: ['MISC-MEM'] },
 { bits: 5, name: 0x0, attr: ['rd'] },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 0x0, attr: ['rs1'] },
 { bits: 12, name: 0x0fe, attr: ['imm'] },
]}

3.1.2. teic.fence.icsrsync

Ensures that the following instructions are executed after the architectural state change by a preceding CSR instructions (or equivalent) takes effect. Encoded in bit 1 of fence.i immediate

Note	In many cases CSR updates don’t require full pipeline flush, though it can be implemented as regular pipeline flush.

Note	necessary to sync e.g irq vector table updates wrt following (peripheral) MMIO access

Note	[41] do require fencing after update of `jvt` and `mtvec` (even though `jvt` falls into "program order" category).

Mnemonic

teic.fence.icsrsync

Encoding (RV32, RV64)

{reg:[
 { bits: 7, name: 0xf, attr: ['MISC-MEM'] },
 { bits: 5, name: 0x0, attr: ['rd'] },
 { bits: 3, name: 0x1 },
 { bits: 5, name: 0x0, attr: ['rs1'] },
 { bits: 12, name: 0x0fd, attr: ['imm'] },
]}

3.2. Xicsrmz

Implemented similarly to Zicsr with uimm=0 mapped into -1 constant.

Note	`csrrsi`/`csrrci` with `uimm=0` still doesn’t write and cause write side effects.

Note	This extensions allows to sync `csrrwi` instruction, with some other extensions [39], as to not cause additional immediate formats.

Note	`csrrw rd, csr, x0` can still be used to write a zero into csr.

3.3. Xtolerantcsr

None of the CSR access shall raise an exception.

Writes to read only CSRs shall be ignored.
in machine mode access to unimplemented CSRs is undefined
in thread mode access to unimplemented CSRs as well as higher privilege ones shall cause no side effects, read a 0 value and have its write ignored

Note	`UNIMP` instruction maps to write into `cycle` csr register, so it can no longer be used. `c.unimp` remains available which is encoded as all zero.

Note	Extension designated for reduction of silicon use, reflects behaviour of certain privileged csr registers (e.g. `misa`, `mvendorid` etc.) when unimplemented

3.4. Xzcmpt

Implemented similarly to Zcmp but with additional immediate bit to accomodate 8 byte aligned stacks, and following changes.

Note	addi8sp is not required as push instruction can prepare initial allocation with 8byte granularity.

rlist encoding

RV32E:
case 0: {reg_list="ra"; xreg_list="x1";}
case 1: {reg_list="ra, s0"; xreg_list="x1, x8";}
case 2: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
case 3-15: reserved
RV32I:
case 0: {reg_list="ra"; xreg_list="x1";}
case 1: {reg_list="ra, s0"; xreg_list="x1, x8";}
case 2: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
case 3: {reg_list="ra, s0-s2"; xreg_list="x1, x8-x9, x18";}
case 4: {reg_list="ra, s0-s3"; xreg_list="x1, x8-x9, x18-x19";}
case 5: {reg_list="ra, s0-s4"; xreg_list="x1, x8-x9, x18-x20";}
case 6: {reg_list="ra, s0-s5"; xreg_list="x1, x8-x9, x18-x21";}
case 7: {reg_list="ra, s0-s6"; xreg_list="x1, x8-x9, x18-x22";}
case 8: {reg_list="ra, s0-s7"; xreg_list="x1, x8-x9, x18-x23";}
case 9: {reg_list="ra, s0-s8"; xreg_list="x1, x8-x9, x18-x24";}
case 10: {reg_list="ra, s0-s9"; xreg_list="x1, x8-x9, x18-x25";}
case 11: {reg_list="ra, s0-s10"; xreg_list="x1, x8-x9, x18-x26";}
case 12: {reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
case 13-15: reserved

stack_adj_base derivation from rlist

case 0..1:   stack_adj_base = 8
case 2..3:   stack_adj_base = 16
case 4..5:   stack_adj_base = 24
case 6..7:   stack_adj_base = 32
case 8..9:   stack_adj_base = 40
case 10..11: stack_adj_base = 48
case 12:     stack_adj_base = 56
case 13..15: reserved

Valid values:
case 0..1:   stack_adj = [ 8|16|24|32|40|48|56|64]
case 2..3:   stack_adj = [16|24|32|40|48|56|64|72]
case 4..5:   stack_adj = [24|32|40|48|56|64|72|80]
case 6..7:   stack_adj = [32|40|48|56|64|72|80|88]
case 8..9:   stack_adj = [40|48|56|64|72|80|88|96]
case 10..11: stack_adj = [48|56|64|72|80|88|96|104]
case 12:     stack_adj = [56|64|72|80|88|96|104|112]
case 13..15: reserved

register stacking order: currently same as in Zcmp

3.4.1. teic.cm.push

Synopsis: Allocates stack frame and saves registers selected by rlist.
Mnemonic

teic.cm.push {reg_list}, -stack_adj

Encoding

{reg:[
 { bits:  2, name: 0x2, attr: ['C2'] },
 { bits:  1, name: 'spimm[5]' },
 { bits:  2, name: 'rlist[1:0]' },
 { bits:  2, name: 'spimm[4:3]' },
 { bits:  2, name: 'rlist[3:2]' },
 { bits:  1, name: 0 },
 { bits:  2, name: 0x0 },
 { bits:  1, name: 0 },
 { bits:  3, name: 0x5, attr: ['C.FSDSP'] },
],config:{bits:16}}

3.4.2. teic.cm.pop

Synopsis: Deallocates stack frame and loads registers selected by rlist.
Mnemonic

teic.cm.pop {reg_list}, stack_adj

Encoding

{reg:[
 { bits:  2, name: 0x2, attr: ['C2'] },
 { bits:  1, name: 'spimm[5]' },
 { bits:  2, name: 'rlist[1:0]' },
 { bits:  2, name: 'spimm[4:3]' },
 { bits:  2, name: 'rlist[3:2]' },
 { bits:  1, name: 1 },
 { bits:  2, name: 0x0 },
 { bits:  1, name: 0 },
 { bits:  3, name: 0x5, attr: ['C.FSDSP'] },
],config:{bits:16}}

3.4.3. teic.cm.popret

Synopsis: Deallocates stack frame, loads registers selected by rlist and returns.
Mnemonic

teic.cm.popret {reg_list}, stack_adj

Encoding

{reg:[
 { bits:  2, name: 0x2, attr: ['C2'] },
 { bits:  1, name: 'spimm[5]' },
 { bits:  2, name: 'rlist[1:0]' },
 { bits:  2, name: 'spimm[4:3]' },
 { bits:  2, name: 'rlist[3:2]' },
 { bits:  1, name: 1 },
 { bits:  2, name: 0x0 },
 { bits:  1, name: 1 },
 { bits:  3, name: 0x5, attr: ['C.FSDSP'] },
],config:{bits:16}}

Description: The ra register may not be populated.

3.4.4. teic.cm.popretz

Synopsis: Deallocates stack frame, loads registers selected by rlist, writes zero to a0 and returns.
Mnemonic

teic.cm.popretz {reg_list}, stack_adj

Encoding

{reg:[
 { bits:  2, name: 0x2, attr: ['C2'] },
 { bits:  1, name: 'spimm[5]' },
 { bits:  2, name: 'rlist[1:0]' },
 { bits:  2, name: 'spimm[4:3]' },
 { bits:  2, name: 'rlist[3:2]' },
 { bits:  1, name: 0 },
 { bits:  2, name: 0x0 },
 { bits:  1, name: 1 },
 { bits:  3, name: 0x5, attr: ['C.FSDSP'] },
],config:{bits:16}}

Description: The ra register may not be populated. Unlike in Zcmp the load to a0 is non atomic.

3.4.5. todo: mva/mvs

those are quite annoying on rve

Appendix A: irq atomic block

mask out all interrupts

void foo()
{
	size_t tmp;
	asm volatile(
		"csrrci %[out], teic_irq_msk, 0b01111 \n\t"
		: [out] "=r" (tmp) :: "memory");
	//
	// execute code with irq disabled
	//
	asm volatile("csrw teic_irq_msk, %[in] \n\n" :: [in] "r" (tmp) : "memory");
}

mask out only nest1 level

void foo()
{
	size_t tmp;
	asm volatile(
		"csrrci %[out], teic_irq_msk, 0b00001 \n\t"
		: [output] "=r" (tmp) :: "memory");
	//
	// execute code with irq disabled
	//
	asm volatile ("csrw teic_irq_msk, %[in] \n\t" :: [in] "r" (tmp) : "memory");
}

Appendix B: RTOS context switch

Appendix C: vendor software support packages

what headers, definitions, names etc. must be provided.

Appendix D: design decisions

D.1. no cause code

The cause code can be implied from hardcoded vector table position or periphereals state if handler is shared. Therefore it’s redundant. The other issue is that it has to be somehow preserved during nesting.

Note	NMIs are handled through `teic_nmi_cause` CSR.

D.2. no single bit interrupt enable

It would be redundant to the irq_msk nest enables. Which can be similarly managed by csrsi, csrci instructions.

D.3. no `misa` register

It’s useless.

will it tell you if there is Zbb, Zmmul or Zcmt implemented? - no

On embedded targets, HW information about implemented extensions and ability to enable/disable them, has a rather low value.

D.4. stacking of floating point and vector registers

currently ???

Zfinx ???

Those can still be handled by IPRA anyway. FP push/pop instruction might be usefull in such case.

D.5. undefined initial state of architectural registers

It is said that registers have to be zeroed at reset "to protect software from itself" [36] It doesn’t, it just hides bugs until they manifest in the worst possible scenario. Just like developing and debugging code at -O0.

This kind of use of uninitailized variables is UB in C/C++ and easily detectable by compilers. Languages like Rust or Ada are supposed to be free from this UB, so there is no need to spend transistors or code memory for zeroing those.

Note	V extension uses all ones for `tail agnostic` filling just to prevent software from relying on uarch dependent zeroing.

However, certain hardened cores may need to have all registers initialized to consistent state, as to avoid integrity faults when stacking out yet unused registers. In some cases, it’s still possible to require initialization of all registers in startup code instead.

D.6. little endian only

Why would you want to have big endian loads/stores?
Probably for handling tasks that compute "network byte order" data which uses big endian representation.

Nice. So, lets add a big-endian mode (making it configurable at runtime of course), and enjoy mandatory endian neutral loads/stores ([37]) used by networking libraries, because one cannot be sure which endianess the code will be run on.

Just use rev8 for "network order" data. It’s much better than doing endian neutral access.

Big endianess is also inefficient to handle in vector registers.

D.7. `TEIC_MMIO_CTRL_BASE` address selection

addressable through c.lui + offset

D.8. no csr scratch registers

Unlike the big unix machines, the RTOS context can be statically addressed by lui + addi sequence.

With hardware stacking there is no need to free up additional registers.

D.9. 1023 vector entries

One entry less than full 1024 due to 2s complement jump immediate.

This is the biggest capacity that can be escaped by single c.j instruction from a first entry in case of TEIC_IRQ_VECT_ENTRY_SIZE == 2 (XTeicTinyIrqTable)

This is also more than enough for any microcontroller.

D.10. no per irq pending/enable in base extension

It is simply redundant to in-peripherals enablees, as well as the nestx interrupt enables.

Has use case only when the same interrupts are routed to multiple harts or when peripheral interrupt lines are shared across multiple master units (e.g. FIFO empty irq signal shared with DMA)

D.11. no nmi/exception nesting

Nesting NMIs is easy way to overflow the stack or greatly increase the worst case in static stack analysis (if there is even a bound)

It also becomes an issue in pure HW state preservation by estate_nl or shadow registers.

Normally such condition is very rare and is usually a sign of bad coding or much more serious hardware issue, that’s causing everything to fail at the same moment.

D.12. no software triggered interrupts

aka software trigger in ARM terminology [47]

Designated for triggering unallocated (or unused peripheral) vectors, by writing to the special NVIC→STIR register. Which is of course redundant to the use of NVIC→ISPRx registers.

However it’s rarely used and only "implemented" vectors can be triggered in such way. Officially it is supposed to be 32 entry granularity in ARM case, but it’s not even obvious wether you can use unimplemented vectors at all. [48]

Note	Even the PendSV is done by setting `ICSR→PENDSVET` bit instead of executing this mechanism.

Note	TEIC instead provides dedicated "peripheral" for handling software (deffered) interrupts

All of this causes a lot of redundancy to allow handling peripheral interrupts and "software" triggered ones by the same handler. The ARM implementation also depends on edge triggered irq mechanism, which is also ommitted by XTeic.

D.13. no stack realignment upon interrupt entry/exit

This is just a waste of hardware. The ABI should mandate the alignment instead. If not followed then the microarchitecture should be allowed to trap.

Note	some architectures, due to legacy codebases, require explicit stack alignment instructions which also contribute to interrupt latency/jitter and impact code density.

D.14. "zero jitter" only in highest nesting level interrupts

It doesn’t make sense to implement "zero jitter" at any other level. If given interrupt can by interrupted by a higher nesting priority, then it would no longer be considered a "zero jitter" one.

Note	NMIs can still break the "zero jitter" guarantee, though those should be considered as a rare fault/error condition.

D.15. only level triggered interrupts

Peripherals usually implement level triggered interrupts. (ie. require clearing trigger source by performing certain actions like reading FIFO registers or clearing the status flags)

Therefore it’s wastefull to spend additional resources (e.g. latch for pending status and related clear on irq entry) on the edge triggered mechanism which is made redundant on every irq line (see [no "software interrupts"])

Note	Sampling edges on GPIO is usually done by a separate peripheral that turns those into an level triggered ones.

D.16. no faulting addr register

aka mtval which ` is often not impelemnted anyway, even by uarch without unaligned loads/stores support.

Due to the lack of MMU, the memory access exceptions are considered fatal errors anyway.

The faulting address can still be recovered in a more complex way of decompilation of faulting instr.

D.17. no (default) "legacy" interrupt modes

Having our cores to boot with "legacy" interrupt modes

is a waste of transistors
it would reqire sync with the CLIC mode/submode encodings (or be incompatible with CLIC which is of course unwanted when lengthening the "flexibility" bar)
causes interrupt hole or additional boilerplate code to handle exceptions/NMIs that arrived before setting up mtvec and thus were routed to reset handler entry.

Note	There was even an CVE related to uninitialized `mtvec`: [43]

This also allows us to use vector address with zeroed two lowest bits. Which, in some scenarios, allows setup of vector table address with a single lui instruction

Also, in cores designated to work in vectored mode, the mtvec has the bottom address lines hardwired to 0. Which leads to large alignment granularity of the unvectored handler (e.g. on ch32v003 it’s 1KiB). Making the unvectored mode handler share entry with startup code or require large alignment.

D.18. no sub-priority reflected in any status registers

Sub-priority is used only during irq handler dispatch. Current priority field would consume additional circuitry to latch in sub-priority of the current handler.

Additionaly the current sub-priority field would have to be somehow preserved during nesting.

D.19. only 4 irq nesting levels

It’s enough for a great majority of use cases, not to mention that a lot of applications would be fine with just 1 nesting level.

Adding more nesting levels will diminish the gains from tail chaining.

D.20. no syscall (in XTeicRTOS)

Problematic to properly implement.

Offers less separation of kernel structures from the thread (by MPU). Though cortex-m port of FreeRTOS uses it only to start a first thread.

D.21. no SEV/WFE

Most use cases are redundant to wfi. (e.g. SEVONPEND)

The SEV from irq method is rarely used and is supposed to reduce wakeups from high frequency interrupts which can be handled by teic.wfi.n4ign instead.

Bibliography

[1] https://github.com/emb-riscv/specs-markdown
[2] https://github.com/riscv/riscv-fast-interrupt/blob/master/clic.adoc
[3] https://github.com/riscv/riscv-aclint/blob/main/riscv-aclint.adoc
[4] https://starfivetech.com/uploads/sifive-interrupt-cookbook-v1p2.pdf
[5] https://github.com/riscv/riscv-plic-spec
[6] https://github.com/riscv/riscv-aia
[7] https://github.com/jnk0le/simple-crt/blob/master/cm0/combotablecrt_stm32f030x6.S
[8] https://reviews.llvm.org/D23980
[9] https://github.com/YosysHQ/picorv32#custom-instructions-for-irq-handling
[10] https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/znKeVnmxsy8/m/NtdDII3kAAAJ
[11] riscv/riscv-fast-interrupt#108
[12] https://github.com/T-head-Semi/thead-extension-spec
[13] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/beginner-guide-on-interrupt-latency-and-interrupt-latency-of-the-arm-cortex-m-processors
[14] https://www.ti.com/lit/an/spracs0a/spracs0a.pdf?ts=1677348911359
[15] https://software-dl.ti.com/C2000/docs/cla_software_dev_guide/intro.html
[16] https://jaycarlson.net/embedded-linux/
[17] https://elinux.org/images/d/de/Real_Time_Linux_Scheduling_Performance_Comparison.pdf
[18] https://static.lwn.net/lwn/images/conf/rtlws11/papers/proc/p19.pdf
[19] https://people.mpi-sws.org/~bbb/papers/pdf/ospert13.pdf
[20] https://www.osadl.org/fileadmin/events/rtlws-2007/Siro.pdf
[21] https://riscv.org/wp-content/uploads/2018/07/DAC-SiFive-Drew-Barbier.pdf
[22] https://www.ti.com/lit/an/spraan9a/spraan9a.pdf?ts=1677877354340
[23] https://www.ti.com/lit/ug/spru430f/spru430f.pdf?ts=1677869437551
[24] https://www.ti.com/lit/ug/spruhs1c/spruhs1c.pdf?ts=1677888169020
[25] https://e2e.ti.com/support/processors-group/processors/f/processors-forum/905744/tms320f28335
[26] https://e2e.ti.com/support/microcontrollers/c2000-microcontrollers-group/c2000/f/c2000-microcontrollers-forum/567535/tms320f28377d-dmips-calculation
[27] https://software-dl.ti.com/C2000/docs/cla_software_dev_guide/_static/pdf/C2000_CLA_Software_Development_Guide.pdf
[28] http://www.wch-ic.com/downloads/QingKeV2_Processor_Manual_PDF.html
[29] http://www.wch-ic.com/downloads/QingKeV3_Processor_Manual_PDF.html
[30] http://www.wch-ic.com/downloads/QingKeV4_Processor_Manual_PDF.html
[31] https://github.com/riscv-non-isa/riscv-eabi-spec
[32] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56165#c2
[33] https://github.com/riscv-non-isa/riscv-elf-psabi-doc
[34] https://www.ti.com/lit/wp/swpy031/swpy031.pdf
[35] https://www.brianchavens.com/2018/09/20/motor-control-microcontroller-performance-comparison/
[36] openhwgroup/cv32e40p#221
[37] https://github.com/openssl/openssl/blob/master/crypto/aes/asm/aes-armv4.pl#L216
[38] https://github.com/riscv/riscv-j-extension/blob/master/id-consistency-proposal.pdf
[39] https://lists.riscv.org/g/tech-j-ext/message/481
[39] https://github.com/jnk0le/XTightlyCoupledIO
[40] https://software-dl.ti.com/trainingTTO/trainingTTO_public_sw/c28x28035/C28x_Piccolo_MDW_2-1.pdf
[41] https://docs.openhwgroup.org/_/downloads/cv32e40s-user-manual/en/latest/pdf/
[43] https://youtu.be/iz_Y1lOtX08?t=1740
[44] https://github.com/riscv/riscv-isa-manual/pull/912/commits/869dcc608e11f9680e950bcb20a9b8294d2b82bd
[45] https://github.com/riscv/riscv-debug-spec
[46] https://github.com/openwch/ch32v003/blob/main/RISC-V%20QingKeV2%20Microprocessor%20Debug%20Manual.pdf
[47] https://developer.arm.com/documentation/dui0553/a/
[48] https://stackoverflow.com/questions/72523639/arm-cortex-m3-add-a-new-interrupt-to-the-end-of-the-vector-table
[49] https://lists.llvm.org/pipermail/cfe-dev/2016-July/050022.html
[50] riscv/riscv-fast-interrupt#314
[51] https://e2echina.ti.com/cfs-file/_key/communityserver-discussions-components-files/56/5504.2803x-CLA-_2800_1_2900.pdf
[52] https://www.ti.com/lit/an/spracw5a/spracw5a.pdf
[53] https://arxiv.org/pdf/2311.08320
[54] https://github.com/Wren6991/Hazard3/blob/stable/doc/hazard3.pdf

Files

XTeic.adoc

Latest commit

History

XTeic.adoc

File metadata and controls

XTeic

revision history

preface

1. Introduction

1.1. prior art

1.1.1. cortex-m NVIC

1.1.2. CLIC

1.1.3. CV32RT fastirq

1.1.4. emb-riscv

1.1.5. CLINT

1.1.6. generic riscv interrupts as described in "privileged" volume II

1.1.7. PLIC/AIA

1.1.8. CH32 PFIC

1.1.9. RNMI (aka returnable NMI)

1.1.10. PicoRV32 interrupts

1.1.11. ti c2000 (main core)

1.1.12. ti c2000 CLA

1.1.13. Xh3irq

1.2. overwiew/discussion of some concepts/features

1.2.1. whole app must be doable in C/C++

1.2.2. ABIs with less caller saved registers

1.2.3. "you are better off with soft stacking in inline handlers"

1.2.4. EABI for RVE must be subset of RVI EABI.

1.2.5. one universal standard for everyone use cases

1.2.6. special handler return pattern

1.2.7. vector tables that are jumped to

1.2.8. MMIO vs CSR mapped config registers

1.2.9. "reduced/zero jitter"

1.2.10. "everything will run Linux in future"

1.2.10.1. RTLinux and hard-realtime

1.2.11. lazy stacking

1.2.12. 64bit microcontrollers

1.3. required ABI

1.3.1. stack alignment

1.3.2. RVE

1.3.3. RVI

1.4. debug

1.4.1. DTM

1.5. tooling issues to solve

1.5.1. prestacked annotation

1.5.1.1. optimization for noreturn functions

1.5.1.2. functions with partially custom calling conventions

1.5.2. IPRA - Inter procedural register allocation

1.5.2.1. adjusting ipra wrt prestacked registers

1.5.2.2. applying IPRA to assembly functions

2. XTeic (aka Total Embedded Interrupt Controller)

2.1. implementation constants

2.2. startup behaviour

2.2.1. reset state of registers

2.2.2. bootloaders

2.3. interrupts

2.3.1. register ranges

2.3.2. interrupt entry

2.3.2.1. handler dispatch

2.3.3. interrupt exit

2.3.4. NMI interrupts

2.3.4.1. NMI unrecoverable state

2.3.4.2. NMI lockup state

2.3.5. vector table allocation

2.3.5.1. alternate vector table allocation

2.4. recycled volume II CSRs

2.5. added instructions

2.5.1. wfi (Wait for interrupt)

2.5.2. teic.wfi.n4ign

2.6. TEIC CSR map

2.6.1. teic_irq_vect

2.6.2. teic_estate

2.6.3. teic_irq_msk

2.6.4. teic_irq_status

2.6.5. teic_nmi_cause

2.6.6. teic_cfg

2.7. MMIO TEIC registers

2.7.1. teic_extra_cfg

2.7.2. teic_reset_req

1.5.1.1. optimization for `noreturn` functions

2.6.1. `teic_irq_vect`

2.6.2. `teic_estate`

2.6.3. `teic_irq_msk`

2.6.4. `teic_irq_status`

2.6.5. `teic_nmi_cause`

2.6.6. `teic_cfg`

2.7.1. `teic_extra_cfg`

2.7.2. `teic_reset_req`

2.7.3. `teic_Deffered_pending`

2.7.4. `teic_Deffered_request`

2.7.5. `teic_irq_pending[32]`

2.7.6. `teic_prio_cfg[1023]`

2.8.1.1. `teicMP_irq_enable[32]`

2.8.2.1. `thread_enter`

2.8.2.2. `teic_swpspm`

2.9.1. `teic_sptlimit`

2.9.2. `teic_spmlimit`

D.3. no `misa` register

D.7. `TEIC_MMIO_CTRL_BASE` address selection