- revision history
- preface
- 1. Introduction
- 1.1. prior art
- 1.1.1. cortex-m NVIC
- 1.1.2. CLIC
- 1.1.3. CV32RT fastirq
- 1.1.4. emb-riscv
- 1.1.5. CLINT
- 1.1.6. generic riscv interrupts as described in "privileged" volume II
- 1.1.7. PLIC/AIA
- 1.1.8. CH32 PFIC
- 1.1.9. RNMI (aka returnable NMI)
- 1.1.10. PicoRV32 interrupts
- 1.1.11. ti c2000 (main core)
- 1.1.12. ti c2000 CLA
- 1.1.13. Xh3irq
- 1.2. overwiew/discussion of some concepts/features
- 1.2.1. whole app must be doable in C/C++
- 1.2.2. ABIs with less caller saved registers
- 1.2.3. "you are better off with soft stacking in inline handlers"
- 1.2.4. EABI for RVE must be subset of RVI EABI.
- 1.2.5. one universal standard for everyone use cases
- 1.2.6. special handler return pattern
- 1.2.7. vector tables that are jumped to
- 1.2.8. MMIO vs CSR mapped config registers
- 1.2.9. "reduced/zero jitter"
- 1.2.10. "everything will run Linux in future"
- 1.2.11. lazy stacking
- 1.2.12. 64bit microcontrollers
- 1.3. required ABI
- 1.4. debug
- 1.5. tooling issues to solve
- 1.1. prior art
- 2. XTeic (aka Total Embedded Interrupt Controller)
- 3. auxiliary extensions
- Appendix A: irq atomic block
- Appendix B: RTOS context switch
- Appendix C: vendor software support packages
- Appendix D: design decisions
- D.1. no cause code
- D.2. no single bit interrupt enable
- D.3. no
misa
register - D.4. stacking of floating point and vector registers
- D.5. undefined initial state of architectural registers
- D.6. little endian only
- D.7.
TEIC_MMIO_CTRL_BASE
address selection - D.8. no csr scratch registers
- D.9. 1023 vector entries
- D.10. no per irq pending/enable in base extension
- D.11. no nmi/exception nesting
- D.12. no software triggered interrupts
- D.13. no stack realignment upon interrupt entry/exit
- D.14. "zero jitter" only in highest nesting level interrupts
- D.15. only level triggered interrupts
- D.16. no faulting addr register
- D.17. no (default) "legacy" interrupt modes
- D.18. no sub-priority reflected in any status registers
- D.19. only 4 irq nesting levels
- D.20. no syscall (in XTeicRTOS)
- D.21. no SEV/WFE
- Bibliography
Jan Oleksiewicz [email protected]
document version 0.36.3
extension status: unstable/PoC
This document is released under a Creative Commons Attribution 4.0 International License
This document uses semantic versioning with respect to potential hardware designs. Assembly syntax change is a minor increment. Version 1.0.0 will be the first somewhat useable. Changes in prior versions are not versioned properly and not tracked in revision history. The number in a major revision doesn’t hold the freeze or ratification status.
Document is written in a way that reduces the duplications as those are hard to maintain.
Even though the current risc-v "privileged" architecture is great for general unix systems. It fails to meet many embedded and hard real time requirements.
Instead of adding more and more on top of layered legacy, that leads to silicon waste, let’s replace entire volume II (aka riscv privileged) with minimal yet efficient embedded architecture.
The goal is to achieve interrupt architecture capable of predictable and fast
control loops by providing minimal interrupt latency and jitter.
Optionally offer single digit cycles of interrupt latency to actual code and true zero jitter,
as to not disturb minimal implementations.
By leveraging general purpose computing capability of risc-v architecture, we can
avoid the need for separate cores (often with asymetric architectures) to offload
low priority tasks (communication, HMI etc).
The lack of many "legacy" functionalities allows reduction of silicon area, power, and verification costs.
A quick recap of what we already have available.
[13] defacto established "industry standard" of efficient interrupt handling. Anyone complaining about risc-v likes and wants the NVIC.
The addition of trustzone in armv8m, increases the interrupt latency/jitter due to the need of preserving and zeroing extra "unnecessary" registers. (to prevent potential leaks)
CLIC CLIC is a designated goto for interrupt handling to fulfill everyone needs.
Attempts to be an unix capable interrupt controller with horizontal nesting of U, S, H (so far only proposed) and M mode.
All used registers must be saved in software, trampoline handlers need to save all ABI registers.
If interrupts can be taken at multiple privilege modes, then each handler at higher privilege
have to swap stack pointer (and interrupt level ??) by 2 additional CSR instructions per handler.
(during vertical nesting those instructions just copy rs1
operand)
Preemption is handled in software by special CSR mechanism, that requires extra boilerplate code in every interrupt handler. Even in "inline" handlers.
Highest priority inline handlers should be possible to be made similar to legacy ones.
Trampoline handlers mimic the late arrival and tail chaining optimizations. Currently trampoline handlers cannot be used alongside "inline" handlers [50].
Introduces unavoidable jitter due to:
-
blocks of code executed with disabled interrupts (additive jitter)
-
late arrival handled through mnxti read (subtractive jitter of entry time)
-
tail chaining handled by another mnxti read (and extra branch) in epilogue
-
indirect jump instruction to actual code (branch prediction)
assuming 1 cycle per instruction, 10.2 and 11.1 listings from clic spec CLIC offer:
-
entry + 6
cycles of jitter from "inline" handlers. -
entry + 7 + 16
cycles of jitter from "C-ABI" trampoline entry -
4 + exit
orabs(entry - 7)
cycles of jitter from "C-ABI" trampoline epilogue
Note
|
trampoline jitter can be reduced by 16 cycles of register stacking at the cost of late arrival handling |
Note
|
according to [21], handler entry time is 6 cycles on sifive E2 and 10 cycles in E3/5. |
Note
|
BTW, my prediction is that the "competitor A" will be able to do a "comparison against riscv" without resorting to FUD tactics, right after CLIC is ratified |
Typical interrupt latency of CLIC trampoline was measured at 33 (inline handler) and 42 (trampoline) cycles for CV32E40P [53].
CV32RT "fastirq" [53] extends CLIC by moving prologue handling entirely into the hardware as well as introducing background lazy stacking from a shadow register set.
The epilogue is still handled in software.
Tail chaining is supported by emret
instruction, but a late arrival (higher priority) will have to
wait for the background stacking to finish.
As a consequence there is a jitter equal to the stacking window.
emb-riscv [1] is clean sheet design that attempts to be universal solution for every microcontroller. Designed with a strong focus on RTOS support.
Note
|
Currently development is stalled due to "not encouraging general interest" |
Achieves lower interrupt latency by introducing EABI with reduced amount of caller-saved registers. FP registers are handled by lazy stacking.
Many similarities with NVIC.
mandates 4 64bit timers (even on RV32):
-
cycle counter
-
instret counter
-
system timer
-
rtc timer
Attaches to generic interrupt scheme.
According to CLINT, it provides memory mapped interface for timers and IPI.
Note
|
ofiicial CLINT is called ACLINT but doesn’t differ much from CLINT in sifive documentations. |
Very often refered to as CLINT. e.g. [4].
has optional vectored mode which simply jumps to the position in vector table.
Doesn’t provide any nesting other than privilege levels or a complex boilerplate code to disable reatking active interrupts.
Registers and CSR state (fcsr
etc.) have to be pushed by software before use
A heavyweight frontend for delivering interrupts to multiple cores running typical unix OS. Not suitable for microcontrolers.
claim/complete architecture
handlers stay very similar to generic case.
AIA adds another set of CSR registers available only through indirect access
mechanism (by miselect
and mireg
CSRs).
Introduces HW stacking and single cycle register shadowing (aka HPE).
It is of course necessary to use custom toolchain that implement a "proprietary" attribute:
__attribute__((interrupt("WCH-Interrupt-fast")))
Note
|
without prestacked annotation there will be no portable way of doing this without compilers build on custom patches. Naked handler + mret trick doesn’t work in llvm, it should break in gcc anyway due to eventual use of callee saved registers and stack. |
Another feature is "vector table free" interrupt mechanism that allows to skip fetching from vector table and jump to handler directly. It provides significant improvement only when all registers are "stacked" by shadow regfile. (or not satcked at all)
The descriptions of a lot of functional behaviour feel like a copy-paste of risc-v privileged.
Highly under/undocumented.
e.g. There is nothing about what happens to mepc
, mcause
or mstatus
during nesting (especially on "V2" core).
It is also unknown whether ra
register doesn’t have an additional use (like saving mepc
…) during
interrupt entry/exit and connot be used immediately as the currently implemented gcc attribute treats
those functions the same way as the regular ABI ones with mret
based return.
Inline with average chinese documentation standards.
The vendor provided headers, of course, contain 46 instances of "NVIC" string and just 5 for "PFIC"
There is also under/undocumented "EABI enable" bit in INTSYSCR
on "V2" core.
Most probably it reduces number of HW stacked registers to match the official EABI proposal [31].
QingKeV4 implements 3 shadow registers sets (aka HPE), given to handlers on first comes first served basis. Result is that only 3 lowest level handlers can practically use shadow registers.
Note
|
supressing dynamic nesting by HWSTKOVEN would cause priority inversion.
|
[44] Adds another horizontal nesting level above the machine mode, that works very similarly
to generic interrupts.
Achieved by providing additional set of CSR registers as well as interrupt return instruction (mnret
).
Note: The IRQ handling features in PicoRV32 do not follow the RISC-V Privileged ISA specification. Instead a small set of very simple custom instructions is used to implement IRQ handling with minimal hardware overhead.
Original author of the PicoRV found the riscv-privileged to be too heavy for minimal core, and provided own [9] interrupt scheme.
Note
|
FPGA minimum cores, is a non goal for XTeic |
Proprietary TI architecture [23] sporting an ancient looking accumulator-memory architecture (with 8 pointer registers), similar to the classic CISCs. An x86 of motor control and signal processing. FPU [24] is more RISC-ish with a bit of VLIW in some instructions.
Note
|
TI is very hesitant to release any general purpose benchmark scores (speed/size etc.) [25], [26]. Claiming that their architecture "is optimized for real world control applications". Those kind of scores are also almost non existent in independent sources. |
According to [22], the core automatically saves some of the registers, rest must be pushed
in software.
"High priority" interrupts can also save and restore all 8 floating point registers into shadow
registers using special instructions.
There are also 5 (4 in prologue) defacto useless instructions for aligning stack and setting "C28 modes"
To allow nesting of "low priority" interrupts handlers must include extra boilerplate code to handle prioritiy masking in software. (8 instructions in prologue, 3 in epilogue)
As a consequence there is 21 cycles of jitter (to HPI and other LPIs) and 43 (HPI) or 63 (LPI) cycles of interrupt latency in worst case.
Use of RPT
istruction will introduce even more jitter and latecy as the sequence is uninterruptible
and takes arbitrary numbers of cycles to execute.
Note
|
ISR entry latency is 10 cycles due to 8 stage pipeline and automatically stacking 13 registers. [40] suggests that the latency is 14 cycles for internal signals. Which would further increase the worst case jitter and latencies. |
CLA [51] is a separate coprocessor designated to offload
main core from control loop tasks "freeing it to handle other tasks such as
handling communication stacks"
Exactly those workloads that are general purpose tasks
for which "c2000 architecture was not optimized for"
Offers less registers/instrucrtiions and lacks TMU so it’s not always faster than the main core.
Can be used as a true coprocesor for delegation of certain tasks to it. According to [52] this mode of operation brings just 12% improvement in motor FOC current loop.
Xh3irq extension (as implemented by hazard3) [54] provides nested and vectored
interrupt handling that is conceptually similar to CLIC (mnxti
) trampoline.
Unlike CLIC, dispatcher has to index pointer array in software (by using index from meinext
)
Example handler implements only jumptable but it can be easily convertod into pointer table.
Access to configuration bits of all 512 inputs is performed by inline windowing of configuration CSRs, which is incompatible with zicsrind.
In this case interrupts must always push all caller saved registers to be able to use functions without
__attribute__((interrupt*))
annotation. Leading to ABIs with less caller saved registers
It also requires preinitialized table with pointer to startup code, sp
, gp
, and of course
any other addition like Zcmt JVT
csr.
This table is also not necessarily smaller than software setup, e.g. sp
can be usually
done with single lui
instruction.
There is still a risk of corruption if the compiler decides to reorder something before
initialization of .data
/.bss
sections.
Such startup code is also inefficient as it will have to obey the ABI (spill ra
to stack) and
compilers can’t optimize out link time symbols anyway. (even though some can be assumed to
always be at certain addresses or offset from each other)
Of course I often find that there is a competition on who will make the worst startup code in assembly. So pure C/C++ startup code turns out to be "better" due to confirmation effect. But let’s have a look at my "combotablecrt" implementation [7] for stm32f030x4/6. Is your compiler able to do that?
There is also a case of interrupt handlers that are using only a few registers and don’t need to take latency of the whole ABI/EABI.
The rationale of introducing ABIs with reduced number of caller saved registers is to reduce interrupt latency.
The major downside of such approach is lowered overall performance and code denisty. Which is highly unliked across riscv community [10] and stalls development of such (E)ABI.
I think for marketing reasons we should have the RISC-V EABI mimic the competitor ABI as closely as possible, and be available and supported by the tools, even if almost no-one should end up actually using it.
Zcmp[e] was also prepared for such fragmentation by reserving first 4 points in rlist for EABI, so the cores can implement UABI and EABI push/pop instructions at the same time. Those 4 points are, of course, supposed to handle 20 caller saved regs of EABI (probably with some reuse of few higher points).
It will also make the processors capable of stacking 2 registers per cycle, underutilized during HW stacking due to shorter stacking time than pipeline refill.
An alternative is to provide interrupts with defacto customizable ABIs by e.g. prestacked annotation (to match the HW stackers) and handle the function call pressure by IPRA.
aka generic riscv __attribute__((interrupt))
The major issue lies within the principles of hardware stackers.
When entering interrupt handler, the core first fetches the entry from vector table and then jumps to that address. Both of those fetches can hit a flash waitstate or a cache miss. During that operation the data bus remains idle waiting for a first store instruction to be executed.
Those cycles can be accomodated for a "free" stacking of registers. If a higher amount of registers is stacked then it can hide a bit of jitter coming from cache misses or flash waitstates.
Even stacking by the special push instructions (e.g. XTheadInt [12] or PUSHINT [11] and maybe a subsets of those), won’t help much. Those start pushing after the latency of double (waitstated) miss was taken.
The only situation when soft stacking yields better results is when HW stacker has to push way more registers than is actually used.
Note
|
Zcmp[e] doesn’t cover caller saved registers except ra .
|
To be able to call RVE only code from RVI ABI
Recurrig thing in RVE ABI proposals.
The idea is to allow compilers and software vendors to provide a single set of precompiled libraries for RVI and RVE ABIs.
The issue with this approach is that the code arbitrarily compiled for RVE is likely to turn out to be less efficient than RVI one. It also limits the capabilities of RVI ABI like trading off argument registers for temporary/saved ones.
Having one universal solution for all possible scenarios brings a lot of inefficiency to all of them. Due to mandatory support for a lot of rarely used functionality, keeping the compatibility with unused legacy, or having to be a subset of a bigger architecture optimized for a different use cases.
Even if that "flexibility" is made completely optional and non intrusive the vendors will implement it anyway for the sake of having the longest "flexibility" bar.
aka "HANDLER_RETURN" on emb-riscv and "EXC_RETURN" on ARM
The idea is to put special pattern in ra
during handler entry and
exit by reusing regular return mechanism provided by the ABI. Requires
certain memory area to be non executable (e.g. 0xF0000000 - 0xFFFFFFFF)
This mechanism follows the typical ABI function call and together with HW stacking, allows the interrupt handlers to be a regular C functions.
The downside is that the ra
and pc
both have to be pushed onto stack
and in some specifc cases, it could add extra stall cycles after the tail due
to the waitstates or cache miss caused by delayed prefetch.
Alternatively we can just stack the ra
and put there current pc
with lowest bit set
to trigger handler return operation. One less register counted towards interrupt latency.
Note
|
normally the jalr instruction just ignores the LSB bit of resulting address.
LSB in register and immediate will lead to "bogus" jump over 2 extra bytes.
Even though this behaviour simplifies hardware, existing ABIs are
allowing "auxiliary information" in pointers as well as jalr
immediate, effectively making both useless.
|
It’s simply inefficient in truly vectored scenario. The vector entries will have to be populated with jump instructions anyway. Those have to take the second round of waitstates or cache miss without amortization by register stacking.
And if the code is far away from vector table (e.g. in SRAM for more deterministic execution), compiler will have to emit a jump island, aka "veener", that will perform yet another unamortized jump. Additionally far jumps require a free register which in typical scenario reqires pushing to stack and returning to veener from handler to handle epilogue.
allocating 8 bytes per entry, allowing lui
+ jalr
sequence, will severly trump the
code density and performance in typical use scenarios.
Note
|
8051 allocated 8 bytes per entry, but it was able to sometimes fit entire handler or one of the conditional path. Especially when following entries were unused. This kind of optimizations is exlusive to assmebly programming and generally not practised today. |
In case of mass initialization MMIO could result in better code density CSR space is also limited.
My take is that anything architecturally coupled to the core should reside in CSR space and keep the rest in MMIO.
Nothing should exist as both.
There is no point in avoiding CSR registers when the cost of Zicsr instructions is already taken.
Very often claimed, yet those claims rarely meet with reality.
Note
|
There are also many non-architectural sources of jitter like caches, waitstated flash, accessing peripherals in different clock domains (usually divided from sysclk), DMA contention, or just the code masking out the interrupts. |
Cortex-m0 offers a "zero jitter" by optional IP (RTL for ASICs) configuration that adjusts the best case of interrupt latency by extra cycle to acommodate random stall from bus contention.
Cortex-m3/4 offer up to 6 cycles of jitter due to "late arrival" and "pop pre-emption". Regular handler entry is dominated by stacking registers, giving some headroom for extra vector/instruction fetch latency.
Cortex-m7 of course suffers from Proprietary&Confidential syndrome. Most probably it’s similar to cm3/4.
In case of C2000 CLA, TI claims [14],[15],[27] that their task driven machine (non preemptible) "reduces interrupt latency and jitter" compared to classic CPU, even though it does exactly the opposite when there is more than 1 async interrupt to handle.
Note
|
Of course whenever TI compares CLA to "classic cpu", it’s always a cpu with preemption priorities only and background task not present on CLA. As if the similar "task machine" couldn’t be achieved by regular general purpose architecture (e.g. risc-v, cortex-m) without nesting and WFI loop (or "sleep on exit" feature) giving access to all GPRs in interrupts without stacking. |
The Linux cargo cult.
Because a simplest tasks suitable for bunch of 555&74s or a simple microcontroler with a
few KiB of flash and RAM must be done under linux so it will work somehow "better".
To be able to properly run linux you need quite beefy unit (usually with MMU), 2-4MiB of flash,
4-8MiB of RAM (usually external DRAM), long boot time and a bad power consumption in idle.
Just to run the OS itself.
One of the the most blatant example is NOMMU linux on stm32f429 with
memory mapped SDRAM that is not even cached by cpu. If the XIP image doesn’t fit
in 2MiB internal flash, it has to land in external parallel NOR flash, which is of course
not cached by cpu and shares bus with SDRAM.
Any attempt to touch internal SRAM regions will defeat the remaining
"universality/portability of linux apps" arguments.
Not to mention much higher unit price than typical 200+Mhz cotex A5/7 SOCs.
Of course there are still actual reasons to use linux in non-realtime embedded, consisting of large collection of drivers for external devices, higher portability or access to the raw performance (at much better perf/price ratio) not available in typical microcontrollers [16].
Lazy stacking allows to skip stacking of FP registers if handler doesn’t touch floating point registers.
The main issue is that all of the caller saved FP registers are saved (execution stalls during push) onto stack whenever FP instruction is executed even though only a few of the registers are used.
Requires additional CSR to hold address of reserved space in stack frame.
Ideally we should not change the established ABI to avoid disruption
But definitely get rid of the tp
register which is overall useless.
should be 2x`XLEN`, mandated thorought entire program execution so as to not require special realignment in interrupts.
Note
|
psABI [33] says that: stack pointer must remain aligned throughout procedure execution and fails to enforce enforce this anyway: Non-standard ABI code must realign the stack pointer prior to invoking standard ABI procedures. The operating system must realign the stack pointer prior to invoking a signal handler; hence, POSIX signal handlers need not realign the stack pointer. In systems that service interrupts using the interruptee’s stack, the interrupt service routine must realign the stack pointer if linked with any code that uses a non-standard stack-alignment discipline, but need not realign the stack pointer if all code adheres to the standard ABI |
Major ilp32e issue is that the addi16sp
instruction works on 16 byte stack increment.
Once the c.addi
range (-32..+31) is exhausted compilers have to chose beetwen
denser code and more efficient use of stack.
Zcmp extension was also designed for 16 byte aligned stack. There is Zcmpe extension
postponed to the future which should handle the EABI. Lowering the stack alignment
requires doubling (per bit of alignment) waste of codepoints by push
/pop
instructions.
Note
|
addi8sp won’t be neccesary as Zcmpe push /pop can prepare initial 8 byte
allocation for an (optionally) following addi16sp
|
Note
|
2x`XLEN` alignment allows more optimal use of microarchitectures capable of stacking 2 registers per cycle |
register | ABI name | Saver | description |
---|---|---|---|
x0 |
zero |
- |
Hardwired zero |
x1 |
ra |
caller |
return address |
x2 |
sp |
callee |
stack pointer |
x3 |
gp |
- |
global pointer |
x4 |
t0 |
caller |
temporary |
x5 |
t1 |
caller |
temporary |
x6 |
t2 |
caller |
temporary |
x7 |
t3 |
caller |
temporary |
x8 |
s0/fp |
callee |
saved/frame pointer |
x9 |
s1 |
callee |
saved |
x10 |
a0 |
caller |
argument/return |
x11 |
a1 |
caller |
argument/return |
x12 |
a2 |
caller |
argument |
x13 |
a3 |
caller |
argument |
x14 |
a4 |
caller |
argument |
x15 |
a5 |
caller |
argument |
x16-x31 |
- |
- |
reserved for custom use |
Note
|
ilp32e with tp turned into temporary.
|
register | ABI name | Saver | description |
---|---|---|---|
x0 |
zero |
- |
Hardwired zero |
x1 |
ra |
caller |
return address |
x2 |
sp |
callee |
stack pointer |
x3 |
gp |
- |
global pointer |
x4 |
t0 |
caller |
temporary |
x5 |
t1 |
caller |
temporary |
x6 |
t2 |
caller |
temporary |
x7 |
t3 |
caller |
temporary |
x8 |
s0/fp |
callee |
saved/frame pointer |
x9 |
s1 |
callee |
saved |
x10 |
a0 |
caller |
argument/return |
x11 |
a1 |
caller |
argument/return |
x12-x17 |
a2-a7 |
caller |
argument |
x18-x27 |
s2-s11 |
callee |
saved |
x28-x31 |
t4-t7 |
caller |
temporary |
Note
|
ilp32 with tp turned into temporary.
|
The official risc-v debug spec [45] is good enough to not necessitate another incompatible one, although the "minimal debug implementation" is actually not minimal.
Some of the minor things that could be "improved" for minimal implementations:
-
1 entry
progbuf
accepting 32bit instructions only (saves 2 bits, currently must accept compressed insns) -
writing this 1 entry progbuf immediately executes written instruction (ie. no storage in progbuf)
-
remove
dpc
CSR, and allow debuggers to get the "current"pc
by executingauipc
fromprogbuf
-
no mandatory abstract register reads (data exchange only through message registers)
-
get rid of certain discovery bits
-
etc.
Biggest offenders of course are and will be the actual implementations that despite being the "minimal"
ones designated as "8bit killers", are happily implementing more than necessary.
Like 8-word progbuf
in ch32v003 [28].
Low pin count devices (8-32) need a denser debug interface as the JTAG uses too many wires.
There are industry proven 2 wire interfaces like cJTAG or ARM SWD.
It would be best to have 1 wire solution like avr8 debugWIRE/updi
or the WCH "SDI" (aka "SWD") [46]
Note
|
official RFC has been submitted here: riscv-non-isa/riscv-c-api-doc#53 |
Currently there is no universal solution to indicate which registers in interrupt handlers can be freely used without stacking them.
-
__attribute__((interrupt))
makes all registers callee saved and uses mret to return. -
__attribute__((interrupt("SiFive-CLIC-preemptible")))
extends regular interrupt by CLIC preemption -
__attribute__((interrupt("WCH-Interrupt-fast")))
requires custom build toolchain, no floating point regs (even on the cores with F extension), still uses mret -
Or just a plain C function that requires prestacking of all caller saved registers, reuses standard return mechanism to exit interrupt context
Even worse, there are already hardware stackers designed for ilp32e and ilp32. When the new and better ABI will be introduced, it will be impossible to use with pre-existing HW stackers. The same applies to creating HW stackers that stack less registers to optimize interrupt latency.
Therefore we need universal way to annotate which registers are available for use in a given function as a defacto calller saved one (aka create custom calling convention)
-
prestacked("")
attribute -
no whitespaces in string parameter
-
register range cover all registers between and including specified (
x4-x6
is equivalent tox4,x5,x6
) -
register range must span at least 3 consecutive registers
-
registers/ranges are separated by comma
-
calee saved registers have to be properly turned into temporary when included in the list
-
CSRs taking part in calling conventions are also subject to this mechanism
-
should use raw names instead of ABI mnemonics as to make it ABI agnostic (more portable)
-
registers must be sorted (integer, floating point, vector, custom, then by lowest numbered)
-
CSRs must be put after the architectural regfiles, those don’t have to be sorted
-
must not collide with
__attribute__((interrupt))
as to support "legacy" handler return mechanisms -
must not imply
__attribute__((interrupt))
as well -
custom CSRs would also have to be somehow covered. (hw loops etc.)
-
annotated functions should be callable by regular code
-
argument registers that are passed but not included in the list, can be assumed to be unmodified after return from an annotated function
ilp32 caller saved:
__attribute__((prestacked("x5-x7,x10-x17,x28-x31")))
ilp32f, caller saved:
__attribute__((prestacked("x5-x7,x10-x17,x28-x31,f0-f7,f10-f17,f28-f31,fcsr")))
preemptible CLIC irq with simplified ranges(e.g. shadow register file):
__attribute__((interrupt("CLIC-preemptible"), prestacked("x8-x15")))
TEIC irq, range0 + shadow regs of half integer regfile (where bit 2 of operand is set, covers range1+2) and F + P extensions:
__attribute__((prestacked("x4-x7,x10,x11,x12-x15,x20-x23,x28-x31,fcsr,vxsat")))
ch32v003 irq (ilp32e + PFIC HW stacker, assuming ra
doesn’t have some undocumented use):
__attribute__((interrupt, prestacked("x1,x5-x7,x10-x15")))
Note
|
unannotated ra is assumed as a valid return address, otherwise a special return mechanism must be
used (e.g. return by mret in __attribute__((interrupt))
|
gcc/llvm compilers can purge the epilogue (even down the call tree) by automatic
detection of infinite loop or by using __attribute__((noreturn))
or __builtin_unreachable()
.
It is not the case on prologues though, leading to waste of stack and codespace in the most typical embedded scenario of main or thread functions with an infinite loops.
This missing optimization is intentional [32] to allow backtracing
(abort()
etc.) and throwing exceptions (of course under -fno-exceptions and exception less code)
By abusing the "prestacked annotation" we can get rid of this prologue
by "prestacking" all of the available registers.
e.g. __attribute__((noreturn, prestacked("x1,x4-x31,f0-f31,fcsr")))
Note
|
addition of noreturn_nobacktrace_noexcept attribute is very unlikely, optimizing
regular noreturn attribute is even less.
|
Note
|
__attribute__((naked)) won’t work, as it will remove the stack allocation
and consequently underflow the stack.
|
It can be additionally abused to:
-
define IPRA clobbers of assembly functions in its C function declarations (see applying IPRA to assembly functions)
-
certain (premature) optimizations (manually solving 2way IPRA recursion etc.)
-
dynamic linked functions with a subset of clobbers. e.g. functions like
memcpy()
,strcmp()
etc. don’t need to clobber all caller saved registers so only common clobbers for straightforward, unrolled (?) and vectorized implementations need to be applied. Requires standardization of canonical clobbers for each offending function. (quite unrealistic)
So far implemented only by llvm [8].
Limited to statically linked code.
There are almost no benchmarks results, especially the ones other than x86 at -O3.
In simple explanation, it makes every function export information about its usage of caller saved registers effectively allowing non leaf functions to use caller saved registers as a callee saved ones. That avoids some of the stacking/spilling leading to a more efficiet code.
requirements and improvements needed for efficient IPRA:
-
this mechanism must cover the CSRs as well as the registers (e.g.
fcsr
,vtype
,vl
etc.) -
custom registers and CSRs should also be covered (e.g. HW loops) (unnamed?)
-
compilers need to avoid using more registers than necessary (currently no reason)
-
registers from compressible range should be allocated only when it will benefit code density (currently no reason)
-
to avoid regressions, compilers need some kind of heuristic to detect when stacking certain (compressible) callee saved registers would yield better code density than using more temporaries from non compressible ranges
Note
|
on riscv it’s s0 and s1 , in presence of Zcmp[e] pushing s0,s1 is free
in non leaf functions, and just 2 16bit instructions in leaf. With IPRA it should be also
possible to just move ra and s0/s1 into caller saved regs.
|
Note
|
This is also non IPRA optimization (-Oz kind) |
-
need special assembly directive to annotate such exports from pure assembly code (workaround exists applying IPRA to assembly functions)
Note
|
Automatic detection is not an option due to self constructed instructions (e.g. from [39]):
|
-
precompiled libraries should also do an "IPRA exports"
-
very important point is resolving IPRA annotations of callbacks, where the callback call will use the smallest common regmask of all functions that can be called through this point
-
callbacks initialized once at startup (typical in many HALs)
-
callbacks passed as function parameters
-
queues (of structs) with callbacks
-
Note
|
callbacks are commonly used in peripheral interrups, therefore it’s important to apply IPRA optimizations to those as well |
-
it can be used to annotate that passed function arguments (through registers or stack) were not modified and can be recycled by caller (e.g. in loops)
-
it can also "export" list of deterministic constants (and addresses) that are left in registers after return
Note
|
This mechanism is portable to other architectures, the more caller saved registers are available, the higher relative gain is. |
Note
|
vector extension can benefit from IPRA as current psABI makes all vector registers temporary, though the syscall destroys entire state |
Because the HW stackers (used with prestacked annotation) will prefer to stack out the compressible registers first, it might not be the best match for IPRA optimized allocation
Note
|
compilers usally don’t care about non-abi (interrupt) prologues/epilogues and emit code as if it was the regular ABI function |
The solution could be:
-
optimize HW stacker for typical allocations
-
make compilers treat specially a call trees growing from interrupt handlers
-
trump the general IPRA optimizations to use
a0-a5
first
Handlers that are not calling another functions should be straightforward as long as the compiler allocators/optimizers are not going to straight out ignore prestacked annotation.
Special attribute to annotate function declaration in header associated with assembly code
(e.g. __attribute__((regmask("clobbered list here")))
) was proposed [49],
but it wasn’t implemented upstream.
The other option is to use inline asm clobbers to make call to such funcions
__attribute__((always_inline))
static inline int weird_call(int n, void* p)
{
register int result asm("a0") = n;
register void* a1 asm("a1") = p;
asm volatile(
"call foo \n\t"
: [ARG0] "+r" (result) // return in same register
: [ARG1] "r" (a1)
: "memory", "ra", "a2" // use clobber for any caller saved regs used
);
return result;
}
-
requires the
call
pseudoinstruction that expands to a proper sequence. Otherwise we get errors when calling too far or missing optimization when short call can be made. -
works in existing compilers (at least in gcc and llvm)
Another solution could be applying prestacked annotation e.g.
pure assembly function (FP compute kernel) using only subset of caller saved
registers (a0
argument not modified):
attributeprestacked("x5,x11-x15,f10-f13,v0,v1,v8-v31,fcsr,vl,vtype,vstart")
Note
|
Both mothods are insufficient for "annotating" unmodified stack arguments which are caller saved (documented on arm and defacto on risc-v) |
smallest profile?
machine mode only
RV32 only
2 or 4 interrupt nesting levels
little endian only software shall assume little endian
name | default value | notes |
---|---|---|
|
implementation specific |
Base address of the first application entry point as well as its vector table. May have additional constarints on the alignment. |
|
implementation specific |
Base address of the most designated executable SRAM memory. (Some devices implement a special memory area designated for interrupt handlers. aka "ITCM". Usually it will be the main memoy address) |
|
0xFFFE0000 |
Base address of XTeic MMIO control block |
|
{0,1,2} |
Number of implemented interrupt nesting priority bits |
|
{1,2,3,4} |
Number of implemented interrupt sub-priority bits |
|
{9..1023} |
Number of allocated interrupt entries including skipped ones and NMIs |
|
{2,4} |
Size in bytes of the single entry in vector table. By default it’s 4. 2 if XTeicTinyIrqTable subextension is implemented. |
Upon hart reset:
-
all of the architectural registers are initialized to their reset state.
-
The MMIO control block registers are also initialized to their reset state.
-
The pc is set to the
TEIC_ENTRY_VECT_BASE
.
Performing the system reset will additionally initialize the state of the peripheral registers to their reset state.
The hart reset is always equivalent to a system reset until XTeicMP extension is implemented.
The reset state of all architectural registers is undefined unless explicitly specified in specific extension.
Note
|
That means the reset state of integer, fp, and vector registers is undefined. |
Note
|
some of the CSR registers also remain in undefined state. |
If the application start is preceeded by bootloader, or the application enters the bootloader, then the the switch code shall ensure that before redirecting execution to the target address:
-
all peripherals are disabled, or initialized to reset state if enabled on reset (e.g. watchdogs)
-
external GPIOs are configured to reset state
-
the oscillators, PLLs, clock selects and divisors are configured to their reset state
-
all nesting levels in
teic_irq_msk
are enabled -
teic_irq_vect
is set to the target entry point, right before the jump happens
Note
|
The rationale of these rules is to avoid bloat in startup
code (and duplicate of it in SystemInit() ), which is a result of assuming the worst case scenario
|
Note
|
bootloaders placed at application entry area (at TEIC_ENTRY_VECT_BASE )
can be entered by setting a certain pattern in backup register and then executing system reset.
|
Note
|
Some devices switch between bootloader and application modes by performing
whole system reset after modifying certain configuration registers (remap of executable area
at TEIC_ENTRY_VECT_BASE )
|
The interrupt controller supports only level triggered interrupts. The logical high is used to assert pending interrupt request lines.
The irq number is the position in vector table
Note
|
there is no irq offseting like in NVIC |
Stack pointer is not realigned, if stack is not 8 byte aligned the behaviour is implementation specified
Note
|
typical HW won’t care about 4 byte stack, some dual issuers or hardened cores
might want to set irqentryexit_unrec nmi request
|
Note
|
Zcmp similarly doesn’t specify the required alignment. |
Register ranges define which registers are pushed onto the stack on irq entry.
Adding certain range require inclusion of all previous ranges.
The selection is implementation specific, fixed at silicon level. Shall not deviate from the predefined ranges.
Note
|
only highest nesting level has configurable stacking ranges. |
range | registers | added stack area | mandatory implemented (all nesting) |
---|---|---|---|
0 |
"x1,x10,x11,reserved" |
XLEN * 4 |
yes |
1 |
"x12-x15" |
XLEN * 4 |
yes |
2 |
"x4-x7" |
XLEN * 4 |
no |
3 |
"x16,x17,x28-x31" |
XLEN * 6 |
no |
Note
|
Range 0+1 gives similar amount of usable registers as NVIC |
- stack frame pseudocode
// all ranges used
// range 0
sw x1, -4(sp)
sw x10, -8(sp)
sw x11, -12(sp)
sw reserved, -16(sp)
// range 1
sw x12, -20(sp)
sw x13, -24(sp)
sw x14, -28(sp)
sw x15, -32(sp)
// range 2
sw x4, -36(sp)
sw x5, -40(sp)
sw x6, -44(sp)
sw x7, -48(sp)
// range 3
sw x16, -52(sp)
sw x17, -56(sp)
sw x28, -60(sp)
sw x29, -64(sp)
sw x30, -68(sp)
sw x31, -72(sp)
addi sp, sp, -72
Note
|
reserved position in range0 window can be optionally used for preserving additional state during nesting |
when a given interrupt nesting level (reflected by pending_nestx
in teic_irq_status
)
becomes pending which is not masked out by corresponing bit in teic_irq_msk
register,
the interrupt entry procedure is triggered.
During the interrupt entry the hardware will:
-
stacks configured/implemented register ranges at given nesting level (can be affected by
n4_stacking
) -
decrement
sp
according to largest configured/implemented register ranges -
put content of interrupted
pc
intora
register with lowest bit set -
set
in_nestx
bit inteic_irq_status
register -
fetches target address from vector table pointed by
teic_irq_vect
. The vector entry is selected by handler dispatch process. -
jumps to the fetched address
Note
|
optimized microarchitectures will implement late arrival, tail chaining and pop preemption which further complicate entry/exit procedures |
If irq request is spuriously deasserted during the interrupt entry (or e.g. tail chaining), the core must either; enter the offending handler or immediately return (or e.g. tail chain to yet another handler).
Note
|
Sometimes it takes a few cycles to deassert irq request signal, after e.g. clearing status flag. Behaviour must be deterministic. Otherwise erratas will be populated. |
When jalr
or cm.popret
instruction is executed and the lowest bit in the source register is
set (before calculating final target address), the interrupt exit procedure is triggered.
If no interrupt is currently active then irqretnest0_unrec
nmi request is set.
During the interrupt exit the hardware will:
-
unstack configured/implemented register ranges at given nesting level (can be affected by
n4_stacking
) -
increment
sp
according to largest configured/implemented register ranges -
clear
in_nestx
bit inteic_irq_status
register -
jumps to the target address of
jalr
orcm.popret
instruction
Note
|
The bogus jalr target address issue remains as per unprivileged spec.
Therefore conforming software shall not set the lsb in jalr immediate used for function returns
|
Note
|
only the lsb in source register is checked, not the computed target
address of jalr instruction. It allows detection of irq ret condition earlier in the pipeline.
|
Note
|
optimized microarchitectures will implement late arrival, tail chaining and pop preemption which further complicate entry/exit procedures |
NMIs (non maskable interrupts) are a special type of interrupts that cannot be masked
by teic_irq_msk
register. Typically used for signalling critical conditions.
Entry/exit procedure is similar to regular IRQs with the following excepions:
-
activity is signalled by
in_nmi
inteic_irq_status
register -
preserves at least range 0 registers, stacking ranges are impelmentation defined.
-
adjusts
sp
by stacked ranges
Note
|
typically NMIs will stack the same register ranges as regular interrupts |
Before returning from NMI handler all requests in teic_nmi_cause
CSR must be acknowledged (cleared).
unrecoverable NMI handler is entered whenever:
-
any of the
*_unrec
requests is raised inteic_nmi_cause
-
synchronous exception is raised during active NMI handler
-
any of the synchronous exception flag (
*_exc
inteic_nmi_cause
) is not cleared before performing interrupt exit from NMI handler -
*_async
that was escalated to unrecoverable nmi request (escalated_async_unrec
inteic_nmi_cause
)
Entry procedure is similar to regular NMIs with the following excepions:
-
activity is signalled by
in_nmi_unrecoverable
inteic_irq_status
register -
busfaults, alignment or other errors during stacking are ignored
-
not required to actually stack the registers only the
ra
shall be written withpc
during fault andsp
decremented by range 0 area
The hart enters the NMI lockup state whenever
-
code attempts to return from
Unrecoverable_NMI
handler -
synchronous or imprecise exception is raised within
Unrecoverable_NMI
handler
NMI lockup state halts any further code execution, except debug mode one.
Note
|
it is necessary to allow debuggers to read out state of registers/memory after experiencing lockup state. |
Note
|
experiencing exceptions within (or return from) unrecoverable handler means a serious issue with control flow, where further attempts to execute code would do more harm than halting until watchdog performs system wide reset. |
Note
|
lack of tripple fault lockout can also lead to security vulnerabilities [43] |
Note
|
microarchitectures can provide external output for signaling NMI lockup state as to allow immediate shutdown of certain peripherals (pwm timers etc.) |
irq num | type | name | notes |
---|---|---|---|
0 |
- |
reserved |
reserved for startup code (typically jump instruction) |
1 |
NMI |
reserved |
|
2 |
NMI |
IntegrityViolation_NMI |
(optional) software and hardware integrity violations |
3 |
NMI |
ClockViolation_NMI |
(optional) Lost clock or other anomaly. It should be assumed that the core/system clock could have been switched to a different one at this point. |
4 |
NMI |
WatchdogViolation_NMI |
(optional) Entered right before any of the watchdogs trips and performs a (device) reset. Designated for safety measures and error logging. It should be assumed that execution could be frozen at this point and no further action can or need to be performed. |
5 |
NMI |
MemoryViolation_NMI |
Bus or memory access fault |
6 |
NMI |
InstructionViolation_NMI |
Illegal instruction exception |
7 |
NMI |
Unrecoverable_NMI |
Nested nmi, unknown or a state that cannot be easily recovered from. |
8 |
IRQ |
Deffered0_IRQ |
software deffered interrupt, can be used for context switch. |
9 |
IRQ |
reserved |
reserved/systick??? |
10..1022 |
IRQ |
*_IRQ |
(optional) device specific interrupts |
Unimplemented optional NMIs can be recycled for custom NMIs other than the ones provided in table above.
Note
|
XTeic doesn’t provide any peripheral API for optional watchdog, clock and integrity protection systems. It’s up to the implementer to provide them. |
Alternate vector table allocation designated for minimal implementationns that are not making use of optional NMIs, but benefit from additional space savings.
Alternate vector table allocation is implentation defined. It’s not discoverable nor configurable.
irq num | type | name | notes |
---|---|---|---|
0 |
- |
reserved |
reserved for startup code (typically jump instruction) |
1 |
NMI |
HW_NMI |
(optional) hardware related exceptions (watchdogs, ECC etc.) |
2 |
NMI |
SW_NMI |
exceptions related to application execution on a given hart (illegal instr, integrity violations by sw etc.) |
3 |
NMI |
Unrecoverable_NMI |
Nested nmi, unknown or a state that cannot be easily recovered from. |
4 |
IRQ |
Deffered0_IRQ |
software deffered interrupt, can be used for context switch. |
5 |
IRQ |
reserved |
reserved/systick??? |
6..1022 |
IRQ |
*_IRQ |
(optional) device specific interrupts |
Note
|
Fragmentation is not a big of a deal, as all devices will be fragmented by implementing it’s own layout of device specific IRQ handlers. Which will be provided within startup files. |
To reduce disruption some of the "privileged" csr have been recycled according to "privileged" specification.
number | name | privilege | description | notes |
---|---|---|---|---|
0x001 |
|
URW |
iee754 exception flags |
implemented when F,D,Zfinx,Zdinx is present |
0x002 |
|
URW |
iee754 dyn rounding mode |
implemented when F,D,Zfinx,Zdinx is present |
0x003 |
|
URW |
frm+fflags |
implemented when F,D,Zfinx,Zdinx is present |
0xf11 |
|
MRO |
vendor ID |
jedec?? |
0xf12 |
|
MRO |
architecture ID |
|
0xf13 |
|
MRO |
implementation ID |
|
0xf14 |
|
MRO |
hart ID |
- Mnemonic
wfi
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x73, attr: ['SYSTEM'] }, { bits: 5, name: 0x0, attr: ['rd'] }, { bits: 3, name: 0x0, attr: ['PRIV'] }, { bits: 5, name: 0x0, attr: ['rs1'] }, { bits: 12, name: 0x105, attr: ['WFI'] }, ]}
- Description
-
Execution of the
wfi
instruction stalls the execution and allows the core to enter various low power states until the interrupt is taken or any nesting level becomes pending
It is allowed to terminte spontaneously or even be implemented as anop
.In addition, the
wfi
instruction is allowed to optionally stack out certain registers ahead of the interrupts, to reduce their latency. In this case,sp
is not changed until interrupt arrives.
Note
|
wfi can be executed when interrupts are disabled. Which is a very common
use case that avoids introduction of non deterministic delays to event respose time.
(i.e. irq arriving right before wfi )
|
Note
|
It is basically the same thing as priviliged wfi but without the
configuration bits in privileged CSR’s
|
- Mnemonic
teic.wfi.n4ign
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0x73, attr: ['SYSTEM'] }, { bits: 5, name: 0x0, attr: ['rd'] }, { bits: 3, name: 0x0, attr: ['PRIV'] }, { bits: 5, name: 0x0, attr: ['rs1'] }, { bits: 12, name: 0x115, attr: ['WFI'] }, ]}
- Description
-
Similar to
wfi
instruction, but doesn’t have to terminate after executing interrupts at 4th nesting priority only. Shall terminate if any other nesting level was entered before returning from n4 irq. (i.e. tail chained to n3, then pop preempted back into n4)
If only single nesting priority is implemented
(TEIC_IRQ_NESTING_BITS == 0
) then this instruction
behaves like a standard wfi
.
Note
|
Designated to reduce wakeups caused by high frequency control loop interrupts that don’t need attention from rest of the application. |
Note
|
Typicall implementation would require additional hidden state to track if interrupt of lower nesting priority was entered. |
Note
|
similarly to standard wfi it can terminate spontaneously
so the additional functionality is optional
|
number | name | privilege | description |
---|---|---|---|
0xbc0 |
|
MRW |
interrupt vector table |
0xbc1 |
|
MRW |
irq saved state |
0x800 |
|
URW |
interrupt mask |
0x801 |
|
URO |
current interrupt status |
0xbc4 |
|
MRW |
coarse mask of NMI causes |
0xbc5 |
|
MRW |
config register |
0xbc6 |
|
MRW |
added with XTeicStackLimit |
0xbc7 |
|
MRW |
added with XTeicStackLimit&&XTeicRTOS |
0xbc8 |
|
MRW |
added with XTeicRTOS |
bit | name | type | reset value | description |
---|---|---|---|---|
[31:5] |
|
WLRL |
|
top bits of vector table offset. |
[4:0] |
reserved |
WLRL |
0 |
reserved |
Note
|
alignment requirement allows to avoid use of the additional adder circuit during irq dispatch |
Note
|
minimum alignment can by calculated by following formula:
pow(2, ceil(log2(TEIC_IRQ_VECT_ENTRIES)/log2(2))) * TEIC_IRQ_VECT_ENTRY_SIZE If vector table consists of 100 entries total, 4 byte each. Then minimum required alignment is 512 bytes |
Note
|
vect_offset can be implemented with just enough bits to point at existing memory areas only,
as to reduce necessary state to implement.
|
Note
|
Implementations may impose additional alignment requirement |
Note
|
vect_offset can also be implemented as a read only constant pointing to beggining of the flash memory
|
bit | name | type | reset value | description |
---|---|---|---|---|
[31:0] |
|
WPRI |
undefined |
implementation specified pattern
used to recover execution state upon interrupt return. Covers certain csr registers:
( |
Note
|
Altough optional, the ability to interrupt multicycle instructions is especially
important for cores implementing zero jitter features.
As an example the ratified Zcmp cm.popretz intruction has 3 uninterrupible instructions (one is branch).
(Even though it could be just 2 as zeroing a0 is restartable. 3 instruction sequence will be formally pushed
down your throats anyway)
|
Note
|
designated to allow an efficient context switch from the lowest priority interrupt |
Note
|
As the risc-v doesn’t have condition codes for branching/predication, it is
expected that the smallest implementations will not make use of estate register at all.
|
Note
|
due to maximum 5-level nesting and limited state to preserve, it was decided to not push previous state onto stack, that would increase interrupt latency. |
bit | name | type | reset value | description |
---|---|---|---|---|
[31:4] |
reserved |
WPRI |
0 |
reserved |
3 |
|
rw |
1 |
Fourth nesting level |
2 |
|
WARL |
1 |
Third nesting level |
1 |
|
WARL |
1 |
Second nesting level |
0 |
|
WARL |
1 |
First nesting level |
Disabling any nesting level shall take effect immediately before executing next instruction.
bits related to unimplemented nesting levels are hardwired to zero.
Note
|
only nest4 level is mandatory to implement
|
Note
|
TEIC_IRQ_NESTING_BITS == 1 implements nest2 and nest4 only
|
bit | name | type | reset value | description |
---|---|---|---|---|
[31:12] |
reserved |
WPRI |
0 |
reserved |
11 |
|
ro |
0 |
(optional) signals that currently stacked registers cover only ranges
configured for nest4 level. |
10 |
|
ro |
0 |
NMI lockup state, can be cleared only by
hart/system reset |
9 |
|
ro |
0 |
unrecoverable NMI handler state, can be
cleared only by hart/system reset |
8 |
|
ro |
0 |
returnable NMI handler state |
7 |
|
ro |
0 |
irq handler at 4th nesting priority state |
6 |
|
ro |
0 |
irq handler at 3rd nesting priority state |
5 |
|
ro |
0 |
irq handler at 2nd nesting priority state |
4 |
|
ro |
0 |
irq handler at 1st nesting priority state |
3 |
|
ro |
0 |
pending status of 4th nesting priority |
2 |
|
ro |
0 |
pending status of 3rd nesting priority |
1 |
|
ro |
0 |
pending status of 2nd nesting priority |
0 |
|
ro |
0 |
pending status of 1st nesting priority |
Note
|
nmi_lockup bit is defacto readable only by debugger
|
bit | name | type | reset value | description |
---|---|---|---|---|
31 |
reserved |
ro |
0 |
|
30 |
|
ro |
0 |
irq return without active irq/nmi |
29 |
|
ro |
0 |
any fault during irq entry/exit (stack alignment, memory faults etc.) |
28 |
|
ro |
0 |
(optional) imprecise bus faults |
27 |
|
ro |
0 |
(optional) imprecise hw integrity error |
26 |
|
ro |
0 |
(optional) imprecise sw integrity error |
25 |
|
ro |
0 |
synchronous exception raised during execution of nmi handler |
24 |
|
ro |
0 |
(optional) escalated |
[23:10] |
reserved |
rw1c |
0 |
reserved |
9 |
|
ro |
0 |
(optional) |
8 |
|
ro |
0 |
(optional) |
7 |
`reserved |
ro |
0 |
reserved |
6 |
|
ro |
0 |
(optional) asynchronous integrity error not related to the architectural control flow (e.g. unrecoverable ECC error triggered by scrubber or speculative prefetch) |
5 |
reserved |
rw1c |
0 |
reserved |
4 |
|
rw1c |
0 |
(optional) software related integrity exceptions |
3 |
|
rw1c |
0 |
(optional) hardware related integrity exceptions |
2 |
|
rw1c |
0 |
(optional) misaligned load/store address |
1 |
|
rw1c |
0 |
memory access faults |
0 |
|
rw1c |
0 |
Illegal instruction exception and misaligned instr |
The *_async
nmi requests have to be cleared within the source peripheral.
bit | name | type | reset value | description |
---|---|---|---|---|
[31:8] |
reserved |
WLRL |
0 |
reserved |
[7:6] |
|
WARL |
implementation specific (highest implemented) |
stacking ranges at 4th nesting level. |
5 |
reserved |
WARL |
0 |
|
4 |
|
WARL |
0 |
(optional)
Switches current (part of) register file to thread one if applicable. |
3 |
|
WARL |
0 |
added with XTeicRTOS |
2 |
|
WARL |
0 |
(optional) if |
1 |
|
WARL |
0 |
(optional) |
0 |
|
WARL |
0 |
(optional) Ensure that the highest nesting priority interrupts are always entered within the same number of cycles regardless of the interrupted execution state. Doesn’t affect tailchaining of handlers within the highest nesting priority, as well as irq return procedure. Various deep sleep states are also an exception. It shall be assumed that irq vector table, highest level interrupt code and stack resides in zero
waitstated memories and no HW measures will be implemented to adjust for a different scenario. |
private to the hart
offset from TEIC_MMIO_CTRL_BASE |
entry size | name | non-native access | description |
---|---|---|---|---|
0x0 |
4 |
|
no |
|
0x4 |
4 |
|
no |
|
0x8 |
4 |
|
no |
|
0x10 |
4 |
|
no |
|
0x20 |
4 |
|
no |
|
0x40 |
4 |
|
no |
added with XTeicMP |
0x400 |
1 |
|
yes |
bit | name | type | reset value | description |
---|---|---|---|---|
[31:16] |
reserved |
rw |
0 |
reserved |
[15] |
|
ro |
dependent |
1: |
[14:11] |
|
ro |
dependent |
0b0000: power on reset |
[10:3] |
|
wo |
0 |
write of |
[2:1] |
reserved |
rw |
0 |
|
[0] |
|
rw |
implementation specific |
(optional) write 1 together with |
Note
|
[45] provides sysreset with excluded debug subsystem, in case of custom debug spec, it should at least provide its own config to exclude itself from reset |
bit | name | type | reset value | description |
---|---|---|---|---|
[31:1] |
|
rw1c |
0 |
(optional) pending status of deffered1-deffered31 irq requests |
[0] |
|
rw1c |
0 |
pending status of deffered0 irq request |
bit | name | type | reset value | description |
---|---|---|---|---|
[31:1] |
|
w1s (wo) |
undefined |
(optional) write 1 to set deffered1-deffered31 irq requests |
[0] |
|
w1s (wo) |
undefined |
write 1 to set deffered0 irq request |
For each implemented irq vector, there is corresponding pending bit in pending register at
teic_irq_pending[IRQn/32]
position.
First 8 bit entries (corresponding to NMIs) are reserved.
bit | name | type | reset value | description |
---|---|---|---|---|
[31:0] |
|
ro |
0 |
signals pending status of |
Consists of 1023 entries, 1 byte each. First 8 entries (corresponding to NMIs) are reserved.
For each implemented irq vector, there is corresponding priority config register at
teic_prio_cfg[IRQn]
position.
- priority encoding
bit | name | type | reset value | description |
---|---|---|---|---|
[8:(9 - |
|
rw |
0 |
nesting priority bits |
[(8 - |
|
rw |
0 |
sub-priority bits |
[(8 - ( |
reserved |
WLRL |
0 |
reserved |
Unimplemented bottom nesting bits are treated as if they were hardwired to 1.
If only 1 bit is implemented then only nest2
and nest4
levels are possible.
additional per vector entry interrupt enable
private to the hart
For each implemented irq vector, there is corresponding enable bit in "enable" register at
teicMP_irq_enable[IRQn/32]
position.
First 8 bit entries (corresponding to NMIs) are reserved.
bit | name | type | reset value | description |
---|---|---|---|---|
[31:0] |
|
WARL |
0 |
enable control of |
Adds additional RTOS specific features
After thread mode (aka "user" or "unprivileged") is activated by thread_enter
bit:
-
Current
sp
becomes a defacto thread stack -
On irq entry from thread, current
sp
is swapped with the context ofteic_swpspm
register which happens after stacking (registers are pushed to thread stack) -
Thread mode protects only CSR registers, memory regions should be protected by additional PMP unit.
-
Interrups are always executing in machine mode.
bit in teic_cfg
CSR
Setting this bit will make the hart to enter thread mode (aka user mode in privileged nomenclature). Once set it cannot be cleared.
Must not be set within interrupt handler, otherwise behaviour is undefined.
Note
|
It is expected that startup code will turn itself into an idle thread after configuring everything in machine mode. |
Holds the stack pointer to be swapped with sp
when entering interrupt context.
Note
|
Separate interrupt stack allows thread stacks to allocate only the area for context switch storage in addition to its own usage (which can be statically analysed) |
If access_thread_regs_n1
control bit is implemented, then it switches sp
to thread stack as well.
When in effect, the teic_swpspm
content is undefined.
When another interrupt nests, it pushes registers onto the machine (interrupt) stack.
Makes each address entry in irq vector table take only 2 byte in size.
(TEIC_IRQ_VECT_ENTRY_SIZE == 2
)
The effective addres is constructed by concatenation of the 2 bytes of the
vector entry content and top 16 bit of TEIC_ENTRY_VECT_BASE
implementation constant.
The TEIC_ENTRY_VECT_BASE
must be 64KiB aligned.
The entry encoding with the least significant bit set, is reserved.
Note
|
Extension designated for smallest devices where a vector table size has a significant code size impact. |
Note
|
SRAM can be used for enplacing handlers if mapped within the same 64KiB block |
Implies XTeicTinyIrqTable extension.
If the fetched vector entry has the lowest bit set, then
the effective addres is constructed by concatenation of the 2 bytes of the
vector entry content and top 16 bits of TEIC_EXEC_SRAM_BASE
implementation constant.
The TEIC_EXEC_SRAM_BASE
must be 64KiB aligned.
Note
|
It is possible to implement this on devices with large flash memories and resort to compiler tricks, to keep handlers within 64KiB range. But the gains will be relatively low. |
Provides additional CSR registers with stack address thresholds.
Throws sw_integrity_exc
exception, when sp
(x1
) register is written with value lower than
the one specified in teic_sp*limit
register.
Note
|
local arrays can be created on stack and then accessed by pointer passed in working register.
Therefore stacklimit comparison must happen on write to sp register
|
Used for limiting sp
when hart is in thread mode or thread_enter == 0
.
bit | name | type | reset value | description |
---|---|---|---|---|
[31:3] |
|
WLRL |
0 |
top bits of bottom stack threshold, unsigned |
[2:0] |
reserved |
WLRL |
0 |
reserved |
available only with XTeicRTOS
Used for limiting sp
when hart is in interrupt (machine) mode (thread_enter == 1
).
bit | name | type | reset value | description |
---|---|---|---|---|
[31:3] |
|
WLRL |
0 |
top bits of bottom stack threshold, unsigned |
[2:0] |
reserved |
WLRL |
0 |
reserved |
Additional extensions that are usefull addition to XTeic
Because J extension group is going to simply ignore the fact that fence.i
instruction
allocated whole 22.125 bits of opcodes, and introduce a new instructions for operational
subset of fence.i
(e.g. IMPORT.I
) [38],[39]. We don’t need to care about eventual
sync with Zjid encodings.
The rationale is that the fence.i
encodes whole instruction side synchronization with all zero immediate.
Therefore we can remove all of the sync mechanisms by inverting the bits, other than the one designated for
certain operation.
The uppermost 4 bits remain zero to allow enabling extra features not covered by fence.i
.
Flushes the pipeline and prefetch buffers before executing next instruction.
Encoded in bit 0 of fence.i
immediate
Note
|
not suitable for synchronizing with architectural state modifications by
CSR instructions, use teic.fence.icsrsync instead
|
- Mnemonic
teic.fence.ipipe
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0xf, attr: ['MISC-MEM'] }, { bits: 5, name: 0x0, attr: ['rd'] }, { bits: 3, name: 0x1 }, { bits: 5, name: 0x0, attr: ['rs1'] }, { bits: 12, name: 0x0fe, attr: ['imm'] }, ]}
Ensures that the following instructions are executed after the architectural state change
by a preceding CSR instructions (or equivalent) takes effect.
Encoded in bit 1 of fence.i
immediate
Note
|
In many cases CSR updates don’t require full pipeline flush, though it can be implemented as regular pipeline flush. |
Note
|
necessary to sync e.g irq vector table updates wrt following (peripheral) MMIO access |
Note
|
[41] do require fencing after update of jvt and mtvec
(even though jvt falls into "program order" category).
|
- Mnemonic
teic.fence.icsrsync
- Encoding (RV32, RV64)
{reg:[ { bits: 7, name: 0xf, attr: ['MISC-MEM'] }, { bits: 5, name: 0x0, attr: ['rd'] }, { bits: 3, name: 0x1 }, { bits: 5, name: 0x0, attr: ['rs1'] }, { bits: 12, name: 0x0fd, attr: ['imm'] }, ]}
Implemented similarly to Zicsr with uimm=0
mapped into -1 constant.
Note
|
csrrsi /csrrci with uimm=0 still doesn’t write and cause write side effects.
|
Note
|
This extensions allows to sync csrrwi instruction, with some other extensions
[39], as to not cause additional immediate formats.
|
Note
|
csrrw rd, csr, x0 can still be used to write a zero into csr.
|
None of the CSR access shall raise an exception.
-
Writes to read only CSRs shall be ignored.
-
in machine mode access to unimplemented CSRs is undefined
-
in thread mode access to unimplemented CSRs as well as higher privilege ones shall cause no side effects, read a
0
value and have its write ignored
Note
|
UNIMP instruction maps to write into cycle csr register, so it can
no longer be used. c.unimp remains available which is encoded as all zero.
|
Note
|
Extension designated for reduction of silicon use, reflects behaviour of
certain privileged csr registers (e.g. misa , mvendorid etc.) when unimplemented
|
Implemented similarly to Zcmp but with additional immediate bit to accomodate 8 byte aligned stacks, and following changes.
Note
|
addi8sp is not required as push instruction can prepare initial allocation with 8byte granularity. |
rlist
encoding
RV32E: case 0: {reg_list="ra"; xreg_list="x1";} case 1: {reg_list="ra, s0"; xreg_list="x1, x8";} case 2: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";} case 3-15: reserved RV32I: case 0: {reg_list="ra"; xreg_list="x1";} case 1: {reg_list="ra, s0"; xreg_list="x1, x8";} case 2: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";} case 3: {reg_list="ra, s0-s2"; xreg_list="x1, x8-x9, x18";} case 4: {reg_list="ra, s0-s3"; xreg_list="x1, x8-x9, x18-x19";} case 5: {reg_list="ra, s0-s4"; xreg_list="x1, x8-x9, x18-x20";} case 6: {reg_list="ra, s0-s5"; xreg_list="x1, x8-x9, x18-x21";} case 7: {reg_list="ra, s0-s6"; xreg_list="x1, x8-x9, x18-x22";} case 8: {reg_list="ra, s0-s7"; xreg_list="x1, x8-x9, x18-x23";} case 9: {reg_list="ra, s0-s8"; xreg_list="x1, x8-x9, x18-x24";} case 10: {reg_list="ra, s0-s9"; xreg_list="x1, x8-x9, x18-x25";} case 11: {reg_list="ra, s0-s10"; xreg_list="x1, x8-x9, x18-x26";} case 12: {reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";} case 13-15: reserved
stack_adj_base
derivation fromrlist
case 0..1: stack_adj_base = 8 case 2..3: stack_adj_base = 16 case 4..5: stack_adj_base = 24 case 6..7: stack_adj_base = 32 case 8..9: stack_adj_base = 40 case 10..11: stack_adj_base = 48 case 12: stack_adj_base = 56 case 13..15: reserved Valid values: case 0..1: stack_adj = [ 8|16|24|32|40|48|56|64] case 2..3: stack_adj = [16|24|32|40|48|56|64|72] case 4..5: stack_adj = [24|32|40|48|56|64|72|80] case 6..7: stack_adj = [32|40|48|56|64|72|80|88] case 8..9: stack_adj = [40|48|56|64|72|80|88|96] case 10..11: stack_adj = [48|56|64|72|80|88|96|104] case 12: stack_adj = [56|64|72|80|88|96|104|112] case 13..15: reserved
- register stacking order
-
currently same as in Zcmp
- Synopsis
-
Allocates stack frame and saves registers selected by
rlist
. - Mnemonic
teic.cm.push {reg_list}, -stack_adj
- Encoding
{reg:[ { bits: 2, name: 0x2, attr: ['C2'] }, { bits: 1, name: 'spimm[5]' }, { bits: 2, name: 'rlist[1:0]' }, { bits: 2, name: 'spimm[4:3]' }, { bits: 2, name: 'rlist[3:2]' }, { bits: 1, name: 0 }, { bits: 2, name: 0x0 }, { bits: 1, name: 0 }, { bits: 3, name: 0x5, attr: ['C.FSDSP'] }, ],config:{bits:16}}
- Synopsis
-
Deallocates stack frame and loads registers selected by
rlist
. - Mnemonic
teic.cm.pop {reg_list}, stack_adj
- Encoding
{reg:[ { bits: 2, name: 0x2, attr: ['C2'] }, { bits: 1, name: 'spimm[5]' }, { bits: 2, name: 'rlist[1:0]' }, { bits: 2, name: 'spimm[4:3]' }, { bits: 2, name: 'rlist[3:2]' }, { bits: 1, name: 1 }, { bits: 2, name: 0x0 }, { bits: 1, name: 0 }, { bits: 3, name: 0x5, attr: ['C.FSDSP'] }, ],config:{bits:16}}
- Synopsis
-
Deallocates stack frame, loads registers selected by
rlist
and returns. - Mnemonic
teic.cm.popret {reg_list}, stack_adj
- Encoding
{reg:[ { bits: 2, name: 0x2, attr: ['C2'] }, { bits: 1, name: 'spimm[5]' }, { bits: 2, name: 'rlist[1:0]' }, { bits: 2, name: 'spimm[4:3]' }, { bits: 2, name: 'rlist[3:2]' }, { bits: 1, name: 1 }, { bits: 2, name: 0x0 }, { bits: 1, name: 1 }, { bits: 3, name: 0x5, attr: ['C.FSDSP'] }, ],config:{bits:16}}
- Description
-
The
ra
register may not be populated.
- Synopsis
-
Deallocates stack frame, loads registers selected by
rlist
, writes zero toa0
and returns. - Mnemonic
teic.cm.popretz {reg_list}, stack_adj
- Encoding
{reg:[ { bits: 2, name: 0x2, attr: ['C2'] }, { bits: 1, name: 'spimm[5]' }, { bits: 2, name: 'rlist[1:0]' }, { bits: 2, name: 'spimm[4:3]' }, { bits: 2, name: 'rlist[3:2]' }, { bits: 1, name: 0 }, { bits: 2, name: 0x0 }, { bits: 1, name: 1 }, { bits: 3, name: 0x5, attr: ['C.FSDSP'] }, ],config:{bits:16}}
- Description
-
The
ra
register may not be populated. Unlike in Zcmp the load to a0 is non atomic.
- mask out all interrupts
void foo() { size_t tmp; asm volatile( "csrrci %[out], teic_irq_msk, 0b01111 \n\t" : [out] "=r" (tmp) :: "memory"); // // execute code with irq disabled // asm volatile("csrw teic_irq_msk, %[in] \n\n" :: [in] "r" (tmp) : "memory"); }
- mask out only
nest1
level
void foo() { size_t tmp; asm volatile( "csrrci %[out], teic_irq_msk, 0b00001 \n\t" : [output] "=r" (tmp) :: "memory"); // // execute code with irq disabled // asm volatile ("csrw teic_irq_msk, %[in] \n\t" :: [in] "r" (tmp) : "memory"); }
what headers, definitions, names etc. must be provided.
The cause code can be implied from hardcoded vector table position or periphereals state if handler is shared. Therefore it’s redundant. The other issue is that it has to be somehow preserved during nesting.
Note
|
NMIs are handled through teic_nmi_cause CSR.
|
It would be redundant to the irq_msk
nest enables.
Which can be similarly managed by csrsi
, csrci
instructions.
It’s useless.
will it tell you if there is Zbb, Zmmul or Zcmt implemented? - no
On embedded targets, HW information about implemented extensions and ability to enable/disable them, has a rather low value.
currently ???
Zfinx ???
Those can still be handled by IPRA anyway. FP push/pop instruction might be usefull in such case.
It is said that registers have to be zeroed at reset "to protect software from itself" [36] It doesn’t, it just hides bugs until they manifest in the worst possible scenario. Just like developing and debugging code at -O0.
This kind of use of uninitailized variables is UB in C/C++ and easily detectable by compilers. Languages like Rust or Ada are supposed to be free from this UB, so there is no need to spend transistors or code memory for zeroing those.
Note
|
V extension uses all ones for tail agnostic filling just to prevent software
from relying on uarch dependent zeroing.
|
However, certain hardened cores may need to have all registers initialized to consistent state, as to avoid integrity faults when stacking out yet unused registers. In some cases, it’s still possible to require initialization of all registers in startup code instead.
Why would you want to have big endian loads/stores?
Probably for handling tasks that compute "network byte order" data which uses big endian representation.
Nice. So, lets add a big-endian mode (making it configurable at runtime of course), and enjoy mandatory endian neutral loads/stores ([37]) used by networking libraries, because one cannot be sure which endianess the code will be run on.
Just use rev8
for "network order" data. It’s much better than doing endian neutral access.
Big endianess is also inefficient to handle in vector registers.
Unlike the big unix machines, the RTOS context can be statically
addressed by lui
+ addi
sequence.
With hardware stacking there is no need to free up additional registers.
One entry less than full 1024 due to 2s complement jump immediate.
This is the biggest capacity that can be escaped by single c.j
instruction
from a first entry in case of TEIC_IRQ_VECT_ENTRY_SIZE == 2
(XTeicTinyIrqTable)
This is also more than enough for any microcontroller.
It is simply redundant to in-peripherals enablees, as well as the nestx
interrupt enables.
Has use case only when the same interrupts are routed to multiple harts or when peripheral interrupt lines are shared across multiple master units (e.g. FIFO empty irq signal shared with DMA)
Nesting NMIs is easy way to overflow the stack or greatly increase the worst case in static stack analysis (if there is even a bound)
It also becomes an issue in pure HW state preservation by estate_nl
or shadow registers.
Normally such condition is very rare and is usually a sign of bad coding or much more serious hardware issue, that’s causing everything to fail at the same moment.
aka software trigger in ARM terminology [47]
Designated for triggering unallocated (or unused peripheral) vectors, by writing to
the special NVIC→STIR
register.
Which is of course redundant to the use of NVIC→ISPRx
registers.
However it’s rarely used and only "implemented" vectors can be triggered in such way. Officially it is supposed to be 32 entry granularity in ARM case, but it’s not even obvious wether you can use unimplemented vectors at all. [48]
Note
|
Even the PendSV is done by setting ICSR→PENDSVET bit instead of executing this mechanism.
|
Note
|
TEIC instead provides dedicated "peripheral" for handling software (deffered) interrupts |
All of this causes a lot of redundancy to allow handling peripheral interrupts and "software" triggered ones by the same handler. The ARM implementation also depends on edge triggered irq mechanism, which is also ommitted by XTeic.
This is just a waste of hardware. The ABI should mandate the alignment instead. If not followed then the microarchitecture should be allowed to trap.
Note
|
some architectures, due to legacy codebases, require explicit stack alignment instructions which also contribute to interrupt latency/jitter and impact code density. |
It doesn’t make sense to implement "zero jitter" at any other level. If given interrupt can by interrupted by a higher nesting priority, then it would no longer be considered a "zero jitter" one.
Note
|
NMIs can still break the "zero jitter" guarantee, though those should be considered as a rare fault/error condition. |
Peripherals usually implement level triggered interrupts. (ie. require clearing trigger source by performing certain actions like reading FIFO registers or clearing the status flags)
Therefore it’s wastefull to spend additional resources (e.g. latch for pending status and related clear on irq entry) on the edge triggered mechanism which is made redundant on every irq line (see [no "software interrupts"])
Note
|
Sampling edges on GPIO is usually done by a separate peripheral that turns those into an level triggered ones. |
aka mtval
which ` is often not impelemnted anyway, even by uarch without unaligned loads/stores support.
Due to the lack of MMU, the memory access exceptions are considered fatal errors anyway.
The faulting address can still be recovered in a more complex way of decompilation of faulting instr.
Having our cores to boot with "legacy" interrupt modes
-
is a waste of transistors
-
it would reqire sync with the CLIC
mode
/submode
encodings (or be incompatible with CLIC which is of course unwanted when lengthening the "flexibility" bar) -
causes interrupt hole or additional boilerplate code to handle exceptions/NMIs that arrived before setting up
mtvec
and thus were routed to reset handler entry.
Note
|
There was even an CVE related to uninitialized mtvec : [43]
|
This also allows us to use vector address with zeroed two lowest bits.
Which, in some scenarios, allows setup of vector table address with a single lui
instruction
Also, in cores designated to work in vectored mode, the mtvec
has the bottom address lines hardwired to 0.
Which leads to large alignment granularity of the unvectored handler (e.g. on ch32v003 it’s 1KiB).
Making the unvectored mode handler share entry with startup code or require large alignment.
Sub-priority is used only during irq handler dispatch. Current priority field would consume additional circuitry to latch in sub-priority of the current handler.
Additionaly the current sub-priority field would have to be somehow preserved during nesting.
It’s enough for a great majority of use cases, not to mention that a lot of applications would be fine with just 1 nesting level.
Adding more nesting levels will diminish the gains from tail chaining.
Problematic to properly implement.
Offers less separation of kernel structures from the thread (by MPU). Though cortex-m port of FreeRTOS uses it only to start a first thread.
-
[2] https://github.com/riscv/riscv-fast-interrupt/blob/master/clic.adoc
-
[3] https://github.com/riscv/riscv-aclint/blob/main/riscv-aclint.adoc
-
[4] https://starfivetech.com/uploads/sifive-interrupt-cookbook-v1p2.pdf
-
[7] https://github.com/jnk0le/simple-crt/blob/master/cm0/combotablecrt_stm32f030x6.S
-
[9] https://github.com/YosysHQ/picorv32#custom-instructions-for-irq-handling
-
[10] https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/znKeVnmxsy8/m/NtdDII3kAAAJ
-
[13] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/beginner-guide-on-interrupt-latency-and-interrupt-latency-of-the-arm-cortex-m-processors
-
[14] https://www.ti.com/lit/an/spracs0a/spracs0a.pdf?ts=1677348911359
-
[15] https://software-dl.ti.com/C2000/docs/cla_software_dev_guide/intro.html
-
[17] https://elinux.org/images/d/de/Real_Time_Linux_Scheduling_Performance_Comparison.pdf
-
[18] https://static.lwn.net/lwn/images/conf/rtlws11/papers/proc/p19.pdf
-
[19] https://people.mpi-sws.org/~bbb/papers/pdf/ospert13.pdf
-
[20] https://www.osadl.org/fileadmin/events/rtlws-2007/Siro.pdf
-
[21] https://riscv.org/wp-content/uploads/2018/07/DAC-SiFive-Drew-Barbier.pdf
-
[22] https://www.ti.com/lit/an/spraan9a/spraan9a.pdf?ts=1677877354340
-
[23] https://www.ti.com/lit/ug/spru430f/spru430f.pdf?ts=1677869437551
-
[24] https://www.ti.com/lit/ug/spruhs1c/spruhs1c.pdf?ts=1677888169020
-
[25] https://e2e.ti.com/support/processors-group/processors/f/processors-forum/905744/tms320f28335
-
[26] https://e2e.ti.com/support/microcontrollers/c2000-microcontrollers-group/c2000/f/c2000-microcontrollers-forum/567535/tms320f28377d-dmips-calculation
-
[27] https://software-dl.ti.com/C2000/docs/cla_software_dev_guide/_static/pdf/C2000_CLA_Software_Development_Guide.pdf
-
[28] http://www.wch-ic.com/downloads/QingKeV2_Processor_Manual_PDF.html
-
[29] http://www.wch-ic.com/downloads/QingKeV3_Processor_Manual_PDF.html
-
[30] http://www.wch-ic.com/downloads/QingKeV4_Processor_Manual_PDF.html
-
[35] https://www.brianchavens.com/2018/09/20/motor-control-microcontroller-performance-comparison/
-
[37] https://github.com/openssl/openssl/blob/master/crypto/aes/asm/aes-armv4.pl#L216
-
[38] https://github.com/riscv/riscv-j-extension/blob/master/id-consistency-proposal.pdf
-
[40] https://software-dl.ti.com/trainingTTO/trainingTTO_public_sw/c28x28035/C28x_Piccolo_MDW_2-1.pdf
-
[41] https://docs.openhwgroup.org/_/downloads/cv32e40s-user-manual/en/latest/pdf/
-
[44] https://github.com/riscv/riscv-isa-manual/pull/912/commits/869dcc608e11f9680e950bcb20a9b8294d2b82bd
-
[46] https://github.com/openwch/ch32v003/blob/main/RISC-V%20QingKeV2%20Microprocessor%20Debug%20Manual.pdf
-
[48] https://stackoverflow.com/questions/72523639/arm-cortex-m3-add-a-new-interrupt-to-the-end-of-the-vector-table
-
[49] https://lists.llvm.org/pipermail/cfe-dev/2016-July/050022.html
-
[51] https://e2echina.ti.com/cfs-file/_key/communityserver-discussions-components-files/56/5504.2803x-CLA-_2800_1_2900.pdf
-
[54] https://github.com/Wren6991/Hazard3/blob/stable/doc/hazard3.pdf