Writing an OS in Rust

Updates in March 2020

Wed, 01 Apr 2020 00:00:00 +0000

This post gives an overview of the recent updates to the Writing an OS in Rust blog and the corresponding libraries and tools.

I focused my time this month on finishing the long-planned post about Async/Await. In addition to that, there were a few updates to the crates behind the scenes, including some great contributions and a new vga crate.

As mentioned in the Async/Await post, Iâ€™m currently looking for job in Karlsruhe (Germany) or remote, so please let me know if youâ€™re interested.

`blog_os`

The repository of the Writing an OS in Rust blog received the following updates:

In addition to the changes above, there were a lot of typo fixes by external contributors. Thanks a lot!

`x86_64`

The x86_64 crate provides support for CPU-specific instructions, registers, and data structures of the x86_64 architecture. In March, there was only a single addition, which was required for the Async/Await post:

Add an enable_interrupts_and_hlt function that executes sti; hlt (released as v0.9.6)

`bootloader`

The bootloader crate received two contributions this month:

Implement boot-info-address by @Darksecond (released as v0.8.9)
Identity-map complete vga region (0xa0000 to 0xc0000) by @RKennedy9064 (released as v0.9.0)

`bootimage`

The bootimage tool builds the bootloader and creates a bootable disk image from a kernel. It received a RUSTFLAGS-related bugfix:

Set empty RUSTFLAGS to ensure that no .cargo/config applies

`vga`

There is a new crate under the rust-osdev organization: vga created by @RKennedy9064. The purpose of the library is to provide abstractions for the VGA hardware. For example, the crate allows to switch the VGA hardware to graphics mode, which makes it possible to draw on a pixel-based framebuffer:

For more information about the crate, check out its API documentation and the GitHub repository.

Async/Await

Fri, 27 Mar 2020 00:00:00 +0000

In this post, we explore cooperative multitasking and the async/await feature of Rust. We take a detailed look at how async/await works in Rust, including the design of the Future trait, the state machine transformation, and pinning. We then add basic support for async/await to our kernel by creating an asynchronous keyboard task and a basic executor.

This blog is openly developed on GitHub. If you have any problems or questions, please open an issue there. You can also leave comments at the bottom. The complete source code for this post can be found in the post-12 branch.

ðŸ”—Multitasking

One of the fundamental features of most operating systems is multitasking, which is the ability to execute multiple tasks concurrently. For example, you probably have other programs open while looking at this post, such as a text editor or a terminal window. Even if you have only a single browser window open, there are probably various background tasks for managing your desktop windows, checking for updates, or indexing files.

While it seems like all tasks run in parallel, only a single task can be executed on a CPU core at a time. To create the illusion that the tasks run in parallel, the operating system rapidly switches between active tasks so that each one can make a bit of progress. Since computers are fast, we donâ€™t notice these switches most of the time.

While single-core CPUs can only execute a single task at a time, multi-core CPUs can run multiple tasks in a truly parallel way. For example, a CPU with 8 cores can run 8 tasks at the same time. We will explain how to setup multi-core CPUs in a future post. For this post, we will focus on single-core CPUs for simplicity. (Itâ€™s worth noting that all multi-core CPUs start with only a single active core, so we can treat them as single-core CPUs for now.)

There are two forms of multitasking: Cooperative multitasking requires tasks to regularly give up control of the CPU so that other tasks can make progress. Preemptive multitasking uses operating system functionality to switch threads at arbitrary points in time by forcibly pausing them. In the following we will explore the two forms of multitasking in more detail and discuss their respective advantages and drawbacks.

ðŸ”—Preemptive Multitasking

The idea behind preemptive multitasking is that the operating system controls when to switch tasks. For that, it utilizes the fact that it regains control of the CPU on each interrupt. This makes it possible to switch tasks whenever new input is available to the system. For example, it would be possible to switch tasks when the mouse is moved or a network packet arrives. The operating system can also determine the exact time that a task is allowed to run by configuring a hardware timer to send an interrupt after that time.

The following graphic illustrates the task switching process on a hardware interrupt:

In the first row, the CPU is executing task A1 of program A. All other tasks are paused. In the second row, a hardware interrupt arrives at the CPU. As described in the Hardware Interrupts post, the CPU immediately stops the execution of task A1 and jumps to the interrupt handler defined in the interrupt descriptor table (IDT). Through this interrupt handler, the operating system now has control of the CPU again, which allows it to switch to task B1 instead of continuing task A1.

ðŸ”—Saving State

Since tasks are interrupted at arbitrary points in time, they might be in the middle of some calculations. In order to be able to resume them later, the operating system must backup the whole state of the task, including its call stack and the values of all CPU registers. This process is called a context switch.

As the call stack can be very large, the operating system typically sets up a separate call stack for each task instead of backing up the call stack content on each task switch. Such a task with its own stack is called a thread of execution or thread for short. By using a separate stack for each task, only the register contents need to be saved on a context switch (including the program counter and stack pointer). This approach minimizes the performance overhead of a context switch, which is very important since context switches often occur up to 100 times per second.

ðŸ”—Discussion

The main advantage of preemptive multitasking is that the operating system can fully control the allowed execution time of a task. This way, it can guarantee that each task gets a fair share of the CPU time, without the need to trust the tasks to cooperate. This is especially important when running third-party tasks or when multiple users share a system.

The disadvantage of preemption is that each task requires its own stack. Compared to a shared stack, this results in higher memory usage per task and often limits the number of tasks in the system. Another disadvantage is that the operating system always has to save the complete CPU register state on each task switch, even if the task only used a small subset of the registers.

Preemptive multitasking and threads are fundamental components of an operating system because they make it possible to run untrusted userspace programs. We will discuss these concepts in full detail in future posts. For this post, however, we will focus on cooperative multitasking, which also provides useful capabilities for our kernel.

ðŸ”—Cooperative Multitasking

Instead of forcibly pausing running tasks at arbitrary points in time, cooperative multitasking lets each task run until it voluntarily gives up control of the CPU. This allows tasks to pause themselves at convenient points in time, for example, when they need to wait for an I/O operation anyway.

Cooperative multitasking is often used at the language level, like in the form of coroutines or async/await. The idea is that either the programmer or the compiler inserts yield operations into the program, which give up control of the CPU and allow other tasks to run. For example, a yield could be inserted after each iteration of a complex loop.

It is common to combine cooperative multitasking with asynchronous operations. Instead of waiting until an operation is finished and preventing other tasks from running during this time, asynchronous operations return a â€œnot readyâ€ status if the operation is not finished yet. In this case, the waiting task can execute a yield operation to let other tasks run.

ðŸ”—Saving State

Since tasks define their pause points themselves, they donâ€™t need the operating system to save their state. Instead, they can save exactly the state they need for continuation before they pause themselves, which often results in better performance. For example, a task that just finished a complex computation might only need to backup the final result of the computation since it does not need the intermediate results anymore.

Language-supported implementations of cooperative tasks are often even able to backup the required parts of the call stack before pausing. As an example, Rustâ€™s async/await implementation stores all local variables that are still needed in an automatically generated struct (see below). By backing up the relevant parts of the call stack before pausing, all tasks can share a single call stack, which results in much lower memory consumption per task. This makes it possible to create an almost arbitrary number of cooperative tasks without running out of memory.

ðŸ”—Discussion

The drawback of cooperative multitasking is that an uncooperative task can potentially run for an unlimited amount of time. Thus, a malicious or buggy task can prevent other tasks from running and slow down or even block the whole system. For this reason, cooperative multitasking should only be used when all tasks are known to cooperate. As a counterexample, itâ€™s not a good idea to make the operating system rely on the cooperation of arbitrary user-level programs.

However, the strong performance and memory benefits of cooperative multitasking make it a good approach for usage within a program, especially in combination with asynchronous operations. Since an operating system kernel is a performance-critical program that interacts with asynchronous hardware, cooperative multitasking seems like a good approach for implementing concurrency.

ðŸ”—Async/Await in Rust

The Rust language provides first-class support for cooperative multitasking in the form of async/await. Before we can explore what async/await is and how it works, we need to understand how futures and asynchronous programming work in Rust.

ðŸ”—Futures

A future represents a value that might not be available yet. This could be, for example, an integer that is computed by another task or a file that is downloaded from the network. Instead of waiting until the value is available, futures make it possible to continue execution until the value is needed.

ðŸ”—Example

The concept of futures is best illustrated with a small example:

This sequence diagram shows a main function that reads a file from the file system and then calls a function foo. This process is repeated two times: once with a synchronous read_file call and once with an asynchronous async_read_file call.

With the synchronous call, the main function needs to wait until the file is loaded from the file system. Only then can it call the foo function, which requires it to again wait for the result.

With the asynchronous async_read_file call, the file system directly returns a future and loads the file asynchronously in the background. This allows the main function to call foo much earlier, which then runs in parallel with the file load. In this example, the file load even finishes before foo returns, so main can directly work with the file without further waiting after foo returns.

ðŸ”—Futures in Rust

In Rust, futures are represented by the Future trait, which looks like this:

pub trait Future {
    type Output;
    fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output>;
}

The associated type Output specifies the type of the asynchronous value. For example, the async_read_file function in the diagram above would return a Future instance with Output set to File.

The poll method allows to check if the value is already available. It returns a Poll enum, which looks like this:

pub enum Poll<T> {
    Ready(T),
    Pending,
}

When the value is already available (e.g. the file was fully read from disk), it is returned wrapped in the Ready variant. Otherwise, the Pending variant is returned, which signals to the caller that the value is not yet available.

The poll method takes two arguments: self: Pin<&mut Self> and cx: &mut Context. The former behaves similarly to a normal &mut self reference, except that the Self value is pinned to its memory location. Understanding Pin and why it is needed is difficult without understanding how async/await works first. We will therefore explain it later in this post.

The purpose of the cx: &mut Context parameter is to pass a Waker instance to the asynchronous task, e.g., the file system load. This Waker allows the asynchronous task to signal that it (or a part of it) is finished, e.g., that the file was loaded from disk. Since the main task knows that it will be notified when the Future is ready, it does not need to call poll over and over again. We will explain this process in more detail later in this post when we implement our own waker type.

ðŸ”—Working with Futures

We now know how futures are defined and understand the basic idea behind the poll method. However, we still donâ€™t know how to effectively work with futures. The problem is that futures represent the results of asynchronous tasks, which might not be available yet. In practice, however, we often need these values directly for further calculations. So the question is: How can we efficiently retrieve the value of a future when we need it?

ðŸ”—Waiting on Futures

One possible answer is to wait until a future becomes ready. This could look something like this:

let future = async_read_file("foo.txt");
let file_content = loop {
    match future.poll(â€¦) {
        Poll::Ready(value) => break value,
        Poll::Pending => {}, // do nothing
    }
}

Here we actively wait for the future by calling poll over and over again in a loop. The arguments to poll donâ€™t matter here, so we omitted them. While this solution works, it is very inefficient because we keep the CPU busy until the value becomes available.

A more efficient approach could be to block the current thread until the future becomes available. This is, of course, only possible if you have threads, so this solution does not work for our kernel, at least not yet. Even on systems where blocking is supported, it is often not desired because it turns an asynchronous task into a synchronous task again, thereby inhibiting the potential performance benefits of parallel tasks.

ðŸ”—Future Combinators

An alternative to waiting is to use future combinators. Future combinators are methods like map that allow chaining and combining futures together, similar to the methods of the Iterator trait. Instead of waiting on the future, these combinators return a future themselves, which applies the mapping operation on poll.

As an example, a simple string_len combinator for converting a Future<Output = String> to a Future<Output = usize> could look like this:

struct StringLen<F> {
    inner_future: F,
}

impl<F> Future for StringLen<F> where F: Future<Output = String> {
    type Output = usize;

    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<T> {
        match self.inner_future.poll(cx) {
            Poll::Ready(s) => Poll::Ready(s.len()),
            Poll::Pending => Poll::Pending,
        }
    }
}

fn string_len(string: impl Future<Output = String>)
    -> impl Future<Output = usize>
{
    StringLen {
        inner_future: string,
    }
}

// Usage
fn file_len() -> impl Future<Output = usize> {
    let file_content_future = async_read_file("foo.txt");
    string_len(file_content_future)
}

This code does not quite work because it does not handle pinning, but it suffices as an example. The basic idea is that the string_len function wraps a given Future instance into a new StringLen struct, which also implements Future. When the wrapped future is polled, it polls the inner future. If the value is not ready yet, Poll::Pending is returned from the wrapped future too. If the value is ready, the string is extracted from the Poll::Ready variant and its length is calculated. Afterwards, it is wrapped in Poll::Ready again and returned.

With this string_len function, we can calculate the length of an asynchronous string without waiting for it. Since the function returns a Future again, the caller canâ€™t work directly on the returned value, but needs to use combinator functions again. This way, the whole call graph becomes asynchronous and we can efficiently wait for multiple futures at once at some point, e.g., in the main function.

Because manually writing combinator functions is difficult, they are often provided by libraries. While the Rust standard library itself provides no combinator methods yet, the semi-official (and no_std compatible) futures crate does. Its FutureExt trait provides high-level combinator methods such as map or then, which can be used to manipulate the result with arbitrary closures.

ðŸ”—Advantages

The big advantage of future combinators is that they keep the operations asynchronous. In combination with asynchronous I/O interfaces, this approach can lead to very high performance. The fact that future combinators are implemented as normal structs with trait implementations allows the compiler to excessively optimize them. For more details, see the Zero-cost futures in Rust post, which announced the addition of futures to the Rust ecosystem.

ðŸ”—Drawbacks

While future combinators make it possible to write very efficient code, they can be difficult to use in some situations because of the type system and the closure-based interface. For example, consider code like this:

fn example(min_len: usize) -> impl Future<Output = String> {
    async_read_file("foo.txt").then(move |content| {
        if content.len() < min_len {
            Either::Left(async_read_file("bar.txt").map(|s| content + &s))
        } else {
            Either::Right(future::ready(content))
        }
    })
}

(Try it on the playground)

Here we read the file foo.txt and then use the then combinator to chain a second future based on the file content. If the content length is smaller than the given min_len, we read a different bar.txt file and append it to content using the map combinator. Otherwise, we return only the content of foo.txt.

We need to use the move keyword for the closure passed to then because otherwise there would be a lifetime error for min_len. The reason for the Either wrapper is that if and else blocks must always have the same type. Since we return different future types in the blocks, we must use the wrapper type to unify them into a single type. The ready function wraps a value into a future, which is immediately ready. The function is required here because the Either wrapper expects that the wrapped value implements Future.

As you can imagine, this can quickly lead to very complex code for larger projects. It gets especially complicated if borrowing and different lifetimes are involved. For this reason, a lot of work was invested in adding support for async/await to Rust, with the goal of making asynchronous code radically simpler to write.

ðŸ”—The Async/Await Pattern

The idea behind async/await is to let the programmer write code that looks like normal synchronous code, but is turned into asynchronous code by the compiler. It works based on the two keywords async and await. The async keyword can be used in a function signature to turn a synchronous function into an asynchronous function that returns a future:

async fn foo() -> u32 {
    0
}

// the above is roughly translated by the compiler to:
fn foo() -> impl Future<Output = u32> {
    future::ready(0)
}

This keyword alone wouldnâ€™t be that useful. However, inside async functions, the await keyword can be used to retrieve the asynchronous value of a future:

async fn example(min_len: usize) -> String {
    let content = async_read_file("foo.txt").await;
    if content.len() < min_len {
        content + &async_read_file("bar.txt").await
    } else {
        content
    }
}

(Try it on the playground)

This function is a direct translation of the example function from above that used combinator functions. Using the .await operator, we can retrieve the value of a future without needing any closures or Either types. As a result, we can write our code like we write normal synchronous code, with the difference that this is still asynchronous code.

ðŸ”—State Machine Transformation

Behind the scenes, the compiler converts the body of the async function into a state machine, with each .await call representing a different state. For the above example function, the compiler creates a state machine with the following four states:

Each state represents a different pause point in the function. The â€œStartâ€ and â€œEndâ€ states represent the function at the beginning and end of its execution. The â€œWaiting on foo.txtâ€ state represents that the function is currently waiting for the first async_read_file result. Similarly, the â€œWaiting on bar.txtâ€ state represents the pause point where the function is waiting on the second async_read_file result.

The state machine implements the Future trait by making each poll call a possible state transition:

The diagram uses arrows to represent state switches and diamond shapes to represent alternative ways. For example, if the foo.txt file is not ready, the path marked with â€œnoâ€ is taken and the â€œWaiting on foo.txtâ€ state is reached. Otherwise, the â€œyesâ€ path is taken. The small red diamond without a caption represents the if content.len() < 100 branch of the example function.

We see that the first poll call starts the function and lets it run until it reaches a future that is not ready yet. If all futures on the path are ready, the function can run till the â€œEndâ€ state, where it returns its result wrapped in Poll::Ready. Otherwise, the state machine enters a waiting state and returns Poll::Pending. On the next poll call, the state machine then starts from the last waiting state and retries the last operation.

ðŸ”—Saving State

In order to be able to continue from the last waiting state, the state machine must keep track of the current state internally. In addition, it must save all the variables that it needs to continue execution on the next poll call. This is where the compiler can really shine: Since it knows which variables are used when, it can automatically generate structs with exactly the variables that are needed.

As an example, the compiler generates structs like the following for the above example function:

// The `example` function again so that you don't have to scroll up
async fn example(min_len: usize) -> String {
    let content = async_read_file("foo.txt").await;
    if content.len() < min_len {
        content + &async_read_file("bar.txt").await
    } else {
        content
    }
}

// The compiler-generated state structs:

struct StartState {
    min_len: usize,
}

struct WaitingOnFooTxtState {
    min_len: usize,
    foo_txt_future: impl Future<Output = String>,
}

struct WaitingOnBarTxtState {
    content: String,
    bar_txt_future: impl Future<Output = String>,
}

struct EndState {}

In the â€œstartâ€ and â€œWaiting on foo.txtâ€ states, the min_len parameter needs to be stored for the later comparison with content.len(). The â€œWaiting on foo.txtâ€ state additionally stores a foo_txt_future, which represents the future returned by the async_read_file call. This future needs to be polled again when the state machine continues, so it needs to be saved.

The â€œWaiting on bar.txtâ€ state contains the content variable for the later string concatenation when bar.txt is ready. It also stores a bar_txt_future that represents the in-progress load of bar.txt. The struct does not contain the min_len variable because it is no longer needed after the content.len() comparison. In the â€œendâ€ state, no variables are stored because the function has already run to completion.

Keep in mind that this is only an example of the code that the compiler could generate. The struct names and the field layout are implementation details and might be different.

ðŸ”—The Full State Machine Type

While the exact compiler-generated code is an implementation detail, it helps in understanding to imagine how the generated state machine could look for the example function. We already defined the structs representing the different states and containing the required variables. To create a state machine on top of them, we can combine them into an enum:

enum ExampleStateMachine {
    Start(StartState),
    WaitingOnFooTxt(WaitingOnFooTxtState),
    WaitingOnBarTxt(WaitingOnBarTxtState),
    End(EndState),
}

We define a separate enum variant for each state and add the corresponding state struct to each variant as a field. To implement the state transitions, the compiler generates an implementation of the Future trait based on the example function:

impl Future for ExampleStateMachine {
    type Output = String; // return type of `example`

    fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
        loop {
            match self { // TODO: handle pinning
                ExampleStateMachine::Start(state) => {â€¦}
                ExampleStateMachine::WaitingOnFooTxt(state) => {â€¦}
                ExampleStateMachine::WaitingOnBarTxt(state) => {â€¦}
                ExampleStateMachine::End(state) => {â€¦}
            }
        }
    }
}

The Output type of the future is String because itâ€™s the return type of the example function. To implement the poll function, we use a match statement on the current state inside a loop. The idea is that we switch to the next state as long as possible and use an explicit return Poll::Pending when we canâ€™t continue.

For simplicity, we only show simplified code and donâ€™t handle pinning, ownership, lifetimes, etc. So this and the following code should be treated as pseudo-code and not used directly. Of course, the real compiler-generated code handles everything correctly, albeit possibly in a different way.

To keep the code excerpts small, we present the code for each match arm separately. Letâ€™s begin with the Start state:

ExampleStateMachine::Start(state) => {
    // from body of `example`
    let foo_txt_future = async_read_file("foo.txt");
    // `.await` operation
    let state = WaitingOnFooTxtState {
        min_len: state.min_len,
        foo_txt_future,
    };
    *self = ExampleStateMachine::WaitingOnFooTxt(state);
}

The state machine is in the Start state when it is right at the beginning of the function. In this case, we execute all the code from the body of the example function until the first .await. To handle the .await operation, we change the state of the self state machine to WaitingOnFooTxt, which includes the construction of the WaitingOnFooTxtState struct.

Since the match self {â€¦} statement is executed in a loop, the execution jumps to the WaitingOnFooTxt arm next:

ExampleStateMachine::WaitingOnFooTxt(state) => {
    match state.foo_txt_future.poll(cx) {
        Poll::Pending => return Poll::Pending,
        Poll::Ready(content) => {
            // from body of `example`
            if content.len() < state.min_len {
                let bar_txt_future = async_read_file("bar.txt");
                // `.await` operation
                let state = WaitingOnBarTxtState {
                    content,
                    bar_txt_future,
                };
                *self = ExampleStateMachine::WaitingOnBarTxt(state);
            } else {
                *self = ExampleStateMachine::End(EndState);
                return Poll::Ready(content);
            }
        }
    }
}

In this match arm, we first call the poll function of the foo_txt_future. If it is not ready, we exit the loop and return Poll::Pending. Since self stays in the WaitingOnFooTxt state in this case, the next poll call on the state machine will enter the same match arm and retry polling the foo_txt_future.

When the foo_txt_future is ready, we assign the result to the content variable and continue to execute the code of the example function: If content.len() is smaller than the min_len saved in the state struct, the bar.txt file is read asynchronously. We again translate the .await operation into a state change, this time into the WaitingOnBarTxt state. Since weâ€™re executing the match inside a loop, the execution directly jumps to the match arm for the new state afterward, where the bar_txt_future is polled.

In case we enter the else branch, no further .await operation occurs. We reach the end of the function and return content wrapped in Poll::Ready. We also change the current state to the End state.

The code for the WaitingOnBarTxt state looks like this:

ExampleStateMachine::WaitingOnBarTxt(state) => {
    match state.bar_txt_future.poll(cx) {
        Poll::Pending => return Poll::Pending,
        Poll::Ready(bar_txt) => {
            *self = ExampleStateMachine::End(EndState);
            // from body of `example`
            return Poll::Ready(state.content + &bar_txt);
        }
    }
}

Similar to the WaitingOnFooTxt state, we start by polling the bar_txt_future. If it is still pending, we exit the loop and return Poll::Pending. Otherwise, we can perform the last operation of the example function: concatenating the content variable with the result from the future. We update the state machine to the End state and then return the result wrapped in Poll::Ready.

Finally, the code for the End state looks like this:

ExampleStateMachine::End(_) => {
    panic!("poll called after Poll::Ready was returned");
}

Futures should not be polled again after they returned Poll::Ready, so we panic if poll is called while we are already in the End state.

We now know what the compiler-generated state machine and its implementation of the Future trait could look like. In practice, the compiler generates code in a different way. (In case youâ€™re interested, the implementation is currently based on coroutines, but this is only an implementation detail.)

The last piece of the puzzle is the generated code for the example function itself. Remember, the function header was defined like this:

async fn example(min_len: usize) -> String

Since the complete function body is now implemented by the state machine, the only thing that the function needs to do is to initialize the state machine and return it. The generated code for this could look like this:

fn example(min_len: usize) -> ExampleStateMachine {
    ExampleStateMachine::Start(StartState {
        min_len,
    })
}

The function no longer has an async modifier since it now explicitly returns an ExampleStateMachine type, which implements the Future trait. As expected, the state machine is constructed in the Start state and the corresponding state struct is initialized with the min_len parameter.

Note that this function does not start the execution of the state machine. This is a fundamental design decision of futures in Rust: they do nothing until they are polled for the first time.

ðŸ”—Pinning

We already stumbled across pinning multiple times in this post. Now is finally the time to explore what pinning is and why it is needed.

ðŸ”—Self-Referential Structs

As explained above, the state machine transformation stores the local variables of each pause point in a struct. For small examples like our example function, this was straightforward and did not lead to any problems. However, things become more difficult when variables reference each other. For example, consider this function:

async fn pin_example() -> i32 {
    let array = [1, 2, 3];
    let element = &array[2];
    async_write_file("foo.txt", element.to_string()).await;
    *element
}

This function creates a small array with the contents 1, 2, and 3. It then creates a reference to the last array element and stores it in an element variable. Next, it asynchronously writes the number converted to a string to a foo.txt file. Finally, it returns the number referenced by element.

Since the function uses a single await operation, the resulting state machine has three states: start, end, and â€œwaiting on writeâ€. The function takes no arguments, so the struct for the start state is empty. Like before, the struct for the end state is empty because the function is finished at this point. The struct for the â€œwaiting on writeâ€ state is more interesting:

struct WaitingOnWriteState {
    array: [1, 2, 3],
    element: 0x1001c, // address of the last array element
}

We need to store both the array and element variables because element is required for the return value and array is referenced by element. Since element is a reference, it stores a pointer (i.e., a memory address) to the referenced element. We used 0x1001c as an example memory address here. In reality, it needs to be the address of the last element of the array field, so it depends on where the struct lives in memory. Structs with such internal pointers are called self-referential structs because they reference themselves from one of their fields.

ðŸ”—The Problem with Self-Referential Structs

The internal pointer of our self-referential struct leads to a fundamental problem, which becomes apparent when we look at its memory layout:

The array field starts at address 0x10014 and the element field at address 0x10020. It points to address 0x1001c because the last array element lives at this address. At this point, everything is still fine. However, an issue occurs when we move this struct to a different memory address:

We moved the struct a bit so that it starts at address 0x10024 now. This could, for example, happen when we pass the struct as a function argument or assign it to a different stack variable. The problem is that the element field still points to address 0x1001c even though the last array element now lives at address 0x1002c. Thus, the pointer is dangling, with the result that undefined behavior occurs on the next poll call.

ðŸ”—Possible Solutions

There are three fundamental approaches to solving the dangling pointer problem:

Update the pointer on move: The idea is to update the internal pointer whenever the struct is moved in memory so that it is still valid after the move. Unfortunately, this approach would require extensive changes to Rust that would result in potentially huge performance losses. The reason is that some kind of runtime would need to keep track of the type of all struct fields and check on every move operation whether a pointer update is required.
Store an offset instead of self-references:: To avoid the requirement for updating pointers, the compiler could try to store self-references as offsets from the structâ€™s beginning instead. For example, the element field of the above WaitingOnWriteState struct could be stored in the form of an element_offset field with a value of 8 because the array element that the reference points to starts 8 bytes after the structâ€™s beginning. Since the offset stays the same when the struct is moved, no field updates are required.

The problem with this approach is that it requires the compiler to detect all self-references. This is not possible at compile-time because the value of a reference might depend on user input, so we would need a runtime system again to analyze references and correctly create the state structs. This would not only result in runtime costs but also prevent certain compiler optimizations, so that it would cause large performance losses again.
Forbid moving the struct: As we saw above, the dangling pointer only occurs when we move the struct in memory. By completely forbidding move operations on self-referential structs, the problem can also be avoided. The big advantage of this approach is that it can be implemented at the type system level without additional runtime costs. The drawback is that it puts the burden of dealing with move operations on possibly self-referential structs on the programmer.

Rust chose the third solution because of its principle of providing zero cost abstractions, which means that abstractions should not impose additional runtime costs. The pinning API was proposed for this purpose in RFC 2349. In the following, we will give a short overview of this API and explain how it works with async/await and futures.

ðŸ”—Heap Values

The first observation is that heap-allocated values already have a fixed memory address most of the time. They are created using a call to allocate and then referenced by a pointer type such as Box<T>. While moving the pointer type is possible, the heap value that the pointer points to stays at the same memory address until it is freed through a deallocate call again.

Using heap allocation, we can try to create a self-referential struct:

fn main() {
    let mut heap_value = Box::new(SelfReferential {
        self_ptr: 0 as *const _,
    });
    let ptr = &*heap_value as *const SelfReferential;
    heap_value.self_ptr = ptr;
    println!("heap value at: {:p}", heap_value);
    println!("internal reference: {:p}", heap_value.self_ptr);
}

struct SelfReferential {
    self_ptr: *const Self,
}

(Try it on the playground)

We create a simple struct named SelfReferential that contains a single pointer field. First, we initialize this struct with a null pointer and then allocate it on the heap using Box::new. We then determine the memory address of the heap-allocated struct and store it in a ptr variable. Finally, we make the struct self-referential by assigning the ptr variable to the self_ptr field.

When we execute this code on the playground, we see that the address of the heap value and its internal pointer are equal, which means that the self_ptr field is a valid self-reference. Since the heap_value variable is only a pointer, moving it (e.g., by passing it to a function) does not change the address of the struct itself, so the self_ptr stays valid even if the pointer is moved.

However, there is still a way to break this example: We can move out of a Box<T> or replace its content:

let stack_value = mem::replace(&mut *heap_value, SelfReferential {
    self_ptr: 0 as *const _,
});
println!("value at: {:p}", &stack_value);
println!("internal reference: {:p}", stack_value.self_ptr);

(Try it on the playground)

Here we use the mem::replace function to replace the heap-allocated value with a new struct instance. This allows us to move the original heap_value to the stack, while the self_ptr field of the struct is now a dangling pointer that still points to the old heap address. When you try to run the example on the playground, you see that the printed â€œvalue at:â€ and â€œinternal reference:â€ lines indeed show different pointers. So heap allocating a value is not enough to make self-references safe.

The fundamental problem that allowed the above breakage is that Box<T> allows us to get a &mut T reference to the heap-allocated value. This &mut reference makes it possible to use methods like mem::replace or mem::swap to invalidate the heap-allocated value. To resolve this problem, we must prevent &mut references to self-referential structs from being created.

ðŸ”—`Pin<Box<T>>` and `Unpin`

The pinning API provides a solution to the &mut T problem in the form of the Pin wrapper type and the Unpin marker trait. The idea behind these types is to gate all methods of Pin that can be used to get &mut references to the wrapped value (e.g. get_mut or deref_mut) on the Unpin trait. The Unpin trait is an auto trait, which is automatically implemented for all types except those that explicitly opt-out. By making self-referential structs opt-out of Unpin, there is no (safe) way to get a &mut T from a Pin<Box<T>> type for them. As a result, their internal self-references are guaranteed to stay valid.

As an example, letâ€™s update the SelfReferential type from above to opt-out of Unpin:

use core::marker::PhantomPinned;

struct SelfReferential {
    self_ptr: *const Self,
    _pin: PhantomPinned,
}

We opt-out by adding a second _pin field of type PhantomPinned. This type is a zero-sized marker type whose only purpose is to not implement the Unpin trait. Because of the way auto traits work, a single field that is not Unpin suffices to make the complete struct opt-out of Unpin.

The second step is to change the Box<SelfReferential> type in the example to a Pin<Box<SelfReferential>> type. The easiest way to do this is to use the Box::pin function instead of Box::new for creating the heap-allocated value:

let mut heap_value = Box::pin(SelfReferential {
    self_ptr: 0 as *const _,
    _pin: PhantomPinned,
});

In addition to changing Box::new to Box::pin, we also need to add the new _pin field in the struct initializer. Since PhantomPinned is a zero-sized type, we only need its type name to initialize it.

When we try to run our adjusted example now, we see that it no longer works:

error[E0594]: cannot assign to data in a dereference of `std::pin::Pin<std::boxed::Box<SelfReferential>>`
  --> src/main.rs:10:5
   |
10 |     heap_value.self_ptr = ptr;
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^ cannot assign
   |
   = help: trait `DerefMut` is required to modify through a dereference, but it is not implemented for `std::pin::Pin<std::boxed::Box<SelfReferential>>`

error[E0596]: cannot borrow data in a dereference of `std::pin::Pin<std::boxed::Box<SelfReferential>>` as mutable
  --> src/main.rs:16:36
   |
16 |     let stack_value = mem::replace(&mut *heap_value, SelfReferential {
   |                                    ^^^^^^^^^^^^^^^^ cannot borrow as mutable
   |
   = help: trait `DerefMut` is required to modify through a dereference, but it is not implemented for `std::pin::Pin<std::boxed::Box<SelfReferential>>`

Both errors occur because the Pin<Box<SelfReferential>> type no longer implements the DerefMut trait. This is exactly what we wanted because the DerefMut trait would return a &mut reference, which we wanted to prevent. This only happens because we both opted-out of Unpin and changed Box::new to Box::pin.

The problem now is that the compiler does not only prevent moving the type in line 16, but also forbids initializing the self_ptr field in line 10. This happens because the compiler canâ€™t differentiate between valid and invalid uses of &mut references. To get the initialization working again, we have to use the unsafe get_unchecked_mut method:

// safe because modifying a field doesn't move the whole struct
unsafe {
    let mut_ref = Pin::as_mut(&mut heap_value);
    Pin::get_unchecked_mut(mut_ref).self_ptr = ptr;
}

(Try it on the playground)

The get_unchecked_mut function works on a Pin<&mut T> instead of a Pin<Box<T>>, so we have to use Pin::as_mut for converting the value. Then we can set the self_ptr field using the &mut reference returned by get_unchecked_mut.

Now the only error left is the desired error on mem::replace. Remember, this operation tries to move the heap-allocated value to the stack, which would break the self-reference stored in the self_ptr field. By opting out of Unpin and using Pin<Box<T>>, we can prevent this operation at compile time and thus safely work with self-referential structs. As we saw, the compiler is not able to prove that the creation of the self-reference is safe (yet), so we need to use an unsafe block and verify the correctness ourselves.

ðŸ”—Stack Pinning and `Pin<&mut T>`

In the previous section, we learned how to use Pin<Box<T>> to safely create a heap-allocated self-referential value. While this approach works fine and is relatively safe (apart from the unsafe construction), the required heap allocation comes with a performance cost. Since Rust strives to provide zero-cost abstractions whenever possible, the pinning API also allows to create Pin<&mut T> instances that point to stack-allocated values.

Unlike Pin<Box<T>> instances, which have ownership of the wrapped value, Pin<&mut T> instances only temporarily borrow the wrapped value. This makes things more complicated, as it requires the programmer to ensure additional guarantees themselves. Most importantly, a Pin<&mut T> must stay pinned for the whole lifetime of the referenced T, which can be difficult to verify for stack-based variables. To help with this, crates like pin-utils exist, but I still wouldnâ€™t recommend pinning to the stack unless you really know what youâ€™re doing.

For further reading, check out the documentation of the pin module and the Pin::new_unchecked method.

ðŸ”—Pinning and Futures

As we already saw in this post, the Future::poll method uses pinning in the form of a Pin<&mut Self> parameter:

fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output>

The reason that this method takes self: Pin<&mut Self> instead of the normal &mut self is that future instances created from async/await are often self-referential, as we saw above. By wrapping Self into Pin and letting the compiler opt-out of Unpin for self-referential futures generated from async/await, it is guaranteed that the futures are not moved in memory between poll calls. This ensures that all internal references are still valid.

It is worth noting that moving futures before the first poll call is fine. This is a result of the fact that futures are lazy and do nothing until theyâ€™re polled for the first time. The start state of the generated state machines therefore only contains the function arguments but no internal references. In order to call poll, the caller must wrap the future into Pin first, which ensures that the future cannot be moved in memory anymore. Since stack pinning is more difficult to get right, I recommend to always use Box::pin combined with Pin::as_mut for this.

In case youâ€™re interested in understanding how to safely implement a future combinator function using stack pinning yourself, take a look at the relatively short source of the map combinator method of the futures crate and the section about projections and structural pinning of the pin documentation.

ðŸ”—Executors and Wakers

Using async/await, it is possible to ergonomically work with futures in a completely asynchronous way. However, as we learned above, futures do nothing until they are polled. This means we have to call poll on them at some point, otherwise the asynchronous code is never executed.

With a single future, we can always wait for each future manually using a loop as described above. However, this approach is very inefficient and not practical for programs that create a large number of futures. The most common solution to this problem is to define a global executor that is responsible for polling all futures in the system until they are finished.

ðŸ”—Executors

The purpose of an executor is to allow spawning futures as independent tasks, typically through some sort of spawn method. The executor is then responsible for polling all futures until they are completed. The big advantage of managing all futures in a central place is that the executor can switch to a different future whenever a future returns Poll::Pending. Thus, asynchronous operations are run in parallel and the CPU is kept busy.

Many executor implementations can also take advantage of systems with multiple CPU cores. They create a thread pool that is able to utilize all cores if there is enough work available and use techniques such as work stealing to balance the load between cores. There are also special executor implementations for embedded systems that optimize for low latency and memory overhead.

To avoid the overhead of polling futures repeatedly, executors typically take advantage of the waker API supported by Rustâ€™s futures.

ðŸ”—Wakers

The idea behind the waker API is that a special Waker type is passed to each invocation of poll, wrapped in the Context type. This Waker type is created by the executor and can be used by the asynchronous task to signal its (partial) completion. As a result, the executor does not need to call poll on a future that previously returned Poll::Pending until it is notified by the corresponding waker.

This is best illustrated by a small example:

async fn write_file() {
    async_write_file("foo.txt", "Hello").await;
}

This function asynchronously writes the string â€œHelloâ€ to a foo.txt file. Since hard disk writes take some time, the first poll call on this future will likely return Poll::Pending. However, the hard disk driver will internally store the Waker passed to the poll call and use it to notify the executor when the file is written to disk. This way, the executor does not need to waste any time trying to poll the future again before it receives the waker notification.

We will see how the Waker type works in detail when we create our own executor with waker support in the implementation section of this post.

ðŸ”—Cooperative Multitasking?

At the beginning of this post, we talked about preemptive and cooperative multitasking. While preemptive multitasking relies on the operating system to forcibly switch between running tasks, cooperative multitasking requires that the tasks voluntarily give up control of the CPU through a yield operation on a regular basis. The big advantage of the cooperative approach is that tasks can save their state themselves, which results in more efficient context switches and makes it possible to share the same call stack between tasks.

It might not be immediately apparent, but futures and async/await are an implementation of the cooperative multitasking pattern:

Each future that is added to the executor is basically a cooperative task.
Instead of using an explicit yield operation, futures give up control of the CPU core by returning Poll::Pending (or Poll::Ready at the end).
- There is nothing that forces futures to give up the CPU. If they want, they can never return from poll, e.g., by spinning endlessly in a loop.
- Since each future can block the execution of the other futures in the executor, we need to trust them to not be malicious.
Futures internally store all the state they need to continue execution on the next poll call. With async/await, the compiler automatically detects all variables that are needed and stores them inside the generated state machine.
- Only the minimum state required for continuation is saved.
- Since the poll method gives up the call stack when it returns, the same stack can be used for polling other futures.

We see that futures and async/await fit the cooperative multitasking pattern perfectly; they just use some different terminology. In the following, we will therefore use the terms â€œtaskâ€ and â€œfutureâ€ interchangeably.

ðŸ”—Implementation

Now that we understand how cooperative multitasking based on futures and async/await works in Rust, itâ€™s time to add support for it to our kernel. Since the Future trait is part of the core library and async/await is a feature of the language itself, there is nothing special we need to do to use it in our #![no_std] kernel. The only requirement is that we use at least nightly 2020-03-25 of Rust because async/await was not no_std compatible before.

With a recent-enough nightly, we can start using async/await in our main.rs:

// in src/main.rs

async fn async_number() -> u32 {
    42
}

async fn example_task() {
    let number = async_number().await;
    println!("async number: {}", number);
}

The async_number function is an async fn, so the compiler transforms it into a state machine that implements Future. Since the function only returns 42, the resulting future will directly return Poll::Ready(42) on the first poll call. Like async_number, the example_task function is also an async fn. It awaits the number returned by async_number and then prints it using the println macro.

To run the future returned by example_task, we need to call poll on it until it signals its completion by returning Poll::Ready. To do this, we need to create a simple executor type.

ðŸ”—Task

Before we start the executor implementation, we create a new task module with a Task type:

// in src/lib.rs

pub mod task;

// in src/task/mod.rs

use core::{future::Future, pin::Pin};
use alloc::boxed::Box;

pub struct Task {
    future: Pin<Box<dyn Future<Output = ()>>>,
}

The Task struct is a newtype wrapper around a pinned, heap-allocated, and dynamically dispatched future with the empty type () as output. Letâ€™s go through it in detail:

We require that the future associated with a task returns (). This means that tasks donâ€™t return any result, they are just executed for their side effects. For example, the example_task function we defined above has no return value, but it prints something to the screen as a side effect.
The dyn keyword indicates that we store a trait object in the Box. This means that the methods on the future are dynamically dispatched, allowing different types of futures to be stored in the Task type. This is important because each async fn has its own type and we want to be able to create multiple different tasks.
As we learned in the section about pinning, the Pin<Box> type ensures that a value cannot be moved in memory by placing it on the heap and preventing the creation of &mut references to it. This is important because futures generated by async/await might be self-referential, i.e., contain pointers to themselves that would be invalidated when the future is moved.

To allow the creation of new Task structs from futures, we create a new function:

// in src/task/mod.rs

impl Task {
    pub fn new(future: impl Future<Output = ()> + 'static) -> Task {
        Task {
            future: Box::pin(future),
        }
    }
}

The function takes an arbitrary future with an output type of () and pins it in memory through the Box::pin function. Then it wraps the boxed future in the Task struct and returns it. The 'static lifetime is required here because the returned Task can live for an arbitrary time, so the future needs to be valid for that time too.

We also add a poll method to allow the executor to poll the stored future:

// in src/task/mod.rs

use core::task::{Context, Poll};

impl Task {
    fn poll(&mut self, context: &mut Context) -> Poll<()> {
        self.future.as_mut().poll(context)
    }
}

Since the poll method of the Future trait expects to be called on a Pin<&mut T> type, we use the Pin::as_mut method to convert the self.future field of type Pin<Box<T>> first. Then we call poll on the converted self.future field and return the result. Since the Task::poll method should only be called by the executor that weâ€™ll create in a moment, we keep the function private to the task module.

ðŸ”—Simple Executor

Since executors can be quite complex, we deliberately start by creating a very basic executor before implementing a more featureful executor later. For this, we first create a new task::simple_executor submodule:

// in src/task/mod.rs

pub mod simple_executor;

// in src/task/simple_executor.rs

use super::Task;
use alloc::collections::VecDeque;

pub struct SimpleExecutor {
    task_queue: VecDeque<Task>,
}

impl SimpleExecutor {
    pub fn new() -> SimpleExecutor {
        SimpleExecutor {
            task_queue: VecDeque::new(),
        }
    }

    pub fn spawn(&mut self, task: Task) {
        self.task_queue.push_back(task)
    }
}

The struct contains a single task_queue field of type VecDeque, which is basically a vector that allows for push and pop operations on both ends. The idea behind using this type is that we insert new tasks through the spawn method at the end and pop the next task for execution from the front. This way, we get a simple FIFO queue (â€œfirst in, first outâ€).

ðŸ”—Dummy Waker

In order to call the poll method, we need to create a Context type, which wraps a Waker type. To start simple, we will first create a dummy waker that does nothing. For this, we create a RawWaker instance, which defines the implementation of the different Waker methods, and then use the Waker::from_raw function to turn it into a Waker:

// in src/task/simple_executor.rs

use core::task::{Waker, RawWaker};

fn dummy_raw_waker() -> RawWaker {
    todo!();
}

fn dummy_waker() -> Waker {
    unsafe { Waker::from_raw(dummy_raw_waker()) }
}

The from_raw function is unsafe because undefined behavior can occur if the programmer does not uphold the documented requirements of RawWaker. Before we look at the implementation of the dummy_raw_waker function, we first try to understand how the RawWaker type works.

ðŸ”—`RawWaker`

The RawWaker type requires the programmer to explicitly define a virtual method table (vtable) that specifies the functions that should be called when the RawWaker is cloned, woken, or dropped. The layout of this vtable is defined by the RawWakerVTable type. Each function receives a *const () argument, which is a type-erased pointer to some value. The reason for using a *const () pointer instead of a proper reference is that the RawWaker type should be non-generic but still support arbitrary types. The pointer is provided by putting it into the data argument of RawWaker::new, which just initializes a RawWaker. The Waker then uses this RawWaker to call the vtable functions with data.

Typically, the RawWaker is created for some heap-allocated struct that is wrapped into the Box or Arc type. For such types, methods like Box::into_raw can be used to convert the Box<T> to a *const T pointer. This pointer can then be cast to an anonymous *const () pointer and passed to RawWaker::new. Since each vtable function receives the same *const () as an argument, the functions can safely cast the pointer back to a Box<T> or a &T to operate on it. As you can imagine, this process is highly dangerous and can easily lead to undefined behavior on mistakes. For this reason, manually creating a RawWaker is not recommended unless necessary.

ðŸ”—A Dummy `RawWaker`

While manually creating a RawWaker is not recommended, there is currently no other way to create a dummy Waker that does nothing. Fortunately, the fact that we want to do nothing makes it relatively safe to implement the dummy_raw_waker function:

// in src/task/simple_executor.rs

use core::task::RawWakerVTable;

fn dummy_raw_waker() -> RawWaker {
    fn no_op(_: *const ()) {}
    fn clone(_: *const ()) -> RawWaker {
        dummy_raw_waker()
    }

    let vtable = &RawWakerVTable::new(clone, no_op, no_op, no_op);
    RawWaker::new(0 as *const (), vtable)
}

First, we define two inner functions named no_op and clone. The no_op function takes a *const () pointer and does nothing. The clone function also takes a *const () pointer and returns a new RawWaker by calling dummy_raw_waker again. We use these two functions to create a minimal RawWakerVTable: The clone function is used for the cloning operations, and the no_op function is used for all other operations. Since the RawWaker does nothing, it does not matter that we return a new RawWaker from clone instead of cloning it.

After creating the vtable, we use the RawWaker::new function to create the RawWaker. The passed *const () does not matter since none of the vtable functions use it. For this reason, we simply pass a null pointer.

ðŸ”—A `run` Method

Now we have a way to create a Waker instance, we can use it to implement a run method on our executor. The most simple run method is to repeatedly poll all queued tasks in a loop until all are done. This is not very efficient since it does not utilize the notifications of the Waker type, but it is an easy way to get things running:

// in src/task/simple_executor.rs

use core::task::{Context, Poll};

impl SimpleExecutor {
    pub fn run(&mut self) {
        while let Some(mut task) = self.task_queue.pop_front() {
            let waker = dummy_waker();
            let mut context = Context::from_waker(&waker);
            match task.poll(&mut context) {
                Poll::Ready(()) => {} // task done
                Poll::Pending => self.task_queue.push_back(task),
            }
        }
    }
}

The function uses a while let loop to handle all tasks in the task_queue. For each task, it first creates a Context type by wrapping a Waker instance returned by our dummy_waker function. Then it invokes the Task::poll method with this context. If the poll method returns Poll::Ready, the task is finished and we can continue with the next task. If the task is still Poll::Pending, we add it to the back of the queue again so that it will be polled again in a subsequent loop iteration.

ðŸ”—Trying It

With our SimpleExecutor type, we can now try running the task returned by the example_task function in our main.rs:

// in src/main.rs

use blog_os::task::{Task, simple_executor::SimpleExecutor};

fn kernel_main(boot_info: &'static BootInfo) -> ! {
    // [â€¦] initialization routines, including `init_heap`

    let mut executor = SimpleExecutor::new();
    executor.spawn(Task::new(example_task()));
    executor.run();

    // [â€¦] test_main, "it did not crash" message, hlt_loop
}


// Below is the example_task function again so that you don't have to scroll up

async fn async_number() -> u32 {
    42
}

async fn example_task() {
    let number = async_number().await;
    println!("async number: {}", number);
}

When we run it, we see that the expected â€œasync number: 42â€ message is printed to the screen:

Letâ€™s summarize the various steps that happen in this example:

First, a new instance of our SimpleExecutor type is created with an empty task_queue.
Next, we call the asynchronous example_task function, which returns a future. We wrap this future in the Task type, which moves it to the heap and pins it, and then add the task to the task_queue of the executor through the spawn method.
We then call the run method to start the execution of the single task in the queue. This involves:
- Popping the task from the front of the task_queue.
- Creating a RawWaker for the task, converting it to a Waker instance, and then creating a Context instance from it.
- Calling the poll method on the future of the task, using the Context we just created.
- Since the example_task does not wait for anything, it can directly run till its end on the first poll call. This is where the â€œasync number: 42â€ line is printed.
- Since the example_task directly returns Poll::Ready, it is not added back to the task queue.
The run method returns after the task_queue becomes empty. The execution of our kernel_main function continues and the â€œIt did not crash!â€ message is printed.

ðŸ”—Async Keyboard Input

Our simple executor does not utilize the Waker notifications and simply loops over all tasks until they are done. This wasnâ€™t a problem for our example since our example_task can directly run to finish on the first poll call. To see the performance advantages of a proper Waker implementation, we first need to create a task that is truly asynchronous, i.e., a task that will probably return Poll::Pending on the first poll call.

We already have some kind of asynchronicity in our system that we can use for this: hardware interrupts. As we learned in the Interrupts post, hardware interrupts can occur at arbitrary points in time, determined by some external device. For example, a hardware timer sends an interrupt to the CPU after some predefined time has elapsed. When the CPU receives an interrupt, it immediately transfers control to the corresponding handler function defined in the interrupt descriptor table (IDT).

In the following, we will create an asynchronous task based on the keyboard interrupt. The keyboard interrupt is a good candidate for this because it is both non-deterministic and latency-critical. Non-deterministic means that there is no way to predict when the next key press will occur because it is entirely dependent on the user. Latency-critical means that we want to handle the keyboard input in a timely manner, otherwise the user will feel a lag. To support such a task in an efficient way, it will be essential that the executor has proper support for Waker notifications.

ðŸ”—Scancode Queue

Currently, we handle the keyboard input directly in the interrupt handler. This is not a good idea for the long term because interrupt handlers should stay as short as possible as they might interrupt important work. Instead, interrupt handlers should only perform the minimal amount of work necessary (e.g., reading the keyboard scancode) and leave the rest of the work (e.g., interpreting the scancode) to a background task.

A common pattern for delegating work to a background task is to create some sort of queue. The interrupt handler pushes units of work to the queue, and the background task handles the work in the queue. Applied to our keyboard interrupt, this means that the interrupt handler only reads the scancode from the keyboard, pushes it to the queue, and then returns. The keyboard task sits on the other end of the queue and interprets and handles each scancode that is pushed to it:

A simple implementation of that queue could be a mutex-protected VecDeque. However, using mutexes in interrupt handlers is not a good idea since it can easily lead to deadlocks. For example, when the user presses a key while the keyboard task has locked the queue, the interrupt handler tries to acquire the lock again and hangs indefinitely. Another problem with this approach is that VecDeque automatically increases its capacity by performing a new heap allocation when it becomes full. This can lead to deadlocks again because our allocator also uses a mutex internally. Further problems are that heap allocations can fail or take a considerable amount of time when the heap is fragmented.

To prevent these problems, we need a queue implementation that does not require mutexes or allocations for its push operation. Such queues can be implemented by using lock-free atomic operations for pushing and popping elements. This way, it is possible to create push and pop operations that only require a &self reference and are thus usable without a mutex. To avoid allocations on push, the queue can be backed by a pre-allocated fixed-size buffer. While this makes the queue bounded (i.e., it has a maximum length), it is often possible to define reasonable upper bounds for the queue length in practice, so that this isnâ€™t a big problem.

ðŸ”—The `crossbeam` Crate

Implementing such a queue in a correct and efficient way is very difficult, so I recommend sticking to existing, well-tested implementations. One popular Rust project that implements various mutex-free types for concurrent programming is crossbeam. It provides a type named ArrayQueue that is exactly what we need in this case. And weâ€™re lucky: the type is fully compatible with no_std crates with allocation support.

To use the type, we need to add a dependency on the crossbeam-queue crate:

# in Cargo.toml

[dependencies.crossbeam-queue]
version = "0.3.11"
default-features = false
features = ["alloc"]

By default, the crate depends on the standard library. To make it no_std compatible, we need to disable its default features and instead enable the alloc feature. (Note that we could also add a dependency on the main crossbeam crate, which re-exports the crossbeam-queue crate, but this would result in a larger number of dependencies and longer compile times.)

ðŸ”—Queue Implementation

Using the ArrayQueue type, we can now create a global scancode queue in a new task::keyboard module:

// in src/task/mod.rs

pub mod keyboard;

// in src/task/keyboard.rs

use conquer_once::spin::OnceCell;
use crossbeam_queue::ArrayQueue;

static SCANCODE_QUEUE: OnceCell<ArrayQueue<u8>> = OnceCell::uninit();

Since ArrayQueue::new performs a heap allocation, which is not possible at compile time (yet), we canâ€™t initialize the static variable directly. Instead, we use the OnceCell type of the conquer_once crate, which makes it possible to perform a safe one-time initialization of static values. To include the crate, we need to add it as a dependency in our Cargo.toml:

# in Cargo.toml

[dependencies.conquer-once]
version = "0.2.0"
default-features = false

Instead of the OnceCell primitive, we could also use the lazy_static macro here. However, the OnceCell type has the advantage that we can ensure that the initialization does not happen in the interrupt handler, thus preventing the interrupt handler from performing a heap allocation.

ðŸ”—Filling the Queue

To fill the scancode queue, we create a new add_scancode function that we will call from the interrupt handler:

// in src/task/keyboard.rs

use crate::println;

/// Called by the keyboard interrupt handler
///
/// Must not block or allocate.
pub(crate) fn add_scancode(scancode: u8) {
    if let Ok(queue) = SCANCODE_QUEUE.try_get() {
        if let Err(_) = queue.push(scancode) {
            println!("WARNING: scancode queue full; dropping keyboard input");
        }
    } else {
        println!("WARNING: scancode queue uninitialized");
    }
}

We use OnceCell::try_get to get a reference to the initialized queue. If the queue is not initialized yet, we ignore the keyboard scancode and print a warning. Itâ€™s important that we donâ€™t try to initialize the queue in this function because it will be called by the interrupt handler, which should not perform heap allocations. Since this function should not be callable from our main.rs, we use the pub(crate) visibility to make it only available to our lib.rs.

The fact that the ArrayQueue::push method requires only a &self reference makes it very simple to call the method on the static queue. The ArrayQueue type performs all the necessary synchronization itself, so we donâ€™t need a mutex wrapper here. In case the queue is full, we print a warning too.

To call the add_scancode function on keyboard interrupts, we update our keyboard_interrupt_handler function in the interrupts module:

// in src/interrupts.rs

extern "x86-interrupt" fn keyboard_interrupt_handler(
    _stack_frame: InterruptStackFrame
) {
    use x86_64::instructions::port::Port;

    let mut port = Port::new(0x60);
    let scancode: u8 = unsafe { port.read() };
    crate::task::keyboard::add_scancode(scancode); // new

    unsafe {
        PICS.lock()
            .notify_end_of_interrupt(InterruptIndex::Keyboard.as_u8());
    }
}

We removed all the keyboard handling code from this function and instead added a call to the add_scancode function. The rest of the function stays the same as before.

As expected, keypresses are no longer printed to the screen when we run our project using cargo run now. Instead, we see the warning that the scancode queue is uninitialized for every keystroke.

ðŸ”—Scancode Stream

To initialize the SCANCODE_QUEUE and read the scancodes from the queue in an asynchronous way, we create a new ScancodeStream type:

// in src/task/keyboard.rs

pub struct ScancodeStream {
    _private: (),
}

impl ScancodeStream {
    pub fn new() -> Self {
        SCANCODE_QUEUE.try_init_once(|| ArrayQueue::new(100))
            .expect("ScancodeStream::new should only be called once");
        ScancodeStream { _private: () }
    }
}

The purpose of the _private field is to prevent construction of the struct from outside of the module. This makes the new function the only way to construct the type. In the function, we first try to initialize the SCANCODE_QUEUE static. We panic if it is already initialized to ensure that only a single ScancodeStream instance can be created.

To make the scancodes available to asynchronous tasks, the next step is to implement a poll-like method that tries to pop the next scancode off the queue. While this sounds like we should implement the Future trait for our type, this does not quite fit here. The problem is that the Future trait only abstracts over a single asynchronous value and expects that the poll method is not called again after it returns Poll::Ready. Our scancode queue, however, contains multiple asynchronous values, so it is okay to keep polling it.

ðŸ”—The `Stream` Trait

Since types that yield multiple asynchronous values are common, the futures crate provides a useful abstraction for such types: the Stream trait. The trait is defined like this:

pub trait Stream {
    type Item;

    fn poll_next(self: Pin<&mut Self>, cx: &mut Context)
        -> Poll<Option<Self::Item>>;
}

This definition is quite similar to the Future trait, with the following differences:

The associated type is named Item instead of Output.
Instead of a poll method that returns Poll<Self::Item>, the Stream trait defines a poll_next method that returns a Poll<Option<Self::Item>> (note the additional Option).

There is also a semantic difference: The poll_next can be called repeatedly, until it returns Poll::Ready(None) to signal that the stream is finished. In this regard, the method is similar to the Iterator::next method, which also returns None after the last value.

ðŸ”—Implementing `Stream`

Letâ€™s implement the Stream trait for our ScancodeStream to provide the values of the SCANCODE_QUEUE in an asynchronous way. For this, we first need to add a dependency on the futures-util crate, which contains the Stream type:

# in Cargo.toml

[dependencies.futures-util]
version = "0.3.4"
default-features = false
features = ["alloc"]

We disable the default features to make the crate no_std compatible and enable the alloc feature to make its allocation-based types available (we will need this later). (Note that we could also add a dependency on the main futures crate, which re-exports the futures-util crate, but this would result in a larger number of dependencies and longer compile times.)

Now we can import and implement the Stream trait:

// in src/task/keyboard.rs

use core::{pin::Pin, task::{Poll, Context}};
use futures_util::stream::Stream;

impl Stream for ScancodeStream {
    type Item = u8;

    fn poll_next(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Option<u8>> {
        let queue = SCANCODE_QUEUE.try_get().expect("not initialized");
        match queue.pop() {
            Some(scancode) => Poll::Ready(Some(scancode)),
            None => Poll::Pending,
        }
    }
}

We first use the OnceCell::try_get method to get a reference to the initialized scancode queue. This should never fail since we initialize the queue in the new function, so we can safely use the expect method to panic if itâ€™s not initialized. Next, we use the ArrayQueue::pop method to try to get the next element from the queue. If it succeeds, we return the scancode wrapped in Poll::Ready(Some(â€¦)). If it fails, it means that the queue is empty. In that case, we return Poll::Pending.

ðŸ”—Waker Support

Like the Futures::poll method, the Stream::poll_next method requires the asynchronous task to notify the executor when it becomes ready after Poll::Pending is returned. This way, the executor does not need to poll the same task again until it is notified, which greatly reduces the performance overhead of waiting tasks.

To send this notification, the task should extract the Waker from the passed Context reference and store it somewhere. When the task becomes ready, it should invoke the wake method on the stored Waker to notify the executor that the task should be polled again.

ðŸ”—AtomicWaker

To implement the Waker notification for our ScancodeStream, we need a place where we can store the Waker between poll calls. We canâ€™t store it as a field in the ScancodeStream itself because it needs to be accessible from the add_scancode function. The solution to this is to use a static variable of the AtomicWaker type provided by the futures-util crate. Like the ArrayQueue type, this type is based on atomic instructions and can be safely stored in a static and modified concurrently.

Letâ€™s use the AtomicWaker type to define a static WAKER:

// in src/task/keyboard.rs

use futures_util::task::AtomicWaker;

static WAKER: AtomicWaker = AtomicWaker::new();

The idea is that the poll_next implementation stores the current waker in this static, and the add_scancode function calls the wake function on it when a new scancode is added to the queue.

ðŸ”—Storing a Waker

The contract defined by poll/poll_next requires the task to register a wakeup for the passed Waker when it returns Poll::Pending. Letâ€™s modify our poll_next implementation to satisfy this requirement:

// in src/task/keyboard.rs

impl Stream for ScancodeStream {
    type Item = u8;

    fn poll_next(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Option<u8>> {
        let queue = SCANCODE_QUEUE
            .try_get()
            .expect("scancode queue not initialized");

        // fast path
        if let Some(scancode) = queue.pop() {
            return Poll::Ready(Some(scancode));
        }

        WAKER.register(&cx.waker());
        match queue.pop() {
            Some(scancode) => {
                WAKER.take();
                Poll::Ready(Some(scancode))
            }
            None => Poll::Pending,
        }
    }
}

Like before, we first use the OnceCell::try_get function to get a reference to the initialized scancode queue. We then optimistically try to pop from the queue and return Poll::Ready when it succeeds. This way, we can avoid the performance overhead of registering a waker when the queue is not empty.

If the first call to queue.pop() does not succeed, the queue is potentially empty. Only potentially because the interrupt handler might have filled the queue asynchronously immediately after the check. Since this race condition can occur again for the next check, we need to register the Waker in the WAKER static before the second check. This way, a wakeup might happen before we return Poll::Pending, but it is guaranteed that we get a wakeup for any scancodes pushed after the check.

After registering the Waker contained in the passed Context through the AtomicWaker::register function, we try to pop from the queue a second time. If it now succeeds, we return Poll::Ready. We also remove the registered waker again using AtomicWaker::take because a waker notification is no longer needed. In case queue.pop() fails for a second time, we return Poll::Pending like before, but this time with a registered wakeup.

Note that there are two ways that a wakeup can happen for a task that did not return Poll::Pending (yet). One way is the mentioned race condition when the wakeup happens immediately before returning Poll::Pending. The other way is when the queue is no longer empty after registering the waker, so that Poll::Ready is returned. Since these spurious wakeups are not preventable, the executor needs to be able to handle them correctly.

ðŸ”—Waking the Stored Waker

To wake the stored Waker, we add a call to WAKER.wake() in the add_scancode function:

// in src/task/keyboard.rs

pub(crate) fn add_scancode(scancode: u8) {
    if let Ok(queue) = SCANCODE_QUEUE.try_get() {
        if let Err(_) = queue.push(scancode) {
            println!("WARNING: scancode queue full; dropping keyboard input");
        } else {
            WAKER.wake(); // new
        }
    } else {
        println!("WARNING: scancode queue uninitialized");
    }
}

The only change that we made is to add a call to WAKER.wake() if the push to the scancode queue succeeds. If a waker is registered in the WAKER static, this method will call the equally-named wake method on it, which notifies the executor. Otherwise, the operation is a no-op, i.e., nothing happens.

It is important that we call wake only after pushing to the queue because otherwise the task might be woken too early while the queue is still empty. This can, for example, happen when using a multi-threaded executor that starts the woken task concurrently on a different CPU core. While we donâ€™t have thread support yet, we will add it soon and donâ€™t want things to break then.

ðŸ”—Keyboard Task

Now that we implemented the Stream trait for our ScancodeStream, we can use it to create an asynchronous keyboard task:

// in src/task/keyboard.rs

use futures_util::stream::StreamExt;
use pc_keyboard::{layouts, DecodedKey, HandleControl, Keyboard, ScancodeSet1};
use crate::print;

pub async fn print_keypresses() {
    let mut scancodes = ScancodeStream::new();
    let mut keyboard = Keyboard::new(ScancodeSet1::new(),
        layouts::Us104Key, HandleControl::Ignore);

    while let Some(scancode) = scancodes.next().await {
        if let Ok(Some(key_event)) = keyboard.add_byte(scancode) {
            if let Some(key) = keyboard.process_keyevent(key_event) {
                match key {
                    DecodedKey::Unicode(character) => print!("{}", character),
                    DecodedKey::RawKey(key) => print!("{:?}", key),
                }
            }
        }
    }
}

The code is very similar to the code we had in our keyboard interrupt handler before we modified it in this post. The only difference is that, instead of reading the scancode from an I/O port, we take it from the ScancodeStream. For this, we first create a new Scancode stream and then repeatedly use the next method provided by the StreamExt trait to get a Future that resolves to the next element in the stream. By using the await operator on it, we asynchronously wait for the result of the future.

We use while let to loop until the stream returns None to signal its end. Since our poll_next method never returns None, this is effectively an endless loop, so the print_keypresses task never finishes.

Letâ€™s add the print_keypresses task to our executor in our main.rs to get working keyboard input again:

// in src/main.rs

use blog_os::task::keyboard; // new

fn kernel_main(boot_info: &'static BootInfo) -> ! {

    // [â€¦] initialization routines, including init_heap, test_main

    let mut executor = SimpleExecutor::new();
    executor.spawn(Task::new(example_task()));
    executor.spawn(Task::new(keyboard::print_keypresses())); // new
    executor.run();

    // [â€¦] "it did not crash" message, hlt_loop
}

When we execute cargo run now, we see that keyboard input works again:

If you keep an eye on the CPU utilization of your computer, you will see that the QEMU process now continuously keeps the CPU busy. This happens because our SimpleExecutor polls tasks over and over again in a loop. So even if we donâ€™t press any keys on the keyboard, the executor repeatedly calls poll on our print_keypresses task, even though the task cannot make any progress and will return Poll::Pending each time.

ðŸ”—Executor with Waker Support

To fix the performance problem, we need to create an executor that properly utilizes the Waker notifications. This way, the executor is notified when the next keyboard interrupt occurs, so it does not need to keep polling the print_keypresses task over and over again.

ðŸ”—Task Id

The first step in creating an executor with proper support for waker notifications is to give each task a unique ID. This is required because we need a way to specify which task should be woken. We start by creating a new TaskId wrapper type:

// in src/task/mod.rs

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
struct TaskId(u64);

The TaskId struct is a simple wrapper type around u64. We derive a number of traits for it to make it printable, copyable, comparable, and sortable. The latter is important because we want to use TaskId as the key type of a BTreeMap in a moment.

To create a new unique ID, we create a TaskId::new function:

use core::sync::atomic::{AtomicU64, Ordering};

impl TaskId {
    fn new() -> Self {
        static NEXT_ID: AtomicU64 = AtomicU64::new(0);
        TaskId(NEXT_ID.fetch_add(1, Ordering::Relaxed))
    }
}

The function uses a static NEXT_ID variable of type AtomicU64 to ensure that each ID is assigned only once. The fetch_add method atomically increases the value and returns the previous value in one atomic operation. This means that even when the TaskId::new method is called in parallel, every ID is returned exactly once. The Ordering parameter defines whether the compiler is allowed to reorder the fetch_add operation in the instructions stream. Since we only require that the ID be unique, the Relaxed ordering with the weakest requirements is enough in this case.

We can now extend our Task type with an additional id field:

// in src/task/mod.rs

pub struct Task {
    id: TaskId, // new
    future: Pin<Box<dyn Future<Output = ()>>>,
}

impl Task {
    pub fn new(future: impl Future<Output = ()> + 'static) -> Task {
        Task {
            id: TaskId::new(), // new
            future: Box::pin(future),
        }
    }
}

The new id field makes it possible to uniquely name a task, which is required for waking a specific task.

ðŸ”—The `Executor` Type

We create our new Executor type in a task::executor module:

// in src/task/mod.rs

pub mod executor;

// in src/task/executor.rs

use super::{Task, TaskId};
use alloc::{collections::BTreeMap, sync::Arc};
use core::task::Waker;
use crossbeam_queue::ArrayQueue;

pub struct Executor {
    tasks: BTreeMap<TaskId, Task>,
    task_queue: Arc<ArrayQueue<TaskId>>,
    waker_cache: BTreeMap<TaskId, Waker>,
}

impl Executor {
    pub fn new() -> Self {
        Executor {
            tasks: BTreeMap::new(),
            task_queue: Arc::new(ArrayQueue::new(100)),
            waker_cache: BTreeMap::new(),
        }
    }
}

Instead of storing tasks in a VecDeque like we did for our SimpleExecutor, we use a task_queue of task IDs and a BTreeMap named tasks that contains the actual Task instances. The map is indexed by the TaskId to allow efficient continuation of a specific task.

The task_queue field is an ArrayQueue of task IDs, wrapped into the Arc type that implements reference counting. Reference counting makes it possible to share ownership of the value among multiple owners. It works by allocating the value on the heap and counting the number of active references to it. When the number of active references reaches zero, the value is no longer needed and can be deallocated.

We use this Arc<ArrayQueue> type for the task_queue because it will be shared between the executor and wakers. The idea is that the wakers push the ID of the woken task to the queue. The executor sits on the receiving end of the queue, retrieves the woken tasks by their ID from the tasks map, and then runs them. The reason for using a fixed-size queue instead of an unbounded queue such as SegQueue is that interrupt handlers should not allocate on push to this queue.

In addition to the task_queue and the tasks map, the Executor type has a waker_cache field that is also a map. This map caches the Waker of a task after its creation. This has two reasons: First, it improves performance by reusing the same waker for multiple wake-ups of the same task instead of creating a new waker each time. Second, it ensures that reference-counted wakers are not deallocated inside interrupt handlers because it could lead to deadlocks (there are more details on this below).

To create an Executor, we provide a simple new function. We choose a capacity of 100 for the task_queue, which should be more than enough for the foreseeable future. In case our system will have more than 100 concurrent tasks at some point, we can easily increase this size.

ðŸ”—Spawning Tasks

As for the SimpleExecutor, we provide a spawn method on our Executor type that adds a given task to the tasks map and immediately wakes it by pushing its ID to the task_queue:

// in src/task/executor.rs

impl Executor {
    pub fn spawn(&mut self, task: Task) {
        let task_id = task.id;
        if self.tasks.insert(task.id, task).is_some() {
            panic!("task with same ID already in tasks");
        }
        self.task_queue.push(task_id).expect("queue full");
    }
}

If there is already a task with the same ID in the map, the [BTreeMap::insert] method returns it. This should never happen since each task has a unique ID, so we panic in this case since it indicates a bug in our code. Similarly, we panic when the task_queue is full since this should never happen if we choose a large-enough queue size.

ðŸ”—Running Tasks

To execute all tasks in the task_queue, we create a private run_ready_tasks method:

// in src/task/executor.rs

use core::task::{Context, Poll};

impl Executor {
    fn run_ready_tasks(&mut self) {
        // destructure `self` to avoid borrow checker errors
        let Self {
            tasks,
            task_queue,
            waker_cache,
        } = self;

        while let Some(task_id) = task_queue.pop() {
            let task = match tasks.get_mut(&task_id) {
                Some(task) => task,
                None => continue, // task no longer exists
            };
            let waker = waker_cache
                .entry(task_id)
                .or_insert_with(|| TaskWaker::new(task_id, task_queue.clone()));
            let mut context = Context::from_waker(waker);
            match task.poll(&mut context) {
                Poll::Ready(()) => {
                    // task done -> remove it and its cached waker
                    tasks.remove(&task_id);
                    waker_cache.remove(&task_id);
                }
                Poll::Pending => {}
            }
        }
    }
}

The basic idea of this function is similar to our SimpleExecutor: Loop over all tasks in the task_queue, create a waker for each task, and then poll them. However, instead of adding pending tasks back to the end of the task_queue, we let our TaskWaker implementation take care of adding woken tasks back to the queue. The implementation of this waker type will be shown in a moment.

Letâ€™s look into some of the implementation details of this run_ready_tasks method:

We use destructuring to split self into its three fields to avoid some borrow checker errors. Namely, our implementation needs to access the self.task_queue from within a closure, which currently tries to borrow self completely. This is a fundamental borrow checker issue that will be resolved when RFC 2229 is implemented.
For each popped task ID, we retrieve a mutable reference to the corresponding task from the tasks map. Since our ScancodeStream implementation registers wakers before checking whether a task needs to be put to sleep, it might happen that a wake-up occurs for a task that no longer exists. In this case, we simply ignore the wake-up and continue with the next ID from the queue.
To avoid the performance overhead of creating a waker on each poll, we use the waker_cache map to store the waker for each task after it has been created. For this, we use the BTreeMap::entry method in combination with Entry::or_insert_with to create a new waker if it doesnâ€™t exist yet and then get a mutable reference to it. For creating a new waker, we clone the task_queue and pass it together with the task ID to the TaskWaker::new function (implementation shown below). Since the task_queue is wrapped into an Arc, the clone only increases the reference count of the value, but still points to the same heap-allocated queue. Note that reusing wakers like this is not possible for all waker implementations, but our TaskWaker type will allow it.

A task is finished when it returns Poll::Ready. In that case, we remove it from the tasks map using the BTreeMap::remove method. We also remove its cached waker, if it exists.

ðŸ”—Waker Design

The job of the waker is to push the ID of the woken task to the task_queue of the executor. We implement this by creating a new TaskWaker struct that stores the task ID and a reference to the task_queue:

// in src/task/executor.rs

struct TaskWaker {
    task_id: TaskId,
    task_queue: Arc<ArrayQueue<TaskId>>,
}

Since the ownership of the task_queue is shared between the executor and wakers, we use the Arc wrapper type to implement shared reference-counted ownership.

The implementation of the wake operation is quite simple:

// in src/task/executor.rs

impl TaskWaker {
    fn wake_task(&self) {
        self.task_queue.push(self.task_id).expect("task_queue full");
    }
}

We push the task_id to the referenced task_queue. Since modifications to the ArrayQueue type only require a shared reference, we can implement this method on &self instead of &mut self.

ðŸ”—The `Wake` Trait

In order to use our TaskWaker type for polling futures, we need to convert it to a Waker instance first. This is required because the Future::poll method takes a Context instance as an argument, which can only be constructed from the Waker type. While we could do this by providing an implementation of the RawWaker type, itâ€™s both simpler and safer to instead implement the Arc-based Wake trait and then use the From implementations provided by the standard library to construct the Waker.

The trait implementation looks like this:

// in src/task/executor.rs

use alloc::task::Wake;

impl Wake for TaskWaker {
    fn wake(self: Arc<Self>) {
        self.wake_task();
    }

    fn wake_by_ref(self: &Arc<Self>) {
        self.wake_task();
    }
}

Since wakers are commonly shared between the executor and the asynchronous tasks, the trait methods require that the Self instance is wrapped in the Arc type, which implements reference-counted ownership. This means that we have to move our TaskWaker to an Arc in order to call them.

The difference between the wake and wake_by_ref methods is that the latter only requires a reference to the Arc, while the former takes ownership of the Arc and thus often requires an increase of the reference count. Not all types support waking by reference, so implementing the wake_by_ref method is optional. However, it can lead to better performance because it avoids unnecessary reference count modifications. In our case, we can simply forward both trait methods to our wake_task function, which requires only a shared &self reference.

ðŸ”—Creating Wakers

Since the Waker type supports From conversions for all Arc-wrapped values that implement the Wake trait, we can now implement the TaskWaker::new function that is required by our Executor::run_ready_tasks method:

// in src/task/executor.rs

impl TaskWaker {
    fn new(task_id: TaskId, task_queue: Arc<ArrayQueue<TaskId>>) -> Waker {
        Waker::from(Arc::new(TaskWaker {
            task_id,
            task_queue,
        }))
    }
}

We create the TaskWaker using the passed task_id and task_queue. We then wrap the TaskWaker in an Arc and use the Waker::from implementation to convert it to a Waker. This from method takes care of constructing a RawWakerVTable and a RawWaker instance for our TaskWaker type. In case youâ€™re interested in how it works in detail, check out the implementation in the alloc crate.

ðŸ”—A `run` Method

With our waker implementation in place, we can finally construct a run method for our executor:

// in src/task/executor.rs

impl Executor {
    pub fn run(&mut self) -> ! {
        loop {
            self.run_ready_tasks();
        }
    }
}

This method just calls the run_ready_tasks function in a loop. While we could theoretically return from the function when the tasks map becomes empty, this would never happen since our keyboard_task never finishes, so a simple loop should suffice. Since the function never returns, we use the ! return type to mark the function as diverging to the compiler.

We can now change our kernel_main to use our new Executor instead of the SimpleExecutor:

// in src/main.rs

use blog_os::task::executor::Executor; // new

fn kernel_main(boot_info: &'static BootInfo) -> ! {
    // [â€¦] initialization routines, including init_heap, test_main

    let mut executor = Executor::new(); // new
    executor.spawn(Task::new(example_task()));
    executor.spawn(Task::new(keyboard::print_keypresses()));
    executor.run();
}

We only need to change the import and the type name. Since our run function is marked as diverging, the compiler knows that it never returns, so we no longer need a call to hlt_loop at the end of our kernel_main function.

When we run our kernel using cargo run now, we see that keyboard input still works:

However, the CPU utilization of QEMU did not get any better. The reason for this is that we still keep the CPU busy the whole time. We no longer poll tasks until they are woken again, but we still check the task_queue in a busy loop. To fix this, we need to put the CPU to sleep if there is no more work to do.

ðŸ”—Sleep If Idle

The basic idea is to execute the hlt instruction when the task_queue is empty. This instruction puts the CPU to sleep until the next interrupt arrives. The fact that the CPU immediately becomes active again on interrupts ensures that we can still directly react when an interrupt handler pushes to the task_queue.

To implement this, we create a new sleep_if_idle method in our executor and call it from our run method:

// in src/task/executor.rs

impl Executor {
    pub fn run(&mut self) -> ! {
        loop {
            self.run_ready_tasks();
            self.sleep_if_idle();   // new
        }
    }

    fn sleep_if_idle(&self) {
        if self.task_queue.is_empty() {
            x86_64::instructions::hlt();
        }
    }
}

Since we call sleep_if_idle directly after run_ready_tasks, which loops until the task_queue becomes empty, checking the queue again might seem unnecessary. However, a hardware interrupt might occur directly after run_ready_tasks returns, so there might be a new task in the queue at the time the sleep_if_idle function is called. Only if the queue is still empty, do we put the CPU to sleep by executing the hlt instruction through the instructions::hlt wrapper function provided by the x86_64 crate.

Unfortunately, there is still a subtle race condition in this implementation. Since interrupts are asynchronous and can happen at any time, it is possible that an interrupt happens right between the is_empty check and the call to hlt:

if self.task_queue.is_empty() {
    /// <--- interrupt can happen here
    x86_64::instructions::hlt();
}

In case this interrupt pushes to the task_queue, we put the CPU to sleep even though there is now a ready task. In the worst case, this could delay the handling of a keyboard interrupt until the next keypress or the next timer interrupt. So how do we prevent it?

The answer is to disable interrupts on the CPU before the check and atomically enable them again together with the hlt instruction. This way, all interrupts that happen in between are delayed after the hlt instruction so that no wake-ups are missed. To implement this approach, we can use the interrupts::enable_and_hlt function provided by the x86_64 crate.

The updated implementation of our sleep_if_idle function looks like this:

// in src/task/executor.rs

impl Executor {
    fn sleep_if_idle(&self) {
        use x86_64::instructions::interrupts::{self, enable_and_hlt};

        interrupts::disable();
        if self.task_queue.is_empty() {
            enable_and_hlt();
        } else {
            interrupts::enable();
        }
    }
}

To avoid race conditions, we disable interrupts before checking whether the task_queue is empty. If it is, we use the enable_and_hlt function to enable interrupts and put the CPU to sleep as a single atomic operation. In case the queue is no longer empty, it means that an interrupt woke a task after run_ready_tasks returned. In that case, we enable interrupts again and directly continue execution without executing hlt.

Now our executor properly puts the CPU to sleep when there is nothing to do. We can see that the QEMU process has a much lower CPU utilization when we run our kernel using cargo run again.

ðŸ”—Possible Extensions

Our executor is now able to run tasks in an efficient way. It utilizes waker notifications to avoid polling waiting tasks and puts the CPU to sleep when there is currently no work to do. However, our executor is still quite basic, and there are many possible ways to extend its functionality:

Scheduling: For our task_queue, we currently use the VecDeque type to implement a first in first out (FIFO) strategy, which is often also called round robin scheduling. This strategy might not be the most efficient for all workloads. For example, it might make sense to prioritize latency-critical tasks or tasks that do a lot of I/O. See the scheduling chapter of the Operating Systems: Three Easy Pieces book or the Wikipedia article on scheduling for more information.
Task Spawning: Our Executor::spawn method currently requires a &mut self reference and is thus no longer available after invoking the run method. To fix this, we could create an additional Spawner type that shares some kind of queue with the executor and allows task creation from within tasks themselves. The queue could be the task_queue directly or a separate queue that the executor checks in its run loop.
Utilizing Threads: We donâ€™t have support for threads yet, but we will add it in the next post. This will make it possible to launch multiple instances of the executor in different threads. The advantage of this approach is that the delay imposed by long-running tasks can be reduced because other tasks can run concurrently. This approach also allows it to utilize multiple CPU cores.
Load Balancing: When adding threading support, it becomes important to know how to distribute the tasks between the executors to ensure that all CPU cores are utilized. A common technique for this is work stealing.

ðŸ”—Summary

We started this post by introducing multitasking and differentiating between preemptive multitasking, which forcibly interrupts running tasks regularly, and cooperative multitasking, which lets tasks run until they voluntarily give up control of the CPU.

We then explored how Rustâ€™s support of async/await provides a language-level implementation of cooperative multitasking. Rust bases its implementation on top of the polling-based Future trait, which abstracts asynchronous tasks. Using async/await, it is possible to work with futures almost like with normal synchronous code. The difference is that asynchronous functions return a Future again, which needs to be added to an executor at some point in order to run it.

Behind the scenes, the compiler transforms async/await code to state machines, with each .await operation corresponding to a possible pause point. By utilizing its knowledge about the program, the compiler is able to save only the minimal state for each pause point, resulting in a very small memory consumption per task. One challenge is that the generated state machines might contain self-referential structs, for example when local variables of the asynchronous function reference each other. To prevent pointer invalidation, Rust uses the Pin type to ensure that futures cannot be moved in memory anymore after they have been polled for the first time.

For our implementation, we first created a very basic executor that polls all spawned tasks in a busy loop without using the Waker type at all. We then showed the advantage of waker notifications by implementing an asynchronous keyboard task. The task defines a static SCANCODE_QUEUE using the mutex-free ArrayQueue type provided by the crossbeam crate. Instead of handling keypresses directly, the keyboard interrupt handler now puts all received scancodes in the queue and then wakes the registered Waker to signal that new input is available. On the receiving end, we created a ScancodeStream type to provide a Future resolving to the next scancode in the queue. This made it possible to create an asynchronous print_keypresses task that uses async/await to interpret and print the scancodes in the queue.

To utilize the waker notifications of the keyboard task, we created a new Executor type that uses an Arc-shared task_queue for ready tasks. We implemented a TaskWaker type that pushes the ID of woken tasks directly to this task_queue, which are then polled again by the executor. To save power when no tasks are runnable, we added support for putting the CPU to sleep using the hlt instruction. Finally, we discussed some potential extensions to our executor, for example, providing multi-core support.

ðŸ”—Whatâ€™s Next?

Using async/wait, we now have basic support for cooperative multitasking in our kernel. While cooperative multitasking is very efficient, it leads to latency problems when individual tasks keep running for too long, thus preventing other tasks from running. For this reason, it makes sense to also add support for preemptive multitasking to our kernel.

In the next post, we will introduce threads as the most common form of preemptive multitasking. In addition to resolving the problem of long-running tasks, threads will also prepare us for utilizing multiple CPU cores and running untrusted user programs in the future.

Updates in February 2020

Mon, 02 Mar 2020 00:00:00 +0000

This post gives an overview of the recent updates to the Writing an OS in Rust blog and the corresponding libraries and tools.

`blog_os`

The repository of the Writing an OS in Rust blog received the following updates:

`x86_64`

The x86_64 crate provides support for CPU-specific instructions, registers, and data structures of the x86_64 architecture. There were lots of great contributions this month:

Add User Mode registers by @vinaychandra (released together with #118 as v0.9.0)
Improve PageTableIndex and PageOffset by @m-ou-se (released as v0.9.1)
Remove the cast dependency by @m-ou-se (released as v0.9.2)
Fix GitHub actions to run latest available rustfmt by @m-ou-se
Enable usage with non-nightly rust by @haraldh (released as v0.9.3)
- asm: add target_env = â€œmuslâ€ to pickup the underscore asm names by @haraldh (released as v0.9.4)
Add #[inline] attribute to small functions by @AntoineSebert (released as v0.9.5)
Fix clippy warnings by @AntoineSebert
- Resolve remaining clippy warnings and add clippy job to CI

`bootloader`

The bootloader crate received two small bugfixes and one new feature this month:

Objcopy replaces . chars with _ chars (released as v0.8.6)
Fix docs.rs build by specifying an explicit target (released as v0.8.7)
Add basic support for ELF thread local storage segments (released as v0.8.8)

`bootimage`

There were no updates to the bootimage tool this month.

`cargo-xbuild`

The cargo-xbuild crate provides support for cross-compiling libcore and liballoc. It received the following contributions this month:

Added new option to the configuration table by @parraman (released an v0.5.22)
Pick up xbuild config from workspace manifest by @ascjones (released as v0.5.23)
Make fn build and Args public to enable use as lib by @ascjones (released as v0.5.24)
Fix: Not all projects have a root package (released as v0.5.25)
Improvements to args and config for lib usage by @ascjones (released as v0.5.26)
Add cargo xfix command by @tjhu (released as v0.5.27)
Update dependencies by @parasyte (released as v0.5.28)

`uart_16550`

The uart_16550 crate, which provides basic support for uart_16550 serial output, received the following updates:

Switch CI to GitHub Actions
Cargo.toml: update x86_64 dependency by @haraldh (released as v0.2.3)
Enable usage with non-nightly rust by @haraldh (released as v0.2.4)

`multiboot2-elf64`

The multiboot2-elf64 crate provides abstractions for reading the boot information of the multiboot 2 standard, which is implemented by bootloaders like GRUB. There were two updates to the crate in February:

Add MemoryAreaType, to allow users to access memory area types in a type-safe way by @CWood1
Add some basic documentation by @mental32 (released as v0.8.2)

Updates in January 2020

Sat, 01 Feb 2020 00:00:00 +0000

This post gives an overview of the recent updates to the Writing an OS in Rust blog and the corresponding libraries and tools.

`blog_os`

The repository of the Writing an OS in Rust blog received the following updates:

I also started working on the upcoming post about threads.

`bootloader`

The bootloader crate received two minor updates this month:

Since I focused my time on the new Allocator Designs post, I did not have the time to make more progress on my plan to rewrite the 16-bit/32-bit stages of the bootloader in Rust. I hope to get back to it soon.

`bootimage`

There were no updates to the bootimage tool this month.

`x86_64`

The following changes were merged this month:

Allow immediate port version of in/out instructions by @m-ou-se
Make more functions const by @m-ou-se
- Released as version 0.8.3
Return the UnusedPhysFrame on MapToError::PageAlreadyMapped by @haraldh
- This is a breaking change since it changes the signature of a type.
- No new release was published yet to give us the option to bundle it with other breaking changes.

There are also some pull requests that have some open design questions and are still being discussed:

Add p23_insert_flag_mask argument to mapper.map_to() by @haraldh
- Related proposal: Page Table Visitors by @mark-i-m
Add User Mode registers by @vinaychandra

Please feel free to join these discussions if you have opinions on the matter.

`cargo-xbuild`

The cargo-xbuild crate, which cross-compiles the sysroot, received the following updates this month:

Override target path for building sysroot by @upsuper
- Published as version 0.5.21

`uart_16550`

The uart_16550 crate, which provides basic support for uart_16550 serial output, received a small dependency update:

Update dependency for x86_64 by @haraldh
- Published as version 0.2.2

Allocator Designs

Mon, 20 Jan 2020 00:00:00 +0000

This post explains how to implement heap allocators from scratch. It presents and discusses different allocator designs, including bump allocation, linked list allocation, and fixed-size block allocation. For each of the three designs, we will create a basic implementation that can be used for our kernel.

ðŸ”—Introduction

In the previous post, we added basic support for heap allocations to our kernel. For that, we created a new memory region in the page tables and used the linked_list_allocator crate to manage that memory. While we have a working heap now, we left most of the work to the allocator crate without trying to understand how it works.

In this post, we will show how to create our own heap allocator from scratch instead of relying on an existing allocator crate. We will discuss different allocator designs, including a simplistic bump allocator and a basic fixed-size block allocator, and use this knowledge to implement an allocator with improved performance (compared to the linked_list_allocator crate).

ðŸ”—Design Goals

The responsibility of an allocator is to manage the available heap memory. It needs to return unused memory on alloc calls and keep track of memory freed by dealloc so that it can be reused again. Most importantly, it must never hand out memory that is already in use somewhere else because this would cause undefined behavior.

Apart from correctness, there are many secondary design goals. For example, the allocator should effectively utilize the available memory and keep fragmentation low. Furthermore, it should work well for concurrent applications and scale to any number of processors. For maximal performance, it could even optimize the memory layout with respect to the CPU caches to improve cache locality and avoid false sharing.

These requirements can make good allocators very complex. For example, jemalloc has over 30.000 lines of code. This complexity is often undesired in kernel code, where a single bug can lead to severe security vulnerabilities. Fortunately, the allocation patterns of kernel code are often much simpler compared to userspace code, so that relatively simple allocator designs often suffice.

In the following, we present three possible kernel allocator designs and explain their advantages and drawbacks.

ðŸ”—Bump Allocator

The most simple allocator design is a bump allocator (also known as stack allocator). It allocates memory linearly and only keeps track of the number of allocated bytes and the number of allocations. It is only useful in very specific use cases because it has a severe limitation: it can only free all memory at once.

ðŸ”—Idea

The idea behind a bump allocator is to linearly allocate memory by increasing (â€œbumpingâ€) a next variable, which points to the start of the unused memory. At the beginning, next is equal to the start address of the heap. On each allocation, next is increased by the allocation size so that it always points to the boundary between used and unused memory:

The next pointer only moves in a single direction and thus never hands out the same memory region twice. When it reaches the end of the heap, no more memory can be allocated, resulting in an out-of-memory error on the next allocation.

A bump allocator is often implemented with an allocation counter, which is increased by 1 on each alloc call and decreased by 1 on each dealloc call. When the allocation counter reaches zero, it means that all allocations on the heap have been deallocated. In this case, the next pointer can be reset to the start address of the heap, so that the complete heap memory is available for allocations again.

ðŸ”—Implementation

We start our implementation by declaring a new allocator::bump submodule:

// in src/allocator.rs

pub mod bump;

The content of the submodule lives in a new src/allocator/bump.rs file, which we create with the following content:

// in src/allocator/bump.rs

pub struct BumpAllocator {
    heap_start: usize,
    heap_end: usize,
    next: usize,
    allocations: usize,
}

impl BumpAllocator {
    /// Creates a new empty bump allocator.
    pub const fn new() -> Self {
        BumpAllocator {
            heap_start: 0,
            heap_end: 0,
            next: 0,
            allocations: 0,
        }
    }

    /// Initializes the bump allocator with the given heap bounds.
    ///
    /// This method is unsafe because the caller must ensure that the given
    /// memory range is unused. Also, this method must be called only once.
    pub unsafe fn init(&mut self, heap_start: usize, heap_size: usize) {
        self.heap_start = heap_start;
        self.heap_end = heap_start + heap_size;
        self.next = heap_start;
    }
}

The heap_start and heap_end fields keep track of the lower and upper bounds of the heap memory region. The caller needs to ensure that these addresses are valid, otherwise the allocator would return invalid memory. For this reason, the init function needs to be unsafe to call.

The purpose of the next field is to always point to the first unused byte of the heap, i.e., the start address of the next allocation. It is set to heap_start in the init function because at the beginning, the entire heap is unused. On each allocation, this field will be increased by the allocation size (â€œbumpedâ€) to ensure that we donâ€™t return the same memory region twice.

The allocations field is a simple counter for the active allocations with the goal of resetting the allocator after the last allocation has been freed. It is initialized with 0.

We chose to create a separate init function instead of performing the initialization directly in new in order to keep the interface identical to the allocator provided by the linked_list_allocator crate. This way, the allocators can be switched without additional code changes.

ðŸ”—Implementing `GlobalAlloc`

As explained in the previous post, all heap allocators need to implement the GlobalAlloc trait, which is defined like this:

pub unsafe trait GlobalAlloc {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8;
    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout);

    unsafe fn alloc_zeroed(&self, layout: Layout) -> *mut u8 { ... }
    unsafe fn realloc(
        &self,
        ptr: *mut u8,
        layout: Layout,
        new_size: usize
    ) -> *mut u8 { ... }
}

Only the alloc and dealloc methods are required; the other two methods have default implementations and can be omitted.

ðŸ”—First Implementation Attempt

Letâ€™s try to implement the alloc method for our BumpAllocator:

// in src/allocator/bump.rs

use alloc::alloc::{GlobalAlloc, Layout};

unsafe impl GlobalAlloc for BumpAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        // TODO alignment and bounds check
        let alloc_start = self.next;
        self.next = alloc_start + layout.size();
        self.allocations += 1;
        alloc_start as *mut u8
    }

    unsafe fn dealloc(&self, _ptr: *mut u8, _layout: Layout) {
        todo!();
    }
}

First, we use the next field as the start address for our allocation. Then we update the next field to point to the end address of the allocation, which is the next unused address on the heap. Before returning the start address of the allocation as a *mut u8 pointer, we increase the allocations counter by 1.

Note that we donâ€™t perform any bounds checks or alignment adjustments, so this implementation is not safe yet. This does not matter much because it fails to compile anyway with the following error:

error[E0594]: cannot assign to `self.next` which is behind a `&` reference
  --> src/allocator/bump.rs:29:9
   |
29 |         self.next = alloc_start + layout.size();
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ `self` is a `&` reference, so the data it refers to cannot be written

(The same error also occurs for the self.allocations += 1 line. We omitted it here for brevity.)

The error occurs because the alloc and dealloc methods of the GlobalAlloc trait only operate on an immutable &self reference, so updating the next and allocations fields is not possible. This is problematic because updating next on every allocation is the essential principle of a bump allocator.

ðŸ”—`GlobalAlloc` and Mutability

Before we look at a possible solution to this mutability problem, letâ€™s try to understand why the GlobalAlloc trait methods are defined with &self arguments: As we saw in the previous post, the global heap allocator is defined by adding the #[global_allocator] attribute to a static that implements the GlobalAlloc trait. Static variables are immutable in Rust, so there is no way to call a method that takes &mut self on the static allocator. For this reason, all the methods of GlobalAlloc only take an immutable &self reference.

Fortunately, there is a way to get a &mut self reference from a &self reference: We can use synchronized interior mutability by wrapping the allocator in a spin::Mutex spinlock. This type provides a lock method that performs mutual exclusion and thus safely turns a &self reference to a &mut self reference. Weâ€™ve already used the wrapper type multiple times in our kernel, for example for the VGA text buffer.

ðŸ”—A `Locked` Wrapper Type

With the help of the spin::Mutex wrapper type, we can implement the GlobalAlloc trait for our bump allocator. The trick is to implement the trait not for the BumpAllocator directly, but for the wrapped spin::Mutex<BumpAllocator> type:

unsafe impl GlobalAlloc for spin::Mutex<BumpAllocator> {â€¦}

Unfortunately, this still doesnâ€™t work because the Rust compiler does not permit trait implementations for types defined in other crates:

error[E0117]: only traits defined in the current crate can be implemented for arbitrary types
  --> src/allocator/bump.rs:28:1
   |
28 | unsafe impl GlobalAlloc for spin::Mutex<BumpAllocator> {
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^--------------------------
   | |                           |
   | |                           `spin::mutex::Mutex` is not defined in the current crate
   | impl doesn't use only types from inside the current crate
   |
   = note: define and implement a trait or new type instead

To fix this, we need to create our own wrapper type around spin::Mutex:

// in src/allocator.rs

/// A wrapper around spin::Mutex to permit trait implementations.
pub struct Locked<A> {
    inner: spin::Mutex<A>,
}

impl<A> Locked<A> {
    pub const fn new(inner: A) -> Self {
        Locked {
            inner: spin::Mutex::new(inner),
        }
    }

    pub fn lock(&self) -> spin::MutexGuard<A> {
        self.inner.lock()
    }
}

The type is a generic wrapper around a spin::Mutex<A>. It imposes no restrictions on the wrapped type A, so it can be used to wrap all kinds of types, not just allocators. It provides a simple new constructor function that wraps a given value. For convenience, it also provides a lock function that calls lock on the wrapped Mutex. Since the Locked type is general enough to be useful for other allocator implementations too, we put it in the parent allocator module.

ðŸ”—Implementation for `Locked<BumpAllocator>`

The Locked type is defined in our own crate (in contrast to spin::Mutex), so we can use it to implement GlobalAlloc for our bump allocator. The full implementation looks like this:

// in src/allocator/bump.rs

use super::{align_up, Locked};
use alloc::alloc::{GlobalAlloc, Layout};
use core::ptr;

unsafe impl GlobalAlloc for Locked<BumpAllocator> {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        let mut bump = self.lock(); // get a mutable reference

        let alloc_start = align_up(bump.next, layout.align());
        let alloc_end = match alloc_start.checked_add(layout.size()) {
            Some(end) => end,
            None => return ptr::null_mut(),
        };

        if alloc_end > bump.heap_end {
            ptr::null_mut() // out of memory
        } else {
            bump.next = alloc_end;
            bump.allocations += 1;
            alloc_start as *mut u8
        }
    }

    unsafe fn dealloc(&self, _ptr: *mut u8, _layout: Layout) {
        let mut bump = self.lock(); // get a mutable reference

        bump.allocations -= 1;
        if bump.allocations == 0 {
            bump.next = bump.heap_start;
        }
    }
}

The first step for both alloc and dealloc is to call the Mutex::lock method through the inner field to get a mutable reference to the wrapped allocator type. The instance remains locked until the end of the method, so that no data race can occur in multithreaded contexts (we will add threading support soon).

Compared to the previous prototype, the alloc implementation now respects alignment requirements and performs a bounds check to ensure that the allocations stay inside the heap memory region. The first step is to round up the next address to the alignment specified by the Layout argument. The code for the align_up function is shown in a moment. We then add the requested allocation size to alloc_start to get the end address of the allocation. To prevent integer overflow on large allocations, we use the checked_add method. If an overflow occurs or if the resulting end address of the allocation is larger than the end address of the heap, we return a null pointer to signal an out-of-memory situation. Otherwise, we update the next address and increase the allocations counter by 1 like before. Finally, we return the alloc_start address converted to a *mut u8 pointer.

The dealloc function ignores the given pointer and Layout arguments. Instead, it just decreases the allocations counter. If the counter reaches 0 again, it means that all allocations were freed again. In this case, it resets the next address to the heap_start address to make the complete heap memory available again.

ðŸ”—Address Alignment

The align_up function is general enough that we can put it into the parent allocator module. A basic implementation looks like this:

// in src/allocator.rs

/// Align the given address `addr` upwards to alignment `align`.
fn align_up(addr: usize, align: usize) -> usize {
    let remainder = addr % align;
    if remainder == 0 {
        addr // addr already aligned
    } else {
        addr - remainder + align
    }
}

The function first computes the remainder of the division of addr by align. If the remainder is 0, the address is already aligned with the given alignment. Otherwise, we align the address by subtracting the remainder (so that the new remainder is 0) and then adding the alignment (so that the address does not become smaller than the original address).

Note that this isnâ€™t the most efficient way to implement this function. A much faster implementation looks like this:

/// Align the given address `addr` upwards to alignment `align`.
///
/// Requires that `align` is a power of two.
fn align_up(addr: usize, align: usize) -> usize {
    (addr + align - 1) & !(align - 1)
}

This method requires align to be a power of two, which can be guaranteed by utilizing the GlobalAlloc trait (and its Layout parameter). This makes it possible to create a bitmask to align the address in a very efficient way. To understand how it works, letâ€™s go through it step by step, starting on the right side:

Since align is a power of two, its binary representation has only a single bit set (e.g. 0b000100000). This means that align - 1 has all the lower bits set (e.g. 0b00011111).
By creating the bitwise NOT through the ! operator, we get a number that has all the bits set except for the bits lower than align (e.g. 0bâ€¦111111111100000).
By performing a bitwise AND on an address and !(align - 1), we align the address downwards. This works by clearing all the bits that are lower than align.
Since we want to align upwards instead of downwards, we increase the addr by align - 1 before performing the bitwise AND. This way, already aligned addresses remain the same while non-aligned addresses are rounded to the next alignment boundary.

Which variant you choose is up to you. Both compute the same result, only using different methods.

ðŸ”—Using It

To use the bump allocator instead of the linked_list_allocator crate, we need to update the ALLOCATOR static in allocator.rs:

// in src/allocator.rs

use bump::BumpAllocator;

#[global_allocator]
static ALLOCATOR: Locked<BumpAllocator> = Locked::new(BumpAllocator::new());

Here it becomes important that we declared BumpAllocator::new and Locked::new as const functions. If they were normal functions, a compilation error would occur because the initialization expression of a static must be evaluable at compile time.

We donâ€™t need to change the ALLOCATOR.lock().init(HEAP_START, HEAP_SIZE) call in our init_heap function because the bump allocator provides the same interface as the allocator provided by the linked_list_allocator.

Now our kernel uses our bump allocator! Everything should still work, including the heap_allocation tests that we created in the previous post:

> cargo test --test heap_allocation
[â€¦]
Running 3 tests
simple_allocation... [ok]
large_vec... [ok]
many_boxes... [ok]

ðŸ”—Discussion

The big advantage of bump allocation is that itâ€™s very fast. Compared to other allocator designs (see below) that need to actively look for a fitting memory block and perform various bookkeeping tasks on alloc and dealloc, a bump allocator can be optimized to just a few assembly instructions. This makes bump allocators useful for optimizing the allocation performance, for example when creating a virtual DOM library.

While a bump allocator is seldom used as the global allocator, the principle of bump allocation is often applied in the form of arena allocation, which basically batches individual allocations together to improve performance. An example of an arena allocator for Rust is contained in the toolshed crate.

ðŸ”—The Drawback of a Bump Allocator

The main limitation of a bump allocator is that it can only reuse deallocated memory after all allocations have been freed. This means that a single long-lived allocation suffices to prevent memory reuse. We can see this when we add a variation of the many_boxes test:

// in tests/heap_allocation.rs

#[test_case]
fn many_boxes_long_lived() {
    let long_lived = Box::new(1); // new
    for i in 0..HEAP_SIZE {
        let x = Box::new(i);
        assert_eq!(*x, i);
    }
    assert_eq!(*long_lived, 1); // new
}

Like the many_boxes test, this test creates a large number of allocations to provoke an out-of-memory failure if the allocator does not reuse freed memory. Additionally, the test creates a long_lived allocation, which lives for the whole loop execution.

When we try to run our new test, we see that it indeed fails:

> cargo test --test heap_allocation
Running 4 tests
simple_allocation... [ok]
large_vec... [ok]
many_boxes... [ok]
many_boxes_long_lived... [failed]

Error: panicked at 'allocation error: Layout { size_: 8, align_: 8 }', src/lib.rs:86:5

Letâ€™s try to understand why this failure occurs in detail: First, the long_lived allocation is created at the start of the heap, thereby increasing the allocations counter by 1. For each iteration of the loop, a short-lived allocation is created and directly freed again before the next iteration starts. This means that the allocations counter is temporarily increased to 2 at the beginning of an iteration and decreased to 1 at the end of it. The problem now is that the bump allocator can only reuse memory after all allocations have been freed, i.e., when the allocations counter falls to 0. Since this doesnâ€™t happen before the end of the loop, each loop iteration allocates a new region of memory, leading to an out-of-memory error after a number of iterations.

ðŸ”—Fixing the Test?

There are two potential tricks that we could utilize to fix the test for our bump allocator:

We could update dealloc to check whether the freed allocation was the last allocation returned by alloc by comparing its end address with the next pointer. In case theyâ€™re equal, we can safely reset next back to the start address of the freed allocation. This way, each loop iteration reuses the same memory block.
We could add an alloc_back method that allocates memory from the end of the heap using an additional next_back field. Then we could manually use this allocation method for all long-lived allocations, thereby separating short-lived and long-lived allocations on the heap. Note that this separation only works if itâ€™s clear beforehand how long each allocation will live. Another drawback of this approach is that manually performing allocations is cumbersome and potentially unsafe.

While both of these approaches work to fix the test, they are not a general solution since they are only able to reuse memory in very specific cases. The question is: Is there a general solution that reuses all freed memory?

ðŸ”—Reusing All Freed Memory?

As we learned in the previous post, allocations can live arbitrarily long and can be freed in an arbitrary order. This means that we need to keep track of a potentially unbounded number of non-continuous, unused memory regions, as illustrated by the following example:

The graphic shows the heap over the course of time. At the beginning, the complete heap is unused, and the next address is equal to heap_start (line 1). Then the first allocation occurs (line 2). In line 3, a second memory block is allocated and the first allocation is freed. Many more allocations are added in line 4. Half of them are very short-lived and already get freed in line 5, where another new allocation is also added.

Line 5 shows the fundamental problem: We have five unused memory regions with different sizes, but the next pointer can only point to the beginning of the last region. While we could store the start addresses and sizes of the other unused memory regions in an array of size 4 for this example, this isnâ€™t a general solution since we could easily create an example with 8, 16, or 1000 unused memory regions.

Normally, when we have a potentially unbounded number of items, we can just use a heap-allocated collection. This isnâ€™t really possible in our case, since the heap allocator canâ€™t depend on itself (it would cause endless recursion or deadlocks). So we need to find a different solution.

ðŸ”—Linked List Allocator

A common trick to keep track of an arbitrary number of free memory areas when implementing allocators is to use these areas themselves as backing storage. This utilizes the fact that the regions are still mapped to a virtual address and backed by a physical frame, but the stored information is not needed anymore. By storing the information about the freed region in the region itself, we can keep track of an unbounded number of freed regions without needing additional memory.

The most common implementation approach is to construct a single linked list in the freed memory, with each node being a freed memory region:

Each list node contains two fields: the size of the memory region and a pointer to the next unused memory region. With this approach, we only need a pointer to the first unused region (called head) to keep track of all unused regions, regardless of their number. The resulting data structure is often called a free list.

As you might guess from the name, this is the technique that the linked_list_allocator crate uses. Allocators that use this technique are also often called pool allocators.

ðŸ”—Implementation

In the following, we will create our own simple LinkedListAllocator type that uses the above approach for keeping track of freed memory regions. This part of the post isnâ€™t required for future posts, so you can skip the implementation details if you like.

ðŸ”—The Allocator Type

We start by creating a private ListNode struct in a new allocator::linked_list submodule:

// in src/allocator.rs

pub mod linked_list;

// in src/allocator/linked_list.rs

struct ListNode {
    size: usize,
    next: Option<&'static mut ListNode>,
}

Like in the graphic, a list node has a size field and an optional pointer to the next node, represented by the Option<&'static mut ListNode> type. The &'static mut type semantically describes an owned object behind a pointer. Basically, itâ€™s a Box without a destructor that frees the object at the end of the scope.

We implement the following set of methods for ListNode:

// in src/allocator/linked_list.rs

impl ListNode {
    const fn new(size: usize) -> Self {
        ListNode { size, next: None }
    }

    fn start_addr(&self) -> usize {
        self as *const Self as usize
    }

    fn end_addr(&self) -> usize {
        self.start_addr() + self.size
    }
}

The type has a simple constructor function named new and methods to calculate the start and end addresses of the represented region. We make the new function a const function, which will be required later when constructing a static linked list allocator. Note that any use of mutable references in const functions (including setting the next field to None) is still unstable. In order to get it to compile, we need to add #![feature(const_mut_refs)] to the beginning of our lib.rs.

With the ListNode struct as a building block, we can now create the LinkedListAllocator struct:

// in src/allocator/linked_list.rs

pub struct LinkedListAllocator {
    head: ListNode,
}

impl LinkedListAllocator {
    /// Creates an empty LinkedListAllocator.
    pub const fn new() -> Self {
        Self {
            head: ListNode::new(0),
        }
    }

    /// Initialize the allocator with the given heap bounds.
    ///
    /// This function is unsafe because the caller must guarantee that the given
    /// heap bounds are valid and that the heap is unused. This method must be
    /// called only once.
    pub unsafe fn init(&mut self, heap_start: usize, heap_size: usize) {
        self.add_free_region(heap_start, heap_size);
    }

    /// Adds the given memory region to the front of the list.
    unsafe fn add_free_region(&mut self, addr: usize, size: usize) {
        todo!();
    }
}

The struct contains a head node that points to the first heap region. We are only interested in the value of the next pointer, so we set the size to 0 in the ListNode::new function. Making head a ListNode instead of just a &'static mut ListNode has the advantage that the implementation of the alloc method will be simpler.

Like for the bump allocator, the new function doesnâ€™t initialize the allocator with the heap bounds. In addition to maintaining API compatibility, the reason is that the initialization routine requires writing a node to the heap memory, which can only happen at runtime. The new function, however, needs to be a const function that can be evaluated at compile time because it will be used for initializing the ALLOCATOR static. For this reason, we again provide a separate, non-constant init method.

The init method uses an add_free_region method, whose implementation will be shown in a moment. For now, we use the todo! macro to provide a placeholder implementation that always panics.

ðŸ”—The `add_free_region` Method

The add_free_region method provides the fundamental push operation on the linked list. We currently only call this method from init, but it will also be the central method in our dealloc implementation. Remember, the dealloc method is called when an allocated memory region is freed again. To keep track of this freed memory region, we want to push it to the linked list.

The implementation of the add_free_region method looks like this:

// in src/allocator/linked_list.rs

use super::align_up;
use core::mem;

impl LinkedListAllocator {
    /// Adds the given memory region to the front of the list.
    unsafe fn add_free_region(&mut self, addr: usize, size: usize) {
        // ensure that the freed region is capable of holding ListNode
        assert_eq!(align_up(addr, mem::align_of::<ListNode>()), addr);
        assert!(size >= mem::size_of::<ListNode>());

        // create a new list node and append it at the start of the list
        let mut node = ListNode::new(size);
        node.next = self.head.next.take();
        let node_ptr = addr as *mut ListNode;
        node_ptr.write(node);
        self.head.next = Some(&mut *node_ptr)
    }
}

The method takes the address and size of a memory region as an argument and adds it to the front of the list. First, it ensures that the given region has the necessary size and alignment for storing a ListNode. Then it creates the node and inserts it into the list through the following steps:

Step 0 shows the state of the heap before add_free_region is called. In step 1, the method is called with the memory region marked as freed in the graphic. After the initial checks, the method creates a new node on its stack with the size of the freed region. It then uses the Option::take method to set the next pointer of the node to the current head pointer, thereby resetting the head pointer to None.

In step 2, the method writes the newly created node to the beginning of the freed memory region through the write method. It then points the head pointer to the new node. The resulting pointer structure looks a bit chaotic because the freed region is always inserted at the beginning of the list, but if we follow the pointers, we see that each free region is still reachable from the head pointer.

ðŸ”—The `find_region` Method

The second fundamental operation on a linked list is finding an entry and removing it from the list. This is the central operation needed for implementing the alloc method. We implement the operation as a find_region method in the following way:

// in src/allocator/linked_list.rs

impl LinkedListAllocator {
    /// Looks for a free region with the given size and alignment and removes
    /// it from the list.
    ///
    /// Returns a tuple of the list node and the start address of the allocation.
    fn find_region(&mut self, size: usize, align: usize)
        -> Option<(&'static mut ListNode, usize)>
    {
        // reference to current list node, updated for each iteration
        let mut current = &mut self.head;
        // look for a large enough memory region in linked list
        while let Some(ref mut region) = current.next {
            if let Ok(alloc_start) = Self::alloc_from_region(&region, size, align) {
                // region suitable for allocation -> remove node from list
                let next = region.next.take();
                let ret = Some((current.next.take().unwrap(), alloc_start));
                current.next = next;
                return ret;
            } else {
                // region not suitable -> continue with next region
                current = current.next.as_mut().unwrap();
            }
        }

        // no suitable region found
        None
    }
}

The method uses a current variable and a while let loop to iterate over the list elements. At the beginning, current is set to the (dummy) head node. On each iteration, it is then updated to the next field of the current node (in the else block). If the region is suitable for an allocation with the given size and alignment, the region is removed from the list and returned together with the alloc_start address.

When the current.next pointer becomes None, the loop exits. This means we iterated over the whole list but found no region suitable for an allocation. In that case, we return None. Whether a region is suitable is checked by the alloc_from_region function, whose implementation will be shown in a moment.

Letâ€™s take a more detailed look at how a suitable region is removed from the list:

Step 0 shows the situation before any pointer adjustments. The region and current regions and the region.next and current.next pointers are marked in the graphic. In step 1, both the region.next and current.next pointers are reset to None by using the Option::take method. The original pointers are stored in local variables called next and ret.

In step 2, the current.next pointer is set to the local next pointer, which is the original region.next pointer. The effect is that current now directly points to the region after region, so that region is no longer an element of the linked list. The function then returns the pointer to region stored in the local ret variable.

ðŸ”—The `alloc_from_region` Function

The alloc_from_region function returns whether a region is suitable for an allocation with a given size and alignment. It is defined like this:

// in src/allocator/linked_list.rs

impl LinkedListAllocator {
    /// Try to use the given region for an allocation with given size and
    /// alignment.
    ///
    /// Returns the allocation start address on success.
    fn alloc_from_region(region: &ListNode, size: usize, align: usize)
        -> Result<usize, ()>
    {
        let alloc_start = align_up(region.start_addr(), align);
        let alloc_end = alloc_start.checked_add(size).ok_or(())?;

        if alloc_end > region.end_addr() {
            // region too small
            return Err(());
        }

        let excess_size = region.end_addr() - alloc_end;
        if excess_size > 0 && excess_size < mem::size_of::<ListNode>() {
            // rest of region too small to hold a ListNode (required because the
            // allocation splits the region in a used and a free part)
            return Err(());
        }

        // region suitable for allocation
        Ok(alloc_start)
    }
}

First, the function calculates the start and end address of a potential allocation, using the align_up function we defined earlier and the checked_add method. If an overflow occurs or if the end address is behind the end address of the region, the allocation doesnâ€™t fit in the region and we return an error.

The function performs a less obvious check after that. This check is necessary because most of the time an allocation does not fit a suitable region perfectly, so that a part of the region remains usable after the allocation. This part of the region must store its own ListNode after the allocation, so it must be large enough to do so. The check verifies exactly that: either the allocation fits perfectly (excess_size == 0) or the excess size is large enough to store a ListNode.

ðŸ”—Implementing `GlobalAlloc`

With the fundamental operations provided by the add_free_region and find_region methods, we can now finally implement the GlobalAlloc trait. As with the bump allocator, we donâ€™t implement the trait directly for the LinkedListAllocator but only for a wrapped Locked<LinkedListAllocator>. The Locked wrapper adds interior mutability through a spinlock, which allows us to modify the allocator instance even though the alloc and dealloc methods only take &self references.

The implementation looks like this:

// in src/allocator/linked_list.rs

use super::Locked;
use alloc::alloc::{GlobalAlloc, Layout};
use core::ptr;

unsafe impl GlobalAlloc for Locked<LinkedListAllocator> {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        // perform layout adjustments
        let (size, align) = LinkedListAllocator::size_align(layout);
        let mut allocator = self.lock();

        if let Some((region, alloc_start)) = allocator.find_region(size, align) {
            let alloc_end = alloc_start.checked_add(size).expect("overflow");
            let excess_size = region.end_addr() - alloc_end;
            if excess_size > 0 {
                allocator.add_free_region(alloc_end, excess_size);
            }
            alloc_start as *mut u8
        } else {
            ptr::null_mut()
        }
    }

    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        // perform layout adjustments
        let (size, _) = LinkedListAllocator::size_align(layout);

        self.lock().add_free_region(ptr as usize, size)
    }
}

Letâ€™s start with the dealloc method because it is simpler: First, it performs some layout adjustments, which we will explain in a moment. Then, it retrieves a &mut LinkedListAllocator reference by calling the Mutex::lock function on the Locked wrapper. Lastly, it calls the add_free_region function to add the deallocated region to the free list.

The alloc method is a bit more complex. It starts with the same layout adjustments and also calls the Mutex::lock function to receive a mutable allocator reference. Then it uses the find_region method to find a suitable memory region for the allocation and remove it from the list. If this doesnâ€™t succeed and None is returned, it returns null_mut to signal an error as there is no suitable memory region.

In the success case, the find_region method returns a tuple of the suitable region (no longer in the list) and the start address of the allocation. Using alloc_start, the allocation size, and the end address of the region, it calculates the end address of the allocation and the excess size again. If the excess size is not null, it calls add_free_region to add the excess size of the memory region back to the free list. Finally, it returns the alloc_start address casted as a *mut u8 pointer.

ðŸ”—Layout Adjustments

So what are these layout adjustments that we make at the beginning of both alloc and dealloc? They ensure that each allocated block is capable of storing a ListNode. This is important because the memory block is going to be deallocated at some point, where we want to write a ListNode to it. If the block is smaller than a ListNode or does not have the correct alignment, undefined behavior can occur.

The layout adjustments are performed by the size_align function, which is defined like this:

// in src/allocator/linked_list.rs

impl LinkedListAllocator {
    /// Adjust the given layout so that the resulting allocated memory
    /// region is also capable of storing a `ListNode`.
    ///
    /// Returns the adjusted size and alignment as a (size, align) tuple.
    fn size_align(layout: Layout) -> (usize, usize) {
        let layout = layout
            .align_to(mem::align_of::<ListNode>())
            .expect("adjusting alignment failed")
            .pad_to_align();
        let size = layout.size().max(mem::size_of::<ListNode>());
        (size, layout.align())
    }
}

First, the function uses the align_to method on the passed Layout to increase the alignment to the alignment of a ListNode if necessary. It then uses the pad_to_align method to round up the size to a multiple of the alignment to ensure that the start address of the next memory block will have the correct alignment for storing a ListNode too. In the second step, it uses the max method to enforce a minimum allocation size of mem::size_of::<ListNode>. This way, the dealloc function can safely write a ListNode to the freed memory block.

ðŸ”—Using it

We can now update the ALLOCATOR static in the allocator module to use our new LinkedListAllocator:

// in src/allocator.rs

use linked_list::LinkedListAllocator;

#[global_allocator]
static ALLOCATOR: Locked<LinkedListAllocator> =
    Locked::new(LinkedListAllocator::new());

Since the init function behaves the same for the bump and linked list allocators, we donâ€™t need to modify the init call in init_heap.

When we now run our heap_allocation tests again, we see that all tests pass now, including the many_boxes_long_lived test that failed with the bump allocator:

> cargo test --test heap_allocation
simple_allocation... [ok]
large_vec... [ok]
many_boxes... [ok]
many_boxes_long_lived... [ok]

This shows that our linked list allocator is able to reuse freed memory for subsequent allocations.

ðŸ”—Discussion

In contrast to the bump allocator, the linked list allocator is much more suitable as a general-purpose allocator, mainly because it is able to directly reuse freed memory. However, it also has some drawbacks. Some of them are only caused by our basic implementation, but there are also fundamental drawbacks of the allocator design itself.

ðŸ”—Merging Freed Blocks

The main problem with our implementation is that it only splits the heap into smaller blocks but never merges them back together. Consider this example:

In the first line, three allocations are created on the heap. Two of them are freed again in line 2 and the third is freed in line 3. Now the complete heap is unused again, but it is still split into four individual blocks. At this point, a large allocation might not be possible anymore because none of the four blocks is large enough. Over time, the process continues, and the heap is split into smaller and smaller blocks. At some point, the heap is so fragmented that even normal sized allocations will fail.

To fix this problem, we need to merge adjacent freed blocks back together. For the above example, this would mean the following:

Like before, two of the three allocations are freed in line 2. Instead of keeping the fragmented heap, we now perform an additional step in line 2a to merge the two rightmost blocks back together. In line 3, the third allocation is freed (like before), resulting in a completely unused heap represented by three distinct blocks. In an additional merging step in line 3a, we then merge the three adjacent blocks back together.

The linked_list_allocator crate implements this merging strategy in the following way: Instead of inserting freed memory blocks at the beginning of the linked list on deallocate, it always keeps the list sorted by start address. This way, merging can be performed directly on the deallocate call by examining the addresses and sizes of the two neighboring blocks in the list. Of course, the deallocation operation is slower this way, but it prevents the heap fragmentation we saw above.

ðŸ”—Performance

As we learned above, the bump allocator is extremely fast and can be optimized to just a few assembly operations. The linked list allocator performs much worse in this category. The problem is that an allocation request might need to traverse the complete linked list until it finds a suitable block.

Since the list length depends on the number of unused memory blocks, the performance can vary extremely for different programs. A program that only creates a couple of allocations will experience relatively fast allocation performance. For a program that fragments the heap with many allocations, however, the allocation performance will be very bad because the linked list will be very long and mostly contain very small blocks.

Itâ€™s worth noting that this performance issue isnâ€™t a problem caused by our basic implementation but a fundamental problem of the linked list approach. Since allocation performance can be very important for kernel-level code, we explore a third allocator design in the following that trades improved performance for reduced memory utilization.

ðŸ”—Fixed-Size Block Allocator

In the following, we present an allocator design that uses fixed-size memory blocks for fulfilling allocation requests. This way, the allocator often returns blocks that are larger than needed for allocations, which results in wasted memory due to internal fragmentation. On the other hand, it drastically reduces the time required to find a suitable block (compared to the linked list allocator), resulting in much better allocation performance.

ðŸ”—Introduction

The idea behind a fixed-size block allocator is the following: Instead of allocating exactly as much memory as requested, we define a small number of block sizes and round up each allocation to the next block size. For example, with block sizes of 16, 64, and 512 bytes, an allocation of 4 bytes would return a 16-byte block, an allocation of 48 bytes a 64-byte block, and an allocation of 128 bytes a 512-byte block.

Like the linked list allocator, we keep track of the unused memory by creating a linked list in the unused memory. However, instead of using a single list with different block sizes, we create a separate list for each size class. Each list then only stores blocks of a single size. For example, with block sizes of 16, 64, and 512, there would be three separate linked lists in memory:

Instead of a single head pointer, we have the three head pointers head_16, head_64, and head_512 that each point to the first unused block of the corresponding size. All nodes in a single list have the same size. For example, the list started by the head_16 pointer only contains 16-byte blocks. This means that we no longer need to store the size in each list node since it is already specified by the name of the head pointer.

Since each element in a list has the same size, each list element is equally suitable for an allocation request. This means that we can very efficiently perform an allocation using the following steps:

Round up the requested allocation size to the next block size. For example, when an allocation of 12 bytes is requested, we would choose the block size of 16 in the above example.
Retrieve the head pointer for the list, e.g., for block size 16, we need to use head_16.
Remove the first block from the list and return it.

Most notably, we can always return the first element of the list and no longer need to traverse the full list. Thus, allocations are much faster than with the linked list allocator.

ðŸ”—Block Sizes and Wasted Memory

Depending on the block sizes, we lose a lot of memory by rounding up. For example, when a 512-byte block is returned for a 128-byte allocation, three-quarters of the allocated memory is unused. By defining reasonable block sizes, it is possible to limit the amount of wasted memory to some degree. For example, when using the powers of 2 (4, 8, 16, 32, 64, 128, â€¦) as block sizes, we can limit the memory waste to half of the allocation size in the worst case and a quarter of the allocation size in the average case.

It is also common to optimize block sizes based on common allocation sizes in a program. For example, we could additionally add block size 24 to improve memory usage for programs that often perform allocations of 24 bytes. This way, the amount of wasted memory can often be reduced without losing the performance benefits.

ðŸ”—Deallocation

Much like allocation, deallocation is also very performant. It involves the following steps:

Round up the freed allocation size to the next block size. This is required since the compiler only passes the requested allocation size to dealloc, not the size of the block that was returned by alloc. By using the same size-adjustment function in both alloc and dealloc, we can make sure that we always free the correct amount of memory.
Retrieve the head pointer for the list.
Add the freed block to the front of the list by updating the head pointer.

Most notably, no traversal of the list is required for deallocation either. This means that the time required for a dealloc call stays the same regardless of the list length.

ðŸ”—Fallback Allocator

Given that large allocations (>2Â KB) are often rare, especially in operating system kernels, it might make sense to fall back to a different allocator for these allocations. For example, we could fall back to a linked list allocator for allocations greater than 2048 bytes in order to reduce memory waste. Since only very few allocations of that size are expected, the linked list would stay small and the (de)allocations would still be reasonably fast.

ðŸ”—Creating new Blocks

Above, we always assumed that there are always enough blocks of a specific size in the list to fulfill all allocation requests. However, at some point, the linked list for a given block size becomes empty. At this point, there are two ways we can create new unused blocks of a specific size to fulfill an allocation request:

Allocate a new block from the fallback allocator (if there is one).
Split a larger block from a different list. This best works if block sizes are powers of two. For example, a 32-byte block can be split into two 16-byte blocks.

For our implementation, we will allocate new blocks from the fallback allocator since the implementation is much simpler.

ðŸ”—Implementation

Now that we know how a fixed-size block allocator works, we can start our implementation. We wonâ€™t depend on the implementation of the linked list allocator created in the previous section, so you can follow this part even if you skipped the linked list allocator implementation.

ðŸ”—List Node

We start our implementation by creating a ListNode type in a new allocator::fixed_size_block module:

// in src/allocator.rs

pub mod fixed_size_block;

// in src/allocator/fixed_size_block.rs

struct ListNode {
    next: Option<&'static mut ListNode>,
}

This type is similar to the ListNode type of our linked list allocator implementation, with the difference that we donâ€™t have a size field. It isnâ€™t needed because every block in a list has the same size with the fixed-size block allocator design.

ðŸ”—Block Sizes

Next, we define a constant BLOCK_SIZES slice with the block sizes used for our implementation:

// in src/allocator/fixed_size_block.rs

/// The block sizes to use.
///
/// The sizes must each be power of 2 because they are also used as
/// the block alignment (alignments must be always powers of 2).
const BLOCK_SIZES: &[usize] = &[8, 16, 32, 64, 128, 256, 512, 1024, 2048];

As block sizes, we use powers of 2, starting from 8 up to 2048. We donâ€™t define any block sizes smaller than 8 because each block must be capable of storing a 64-bit pointer to the next block when freed. For allocations greater than 2048 bytes, we will fall back to a linked list allocator.

To simplify the implementation, we define the size of a block as its required alignment in memory. So a 16-byte block is always aligned on a 16-byte boundary and a 512-byte block is aligned on a 512-byte boundary. Since alignments always need to be powers of 2, this rules out any other block sizes. If we need block sizes that are not powers of 2 in the future, we can still adjust our implementation for this (e.g., by defining a second BLOCK_ALIGNMENTS array).

ðŸ”—The Allocator Type

Using the ListNode type and the BLOCK_SIZES slice, we can now define our allocator type:

// in src/allocator/fixed_size_block.rs

pub struct FixedSizeBlockAllocator {
    list_heads: [Option<&'static mut ListNode>; BLOCK_SIZES.len()],
    fallback_allocator: linked_list_allocator::Heap,
}

The list_heads field is an array of head pointers, one for each block size. This is implemented by using the len() of the BLOCK_SIZES slice as the array length. As a fallback allocator for allocations larger than the largest block size, we use the allocator provided by the linked_list_allocator. We could also use the LinkedListAllocator we implemented ourselves instead, but it has the disadvantage that it does not merge freed blocks.

For constructing a FixedSizeBlockAllocator, we provide the same new and init functions that we implemented for the other allocator types too:

// in src/allocator/fixed_size_block.rs

impl FixedSizeBlockAllocator {
    /// Creates an empty FixedSizeBlockAllocator.
    pub const fn new() -> Self {
        const EMPTY: Option<&'static mut ListNode> = None;
        FixedSizeBlockAllocator {
            list_heads: [EMPTY; BLOCK_SIZES.len()],
            fallback_allocator: linked_list_allocator::Heap::empty(),
        }
    }

    /// Initialize the allocator with the given heap bounds.
    ///
    /// This function is unsafe because the caller must guarantee that the given
    /// heap bounds are valid and that the heap is unused. This method must be
    /// called only once.
    pub unsafe fn init(&mut self, heap_start: usize, heap_size: usize) {
        self.fallback_allocator.init(heap_start, heap_size);
    }
}

The new function just initializes the list_heads array with empty nodes and creates an empty linked list allocator as fallback_allocator. The EMPTY constant is needed to tell the Rust compiler that we want to initialize the array with a constant value. Initializing the array directly as [None; BLOCK_SIZES.len()] does not work, because then the compiler requires Option<&'static mut ListNode> to implement the Copy trait, which it does not. This is a current limitation of the Rust compiler, which might go away in the future.

If you havenâ€™t done so already for the LinkedListAllocator implementation, you also need to add #![feature(const_mut_refs)] to the top of your lib.rs. The reason is that any use of mutable reference types in const functions is still unstable, including the Option<&'static mut ListNode> array element type of the list_heads field (even if we set it to None).

The unsafe init function only calls the init function of the fallback_allocator without doing any additional initialization of the list_heads array. Instead, we will initialize the lists lazily on alloc and dealloc calls.

For convenience, we also create a private fallback_alloc method that allocates using the fallback_allocator:

// in src/allocator/fixed_size_block.rs

use alloc::alloc::Layout;
use core::ptr;

impl FixedSizeBlockAllocator {
    /// Allocates using the fallback allocator.
    fn fallback_alloc(&mut self, layout: Layout) -> *mut u8 {
        match self.fallback_allocator.allocate_first_fit(layout) {
            Ok(ptr) => ptr.as_ptr(),
            Err(_) => ptr::null_mut(),
        }
    }
}

The Heap type of the linked_list_allocator crate does not implement GlobalAlloc (as itâ€™s not possible without locking). Instead, it provides an allocate_first_fit method that has a slightly different interface. Instead of returning a *mut u8 and using a null pointer to signal an error, it returns a Result<NonNull<u8>, ()>. The NonNull type is an abstraction for a raw pointer that is guaranteed to not be a null pointer. By mapping the Ok case to the NonNull::as_ptr method and the Err case to a null pointer, we can easily translate this back to a *mut u8 type.

ðŸ”—Calculating the List Index

Before we implement the GlobalAlloc trait, we define a list_index helper function that returns the lowest possible block size for a given Layout:

// in src/allocator/fixed_size_block.rs

/// Choose an appropriate block size for the given layout.
///
/// Returns an index into the `BLOCK_SIZES` array.
fn list_index(layout: &Layout) -> Option<usize> {
    let required_block_size = layout.size().max(layout.align());
    BLOCK_SIZES.iter().position(|&s| s >= required_block_size)
}

The block must have at least the size and alignment required by the given Layout. Since we defined that the block size is also its alignment, this means that the required_block_size is the maximum of the layoutâ€™s size() and align() attributes. To find the next-larger block in the BLOCK_SIZES slice, we first use the iter() method to get an iterator and then the position() method to find the index of the first block that is at least as large as the required_block_size.

Note that we donâ€™t return the block size itself, but the index into the BLOCK_SIZES slice. The reason is that we want to use the returned index as an index into the list_heads array.

ðŸ”—Implementing `GlobalAlloc`

The last step is to implement the GlobalAlloc trait:

// in src/allocator/fixed_size_block.rs

use super::Locked;
use alloc::alloc::GlobalAlloc;

unsafe impl GlobalAlloc for Locked<FixedSizeBlockAllocator> {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        todo!();
    }

    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        todo!();
    }
}

Like for the other allocators, we donâ€™t implement the GlobalAlloc trait directly for our allocator type, but use the Locked wrapper to add synchronized interior mutability. Since the alloc and dealloc implementations are relatively large, we introduce them one by one in the following.

ðŸ”—`alloc`

The implementation of the alloc method looks like this:

// in `impl` block in src/allocator/fixed_size_block.rs

unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
    let mut allocator = self.lock();
    match list_index(&layout) {
        Some(index) => {
            match allocator.list_heads[index].take() {
                Some(node) => {
                    allocator.list_heads[index] = node.next.take();
                    node as *mut ListNode as *mut u8
                }
                None => {
                    // no block exists in list => allocate new block
                    let block_size = BLOCK_SIZES[index];
                    // only works if all block sizes are a power of 2
                    let block_align = block_size;
                    let layout = Layout::from_size_align(block_size, block_align)
                        .unwrap();
                    allocator.fallback_alloc(layout)
                }
            }
        }
        None => allocator.fallback_alloc(layout),
    }
}

Letâ€™s go through it step by step:

First, we use the Locked::lock method to get a mutable reference to the wrapped allocator instance. Next, we call the list_index function we just defined to calculate the appropriate block size for the given layout and get the corresponding index into the list_heads array. If this index is None, no block size fits for the allocation, therefore we use the fallback_allocator using the fallback_alloc function.

If the list index is Some, we try to remove the first node in the corresponding list started by list_heads[index] using the Option::take method. If the list is not empty, we enter the Some(node) branch of the match statement, where we point the head pointer of the list to the successor of the popped node (by using take again). Finally, we return the popped node pointer as a *mut u8.

If the list head is None, it indicates that the list of blocks is empty. This means that we need to construct a new block as described above. For that, we first get the current block size from the BLOCK_SIZES slice and use it as both the size and the alignment for the new block. Then we create a new Layout from it and call the fallback_alloc method to perform the allocation. The reason for adjusting the layout and alignment is that the block will be added to the block list on deallocation.

ðŸ”—`dealloc`

The implementation of the dealloc method looks like this:

// in src/allocator/fixed_size_block.rs

use core::{mem, ptr::NonNull};

// inside the `unsafe impl GlobalAlloc` block

unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
    let mut allocator = self.lock();
    match list_index(&layout) {
        Some(index) => {
            let new_node = ListNode {
                next: allocator.list_heads[index].take(),
            };
            // verify that block has size and alignment required for storing node
            assert!(mem::size_of::<ListNode>() <= BLOCK_SIZES[index]);
            assert!(mem::align_of::<ListNode>() <= BLOCK_SIZES[index]);
            let new_node_ptr = ptr as *mut ListNode;
            new_node_ptr.write(new_node);
            allocator.list_heads[index] = Some(&mut *new_node_ptr);
        }
        None => {
            let ptr = NonNull::new(ptr).unwrap();
            allocator.fallback_allocator.deallocate(ptr, layout);
        }
    }
}

Like in alloc, we first use the lock method to get a mutable allocator reference and then the list_index function to get the block list corresponding to the given Layout. If the index is None, no fitting block size exists in BLOCK_SIZES, which indicates that the allocation was created by the fallback allocator. Therefore, we use its deallocate to free the memory again. The method expects a NonNull instead of a *mut u8, so we need to convert the pointer first. (The unwrap call only fails when the pointer is null, which should never happen when the compiler calls dealloc.)

If list_index returns a block index, we need to add the freed memory block to the list. For that, we first create a new ListNode that points to the current list head (by using Option::take again). Before we write the new node into the freed memory block, we first assert that the current block size specified by index has the required size and alignment for storing a ListNode. Then we perform the write by converting the given *mut u8 pointer to a *mut ListNode pointer and then calling the unsafe write method on it. The last step is to set the head pointer of the list, which is currently None since we called take on it, to our newly written ListNode. For that, we convert the raw new_node_ptr to a mutable reference.

There are a few things worth noting:

We donâ€™t differentiate between blocks allocated from a block list and blocks allocated from the fallback allocator. This means that new blocks created in alloc are added to the block list on dealloc, thereby increasing the number of blocks of that size.
The alloc method is the only place where new blocks are created in our implementation. This means that we initially start with empty block lists and only fill these lists lazily when allocations of their block size are performed.
We donâ€™t need unsafe blocks in alloc and dealloc, even though we perform some unsafe operations. The reason is that Rust currently treats the complete body of unsafe functions as one large unsafe block. Since using explicit unsafe blocks has the advantage that itâ€™s obvious which operations are unsafe and which are not, there is a proposed RFC to change this behavior.

ðŸ”—Using it

To use our new FixedSizeBlockAllocator, we need to update the ALLOCATOR static in the allocator module:

// in src/allocator.rs

use fixed_size_block::FixedSizeBlockAllocator;

#[global_allocator]
static ALLOCATOR: Locked<FixedSizeBlockAllocator> = Locked::new(
    FixedSizeBlockAllocator::new());

Since the init function behaves the same for all allocators we implemented, we donâ€™t need to modify the init call in init_heap.

When we now run our heap_allocation tests again, all tests should still pass:

> cargo test --test heap_allocation
simple_allocation... [ok]
large_vec... [ok]
many_boxes... [ok]
many_boxes_long_lived... [ok]

Our new allocator seems to work!

ðŸ”—Discussion

While the fixed-size block approach has much better performance than the linked list approach, it wastes up to half of the memory when using powers of 2 as block sizes. Whether this tradeoff is worth it heavily depends on the application type. For an operating system kernel, where performance is critical, the fixed-size block approach seems to be the better choice.

On the implementation side, there are various things that we could improve in our current implementation:

Instead of only allocating blocks lazily using the fallback allocator, it might be better to pre-fill the lists to improve the performance of initial allocations.
To simplify the implementation, we only allowed block sizes that are powers of 2 so that we could also use them as the block alignment. By storing (or calculating) the alignment in a different way, we could also allow arbitrary other block sizes. This way, we could add more block sizes, e.g., for common allocation sizes, in order to minimize the wasted memory.
We currently only create new blocks, but never free them again. This results in fragmentation and might eventually result in allocation failure for large allocations. It might make sense to enforce a maximum list length for each block size. When the maximum length is reached, subsequent deallocations are freed using the fallback allocator instead of being added to the list.
Instead of falling back to a linked list allocator, we could have a special allocator for allocations greater than 4Â KiB. The idea is to utilize paging, which operates on 4Â KiB pages, to map a continuous block of virtual memory to non-continuous physical frames. This way, fragmentation of unused memory is no longer a problem for large allocations.
With such a page allocator, it might make sense to add block sizes up to 4Â KiB and drop the linked list allocator completely. The main advantages of this would be reduced fragmentation and improved performance predictability, i.e., better worst-case performance.

Itâ€™s important to note that the implementation improvements outlined above are only suggestions. Allocators used in operating system kernels are typically highly optimized for the specific workload of the kernel, which is only possible through extensive profiling.

ðŸ”—Variations

There are also many variations of the fixed-size block allocator design. Two popular examples are the slab allocator and the buddy allocator, which are also used in popular kernels such as Linux. In the following, we give a short introduction to these two designs.

ðŸ”—Slab Allocator

The idea behind a slab allocator is to use block sizes that directly correspond to selected types in the kernel. This way, allocations of those types fit a block size exactly and no memory is wasted. Sometimes, it might be even possible to preinitialize type instances in unused blocks to further improve performance.

Slab allocation is often combined with other allocators. For example, it can be used together with a fixed-size block allocator to further split an allocated block in order to reduce memory waste. It is also often used to implement an object pool pattern on top of a single large allocation.

ðŸ”—Buddy Allocator

Instead of using a linked list to manage freed blocks, the buddy allocator design uses a binary tree data structure together with power-of-2 block sizes. When a new block of a certain size is required, it splits a larger sized block into two halves, thereby creating two child nodes in the tree. Whenever a block is freed again, its neighbor block in the tree is analyzed. If the neighbor is also free, the two blocks are joined back together to form a block of twice the size.

The advantage of this merge process is that external fragmentation is reduced so that small freed blocks can be reused for a large allocation. It also does not use a fallback allocator, so the performance is more predictable. The biggest drawback is that only power-of-2 block sizes are possible, which might result in a large amount of wasted memory due to internal fragmentation. For this reason, buddy allocators are often combined with a slab allocator to further split an allocated block into multiple smaller blocks.

ðŸ”—Summary

This post gave an overview of different allocator designs. We learned how to implement a basic bump allocator, which hands out memory linearly by increasing a single next pointer. While bump allocation is very fast, it can only reuse memory after all allocations have been freed. For this reason, it is rarely used as a global allocator.

Next, we created a linked list allocator that uses the freed memory blocks itself to create a linked list, the so-called free list. This list makes it possible to store an arbitrary number of freed blocks of different sizes. While no memory waste occurs, the approach suffers from poor performance because an allocation request might require a complete traversal of the list. Our implementation also suffers from external fragmentation because it does not merge adjacent freed blocks back together.

To fix the performance problems of the linked list approach, we created a fixed-size block allocator that predefines a fixed set of block sizes. For each block size, a separate free list exists so that allocations and deallocations only need to insert/pop at the front of the list and are thus very fast. Since each allocation is rounded up to the next larger block size, some memory is wasted due to internal fragmentation.

There are many more allocator designs with different tradeoffs. Slab allocation works well to optimize the allocation of common fixed-size structures, but is not applicable in all situations. Buddy allocation uses a binary tree to merge freed blocks back together, but wastes a large amount of memory because it only supports power-of-2 block sizes. Itâ€™s also important to remember that each kernel implementation has a unique workload, so there is no â€œbestâ€ allocator design that fits all cases.

ðŸ”—Whatâ€™s next?

With this post, we conclude our memory management implementation for now. Next, we will start exploring multitasking, starting with cooperative multitasking in the form of async/await. In subsequent posts, we will then explore threads, multiprocessing, and processes.

Updates in December 2019

Tue, 07 Jan 2020 00:00:00 +0000

Happy New Year!

This post gives an overview of the recent updates to the Writing an OS in Rust blog and the corresponding libraries and tools.

`blog_os`

The repository of the Writing an OS in Rust blog received the following updates:

Update x86_64 dependency to version 0.8.1. This included the dependency update itself, an update of the frame allocation code, and an update of the blog.
License the blog/content folder under CC BY-NC
Reword sentence in first post by @pamolloy

Further, weâ€™re still working on adding Experimental Support for Community Translations to the blog, starting with Simplified Chinese and Traditional Chinese. Any help is appreciated!

`bootloader`

There were no updates to the bootloader this month.

Iâ€™m currently working on rewriting the 16-bit/32-bit stages in Rust and making the bootloader more modular in the process. This should make it much easier to add support for UEFI and GRUB booting later.

`bootimage`

There were no updates to the bootimage tool this month.

`x86_64`

We landed a number of breaking changes this month:

These changes were released an version 0.8.0. Unfortunately, there was a missing re-export for the new UnusedPhysFrame type. We fixed it in #110 and released the fix as version 0.8.1.

There was one more addition to the x86_64 crate afterwards:

Add support for cr4 control register (with complete documentation) by @KarimAllah (released as version 0.8.2).

There were also a few changes related to continuous integration:

`cargo-xbuild`

The cargo-xbuild crate, which cross-compiles the sysroot, received the following updates this month:

Add --quiet flag that suppresses â€œwaiting for file lockâ€ message by @Nils-TUD (published as version 0.5.19)
Fix wrong feature name for memcpy=false (released as version 0.5.20)

Updates in October and November 2019

Mon, 02 Dec 2019 00:00:00 +0000

This post gives an overview of the recent updates to the Writing an OS in Rust blog and the used libraries and tools.

I moved to a new apartment mid-October and had lots of work to do there, so I didnâ€™t have the time for creating the October status update post. Therefore, this post lists the changes from both October and November. Iâ€™m slowly picking up speed again, but I still have a lot of mails in my backlog. Sorry if you havenâ€™t received an answer yet!

`blog_os`

The blog itself received only a minor update: Use panic! instead of println! + loop in double fault handler. This fixes an issue where a double fault during cargo xtest leads to an endless loop without any output on the serial port.

We also have other news: We plan to add Experimental Support for Community Translations to the blog. While this imposes additional challenges, it makes the content accessible to people who donâ€™t speak English, so itâ€™s definitely worth trying in my opinion. The first additional language will be Chinese, based on an existing translation by @luojia65. Many thanks also to @TheBegining and @Rustin-Liu for helping with the translation!

`bootloader`

Change the way the kernel entry point is called to honor alignment ABI by @GuillaumeDIDIER (published as version 0.8.2)
Add support for Github Actions
Remove unnecessary extern C on panic handler to fix not-ffi-safe warning by @cmsd2 (published as version 0.8.3)

`bootimage`

Donâ€™t exit with expected exit code when failed to read QEMU exit code

`x86_64`

Switch to GitHub Actions for CI
Use repr C to suppress not-ffi-safe when used with extern handler functions by @cmsd2 (published as version 0.7.6)
Add slice and slice_mut methods to IDT by @foxcob (published as version 0.7.7)

`cargo-xbuild`

Add support for publishing and installing cross compiled crates by @ALSchwalm (published as version 0.5.18)

Updates in September 2019

Sun, 06 Oct 2019 00:00:00 +0000

This post gives an overview of the recent updates to the Writing an OS in Rust blog and the used libraries and tools.

I finished my master thesis and got my degree this month, so I only had limited time for my open source work. I still managed to perform a few minor updates, including code simplications for the Paging Implementation post and the evaluation of GitHub Actions as a CI service.

`blog_os`

Improve Paging Implementation Post: Improves and simplifies the code in multiple places
Use GitHub Actions to build and deploy blog
Set up GitHub Actions for post-XX branches: post-01, post-02, post-04
Update to bootloader 0.8.0: Considerably reduces compile times
Update to Zola 0.9.0: Updates the used static site generator to the latest version

`cargo-xbuild`

Print a warning when building for the host target

`bootloader`

Add a Cargo Feature for Enabling SSE

`uart_16550`

`x86_64`

No updates were merged in September. However, Iâ€™m planning some breaking changes for the crate, namely:

Updates in August 2019

Mon, 09 Sep 2019 00:00:00 +0000

This post gives an overview of the recent updates to the Writing an OS in Rust blog and the used libraries and tools.

I was very busy with finishing my masterâ€™s thesis, so I didnâ€™t have any to implement any notable changes myself. Thanks to contributions by @vinaychandra and @64, we were still able to publish new versions of the x86_64, bootimage and bootloader crates.

`blog_os`

Apart from rewriting the section about no-harness tests of the Testing post, there were no notable changes to the blog in August. Now that I have some more free time again, I plan to upgrade the blog to the latest versions of bootloader and bootimage, evaluate the use of GitHub Actions for the repository, and continue the work on the upcoming post about heap allocator implementations.

`x86_64`

Thanks to @vinaychandra, the x86_64 crate now has support for the FsBase and GsBase registers. The change was published as version 0.7.5.

`bootimage`

To allow bootloaders to read configuration from the Cargo.toml file of the kernel, the bootimage crate now passes the location of the kernelâ€™s Cargo.toml to bootloader crates. This change was implemented by @64 and published as version 0.7.7.

`bootloader`

Apart from initializing the CPU and loading the kernel, the bootloader crate is also responsible for creating several memory regions for the kernel, for example a program stack and the boot information struct. These regions must be mapped at some address in the virtual address space.

As a stop-gap solution, the bootloader crate used fixed virtual addresses for these regions, which resulted in errors if the kernel tried to use the same address ranges itself. For example, the (optional) recursive mapping of page tables often conflicted with so-called higher half kernels, which live at the upper end of the address space. To avoid these conflicts, @64 updated the bootloader crate to dynamically map the kernel stack, boot info, physical memory, and recursive table regions at an unused virtual address range.

To also support specifying explicit addresses for these regions, @64 further added support for parsing bootloader configuration from the kernelâ€™s Cargo.toml. This way, the virtual addresses of the kernel stack and physical memory mapping can now be configured using a package.metadata.bootloader key in the Cargo.toml of the kernel. In a third pull request, @64 also made the kernel stack size configurable.

The changes were published together as version 0.8.0. This is a breaking update because the new configuration system requires at least version 0.7.7 of bootimage, which is the first version that passes the location of the kernelâ€™s Cargo.toml file.

Updates in July 2019

Fri, 02 Aug 2019 00:00:00 +0000

This post gives an overview of the recent updates to the Writing an OS in Rust blog and the used libraries and tools.

Since Iâ€™m still very busy with my master thesis, I havenâ€™t had the time to work on a new post. But there were quite a few maintenance updates this month and also a few new features such as the new OffsetPageTable type in the x86_64 crate.

We also had some great contributions this month. Thanks to the efforts of @64, we were able to considerably lower the compile times of the x86_64 and bootloader crates. Thanks to @Aehmlo, the cargo-xbuild crate now has a cargo xdoc subcommands and support for the cargo {c, b, t, r} aliases.

The following list gives a short overview of notable changes to the different projects.

blog_os

x86_64

Reexport MappedPageTable on non-x86_64 platforms too
Update GDT docs, add user_data_segment function and WRITABLE flag by @64 (published as version 0.7.2)
Add a new OffsetPageTable mapper type (published as version 0.7.3)
Update integration tests to use new testing framework
Remove raw-cpuid dependency and use rdrand intrinsics by @64 (published as version 0.7.4)

bootloader

Remove stabilized publish-lockfile feature (published as version 0.6.2)
Update CI badge, use latest version of x86_64 crate and rustfmt by @64 (published as version 0.6.3)
Use volatile accesses in VGA code and make font dependency optional by @64
- Making the dependency optional should improve compile times when the VGA text mode is used
- Published as version 0.6.4
Breaking: Only include dependencies when binary feature is enabled (published as version 0.7.0)

bootimage

If the bootloader has a feature named binary, enable it (published as version 0.7.6)
- This is required for building bootloader 0.7.0 or later

cargo-xbuild

Add cargo xdoc command for invoking cargo doc by @Aehmlo (published as version 0.5.13)
Donâ€™t append a --sysroot argument to RUSTFLAGS if it already contains one (published as version 0.5.14)
Add xb, xt, xc, and xr subcommands by @Aehmlo (published as version 0.5.15)

Updates in June 2019

Sat, 06 Jul 2019 00:00:00 +0000

This post gives an overview of the recent updates to the Writing an OS in Rust blog and the used libraries and tools.

My focus this month was to finish the Heap Allocation post, on which I had been working since March. I originally wanted to include a section about different allocator designs (bump, linked list, slab, â€¦) and how to implement them, but I decided to split it out into a separate post because it became much too long. I try to release this half-done post soon.

Apart from the new post, there were some minor updates to the x86_64, bootloader and cargo-xbuild crates. The following gives a short overview of notable changes to the different projects.

blog_os

x86_64

Add ring-3 flag to GDT descriptor by @mark-i-m (released as version 0.7.1)
Add bochs magic breakpoint, read instruction pointer, inline instructions by @64

bootloader

cargo-xbuild

Heap Allocation

Wed, 26 Jun 2019 00:00:00 +0000

This post adds support for heap allocation to our kernel. First, it gives an introduction to dynamic memory and shows how the borrow checker prevents common allocation errors. It then implements the basic allocation interface of Rust, creates a heap memory region, and sets up an allocator crate. At the end of this post, all the allocation and collection types of the built-in alloc crate will be available to our kernel.

ðŸ”—Local and Static Variables

We currently use two types of variables in our kernel: local variables and static variables. Local variables are stored on the call stack and are only valid until the surrounding function returns. Static variables are stored at a fixed memory location and always live for the complete lifetime of the program.

ðŸ”—Local Variables

Local variables are stored on the call stack, which is a stack data structure that supports push and pop operations. On each function entry, the parameters, the return address, and the local variables of the called function are pushed by the compiler:

The above example shows the call stack after the outer function called the inner function. We see that the call stack contains the local variables of outer first. On the inner call, the parameter 1 and the return address for the function were pushed. Then control was transferred to inner, which pushed its local variables.

After the inner function returns, its part of the call stack is popped again and only the local variables of outer remain:

We see that the local variables of inner only live until the function returns. The Rust compiler enforces these lifetimes and throws an error when we use a value for too long, for example when we try to return a reference to a local variable:

fn inner(i: usize) -> &'static u32 {
    let z = [1, 2, 3];
    &z[i]
}

(run the example on the playground)

While returning a reference makes no sense in this example, there are cases where we want a variable to live longer than the function. We already saw such a case in our kernel when we tried to load an interrupt descriptor table and had to use a static variable to extend the lifetime.

ðŸ”—Static Variables

Static variables are stored at a fixed memory location separate from the stack. This memory location is assigned at compile time by the linker and encoded in the executable. Statics live for the complete runtime of the program, so they have the 'static lifetime and can always be referenced from local variables:

When the inner function returns in the above example, its part of the call stack is destroyed. The static variables live in a separate memory range that is never destroyed, so the &Z[1] reference is still valid after the return.

Apart from the 'static lifetime, static variables also have the useful property that their location is known at compile time, so that no reference is needed for accessing them. We utilized that property for our println macro: By using a static Writer internally, there is no &mut Writer reference needed to invoke the macro, which is very useful in exception handlers, where we donâ€™t have access to any additional variables.

However, this property of static variables brings a crucial drawback: they are read-only by default. Rust enforces this because a data race would occur if, e.g., two threads modified a static variable at the same time. The only way to modify a static variable is to encapsulate it in a Mutex type, which ensures that only a single &mut reference exists at any point in time. We already used a Mutex for our static VGA buffer Writer.

ðŸ”—Dynamic Memory

Local and static variables are already very powerful together and enable most use cases. However, we saw that they both have their limitations:

Local variables only live until the end of the surrounding function or block. This is because they live on the call stack and are destroyed after the surrounding function returns.
Static variables always live for the complete runtime of the program, so there is no way to reclaim and reuse their memory when theyâ€™re no longer needed. Also, they have unclear ownership semantics and are accessible from all functions, so they need to be protected by a Mutex when we want to modify them.

Another limitation of local and static variables is that they have a fixed size. So they canâ€™t store a collection that dynamically grows when more elements are added. (There are proposals for unsized rvalues in Rust that would allow dynamically sized local variables, but they only work in some specific cases.)

To circumvent these drawbacks, programming languages often support a third memory region for storing variables called the heap. The heap supports dynamic memory allocation at runtime through two functions called allocate and deallocate. It works in the following way: The allocate function returns a free chunk of memory of the specified size that can be used to store a variable. This variable then lives until it is freed by calling the deallocate function with a reference to the variable.

Letâ€™s go through an example:

Here the inner function uses heap memory instead of static variables for storing z. It first allocates a memory block of the required size, which returns a *mut u32 raw pointer. It then uses the ptr::write method to write the array [1,2,3] to it. In the last step, it uses the offset function to calculate a pointer to the i-th element and then returns it. (Note that we omitted some required casts and unsafe blocks in this example function for brevity.)

The allocated memory lives until it is explicitly freed through a call to deallocate. Thus, the returned pointer is still valid even after inner returned and its part of the call stack was destroyed. The advantage of using heap memory compared to static memory is that the memory can be reused after it is freed, which we do through the deallocate call in outer. After that call, the situation looks like this:

We see that the z[1] slot is free again and can be reused for the next allocate call. However, we also see that z[0] and z[2] are never freed because we never deallocate them. Such a bug is called a memory leak and is often the cause of excessive memory consumption of programs (just imagine what happens when we call inner repeatedly in a loop). This might seem bad, but there are much more dangerous types of bugs that can happen with dynamic allocation.

ðŸ”—Common Errors

Apart from memory leaks, which are unfortunate but donâ€™t make the program vulnerable to attackers, there are two common types of bugs with more severe consequences:

When we accidentally continue to use a variable after calling deallocate on it, we have a so-called use-after-free vulnerability. Such a bug causes undefined behavior and can often be exploited by attackers to execute arbitrary code.
When we accidentally free a variable twice, we have a double-free vulnerability. This is problematic because it might free a different allocation that was allocated in the same spot after the first deallocate call. Thus, it can lead to a use-after-free vulnerability again.

These types of vulnerabilities are commonly known, so one might expect that people have learned how to avoid them by now. But no, such vulnerabilities are still regularly found, for example this use-after-free vulnerability in Linux (2019), that allowed arbitrary code execution. A web search like use-after-free linux {current year} will probably always yield results. This shows that even the best programmers are not always able to correctly handle dynamic memory in complex projects.

To avoid these issues, many languages, such as Java or Python, manage dynamic memory automatically using a technique called garbage collection. The idea is that the programmer never invokes deallocate manually. Instead, the program is regularly paused and scanned for unused heap variables, which are then automatically deallocated. Thus, the above vulnerabilities can never occur. The drawbacks are the performance overhead of the regular scan and the probably long pause times.

Rust takes a different approach to the problem: It uses a concept called ownership that is able to check the correctness of dynamic memory operations at compile time. Thus, no garbage collection is needed to avoid the mentioned vulnerabilities, which means that there is no performance overhead. Another advantage of this approach is that the programmer still has fine-grained control over the use of dynamic memory, just like with C or C++.

ðŸ”—Allocations in Rust

Instead of letting the programmer manually call allocate and deallocate, the Rust standard library provides abstraction types that call these functions implicitly. The most important type is Box, which is an abstraction for a heap-allocated value. It provides a Box::new constructor function that takes a value, calls allocate with the size of the value, and then moves the value to the newly allocated slot on the heap. To free the heap memory again, the Box type implements the Drop trait to call deallocate when it goes out of scope:

{
    let z = Box::new([1,2,3]);
    [â€¦]
} // z goes out of scope and `deallocate` is called

This pattern has the strange name resource acquisition is initialization (or RAII for short). It originated in C++, where it is used to implement a similar abstraction type called std::unique_ptr.

Such a type alone does not suffice to prevent all use-after-free bugs since programmers can still hold on to references after the Box goes out of scope and the corresponding heap memory slot is deallocated:

let x = {
    let z = Box::new([1,2,3]);
    &z[1]
}; // z goes out of scope and `deallocate` is called
println!("{}", x);

This is where Rustâ€™s ownership comes in. It assigns an abstract lifetime to each reference, which is the scope in which the reference is valid. In the above example, the x reference is taken from the z array, so it becomes invalid after z goes out of scope. When you run the above example on the playground you see that the Rust compiler indeed throws an error:

error[E0597]: `z[_]` does not live long enough
 --> src/main.rs:4:9
  |
2 |     let x = {
  |         - borrow later stored here
3 |         let z = Box::new([1,2,3]);
4 |         &z[1]
  |         ^^^^^ borrowed value does not live long enough
5 |     }; // z goes out of scope and `deallocate` is called
  |     - `z[_]` dropped here while still borrowed

The terminology can be a bit confusing at first. Taking a reference to a value is called borrowing the value since itâ€™s similar to a borrow in real life: You have temporary access to an object but need to return it sometime, and you must not destroy it. By checking that all borrows end before an object is destroyed, the Rust compiler can guarantee that no use-after-free situation can occur.

Rustâ€™s ownership system goes even further, preventing not only use-after-free bugs but also providing complete memory safety, as garbage collected languages like Java or Python do. Additionally, it guarantees thread safety and is thus even safer than those languages in multi-threaded code. And most importantly, all these checks happen at compile time, so there is no runtime overhead compared to hand-written memory management in C.

ðŸ”—Use Cases

We now know the basics of dynamic memory allocation in Rust, but when should we use it? Weâ€™ve come really far with our kernel without dynamic memory allocation, so why do we need it now?

First, dynamic memory allocation always comes with a bit of performance overhead since we need to find a free slot on the heap for every allocation. For this reason, local variables are generally preferable, especially in performance-sensitive kernel code. However, there are cases where dynamic memory allocation is the best choice.

As a basic rule, dynamic memory is required for variables that have a dynamic lifetime or a variable size. The most important type with a dynamic lifetime is Rc, which counts the references to its wrapped value and deallocates it after all references have gone out of scope. Examples for types with a variable size are Vec, String, and other collection types that dynamically grow when more elements are added. These types work by allocating a larger amount of memory when they become full, copying all elements over, and then deallocating the old allocation.

For our kernel, we will mostly need the collection types, for example, to store a list of active tasks when implementing multitasking in future posts.

ðŸ”—The Allocator Interface

The first step in implementing a heap allocator is to add a dependency on the built-in alloc crate. Like the core crate, it is a subset of the standard library that additionally contains the allocation and collection types. To add the dependency on alloc, we add the following to our lib.rs:

// in src/lib.rs

extern crate alloc;

Contrary to normal dependencies, we donâ€™t need to modify the Cargo.toml. The reason is that the alloc crate ships with the Rust compiler as part of the standard library, so the compiler already knows about the crate. By adding this extern crate statement, we specify that the compiler should try to include it. (Historically, all dependencies needed an extern crate statement, which is now optional).

Since we are compiling for a custom target, we canâ€™t use the precompiled version of alloc that is shipped with the Rust installation. Instead, we have to tell cargo to recompile the crate from source. We can do that by adding it to the unstable.build-std array in our .cargo/config.toml file:

# in .cargo/config.toml

[unstable]
build-std = ["core", "compiler_builtins", "alloc"]

Now the compiler will recompile and include the alloc crate in our kernel.

The reason that the alloc crate is disabled by default in #[no_std] crates is that it has additional requirements. When we try to compile our project now, we will see these requirements as errors:

error: no global memory allocator found but one is required; link to std or add
       #[global_allocator] to a static item that implements the GlobalAlloc trait.

The error occurs because the alloc crate requires a heap allocator, which is an object that provides the allocate and deallocate functions. In Rust, heap allocators are described by the GlobalAlloc trait, which is mentioned in the error message. To set the heap allocator for the crate, the #[global_allocator] attribute must be applied to a static variable that implements the GlobalAlloc trait.

ðŸ”—The `GlobalAlloc` Trait

The GlobalAlloc trait defines the functions that a heap allocator must provide. The trait is special because it is almost never used directly by the programmer. Instead, the compiler will automatically insert the appropriate calls to the trait methods when using the allocation and collection types of alloc.

Since we will need to implement the trait for all our allocator types, it is worth taking a closer look at its declaration:

pub unsafe trait GlobalAlloc {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8;
    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout);

    unsafe fn alloc_zeroed(&self, layout: Layout) -> *mut u8 { ... }
    unsafe fn realloc(
        &self,
        ptr: *mut u8,
        layout: Layout,
        new_size: usize
    ) -> *mut u8 { ... }
}

It defines the two required methods alloc and dealloc, which correspond to the allocate and deallocate functions we used in our examples:

The alloc method takes a Layout instance as an argument, which describes the desired size and alignment that the allocated memory should have. It returns a raw pointer to the first byte of the allocated memory block. Instead of an explicit error value, the alloc method returns a null pointer to signal an allocation error. This is a bit non-idiomatic, but it has the advantage that wrapping existing system allocators is easy since they use the same convention.
The dealloc method is the counterpart and is responsible for freeing a memory block again. It receives two arguments: the pointer returned by alloc and the Layout that was used for the allocation.

The trait additionally defines the two methods alloc_zeroed and realloc with default implementations:

The alloc_zeroed method is equivalent to calling alloc and then setting the allocated memory block to zero, which is exactly what the provided default implementation does. An allocator implementation can override the default implementations with a more efficient custom implementation if possible.
The realloc method allows to grow or shrink an allocation. The default implementation allocates a new memory block with the desired size and copies over all the content from the previous allocation. Again, an allocator implementation can probably provide a more efficient implementation of this method, for example by growing/shrinking the allocation in-place if possible.

ðŸ”—Unsafety

One thing to notice is that both the trait itself and all trait methods are declared as unsafe:

The reason for declaring the trait as unsafe is that the programmer must guarantee that the trait implementation for an allocator type is correct. For example, the alloc method must never return a memory block that is already used somewhere else because this would cause undefined behavior.
Similarly, the reason that the methods are unsafe is that the caller must ensure various invariants when calling the methods, for example, that the Layout passed to alloc specifies a non-zero size. This is not really relevant in practice since the methods are normally called directly by the compiler, which ensures that the requirements are met.

ðŸ”—A `DummyAllocator`

Now that we know what an allocator type should provide, we can create a simple dummy allocator. For that, we create a new allocator module:

// in src/lib.rs

pub mod allocator;

Our dummy allocator does the absolute minimum to implement the trait and always returns an error when alloc is called. It looks like this:

// in src/allocator.rs

use alloc::alloc::{GlobalAlloc, Layout};
use core::ptr::null_mut;

pub struct Dummy;

unsafe impl GlobalAlloc for Dummy {
    unsafe fn alloc(&self, _layout: Layout) -> *mut u8 {
        null_mut()
    }

    unsafe fn dealloc(&self, _ptr: *mut u8, _layout: Layout) {
        panic!("dealloc should be never called")
    }
}

The struct does not need any fields, so we create it as a zero-sized type. As mentioned above, we always return the null pointer from alloc, which corresponds to an allocation error. Since the allocator never returns any memory, a call to dealloc should never occur. For this reason, we simply panic in the dealloc method. The alloc_zeroed and realloc methods have default implementations, so we donâ€™t need to provide implementations for them.

We now have a simple allocator, but we still have to tell the Rust compiler that it should use this allocator. This is where the #[global_allocator] attribute comes in.

ðŸ”—The `#[global_allocator]` Attribute

The #[global_allocator] attribute tells the Rust compiler which allocator instance it should use as the global heap allocator. The attribute is only applicable to a static that implements the GlobalAlloc trait. Letâ€™s register an instance of our Dummy allocator as the global allocator:

// in src/allocator.rs

#[global_allocator]
static ALLOCATOR: Dummy = Dummy;

Since the Dummy allocator is a zero-sized type, we donâ€™t need to specify any fields in the initialization expression.

With this static, the compilation errors should be fixed. Now we can use the allocation and collection types of alloc. For example, we can use a Box to allocate a value on the heap:

// in src/main.rs

extern crate alloc;

use alloc::boxed::Box;

fn kernel_main(boot_info: &'static BootInfo) -> ! {
    // [â€¦] print "Hello World!", call `init`, create `mapper` and `frame_allocator`

    let x = Box::new(41);

    // [â€¦] call `test_main` in test mode

    println!("It did not crash!");
    blog_os::hlt_loop();
}

Note that we need to specify the extern crate alloc statement in our main.rs too. This is required because the lib.rs and main.rs parts are treated as separate crates. However, we donâ€™t need to create another #[global_allocator] static because the global allocator applies to all crates in the project. In fact, specifying an additional allocator in another crate would be an error.

When we run the above code, we see that a panic occurs:

The panic occurs because the Box::new function implicitly calls the alloc function of the global allocator. Our dummy allocator always returns a null pointer, so every allocation fails. To fix this, we need to create an allocator that actually returns usable memory.

ðŸ”—Creating a Kernel Heap

Before we can create a proper allocator, we first need to create a heap memory region from which the allocator can allocate memory. To do this, we need to define a virtual memory range for the heap region and then map this region to physical frames. See the â€œIntroduction To Pagingâ€ post for an overview of virtual memory and page tables.

The first step is to define a virtual memory region for the heap. We can choose any virtual address range that we like, as long as it is not already used for a different memory region. Letâ€™s define it as the memory starting at address 0x_4444_4444_0000 so that we can easily recognize a heap pointer later:

// in src/allocator.rs

pub const HEAP_START: usize = 0x_4444_4444_0000;
pub const HEAP_SIZE: usize = 100 * 1024; // 100 KiB

We set the heap size to 100Â KiB for now. If we need more space in the future, we can simply increase it.

If we tried to use this heap region now, a page fault would occur since the virtual memory region is not mapped to physical memory yet. To resolve this, we create an init_heap function that maps the heap pages using the Mapper API that we introduced in the â€œPaging Implementationâ€ post:

// in src/allocator.rs

use x86_64::{
    structures::paging::{
        mapper::MapToError, FrameAllocator, Mapper, Page, PageTableFlags, Size4KiB,
    },
    VirtAddr,
};

pub fn init_heap(
    mapper: &mut impl Mapper<Size4KiB>,
    frame_allocator: &mut impl FrameAllocator<Size4KiB>,
) -> Result<(), MapToError<Size4KiB>> {
    let page_range = {
        let heap_start = VirtAddr::new(HEAP_START as u64);
        let heap_end = heap_start + HEAP_SIZE - 1u64;
        let heap_start_page = Page::containing_address(heap_start);
        let heap_end_page = Page::containing_address(heap_end);
        Page::range_inclusive(heap_start_page, heap_end_page)
    };

    for page in page_range {
        let frame = frame_allocator
            .allocate_frame()
            .ok_or(MapToError::FrameAllocationFailed)?;
        let flags = PageTableFlags::PRESENT | PageTableFlags::WRITABLE;
        unsafe {
            mapper.map_to(page, frame, flags, frame_allocator)?.flush()
        };
    }

    Ok(())
}

The function takes mutable references to a Mapper and a FrameAllocator instance, both limited to 4Â KiB pages by using Size4KiB as the generic parameter. The return value of the function is a Result with the unit type () as the success variant and a MapToError as the error variant, which is the error type returned by the Mapper::map_to method. Reusing the error type makes sense here because the map_to method is the main source of errors in this function.

The implementation can be broken down into two parts:

Creating the page range:: To create a range of the pages that we want to map, we convert the HEAP_START pointer to a VirtAddr type. Then we calculate the heap end address from it by adding the HEAP_SIZE. We want an inclusive bound (the address of the last byte of the heap), so we subtract 1. Next, we convert the addresses into Page types using the containing_address function. Finally, we create a page range from the start and end pages using the Page::range_inclusive function.
Mapping the pages: The second step is to map all pages of the page range we just created. For that, we iterate over these pages using a for loop. For each page, we do the following:
- We allocate a physical frame that the page should be mapped to using the FrameAllocator::allocate_frame method. This method returns None when there are no more frames left. We deal with that case by mapping it to a MapToError::FrameAllocationFailed error through the Option::ok_or method and then applying the question mark operator to return early in the case of an error.
- We set the required PRESENT flag and the WRITABLE flag for the page. With these flags, both read and write accesses are allowed, which makes sense for heap memory.
- We use the Mapper::map_to method for creating the mapping in the active page table. The method can fail, so we use the question mark operator again to forward the error to the caller. On success, the method returns a MapperFlush instance that we can use to update the translation lookaside buffer using the flush method.

The final step is to call this function from our kernel_main:

// in src/main.rs

fn kernel_main(boot_info: &'static BootInfo) -> ! {
    use blog_os::allocator; // new import
    use blog_os::memory::{self, BootInfoFrameAllocator};

    println!("Hello World{}", "!");
    blog_os::init();

    let phys_mem_offset = VirtAddr::new(boot_info.physical_memory_offset);
    let mut mapper = unsafe { memory::init(phys_mem_offset) };
    let mut frame_allocator = unsafe {
        BootInfoFrameAllocator::init(&boot_info.memory_map)
    };

    // new
    allocator::init_heap(&mut mapper, &mut frame_allocator)
        .expect("heap initialization failed");

    let x = Box::new(41);

    // [â€¦] call `test_main` in test mode

    println!("It did not crash!");
    blog_os::hlt_loop();
}

We show the full function for context here. The only new lines are the blog_os::allocator import and the call to the allocator::init_heap function. In case the init_heap function returns an error, we panic using the Result::expect method since there is currently no sensible way for us to handle this error.

We now have a mapped heap memory region that is ready to be used. The Box::new call still uses our old Dummy allocator, so you will still see the â€œout of memoryâ€ error when you run it. Letâ€™s fix this by using a proper allocator.

ðŸ”—Using an Allocator Crate

Since implementing an allocator is somewhat complex, we start by using an external allocator crate. We will learn how to implement our own allocator in the next post.

A simple allocator crate for no_std applications is the linked_list_allocator crate. Its name comes from the fact that it uses a linked list data structure to keep track of deallocated memory regions. See the next post for a more detailed explanation of this approach.

To use the crate, we first need to add a dependency on it in our Cargo.toml:

# in Cargo.toml

[dependencies]
linked_list_allocator = "0.9.0"

Then we can replace our dummy allocator with the allocator provided by the crate:

// in src/allocator.rs

use linked_list_allocator::LockedHeap;

#[global_allocator]
static ALLOCATOR: LockedHeap = LockedHeap::empty();

The struct is named LockedHeap because it uses the spinning_top::Spinlock type for synchronization. This is required because multiple threads could access the ALLOCATOR static at the same time. As always, when using a spinlock or a mutex, we need to be careful to not accidentally cause a deadlock. This means that we shouldnâ€™t perform any allocations in interrupt handlers, since they can run at an arbitrary time and might interrupt an in-progress allocation.

Setting the LockedHeap as global allocator is not enough. The reason is that we use the empty constructor function, which creates an allocator without any backing memory. Like our dummy allocator, it always returns an error on alloc. To fix this, we need to initialize the allocator after creating the heap:

// in src/allocator.rs

pub fn init_heap(
    mapper: &mut impl Mapper<Size4KiB>,
    frame_allocator: &mut impl FrameAllocator<Size4KiB>,
) -> Result<(), MapToError<Size4KiB>> {
    // [â€¦] map all heap pages to physical frames

    // new
    unsafe {
        ALLOCATOR.lock().init(HEAP_START, HEAP_SIZE);
    }

    Ok(())
}

We use the lock method on the inner spinlock of the LockedHeap type to get an exclusive reference to the wrapped Heap instance, on which we then call the init method with the heap bounds as arguments. Because the init function already tries to write to the heap memory, we must initialize the heap only after mapping the heap pages.

After initializing the heap, we can now use all allocation and collection types of the built-in alloc crate without error:

// in src/main.rs

use alloc::{boxed::Box, vec, vec::Vec, rc::Rc};

fn kernel_main(boot_info: &'static BootInfo) -> ! {
    // [â€¦] initialize interrupts, mapper, frame_allocator, heap

    // allocate a number on the heap
    let heap_value = Box::new(41);
    println!("heap_value at {:p}", heap_value);

    // create a dynamically sized vector
    let mut vec = Vec::new();
    for i in 0..500 {
        vec.push(i);
    }
    println!("vec at {:p}", vec.as_slice());

    // create a reference counted vector -> will be freed when count reaches 0
    let reference_counted = Rc::new(vec![1, 2, 3]);
    let cloned_reference = reference_counted.clone();
    println!("current reference count is {}", Rc::strong_count(&cloned_reference));
    core::mem::drop(reference_counted);
    println!("reference count is {} now", Rc::strong_count(&cloned_reference));

    // [â€¦] call `test_main` in test context
    println!("It did not crash!");
    blog_os::hlt_loop();
}

This code example shows some uses of the Box, Vec, and Rc types. For the Box and Vec types, we print the underlying heap pointers using the {:p} formatting specifier. To showcase Rc, we create a reference-counted heap value and use the Rc::strong_count function to print the current reference count before and after dropping an instance (using core::mem::drop).

When we run it, we see the following:

As expected, we see that the Box and Vec values live on the heap, as indicated by the pointer starting with the 0x_4444_4444_* prefix. The reference counted value also behaves as expected, with the reference count being 2 after the clone call, and 1 again after one of the instances was dropped.

The reason that the vector starts at offset 0x800 is not that the boxed value is 0x800 bytes large, but the reallocations that occur when the vector needs to increase its capacity. For example, when the vectorâ€™s capacity is 32 and we try to add the next element, the vector allocates a new backing array with a capacity of 64 behind the scenes and copies all elements over. Then it frees the old allocation.

Of course, there are many more allocation and collection types in the alloc crate that we can now all use in our kernel, including:

the thread-safe reference counted pointer Arc
the owned string type String and the format! macro
LinkedList
the growable ring buffer VecDeque
the BinaryHeap priority queue
BTreeMap and BTreeSet

These types will become very useful when we want to implement thread lists, scheduling queues, or support for async/await.

ðŸ”—Adding a Test

To ensure that we donâ€™t accidentally break our new allocation code, we should add an integration test for it. We start by creating a new tests/heap_allocation.rs file with the following content:

// in tests/heap_allocation.rs

#![no_std]
#![no_main]
#![feature(custom_test_frameworks)]
#![test_runner(blog_os::test_runner)]
#![reexport_test_harness_main = "test_main"]

extern crate alloc;

use bootloader::{entry_point, BootInfo};
use core::panic::PanicInfo;

entry_point!(main);

fn main(boot_info: &'static BootInfo) -> ! {
    unimplemented!();
}

#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    blog_os::test_panic_handler(info)
}

We reuse the test_runner and test_panic_handler functions from our lib.rs. Since we want to test allocations, we enable the alloc crate through the extern crate alloc statement. For more information about the test boilerplate, check out the Testing post.

The implementation of the main function looks like this:

// in tests/heap_allocation.rs

fn main(boot_info: &'static BootInfo) -> ! {
    use blog_os::allocator;
    use blog_os::memory::{self, BootInfoFrameAllocator};
    use x86_64::VirtAddr;

    blog_os::init();
    let phys_mem_offset = VirtAddr::new(boot_info.physical_memory_offset);
    let mut mapper = unsafe { memory::init(phys_mem_offset) };
    let mut frame_allocator = unsafe {
        BootInfoFrameAllocator::init(&boot_info.memory_map)
    };
    allocator::init_heap(&mut mapper, &mut frame_allocator)
        .expect("heap initialization failed");

    test_main();
    loop {}
}

It is very similar to the kernel_main function in our main.rs, with the differences that we donâ€™t invoke println, donâ€™t include any example allocations, and call test_main unconditionally.

Now weâ€™re ready to add a few test cases. First, we add a test that performs some simple allocations using Box and checks the allocated values to ensure that basic allocations work:

// in tests/heap_allocation.rs
use alloc::boxed::Box;

#[test_case]
fn simple_allocation() {
    let heap_value_1 = Box::new(41);
    let heap_value_2 = Box::new(13);
    assert_eq!(*heap_value_1, 41);
    assert_eq!(*heap_value_2, 13);
}

Most importantly, this test verifies that no allocation error occurs.

Next, we iteratively build a large vector, to test both large allocations and multiple allocations (due to reallocations):

// in tests/heap_allocation.rs

use alloc::vec::Vec;

#[test_case]
fn large_vec() {
    let n = 1000;
    let mut vec = Vec::new();
    for i in 0..n {
        vec.push(i);
    }
    assert_eq!(vec.iter().sum::<u64>(), (n - 1) * n / 2);
}

We verify the sum by comparing it with the formula for the n-th partial sum. This gives us some confidence that the allocated values are all correct.

As a third test, we create ten thousand allocations after each other:

// in tests/heap_allocation.rs

use blog_os::allocator::HEAP_SIZE;

#[test_case]
fn many_boxes() {
    for i in 0..HEAP_SIZE {
        let x = Box::new(i);
        assert_eq!(*x, i);
    }
}

This test ensures that the allocator reuses freed memory for subsequent allocations since it would run out of memory otherwise. This might seem like an obvious requirement for an allocator, but there are allocator designs that donâ€™t do this. An example is the bump allocator design that will be explained in the next post.

Letâ€™s run our new integration test:

> cargo test --test heap_allocation
[â€¦]
Running 3 tests
simple_allocation... [ok]
large_vec... [ok]
many_boxes... [ok]

All three tests succeeded! You can also invoke cargo test (without the --test argument) to run all unit and integration tests.

ðŸ”—Summary

This post gave an introduction to dynamic memory and explained why and where it is needed. We saw how Rustâ€™s borrow checker prevents common vulnerabilities and learned how Rustâ€™s allocation API works.

After creating a minimal implementation of Rustâ€™s allocator interface using a dummy allocator, we created a proper heap memory region for our kernel. For that, we defined a virtual address range for the heap and then mapped all pages of that range to physical frames using the Mapper and FrameAllocator from the previous post.

Finally, we added a dependency on the linked_list_allocator crate to add a proper allocator to our kernel. With this allocator, we were able to use Box, Vec, and other allocation and collection types from the alloc crate.

ðŸ”—Whatâ€™s next?

While we already added heap allocation support in this post, we left most of the work to the linked_list_allocator crate. The next post will show in detail how an allocator can be implemented from scratch. It will present multiple possible allocator designs, show how to implement simple versions of them, and explain their advantages and drawbacks.

Updates in May 2019

Mon, 03 Jun 2019 00:00:00 +0000

This post gives an overview of the recent updates to the Writing an OS in Rust blog and to the used tools. I was quite busy with my master thesis this month, so I didnâ€™t have the time to create new content or major new features. However, there were quite a few minor updates.

x86_64

Use cast crate instead of usize_conversions crate (released as version 0.5.5).
Make FrameAllocator an unsafe trait (released as version 0.6.0).
Change Port::read and PortReadOnly::read to take &mut self (released as version 0.7.0).
@npmccallum started working on moving the type declarations to a separate crate to make them usable for more projects. We created the experimental x86_64_types crate for this.

Cargo-Xbuild

Make backtraces optional to remove the transitive dependency on the cc crate, which has additional compile-time requirements (e.g. a working gcc installation). These requirements caused problems for some people, so we decided to disable backtraces by default. Released as version 0.5.9.
Error when the sysroot path contains spaces: This pull request adds a special error message that points to rust-lang/cargo#6139 when a sysroot path contains spaces. This should avoid the regular confusion, e.g. here, here, or here.
Add a XBUILD_SYSROOT_PATH environment variable to override sysroot path: This feature is useful when the default sysroot path contains a space. Released as version 0.5.10.
Fix the new XBUILD_SYSROOT_PATH environment variable. Released as version 0.5.11.
Update Azure Pipelines CI script
- Build all branches instead of just master and the bors staging branch.
- Rustup is now included in the official Windows image of Azure Pipelines, so we donâ€™t need to install it again.

Bootloader

@rybot666 started working on porting the 16-bit assembly of the bootloader to Rust.

Bootimage

@toothbrush7777777 landed a pull request to pad the boot image to a hard disk block size. This is required for booting the image in VirtualBox. Released as version 0.7.4.
Set XBUILD_SYSROOT_PATH when building bootloader. Released as version 0.7.5.

Blog OS

Update to version 0.6.0 of x86_64, which made the FrameAllocator trait unsafe to implement.
Use -serial stdio instead of -serial mon:stdio as QEMU arguments when testing.
Update x86_64 to version 0.7.0, which changed the Port::read method to take &mut self instead of &self.
@josephlr replaced some leftover tabs with spaces.
Rewrite CompareMessage struct to check the whole string.

Updates in April 2019

Wed, 01 May 2019 00:00:00 +0000

Lotâ€™s of things changed in the Writing an OS in Rust series in the past month, both on the blog itself and in the tools behind the scenes. This post gives an overview of the most important updates.

This post is an experiment inspired by This Week in Rust and similar series. The goal is to provide a resource that allows following the project more closely and staying up-to-date with the changes in the tools/libraries behind the scenes. If enough people find this useful, I will try to turn this in a semi-regular series.

Bootloader

The build system of the bootloader was rewritten to do a proper linking instead of appending the kernel executable manually like before. The relevant pull requests are Rewrite build system and Updates for new build system. These (breaking) changes were released as version 0.5.0 (changelog).
To make the bootloader work with future versions of bootimage, a package.metadata.bootloader.target key was added to the Cargo.toml of the bootloader. This key specifies the name of the target JSON file, so that bootimage knows which --target argument to pass. This change was released as version 0.5.1 (changelog)
In the Version 0.6.0 pull request, the #[cfg(not(test))] attribute was removed from the entry_point macro. This makes it possible to use the macro together with cargo xtest and a custom test framework. Since the change is breaking, it was released as version 0.6.0 (changelog).

Bootimage

The Rewrite bootimage for new bootloader build system pull request completely revamped the implementation of the crate. This was released as version 0.7.0. See the changelog for a list of changes.
- The rewrite had the unintended side-effect that bootimage run no longer ignored executables named test-*, so that an additional --bin argument was required for specifying which executable to run. To avoid breaking users of bootimage test, we yanked version 0.7.0. After fixing the issue, version 0.7.1 was released (changelog).
The New features for bootimage runner pull request added support for additional arguments and various functionality for supporting cargo xtest. The changes were released as version 0.7.2 (changelog).
An argument parsing bug that broke the new cargo bootimage subcommand on Windows was fixed. The fix was released as version 0.7.3.

Blog OS

Performed an Update to new bootloader 0.5.1 and bootimage 0.7.2. Apart from requiring the llvm-tools-preview rustup component, this only changes version numbers.
The Rewrite the linking section of â€œA Freestanding Rust Binaryâ€ pull request updated the first post to compile for the bare-metal thumbv7em-none-eabihf target instead of adding linker arguments for Linux/Windows/macOS.
Since the blog came close to the free bandwidth limit of Netlify, we needed to Migrate from Netlify to Github Pages to avoid additional fees.
With the Minimal Rust Kernel: Use a runner to make cargo xrun work pull request, we integrated the new bootimage runner into the blog.
- The required updates to the post-02 and post-03 branches were performed in the Add .cargo/config file to post-02 branch and Merge the changes from #585 into the post-03 branch pull requests.
In the New testing post pull request, we replaced the previous Unit Testing and Integration Tests with the new Testing post, which uses cargo xtest and a custom test framework for running tests.
- The required updates for the post-04 branch were performed in the Implement code for new testing post in post-xx branches pull request. The updates for the other post-* branches were pushed manually to avoid spamming the repository with pull requests. You can find a list of the commits in the pull request description.
The Avoid generic impl trait parameters in BootInfoFrameAllocator pull request made the BootInfoFrameAllocator non-generic by reconstructing the frame iterator on every allocation. This way, we avoid using a impl Trait type parameter, which makes it impossible to store the type in a static. See rust-lang/rust#60367 for the fundamental problem.

Testing

Sat, 27 Apr 2019 00:00:00 +0000

This post explores unit and integration testing in no_std executables. We will use Rustâ€™s support for custom test frameworks to execute test functions inside our kernel. To report the results out of QEMU, we will use different features of QEMU and the bootimage tool.

ðŸ”—Requirements

This post replaces the (now deprecated) Unit Testing and Integration Tests posts. It assumes that you have followed the A Minimal Rust Kernel post after 2019-04-27. Mainly, it requires that you have a .cargo/config.toml file that sets a default target and defines a runner executable.

ðŸ”—Testing in Rust

Rust has a built-in test framework that is capable of running unit tests without the need to set anything up. Just create a function that checks some results through assertions and add the #[test] attribute to the function header. Then cargo test will automatically find and execute all test functions of your crate.

Unfortunately, itâ€™s a bit more complicated for no_std applications such as our kernel. The problem is that Rustâ€™s test framework implicitly uses the built-in test library, which depends on the standard library. This means that we canâ€™t use the default test framework for our #[no_std] kernel.

We can see this when we try to run cargo test in our project:

> cargo test
   Compiling blog_os v0.1.0 (/â€¦/blog_os)
error[E0463]: can't find crate for `test`

Since the test crate depends on the standard library, it is not available for our bare metal target. While porting the test crate to a #[no_std] context is possible, it is highly unstable and requires some hacks, such as redefining the panic macro.

ðŸ”—Custom Test Frameworks

Fortunately, Rust supports replacing the default test framework through the unstable custom_test_frameworks feature. This feature requires no external libraries and thus also works in #[no_std] environments. It works by collecting all functions annotated with a #[test_case] attribute and then invoking a user-specified runner function with the list of tests as an argument. Thus, it gives the implementation maximal control over the test process.

The disadvantage compared to the default test framework is that many advanced features, such as should_panic tests, are not available. Instead, it is up to the implementation to provide such features itself if needed. This is ideal for us since we have a very special execution environment where the default implementations of such advanced features probably wouldnâ€™t work anyway. For example, the #[should_panic] attribute relies on stack unwinding to catch the panics, which we disabled for our kernel.

To implement a custom test framework for our kernel, we add the following to our main.rs:

// in src/main.rs

#![feature(custom_test_frameworks)]
#![test_runner(crate::test_runner)]

#[cfg(test)]
pub fn test_runner(tests: &[&dyn Fn()]) {
    println!("Running {} tests", tests.len());
    for test in tests {
        test();
    }
}

Our runner just prints a short debug message and then calls each test function in the list. The argument type &[&dyn Fn()] is a slice of trait object references of the Fn() trait. It is basically a list of references to types that can be called like a function. Since the function is useless for non-test runs, we use the #[cfg(test)] attribute to include it only for tests.

When we run cargo test now, we see that it now succeeds (if it doesnâ€™t, see the note below). However, we still see our â€œHello Worldâ€ instead of the message from our test_runner. The reason is that our _start function is still used as entry point. The custom test frameworks feature generates a main function that calls test_runner, but this function is ignored because we use the #[no_main] attribute and provide our own entry point.

Note: There is currently a bug in cargo that leads to â€œduplicate lang itemâ€ errors on cargo test in some cases. It occurs when you have set panic = "abort" for a profile in your Cargo.toml. Try removing it, then cargo test should work. Alternatively, if that doesnâ€™t work, then add panic-abort-tests = true to the [unstable] section of your .cargo/config.toml file. See the cargo issue for more information on this.

To fix this, we first need to change the name of the generated function to something different than main through the reexport_test_harness_main attribute. Then we can call the renamed function from our _start function:

// in src/main.rs

#![reexport_test_harness_main = "test_main"]

#[no_mangle]
pub extern "C" fn _start() -> ! {
    println!("Hello World{}", "!");

    #[cfg(test)]
    test_main();

    loop {}
}

We set the name of the test framework entry function to test_main and call it from our _start entry point. We use conditional compilation to add the call to test_main only in test contexts because the function is not generated on a normal run.

When we now execute cargo test, we see the â€œRunning 0 testsâ€ message from our test_runner on the screen. We are now ready to create our first test function:

// in src/main.rs

#[test_case]
fn trivial_assertion() {
    print!("trivial assertion... ");
    assert_eq!(1, 1);
    println!("[ok]");
}

When we run cargo test now, we see the following output:

The tests slice passed to our test_runner function now contains a reference to the trivial_assertion function. From the trivial assertion... [ok] output on the screen, we see that the test was called and that it succeeded.

After executing the tests, our test_runner returns to the test_main function, which in turn returns to our _start entry point function. At the end of _start, we enter an endless loop because the entry point function is not allowed to return. This is a problem, because we want cargo test to exit after running all tests.

ðŸ”—Exiting QEMU

Right now, we have an endless loop at the end of our _start function and need to close QEMU manually on each execution of cargo test. This is unfortunate because we also want to run cargo test in scripts without user interaction. The clean solution to this would be to implement a proper way to shutdown our OS. Unfortunately, this is relatively complex because it requires implementing support for either the APM or ACPI power management standard.

Luckily, there is an escape hatch: QEMU supports a special isa-debug-exit device, which provides an easy way to exit QEMU from the guest system. To enable it, we need to pass a -device argument to QEMU. We can do so by adding a package.metadata.bootimage.test-args configuration key in our Cargo.toml:

# in Cargo.toml

[package.metadata.bootimage]
test-args = ["-device", "isa-debug-exit,iobase=0xf4,iosize=0x04"]

The bootimage runner appends the test-args to the default QEMU command for all test executables. For a normal cargo run, the arguments are ignored.

Together with the device name (isa-debug-exit), we pass the two parameters iobase and iosize that specify the I/O port through which the device can be reached from our kernel.

ðŸ”—I/O Ports

There are two different approaches for communicating between the CPU and peripheral hardware on x86, memory-mapped I/O and port-mapped I/O. We already used memory-mapped I/O for accessing the VGA text buffer through the memory address 0xb8000. This address is not mapped to RAM but to some memory on the VGA device.

In contrast, port-mapped I/O uses a separate I/O bus for communication. Each connected peripheral has one or more port numbers. To communicate with such an I/O port, there are special CPU instructions called in and out, which take a port number and a data byte (there are also variations of these commands that allow sending a u16 or u32).

The isa-debug-exit device uses port-mapped I/O. The iobase parameter specifies on which port address the device should live (0xf4 is a generally unused port on the x86â€™s IO bus) and the iosize specifies the port size (0x04 means four bytes).

ðŸ”—Using the Exit Device

The functionality of the isa-debug-exit device is very simple. When a value is written to the I/O port specified by iobase, it causes QEMU to exit with exit status (value << 1) | 1. So when we write 0 to the port, QEMU will exit with exit status (0 << 1) | 1 = 1, and when we write 1 to the port, it will exit with exit status (1 << 1) | 1 = 3.

Instead of manually invoking the in and out assembly instructions, we use the abstractions provided by the x86_64 crate. To add a dependency on that crate, we add it to the dependencies section in our Cargo.toml:

# in Cargo.toml

[dependencies]
x86_64 = "0.14.2"

Now we can use the Port type provided by the crate to create an exit_qemu function:

// in src/main.rs

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u32)]
pub enum QemuExitCode {
    Success = 0x10,
    Failed = 0x11,
}

pub fn exit_qemu(exit_code: QemuExitCode) {
    use x86_64::instructions::port::Port;

    unsafe {
        let mut port = Port::new(0xf4);
        port.write(exit_code as u32);
    }
}

The function creates a new Port at 0xf4, which is the iobase of the isa-debug-exit device. Then it writes the passed exit code to the port. We use u32 because we specified the iosize of the isa-debug-exit device as 4 bytes. Both operations are unsafe because writing to an I/O port can generally result in arbitrary behavior.

To specify the exit status, we create a QemuExitCode enum. The idea is to exit with the success exit code if all tests succeeded and with the failure exit code otherwise. The enum is marked as #[repr(u32)] to represent each variant by a u32 integer. We use the exit code 0x10 for success and 0x11 for failure. The actual exit codes donâ€™t matter much, as long as they donâ€™t clash with the default exit codes of QEMU. For example, using exit code 0 for success is not a good idea because it becomes (0 << 1) | 1 = 1 after the transformation, which is the default exit code when QEMU fails to run. So we could not differentiate a QEMU error from a successful test run.

We can now update our test_runner to exit QEMU after all tests have run:

// in src/main.rs

fn test_runner(tests: &[&dyn Fn()]) {
    println!("Running {} tests", tests.len());
    for test in tests {
        test();
    }
    /// new
    exit_qemu(QemuExitCode::Success);
}

When we run cargo test now, we see that QEMU immediately closes after executing the tests. The problem is that cargo test interprets the test as failed even though we passed our Success exit code:

> cargo test
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running target/x86_64-blog_os/debug/deps/blog_os-5804fc7d2dd4c9be
Building bootloader
   Compiling bootloader v0.5.3 (/home/philipp/Documents/bootloader)
    Finished release [optimized + debuginfo] target(s) in 1.07s
Running: `qemu-system-x86_64 -drive format=raw,file=/â€¦/target/x86_64-blog_os/debug/
    deps/bootimage-blog_os-5804fc7d2dd4c9be.bin -device isa-debug-exit,iobase=0xf4,
    iosize=0x04`
error: test failed, to rerun pass '--bin blog_os'

The problem is that cargo test considers all error codes other than 0 as failure.

ðŸ”—Success Exit Code

To work around this, bootimage provides a test-success-exit-code configuration key that maps a specified exit code to the exit code 0:

# in Cargo.toml

[package.metadata.bootimage]
test-args = [â€¦]
test-success-exit-code = 33         # (0x10 << 1) | 1

With this configuration, bootimage maps our success exit code to exit code 0, so that cargo test correctly recognizes the success case and does not count the test as failed.

Our test runner now automatically closes QEMU and correctly reports the test results. We still see the QEMU window open for a very short time, but it does not suffice to read the results. It would be nice if we could print the test results to the console instead, so we can still see them after QEMU exits.

ðŸ”—Printing to the Console

To see the test output on the console, we need to send the data from our kernel to the host system somehow. There are various ways to achieve this, for example, by sending the data over a TCP network interface. However, setting up a networking stack is quite a complex task, so we will choose a simpler solution instead.

ðŸ”—Serial Port

A simple way to send the data is to use the serial port, an old interface standard which is no longer found in modern computers. It is easy to program and QEMU can redirect the bytes sent over serial to the hostâ€™s standard output or a file.

The chips implementing a serial interface are called UARTs. There are lots of UART models on x86, but fortunately the only differences between them are some advanced features we donâ€™t need. The common UARTs today are all compatible with the 16550 UART, so we will use that model for our testing framework.

We will use the uart_16550 crate to initialize the UART and send data over the serial port. To add it as a dependency, we update our Cargo.toml and main.rs:

# in Cargo.toml

[dependencies]
uart_16550 = "0.2.0"

The uart_16550 crate contains a SerialPort struct that represents the UART registers, but we still need to construct an instance of it ourselves. For that, we create a new serial module with the following content:

// in src/main.rs

mod serial;

// in src/serial.rs

use uart_16550::SerialPort;
use spin::Mutex;
use lazy_static::lazy_static;

lazy_static! {
    pub static ref SERIAL1: Mutex<SerialPort> = {
        let mut serial_port = unsafe { SerialPort::new(0x3F8) };
        serial_port.init();
        Mutex::new(serial_port)
    };
}

Like with the VGA text buffer, we use lazy_static and a spinlock to create a static writer instance. By using lazy_static we can ensure that the init method is called exactly once on its first use.

Like the isa-debug-exit device, the UART is programmed using port I/O. Since the UART is more complex, it uses multiple I/O ports for programming different device registers. The unsafe SerialPort::new function expects the address of the first I/O port of the UART as an argument, from which it can calculate the addresses of all needed ports. Weâ€™re passing the port address 0x3F8, which is the standard port number for the first serial interface.

To make the serial port easily usable, we add serial_print! and serial_println! macros:

// in src/serial.rs

#[doc(hidden)]
pub fn _print(args: ::core::fmt::Arguments) {
    use core::fmt::Write;
    SERIAL1.lock().write_fmt(args).expect("Printing to serial failed");
}

/// Prints to the host through the serial interface.
#[macro_export]
macro_rules! serial_print {
    ($($arg:tt)*) => {
        $crate::serial::_print(format_args!($($arg)*));
    };
}

/// Prints to the host through the serial interface, appending a newline.
#[macro_export]
macro_rules! serial_println {
    () => ($crate::serial_print!("\n"));
    ($fmt:expr) => ($crate::serial_print!(concat!($fmt, "\n")));
    ($fmt:expr, $($arg:tt)*) => ($crate::serial_print!(
        concat!($fmt, "\n"), $($arg)*));
}

The implementation is very similar to the implementation of our print and println macros. Since the SerialPort type already implements the fmt::Write trait, we donâ€™t need to provide our own implementation.

Now we can print to the serial interface instead of the VGA text buffer in our test code:

// in src/main.rs

#[cfg(test)]
fn test_runner(tests: &[&dyn Fn()]) {
    serial_println!("Running {} tests", tests.len());
    [â€¦]
}

#[test_case]
fn trivial_assertion() {
    serial_print!("trivial assertion... ");
    assert_eq!(1, 1);
    serial_println!("[ok]");
}

Note that the serial_println macro lives directly under the root namespace because we used the #[macro_export] attribute, so importing it through use crate::serial::serial_println will not work.

ðŸ”—QEMU Arguments

To see the serial output from QEMU, we need to use the -serial argument to redirect the output to stdout:

# in Cargo.toml

[package.metadata.bootimage]
test-args = [
    "-device", "isa-debug-exit,iobase=0xf4,iosize=0x04", "-serial", "stdio"
]

When we run cargo test now, we see the test output directly in the console:

> cargo test
    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running target/x86_64-blog_os/debug/deps/blog_os-7b7c37b4ad62551a
Building bootloader
    Finished release [optimized + debuginfo] target(s) in 0.02s
Running: `qemu-system-x86_64 -drive format=raw,file=/â€¦/target/x86_64-blog_os/debug/
    deps/bootimage-blog_os-7b7c37b4ad62551a.bin -device
    isa-debug-exit,iobase=0xf4,iosize=0x04 -serial stdio`
Running 1 tests
trivial assertion... [ok]

However, when a test fails, we still see the output inside QEMU because our panic handler still uses println. To simulate this, we can change the assertion in our trivial_assertion test to assert_eq!(0, 1):

We see that the panic message is still printed to the VGA buffer, while the other test output is printed to the serial port. The panic message is quite useful, so it would be useful to see it in the console too.

ðŸ”—Print an Error Message on Panic

To exit QEMU with an error message on a panic, we can use conditional compilation to use a different panic handler in testing mode:

// in src/main.rs

// our existing panic handler
#[cfg(not(test))] // new attribute
#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    println!("{}", info);
    loop {}
}

// our panic handler in test mode
#[cfg(test)]
#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    serial_println!("[failed]\n");
    serial_println!("Error: {}\n", info);
    exit_qemu(QemuExitCode::Failed);
    loop {}
}

For our test panic handler, we use serial_println instead of println and then exit QEMU with a failure exit code. Note that we still need an endless loop after the exit_qemu call because the compiler does not know that the isa-debug-exit device causes a program exit.

Now QEMU also exits for failed tests and prints a useful error message on the console:

> cargo test
    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running target/x86_64-blog_os/debug/deps/blog_os-7b7c37b4ad62551a
Building bootloader
    Finished release [optimized + debuginfo] target(s) in 0.02s
Running: `qemu-system-x86_64 -drive format=raw,file=/â€¦/target/x86_64-blog_os/debug/
    deps/bootimage-blog_os-7b7c37b4ad62551a.bin -device
    isa-debug-exit,iobase=0xf4,iosize=0x04 -serial stdio`
Running 1 tests
trivial assertion... [failed]

Error: panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `1`', src/main.rs:65:5

Since we see all test output on the console now, we no longer need the QEMU window that pops up for a short time. So we can hide it completely.

ðŸ”—Hiding QEMU

Since we report out the complete test results using the isa-debug-exit device and the serial port, we donâ€™t need the QEMU window anymore. We can hide it by passing the -display none argument to QEMU:

# in Cargo.toml

[package.metadata.bootimage]
test-args = [
    "-device", "isa-debug-exit,iobase=0xf4,iosize=0x04", "-serial", "stdio",
    "-display", "none"
]

Now QEMU runs completely in the background and no window gets opened anymore. This is not only less annoying, but also allows our test framework to run in environments without a graphical user interface, such as CI services or SSH connections.

ðŸ”—Timeouts

Since cargo test waits until the test runner exits, a test that never returns can block the test runner forever. Thatâ€™s unfortunate, but not a big problem in practice since itâ€™s usually easy to avoid endless loops. In our case, however, endless loops can occur in various situations:

The bootloader fails to load our kernel, which causes the system to reboot endlessly.
The BIOS/UEFI firmware fails to load the bootloader, which causes the same endless rebooting.
The CPU enters a loop {} statement at the end of some of our functions, for example because the QEMU exit device doesnâ€™t work properly.
The hardware causes a system reset, for example when a CPU exception is not caught (explained in a future post).

Since endless loops can occur in so many situations, the bootimage tool sets a timeout of 5 minutes for each test executable by default. If the test does not finish within this time, it is marked as failed and a â€œTimed Outâ€ error is printed to the console. This feature ensures that tests that are stuck in an endless loop donâ€™t block cargo test forever.

You can try it yourself by adding a loop {} statement in the trivial_assertion test. When you run cargo test, you see that the test is marked as timed out after 5 minutes. The timeout duration is configurable through a test-timeout key in the Cargo.toml:

# in Cargo.toml

[package.metadata.bootimage]
test-timeout = 300          # (in seconds)

If you donâ€™t want to wait 5 minutes for the trivial_assertion test to time out, you can temporarily decrease the above value.

ðŸ”—Insert Printing Automatically

Our trivial_assertion test currently needs to print its own status information using serial_print!/serial_println!:

#[test_case]
fn trivial_assertion() {
    serial_print!("trivial assertion... ");
    assert_eq!(1, 1);
    serial_println!("[ok]");
}

Manually adding these print statements for every test we write is cumbersome, so letâ€™s update our test_runner to print these messages automatically. To do that, we need to create a new Testable trait:

// in src/main.rs

pub trait Testable {
    fn run(&self) -> ();
}

The trick now is to implement this trait for all types T that implement the Fn() trait:

// in src/main.rs

impl<T> Testable for T
where
    T: Fn(),
{
    fn run(&self) {
        serial_print!("{}...\t", core::any::type_name::<T>());
        self();
        serial_println!("[ok]");
    }
}

We implement the run function by first printing the function name using the any::type_name function. This function is implemented directly in the compiler and returns a string description of every type. For functions, the type is their name, so this is exactly what we want in this case. The \t character is the tab character, which adds some alignment to the [ok] messages.

After printing the function name, we invoke the test function through self(). This only works because we require that self implements the Fn() trait. After the test function returns, we print [ok] to indicate that the function did not panic.

The last step is to update our test_runner to use the new Testable trait:

// in src/main.rs

#[cfg(test)]
pub fn test_runner(tests: &[&dyn Testable]) { // new
    serial_println!("Running {} tests", tests.len());
    for test in tests {
        test.run(); // new
    }
    exit_qemu(QemuExitCode::Success);
}

The only two changes are the type of the tests argument from &[&dyn Fn()] to &[&dyn Testable] and the fact that we now call test.run() instead of test().

We can now remove the print statements from our trivial_assertion test since theyâ€™re now printed automatically:

// in src/main.rs

#[test_case]
fn trivial_assertion() {
    assert_eq!(1, 1);
}

The cargo test output now looks like this:

Running 1 tests
blog_os::trivial_assertion...	[ok]

The function name now includes the full path to the function, which is useful when test functions in different modules have the same name. Otherwise, the output looks the same as before, but we no longer need to add print statements to our tests manually.

ðŸ”—Testing the VGA Buffer

Now that we have a working test framework, we can create a few tests for our VGA buffer implementation. First, we create a very simple test to verify that println works without panicking:

// in src/vga_buffer.rs

#[test_case]
fn test_println_simple() {
    println!("test_println_simple output");
}

The test just prints something to the VGA buffer. If it finishes without panicking, it means that the println invocation did not panic either.

To ensure that no panic occurs even if many lines are printed and lines are shifted off the screen, we can create another test:

// in src/vga_buffer.rs

#[test_case]
fn test_println_many() {
    for _ in 0..200 {
        println!("test_println_many output");
    }
}

We can also create a test function to verify that the printed lines really appear on the screen:

// in src/vga_buffer.rs

#[test_case]
fn test_println_output() {
    let s = "Some test string that fits on a single line";
    println!("{}", s);
    for (i, c) in s.chars().enumerate() {
        let screen_char = WRITER.lock().buffer.chars[BUFFER_HEIGHT - 2][i].read();
        assert_eq!(char::from(screen_char.ascii_character), c);
    }
}

The function defines a test string, prints it using println, and then iterates over the screen characters of the static WRITER, which represents the VGA text buffer. Since println prints to the last screen line and then immediately appends a newline, the string should appear on line BUFFER_HEIGHT - 2.

By using enumerate, we count the number of iterations in the variable i, which we then use for loading the screen character corresponding to c. By comparing the ascii_character of the screen character with c, we ensure that each character of the string really appears in the VGA text buffer.

As you can imagine, we could create many more test functions. For example, a function that tests that no panic occurs when printing very long lines and that theyâ€™re wrapped correctly, or a function for testing that newlines, non-printable characters, and non-unicode characters are handled correctly.

For the rest of this post, however, we will explain how to create integration tests to test the interaction of different components together.

ðŸ”—Integration Tests

The convention for integration tests in Rust is to put them into a tests directory in the project root (i.e., next to the src directory). Both the default test framework and custom test frameworks will automatically pick up and execute all tests in that directory.

All integration tests are their own executables and completely separate from our main.rs. This means that each test needs to define its own entry point function. Letâ€™s create an example integration test named basic_boot to see how it works in detail:

// in tests/basic_boot.rs

#![no_std]
#![no_main]
#![feature(custom_test_frameworks)]
#![test_runner(crate::test_runner)]
#![reexport_test_harness_main = "test_main"]

use core::panic::PanicInfo;

#[no_mangle] // don't mangle the name of this function
pub extern "C" fn _start() -> ! {
    test_main();

    loop {}
}

fn test_runner(tests: &[&dyn Fn()]) {
    unimplemented!();
}

#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    loop {}
}

Since integration tests are separate executables, we need to provide all the crate attributes (no_std, no_main, test_runner, etc.) again. We also need to create a new entry point function _start, which calls the test entry point function test_main. We donâ€™t need any cfg(test) attributes because integration test executables are never built in non-test mode.

We use the unimplemented macro that always panics as a placeholder for the test_runner function and just loop in the panic handler for now. Ideally, we want to implement these functions exactly as we did in our main.rs using the serial_println macro and the exit_qemu function. The problem is that we donâ€™t have access to these functions since tests are built completely separately from our main.rs executable.

If you run cargo test at this stage, you will get an endless loop because the panic handler loops endlessly. You need to use the ctrl+c keyboard shortcut for exiting QEMU.

ðŸ”—Create a Library

To make the required functions available to our integration test, we need to split off a library from our main.rs, which can be included by other crates and integration test executables. To do this, we create a new src/lib.rs file:

// src/lib.rs

#![no_std]

Like the main.rs, the lib.rs is a special file that is automatically recognized by cargo. The library is a separate compilation unit, so we need to specify the #![no_std] attribute again.

To make our library work with cargo test, we need to also move the test functions and attributes from main.rs to lib.rs:

// in src/lib.rs

#![cfg_attr(test, no_main)]
#![feature(custom_test_frameworks)]
#![test_runner(crate::test_runner)]
#![reexport_test_harness_main = "test_main"]

use core::panic::PanicInfo;

pub trait Testable {
    fn run(&self) -> ();
}

impl<T> Testable for T
where
    T: Fn(),
{
    fn run(&self) {
        serial_print!("{}...\t", core::any::type_name::<T>());
        self();
        serial_println!("[ok]");
    }
}

pub fn test_runner(tests: &[&dyn Testable]) {
    serial_println!("Running {} tests", tests.len());
    for test in tests {
        test.run();
    }
    exit_qemu(QemuExitCode::Success);
}

pub fn test_panic_handler(info: &PanicInfo) -> ! {
    serial_println!("[failed]\n");
    serial_println!("Error: {}\n", info);
    exit_qemu(QemuExitCode::Failed);
    loop {}
}

/// Entry point for `cargo test`
#[cfg(test)]
#[no_mangle]
pub extern "C" fn _start() -> ! {
    test_main();
    loop {}
}

#[cfg(test)]
#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    test_panic_handler(info)
}

To make our test_runner available to executables and integration tests, we make it public and donâ€™t apply the cfg(test) attribute to it. We also factor out the implementation of our panic handler into a public test_panic_handler function, so that it is available for executables too.

Since our lib.rs is tested independently of our main.rs, we need to add a _start entry point and a panic handler when the library is compiled in test mode. By using the cfg_attr crate attribute, we conditionally enable the no_main attribute in this case.

We also move over the QemuExitCode enum and the exit_qemu function and make them public:

// in src/lib.rs

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u32)]
pub enum QemuExitCode {
    Success = 0x10,
    Failed = 0x11,
}

pub fn exit_qemu(exit_code: QemuExitCode) {
    use x86_64::instructions::port::Port;

    unsafe {
        let mut port = Port::new(0xf4);
        port.write(exit_code as u32);
    }
}

Now executables and integration tests can import these functions from the library and donâ€™t need to define their own implementations. To also make println and serial_println available, we move the module declarations too:

// in src/lib.rs

pub mod serial;
pub mod vga_buffer;

We make the modules public to make them usable outside of our library. This is also required for making our println and serial_println macros usable since they use the _print functions of the modules.

Now we can update our main.rs to use the library:

// in src/main.rs

#![no_std]
#![no_main]
#![feature(custom_test_frameworks)]
#![test_runner(blog_os::test_runner)]
#![reexport_test_harness_main = "test_main"]

use core::panic::PanicInfo;
use blog_os::println;

#[no_mangle]
pub extern "C" fn _start() -> ! {
    println!("Hello World{}", "!");

    #[cfg(test)]
    test_main();

    loop {}
}

/// This function is called on panic.
#[cfg(not(test))]
#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    println!("{}", info);
    loop {}
}

#[cfg(test)]
#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    blog_os::test_panic_handler(info)
}

The library is usable like a normal external crate. It is called blog_os, like our crate. The above code uses the blog_os::test_runner function in the test_runner attribute and the blog_os::test_panic_handler function in our cfg(test) panic handler. It also imports the println macro to make it available to our _start and panic functions.

At this point, cargo run and cargo test should work again. Of course, cargo test still loops endlessly (you can exit with ctrl+c). Letâ€™s fix this by using the required library functions in our integration test.

ðŸ”—Completing the Integration Test

Like our src/main.rs, our tests/basic_boot.rs executable can import types from our new library. This allows us to import the missing components to complete our test:

// in tests/basic_boot.rs

#![test_runner(blog_os::test_runner)]

#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    blog_os::test_panic_handler(info)
}

Instead of reimplementing the test runner, we use the test_runner function from our library by changing the #![test_runner(crate::test_runner)] attribute to #![test_runner(blog_os::test_runner)]. We then donâ€™t need the test_runner stub function in basic_boot.rs anymore, so we can remove it. For our panic handler, we call the blog_os::test_panic_handler function like we did in our main.rs.

Now cargo test exits normally again. When you run it, you will see that it builds and runs the tests for our lib.rs, main.rs, and basic_boot.rs separately after each other. For the main.rs and the basic_boot integration tests, it reports â€œRunning 0 testsâ€ since these files donâ€™t have any functions annotated with #[test_case].

We can now add tests to our basic_boot.rs. For example, we can test that println works without panicking, like we did in the VGA buffer tests:

// in tests/basic_boot.rs

use blog_os::println;

#[test_case]
fn test_println() {
    println!("test_println output");
}

When we run cargo test now, we see that it finds and executes the test function.

The test might seem a bit useless right now since itâ€™s almost identical to one of the VGA buffer tests. However, in the future, the _start functions of our main.rs and lib.rs might grow and call various initialization routines before running the test_main function, so that the two tests are executed in very different environments.

By testing println in a basic_boot environment without calling any initialization routines in _start, we can ensure that println works right after booting. This is important because we rely on it, e.g., for printing panic messages.

ðŸ”—Future Tests

The power of integration tests is that theyâ€™re treated as completely separate executables. This gives them complete control over the environment, which makes it possible to test that the code interacts correctly with the CPU or hardware devices.

Our basic_boot test is a very simple example of an integration test. In the future, our kernel will become much more featureful and interact with the hardware in various ways. By adding integration tests, we can ensure that these interactions work (and keep working) as expected. Some ideas for possible future tests are:

CPU Exceptions: When the code performs invalid operations (e.g., divides by zero), the CPU throws an exception. The kernel can register handler functions for such exceptions. An integration test could verify that the correct exception handler is called when a CPU exception occurs or that the execution continues correctly after a resolvable exception.
Page Tables: Page tables define which memory regions are valid and accessible. By modifying the page tables, it is possible to allocate new memory regions, for example when launching programs. An integration test could modify the page tables in the _start function and verify that the modifications have the desired effects in #[test_case] functions.
Userspace Programs: Userspace programs are programs with limited access to the systemâ€™s resources. For example, they donâ€™t have access to kernel data structures or to the memory of other programs. An integration test could launch userspace programs that perform forbidden operations and verify that the kernel prevents them all.

As you can imagine, many more tests are possible. By adding such tests, we can ensure that we donâ€™t break them accidentally when we add new features to our kernel or refactor our code. This is especially important when our kernel becomes larger and more complex.

ðŸ”—Tests that Should Panic

The test framework of the standard library supports a #[should_panic] attribute that allows constructing tests that should fail. This is useful, for example, to verify that a function fails when an invalid argument is passed. Unfortunately, this attribute isnâ€™t supported in #[no_std] crates since it requires support from the standard library.

While we canâ€™t use the #[should_panic] attribute in our kernel, we can get similar behavior by creating an integration test that exits with a success error code from the panic handler. Letâ€™s start creating such a test with the name should_panic:

// in tests/should_panic.rs

#![no_std]
#![no_main]

use core::panic::PanicInfo;
use blog_os::{QemuExitCode, exit_qemu, serial_println};

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    serial_println!("[ok]");
    exit_qemu(QemuExitCode::Success);
    loop {}
}

This test is still incomplete as it doesnâ€™t define a _start function or any of the custom test runner attributes yet. Letâ€™s add the missing parts:

// in tests/should_panic.rs

#![feature(custom_test_frameworks)]
#![test_runner(test_runner)]
#![reexport_test_harness_main = "test_main"]

#[no_mangle]
pub extern "C" fn _start() -> ! {
    test_main();

    loop {}
}

pub fn test_runner(tests: &[&dyn Fn()]) {
    serial_println!("Running {} tests", tests.len());
    for test in tests {
        test();
        serial_println!("[test did not panic]");
        exit_qemu(QemuExitCode::Failed);
    }
    exit_qemu(QemuExitCode::Success);
}

Instead of reusing the test_runner from our lib.rs, the test defines its own test_runner function that exits with a failure exit code when a test returns without panicking (we want our tests to panic). If no test function is defined, the runner exits with a success error code. Since the runner always exits after running a single test, it does not make sense to define more than one #[test_case] function.

Now we can create a test that should fail:

// in tests/should_panic.rs

use blog_os::serial_print;

#[test_case]
fn should_fail() {
    serial_print!("should_panic::should_fail...\t");
    assert_eq!(0, 1);
}

The test uses assert_eq to assert that 0 and 1 are equal. Of course, this fails, so our test panics as desired. Note that we need to manually print the function name using serial_print! here because we donâ€™t use the Testable trait.

When we run the test through cargo test --test should_panic we see that it is successful because the test panicked as expected. When we comment out the assertion and run the test again, we see that it indeed fails with the â€œtest did not panicâ€ message.

A significant drawback of this approach is that it only works for a single test function. With multiple #[test_case] functions, only the first function is executed because the execution cannot continue after the panic handler has been called. I currently donâ€™t know of a good way to solve this problem, so let me know if you have an idea!

ðŸ”—No Harness Tests

For integration tests that only have a single test function (like our should_panic test), the test runner isnâ€™t really needed. For cases like this, we can disable the test runner completely and run our test directly in the _start function.

The key to this is to disable the harness flag for the test in the Cargo.toml, which defines whether a test runner is used for an integration test. When itâ€™s set to false, both the default test runner and the custom test runner feature are disabled, so that the test is treated like a normal executable.

Letâ€™s disable the harness flag for our should_panic test:

# in Cargo.toml

[[test]]
name = "should_panic"
harness = false

Now we vastly simplify our should_panic test by removing the test_runner-related code. The result looks like this:

// in tests/should_panic.rs

#![no_std]
#![no_main]

use core::panic::PanicInfo;
use blog_os::{exit_qemu, serial_print, serial_println, QemuExitCode};

#[no_mangle]
pub extern "C" fn _start() -> ! {
    should_fail();
    serial_println!("[test did not panic]");
    exit_qemu(QemuExitCode::Failed);
    loop{}
}

fn should_fail() {
    serial_print!("should_panic::should_fail...\t");
    assert_eq!(0, 1);
}

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    serial_println!("[ok]");
    exit_qemu(QemuExitCode::Success);
    loop {}
}

We now call the should_fail function directly from our _start function and exit with a failure exit code if it returns. When we run cargo test --test should_panic now, we see that the test behaves exactly as before.

Apart from creating should_panic tests, disabling the harness attribute can also be useful for complex integration tests, for example, when the individual test functions have side effects and need to be run in a specified order.

ðŸ”—Summary

Testing is a very useful technique to ensure that certain components have the desired behavior. Even if they cannot show the absence of bugs, theyâ€™re still a useful tool for finding them and especially for avoiding regressions.

This post explained how to set up a test framework for our Rust kernel. We used Rustâ€™s custom test frameworks feature to implement support for a simple #[test_case] attribute in our bare-metal environment. Using the isa-debug-exit device of QEMU, our test runner can exit QEMU after running the tests and report the test status. To print error messages to the console instead of the VGA buffer, we created a basic driver for the serial port.

After creating some tests for our println macro, we explored integration tests in the second half of the post. We learned that they live in the tests directory and are treated as completely separate executables. To give them access to the exit_qemu function and the serial_println macro, we moved most of our code into a library that can be imported by all executables and integration tests. Since integration tests run in their own separate environment, they make it possible to test interactions with the hardware or to create tests that should panic.

We now have a test framework that runs in a realistic environment inside QEMU. By creating more tests in future posts, we can keep our kernel maintainable when it becomes more complex.

ðŸ”—Whatâ€™s next?

In the next post, we will explore CPU exceptions. These exceptions are thrown by the CPU when something illegal happens, such as a division by zero or an access to an unmapped memory page (a so-called â€œpage faultâ€). Being able to catch and examine these exceptions is very important for debugging future errors. Exception handling is also very similar to the handling of hardware interrupts, which is required for keyboard support.

Paging Implementation

Thu, 14 Mar 2019 00:00:00 +0000

This post shows how to implement paging support in our kernel. It first explores different techniques to make the physical page table frames accessible to the kernel and discusses their respective advantages and drawbacks. It then implements an address translation function and a function to create a new mapping.

ðŸ”—Introduction

The previous post gave an introduction to the concept of paging. It motivated paging by comparing it with segmentation, explained how paging and page tables work, and then introduced the 4-level page table design of x86_64. We found out that the bootloader already set up a page table hierarchy for our kernel, which means that our kernel already runs on virtual addresses. This improves safety since illegal memory accesses cause page fault exceptions instead of modifying arbitrary physical memory.

The post ended with the problem that we canâ€™t access the page tables from our kernel because they are stored in physical memory and our kernel already runs on virtual addresses. This post explores different approaches to making the page table frames accessible to our kernel. We will discuss the advantages and drawbacks of each approach and then decide on an approach for our kernel.

To implement the approach, we will need support from the bootloader, so weâ€™ll configure it first. Afterward, we will implement a function that traverses the page table hierarchy in order to translate virtual to physical addresses. Finally, we learn how to create new mappings in the page tables and how to find unused memory frames for creating new page tables.

ðŸ”—Accessing Page Tables

Accessing the page tables from our kernel is not as easy as it may seem. To understand the problem, letâ€™s take a look at the example 4-level page table hierarchy from the previous post again:

The important thing here is that each page entry stores the physical address of the next table. This avoids the need to run a translation for these addresses too, which would be bad for performance and could easily cause endless translation loops.

The problem for us is that we canâ€™t directly access physical addresses from our kernel since our kernel also runs on top of virtual addresses. For example, when we access address 4â€¯KiB we access the virtual address 4 KiB, not the physical address 4â€¯KiB where the level 4 page table is stored. When we want to access the physical address 4â€¯KiB, we can only do so through some virtual address that maps to it.

So in order to access page table frames, we need to map some virtual pages to them. There are different ways to create these mappings that all allow us to access arbitrary page table frames.

ðŸ”—Identity Mapping

A simple solution is to identity map all page tables:

In this example, we see various identity-mapped page table frames. This way, the physical addresses of page tables are also valid virtual addresses so that we can easily access the page tables of all levels starting from the CR3 register.

However, it clutters the virtual address space and makes it more difficult to find continuous memory regions of larger sizes. For example, imagine that we want to create a virtual memory region of size 1000Â KiB in the above graphic, e.g., for memory-mapping a file. We canâ€™t start the region at 28â€¯KiB because it would collide with the already mapped page at 1004â€¯KiB. So we have to look further until we find a large enough unmapped area, for example at 1008â€¯KiB. This is a similar fragmentation problem as with segmentation.

Equally, it makes it much more difficult to create new page tables because we need to find physical frames whose corresponding pages arenâ€™t already in use. For example, letâ€™s assume that we reserved the virtual 1000Â KiB memory region starting at 1008â€¯KiB for our memory-mapped file. Now we canâ€™t use any frame with a physical address between 1000â€¯KiB and 2008â€¯KiB anymore, because we canâ€™t identity map it.

ðŸ”—Map at a Fixed Offset

To avoid the problem of cluttering the virtual address space, we can use a separate memory region for page table mappings. So instead of identity mapping page table frames, we map them at a fixed offset in the virtual address space. For example, the offset could be 10Â TiB:

By using the virtual memory in the range 10 TiB..(10 TiB + physical memory size) exclusively for page table mappings, we avoid the collision problems of the identity mapping. Reserving such a large region of the virtual address space is only possible if the virtual address space is much larger than the physical memory size. This isnâ€™t a problem on x86_64 since the 48-bit address space is 256Â TiB large.

This approach still has the disadvantage that we need to create a new mapping whenever we create a new page table. Also, it does not allow accessing page tables of other address spaces, which would be useful when creating a new process.

ðŸ”—Map the Complete Physical Memory

We can solve these problems by mapping the complete physical memory instead of only page table frames:

This approach allows our kernel to access arbitrary physical memory, including page table frames of other address spaces. The reserved virtual memory range has the same size as before, with the difference that it no longer contains unmapped pages.

The disadvantage of this approach is that additional page tables are needed for storing the mapping of the physical memory. These page tables need to be stored somewhere, so they use up a part of physical memory, which can be a problem on devices with a small amount of memory.

On x86_64, however, we can use huge pages with a size of 2Â MiB for the mapping, instead of the default 4Â KiB pages. This way, mapping 32Â GiB of physical memory only requires 132Â KiB for page tables since only one level 3 table and 32 level 2 tables are needed. Huge pages are also more cache efficient since they use fewer entries in the translation lookaside buffer (TLB).

ðŸ”—Temporary Mapping

For devices with very small amounts of physical memory, we could map the page table frames only temporarily when we need to access them. To be able to create the temporary mappings, we only need a single identity-mapped level 1 table:

The level 1 table in this graphic controls the first 2Â MiB of the virtual address space. This is because it is reachable by starting at the CR3 register and following the 0th entry in the level 4, level 3, and level 2 page tables. The entry with index 8 maps the virtual page at address 32â€¯KiB to the physical frame at address 32â€¯KiB, thereby identity mapping the level 1 table itself. The graphic shows this identity-mapping by the horizontal arrow at 32â€¯KiB.

By writing to the identity-mapped level 1 table, our kernel can create up to 511 temporary mappings (512 minus the entry required for the identity mapping). In the above example, the kernel created two temporary mappings:

By mapping the 0th entry of the level 1 table to the frame with address 24â€¯KiB, it created a temporary mapping of the virtual page at 0â€¯KiB to the physical frame of the level 2 page table, indicated by the dashed arrow.
By mapping the 9th entry of the level 1 table to the frame with address 4â€¯KiB, it created a temporary mapping of the virtual page at 36â€¯KiB to the physical frame of the level 4 page table, indicated by the dashed arrow.

Now the kernel can access the level 2 page table by writing to page 0â€¯KiB and the level 4 page table by writing to page 36â€¯KiB.

The process for accessing an arbitrary page table frame with temporary mappings would be:

Search for a free entry in the identity-mapped level 1 table.
Map that entry to the physical frame of the page table that we want to access.
Access the target frame through the virtual page that maps to the entry.
Set the entry back to unused, thereby removing the temporary mapping again.

This approach reuses the same 512 virtual pages for creating the mappings and thus requires only 4Â KiB of physical memory. The drawback is that it is a bit cumbersome, especially since a new mapping might require modifications to multiple table levels, which means that we would need to repeat the above process multiple times.

ðŸ”—Recursive Page Tables

Another interesting approach, which requires no additional page tables at all, is to map the page table recursively. The idea behind this approach is to map an entry from the level 4 page table to the level 4 table itself. By doing this, we effectively reserve a part of the virtual address space and map all current and future page table frames to that space.

Letâ€™s go through an example to understand how this all works:

The only difference to the example at the beginning of this post is the additional entry at index 511 in the level 4 table, which is mapped to physical frame 4â€¯KiB, the frame of the level 4 table itself.

By letting the CPU follow this entry on a translation, it doesnâ€™t reach a level 3 table but the same level 4 table again. This is similar to a recursive function that calls itself, therefore this table is called a recursive page table. The important thing is that the CPU assumes that every entry in the level 4 table points to a level 3 table, so it now treats the level 4 table as a level 3 table. This works because tables of all levels have the exact same layout on x86_64.

By following the recursive entry one or multiple times before we start the actual translation, we can effectively shorten the number of levels that the CPU traverses. For example, if we follow the recursive entry once and then proceed to the level 3 table, the CPU thinks that the level 3 table is a level 2 table. Going further, it treats the level 2 table as a level 1 table and the level 1 table as the mapped frame. This means that we can now read and write the level 1 page table because the CPU thinks that it is the mapped frame. The graphic below illustrates the five translation steps:

Similarly, we can follow the recursive entry twice before starting the translation to reduce the number of traversed levels to two:

Letâ€™s go through it step by step: First, the CPU follows the recursive entry on the level 4 table and thinks that it reaches a level 3 table. Then it follows the recursive entry again and thinks that it reaches a level 2 table. But in reality, it is still on the level 4 table. When the CPU now follows a different entry, it lands on a level 3 table but thinks it is already on a level 1 table. So while the next entry points to a level 2 table, the CPU thinks that it points to the mapped frame, which allows us to read and write the level 2 table.

Accessing the tables of levels 3 and 4 works in the same way. To access the level 3 table, we follow the recursive entry three times, tricking the CPU into thinking it is already on a level 1 table. Then we follow another entry and reach a level 3 table, which the CPU treats as a mapped frame. For accessing the level 4 table itself, we just follow the recursive entry four times until the CPU treats the level 4 table itself as the mapped frame (in blue in the graphic below).

It might take some time to wrap your head around the concept, but it works quite well in practice.

In the section below, we explain how to construct virtual addresses for following the recursive entry one or multiple times. We will not use recursive paging for our implementation, so you donâ€™t need to read it to continue with the post. If it interests you, just click on â€œAddress Calculationâ€ to expand it.

Address Calculation

We saw that we can access tables of all levels by following the recursive entry once or multiple times before the actual translation. Since the indexes into the tables of the four levels are derived directly from the virtual address, we need to construct special virtual addresses for this technique. Remember, the page table indexes are derived from the address in the following way:

Letâ€™s assume that we want to access the level 1 page table that maps a specific page. As we learned above, this means that we have to follow the recursive entry once before continuing with the level 4, level 3, and level 2 indexes. To do that, we move each block of the address one block to the right and set the original level 4 index to the index of the recursive entry:

For accessing the level 2 table of that page, we move each index block two blocks to the right and set both the blocks of the original level 4 index and the original level 3 index to the index of the recursive entry:

Accessing the level 3 table works by moving each block three blocks to the right and using the recursive index for the original level 4, level 3, and level 2 address blocks:

Finally, we can access the level 4 table by moving each block four blocks to the right and using the recursive index for all address blocks except for the offset:

We can now calculate virtual addresses for the page tables of all four levels. We can even calculate an address that points exactly to a specific page table entry by multiplying its index by 8, the size of a page table entry.

The table below summarizes the address structure for accessing the different kinds of frames:

Virtual Address for	Address Structure (octal)
Page	`0o_SSSSSS_AAA_BBB_CCC_DDD_EEEE`
Level 1 Table Entry	`0o_SSSSSS_RRR_AAA_BBB_CCC_DDDD`
Level 2 Table Entry	`0o_SSSSSS_RRR_RRR_AAA_BBB_CCCC`
Level 3 Table Entry	`0o_SSSSSS_RRR_RRR_RRR_AAA_BBBB`
Level 4 Table Entry	`0o_SSSSSS_RRR_RRR_RRR_RRR_AAAA`

Whereas AAA is the level 4 index, BBB the level 3 index, CCC the level 2 index, and DDD the level 1 index of the mapped frame, and EEEE the offset into it. RRR is the index of the recursive entry. When an index (three digits) is transformed to an offset (four digits), it is done by multiplying it by 8 (the size of a page table entry). With this offset, the resulting address directly points to the respective page table entry.

SSSSSS are sign extension bits, which means that they are all copies of bit 47. This is a special requirement for valid addresses on the x86_64 architecture. We explained it in the previous post.

We use octal numbers for representing the addresses since each octal character represents three bits, which allows us to clearly separate the 9-bit indexes of the different page table levels. This isnâ€™t possible with the hexadecimal system, where each character represents four bits.

ðŸ”—In Rust Code

To construct such addresses in Rust code, you can use bitwise operations:

// the virtual address whose corresponding page tables you want to access
let addr: usize = [â€¦];

let r = 0o777; // recursive index
let sign = 0o177777 << 48; // sign extension

// retrieve the page table indices of the address that we want to translate
let l4_idx = (addr >> 39) & 0o777; // level 4 index
let l3_idx = (addr >> 30) & 0o777; // level 3 index
let l2_idx = (addr >> 21) & 0o777; // level 2 index
let l1_idx = (addr >> 12) & 0o777; // level 1 index
let page_offset = addr & 0o7777;

// calculate the table addresses
let level_4_table_addr =
    sign | (r << 39) | (r << 30) | (r << 21) | (r << 12);
let level_3_table_addr =
    sign | (r << 39) | (r << 30) | (r << 21) | (l4_idx << 12);
let level_2_table_addr =
    sign | (r << 39) | (r << 30) | (l4_idx << 21) | (l3_idx << 12);
let level_1_table_addr =
    sign | (r << 39) | (l4_idx << 30) | (l3_idx << 21) | (l2_idx << 12);

The above code assumes that the last level 4 entry with index 0o777 (511) is recursively mapped. This isnâ€™t the case currently, so the code wonâ€™t work yet. See below on how to tell the bootloader to set up the recursive mapping.

Alternatively to performing the bitwise operations by hand, you can use the RecursivePageTable type of the x86_64 crate, which provides safe abstractions for various page table operations. For example, the code below shows how to translate a virtual address to its mapped physical address:

// in src/memory.rs

use x86_64::structures::paging::{Mapper, Page, PageTable, RecursivePageTable};
use x86_64::{VirtAddr, PhysAddr};

/// Creates a RecursivePageTable instance from the level 4 address.
let level_4_table_addr = [â€¦];
let level_4_table_ptr = level_4_table_addr as *mut PageTable;
let recursive_page_table = unsafe {
    let level_4_table = &mut *level_4_table_ptr;
    RecursivePageTable::new(level_4_table).unwrap();
}


/// Retrieve the physical address for the given virtual address
let addr: u64 = [â€¦]
let addr = VirtAddr::new(addr);
let page: Page = Page::containing_address(addr);

// perform the translation
let frame = recursive_page_table.translate_page(page);
frame.map(|frame| frame.start_address() + u64::from(addr.page_offset()))

Again, a valid recursive mapping is required for this code. With such a mapping, the missing level_4_table_addr can be calculated as in the first code example.

Recursive Paging is an interesting technique that shows how powerful a single mapping in a page table can be. It is relatively easy to implement and only requires a minimal amount of setup (just a single recursive entry), so itâ€™s a good choice for first experiments with paging.

However, it also has some disadvantages:

It occupies a large amount of virtual memory (512Â GiB). This isnâ€™t a big problem in the large 48-bit address space, but it might lead to suboptimal cache behavior.
It only allows accessing the currently active address space easily. Accessing other address spaces is still possible by changing the recursive entry, but a temporary mapping is required for switching back. We described how to do this in the (outdated) Remap The Kernel post.
It heavily relies on the page table format of x86 and might not work on other architectures.

ðŸ”—Bootloader Support

All of these approaches require page table modifications for their setup. For example, mappings for the physical memory need to be created or an entry of the level 4 table needs to be mapped recursively. The problem is that we canâ€™t create these required mappings without an existing way to access the page tables.

This means that we need the help of the bootloader, which creates the page tables that our kernel runs on. The bootloader has access to the page tables, so it can create any mappings that we need. In its current implementation, the bootloader crate has support for two of the above approaches, controlled through cargo features:

The map_physical_memory feature maps the complete physical memory somewhere into the virtual address space. Thus, the kernel has access to all physical memory and can follow the Map the Complete Physical Memory approach.
With the recursive_page_table feature, the bootloader maps an entry of the level 4 page table recursively. This allows the kernel to access the page tables as described in the Recursive Page Tables section.

We choose the first approach for our kernel since it is simple, platform-independent, and more powerful (it also allows access to non-page-table-frames). To enable the required bootloader support, we add the map_physical_memory feature to our bootloader dependency:

[dependencies]
bootloader = { version = "0.9", features = ["map_physical_memory"]}

With this feature enabled, the bootloader maps the complete physical memory to some unused virtual address range. To communicate the virtual address range to our kernel, the bootloader passes a boot information structure.

ðŸ”—Boot Information

The bootloader crate defines a BootInfo struct that contains all the information it passes to our kernel. The struct is still in an early stage, so expect some breakage when updating to future semver-incompatible bootloader versions. With the map_physical_memory feature enabled, it currently has the two fields memory_map and physical_memory_offset:

The memory_map field contains an overview of the available physical memory. This tells our kernel how much physical memory is available in the system and which memory regions are reserved for devices such as the VGA hardware. The memory map can be queried from the BIOS or UEFI firmware, but only very early in the boot process. For this reason, it must be provided by the bootloader because there is no way for the kernel to retrieve it later. We will need the memory map later in this post.
The physical_memory_offset tells us the virtual start address of the physical memory mapping. By adding this offset to a physical address, we get the corresponding virtual address. This allows us to access arbitrary physical memory from our kernel.
This physical memory offset can be customized by adding a [package.metadata.bootloader] table in Cargo.toml and setting the field physical-memory-offset = "0x0000f00000000000" (or any other value). However, note that the bootloader can panic if it runs into physical address values that start to overlap with the space beyond the offset, i.e., areas it would have previously mapped to some other early physical addresses. So in general, the higher the value (> 1 TiB), the better.

The bootloader passes the BootInfo struct to our kernel in the form of a &'static BootInfo argument to our _start function. We donâ€™t have this argument declared in our function yet, so letâ€™s add it:

// in src/main.rs

use bootloader::BootInfo;

#[no_mangle]
pub extern "C" fn _start(boot_info: &'static BootInfo) -> ! { // new argument
    [â€¦]
}

It wasnâ€™t a problem to leave off this argument before because the x86_64 calling convention passes the first argument in a CPU register. Thus, the argument is simply ignored when it isnâ€™t declared. However, it would be a problem if we accidentally used a wrong argument type, since the compiler doesnâ€™t know the correct type signature of our entry point function.

ðŸ”—The `entry_point` Macro

Since our _start function is called externally from the bootloader, no checking of our function signature occurs. This means that we could let it take arbitrary arguments without any compilation errors, but it would fail or cause undefined behavior at runtime.

To make sure that the entry point function always has the correct signature that the bootloader expects, the bootloader crate provides an entry_point macro that provides a type-checked way to define a Rust function as the entry point. Letâ€™s rewrite our entry point function to use this macro:

// in src/main.rs

use bootloader::{BootInfo, entry_point};

entry_point!(kernel_main);

fn kernel_main(boot_info: &'static BootInfo) -> ! {
    [â€¦]
}

We no longer need to use extern "C" or no_mangle for our entry point, as the macro defines the real lower level _start entry point for us. The kernel_main function is now a completely normal Rust function, so we can choose an arbitrary name for it. The important thing is that it is type-checked so that a compilation error occurs when we use a wrong function signature, for example by adding an argument or changing the argument type.

Letâ€™s perform the same change in our lib.rs:

// in src/lib.rs

#[cfg(test)]
use bootloader::{entry_point, BootInfo};

#[cfg(test)]
entry_point!(test_kernel_main);

/// Entry point for `cargo test`
#[cfg(test)]
fn test_kernel_main(_boot_info: &'static BootInfo) -> ! {
    // like before
    init();
    test_main();
    hlt_loop();
}

Since the entry point is only used in test mode, we add the #[cfg(test)] attribute to all items. We give our test entry point the distinct name test_kernel_main to avoid confusion with the kernel_main of our main.rs. We donâ€™t use the BootInfo parameter for now, so we prefix the parameter name with a _ to silence the unused variable warning.

ðŸ”—Implementation

Now that we have access to physical memory, we can finally start to implement our page table code. First, we will take a look at the currently active page tables that our kernel runs on. In the second step, we will create a translation function that returns the physical address that a given virtual address is mapped to. As a last step, we will try to modify the page tables in order to create a new mapping.

Before we begin, we create a new memory module for our code:

// in src/lib.rs

pub mod memory;

For the module, we create an empty src/memory.rs file.

ðŸ”—Accessing the Page Tables

At the end of the previous post, we tried to take a look at the page tables our kernel runs on, but failed since we couldnâ€™t access the physical frame that the CR3 register points to. Weâ€™re now able to continue from there by creating an active_level_4_table function that returns a reference to the active level 4 page table:

// in src/memory.rs

use x86_64::{
    structures::paging::PageTable,
    VirtAddr,
};

/// Returns a mutable reference to the active level 4 table.
///
/// This function is unsafe because the caller must guarantee that the
/// complete physical memory is mapped to virtual memory at the passed
/// `physical_memory_offset`. Also, this function must be only called once
/// to avoid aliasing `&mut` references (which is undefined behavior).
pub unsafe fn active_level_4_table(physical_memory_offset: VirtAddr)
    -> &'static mut PageTable
{
    use x86_64::registers::control::Cr3;

    let (level_4_table_frame, _) = Cr3::read();

    let phys = level_4_table_frame.start_address();
    let virt = physical_memory_offset + phys.as_u64();
    let page_table_ptr: *mut PageTable = virt.as_mut_ptr();

    &mut *page_table_ptr // unsafe
}

First, we read the physical frame of the active level 4 table from the CR3 register. We then take its physical start address, convert it to a u64, and add it to physical_memory_offset to get the virtual address where the page table frame is mapped. Finally, we convert the virtual address to a *mut PageTable raw pointer through the as_mut_ptr method and then unsafely create a &mut PageTable reference from it. We create a &mut reference instead of a & reference because we will mutate the page tables later in this post.

We donâ€™t need to use an unsafe block here because Rust treats the complete body of an unsafe fn like a large unsafe block. This makes our code more dangerous since we could accidentally introduce an unsafe operation in previous lines without noticing. It also makes it much more difficult to spot unsafe operations in between safe operations. There is an RFC to change this behavior.

We can now use this function to print the entries of the level 4 table:

// in src/main.rs

fn kernel_main(boot_info: &'static BootInfo) -> ! {
    use blog_os::memory::active_level_4_table;
    use x86_64::VirtAddr;

    println!("Hello World{}", "!");
    blog_os::init();

    let phys_mem_offset = VirtAddr::new(boot_info.physical_memory_offset);
    let l4_table = unsafe { active_level_4_table(phys_mem_offset) };

    for (i, entry) in l4_table.iter().enumerate() {
        if !entry.is_unused() {
            println!("L4 Entry {}: {:?}", i, entry);
        }
    }

    // as before
    #[cfg(test)]
    test_main();

    println!("It did not crash!");
    blog_os::hlt_loop();
}

First, we convert the physical_memory_offset of the BootInfo struct to a VirtAddr and pass it to the active_level_4_table function. We then use the iter function to iterate over the page table entries and the enumerate combinator to additionally add an index i to each element. We only print non-empty entries because all 512 entries wouldnâ€™t fit on the screen.

When we run it, we see the following output:

We see that there are various non-empty entries, which all map to different level 3 tables. There are so many regions because kernel code, kernel stack, physical memory mapping, and boot information all use separate memory areas.

To traverse the page tables further and take a look at a level 3 table, we can take the mapped frame of an entry and convert it to a virtual address again:

// in the `for` loop in src/main.rs

use x86_64::structures::paging::PageTable;

if !entry.is_unused() {
    println!("L4 Entry {}: {:?}", i, entry);

    // get the physical address from the entry and convert it
    let phys = entry.frame().unwrap().start_address();
    let virt = phys.as_u64() + boot_info.physical_memory_offset;
    let ptr = VirtAddr::new(virt).as_mut_ptr();
    let l3_table: &PageTable = unsafe { &*ptr };

    // print non-empty entries of the level 3 table
    for (i, entry) in l3_table.iter().enumerate() {
        if !entry.is_unused() {
            println!("  L3 Entry {}: {:?}", i, entry);
        }
    }
}

For looking at the level 2 and level 1 tables, we repeat that process for the level 3 and level 2 entries. As you can imagine, this gets very verbose very quickly, so we donâ€™t show the full code here.

Traversing the page tables manually is interesting because it helps to understand how the CPU performs the translation. However, most of the time, we are only interested in the mapped physical address for a given virtual address, so letâ€™s create a function for that.

ðŸ”—Translating Addresses

To translate a virtual to a physical address, we have to traverse the four-level page table until we reach the mapped frame. Letâ€™s create a function that performs this translation:

// in src/memory.rs

use x86_64::PhysAddr;

/// Translates the given virtual address to the mapped physical address, or
/// `None` if the address is not mapped.
///
/// This function is unsafe because the caller must guarantee that the
/// complete physical memory is mapped to virtual memory at the passed
/// `physical_memory_offset`.
pub unsafe fn translate_addr(addr: VirtAddr, physical_memory_offset: VirtAddr)
    -> Option<PhysAddr>
{
    translate_addr_inner(addr, physical_memory_offset)
}

We forward the function to a safe translate_addr_inner function to limit the scope of unsafe. As we noted above, Rust treats the complete body of an unsafe fn like a large unsafe block. By calling into a private safe function, we make each unsafe operation explicit again.

The private inner function contains the real implementation:

// in src/memory.rs

/// Private function that is called by `translate_addr`.
///
/// This function is safe to limit the scope of `unsafe` because Rust treats
/// the whole body of unsafe functions as an unsafe block. This function must
/// only be reachable through `unsafe fn` from outside of this module.
fn translate_addr_inner(addr: VirtAddr, physical_memory_offset: VirtAddr)
    -> Option<PhysAddr>
{
    use x86_64::structures::paging::page_table::FrameError;
    use x86_64::registers::control::Cr3;

    // read the active level 4 frame from the CR3 register
    let (level_4_table_frame, _) = Cr3::read();

    let table_indexes = [
        addr.p4_index(), addr.p3_index(), addr.p2_index(), addr.p1_index()
    ];
    let mut frame = level_4_table_frame;

    // traverse the multi-level page table
    for &index in &table_indexes {
        // convert the frame into a page table reference
        let virt = physical_memory_offset + frame.start_address().as_u64();
        let table_ptr: *const PageTable = virt.as_ptr();
        let table = unsafe {&*table_ptr};

        // read the page table entry and update `frame`
        let entry = &table[index];
        frame = match entry.frame() {
            Ok(frame) => frame,
            Err(FrameError::FrameNotPresent) => return None,
            Err(FrameError::HugeFrame) => panic!("huge pages not supported"),
        };
    }

    // calculate the physical address by adding the page offset
    Some(frame.start_address() + u64::from(addr.page_offset()))
}

Instead of reusing our active_level_4_table function, we read the level 4 frame from the CR3 register again. We do this because it simplifies this prototype implementation. Donâ€™t worry, we will create a better solution in a moment.

The VirtAddr struct already provides methods to compute the indexes into the page tables of the four levels. We store these indexes in a small array because it allows us to traverse the page tables using a for loop. Outside of the loop, we remember the last visited frame to calculate the physical address later. The frame points to page table frames while iterating and to the mapped frame after the last iteration, i.e., after following the level 1 entry.

Inside the loop, we again use the physical_memory_offset to convert the frame into a page table reference. We then read the entry of the current page table and use the PageTableEntry::frame function to retrieve the mapped frame. If the entry is not mapped to a frame, we return None. If the entry maps a huge 2Â MiB or 1Â GiB page, we panic for now.

Letâ€™s test our translation function by translating some addresses:

// in src/main.rs

fn kernel_main(boot_info: &'static BootInfo) -> ! {
    // new import
    use blog_os::memory::translate_addr;

    [â€¦] // hello world and blog_os::init

    let phys_mem_offset = VirtAddr::new(boot_info.physical_memory_offset);

    let addresses = [
        // the identity-mapped vga buffer page
        0xb8000,
        // some code page
        0x201008,
        // some stack page
        0x0100_0020_1a10,
        // virtual address mapped to physical address 0
        boot_info.physical_memory_offset,
    ];

    for &address in &addresses {
        let virt = VirtAddr::new(address);
        let phys = unsafe { translate_addr(virt, phys_mem_offset) };
        println!("{:?} -> {:?}", virt, phys);
    }

    [â€¦] // test_main(), "it did not crash" printing, and hlt_loop()
}

When we run it, we see the following output:

As expected, the identity-mapped address 0xb8000 translates to the same physical address. The code page and the stack page translate to some arbitrary physical addresses, which depend on how the bootloader created the initial mapping for our kernel. Itâ€™s worth noting that the last 12 bits always stay the same after translation, which makes sense because these bits are the page offset and not part of the translation.

Since each physical address can be accessed by adding the physical_memory_offset, the translation of the physical_memory_offset address itself should point to physical address 0. However, the translation fails because the mapping uses huge pages for efficiency, which is not supported in our implementation yet.

ðŸ”—Using `OffsetPageTable`

Translating virtual to physical addresses is a common task in an OS kernel, therefore the x86_64 crate provides an abstraction for it. The implementation already supports huge pages and several other page table functions apart from translate_addr, so we will use it in the following instead of adding huge page support to our own implementation.

At the basis of the abstraction are two traits that define various page table mapping functions:

The Mapper trait is generic over the page size and provides functions that operate on pages. Examples are translate_page, which translates a given page to a frame of the same size, and map_to, which creates a new mapping in the page table.
The Translate trait provides functions that work with multiple page sizes, such as translate_addr or the general translate.

The traits only define the interface, they donâ€™t provide any implementation. The x86_64 crate currently provides three types that implement the traits with different requirements. The OffsetPageTable type assumes that the complete physical memory is mapped to the virtual address space at some offset. The MappedPageTable is a bit more flexible: It only requires that each page table frame is mapped to the virtual address space at a calculable address. Finally, the RecursivePageTable type can be used to access page table frames through recursive page tables.

In our case, the bootloader maps the complete physical memory at a virtual address specified by the physical_memory_offset variable, so we can use the OffsetPageTable type. To initialize it, we create a new init function in our memory module:

use x86_64::structures::paging::OffsetPageTable;

/// Initialize a new OffsetPageTable.
///
/// This function is unsafe because the caller must guarantee that the
/// complete physical memory is mapped to virtual memory at the passed
/// `physical_memory_offset`. Also, this function must be only called once
/// to avoid aliasing `&mut` references (which is undefined behavior).
pub unsafe fn init(physical_memory_offset: VirtAddr) -> OffsetPageTable<'static> {
    let level_4_table = active_level_4_table(physical_memory_offset);
    OffsetPageTable::new(level_4_table, physical_memory_offset)
}

// make private
unsafe fn active_level_4_table(physical_memory_offset: VirtAddr)
    -> &'static mut PageTable
{â€¦}

The function takes the physical_memory_offset as an argument and returns a new OffsetPageTable instance with a 'static lifetime. This means that the instance stays valid for the complete runtime of our kernel. In the function body, we first call the active_level_4_table function to retrieve a mutable reference to the level 4 page table. We then invoke the OffsetPageTable::new function with this reference. As the second parameter, the new function expects the virtual address at which the mapping of the physical memory starts, which is given in the physical_memory_offset variable.

The active_level_4_table function should only be called from the init function from now on because it can easily lead to aliased mutable references when called multiple times, which can cause undefined behavior. For this reason, we make the function private by removing the pub specifier.

We can now use the Translate::translate_addr method instead of our own memory::translate_addr function. We only need to change a few lines in our kernel_main:

// in src/main.rs

fn kernel_main(boot_info: &'static BootInfo) -> ! {
    // new: different imports
    use blog_os::memory;
    use x86_64::{structures::paging::Translate, VirtAddr};

    [â€¦] // hello world and blog_os::init

    let phys_mem_offset = VirtAddr::new(boot_info.physical_memory_offset);
    // new: initialize a mapper
    let mapper = unsafe { memory::init(phys_mem_offset) };

    let addresses = [â€¦]; // same as before

    for &address in &addresses {
        let virt = VirtAddr::new(address);
        // new: use the `mapper.translate_addr` method
        let phys = mapper.translate_addr(virt);
        println!("{:?} -> {:?}", virt, phys);
    }

    [â€¦] // test_main(), "it did not crash" printing, and hlt_loop()
}

We need to import the Translate trait in order to use the translate_addr method it provides.

When we run it now, we see the same translation results as before, with the difference that the huge page translation now also works:

As expected, the translations of 0xb8000 and the code and stack addresses stay the same as with our own translation function. Additionally, we now see that the virtual address physical_memory_offset is mapped to the physical address 0x0.

By using the translation function of the MappedPageTable type, we can spare ourselves the work of implementing huge page support. We also have access to other page functions, such as map_to, which we will use in the next section.

At this point, we no longer need our memory::translate_addr and memory::translate_addr_inner functions, so we can delete them.

ðŸ”—Creating a new Mapping

Until now, we only looked at the page tables without modifying anything. Letâ€™s change that by creating a new mapping for a previously unmapped page.

We will use the map_to function of the Mapper trait for our implementation, so letâ€™s take a look at that function first. The documentation tells us that it takes four arguments: the page that we want to map, the frame that the page should be mapped to, a set of flags for the page table entry, and a frame_allocator. The frame allocator is needed because mapping the given page might require creating additional page tables, which need unused frames as backing storage.

ðŸ”—A `create_example_mapping` Function

The first step of our implementation is to create a new create_example_mapping function that maps a given virtual page to 0xb8000, the physical frame of the VGA text buffer. We choose that frame because it allows us to easily test if the mapping was created correctly: We just need to write to the newly mapped page and see whether we see the write appear on the screen.

The create_example_mapping function looks like this:

// in src/memory.rs

use x86_64::{
    PhysAddr,
    structures::paging::{Page, PhysFrame, Mapper, Size4KiB, FrameAllocator}
};

/// Creates an example mapping for the given page to frame `0xb8000`.
pub fn create_example_mapping(
    page: Page,
    mapper: &mut OffsetPageTable,
    frame_allocator: &mut impl FrameAllocator<Size4KiB>,
) {
    use x86_64::structures::paging::PageTableFlags as Flags;

    let frame = PhysFrame::containing_address(PhysAddr::new(0xb8000));
    let flags = Flags::PRESENT | Flags::WRITABLE;

    let map_to_result = unsafe {
        // FIXME: this is not safe, we do it only for testing
        mapper.map_to(page, frame, flags, frame_allocator)
    };
    map_to_result.expect("map_to failed").flush();
}

In addition to the page that should be mapped, the function expects a mutable reference to an OffsetPageTable instance and a frame_allocator. The frame_allocator parameter uses the impl Trait syntax to be generic over all types that implement the FrameAllocator trait. The trait is generic over the PageSize trait to work with both standard 4Â KiB pages and huge 2Â MiB/1Â GiB pages. We only want to create a 4Â KiB mapping, so we set the generic parameter to Size4KiB.

The map_to method is unsafe because the caller must ensure that the frame is not already in use. The reason for this is that mapping the same frame twice could result in undefined behavior, for example when two different &mut references point to the same physical memory location. In our case, we reuse the VGA text buffer frame, which is already mapped, so we break the required condition. However, the create_example_mapping function is only a temporary testing function and will be removed after this post, so it is ok. To remind us of the unsafety, we put a FIXME comment on the line.

In addition to the page and the unused_frame, the map_to method takes a set of flags for the mapping and a reference to the frame_allocator, which will be explained in a moment. For the flags, we set the PRESENT flag because it is required for all valid entries and the WRITABLE flag to make the mapped page writable. For a list of all possible flags, see the Page Table Format section of the previous post.

The map_to function can fail, so it returns a Result. Since this is just some example code that does not need to be robust, we just use expect to panic when an error occurs. On success, the function returns a MapperFlush type that provides an easy way to flush the newly mapped page from the translation lookaside buffer (TLB) with its flush method. Like Result, the type uses the #[must_use] attribute to emit a warning when we accidentally forget to use it.

ðŸ”—A dummy `FrameAllocator`

To be able to call create_example_mapping, we need to create a type that implements the FrameAllocator trait first. As noted above, the trait is responsible for allocating frames for new page tables if they are needed by map_to.

Letâ€™s start with the simple case and assume that we donâ€™t need to create new page tables. For this case, a frame allocator that always returns None suffices. We create such an EmptyFrameAllocator for testing our mapping function:

// in src/memory.rs

/// A FrameAllocator that always returns `None`.
pub struct EmptyFrameAllocator;

unsafe impl FrameAllocator<Size4KiB> for EmptyFrameAllocator {
    fn allocate_frame(&mut self) -> Option<PhysFrame> {
        None
    }
}

Implementing the FrameAllocator is unsafe because the implementer must guarantee that the allocator yields only unused frames. Otherwise, undefined behavior might occur, for example when two virtual pages are mapped to the same physical frame. Our EmptyFrameAllocator only returns None, so this isnâ€™t a problem in this case.

ðŸ”—Choosing a Virtual Page

We now have a simple frame allocator that we can pass to our create_example_mapping function. However, the allocator always returns None, so this will only work if no additional page table frames are needed for creating the mapping. To understand when additional page table frames are needed and when not, letâ€™s consider an example:

The graphic shows the virtual address space on the left, the physical address space on the right, and the page tables in between. The page tables are stored in physical memory frames, indicated by the dashed lines. The virtual address space contains a single mapped page at address 0x803fe00000, marked in blue. To translate this page to its frame, the CPU walks the 4-level page table until it reaches the frame at address 36Â KiB.

Additionally, the graphic shows the physical frame of the VGA text buffer in red. Our goal is to map a previously unmapped virtual page to this frame using our create_example_mapping function. Since our EmptyFrameAllocator always returns None, we want to create the mapping so that no additional frames are needed from the allocator. This depends on the virtual page that we select for the mapping.

The graphic shows two candidate pages in the virtual address space, both marked in yellow. One page is at address 0x803fdfd000, which is 3 pages before the mapped page (in blue). While the level 4 and level 3 page table indices are the same as for the blue page, the level 2 and level 1 indices are different (see the previous post). The different index into the level 2 table means that a different level 1 table is used for this page. Since this level 1 table does not exist yet, we would need to create it if we chose that page for our example mapping, which would require an additional unused physical frame. In contrast, the second candidate page at address 0x803fe02000 does not have this problem because it uses the same level 1 page table as the blue page. Thus, all the required page tables already exist.

In summary, the difficulty of creating a new mapping depends on the virtual page that we want to map. In the easiest case, the level 1 page table for the page already exists and we just need to write a single entry. In the most difficult case, the page is in a memory region for which no level 3 exists yet, so we need to create new level 3, level 2 and level 1 page tables first.

For calling our create_example_mapping function with the EmptyFrameAllocator, we need to choose a page for which all page tables already exist. To find such a page, we can utilize the fact that the bootloader loads itself in the first megabyte of the virtual address space. This means that a valid level 1 table exists for all pages in this region. Thus, we can choose any unused page in this memory region for our example mapping, such as the page at address 0. Normally, this page should stay unused to guarantee that dereferencing a null pointer causes a page fault, so we know that the bootloader leaves it unmapped.

ðŸ”—Creating the Mapping

We now have all the required parameters for calling our create_example_mapping function, so letâ€™s modify our kernel_main function to map the page at virtual address 0. Since we map the page to the frame of the VGA text buffer, we should be able to write to the screen through it afterward. The implementation looks like this:

// in src/main.rs

fn kernel_main(boot_info: &'static BootInfo) -> ! {
    use blog_os::memory;
    use x86_64::{structures::paging::Page, VirtAddr}; // new import

    [â€¦] // hello world and blog_os::init

    let phys_mem_offset = VirtAddr::new(boot_info.physical_memory_offset);
    let mut mapper = unsafe { memory::init(phys_mem_offset) };
    let mut frame_allocator = memory::EmptyFrameAllocator;

    // map an unused page
    let page = Page::containing_address(VirtAddr::new(0));
    memory::create_example_mapping(page, &mut mapper, &mut frame_allocator);

    // write the string `New!` to the screen through the new mapping
    let page_ptr: *mut u64 = page.start_address().as_mut_ptr();
    unsafe { page_ptr.offset(400).write_volatile(0x_f021_f077_f065_f04e)};

    [â€¦] // test_main(), "it did not crash" printing, and hlt_loop()
}

We first create the mapping for the page at address 0 by calling our create_example_mapping function with a mutable reference to the mapper and the frame_allocator instances. This maps the page to the VGA text buffer frame, so we should see any write to it on the screen.

Then we convert the page to a raw pointer and write a value to offset 400. We donâ€™t write to the start of the page because the top line of the VGA buffer is directly shifted off the screen by the next println. We write the value 0x_f021_f077_f065_f04e, which represents the string â€œNew!â€ on a white background. As we learned in the â€œVGA Text Modeâ€ post, writes to the VGA buffer should be volatile, so we use the write_volatile method.

When we run it in QEMU, we see the following output:

The â€œNew!â€ on the screen is caused by our write to page 0, which means that we successfully created a new mapping in the page tables.

Creating that mapping only worked because the level 1 table responsible for the page at address 0 already exists. When we try to map a page for which no level 1 table exists yet, the map_to function fails because it tries to create new page tables by allocating frames with the EmptyFrameAllocator. We can see that happen when we try to map page 0xdeadbeaf000 instead of 0:

// in src/main.rs

fn kernel_main(boot_info: &'static BootInfo) -> ! {
    [â€¦]
    let page = Page::containing_address(VirtAddr::new(0xdeadbeaf000));
    [â€¦]
}

When we run it, a panic with the following error message occurs:

panicked at 'map_to failed: FrameAllocationFailed', /â€¦/result.rs:999:5

To map pages that donâ€™t have a level 1 page table yet, we need to create a proper FrameAllocator. But how do we know which frames are unused and how much physical memory is available?

ðŸ”—Allocating Frames

In order to create new page tables, we need to create a proper frame allocator. To do that, we use the memory_map that is passed by the bootloader as part of the BootInfo struct:

// in src/memory.rs

use bootloader::bootinfo::MemoryMap;

/// A FrameAllocator that returns usable frames from the bootloader's memory map.
pub struct BootInfoFrameAllocator {
    memory_map: &'static MemoryMap,
    next: usize,
}

impl BootInfoFrameAllocator {
    /// Create a FrameAllocator from the passed memory map.
    ///
    /// This function is unsafe because the caller must guarantee that the passed
    /// memory map is valid. The main requirement is that all frames that are marked
    /// as `USABLE` in it are really unused.
    pub unsafe fn init(memory_map: &'static MemoryMap) -> Self {
        BootInfoFrameAllocator {
            memory_map,
            next: 0,
        }
    }
}

The struct has two fields: A 'static reference to the memory map passed by the bootloader and a next field that keeps track of the number of the next frame that the allocator should return.

As we explained in the Boot Information section, the memory map is provided by the BIOS/UEFI firmware. It can only be queried very early in the boot process, so the bootloader already calls the respective functions for us. The memory map consists of a list of MemoryRegion structs, which contain the start address, the length, and the type (e.g. unused, reserved, etc.) of each memory region.

The init function initializes a BootInfoFrameAllocator with a given memory map. The next field is initialized with 0 and will be increased for every frame allocation to avoid returning the same frame twice. Since we donâ€™t know if the usable frames of the memory map were already used somewhere else, our init function must be unsafe to require additional guarantees from the caller.

ðŸ”—A `usable_frames` Method

Before we implement the FrameAllocator trait, we add an auxiliary method that converts the memory map into an iterator of usable frames:

// in src/memory.rs

use bootloader::bootinfo::MemoryRegionType;

impl BootInfoFrameAllocator {
    /// Returns an iterator over the usable frames specified in the memory map.
    fn usable_frames(&self) -> impl Iterator<Item = PhysFrame> {
        // get usable regions from memory map
        let regions = self.memory_map.iter();
        let usable_regions = regions
            .filter(|r| r.region_type == MemoryRegionType::Usable);
        // map each region to its address range
        let addr_ranges = usable_regions
            .map(|r| r.range.start_addr()..r.range.end_addr());
        // transform to an iterator of frame start addresses
        let frame_addresses = addr_ranges.flat_map(|r| r.step_by(4096));
        // create `PhysFrame` types from the start addresses
        frame_addresses.map(|addr| PhysFrame::containing_address(PhysAddr::new(addr)))
    }
}

This function uses iterator combinator methods to transform the initial MemoryMap into an iterator of usable physical frames:

First, we call the iter method to convert the memory map to an iterator of MemoryRegions.
Then we use the filter method to skip any reserved or otherwise unavailable regions. The bootloader updates the memory map for all the mappings it creates, so frames that are used by our kernel (code, data, or stack) or to store the boot information are already marked as InUse or similar. Thus, we can be sure that Usable frames are not used somewhere else.
Afterwards, we use the map combinator and Rustâ€™s range syntax to transform our iterator of memory regions to an iterator of address ranges.
Next, we use flat_map to transform the address ranges into an iterator of frame start addresses, choosing every 4096th address using step_by. Since 4096 bytes (= 4Â KiB) is the page size, we get the start address of each frame. The bootloader page-aligns all usable memory areas so that we donâ€™t need any alignment or rounding code here. By using flat_map instead of map, we get an Iterator<Item = u64> instead of an Iterator<Item = Iterator<Item = u64>>.
Finally, we convert the start addresses to PhysFrame types to construct an Iterator<Item = PhysFrame>.

The return type of the function uses the impl Trait feature. This way, we can specify that we return some type that implements the Iterator trait with item type PhysFrame but donâ€™t need to name the concrete return type. This is important here because we canâ€™t name the concrete type since it depends on unnamable closure types.

ðŸ”—Implementing the `FrameAllocator` Trait

Now we can implement the FrameAllocator trait:

// in src/memory.rs

unsafe impl FrameAllocator<Size4KiB> for BootInfoFrameAllocator {
    fn allocate_frame(&mut self) -> Option<PhysFrame> {
        let frame = self.usable_frames().nth(self.next);
        self.next += 1;
        frame
    }
}

We first use the usable_frames method to get an iterator of usable frames from the memory map. Then, we use the Iterator::nth function to get the frame with index self.next (thereby skipping (self.next - 1) frames). Before returning that frame, we increase self.next by one so that we return the following frame on the next call.

This implementation is not quite optimal since it recreates the usable_frame allocator on every allocation. It would be better to directly store the iterator as a struct field instead. Then we wouldnâ€™t need the nth method and could just call next on every allocation. The problem with this approach is that itâ€™s not possible to store an impl Trait type in a struct field currently. It might work someday when named existential types are fully implemented.

ðŸ”—Using the `BootInfoFrameAllocator`

We can now modify our kernel_main function to pass a BootInfoFrameAllocator instance instead of an EmptyFrameAllocator:

// in src/main.rs

fn kernel_main(boot_info: &'static BootInfo) -> ! {
    use blog_os::memory::BootInfoFrameAllocator;
    [â€¦]
    let mut frame_allocator = unsafe {
        BootInfoFrameAllocator::init(&boot_info.memory_map)
    };
    [â€¦]
}

With the boot info frame allocator, the mapping succeeds and we see the black-on-white â€œNew!â€ on the screen again. Behind the scenes, the map_to method creates the missing page tables in the following way:

Use the passed frame_allocator to allocate an unused frame.
Zero the frame to create a new, empty page table.
Map the entry of the higher level table to that frame.
Continue with the next table level.

While our create_example_mapping function is just some example code, we are now able to create new mappings for arbitrary pages. This will be essential for allocating memory or implementing multithreading in future posts.

At this point, we should delete the create_example_mapping function again to avoid accidentally invoking undefined behavior, as explained above.

ðŸ”—Summary

In this post we learned about different techniques to access the physical frames of page tables, including identity mapping, mapping of the complete physical memory, temporary mapping, and recursive page tables. We chose to map the complete physical memory since itâ€™s simple, portable, and powerful.

We canâ€™t map the physical memory from our kernel without page table access, so we need support from the bootloader. The bootloader crate supports creating the required mapping through optional cargo crate features. It passes the required information to our kernel in the form of a &BootInfo argument to our entry point function.

For our implementation, we first manually traversed the page tables to implement a translation function, and then used the MappedPageTable type of the x86_64 crate. We also learned how to create new mappings in the page table and how to create the necessary FrameAllocator on top of the memory map passed by the bootloader.

ðŸ”—Whatâ€™s next?

The next post will create a heap memory region for our kernel, which will allow us to allocate memory and use various collection types.

Advanced Paging

Mon, 28 Jan 2019 00:00:00 +0000

This post explains techniques to make the physical page table frames accessible to our kernel. It then uses such a technique to implement a function that translates virtual to physical addresses. It also explains how to create new mappings in the page tables.

ðŸ”—Introduction

In the previous post we learned about the principles of paging and how the 4-level page tables on x86_64 work. We also found out that the bootloader already set up a page table hierarchy for our kernel, which means that our kernel already runs on virtual addresses. This improves safety since illegal memory accesses cause page fault exceptions instead of modifying arbitrary physical memory.

However, it also causes a problem when we try to access the page tables from our kernel because we canâ€™t directly access the physical addresses that are stored in page table entries or the CR3 register. We experienced that problem already at the end of the previous post when we tried to inspect the active page tables.

The next section discusses the problem in detail and provides different approaches to a solution. Afterward, we implement a function that traverses the page table hierarchy in order to translate virtual to physical addresses. Finally, we learn how to create new mappings in the page tables and how to find unused memory frames for creating new page tables.

ðŸ”—Dependency Versions

This post requires version 0.3.12 of the bootloader dependency and version 0.5.0 of the x86_64 dependency. You can set the dependency versions in your Cargo.toml:

[dependencies]
bootloader = "0.3.12"
x86_64 = "0.5.0"

ðŸ”—Accessing Page Tables

Accessing the page tables from our kernel is not as easy as it may seem. To understand the problem letâ€™s take a look at the example 4-level page table hierarchy of the previous post again:

The problem for us is that we canâ€™t directly access physical addresses from our kernel since our kernel also runs on top of virtual addresses. For example when we access address 4â€¯KiB, we access the virtual address 4â€¯KiB, not the physical address 4â€¯KiB where the level 4 page table is stored. When we want to access the physical address 4â€¯KiB, we can only do so through some virtual address that maps to it.

So in order to access page table frames, we need to map some virtual pages to them. There are different ways to create these mappings that all allow us to access arbitrary page table frames:

A simple solution is to identity map all page tables:

In this example, we see various identity-mapped page table frames. This way the physical addresses of page tables are also valid virtual addresses so that we can easily access the page tables of all levels starting from the CR3 register.

However, it clutters the virtual address space and makes it more difficult to find continuous memory regions of larger sizes. For example, imagine that we want to create a virtual memory region of size 1000â€¯KiB in the above graphic, e.g. for memory-mapping a file. We canâ€™t start the region at 28â€¯KiB because it would collide with the already mapped page at 1004â€¯MiB. So we have to look further until we find a large enough unmapped area, for example at 1008â€¯KiB. This is a similar fragmentation problem as with segmentation.

Equally, it makes it much more difficult to create new page tables, because we need to find physical frames whose corresponding pages arenâ€™t already in use. For example, letâ€™s assume that we reserved the virtual 1000â€¯KiB memory region starting at 1008â€¯KiB for our memory-mapped file. Now we canâ€™t use any frame with a physical address between 1000â€¯KiB and 2008â€¯KiB anymore, because we canâ€™t identity map it.
Alternatively, we could map the page tables frames only temporarily when we need to access them. To be able to create the temporary mappings we only need a single identity-mapped level 1 table:

The level 1 table in this graphic controls the first 2â€¯MiB of the virtual address space. This is because it is reachable by starting at the CR3 register and following the 0th entry in the level 4, level 3, and level 2 page tables. The entry with index 8 maps the virtual page at address 32â€¯KiB to the physical frame at address 32â€¯KiB, thereby identity mapping the level 1 table itself. The graphic shows this identity-mapping by the horizontal arrow at 32â€¯KiB.

By writing to the identity-mapped level 1 table, our kernel can create up to 511 temporary mappings (512 minus the entry required for the identity mapping). In the above example, the kernel mapped the 0th entry of the level 1 table to the frame with address 24â€¯KiB. This created a temporary mapping of the virtual page at 0â€¯KiB to the physical frame of the level 2 page table, indicated by the dashed arrow. Now the kernel can access the level 2 page table by writing to the page starting at 0â€¯KiB.

The process for accessing an arbitrary page table frame with temporary mappings would be:
- Search for a free entry in the identity-mapped level 1 table.
- Map that entry to the physical frame of the page table that we want to access.
- Access the target frame through the virtual page that maps to the entry.
- Set the entry back to unused thereby removing the temporary mapping again.
This approach keeps the virtual address space clean since it reuses the same 512 virtual pages for creating the mappings. The drawback is that it is a bit cumbersome, especially since a new mapping might require modifications of multiple table levels, which means that we would need to repeat the above process multiple times.
While both of the above approaches work, there is a third technique called recursive page tables that combines their advantages: It keeps all page table frames mapped at all times so that no temporary mappings are needed, and also keeps the mapped pages together to avoid fragmentation of the virtual address space. This is the technique that we will use for our implementation, therefore it is described in detail in the following section.

ðŸ”—Recursive Page Tables

The idea behind this approach is to map some entry of the level 4 page table to the level 4 table itself. By doing this, we effectively reserve a part of the virtual address space and map all current and future page table frames to that space.

Letâ€™s go through an example to understand how this all works:

By letting the CPU follow this entry on a translation, it doesnâ€™t reach a level 3 table, but the same level 4 table again. This is similar to a recursive function that calls itself, therefore this table is called a recursive page table. The important thing is that the CPU assumes that every entry in the level 4 table points to a level 3 table, so it now treats the level 4 table as a level 3 table. This works because tables of all levels have the exact same layout on x86_64.

Similarly, we can follow the recursive entry twice before starting the translation to reduce the number of traversed levels to two:

Letâ€™s go through it step by step: First, the CPU follows the recursive entry on the level 4 table and thinks that it reaches a level 3 table. Then it follows the recursive entry again and thinks that it reaches a level 2 table. But in reality, it is still on the level 4 table. When the CPU now follows a different entry, it lands on a level 3 table but thinks it is already on a level 1 table. So while the next entry points at a level 2 table, the CPU thinks that it points to the mapped frame, which allows us to read and write the level 2 table.

Accessing the tables of levels 3 and 4 works in the same way. For accessing the level 3 table, we follow the recursive entry three times, tricking the CPU into thinking it is already on a level 1 table. Then we follow another entry and reach a level 3 table, which the CPU treats as a mapped frame. For accessing the level 4 table itself, we just follow the recursive entry four times until the CPU treats the level 4 table itself as mapped frame (in blue in the graphic below).

It might take some time to wrap your head around the concept, but it works quite well in practice.

ðŸ”—Address Calculation

Letâ€™s assume that we want to access the level 1 page table that maps a specific page. As we learned above, this means that we have to follow the recursive entry one time before continuing with the level 4, level 3, and level 2 indexes. To do that we move each block of the address one block to the right and set the original level 4 index to the index of the recursive entry:

Accessing the level 3 table works by moving each block three blocks to the right and using the recursive index for the original level 4, level 3, and level 2 address blocks:

Finally, we can access the level 4 table by moving each block four blocks to the right and using the recursive index for all address blocks except for the offset:

The table below summarizes the address structure for accessing the different kinds of frames:

Virtual Address for	Address Structure (octal)
Page	`0o_SSSSSS_AAA_BBB_CCC_DDD_EEEE`
Level 1 Table Entry	`0o_SSSSSS_RRR_AAA_BBB_CCC_DDDD`
Level 2 Table Entry	`0o_SSSSSS_RRR_RRR_AAA_BBB_CCCC`
Level 3 Table Entry	`0o_SSSSSS_RRR_RRR_RRR_AAA_BBBB`
Level 4 Table Entry	`0o_SSSSSS_RRR_RRR_RRR_RRR_AAAA`

SSSSSS are sign extension bits, which means that they are all copies of bit 47. This is a special requirement for valid addresses on the x86_64 architecture. We explained it in the previous post.

ðŸ”—Implementation

After all this theory we can finally start our implementation. Conveniently, the bootloader not only created page tables for our kernel, but it also created a recursive mapping in the last entry of the level 4 table. The bootloader did this because otherwise there would be a chicken or egg problem: We need to access the level 4 table to create a recursive mapping, but we canâ€™t access it without some kind of mapping.

We already used this recursive mapping at the end of the previous post to access the level 4 table. We did this through the hardcoded address 0xffff_ffff_ffff_f000. When we convert this address to octal and compare it with the above table, we can see that it exactly follows the structure of a level 4 table entry with RRR = 0o777, AAAA = 0, and the sign extension bits set to 1 each:

structure: 0o_SSSSSS_RRR_RRR_RRR_RRR_AAAA
address:   0o_177777_777_777_777_777_0000

With our knowledge about recursive page tables we can now create virtual addresses to access all active page tables. This allows us to create a translation function in software.

ðŸ”—Translating Addresses

As a first step, letâ€™s create a function that translates a virtual address to a physical address by walking the page table hierarchy:

// in src/lib.rs

pub mod memory;

// in src/memory.rs

use x86_64::PhysAddr;
use x86_64::structures::paging::PageTable;

/// Returns the physical address for the given virtual address, or `None` if the
/// virtual address is not mapped.
pub fn translate_addr(addr: usize) -> Option<PhysAddr> {
    // introduce variables for the recursive index and the sign extension bits
    // TODO: Don't hardcode these values
    let r = 0o777; // recursive index
    let sign = 0o177777 << 48; // sign extension

    // retrieve the page table indices of the address that we want to translate
    let l4_idx = (addr >> 39) & 0o777; // level 4 index
    let l3_idx = (addr >> 30) & 0o777; // level 3 index
    let l2_idx = (addr >> 21) & 0o777; // level 2 index
    let l1_idx = (addr >> 12) & 0o777; // level 1 index
    let page_offset = addr & 0o7777;

    // calculate the table addresses
    let level_4_table_addr =
        sign | (r << 39) | (r << 30) | (r << 21) | (r << 12);
    let level_3_table_addr =
        sign | (r << 39) | (r << 30) | (r << 21) | (l4_idx << 12);
    let level_2_table_addr =
        sign | (r << 39) | (r << 30) | (l4_idx << 21) | (l3_idx << 12);
    let level_1_table_addr =
        sign | (r << 39) | (l4_idx << 30) | (l3_idx << 21) | (l2_idx << 12);

    // check that level 4 entry is mapped
    let level_4_table = unsafe { &*(level_4_table_addr as *const PageTable) };
    if level_4_table[l4_idx].addr().is_null() {
        return None;
    }

    // check that level 3 entry is mapped
    let level_3_table = unsafe { &*(level_3_table_addr as *const PageTable) };
    if level_3_table[l3_idx].addr().is_null() {
        return None;
    }

    // check that level 2 entry is mapped
    let level_2_table = unsafe { &*(level_2_table_addr as *const PageTable) };
    if level_2_table[l2_idx].addr().is_null() {
        return None;
    }

    // check that level 1 entry is mapped and retrieve physical address from it
    let level_1_table = unsafe { &*(level_1_table_addr as *const PageTable) };
    let phys_addr = level_1_table[l1_idx].addr();
    if phys_addr.is_null() {
        return None;
    }

    Some(phys_addr + page_offset)
}

First, we introduce variables for the recursive index (511 = 0o777) and the sign extension bits (which are 1 each). Then we calculate the page table indices and the page offset from the address through bitwise operations as specified in the graphic:

In the next step we calculate the virtual addresses of the four page tables as descripbed in the address calculation section. We transform each of these addresses to PageTable references later in the function. These transformations are unsafe operations since the compiler canâ€™t know that these addresses are valid.

After the address calculation, we use the indexing operator to look at the entry in the level 4 table. If that entry is null, there is no level 3 table for this level 4 entry, which means that the addr is not mapped to any physical memory, so we return None. If the entry is not None, we know that a level 3 table exists. We then do the same cast and entry-checking as with the level 4 table.

After we checked the three higher level pages, we can finally read the entry of the level 1 table that tells us the physical frame that the address is mapped to. As the last step, we add the page offset to that address and return it.

If we knew that the address is mapped, we could directly access the level 1 table without looking at the higher level pages first. But since we donâ€™t know this, we have to check whether the level 1 table exists first, otherwise our function would cause a page fault for unmapped addresses.

ðŸ”—Try it out

We can use our new translation function to translate some virtual addresses in our _start function:

// in src/main.rs

#[cfg(not(test))]
#[no_mangle]
pub extern "C" fn _start() -> ! {
    [â€¦] // initialize GDT, IDT, PICS

    use blog_os::memory::translate_addr;

    let addresses = [
        // the identity-mapped vga buffer page
        0xb8000,
        // some code page
        0x20010a,
        // some stack page
        0x57ac_001f_fe48,
    ];

    for &address in &addresses {
        println!("{:?} -> {:?}", address, translate_addr(address));
    }

    println!("It did not crash!");
    blog_os::hlt_loop();
}

When we run it, we see the following output:

ðŸ”—The `RecursivePageTable` Type

The x86_64 provides a RecursivePageTable type that implements safe abstractions for various page table operations. The type implements the MapperAllSizes trait, which already contains a translate_addr function that we can use instead of hand-rolling our own. To create a new RecursivePageTable, we create a memory::init function:

// in src/memory.rs

use x86_64::structures::paging::{Mapper, Page, PageTable, RecursivePageTable};
use x86_64::{VirtAddr, PhysAddr};

/// Creates a RecursivePageTable instance from the level 4 address.
///
/// This function is unsafe because it can break memory safety if an invalid
/// address is passed.
pub unsafe fn init(level_4_table_addr: usize) -> RecursivePageTable<'static> {
    let level_4_table_ptr = level_4_table_addr as *mut PageTable;
    let level_4_table = &mut *level_4_table_ptr;
    RecursivePageTable::new(level_4_table).unwrap()
}

The RecursivePageTable type encapsulates the unsafety of the page table walk completely so that we no longer need unsafe to implement our own translate_addr function. The init function needs to be unsafe because the caller has to guarantee that the passed level_4_table_addr is valid.

We can now use the MapperAllSizes::translate_addr function in our _start function:

// in src/main.rs

#[cfg(not(test))]
#[no_mangle]
pub extern "C" fn _start() -> ! {
    [â€¦] // initialize GDT, IDT, PICS

    use blog_os::memory;
    use x86_64::{
        structures::paging::MapperAllSizes,
        VirtAddr,
    };

    const LEVEL_4_TABLE_ADDR: usize = 0o_177777_777_777_777_777_0000;
    let recursive_page_table = unsafe { memory::init(LEVEL_4_TABLE_ADDR) };

    let addresses = [â€¦]; // as before
    for &address in &addresses {
        let virt_addr = VirtAddr::new(address);
        let phys_addr = recursive_page_table.translate_addr(virt_addr);
        println!("{:?} -> {:?}", virt_addr, phys_addr);
    }

    println!("It did not crash!");
    blog_os::hlt_loop();
}

Instead of using u64 for all addresses we now use the VirtAddr and PhysAddr wrapper types to differentiate the two kinds of addresses. In order to be able to call the translate_addr method, we need to import the MapperAllSizes trait.

By using the RecursivePageTable type, we now have a safe abstraction and clear ownership semantics. This ensures that we canâ€™t accidentally modify the page table concurrently, because an exclusive borrow of the RecursivePageTable is needed in order to modify it.

When we run it, we see the same result as with our handcrafted translation function.

ðŸ”—Making Unsafe Functions Safer

Our memory::init function is an unsafe function, which means that an unsafe block is required for calling it because the caller has to guarantee that certain requirements are met. In our case, the requirement is that the passed address is mapped to the physical frame of the level 4 page table.

The second property of unsafe functions is that their complete body is treated as an unsafe block, which means that it can perform all kinds of unsafe operations without additional unsafe blocks. This is the reason that we didnâ€™t need an unsafe block for dereferencing the raw level_4_table_ptr:

pub unsafe fn init(level_4_table_addr: usize) -> RecursivePageTable<'static> {
    let level_4_table_ptr = level_4_table_addr as *mut PageTable;
    let level_4_table = &mut *level_4_table_ptr; // <- this operation is unsafe
    RecursivePageTable::new(level_4_table).unwrap()
}

The problem with this is that we donâ€™t immediately see which parts are unsafe. For example, we donâ€™t know whether the RecursivePageTable::new function is unsafe or not without looking at its definition. This makes it very easy to accidentally do something unsafe without noticing.

To avoid this problem, we can add a safe inner function:

// in src/memory.rs

pub unsafe fn init(level_4_table_addr: usize) -> RecursivePageTable<'static> {
    /// Rust currently treats the whole body of unsafe functions as an unsafe
    /// block, which makes it difficult to see which operations are unsafe. To
    /// limit the scope of unsafe we use a safe inner function.
    fn init_inner(level_4_table_addr: usize) -> RecursivePageTable<'static> {
        let level_4_table_ptr = level_4_table_addr as *mut PageTable;
        let level_4_table = unsafe { &mut *level_4_table_ptr };
        RecursivePageTable::new(level_4_table).unwrap()
    }

    init_inner(level_4_table_addr)
}

Now an unsafe block is required again for dereferencing the level_4_table_ptr and we immediately see that this is the only unsafe operations in the function. There is currently an open RFC to change this unfortunate property of unsafe functions that would allow us to avoid the above boilerplate.

ðŸ”—Creating a new Mapping

After reading the page tables and creating a translation function, the next step is to create a new mapping in the page table hierarchy.

The difficulty of creating a new mapping depends on the virtual page that we want to map. In the easiest case, the level 1 page table for the page already exists and we just need to write a single entry. In the most difficult case, the page is in a memory region for that no level 3 exists yet so that we need to create new level 3, level 2 and level 1 page tables first.

Letâ€™s start with the simple case and assume that we donâ€™t need to create new page tables. The bootloader loads itself in the first megabyte of the virtual address space, so we know that a valid level 1 table exists for this region. We can choose any unused page in this memory region for our example mapping, for example, the page at address 0x1000. As the target frame we use 0xb8000, the frame of the VGA text buffer. This way we can easily test whether our mapping worked.

We implement it in a new create_example_mapping function in our memory module:

// in src/memory.rs

use x86_64::structures::paging::{FrameAllocator, PhysFrame, Size4KiB};

pub fn create_example_mapping(
    recursive_page_table: &mut RecursivePageTable,
    frame_allocator: &mut impl FrameAllocator<Size4KiB>,
) {
    use x86_64::structures::paging::PageTableFlags as Flags;

    let page: Page = Page::containing_address(VirtAddr::new(0x1000));
    let frame = PhysFrame::containing_address(PhysAddr::new(0xb8000));
    let flags = Flags::PRESENT | Flags::WRITABLE;

    let map_to_result = unsafe {
        recursive_page_table.map_to(page, frame, flags, frame_allocator)
    };
    map_to_result.expect("map_to failed").flush();
}

The function takes a mutable reference to the RecursivePageTable because it needs to modify it and a FrameAllocator that is explained below. It then uses the map_to function of the Mapper trait to map the page at address 0x1000 to the physical frame at address 0xb8000. The function is unsafe because itâ€™s possible to break memory safety with invalid arguments.

Apart from the page and frame arguments, the map_to function takes two more arguments. The third argument is a set of flags for the page table entry. We set the PRESENT flag because it is required for all valid entries and the WRITABLE flag to make the mapped page writable.

The fourth argument needs to be some structure that implements the FrameAllocator trait. The map_to method needs this argument because it might need unused frames for creating new page tables. The Size4KiB argument in the trait implementation is needed because the Page and PhysFrame types are generic over the PageSize trait to work with both standard 4KiB pages and huge 2MiB/1GiB pages.

The map_to function can fail, so it returns a Result. Since this is just some example code that does not need to be robust, we just use expect to panic when an error occurs. On success, the function returns a MapperFlush type that provides an easy way to flush the newly mapped page from the translation lookaside buffer (TLB) with its flush method. Like Result, the type uses the [#[must_use]] attribute to emit a warning when we accidentally forget to use it.

[#[must_use]]: https://doc.rust-lang.org/std/result/#results-must-be-used

Since we know that no new page tables are required for the address 0x1000, a frame allocator that always returns None suffices. We create such an EmptyFrameAllocator for testing our mapping function:

// in src/memory.rs

/// A FrameAllocator that always returns `None`.
pub struct EmptyFrameAllocator;

impl FrameAllocator<Size4KiB> for EmptyFrameAllocator {
    fn allocate_frame(&mut self) -> Option<PhysFrame> {
        None
    }
}

(If youâ€™re getting a â€˜method allocate_frame is not a member of trait FrameAllocatorâ€™ error, you need to update x86_64 to version 0.4.0.)

We can now test the new mapping function in our main.rs:

// in src/main.rs

#[cfg(not(test))]
#[no_mangle]
pub extern "C" fn _start() -> ! {
    [â€¦] // initialize GDT, IDT, PICS

    use blog_os::memory::{create_example_mapping, EmptyFrameAllocator};

    const LEVEL_4_TABLE_ADDR: usize = 0o_177777_777_777_777_777_0000;
    let mut recursive_page_table = unsafe { memory::init(LEVEL_4_TABLE_ADDR) };

    create_example_mapping(&mut recursive_page_table, &mut EmptyFrameAllocator);
    unsafe { (0x1900 as *mut u64).write_volatile(0xf021_f077_f065_f04e)};

    println!("It did not crash!");
    blog_os::hlt_loop();
}

We first create the mapping for the page at 0x1000 by calling our create_example_mapping function with a mutable reference to the RecursivePageTable instance. This maps the page 0x1000 to the VGA text buffer, so we should see any write to it on the screen.

Then we write the value 0xf021_f077_f065_f04e to this page, which represents the string â€œNew!â€ on white background. We donâ€™t write directly to the beginning of the page at 0x1000 since the top line is directly shifted off the screen by the next println. Instead, we write to offset 0x900, which is about in the middle of the screen. As we learned in the â€œVGA Text Modeâ€ post, writes to the VGA buffer should be volatile, so we use the write_volatile method.

When we run it in QEMU, we see the following output:

The â€œNew!â€ on the screen is by our write to 0x1900, which means that we successfully created a new mapping in the page tables.

This only worked because there was already a level 1 table for mapping page 0x1000. When we try to map a page for that no level 1 table exists yet, the map_to function fails because it tries to allocate frames from the EmptyFrameAllocator for creating new page tables. We can see that happen when we try to map page 0xdeadbeaf000 instead of 0x1000:

// in src/memory.rs

pub fn create_example_mapping(â€¦) {
    [â€¦]
    let page: Page = Page::containing_address(VirtAddr::new(0xdeadbeaf000));
    [â€¦]
}

// in src/main.rs

#[no_mangle]
pub extern "C" fn _start() -> ! {
    [â€¦]
    unsafe { (0xdeadbeaf900 as *mut u64).write_volatile(0xf021_f077_f065_f04e)};
    [â€¦]
}

When we run it, a panic with the following error message occurs:

panicked at 'map_to failed: FrameAllocationFailed', /â€¦/result.rs:999:5

To map pages that donâ€™t have a level 1 page table yet we need to create a proper FrameAllocator. But how do we know which frames are unused and how much physical memory is available?

ðŸ”—Boot Information

The amount of physical memory and the memory regions reserved by devices like the VGA hardware vary between different machines. Only the BIOS or UEFI firmware knows exactly which memory regions can be used by the operating system and which regions are reserved. Both firmware standards provide functions to retrieve the memory map, but they can only be called very early in the boot process. For this reason, the bootloader already queries this and other information from the firmware.

To communicate this information to our kernel, the bootloader passes a reference to a boot information structure as an argument when calling our _start function. Right now we donâ€™t have this argument declared in our function, so it is ignored. Letâ€™s add it:

// in src/main.rs

use bootloader::bootinfo::BootInfo;

#[cfg(not(test))]
#[no_mangle]
pub extern "C" fn _start(boot_info: &'static BootInfo) -> ! { // new argument
    [â€¦]
}

The BootInfo struct is still in an early stage, so expect some breakage when updating to future semver-incompatible bootloader versions. It currently has the three fields p4_table_addr, memory_map, and package:

The p4_table_addr field contains the recursive virtual address of the level 4 page table. By using this field we can avoid hardcoding the address 0o_177777_777_777_777_777_0000.
The memory_map field is most interesting to us since it contains a list of all memory regions and their type (i.e. unused, reserved, or other).
The package field is an in-progress feature to bundle additional data with the bootloader. The implementation is not finished, so we can ignore this field for now.

Before we use the memory_map field to create a proper FrameAllocator, we want to ensure that we canâ€™t use a boot_info argument of the wrong type.

ðŸ”—The `entry_point` Macro

To make sure that the entry point function has always the correct signature that the bootloader expects, the bootloader crate provides an entry_point macro that provides a type-checked way to define a Rust function as the entry point. Letâ€™s rewrite our entry point function to use this macro:

// in src/main.rs

use bootloader::{bootinfo::BootInfo, entry_point};

entry_point!(kernel_main);

#[cfg(not(test))]
fn kernel_main(boot_info: &'static BootInfo) -> ! {
    [â€¦] // initialize GDT, IDT, PICS

    let mut recursive_page_table = unsafe {
        memory::init(boot_info.p4_table_addr as usize)
    };

    [â€¦] // create and test example mapping

    println!("It did not crash!");
    blog_os::hlt_loop();
}

We no longer need to use extern "C" or no_mangle for our entry point, as the macro defines the real lower level _start entry point for us. The kernel_main function is now a completely normal Rust function, so we can choose an arbitrary name for it. The important thing is that it is type-checked so that a compilation error occurs when we now try to modify the function signature in any way, for example adding an argument or changing the argument type.

Note that we now pass boot_info.p4_table_addr instead of a hardcoded address to our memory::init. Thus our code continues to work even if a future version of the bootloader chooses a different entry of the level 4 page table for the recursive mapping.

ðŸ”—Allocating Frames

Now that we have access to the memory map through the boot information we can create a proper frame allocator on top. We start with a generic skeleton:

// in src/memory.rs

pub struct BootInfoFrameAllocator<I> where I: Iterator<Item = PhysFrame> {
    frames: I,
}

impl<I> FrameAllocator<Size4KiB> for BootInfoFrameAllocator<I>
    where I: Iterator<Item = PhysFrame>
{
    fn allocate_frame(&mut self) -> Option<PhysFrame> {
        self.frames.next()
    }
}

The frames field can be initialized with an arbitrary Iterator of frames. This allows us to just delegate alloc calls to the Iterator::next method.

The initialization of the BootInfoFrameAllocator happens in a new init_frame_allocator function:

// in src/memory.rs

use bootloader::bootinfo::{MemoryMap, MemoryRegionType};

/// Create a FrameAllocator from the passed memory map
pub fn init_frame_allocator(
    memory_map: &'static MemoryMap,
) -> BootInfoFrameAllocator<impl Iterator<Item = PhysFrame>> {
    // get usable regions from memory map
    let regions = memory_map
        .iter()
        .filter(|r| r.region_type == MemoryRegionType::Usable);
    // map each region to its address range
    let addr_ranges = regions.map(|r| r.range.start_addr()..r.range.end_addr());
    // transform to an iterator of frame start addresses
    let frame_addresses = addr_ranges.flat_map(|r| r.into_iter().step_by(4096));
    // create `PhysFrame` types from the start addresses
    let frames = frame_addresses.map(|addr| {
        PhysFrame::containing_address(PhysAddr::new(addr))
    });

    BootInfoFrameAllocator { frames }
}

This function uses iterator combinator methods to transform the initial MemoryMap into an iterator of usable physical frames:

First, we call the iter method to convert the memory map to an iterator of MemoryRegions. Then we use the filter method to skip any reserved or otherwise unavailable regions. The bootloader updates the memory map for all the mappings it creates, so frames that are used by our kernel (code, data or stack) or to store the boot information are already marked as InUse or similar. Thus we can be sure that Usable frames are not used somewhere else.
In the second step, we use the map combinator and Rustâ€™s range syntax to transform our iterator of memory regions to an iterator of address ranges.
The third step is the most complicated: We convert each range to an iterator through the into_iter method and then choose every 4096th address using step_by. Since 4096 bytes (= 4 KiB) is the page size, we get the start address of each frame. The bootloader page aligns all usable memory areas so that we donâ€™t need any alignment or rounding code here. By using flat_map instead of map, we get an Iterator<Item = u64> instead of an Iterator<Item = Iterator<Item = u64>>.
In the final step, we convert the start addresses to PhysFrame types to construct the desired Iterator<Item = PhysFrame>. We then use this iterator to create and return a new BootInfoFrameAllocator.

We can now modify our kernel_main function to pass a BootInfoFrameAllocator instance instead of an EmptyFrameAllocator:

// in src/main.rs

#[cfg(not(test))]
fn kernel_main(boot_info: &'static BootInfo) -> ! {
    [â€¦] // initialize GDT, IDT, PICS

    use x86_64::structures::paging::{PageTable, RecursivePageTable};

    let mut recursive_page_table = unsafe {
        memory::init(boot_info.p4_table_addr as usize)
    };
    // new
    let mut frame_allocator = memory::init_frame_allocator(&boot_info.memory_map);

    blog_os::memory::create_example_mapping(&mut recursive_page_table, &mut frame_allocator);
    unsafe { (0xdeadbeaf900 as *mut u64).write_volatile(0xf021_f077_f065_f04e)};

    println!("It did not crash!");
    blog_os::hlt_loop();
}

Now the mapping succeeds and we see the black-on-white â€œNew!â€ on the screen again. Behind the scenes, the map_to method creates the missing page tables in the following way:

Allocate an unused frame from the passed frame_allocator.
Map the entry of the higher level table to that frame. Now the frame is accessible through the recursive page table.
Zero the frame to create a new, empty page table.
Continue with the next table level.

ðŸ”—Summary

In this post we learned how a recursive level 4 table entry can be used to map all page table frames to calculatable virtual addresses. We used this technique to implement an address translation function and to create a new mapping in the page tables.

We saw that the creation of new mappings requires unused frames for creating new page tables. Such a frame allocator can be implemented on top of the boot information structure that the bootloader passes to our kernel.

ðŸ”—Whatâ€™s next?

The next post will create a heap memory region for our kernel, which will allow us to allocate memory and use various collection types.

Introduction to Paging

Mon, 14 Jan 2019 00:00:00 +0000

This post introduces paging, a very common memory management scheme that we will also use for our operating system. It explains why memory isolation is needed, how segmentation works, what virtual memory is, and how paging solves memory fragmentation issues. It also explores the layout of multilevel page tables on the x86_64 architecture.

ðŸ”—Memory Protection

One main task of an operating system is to isolate programs from each other. Your web browser shouldnâ€™t be able to interfere with your text editor, for example. To achieve this goal, operating systems utilize hardware functionality to ensure that memory areas of one process are not accessible by other processes. There are different approaches depending on the hardware and the OS implementation.

As an example, some ARM Cortex-M processors (used for embedded systems) have a Memory Protection Unit (MPU), which allows you to define a small number (e.g., 8) of memory regions with different access permissions (e.g., no access, read-only, read-write). On each memory access, the MPU ensures that the address is in a region with correct access permissions and throws an exception otherwise. By changing the regions and access permissions on each process switch, the operating system can ensure that each process only accesses its own memory and thus isolates processes from each other.

On x86, the hardware supports two different approaches to memory protection: segmentation and paging.

ðŸ”—Segmentation

Segmentation was already introduced in 1978, originally to increase the amount of addressable memory. The situation back then was that CPUs only used 16-bit addresses, which limited the amount of addressable memory to 64Â KiB. To make more than these 64Â KiB accessible, additional segment registers were introduced, each containing an offset address. The CPU automatically added this offset on each memory access, so that up to 1Â MiB of memory was accessible.

The segment register is chosen automatically by the CPU depending on the kind of memory access: For fetching instructions, the code segment CS is used, and for stack operations (push/pop), the stack segment SS is used. Other instructions use the data segment DS or the extra segment ES. Later, two additional segment registers, FS and GS, were added, which can be used freely.

In the first version of segmentation, the segment registers directly contained the offset and no access control was performed. This was changed later with the introduction of the protected mode. When the CPU runs in this mode, the segment descriptors contain an index into a local or global descriptor table, which contains â€“ in addition to an offset address â€“ the segment size and access permissions. By loading separate global/local descriptor tables for each process, which confine memory accesses to the processâ€™s own memory areas, the OS can isolate processes from each other.

By modifying the memory addresses before the actual access, segmentation already employed a technique that is now used almost everywhere: virtual memory.

ðŸ”—Virtual Memory

The idea behind virtual memory is to abstract away the memory addresses from the underlying physical storage device. Instead of directly accessing the storage device, a translation step is performed first. For segmentation, the translation step is to add the offset address of the active segment. Imagine a program accessing memory address 0x1234000 in a segment with an offset of 0x1111000: The address that is really accessed is 0x2345000.

To differentiate the two address types, addresses before the translation are called virtual, and addresses after the translation are called physical. One important difference between these two kinds of addresses is that physical addresses are unique and always refer to the same distinct memory location. Virtual addresses, on the other hand, depend on the translation function. It is entirely possible that two different virtual addresses refer to the same physical address. Also, identical virtual addresses can refer to different physical addresses when they use different translation functions.

An example where this property is useful is running the same program twice in parallel:

Here the same program runs twice, but with different translation functions. The first instance has a segment offset of 100, so that its virtual addresses 0â€“150 are translated to the physical addresses 100â€“250. The second instance has an offset of 300, which translates its virtual addresses 0â€“150 to physical addresses 300â€“450. This allows both programs to run the same code and use the same virtual addresses without interfering with each other.

Another advantage is that programs can now be placed at arbitrary physical memory locations, even if they use completely different virtual addresses. Thus, the OS can utilize the full amount of available memory without needing to recompile programs.

ðŸ”—Fragmentation

The differentiation between virtual and physical addresses makes segmentation really powerful. However, it has the problem of fragmentation. As an example, imagine that we want to run a third copy of the program we saw above:

There is no way to map the third instance of the program to virtual memory without overlapping, even though there is more than enough free memory available. The problem is that we need continuous memory and canâ€™t use the small free chunks.

One way to combat this fragmentation is to pause execution, move the used parts of the memory closer together, update the translation, and then resume execution:

Now there is enough continuous space to start the third instance of our program.

The disadvantage of this defragmentation process is that it needs to copy large amounts of memory, which decreases performance. It also needs to be done regularly before the memory becomes too fragmented. This makes performance unpredictable since programs are paused at random times and might become unresponsive.

The fragmentation problem is one of the reasons that segmentation is no longer used by most systems. In fact, segmentation is not even supported in 64-bit mode on x86 anymore. Instead, paging is used, which completely avoids the fragmentation problem.

ðŸ”—Paging

The idea is to divide both the virtual and physical memory space into small, fixed-size blocks. The blocks of the virtual memory space are called pages, and the blocks of the physical address space are called frames. Each page can be individually mapped to a frame, which makes it possible to split larger memory regions across non-continuous physical frames.

The advantage of this becomes visible if we recap the example of the fragmented memory space, but use paging instead of segmentation this time:

In this example, we have a page size of 50 bytes, which means that each of our memory regions is split across three pages. Each page is mapped to a frame individually, so a continuous virtual memory region can be mapped to non-continuous physical frames. This allows us to start the third instance of the program without performing any defragmentation before.

ðŸ”—Hidden Fragmentation

Compared to segmentation, paging uses lots of small, fixed-sized memory regions instead of a few large, variable-sized regions. Since every frame has the same size, there are no frames that are too small to be used, so no fragmentation occurs.

Or it seems like no fragmentation occurs. There is still some hidden kind of fragmentation, the so-called internal fragmentation. Internal fragmentation occurs because not every memory region is an exact multiple of the page size. Imagine a program of size 101 in the above example: It would still need three pages of size 50, so it would occupy 49 bytes more than needed. To differentiate the two types of fragmentation, the kind of fragmentation that happens when using segmentation is called external fragmentation.

Internal fragmentation is unfortunate but often better than the external fragmentation that occurs with segmentation. It still wastes memory, but does not require defragmentation and makes the amount of fragmentation predictable (on average half a page per memory region).

ðŸ”—Page Tables

We saw that each of the potentially millions of pages is individually mapped to a frame. This mapping information needs to be stored somewhere. Segmentation uses an individual segment selector register for each active memory region, which is not possible for paging since there are way more pages than registers. Instead, paging uses a table structure called page table to store the mapping information.

For our above example, the page tables would look like this:

We see that each program instance has its own page table. A pointer to the currently active table is stored in a special CPU register. On x86, this register is called CR3. It is the job of the operating system to load this register with the pointer to the correct page table before running each program instance.

On each memory access, the CPU reads the table pointer from the register and looks up the mapped frame for the accessed page in the table. This is entirely done in hardware and completely invisible to the running program. To speed up the translation process, many CPU architectures have a special cache that remembers the results of the last translations.

Depending on the architecture, page table entries can also store attributes such as access permissions in a flags field. In the above example, the â€œr/wâ€ flag makes the page both readable and writable.

ðŸ”—Multilevel Page Tables

The simple page tables we just saw have a problem in larger address spaces: they waste memory. For example, imagine a program that uses the four virtual pages 0, 1_000_000, 1_000_050, and 1_000_100 (we use _ as a thousands separator):

It only needs 4 physical frames, but the page table has over a million entries. We canâ€™t omit the empty entries because then the CPU would no longer be able to jump directly to the correct entry in the translation process (e.g., it is no longer guaranteed that the fourth page uses the fourth entry).

To reduce the wasted memory, we can use a two-level page table. The idea is that we use different page tables for different address regions. An additional table called level 2 page table contains the mapping between address regions and (level 1) page tables.

This is best explained by an example. Letâ€™s define that each level 1 page table is responsible for a region of size 10_000. Then the following tables would exist for the above example mapping:

Page 0 falls into the first 10_000 byte region, so it uses the first entry of the level 2 page table. This entry points to level 1 page table T1, which specifies that page 0 points to frame 0.

The pages 1_000_000, 1_000_050, and 1_000_100 all fall into the 100th 10_000 byte region, so they use the 100th entry of the level 2 page table. This entry points to a different level 1 page table T2, which maps the three pages to frames 100, 150, and 200. Note that the page address in level 1 tables does not include the region offset. For example, the entry for page 1_000_050 is just 50.

We still have 100 empty entries in the level 2 table, but much fewer than the million empty entries before. The reason for these savings is that we donâ€™t need to create level 1 page tables for the unmapped memory regions between 10_000 and 1_000_000.

The principle of two-level page tables can be extended to three, four, or more levels. Then the page table register points to the highest level table, which points to the next lower level table, which points to the next lower level, and so on. The level 1 page table then points to the mapped frame. The principle in general is called a multilevel or hierarchical page table.

Now that we know how paging and multilevel page tables work, we can look at how paging is implemented in the x86_64 architecture (we assume in the following that the CPU runs in 64-bit mode).

ðŸ”—Paging on x86_64

The x86_64 architecture uses a 4-level page table and a page size of 4Â KiB. Each page table, independent of the level, has a fixed size of 512 entries. Each entry has a size of 8 bytes, so each table is 512 * 8Â B = 4Â KiB large and thus fits exactly into one page.

The page table index for each level is derived directly from the virtual address:

We see that each table index consists of 9 bits, which makes sense because each table has 2^9 = 512 entries. The lowest 12 bits are the offset in the 4Â KiB page (2^12 bytes = 4Â KiB). Bits 48 to 64 are discarded, which means that x86_64 is not really 64-bit since it only supports 48-bit addresses.

Even though bits 48 to 64 are discarded, they canâ€™t be set to arbitrary values. Instead, all bits in this range have to be copies of bit 47 in order to keep addresses unique and allow future extensions like the 5-level page table. This is called sign-extension because itâ€™s very similar to the sign extension in twoâ€™s complement. When an address is not correctly sign-extended, the CPU throws an exception.

Itâ€™s worth noting that the recent â€œIce Lakeâ€ Intel CPUs optionally support 5-level page tables to extend virtual addresses from 48-bit to 57-bit. Given that optimizing our kernel for a specific CPU does not make sense at this stage, we will only work with standard 4-level page tables in this post.

ðŸ”—Example Translation

Letâ€™s go through an example to understand how the translation process works in detail:

The physical address of the currently active level 4 page table, which is the root of the 4-level page table, is stored in the CR3 register. Each page table entry then points to the physical frame of the next level table. The entry of the level 1 table then points to the mapped frame. Note that all addresses in the page tables are physical instead of virtual, because otherwise the CPU would need to translate those addresses too (which could cause a never-ending recursion).

The above page table hierarchy maps two pages (in blue). From the page table indices, we can deduce that the virtual addresses of these two pages are 0x803FE7F000 and 0x803FE00000. Letâ€™s see what happens when the program tries to read from address 0x803FE7F5CE. First, we convert the address to binary and determine the page table indices and the page offset for the address:

With these indices, we can now walk the page table hierarchy to determine the mapped frame for the address:

We start by reading the address of the level 4 table out of the CR3 register.
The level 4 index is 1, so we look at the entry with index 1 of that table, which tells us that the level 3 table is stored at address 16Â KiB.
We load the level 3 table from that address and look at the entry with index 0, which points us to the level 2 table at 24Â KiB.
The level 2 index is 511, so we look at the last entry of that page to find out the address of the level 1 table.
Through the entry with index 127 of the level 1 table, we finally find out that the page is mapped to frame 12Â KiB, or 0x3000 in hexadecimal.
The final step is to add the page offset to the frame address to get the physical address 0x3000 + 0x5ce = 0x35ce.

The permissions for the page in the level 1 table are r, which means read-only. The hardware enforces these permissions and would throw an exception if we tried to write to that page. Permissions in higher level pages restrict the possible permissions in lower levels, so if we set the level 3 entry to read-only, no pages that use this entry can be writable, even if lower levels specify read/write permissions.

Itâ€™s important to note that even though this example used only a single instance of each table, there are typically multiple instances of each level in each address space. At maximum, there are:

one level 4 table,
512 level 3 tables (because the level 4 table has 512 entries),
512 * 512 level 2 tables (because each of the 512 level 3 tables has 512 entries), and
512 * 512 * 512 level 1 tables (512 entries for each level 2 table).

ðŸ”—Page Table Format

Page tables on the x86_64 architecture are basically an array of 512 entries. In Rust syntax:

#[repr(align(4096))]
pub struct PageTable {
    entries: [PageTableEntry; 512],
}

As indicated by the repr attribute, page tables need to be page-aligned, i.e., aligned on a 4Â KiB boundary. This requirement guarantees that a page table always fills a complete page and allows an optimization that makes entries very compact.

Each entry is 8 bytes (64 bits) large and has the following format:

Bit(s)	Name	Meaning
0	present	the page is currently in memory
1	writable	itâ€™s allowed to write to this page
2	user accessible	if not set, only kernel mode code can access this page
3	write-through caching	writes go directly to memory
4	disable cache	no cache is used for this page
5	accessed	the CPU sets this bit when this page is used
6	dirty	the CPU sets this bit when a write to this page occurs
7	huge page/null	must be 0 in P1 and P4, creates a 1Â GiB page in P3, creates a 2Â MiB page in P2
8	global	page isnâ€™t flushed from caches on address space switch (PGE bit of CR4 register must be set)
9-11	available	can be used freely by the OS
12-51	physical address	the page aligned 52bit physical address of the frame or the next page table
52-62	available	can be used freely by the OS
63	no execute	forbid executing code on this page (the NXE bit in the EFER register must be set)

We see that only bits 12â€“51 are used to store the physical frame address. The remaining bits are used as flags or can be freely used by the operating system. This is possible because we always point to a 4096-byte aligned address, either to a page-aligned page table or to the start of a mapped frame. This means that bits 0â€“11 are always zero, so there is no reason to store these bits because the hardware can just set them to zero before using the address. The same is true for bits 52â€“63, because the x86_64 architecture only supports 52-bit physical addresses (similar to how it only supports 48-bit virtual addresses).

Letâ€™s take a closer look at the available flags:

The present flag differentiates mapped pages from unmapped ones. It can be used to temporarily swap out pages to disk when the main memory becomes full. When the page is accessed subsequently, a special exception called page fault occurs, to which the operating system can react by reloading the missing page from disk and then continuing the program.
The writable and no execute flags control whether the contents of the page are writable or contain executable instructions, respectively.
The accessed and dirty flags are automatically set by the CPU when a read or write to the page occurs. This information can be leveraged by the operating system, e.g., to decide which pages to swap out or whether the page contents have been modified since the last save to disk.
The write-through caching and disable cache flags allow the control of caches for every page individually.
The user accessible flag makes a page available to userspace code, otherwise, it is only accessible when the CPU is in kernel mode. This feature can be used to make system calls faster by keeping the kernel mapped while a userspace program is running. However, the Spectre vulnerability can allow userspace programs to read these pages nonetheless.
The global flag signals to the hardware that a page is available in all address spaces and thus does not need to be removed from the translation cache (see the section about the TLB below) on address space switches. This flag is commonly used together with a cleared user accessible flag to map the kernel code to all address spaces.
The huge page flag allows the creation of pages of larger sizes by letting the entries of the level 2 or level 3 page tables directly point to a mapped frame. With this bit set, the page size increases by factor 512 to either 2Â MiB = 512 * 4Â KiB for level 2 entries or even 1Â GiB = 512 * 2Â MiB for level 3 entries. The advantage of using larger pages is that fewer lines of the translation cache and fewer page tables are needed.

The x86_64 crate provides types for page tables and their entries, so we donâ€™t need to create these structures ourselves.

ðŸ”—The Translation Lookaside Buffer

A 4-level page table makes the translation of virtual addresses expensive because each translation requires four memory accesses. To improve performance, the x86_64 architecture caches the last few translations in the so-called translation lookaside buffer (TLB). This allows skipping the translation when it is still cached.

Unlike the other CPU caches, the TLB is not fully transparent and does not update or remove translations when the contents of page tables change. This means that the kernel must manually update the TLB whenever it modifies a page table. To do this, there is a special CPU instruction called invlpg (â€œinvalidate pageâ€) that removes the translation for the specified page from the TLB, so that it is loaded again from the page table on the next access. The TLB can also be flushed completely by reloading the CR3 register, which simulates an address space switch. The x86_64 crate provides Rust functions for both variants in the tlb module.

It is important to remember to flush the TLB on each page table modification because otherwise, the CPU might keep using the old translation, which can lead to non-deterministic bugs that are very hard to debug.

ðŸ”—Implementation

One thing that we did not mention yet: Our kernel already runs on paging. The bootloader that we added in the â€œA minimal Rust Kernelâ€ post has already set up a 4-level paging hierarchy that maps every page of our kernel to a physical frame. The bootloader does this because paging is mandatory in 64-bit mode on x86_64.

This means that every memory address that we used in our kernel was a virtual address. Accessing the VGA buffer at address 0xb8000 only worked because the bootloader identity mapped that memory page, which means that it mapped the virtual page 0xb8000 to the physical frame 0xb8000.

Paging makes our kernel already relatively safe, since every memory access that is out of bounds causes a page fault exception instead of writing to random physical memory. The bootloader even sets the correct access permissions for each page, which means that only the pages containing code are executable and only data pages are writable.

ðŸ”—Page Faults

Letâ€™s try to cause a page fault by accessing some memory outside of our kernel. First, we create a page fault handler and register it in our IDT, so that we see a page fault exception instead of a generic double fault:

// in src/interrupts.rs

lazy_static! {
    static ref IDT: InterruptDescriptorTable = {
        let mut idt = InterruptDescriptorTable::new();

        [â€¦]

        idt.page_fault.set_handler_fn(page_fault_handler); // new

        idt
    };
}

use x86_64::structures::idt::PageFaultErrorCode;
use crate::hlt_loop;

extern "x86-interrupt" fn page_fault_handler(
    stack_frame: InterruptStackFrame,
    error_code: PageFaultErrorCode,
) {
    use x86_64::registers::control::Cr2;

    println!("EXCEPTION: PAGE FAULT");
    println!("Accessed Address: {:?}", Cr2::read());
    println!("Error Code: {:?}", error_code);
    println!("{:#?}", stack_frame);
    hlt_loop();
}

The CR2 register is automatically set by the CPU on a page fault and contains the accessed virtual address that caused the page fault. We use the Cr2::read function of the x86_64 crate to read and print it. The PageFaultErrorCode type provides more information about the type of memory access that caused the page fault, for example, whether it was caused by a read or write operation. For this reason, we print it too. We canâ€™t continue execution without resolving the page fault, so we enter a hlt_loop at the end.

Now we can try to access some memory outside our kernel:

// in src/main.rs

#[no_mangle]
pub extern "C" fn _start() -> ! {
    println!("Hello World{}", "!");

    blog_os::init();

    // new
    let ptr = 0xdeadbeaf as *mut u8;
    unsafe { *ptr = 42; }

    // as before
    #[cfg(test)]
    test_main();

    println!("It did not crash!");
    blog_os::hlt_loop();
}

When we run it, we see that our page fault handler is called:

The CR2 register indeed contains 0xdeadbeaf, the address that we tried to access. The error code tells us through the CAUSED_BY_WRITE that the fault occurred while trying to perform a write operation. It tells us even more through the bits that are not set. For example, the fact that the PROTECTION_VIOLATION flag is not set means that the page fault occurred because the target page wasnâ€™t present.

We see that the current instruction pointer is 0x2031b2, so we know that this address points to a code page. Code pages are mapped read-only by the bootloader, so reading from this address works but writing causes a page fault. You can try this by changing the 0xdeadbeaf pointer to 0x2031b2:

// Note: The actual address might be different for you. Use the address that
// your page fault handler reports.
let ptr = 0x2031b2 as *mut u8;

// read from a code page
unsafe { let x = *ptr; }
println!("read worked");

// write to a code page
unsafe { *ptr = 42; }
println!("write worked");

By commenting out the last line, we see that the read access works, but the write access causes a page fault:

We see that the â€œread workedâ€ message is printed, which indicates that the read operation did not cause any errors. However, instead of the â€œwrite workedâ€ message, a page fault occurs. This time the PROTECTION_VIOLATION flag is set in addition to the CAUSED_BY_WRITE flag, which indicates that the page was present, but the operation was not allowed on it. In this case, writes to the page are not allowed since code pages are mapped as read-only.

ðŸ”—Accessing the Page Tables

Letâ€™s try to take a look at the page tables that define how our kernel is mapped:

// in src/main.rs

#[no_mangle]
pub extern "C" fn _start() -> ! {
    println!("Hello World{}", "!");

    blog_os::init();

    use x86_64::registers::control::Cr3;

    let (level_4_page_table, _) = Cr3::read();
    println!("Level 4 page table at: {:?}", level_4_page_table.start_address());

    [â€¦] // test_main(), println(â€¦), and hlt_loop()
}

The Cr3::read function of the x86_64 returns the currently active level 4 page table from the CR3 register. It returns a tuple of a PhysFrame and a Cr3Flags type. We are only interested in the frame, so we ignore the second element of the tuple.

When we run it, we see the following output:

Level 4 page table at: PhysAddr(0x1000)

So the currently active level 4 page table is stored at address 0x1000 in physical memory, as indicated by the PhysAddr wrapper type. The question now is: how can we access this table from our kernel?

Accessing physical memory directly is not possible when paging is active, since programs could easily circumvent memory protection and access the memory of other programs otherwise. So the only way to access the table is through some virtual page that is mapped to the physical frame at address 0x1000. This problem of creating mappings for page table frames is a general problem since the kernel needs to access the page tables regularly, for example, when allocating a stack for a new thread.

Solutions to this problem are explained in detail in the next post.

ðŸ”—Summary

This post introduced two memory protection techniques: segmentation and paging. While the former uses variable-sized memory regions and suffers from external fragmentation, the latter uses fixed-sized pages and allows much more fine-grained control over access permissions.

Paging stores the mapping information for pages in page tables with one or more levels. The x86_64 architecture uses 4-level page tables and a page size of 4Â KiB. The hardware automatically walks the page tables and caches the resulting translations in the translation lookaside buffer (TLB). This buffer is not updated transparently and needs to be flushed manually on page table changes.

We learned that our kernel already runs on top of paging and that illegal memory accesses cause page fault exceptions. We tried to access the currently active page tables, but we werenâ€™t able to do it because the CR3 register stores a physical address that we canâ€™t access directly from our kernel.

ðŸ”—Whatâ€™s next?

The next post explains how to implement support for paging in our kernel. It presents different ways to access physical memory from our kernel, which makes it possible to access the page tables that our kernel runs on. At this point, we are able to implement functions for translating virtual to physical addresses and for creating new mappings in the page tables.

Hardware Interrupts

Mon, 22 Oct 2018 00:00:00 +0000

In this post, we set up the programmable interrupt controller to correctly forward hardware interrupts to the CPU. To handle these interrupts, we add new entries to our interrupt descriptor table, just like we did for our exception handlers. We will learn how to get periodic timer interrupts and how to get input from the keyboard.

ðŸ”—Overview

Interrupts provide a way to notify the CPU from attached hardware devices. So instead of letting the kernel periodically check the keyboard for new characters (a process called polling), the keyboard can notify the kernel of each keypress. This is much more efficient because the kernel only needs to act when something happened. It also allows faster reaction times since the kernel can react immediately and not only at the next poll.

Connecting all hardware devices directly to the CPU is not possible. Instead, a separate interrupt controller aggregates the interrupts from all devices and then notifies the CPU:

                                    ____________             _____
               Timer ------------> |            |           |     |
               Keyboard ---------> | Interrupt  |---------> | CPU |
               Other Hardware ---> | Controller |           |_____|
               Etc. -------------> |____________|

Most interrupt controllers are programmable, which means they support different priority levels for interrupts. For example, this allows to give timer interrupts a higher priority than keyboard interrupts to ensure accurate timekeeping.

Unlike exceptions, hardware interrupts occur asynchronously. This means they are completely independent from the executed code and can occur at any time. Thus, we suddenly have a form of concurrency in our kernel with all the potential concurrency-related bugs. Rustâ€™s strict ownership model helps us here because it forbids mutable global state. However, deadlocks are still possible, as we will see later in this post.

ðŸ”—The 8259 PIC

The Intel 8259 is a programmable interrupt controller (PIC) introduced in 1976. It has long been replaced by the newer APIC, but its interface is still supported on current systems for backwards compatibility reasons. The 8259 PIC is significantly easier to set up than the APIC, so we will use it to introduce ourselves to interrupts before we switch to the APIC in a later post.

The 8259 has eight interrupt lines and several lines for communicating with the CPU. The typical systems back then were equipped with two instances of the 8259 PIC, one primary and one secondary PIC, connected to one of the interrupt lines of the primary:

                     ____________                          ____________
Real Time Clock --> |            |   Timer -------------> |            |
ACPI -------------> |            |   Keyboard-----------> |            |      _____
Available --------> | Secondary  |----------------------> | Primary    |     |     |
Available --------> | Interrupt  |   Serial Port 2 -----> | Interrupt  |---> | CPU |
Mouse ------------> | Controller |   Serial Port 1 -----> | Controller |     |_____|
Co-Processor -----> |            |   Parallel Port 2/3 -> |            |
Primary ATA ------> |            |   Floppy disk -------> |            |
Secondary ATA ----> |____________|   Parallel Port 1----> |____________|

This graphic shows the typical assignment of interrupt lines. We see that most of the 15 lines have a fixed mapping, e.g., line 4 of the secondary PIC is assigned to the mouse.

Each controller can be configured through two I/O ports, one â€œcommandâ€ port and one â€œdataâ€ port. For the primary controller, these ports are 0x20 (command) and 0x21 (data). For the secondary controller, they are 0xa0 (command) and 0xa1 (data). For more information on how the PICs can be configured, see the article on osdev.org.

ðŸ”—Implementation

The default configuration of the PICs is not usable because it sends interrupt vector numbers in the range of 0â€“15 to the CPU. These numbers are already occupied by CPU exceptions. For example, number 8 corresponds to a double fault. To fix this overlapping issue, we need to remap the PIC interrupts to different numbers. The actual range doesnâ€™t matter as long as it does not overlap with the exceptions, but typically the range of 32â€“47 is chosen, because these are the first free numbers after the 32 exception slots.

The configuration happens by writing special values to the command and data ports of the PICs. Fortunately, there is already a crate called pic8259, so we donâ€™t need to write the initialization sequence ourselves. However, if you are interested in how it works, check out its source code. Itâ€™s fairly small and well documented.

To add the crate as a dependency, we add the following to our project:

# in Cargo.toml

[dependencies]
pic8259 = "0.10.1"

The main abstraction provided by the crate is the ChainedPics struct that represents the primary/secondary PIC layout we saw above. It is designed to be used in the following way:

// in src/interrupts.rs

use pic8259::ChainedPics;
use spin;

pub const PIC_1_OFFSET: u8 = 32;
pub const PIC_2_OFFSET: u8 = PIC_1_OFFSET + 8;

pub static PICS: spin::Mutex<ChainedPics> =
    spin::Mutex::new(unsafe { ChainedPics::new(PIC_1_OFFSET, PIC_2_OFFSET) });

As noted above, weâ€™re setting the offsets for the PICs to the range 32â€“47. By wrapping the ChainedPics struct in a Mutex, we can get safe mutable access (through the lock method), which we need in the next step. The ChainedPics::new function is unsafe because wrong offsets could cause undefined behavior.

We can now initialize the 8259 PIC in our init function:

// in src/lib.rs

pub fn init() {
    gdt::init();
    interrupts::init_idt();
    unsafe { interrupts::PICS.lock().initialize() }; // new
}

We use the initialize function to perform the PIC initialization. Like the ChainedPics::new function, this function is also unsafe because it can cause undefined behavior if the PIC is misconfigured.

If all goes well, we should continue to see the â€œIt did not crashâ€ message when executing cargo run.

ðŸ”—Enabling Interrupts

Until now, nothing happened because interrupts are still disabled in the CPU configuration. This means that the CPU does not listen to the interrupt controller at all, so no interrupts can reach the CPU. Letâ€™s change that:

// in src/lib.rs

pub fn init() {
    gdt::init();
    interrupts::init_idt();
    unsafe { interrupts::PICS.lock().initialize() };
    x86_64::instructions::interrupts::enable();     // new
}

The interrupts::enable function of the x86_64 crate executes the special sti instruction (â€œset interruptsâ€) to enable external interrupts. When we try cargo run now, we see that a double fault occurs:

The reason for this double fault is that the hardware timer (the Intel 8253, to be exact) is enabled by default, so we start receiving timer interrupts as soon as we enable interrupts. Since we didnâ€™t define a handler function for it yet, our double fault handler is invoked.

ðŸ”—Handling Timer Interrupts

As we see from the graphic above, the timer uses line 0 of the primary PIC. This means that it arrives at the CPU as interrupt 32 (0 + offset 32). Instead of hardcoding index 32, we store it in an InterruptIndex enum:

// in src/interrupts.rs

#[derive(Debug, Clone, Copy)]
#[repr(u8)]
pub enum InterruptIndex {
    Timer = PIC_1_OFFSET,
}

impl InterruptIndex {
    fn as_u8(self) -> u8 {
        self as u8
    }

    fn as_usize(self) -> usize {
        usize::from(self.as_u8())
    }
}

The enum is a C-like enum so that we can directly specify the index for each variant. The repr(u8) attribute specifies that each variant is represented as a u8. We will add more variants for other interrupts in the future.

Now we can add a handler function for the timer interrupt:

// in src/interrupts.rs

use crate::print;

lazy_static! {
    static ref IDT: InterruptDescriptorTable = {
        let mut idt = InterruptDescriptorTable::new();
        idt.breakpoint.set_handler_fn(breakpoint_handler);
        [â€¦]
        idt[InterruptIndex::Timer.as_usize()]
            .set_handler_fn(timer_interrupt_handler); // new

        idt
    };
}

extern "x86-interrupt" fn timer_interrupt_handler(
    _stack_frame: InterruptStackFrame)
{
    print!(".");
}

Our timer_interrupt_handler has the same signature as our exception handlers, because the CPU reacts identically to exceptions and external interrupts (the only difference is that some exceptions push an error code). The InterruptDescriptorTable struct implements the IndexMut trait, so we can access individual entries through array indexing syntax.

In our timer interrupt handler, we print a dot to the screen. As the timer interrupt happens periodically, we would expect to see a dot appearing on each timer tick. However, when we run it, we see that only a single dot is printed:

ðŸ”—End of Interrupt

The reason is that the PIC expects an explicit â€œend of interruptâ€ (EOI) signal from our interrupt handler. This signal tells the controller that the interrupt was processed and that the system is ready to receive the next interrupt. So the PIC thinks weâ€™re still busy processing the first timer interrupt and waits patiently for the EOI signal before sending the next one.

To send the EOI, we use our static PICS struct again:

// in src/interrupts.rs

extern "x86-interrupt" fn timer_interrupt_handler(
    _stack_frame: InterruptStackFrame)
{
    print!(".");

    unsafe {
        PICS.lock()
            .notify_end_of_interrupt(InterruptIndex::Timer.as_u8());
    }
}

The notify_end_of_interrupt figures out whether the primary or secondary PIC sent the interrupt and then uses the command and data ports to send an EOI signal to the respective controllers. If the secondary PIC sent the interrupt, both PICs need to be notified because the secondary PIC is connected to an input line of the primary PIC.

We need to be careful to use the correct interrupt vector number, otherwise we could accidentally delete an important unsent interrupt or cause our system to hang. This is the reason that the function is unsafe.

When we now execute cargo run we see dots periodically appearing on the screen:

ðŸ”—Configuring the Timer

The hardware timer that we use is called the Programmable Interval Timer, or PIT, for short. Like the name says, it is possible to configure the interval between two interrupts. We wonâ€™t go into details here because we will switch to the APIC timer soon, but the OSDev wiki has an extensive article about the configuring the PIT.

ðŸ”—Deadlocks

We now have a form of concurrency in our kernel: The timer interrupts occur asynchronously, so they can interrupt our _start function at any time. Fortunately, Rustâ€™s ownership system prevents many types of concurrency-related bugs at compile time. One notable exception is deadlocks. Deadlocks occur if a thread tries to acquire a lock that will never become free. Thus, the thread hangs indefinitely.

We can already provoke a deadlock in our kernel. Remember, our println macro calls the vga_buffer::_print function, which locks a global WRITER using a spinlock:

// in src/vga_buffer.rs

[â€¦]

#[doc(hidden)]
pub fn _print(args: fmt::Arguments) {
    use core::fmt::Write;
    WRITER.lock().write_fmt(args).unwrap();
}

It locks the WRITER, calls write_fmt on it, and implicitly unlocks it at the end of the function. Now imagine that an interrupt occurs while the WRITER is locked and the interrupt handler tries to print something too:

Timestep	_start	interrupt_handler
0	calls `println!`	Â
1	`print` locks `WRITER`	Â
2		interrupt occurs, handler begins to run
3		calls `println!`
4		`print` tries to lock `WRITER` (already locked)
5		`print` tries to lock `WRITER` (already locked)
â€¦		â€¦
never	unlock `WRITER`

The WRITER is locked, so the interrupt handler waits until it becomes free. But this never happens, because the _start function only continues to run after the interrupt handler returns. Thus, the entire system hangs.

ðŸ”—Provoking a Deadlock

We can easily provoke such a deadlock in our kernel by printing something in the loop at the end of our _start function:

// in src/main.rs

#[no_mangle]
pub extern "C" fn _start() -> ! {
    [â€¦]
    loop {
        use blog_os::print;
        print!("-");        // new
    }
}

When we run it in QEMU, we get an output of the form:

We see that only a limited number of hyphens are printed until the first timer interrupt occurs. Then the system hangs because the timer interrupt handler deadlocks when it tries to print a dot. This is the reason that we see no dots in the above output.

The actual number of hyphens varies between runs because the timer interrupt occurs asynchronously. This non-determinism is what makes concurrency-related bugs so difficult to debug.

ðŸ”—Fixing the Deadlock

To avoid this deadlock, we can disable interrupts as long as the Mutex is locked:

// in src/vga_buffer.rs

/// Prints the given formatted string to the VGA text buffer
/// through the global `WRITER` instance.
#[doc(hidden)]
pub fn _print(args: fmt::Arguments) {
    use core::fmt::Write;
    use x86_64::instructions::interrupts;   // new

    interrupts::without_interrupts(|| {     // new
        WRITER.lock().write_fmt(args).unwrap();
    });
}

The without_interrupts function takes a closure and executes it in an interrupt-free environment. We use it to ensure that no interrupt can occur as long as the Mutex is locked. When we run our kernel now, we see that it keeps running without hanging. (We still donâ€™t notice any dots, but this is because theyâ€™re scrolling by too fast. Try to slow down the printing, e.g., by putting a for _ in 0..10000 {} inside the loop.)

We can apply the same change to our serial printing function to ensure that no deadlocks occur with it either:

// in src/serial.rs

#[doc(hidden)]
pub fn _print(args: ::core::fmt::Arguments) {
    use core::fmt::Write;
    use x86_64::instructions::interrupts;       // new

    interrupts::without_interrupts(|| {         // new
        SERIAL1
            .lock()
            .write_fmt(args)
            .expect("Printing to serial failed");
    });
}

Note that disabling interrupts shouldnâ€™t be a general solution. The problem is that it increases the worst-case interrupt latency, i.e., the time until the system reacts to an interrupt. Therefore, interrupts should only be disabled for a very short time.

ðŸ”—Fixing a Race Condition

If you run cargo test, you might see the test_println_output test failing:

> cargo test --lib
[â€¦]
Running 4 tests
test_breakpoint_exception...[ok]
test_println... [ok]
test_println_many... [ok]
test_println_output... [failed]

Error: panicked at 'assertion failed: `(left == right)`
  left: `'.'`,
 right: `'S'`', src/vga_buffer.rs:205:9

The reason is a race condition between the test and our timer handler. Remember, the test looks like this:

// in src/vga_buffer.rs

#[test_case]
fn test_println_output() {
    let s = "Some test string that fits on a single line";
    println!("{}", s);
    for (i, c) in s.chars().enumerate() {
        let screen_char = WRITER.lock().buffer.chars[BUFFER_HEIGHT - 2][i].read();
        assert_eq!(char::from(screen_char.ascii_character), c);
    }
}

The test prints a string to the VGA buffer and then checks the output by manually iterating over the buffer_chars array. The race condition occurs because the timer interrupt handler might run between the println and the reading of the screen characters. Note that this isnâ€™t a dangerous data race, which Rust completely prevents at compile time. See the Rustonomicon for details.

To fix this, we need to keep the WRITER locked for the complete duration of the test, so that the timer handler canâ€™t write a . to the screen in between. The fixed test looks like this:

// in src/vga_buffer.rs

#[test_case]
fn test_println_output() {
    use core::fmt::Write;
    use x86_64::instructions::interrupts;

    let s = "Some test string that fits on a single line";
    interrupts::without_interrupts(|| {
        let mut writer = WRITER.lock();
        writeln!(writer, "\n{}", s).expect("writeln failed");
        for (i, c) in s.chars().enumerate() {
            let screen_char = writer.buffer.chars[BUFFER_HEIGHT - 2][i].read();
            assert_eq!(char::from(screen_char.ascii_character), c);
        }
    });
}

We performed the following changes:

We keep the writer locked for the complete test by using the lock() method explicitly. Instead of println, we use the writeln macro that allows printing to an already locked writer.
To avoid another deadlock, we disable interrupts for the testâ€™s duration. Otherwise, the test might get interrupted while the writer is still locked.
Since the timer interrupt handler can still run before the test, we print an additional newline \n before printing the string s. This way, we avoid test failure when the timer handler has already printed some . characters to the current line.

With the above changes, cargo test now deterministically succeeds again.

This was a very harmless race condition that only caused a test failure. As you can imagine, other race conditions can be much more difficult to debug due to their non-deterministic nature. Luckily, Rust prevents us from data races, which are the most serious class of race conditions since they can cause all kinds of undefined behavior, including system crashes and silent memory corruptions.

ðŸ”—The `hlt` Instruction

Until now, we used a simple empty loop statement at the end of our _start and panic functions. This causes the CPU to spin endlessly, and thus works as expected. But it is also very inefficient, because the CPU continues to run at full speed even though thereâ€™s no work to do. You can see this problem in your task manager when you run your kernel: The QEMU process needs close to 100% CPU the whole time.

What we really want to do is to halt the CPU until the next interrupt arrives. This allows the CPU to enter a sleep state in which it consumes much less energy. The hlt instruction does exactly that. Letâ€™s use this instruction to create an energy-efficient endless loop:

// in src/lib.rs

pub fn hlt_loop() -> ! {
    loop {
        x86_64::instructions::hlt();
    }
}

The instructions::hlt function is just a thin wrapper around the assembly instruction. It is safe because thereâ€™s no way it can compromise memory safety.

We can now use this hlt_loop instead of the endless loops in our _start and panic functions:

// in src/main.rs

#[no_mangle]
pub extern "C" fn _start() -> ! {
    [â€¦]

    println!("It did not crash!");
    blog_os::hlt_loop();            // new
}


#[cfg(not(test))]
#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    println!("{}", info);
    blog_os::hlt_loop();            // new
}

Letâ€™s update our lib.rs as well:

// in src/lib.rs

/// Entry point for `cargo test`
#[cfg(test)]
#[no_mangle]
pub extern "C" fn _start() -> ! {
    init();
    test_main();
    hlt_loop();         // new
}

pub fn test_panic_handler(info: &PanicInfo) -> ! {
    serial_println!("[failed]\n");
    serial_println!("Error: {}\n", info);
    exit_qemu(QemuExitCode::Failed);
    hlt_loop();         // new
}

When we run our kernel now in QEMU, we see a much lower CPU usage.

ðŸ”—Keyboard Input

Now that we are able to handle interrupts from external devices, we are finally able to add support for keyboard input. This will allow us to interact with our kernel for the first time.

Like the hardware timer, the keyboard controller is already enabled by default. So when you press a key, the keyboard controller sends an interrupt to the PIC, which forwards it to the CPU. The CPU looks for a handler function in the IDT, but the corresponding entry is empty. Therefore, a double fault occurs.

So letâ€™s add a handler function for the keyboard interrupt. Itâ€™s quite similar to how we defined the handler for the timer interrupt; it just uses a different interrupt number:

// in src/interrupts.rs

#[derive(Debug, Clone, Copy)]
#[repr(u8)]
pub enum InterruptIndex {
    Timer = PIC_1_OFFSET,
    Keyboard, // new
}

lazy_static! {
    static ref IDT: InterruptDescriptorTable = {
        let mut idt = InterruptDescriptorTable::new();
        idt.breakpoint.set_handler_fn(breakpoint_handler);
        [â€¦]
        // new
        idt[InterruptIndex::Keyboard.as_usize()]
            .set_handler_fn(keyboard_interrupt_handler);

        idt
    };
}

extern "x86-interrupt" fn keyboard_interrupt_handler(
    _stack_frame: InterruptStackFrame)
{
    print!("k");

    unsafe {
        PICS.lock()
            .notify_end_of_interrupt(InterruptIndex::Keyboard.as_u8());
    }
}

As we see from the graphic above, the keyboard uses line 1 of the primary PIC. This means that it arrives at the CPU as interrupt 33 (1 + offset 32). We add this index as a new Keyboard variant to the InterruptIndex enum. We donâ€™t need to specify the value explicitly, since it defaults to the previous value plus one, which is also 33. In the interrupt handler, we print a k and send the end of interrupt signal to the interrupt controller.

We now see that a k appears on the screen when we press a key. However, this only works for the first key we press. Even if we continue to press keys, no more ks appear on the screen. This is because the keyboard controller wonâ€™t send another interrupt until we have read the so-called scancode of the pressed key.

ðŸ”—Reading the Scancodes

To find out which key was pressed, we need to query the keyboard controller. We do this by reading from the data port of the PS/2 controller, which is the I/O port with the number 0x60:

// in src/interrupts.rs

extern "x86-interrupt" fn keyboard_interrupt_handler(
    _stack_frame: InterruptStackFrame)
{
    use x86_64::instructions::port::Port;

    let mut port = Port::new(0x60);
    let scancode: u8 = unsafe { port.read() };
    print!("{}", scancode);

    unsafe {
        PICS.lock()
            .notify_end_of_interrupt(InterruptIndex::Keyboard.as_u8());
    }
}

We use the Port type of the x86_64 crate to read a byte from the keyboardâ€™s data port. This byte is called the scancode and it represents the key press/release. We donâ€™t do anything with the scancode yet, other than print it to the screen:

The above image shows me slowly typing â€œ123â€. We see that adjacent keys have adjacent scancodes and that pressing a key causes a different scancode than releasing it. But how do we translate the scancodes to the actual key actions exactly?

ðŸ”—Interpreting the Scancodes

There are three different standards for the mapping between scancodes and keys, the so-called scancode sets. All three go back to the keyboards of early IBM computers: the IBM XT, the IBM 3270 PC, and the IBM AT. Later computers fortunately did not continue the trend of defining new scancode sets, but rather emulated the existing sets and extended them. Today, most keyboards can be configured to emulate any of the three sets.

By default, PS/2 keyboards emulate scancode set 1 (â€œXTâ€). In this set, the lower 7 bits of a scancode byte define the key, and the most significant bit defines whether itâ€™s a press (â€œ0â€) or a release (â€œ1â€). Keys that were not present on the original IBM XT keyboard, such as the enter key on the keypad, generate two scancodes in succession: a 0xe0 escape byte and then a byte representing the key. For a list of all set 1 scancodes and their corresponding keys, check out the OSDev Wiki.

To translate the scancodes to keys, we can use a match statement:

// in src/interrupts.rs

extern "x86-interrupt" fn keyboard_interrupt_handler(
    _stack_frame: InterruptStackFrame)
{
    use x86_64::instructions::port::Port;

    let mut port = Port::new(0x60);
    let scancode: u8 = unsafe { port.read() };

    // new
    let key = match scancode {
        0x02 => Some('1'),
        0x03 => Some('2'),
        0x04 => Some('3'),
        0x05 => Some('4'),
        0x06 => Some('5'),
        0x07 => Some('6'),
        0x08 => Some('7'),
        0x09 => Some('8'),
        0x0a => Some('9'),
        0x0b => Some('0'),
        _ => None,
    };
    if let Some(key) = key {
        print!("{}", key);
    }

    unsafe {
        PICS.lock()
            .notify_end_of_interrupt(InterruptIndex::Keyboard.as_u8());
    }
}

The above code translates keypresses of the number keys 0-9 and ignores all other keys. It uses a match statement to assign a character or None to each scancode. It then uses if let to destructure the optional key. By using the same variable name key in the pattern, we shadow the previous declaration, which is a common pattern for destructuring Option types in Rust.

Now we can write numbers:

Translating the other keys works in the same way. Fortunately, there is a crate named pc-keyboard for translating scancodes of scancode sets 1 and 2, so we donâ€™t have to implement this ourselves. To use the crate, we add it to our Cargo.toml and import it in our lib.rs:

# in Cargo.toml

[dependencies]
pc-keyboard = "0.7.0"

Now we can use this crate to rewrite our keyboard_interrupt_handler:

// in/src/interrupts.rs

extern "x86-interrupt" fn keyboard_interrupt_handler(
    _stack_frame: InterruptStackFrame)
{
    use pc_keyboard::{layouts, DecodedKey, HandleControl, Keyboard, ScancodeSet1};
    use spin::Mutex;
    use x86_64::instructions::port::Port;

    lazy_static! {
        static ref KEYBOARD: Mutex<Keyboard<layouts::Us104Key, ScancodeSet1>> =
            Mutex::new(Keyboard::new(ScancodeSet1::new(),
                layouts::Us104Key, HandleControl::Ignore)
            );
    }

    let mut keyboard = KEYBOARD.lock();
    let mut port = Port::new(0x60);

    let scancode: u8 = unsafe { port.read() };
    if let Ok(Some(key_event)) = keyboard.add_byte(scancode) {
        if let Some(key) = keyboard.process_keyevent(key_event) {
            match key {
                DecodedKey::Unicode(character) => print!("{}", character),
                DecodedKey::RawKey(key) => print!("{:?}", key),
            }
        }
    }

    unsafe {
        PICS.lock()
            .notify_end_of_interrupt(InterruptIndex::Keyboard.as_u8());
    }
}

We use the lazy_static macro to create a static Keyboard object protected by a Mutex. We initialize the Keyboard with a US keyboard layout and the scancode set 1. The HandleControl parameter allows to map ctrl+[a-z] to the Unicode characters U+0001 through U+001A. We donâ€™t want to do that, so we use the Ignore option to handle the ctrl like normal keys.

On each interrupt, we lock the Mutex, read the scancode from the keyboard controller, and pass it to the add_byte method, which translates the scancode into an Option<KeyEvent>. The KeyEvent contains the key which caused the event and whether it was a press or release event.

To interpret this key event, we pass it to the process_keyevent method, which translates the key event to a character, if possible. For example, it translates a press event of the A key to either a lowercase a character or an uppercase A character, depending on whether the shift key was pressed.

With this modified interrupt handler, we can now write text:

ðŸ”—Configuring the Keyboard

Itâ€™s possible to configure some aspects of a PS/2 keyboard, for example, which scancode set it should use. We wonâ€™t cover it here because this post is already long enough, but the OSDev Wiki has an overview of possible configuration commands.

ðŸ”—Summary

This post explained how to enable and handle external interrupts. We learned about the 8259 PIC and its primary/secondary layout, the remapping of the interrupt numbers, and the â€œend of interruptâ€ signal. We implemented handlers for the hardware timer and the keyboard and learned about the hlt instruction, which halts the CPU until the next interrupt.

Now we are able to interact with our kernel and have some fundamental building blocks for creating a small shell or simple games.

ðŸ”—Whatâ€™s next?

Timer interrupts are essential for an operating system because they provide a way to periodically interrupt the running process and let the kernel regain control. The kernel can then switch to a different process and create the illusion of multiple processes running in parallel.

But before we can create processes or threads, we need a way to allocate memory for them. The next posts will explore memory management to provide this fundamental building block.

Double Faults

Mon, 18 Jun 2018 00:00:00 +0000

This post explores the double fault exception in detail, which occurs when the CPU fails to invoke an exception handler. By handling this exception, we avoid fatal triple faults that cause a system reset. To prevent triple faults in all cases, we also set up an Interrupt Stack Table to catch double faults on a separate kernel stack.

ðŸ”—What is a Double Fault?

In simplified terms, a double fault is a special exception that occurs when the CPU fails to invoke an exception handler. For example, it occurs when a page fault is triggered but there is no page fault handler registered in the Interrupt Descriptor Table (IDT). So itâ€™s kind of similar to catch-all blocks in programming languages with exceptions, e.g., catch(...) in C++ or catch(Exception e) in Java or C#.

A double fault behaves like a normal exception. It has the vector number 8 and we can define a normal handler function for it in the IDT. It is really important to provide a double fault handler, because if a double fault is unhandled, a fatal triple fault occurs. Triple faults canâ€™t be caught, and most hardware reacts with a system reset.

ðŸ”—Triggering a Double Fault

Letâ€™s provoke a double fault by triggering an exception for which we didnâ€™t define a handler function:

// in src/main.rs

#[no_mangle]
pub extern "C" fn _start() -> ! {
    println!("Hello World{}", "!");

    blog_os::init();

    // trigger a page fault
    unsafe {
        *(0xdeadbeef as *mut u8) = 42;
    };

    // as before
    #[cfg(test)]
    test_main();

    println!("It did not crash!");
    loop {}
}

We use unsafe to write to the invalid address 0xdeadbeef. The virtual address is not mapped to a physical address in the page tables, so a page fault occurs. We havenâ€™t registered a page fault handler in our IDT, so a double fault occurs.

When we start our kernel now, we see that it enters an endless boot loop. The reason for the boot loop is the following:

The CPU tries to write to 0xdeadbeef, which causes a page fault.
The CPU looks at the corresponding entry in the IDT and sees that no handler function is specified. Thus, it canâ€™t call the page fault handler and a double fault occurs.
The CPU looks at the IDT entry of the double fault handler, but this entry does not specify a handler function either. Thus, a triple fault occurs.
A triple fault is fatal. QEMU reacts to it like most real hardware and issues a system reset.

So in order to prevent this triple fault, we need to either provide a handler function for page faults or a double fault handler. We want to avoid triple faults in all cases, so letâ€™s start with a double fault handler that is invoked for all unhandled exception types.

ðŸ”—A Double Fault Handler

A double fault is a normal exception with an error code, so we can specify a handler function similar to our breakpoint handler:

// in src/interrupts.rs

lazy_static! {
    static ref IDT: InterruptDescriptorTable = {
        let mut idt = InterruptDescriptorTable::new();
        idt.breakpoint.set_handler_fn(breakpoint_handler);
        idt.double_fault.set_handler_fn(double_fault_handler); // new
        idt
    };
}

// new
extern "x86-interrupt" fn double_fault_handler(
    stack_frame: InterruptStackFrame, _error_code: u64) -> !
{
    panic!("EXCEPTION: DOUBLE FAULT\n{:#?}", stack_frame);
}

Our handler prints a short error message and dumps the exception stack frame. The error code of the double fault handler is always zero, so thereâ€™s no reason to print it. One difference to the breakpoint handler is that the double fault handler is diverging. The reason is that the x86_64 architecture does not permit returning from a double fault exception.

When we start our kernel now, we should see that the double fault handler is invoked:

It worked! Here is what happened this time:

The CPU tries to write to 0xdeadbeef, which causes a page fault.
Like before, the CPU looks at the corresponding entry in the IDT and sees that no handler function is defined. Thus, a double fault occurs.
The CPU jumps to the â€“ now present â€“ double fault handler.

The triple fault (and the boot-loop) no longer occurs, since the CPU can now call the double fault handler.

That was quite straightforward! So why do we need a whole post for this topic? Well, weâ€™re now able to catch most double faults, but there are some cases where our current approach doesnâ€™t suffice.

ðŸ”—Causes of Double Faults

Before we look at the special cases, we need to know the exact causes of double faults. Above, we used a pretty vague definition:

A double fault is a special exception that occurs when the CPU fails to invoke an exception handler.

What does â€œfails to invokeâ€ mean exactly? The handler is not present? The handler is swapped out? And what happens if a handler causes exceptions itself?

For example, what happens if:

a breakpoint exception occurs, but the corresponding handler function is swapped out?
a page fault occurs, but the page fault handler is swapped out?
a divide-by-zero handler causes a breakpoint exception, but the breakpoint handler is swapped out?
our kernel overflows its stack and the guard page is hit?

Fortunately, the AMD64 manual (PDF) has an exact definition (in Section 8.2.9). According to it, a â€œdouble fault exception can occur when a second exception occurs during the handling of a prior (first) exception handlerâ€. The â€œcanâ€ is important: Only very specific combinations of exceptions lead to a double fault. These combinations are:

First Exception	Second Exception
Divide-by-zero, Invalid TSS, Segment Not Present, Stack-Segment Fault, General Protection Fault	Invalid TSS, Segment Not Present, Stack-Segment Fault, General Protection Fault
Page Fault	Page Fault, Invalid TSS, Segment Not Present, Stack-Segment Fault, General Protection Fault

So, for example, a divide-by-zero fault followed by a page fault is fine (the page fault handler is invoked), but a divide-by-zero fault followed by a general-protection fault leads to a double fault.

With the help of this table, we can answer the first three of the above questions:

If a breakpoint exception occurs and the corresponding handler function is swapped out, a page fault occurs and the page fault handler is invoked.
If a page fault occurs and the page fault handler is swapped out, a double fault occurs and the double fault handler is invoked.
If a divide-by-zero handler causes a breakpoint exception, the CPU tries to invoke the breakpoint handler. If the breakpoint handler is swapped out, a page fault occurs and the page fault handler is invoked.

In fact, even the case of an exception without a handler function in the IDT follows this scheme: When the exception occurs, the CPU tries to read the corresponding IDT entry. Since the entry is 0, which is not a valid IDT entry, a general protection fault occurs. We did not define a handler function for the general protection fault either, so another general protection fault occurs. According to the table, this leads to a double fault.

ðŸ”—Kernel Stack Overflow

Letâ€™s look at the fourth question:

What happens if our kernel overflows its stack and the guard page is hit?

A guard page is a special memory page at the bottom of a stack that makes it possible to detect stack overflows. The page is not mapped to any physical frame, so accessing it causes a page fault instead of silently corrupting other memory. The bootloader sets up a guard page for our kernel stack, so a stack overflow causes a page fault.

When a page fault occurs, the CPU looks up the page fault handler in the IDT and tries to push the interrupt stack frame onto the stack. However, the current stack pointer still points to the non-present guard page. Thus, a second page fault occurs, which causes a double fault (according to the above table).

So the CPU tries to call the double fault handler now. However, on a double fault, the CPU tries to push the exception stack frame, too. The stack pointer still points to the guard page, so a third page fault occurs, which causes a triple fault and a system reboot. So our current double fault handler canâ€™t avoid a triple fault in this case.

Letâ€™s try it ourselves! We can easily provoke a kernel stack overflow by calling a function that recurses endlessly:

// in src/main.rs

#[no_mangle] // don't mangle the name of this function
pub extern "C" fn _start() -> ! {
    println!("Hello World{}", "!");

    blog_os::init();

    fn stack_overflow() {
        stack_overflow(); // for each recursion, the return address is pushed
    }

    // trigger a stack overflow
    stack_overflow();

    [â€¦] // test_main(), println(â€¦), and loop {}
}

When we try this code in QEMU, we see that the system enters a bootloop again.

So how can we avoid this problem? We canâ€™t omit the pushing of the exception stack frame, since the CPU itself does it. So we need to ensure somehow that the stack is always valid when a double fault exception occurs. Fortunately, the x86_64 architecture has a solution to this problem.

ðŸ”—Switching Stacks

The x86_64 architecture is able to switch to a predefined, known-good stack when an exception occurs. This switch happens at hardware level, so it can be performed before the CPU pushes the exception stack frame.

The switching mechanism is implemented as an Interrupt Stack Table (IST). The IST is a table of 7 pointers to known-good stacks. In Rust-like pseudocode:

struct InterruptStackTable {
    stack_pointers: [Option<StackPointer>; 7],
}

For each exception handler, we can choose a stack from the IST through the stack_pointers field in the corresponding IDT entry. For example, our double fault handler could use the first stack in the IST. Then the CPU automatically switches to this stack whenever a double fault occurs. This switch would happen before anything is pushed, preventing the triple fault.

ðŸ”—The IST and TSS

The Interrupt Stack Table (IST) is part of an old legacy structure called Task State Segment (TSS). The TSS used to hold various pieces of information (e.g., processor register state) about a task in 32-bit mode and was, for example, used for hardware context switching. However, hardware context switching is no longer supported in 64-bit mode and the format of the TSS has changed completely.

On x86_64, the TSS no longer holds any task-specific information at all. Instead, it holds two stack tables (the IST is one of them). The only common field between the 32-bit and 64-bit TSS is the pointer to the I/O port permissions bitmap.

The 64-bit TSS has the following format:

Field	Type
(reserved)	`u32`
Privilege Stack Table	`[u64; 3]`
(reserved)	`u64`
Interrupt Stack Table	`[u64; 7]`
(reserved)	`u64`
(reserved)	`u16`
I/O Map Base Address	`u16`

The Privilege Stack Table is used by the CPU when the privilege level changes. For example, if an exception occurs while the CPU is in user mode (privilege level 3), the CPU normally switches to kernel mode (privilege level 0) before invoking the exception handler. In that case, the CPU would switch to the 0th stack in the Privilege Stack Table (since 0 is the target privilege level). We donâ€™t have any user-mode programs yet, so we will ignore this table for now.

ðŸ”—Creating a TSS

Letâ€™s create a new TSS that contains a separate double fault stack in its interrupt stack table. For that, we need a TSS struct. Fortunately, the x86_64 crate already contains a TaskStateSegment struct that we can use.

We create the TSS in a new gdt module (the name will make sense later):

// in src/lib.rs

pub mod gdt;

// in src/gdt.rs

use x86_64::VirtAddr;
use x86_64::structures::tss::TaskStateSegment;
use lazy_static::lazy_static;

pub const DOUBLE_FAULT_IST_INDEX: u16 = 0;

lazy_static! {
    static ref TSS: TaskStateSegment = {
        let mut tss = TaskStateSegment::new();
        tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX as usize] = {
            const STACK_SIZE: usize = 4096 * 5;
            static mut STACK: [u8; STACK_SIZE] = [0; STACK_SIZE];

            let stack_start = VirtAddr::from_ptr(unsafe { &STACK });
            let stack_end = stack_start + STACK_SIZE;
            stack_end
        };
        tss
    };
}

We use lazy_static because Rustâ€™s const evaluator is not yet powerful enough to do this initialization at compile time. We define that the 0th IST entry is the double fault stack (any other IST index would work too). Then we write the top address of a double fault stack to the 0th entry. We write the top address because stacks on x86 grow downwards, i.e., from high addresses to low addresses.

We havenâ€™t implemented memory management yet, so we donâ€™t have a proper way to allocate a new stack. Instead, we use a static mut array as stack storage for now. The unsafe is required because the compiler canâ€™t guarantee race freedom when mutable statics are accessed. It is important that it is a static mut and not an immutable static, because otherwise the bootloader will map it to a read-only page. We will replace this with a proper stack allocation in a later post, then the unsafe will no longer be needed at this place.

Note that this double fault stack has no guard page that protects against stack overflow. This means that we should not do anything stack-intensive in our double fault handler because a stack overflow might corrupt the memory below the stack.

ðŸ”—Loading the TSS

Now that weâ€™ve created a new TSS, we need a way to tell the CPU that it should use it. Unfortunately, this is a bit cumbersome since the TSS uses the segmentation system (for historical reasons). Instead of loading the table directly, we need to add a new segment descriptor to the Global Descriptor Table (GDT). Then we can load our TSS by invoking the ltr instruction with the respective GDT index. (This is the reason why we named our module gdt.)

ðŸ”—The Global Descriptor Table

The Global Descriptor Table (GDT) is a relic that was used for memory segmentation before paging became the de facto standard. However, it is still needed in 64-bit mode for various things, such as kernel/user mode configuration or TSS loading.

The GDT is a structure that contains the segments of the program. It was used on older architectures to isolate programs from each other before paging became the standard. For more information about segmentation, check out the equally named chapter of the free â€œThree Easy Piecesâ€ book. While segmentation is no longer supported in 64-bit mode, the GDT still exists. It is mostly used for two things: Switching between kernel space and user space, and loading a TSS structure.

ðŸ”—Creating a GDT

Letâ€™s create a static GDT that includes a segment for our TSS static:

// in src/gdt.rs

use x86_64::structures::gdt::{GlobalDescriptorTable, Descriptor};

lazy_static! {
    static ref GDT: GlobalDescriptorTable = {
        let mut gdt = GlobalDescriptorTable::new();
        gdt.add_entry(Descriptor::kernel_code_segment());
        gdt.add_entry(Descriptor::tss_segment(&TSS));
        gdt
    };
}

As before, we use lazy_static again. We create a new GDT with a code segment and a TSS segment.

ðŸ”—Loading the GDT

To load our GDT, we create a new gdt::init function that we call from our init function:

// in src/gdt.rs

pub fn init() {
    GDT.load();
}

// in src/lib.rs

pub fn init() {
    gdt::init();
    interrupts::init_idt();
}

Now our GDT is loaded (since the _start function calls init), but we still see the boot loop on stack overflow.

ðŸ”—The Final Steps

The problem is that the GDT segments are not yet active because the segment and TSS registers still contain the values from the old GDT. We also need to modify the double fault IDT entry so that it uses the new stack.

In summary, we need to do the following:

Reload code segment register: We changed our GDT, so we should reload cs, the code segment register. This is required since the old segment selector could now point to a different GDT descriptor (e.g., a TSS descriptor).
Load the TSS: We loaded a GDT that contains a TSS selector, but we still need to tell the CPU that it should use that TSS.
Update the IDT entry: As soon as our TSS is loaded, the CPU has access to a valid interrupt stack table (IST). Then we can tell the CPU that it should use our new double fault stack by modifying our double fault IDT entry.

For the first two steps, we need access to the code_selector and tss_selector variables in our gdt::init function. We can achieve this by making them part of the static through a new Selectors struct:

// in src/gdt.rs

use x86_64::structures::gdt::SegmentSelector;

lazy_static! {
    static ref GDT: (GlobalDescriptorTable, Selectors) = {
        let mut gdt = GlobalDescriptorTable::new();
        let code_selector = gdt.add_entry(Descriptor::kernel_code_segment());
        let tss_selector = gdt.add_entry(Descriptor::tss_segment(&TSS));
        (gdt, Selectors { code_selector, tss_selector })
    };
}

struct Selectors {
    code_selector: SegmentSelector,
    tss_selector: SegmentSelector,
}

Now we can use the selectors to reload the cs register and load our TSS:

// in src/gdt.rs

pub fn init() {
    use x86_64::instructions::tables::load_tss;
    use x86_64::instructions::segmentation::{CS, Segment};
    
    GDT.0.load();
    unsafe {
        CS::set_reg(GDT.1.code_selector);
        load_tss(GDT.1.tss_selector);
    }
}

We reload the code segment register using CS::set_reg and load the TSS using load_tss. The functions are marked as unsafe, so we need an unsafe block to invoke them. The reason is that it might be possible to break memory safety by loading invalid selectors.

Now that we have loaded a valid TSS and interrupt stack table, we can set the stack index for our double fault handler in the IDT:

// in src/interrupts.rs

use crate::gdt;

lazy_static! {
    static ref IDT: InterruptDescriptorTable = {
        let mut idt = InterruptDescriptorTable::new();
        idt.breakpoint.set_handler_fn(breakpoint_handler);
        unsafe {
            idt.double_fault.set_handler_fn(double_fault_handler)
                .set_stack_index(gdt::DOUBLE_FAULT_IST_INDEX); // new
        }

        idt
    };
}

The set_stack_index method is unsafe because the caller must ensure that the used index is valid and not already used for another exception.

Thatâ€™s it! Now the CPU should switch to the double fault stack whenever a double fault occurs. Thus, we are able to catch all double faults, including kernel stack overflows:

From now on, we should never see a triple fault again! To ensure that we donâ€™t accidentally break the above, we should add a test for this.

ðŸ”—A Stack Overflow Test

To test our new gdt module and ensure that the double fault handler is correctly called on a stack overflow, we can add an integration test. The idea is to provoke a double fault in the test function and verify that the double fault handler is called.

Letâ€™s start with a minimal skeleton:

// in tests/stack_overflow.rs

#![no_std]
#![no_main]

use core::panic::PanicInfo;

#[no_mangle]
pub extern "C" fn _start() -> ! {
    unimplemented!();
}

#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    blog_os::test_panic_handler(info)
}

Like our panic_handler test, the test will run without a test harness. The reason is that we canâ€™t continue execution after a double fault, so more than one test doesnâ€™t make sense. To disable the test harness for the test, we add the following to our Cargo.toml:

# in Cargo.toml

[[test]]
name = "stack_overflow"
harness = false

Now cargo test --test stack_overflow should compile successfully. The test fails, of course, since the unimplemented macro panics.

ðŸ”—Implementing `_start`

The implementation of the _start function looks like this:

// in tests/stack_overflow.rs

use blog_os::serial_print;

#[no_mangle]
pub extern "C" fn _start() -> ! {
    serial_print!("stack_overflow::stack_overflow...\t");

    blog_os::gdt::init();
    init_test_idt();

    // trigger a stack overflow
    stack_overflow();

    panic!("Execution continued after stack overflow");
}

#[allow(unconditional_recursion)]
fn stack_overflow() {
    stack_overflow(); // for each recursion, the return address is pushed
    volatile::Volatile::new(0).read(); // prevent tail recursion optimizations
}

We call our gdt::init function to initialize a new GDT. Instead of calling our interrupts::init_idt function, we call an init_test_idt function that will be explained in a moment. The reason is that we want to register a custom double fault handler that does an exit_qemu(QemuExitCode::Success) instead of panicking.

The stack_overflow function is almost identical to the function in our main.rs. The only difference is that at the end of the function, we perform an additional volatile read using the Volatile type to prevent a compiler optimization called tail call elimination. Among other things, this optimization allows the compiler to transform a function whose last statement is a recursive function call into a normal loop. Thus, no additional stack frame is created for the function call, so the stack usage remains constant.

In our case, however, we want the stack overflow to happen, so we add a dummy volatile read statement at the end of the function, which the compiler is not allowed to remove. Thus, the function is no longer tail recursive, and the transformation into a loop is prevented. We also add the allow(unconditional_recursion) attribute to silence the compiler warning that the function recurses endlessly.

ðŸ”—The Test IDT

As noted above, the test needs its own IDT with a custom double fault handler. The implementation looks like this:

// in tests/stack_overflow.rs

use lazy_static::lazy_static;
use x86_64::structures::idt::InterruptDescriptorTable;

lazy_static! {
    static ref TEST_IDT: InterruptDescriptorTable = {
        let mut idt = InterruptDescriptorTable::new();
        unsafe {
            idt.double_fault
                .set_handler_fn(test_double_fault_handler)
                .set_stack_index(blog_os::gdt::DOUBLE_FAULT_IST_INDEX);
        }

        idt
    };
}

pub fn init_test_idt() {
    TEST_IDT.load();
}

The implementation is very similar to our normal IDT in interrupts.rs. Like in the normal IDT, we set a stack index in the IST for the double fault handler in order to switch to a separate stack. The init_test_idt function loads the IDT on the CPU through the load method.

ðŸ”—The Double Fault Handler

The only missing piece is our double fault handler. It looks like this:

// in tests/stack_overflow.rs

use blog_os::{exit_qemu, QemuExitCode, serial_println};
use x86_64::structures::idt::InterruptStackFrame;

extern "x86-interrupt" fn test_double_fault_handler(
    _stack_frame: InterruptStackFrame,
    _error_code: u64,
) -> ! {
    serial_println!("[ok]");
    exit_qemu(QemuExitCode::Success);
    loop {}
}

When the double fault handler is called, we exit QEMU with a success exit code, which marks the test as passed. Since integration tests are completely separate executables, we need to set the #![feature(abi_x86_interrupt)] attribute again at the top of our test file.

Now we can run our test through cargo test --test stack_overflow (or cargo test to run all tests). As expected, we see the stack_overflow... [ok] output in the console. Try to comment out the set_stack_index line; it should cause the test to fail.

ðŸ”—Summary

In this post, we learned what a double fault is and under which conditions it occurs. We added a basic double fault handler that prints an error message and added an integration test for it.

We also enabled the hardware-supported stack switching on double fault exceptions so that it also works on stack overflow. While implementing it, we learned about the task state segment (TSS), the contained interrupt stack table (IST), and the global descriptor table (GDT), which was used for segmentation on older architectures.

ðŸ”—Whatâ€™s next?

The next post explains how to handle interrupts from external devices such as timers, keyboards, or network controllers. These hardware interrupts are very similar to exceptions, e.g., they are also dispatched through the IDT. However, unlike exceptions, they donâ€™t arise directly on the CPU. Instead, an interrupt controller aggregates these interrupts and forwards them to the CPU depending on their priority. In the next post, we will explore the Intel 8259 (â€œPICâ€) interrupt controller and learn how to implement keyboard support.

CPU Exceptions

Sun, 17 Jun 2018 00:00:00 +0000

CPU exceptions occur in various erroneous situations, for example, when accessing an invalid memory address or when dividing by zero. To react to them, we have to set up an interrupt descriptor table that provides handler functions. At the end of this post, our kernel will be able to catch breakpoint exceptions and resume normal execution afterward.

ðŸ”—Overview

An exception signals that something is wrong with the current instruction. For example, the CPU issues an exception if the current instruction tries to divide by 0. When an exception occurs, the CPU interrupts its current work and immediately calls a specific exception handler function, depending on the exception type.

On x86, there are about 20 different CPU exception types. The most important are:

Page Fault: A page fault occurs on illegal memory accesses. For example, if the current instruction tries to read from an unmapped page or tries to write to a read-only page.
Invalid Opcode: This exception occurs when the current instruction is invalid, for example, when we try to use new SSE instructions on an old CPU that does not support them.
General Protection Fault: This is the exception with the broadest range of causes. It occurs on various kinds of access violations, such as trying to execute a privileged instruction in user-level code or writing reserved fields in configuration registers.
Double Fault: When an exception occurs, the CPU tries to call the corresponding handler function. If another exception occurs while calling the exception handler, the CPU raises a double fault exception. This exception also occurs when there is no handler function registered for an exception.
Triple Fault: If an exception occurs while the CPU tries to call the double fault handler function, it issues a fatal triple fault. We canâ€™t catch or handle a triple fault. Most processors react by resetting themselves and rebooting the operating system.

For the full list of exceptions, check out the OSDev wiki.

ðŸ”—The Interrupt Descriptor Table

In order to catch and handle exceptions, we have to set up a so-called Interrupt Descriptor Table (IDT). In this table, we can specify a handler function for each CPU exception. The hardware uses this table directly, so we need to follow a predefined format. Each entry must have the following 16-byte structure:

Type	Name	Description
u16	Function Pointer [0:15]	The lower bits of the pointer to the handler function.
u16	GDT selector	Selector of a code segment in the global descriptor table.
u16	Options	(see below)
u16	Function Pointer [16:31]	The middle bits of the pointer to the handler function.
u32	Function Pointer [32:63]	The remaining bits of the pointer to the handler function.
u32	Reserved

The options field has the following format:

Bits	Name	Description
0-2	Interrupt Stack Table Index	0: Donâ€™t switch stacks, 1-7: Switch to the n-th stack in the Interrupt Stack Table when this handler is called.
3-7	Reserved
8	0: Interrupt Gate, 1: Trap Gate	If this bit is 0, interrupts are disabled when this handler is called.
9-11	must be one
12	must be zero
13â€‘14	Descriptor Privilege Level (DPL)	The minimal privilege level required for calling this handler.
15	Present

Each exception has a predefined IDT index. For example, the invalid opcode exception has table index 6 and the page fault exception has table index 14. Thus, the hardware can automatically load the corresponding IDT entry for each exception. The Exception Table in the OSDev wiki shows the IDT indexes of all exceptions in the â€œVector nr.â€ column.

When an exception occurs, the CPU roughly does the following:

Push some registers on the stack, including the instruction pointer and the RFLAGS register. (We will use these values later in this post.)
Read the corresponding entry from the Interrupt Descriptor Table (IDT). For example, the CPU reads the 14th entry when a page fault occurs.
Check if the entry is present and, if not, raise a double fault.
Disable hardware interrupts if the entry is an interrupt gate (bit 40 not set).
Load the specified GDT selector into the CS (code segment).
Jump to the specified handler function.

Donâ€™t worry about steps 4 and 5 for now; we will learn about the global descriptor table and hardware interrupts in future posts.

ðŸ”—An IDT Type

Instead of creating our own IDT type, we will use the InterruptDescriptorTable struct of the x86_64 crate, which looks like this:

#[repr(C)]
pub struct InterruptDescriptorTable {
    pub divide_by_zero: Entry<HandlerFunc>,
    pub debug: Entry<HandlerFunc>,
    pub non_maskable_interrupt: Entry<HandlerFunc>,
    pub breakpoint: Entry<HandlerFunc>,
    pub overflow: Entry<HandlerFunc>,
    pub bound_range_exceeded: Entry<HandlerFunc>,
    pub invalid_opcode: Entry<HandlerFunc>,
    pub device_not_available: Entry<HandlerFunc>,
    pub double_fault: Entry<HandlerFuncWithErrCode>,
    pub invalid_tss: Entry<HandlerFuncWithErrCode>,
    pub segment_not_present: Entry<HandlerFuncWithErrCode>,
    pub stack_segment_fault: Entry<HandlerFuncWithErrCode>,
    pub general_protection_fault: Entry<HandlerFuncWithErrCode>,
    pub page_fault: Entry<PageFaultHandlerFunc>,
    pub x87_floating_point: Entry<HandlerFunc>,
    pub alignment_check: Entry<HandlerFuncWithErrCode>,
    pub machine_check: Entry<HandlerFunc>,
    pub simd_floating_point: Entry<HandlerFunc>,
    pub virtualization: Entry<HandlerFunc>,
    pub security_exception: Entry<HandlerFuncWithErrCode>,
    // some fields omitted
}

The fields have the type idt::Entry<F>, which is a struct that represents the fields of an IDT entry (see the table above). The type parameter F defines the expected handler function type. We see that some entries require a HandlerFunc and some entries require a HandlerFuncWithErrCode. The page fault even has its own special type: PageFaultHandlerFunc.

Letâ€™s look at the HandlerFunc type first:

type HandlerFunc = extern "x86-interrupt" fn(_: InterruptStackFrame);

Itâ€™s a type alias for an extern "x86-interrupt" fn type. The extern keyword defines a function with a foreign calling convention and is often used to communicate with C code (extern "C" fn). But what is the x86-interrupt calling convention?

ðŸ”—The Interrupt Calling Convention

Exceptions are quite similar to function calls: The CPU jumps to the first instruction of the called function and executes it. Afterwards, the CPU jumps to the return address and continues the execution of the parent function.

However, there is a major difference between exceptions and function calls: A function call is invoked voluntarily by a compiler-inserted call instruction, while an exception might occur at any instruction. In order to understand the consequences of this difference, we need to examine function calls in more detail.

Calling conventions specify the details of a function call. For example, they specify where function parameters are placed (e.g. in registers or on the stack) and how results are returned. On x86_64 Linux, the following rules apply for C functions (specified in the System V ABI):

the first six integer arguments are passed in registers rdi, rsi, rdx, rcx, r8, r9
additional arguments are passed on the stack
results are returned in rax and rdx

Note that Rust does not follow the C ABI (in fact, there isnâ€™t even a Rust ABI yet), so these rules apply only to functions declared as extern "C" fn.

ðŸ”—Preserved and Scratch Registers

The calling convention divides the registers into two parts: preserved and scratch registers.

The values of preserved registers must remain unchanged across function calls. So a called function (the â€œcalleeâ€) is only allowed to overwrite these registers if it restores their original values before returning. Therefore, these registers are called â€œcallee-savedâ€. A common pattern is to save these registers to the stack at the functionâ€™s beginning and restore them just before returning.

In contrast, a called function is allowed to overwrite scratch registers without restrictions. If the caller wants to preserve the value of a scratch register across a function call, it needs to backup and restore it before the function call (e.g., by pushing it to the stack). So the scratch registers are caller-saved.

On x86_64, the C calling convention specifies the following preserved and scratch registers:

preserved registers	scratch registers
`rbp`, `rbx`, `rsp`, `r12`, `r13`, `r14`, `r15`	`rax`, `rcx`, `rdx`, `rsi`, `rdi`, `r8`, `r9`, `r10`, `r11`
callee-saved	caller-saved

The compiler knows these rules, so it generates the code accordingly. For example, most functions begin with a push rbp, which backups rbp on the stack (because itâ€™s a callee-saved register).

ðŸ”—Preserving all Registers

In contrast to function calls, exceptions can occur on any instruction. In most cases, we donâ€™t even know at compile time if the generated code will cause an exception. For example, the compiler canâ€™t know if an instruction causes a stack overflow or a page fault.

Since we donâ€™t know when an exception occurs, we canâ€™t backup any registers before. This means we canâ€™t use a calling convention that relies on caller-saved registers for exception handlers. Instead, we need a calling convention that preserves all registers. The x86-interrupt calling convention is such a calling convention, so it guarantees that all register values are restored to their original values on function return.

Note that this does not mean all registers are saved to the stack at function entry. Instead, the compiler only backs up the registers that are overwritten by the function. This way, very efficient code can be generated for short functions that only use a few registers.

ðŸ”—The Interrupt Stack Frame

On a normal function call (using the call instruction), the CPU pushes the return address before jumping to the target function. On function return (using the ret instruction), the CPU pops this return address and jumps to it. So the stack frame of a normal function call looks like this:

For exception and interrupt handlers, however, pushing a return address would not suffice, since interrupt handlers often run in a different context (stack pointer, CPU flags, etc.). Instead, the CPU performs the following steps when an interrupt occurs:

Saving the old stack pointer: The CPU reads the stack pointer (rsp) and stack segment (ss) register values and remembers them in an internal buffer.
Aligning the stack pointer: An interrupt can occur at any instruction, so the stack pointer can have any value, too. However, some CPU instructions (e.g., some SSE instructions) require that the stack pointer be aligned on a 16-byte boundary, so the CPU performs such an alignment right after the interrupt.
Switching stacks (in some cases): A stack switch occurs when the CPU privilege level changes, for example, when a CPU exception occurs in a user-mode program. It is also possible to configure stack switches for specific interrupts using the so-called Interrupt Stack Table (described in the next post).
Pushing the old stack pointer: The CPU pushes the rsp and ss values from step 0 to the stack. This makes it possible to restore the original stack pointer when returning from an interrupt handler.
Pushing and updating the RFLAGS register: The RFLAGS register contains various control and status bits. On interrupt entry, the CPU changes some bits and pushes the old value.
Pushing the instruction pointer: Before jumping to the interrupt handler function, the CPU pushes the instruction pointer (rip) and the code segment (cs). This is comparable to the return address push of a normal function call.
Pushing an error code (for some exceptions): For some specific exceptions, such as page faults, the CPU pushes an error code, which describes the cause of the exception.
Invoking the interrupt handler: The CPU reads the address and the segment descriptor of the interrupt handler function from the corresponding field in the IDT. It then invokes this handler by loading the values into the rip and cs registers.

So the interrupt stack frame looks like this:

In the x86_64 crate, the interrupt stack frame is represented by the InterruptStackFrame struct. It is passed to interrupt handlers as &mut and can be used to retrieve additional information about the exceptionâ€™s cause. The struct contains no error code field, since only a few exceptions push an error code. These exceptions use the separate HandlerFuncWithErrCode function type, which has an additional error_code argument.

ðŸ”—Behind the Scenes

The x86-interrupt calling convention is a powerful abstraction that hides almost all of the messy details of the exception handling process. However, sometimes itâ€™s useful to know whatâ€™s happening behind the curtain. Here is a short overview of the things that the x86-interrupt calling convention takes care of:

Retrieving the arguments: Most calling conventions expect that the arguments are passed in registers. This is not possible for exception handlers since we must not overwrite any register values before backing them up on the stack. Instead, the x86-interrupt calling convention is aware that the arguments already lie on the stack at a specific offset.
Returning using iretq: Since the interrupt stack frame completely differs from stack frames of normal function calls, we canâ€™t return from handler functions through the normal ret instruction. So instead, the iretq instruction must be used.
Handling the error code: The error code, which is pushed for some exceptions, makes things much more complex. It changes the stack alignment (see the next point) and needs to be popped off the stack before returning. The x86-interrupt calling convention handles all that complexity. However, it doesnâ€™t know which handler function is used for which exception, so it needs to deduce that information from the number of function arguments. That means the programmer is still responsible for using the correct function type for each exception. Luckily, the InterruptDescriptorTable type defined by the x86_64 crate ensures that the correct function types are used.
Aligning the stack: Some instructions (especially SSE instructions) require a 16-byte stack alignment. The CPU ensures this alignment whenever an exception occurs, but for some exceptions it destroys it again later when it pushes an error code. The x86-interrupt calling convention takes care of this by realigning the stack in this case.

If you are interested in more details, we also have a series of posts that explain exception handling using naked functions linked at the end of this post.

ðŸ”—Implementation

Now that weâ€™ve understood the theory, itâ€™s time to handle CPU exceptions in our kernel. Weâ€™ll start by creating a new interrupts module in src/interrupts.rs, that first creates an init_idt function that creates a new InterruptDescriptorTable:

// in src/lib.rs

pub mod interrupts;

// in src/interrupts.rs

use x86_64::structures::idt::InterruptDescriptorTable;

pub fn init_idt() {
    let mut idt = InterruptDescriptorTable::new();
}

Now we can add handler functions. We start by adding a handler for the breakpoint exception. The breakpoint exception is the perfect exception to test exception handling. Its only purpose is to temporarily pause a program when the breakpoint instruction int3 is executed.

The breakpoint exception is commonly used in debuggers: When the user sets a breakpoint, the debugger overwrites the corresponding instruction with the int3 instruction so that the CPU throws the breakpoint exception when it reaches that line. When the user wants to continue the program, the debugger replaces the int3 instruction with the original instruction again and continues the program. For more details, see the â€œHow debuggers workâ€ series.

For our use case, we donâ€™t need to overwrite any instructions. Instead, we just want to print a message when the breakpoint instruction is executed and then continue the program. So letâ€™s create a simple breakpoint_handler function and add it to our IDT:

// in src/interrupts.rs

use x86_64::structures::idt::{InterruptDescriptorTable, InterruptStackFrame};
use crate::println;

pub fn init_idt() {
    let mut idt = InterruptDescriptorTable::new();
    idt.breakpoint.set_handler_fn(breakpoint_handler);
}

extern "x86-interrupt" fn breakpoint_handler(
    stack_frame: InterruptStackFrame)
{
    println!("EXCEPTION: BREAKPOINT\n{:#?}", stack_frame);
}

Our handler just outputs a message and pretty-prints the interrupt stack frame.

When we try to compile it, the following error occurs:

error[E0658]: x86-interrupt ABI is experimental and subject to change (see issue #40180)
  --> src/main.rs:53:1
   |
53 | / extern "x86-interrupt" fn breakpoint_handler(stack_frame: InterruptStackFrame) {
54 | |     println!("EXCEPTION: BREAKPOINT\n{:#?}", stack_frame);
55 | | }
   | |_^
   |
   = help: add #![feature(abi_x86_interrupt)] to the crate attributes to enable

This error occurs because the x86-interrupt calling convention is still unstable. To use it anyway, we have to explicitly enable it by adding #![feature(abi_x86_interrupt)] at the top of our lib.rs.

ðŸ”—Loading the IDT

In order for the CPU to use our new interrupt descriptor table, we need to load it using the lidt instruction. The InterruptDescriptorTable struct of the x86_64 crate provides a load method for that. Letâ€™s try to use it:

// in src/interrupts.rs

pub fn init_idt() {
    let mut idt = InterruptDescriptorTable::new();
    idt.breakpoint.set_handler_fn(breakpoint_handler);
    idt.load();
}

When we try to compile it now, the following error occurs:

error: `idt` does not live long enough
  --> src/interrupts/mod.rs:43:5
   |
43 |     idt.load();
   |     ^^^ does not live long enough
44 | }
   | - borrowed value only lives until here
   |
   = note: borrowed value must be valid for the static lifetime...

So the load method expects a &'static self, that is, a reference valid for the complete runtime of the program. The reason is that the CPU will access this table on every interrupt until we load a different IDT. So using a shorter lifetime than 'static could lead to use-after-free bugs.

In fact, this is exactly what happens here. Our idt is created on the stack, so it is only valid inside the init function. Afterwards, the stack memory is reused for other functions, so the CPU would interpret random stack memory as IDT. Luckily, the InterruptDescriptorTable::load method encodes this lifetime requirement in its function definition, so that the Rust compiler is able to prevent this possible bug at compile time.

In order to fix this problem, we need to store our idt at a place where it has a 'static lifetime. To achieve this, we could allocate our IDT on the heap using Box and then convert it to a 'static reference, but we are writing an OS kernel and thus donâ€™t have a heap (yet).

As an alternative, we could try to store the IDT as a static:

static IDT: InterruptDescriptorTable = InterruptDescriptorTable::new();

pub fn init_idt() {
    IDT.breakpoint.set_handler_fn(breakpoint_handler);
    IDT.load();
}

However, there is a problem: Statics are immutable, so we canâ€™t modify the breakpoint entry from our init function. We could solve this problem by using a static mut:

static mut IDT: InterruptDescriptorTable = InterruptDescriptorTable::new();

pub fn init_idt() {
    unsafe {
        IDT.breakpoint.set_handler_fn(breakpoint_handler);
        IDT.load();
    }
}

This variant compiles without errors but itâ€™s far from idiomatic. static muts are very prone to data races, so we need an unsafe block on each access.

ðŸ”—Lazy Statics to the Rescue

Fortunately, the lazy_static macro exists. Instead of evaluating a static at compile time, the macro performs the initialization when the static is referenced the first time. Thus, we can do almost everything in the initialization block and are even able to read runtime values.

We already imported the lazy_static crate when we created an abstraction for the VGA text buffer. So we can directly use the lazy_static! macro to create our static IDT:

// in src/interrupts.rs

use lazy_static::lazy_static;

lazy_static! {
    static ref IDT: InterruptDescriptorTable = {
        let mut idt = InterruptDescriptorTable::new();
        idt.breakpoint.set_handler_fn(breakpoint_handler);
        idt
    };
}

pub fn init_idt() {
    IDT.load();
}

Note how this solution requires no unsafe blocks. The lazy_static! macro does use unsafe behind the scenes, but it is abstracted away in a safe interface.

ðŸ”—Running it

The last step for making exceptions work in our kernel is to call the init_idt function from our main.rs. Instead of calling it directly, we introduce a general init function in our lib.rs:

// in src/lib.rs

pub fn init() {
    interrupts::init_idt();
}

With this function, we now have a central place for initialization routines that can be shared between the different _start functions in our main.rs, lib.rs, and integration tests.

Now we can update the _start function of our main.rs to call init and then trigger a breakpoint exception:

// in src/main.rs

#[no_mangle]
pub extern "C" fn _start() -> ! {
    println!("Hello World{}", "!");

    blog_os::init(); // new

    // invoke a breakpoint exception
    x86_64::instructions::interrupts::int3(); // new

    // as before
    #[cfg(test)]
    test_main();

    println!("It did not crash!");
    loop {}
}

When we run it in QEMU now (using cargo run), we see the following:

It works! The CPU successfully invokes our breakpoint handler, which prints the message, and then returns back to the _start function, where the It did not crash! message is printed.

We see that the interrupt stack frame tells us the instruction and stack pointers at the time when the exception occurred. This information is very useful when debugging unexpected exceptions.

ðŸ”—Adding a Test

Letâ€™s create a test that ensures that the above continues to work. First, we update the _start function to also call init:

// in src/lib.rs

/// Entry point for `cargo test`
#[cfg(test)]
#[no_mangle]
pub extern "C" fn _start() -> ! {
    init();      // new
    test_main();
    loop {}
}

Remember, this _start function is used when running cargo test --lib, since Rust tests the lib.rs completely independently of the main.rs. We need to call init here to set up an IDT before running the tests.

Now we can create a test_breakpoint_exception test:

// in src/interrupts.rs

#[test_case]
fn test_breakpoint_exception() {
    // invoke a breakpoint exception
    x86_64::instructions::interrupts::int3();
}

The test invokes the int3 function to trigger a breakpoint exception. By checking that the execution continues afterward, we verify that our breakpoint handler is working correctly.

You can try this new test by running cargo test (all tests) or cargo test --lib (only tests of lib.rs and its modules). You should see the following in the output:

blog_os::interrupts::test_breakpoint_exception...	[ok]

ðŸ”—Too much Magic?

The x86-interrupt calling convention and the InterruptDescriptorTable type made the exception handling process relatively straightforward and painless. If this was too much magic for you and you like to learn all the gory details of exception handling, weâ€™ve got you covered: Our â€œHandling Exceptions with Naked Functionsâ€ series shows how to handle exceptions without the x86-interrupt calling convention and also creates its own IDT type. Historically, these posts were the main exception handling posts before the x86-interrupt calling convention and the x86_64 crate existed. Note that these posts are based on the first edition of this blog and might be out of date.

ðŸ”—Whatâ€™s next?

Weâ€™ve successfully caught our first exception and returned from it! The next step is to ensure that we catch all exceptions because an uncaught exception causes a fatal triple fault, which leads to a system reset. The next post explains how we can avoid this by correctly catching double faults.

Integration Tests

Fri, 15 Jun 2018 00:00:00 +0000

To complete the testing picture we implement a basic integration test framework, which allows us to run tests on the target system. The idea is to run tests inside QEMU and report the results back to the host through the serial port.

ðŸ”—Requirements

This post builds upon the Unit Testing post, so you need to follow it first. Alternatively, consider reading the new Testing post instead, which replaces both Unit Testing and this post. The new posts implements similar functionality, but integrates it directly in cargo xtest, so that both unit and integration tests run in a realistic environment inside QEMU.

ðŸ”—Overview

In the previous post we added support for unit tests. The goal of unit tests is to test small components in isolation to ensure that each of them works as intended. The tests are run on the host machine and thus shouldnâ€™t rely on architecture specific functionality.

To test the interaction of the components, both with each other and the system environment, we can write integration tests. Compared to unit tests, Ã¬ntegration tests are more complex, because they need to run in a realistic environment. What this means depends on the application type. For example, for webserver applications it often means to set up a database instance. For an operating system kernel like ours, it means that we run the tests on the target hardware without an underlying operating system.

Running on the target architecture allows us to test all hardware specific code such as the VGA buffer or the effects of page table modifications. It also allows us to verify that our kernel boots without problems and that no CPU exception occurs.

In this post we will implement a very basic test framework that runs integration tests inside instances of the QEMU virtual machine. It is not as realistic as running them on real hardware, but it is much simpler and should be sufficient as long as we only use standard hardware that is well supported in QEMU.

ðŸ”—The Serial Port

The naive way of doing an integration test would be to add some assertions in the code, launch QEMU, and manually check if a panic occurred or not. This is very cumbersome and not practical if we have hundreds of integration tests. So we want an automated solution that runs all tests and fails if not all of them pass.

Such an automated test framework needs to know whether a test succeeded or failed. It canâ€™t look at the screen output of QEMU, so we need a different way of retrieving the test results on the host system. A simple way to achieve this is by using the serial port, an old interface standard which is no longer found in modern computers. It is easy to program and QEMU can redirect the bytes sent over serial to the hostâ€™s standard output or a file.

The chips implementing a serial interface are called UARTs. There are lots of UART models on x86, but fortunately the only differences between them are some advanced features we donâ€™t need. The common UARTs today are all compatible to the 16550 UART, so we will use that model for our testing framework.

ðŸ”—Port I/O

In contrast, port-mapped I/O uses a separate I/O bus for communication. Each connected peripheral has one or more port numbers. To communicate with such an I/O port there are special CPU instructions called in and out, which take a port number and a data byte (there are also variations of these commands that allow sending an u16 or u32).

The UART uses port-mapped I/O. Fortunately there are already several crates that provide abstractions for I/O ports and even UARTs, so we donâ€™t need to invoke the in and out assembly instructions manually.

ðŸ”—Implementation

We will use the uart_16550 crate to initialize the UART and send data over the serial port. To add it as a dependency, we update our Cargo.toml and main.rs:

# in Cargo.toml

[dependencies]
uart_16550 = "0.1.0"

The uart_16550 crate contains a SerialPort struct that represents the UART registers, but we still need to construct an instance of it ourselves. For that we create a new serial module with the following content:

// in src/main.rs

mod serial;

// in src/serial.rs

use uart_16550::SerialPort;
use spin::Mutex;
use lazy_static::lazy_static;

lazy_static! {
    pub static ref SERIAL1: Mutex<SerialPort> = {
        let mut serial_port = SerialPort::new(0x3F8);
        serial_port.init();
        Mutex::new(serial_port)
    };
}

Like with the VGA text buffer, we use lazy_static and a spinlock to create a static. However, this time we use lazy_static to ensure that the init method is called before first use. Weâ€™re using the port address 0x3F8, which is the standard port number for the first serial interface.

To make the serial port easily usable, we add serial_print! and serial_println! macros:

#[doc(hidden)]
pub fn _print(args: ::core::fmt::Arguments) {
    use core::fmt::Write;
    SERIAL1.lock().write_fmt(args).expect("Printing to serial failed");
}

/// Prints to the host through the serial interface.
#[macro_export]
macro_rules! serial_print {
    ($($arg:tt)*) => {
        $crate::serial::_print(format_args!($($arg)*));
    };
}

/// Prints to the host through the serial interface, appending a newline.
#[macro_export]
macro_rules! serial_println {
    () => ($crate::serial_print!("\n"));
    ($fmt:expr) => ($crate::serial_print!(concat!($fmt, "\n")));
    ($fmt:expr, $($arg:tt)*) => ($crate::serial_print!(
        concat!($fmt, "\n"), $($arg)*));
}

The SerialPort type already implements the fmt::Write trait, so we donâ€™t need to provide an implementation.

Now we can print to the serial interface in our main.rs:

// in src/main.rs

mod serial;

#[cfg(not(test))]
#[no_mangle]
pub extern "C" fn _start() -> ! {
    println!("Hello World{}", "!"); // prints to vga buffer
    serial_println!("Hello Host{}", "!");

    loop {}
}

ðŸ”—QEMU Arguments

To see the serial output in QEMU, we can use the -serial argument to redirect the output to stdout:

> qemu-system-x86_64 \
    -drive format=raw,file=target/x86_64-blog_os/debug/bootimage-blog_os.bin \
    -serial mon:stdio
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
Hello Host!

If you chose a different name than blog_os, you need to update the paths of course. Note that you can no longer exit QEMU through Ctrl+c. As an alternative you can use Ctrl+a and then x.

As an alternative to this long command, we can pass the argument to bootimage run, with an additional -- to separate the build arguments (passed to cargo) from the run arguments (passed to QEMU).

bootimage run -- -serial mon:stdio

Instead of standard output, QEMU supports many more target devices. For redirecting the output to a file, the argument is:

-serial file:output-file.txt

ðŸ”—Shutting Down QEMU

Right now we have an endless loop at the end of our _start function and need to close QEMU manually. This does not work for automated tests. We could try to kill QEMU automatically from the host, for example after some special output was sent over serial, but this would be a bit hacky and difficult to get right. The cleaner solution would be to implement a way to shutdown our OS. Unfortunately this is relatively complex, because it requires implementing support for either the APM or ACPI power management standard.

Luckily, there is an escape hatch: QEMU supports a special isa-debug-exit device, which provides an easy way to exit QEMU from the guest system. To enable it, we add the following argument to our QEMU command:

-device isa-debug-exit,iobase=0xf4,iosize=0x04

The iobase specifies on which port address the device should live (0xf4 is a generally unused port on the x86â€™s IO bus) and the iosize specifies the port size (0x04 means four bytes). Now the guest can write a value to the 0xf4 port and QEMU will exit with exit status (passed_value << 1) | 1.

To write to the I/O port, we use the x86_64 crate:

# in Cargo.toml

[dependencies]
x86_64 = "0.5.2"

// in src/main.rs

pub unsafe fn exit_qemu() {
    use x86_64::instructions::port::Port;

    let mut port = Port::<u32>::new(0xf4);
    port.write(0);
}

We mark the function as unsafe because it relies on the fact that a special QEMU device is attached to the I/O port with address 0xf4. For the port type we choose u32 because the iosize is 4 bytes. As value we write a zero, which causes QEMU to exit with exit status (0 << 1) | 1 = 1.

Note that we could also use the exit status instead of the serial interface for sending the test results, for example 1 for success and 2 for failure. However, this wouldnâ€™t allow us to send panic messages like the serial interface does and would also prevent us from replacing exit_qemu with a proper shutdown someday. Therefore we continue to use the serial interface and just always write a 0 to the port.

We can now test the QEMU shutdown by calling exit_qemu from our _start function:

#[cfg(not(test))]
#[no_mangle]
pub extern "C" fn _start() -> ! {
    println!("Hello World{}", "!"); // prints to vga buffer
    serial_println!("Hello Host{}", "!");

    unsafe { exit_qemu(); }

    loop {}
}

You should see that QEMU immediately closes after booting when executing:

bootimage run -- -serial mon:stdio -device isa-debug-exit,iobase=0xf4,iosize=0x04

ðŸ”—Hiding QEMU

We are now able to launch a QEMU instance that writes its output to the serial port and automatically exits itself when itâ€™s done. So we no longer need the VGA buffer output or the graphical representation that still pops up. We can disable it by passing the -display none parameter to QEMU. The full command looks like this:

qemu-system-x86_64 \
    -drive format=raw,file=target/x86_64-blog_os/debug/bootimage-blog_os.bin \
    -serial mon:stdio \
    -device isa-debug-exit,iobase=0xf4,iosize=0x04 \
    -display none

Or, with bootimage run:

bootimage run -- \
    -serial mon:stdio \
    -device isa-debug-exit,iobase=0xf4,iosize=0x04 \
    -display none

Now QEMU runs completely in the background and no window is opened anymore. This is not only less annoying, but also allows our test framework to run in environments without a graphical user interface, such as Travis CI.

ðŸ”—Test Organization

Right now weâ€™re doing the serial output and the QEMU exit from the _start function in our main.rs and can no longer run our kernel in a normal way. We could try to fix this by adding an integration-test cargo feature and using conditional compilation:

# in Cargo.toml

[features]
integration-test = []

// in src/main.rs

#[cfg(not(feature = "integration-test"))] // new
#[cfg(not(test))]
#[no_mangle]
pub extern "C" fn _start() -> ! {
    println!("Hello World{}", "!"); // prints to vga buffer

    // normal execution

    loop {}
}

#[cfg(feature = "integration-test")] // new
#[cfg(not(test))]
#[no_mangle]
pub extern "C" fn _start() -> ! {
    serial_println!("Hello Host{}", "!");

    run_test_1();
    run_test_2();
    // run more tests

    unsafe { exit_qemu(); }

    loop {}
}

However, this approach has a big problem: All tests run in the same kernel instance, which means that they can influence each other. For example, if run_test_1 misconfigures the system by loading an invalid page table, it can cause run_test_2 to fail. This isnâ€™t something that we want because it makes it very difficult to find the actual cause of an error.

Instead, we want our test instances to be as independent as possible. If a test wants to destroy most of the system configuration to ensure that some property still holds in catastrophic situations, it should be able to do so without needing to restore a correct system state afterwards. This means that we need to launch a separate QEMU instance for each test.

With the above conditional compilation we only have two modes: Run the kernel normally or execute all integration tests. To run each test in isolation we would need a separate cargo feature for each test with that approach, which would result in very complex conditional compilation bounds and confusing code.

A better solution is to create an additional executable for each test.

ðŸ”—Additional Test Executables

Cargo allows to add additional executables to a project by putting them inside src/bin. We can use that feature to create a separate executable for each integration test. For example, a test-something executable could be added like this:

// src/bin/test-something.rs

#![cfg_attr(not(test), no_std)]
#![cfg_attr(not(test), no_main)]
#![cfg_attr(test, allow(unused_imports))]

use core::panic::PanicInfo;

#[cfg(not(test))]
#[no_mangle]
pub extern "C" fn _start() -> ! {
    // run tests
    loop {}
}

#[cfg(not(test))]
#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}

By providing a new implementation for _start we can create a minimal test case that only tests one specific thing and is independent of the rest. For example, if we donâ€™t print anything to the VGA buffer, the test still succeeds even if the vga_buffer module is broken.

We can now run this executable in QEMU by passing a --bin argument to bootimage:

bootimage run --bin test-something

It should build the test-something.rs executable instead of main.rs and launch an empty QEMU window (since we donâ€™t print anything). So this approach allows us to create completely independent executables without cargo features or conditional compilation, and without cluttering our main.rs.

However, there is a problem: This is a completely separate executable, which means that we canâ€™t access any functions from our main.rs, including serial_println and exit_qemu. Duplicating the code would work, but we would also need to copy everything we want to test. This would mean that we no longer test the original function but only a possibly outdated copy.

Fortunately there is a way to share most of the code between our main.rs and the testing binaries: We move most of the code from our main.rs to a library that we can include from all executables.

ðŸ”—Split Off A Library

Cargo supports hybrid projects that are both a library and a binary. We only need to create a src/lib.rs file and split the contents of our main.rs in the following way:

// src/lib.rs

#![cfg_attr(not(test), no_std)] // don't link the Rust standard library

// NEW: We need to add `pub` here to make them accessible from the outside
pub mod vga_buffer;
pub mod serial;

pub unsafe fn exit_qemu() {
    use x86_64::instructions::port::Port;

    let mut port = Port::<u32>::new(0xf4);
    port.write(0);
}

// src/main.rs

#![cfg_attr(not(test), no_std)]
#![cfg_attr(not(test), no_main)]
#![cfg_attr(test, allow(unused_imports))]

use core::panic::PanicInfo;
use blog_os::println;

/// This function is the entry point, since the linker looks for a function
/// named `_start` by default.
#[cfg(not(test))]
#[no_mangle] // don't mangle the name of this function
pub extern "C" fn _start() -> ! {
    println!("Hello World{}", "!");

    loop {}
}

/// This function is called on panic.
#[cfg(not(test))]
#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    println!("{}", info);
    loop {}
}

So we move everything except _start and panic to lib.rs and make the vga_buffer and serial modules public. Everything should work exactly as before, including bootimage run and cargo test. To run tests only for the library part of our crate and avoid the additional output we can execute cargo test --lib.

ðŸ”—Test Basic Boot

We are finally able to create our first integration test executable. We start simple and only test that the basic boot sequence works and the _start function is called:

// in src/bin/test-basic-boot.rs

#![cfg_attr(not(test), no_std)]
#![cfg_attr(not(test), no_main)] // disable all Rust-level entry points
#![cfg_attr(test, allow(unused_imports))]

use core::panic::PanicInfo;
use blog_os::{exit_qemu, serial_println};

/// This function is the entry point, since the linker looks for a function
/// named `_start` by default.
#[cfg(not(test))]
#[no_mangle] // don't mangle the name of this function
pub extern "C" fn _start() -> ! {
    serial_println!("ok");

    unsafe { exit_qemu(); }
    loop {}
}


/// This function is called on panic.
#[cfg(not(test))]
#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    serial_println!("failed");

    serial_println!("{}", info);

    unsafe { exit_qemu(); }
    loop {}
}

We donâ€™t do something special here, we just print ok if _start is called and failed with the panic message when a panic occurs. Letâ€™s try it:

> bootimage run --bin test-basic-boot -- \
    -serial mon:stdio -display none \
    -device isa-debug-exit,iobase=0xf4,iosize=0x04
Building kernel
   Compiling blog_os v0.2.0 (file:///â€¦/blog_os)
    Finished dev [unoptimized + debuginfo] target(s) in 0.19s
    Updating registry `https://github.com/rust-lang/crates.io-index`
Creating disk image at target/x86_64-blog_os/debug/bootimage-test-basic-boot.bin
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
ok

We got our ok, so it worked! Try inserting a panic!() before the ok printing, you should see output like this:

failed
panicked at 'explicit panic', src/bin/test-basic-boot.rs:19:5

ðŸ”—Test Panic

To test that our panic handler is really invoked on a panic, we create a test-panic test:

// in src/bin/test-panic.rs

#![cfg_attr(not(test), no_std)]
#![cfg_attr(not(test), no_main)]
#![cfg_attr(test, allow(unused_imports))]

use core::panic::PanicInfo;
use blog_os::{exit_qemu, serial_println};

#[cfg(not(test))]
#[no_mangle]
pub extern "C" fn _start() -> ! {
    panic!();
}

#[cfg(not(test))]
#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    serial_println!("ok");

    unsafe { exit_qemu(); }
    loop {}
}

This executable is almost identical to test-basic-boot, the only difference is that we print ok from our panic handler and invoke an explicit panic() in our _start function.

ðŸ”—A Test Runner

The final step is to create a test runner, a program that executes all integration tests and checks their results. The basic steps that it should do are:

Look for integration tests in the current project, maybe by some convention (e.g. executables starting with test-).
Run all integration tests and interpret their results.
- Use a timeout to ensure that an endless loop does not block the test runner forever.
Report the test results to the user and set a successful or failing exit status.

Such a test runner is useful to many projects, so we decided to add one to the bootimage tool.

ðŸ”—Bootimage Test

The test runner of the bootimage tool can be invoked via bootimage test. It uses the following conventions:

All executables starting with test- are treated as integration tests.
Tests must print either ok or failed over the serial port. When printing failed they can print additional information such as a panic message (in the next lines).
Tests are run with a timeout of 1 minute. If the test has not completed in time, it is reported as â€œtimed outâ€.

The test-basic-boot and test-panic tests we created above begin with test- and follow the ok/failed conventions, so they should work with bootimage test:

> bootimage test
test-panic
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
Ok

test-basic-boot
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
Ok

test-something
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
Timed Out

The following tests failed:
    test-something: TimedOut

We see that our test-panic and test-basic-boot succeeded and that the test-something test timed out after one minute. We no longer need test-something, so we delete it (if you havenâ€™t done already). Now bootimage test should execute successfully.

ðŸ”—Summary

In this post we learned about the serial port and port-mapped I/O and saw how to configure QEMU to print serial output to the command line. We also learned a trick how to exit QEMU without needing to implement a proper shutdown.

We then split our crate into a library and binary part in order to create additional executables for integration tests. We added two example tests for testing that the _start function is correctly called and that a panic invokes our panic handler. Finally, we presented bootimage test as a basic test runner for our integration tests.

We now have a working integration test framework and can finally start to implement functionality in our kernel. We will continue to use the test framework over the next posts to test new components we add.

ðŸ”—Whatâ€™s next?

Unit Testing

Sun, 29 Apr 2018 00:00:00 +0000

This post explores unit testing in no_std executables using Rustâ€™s built-in test framework. We will adjust our code so that cargo test works and add some basic unit tests to our VGA buffer module.

ðŸ”—Requirements

In this post we explore how to execute cargo test on the host system (as a normal Linux/Windows/macOS executable). This only works if you donâ€™t have a .cargo/config file that sets a default target. If you followed the Minimal Rust Kernel post before 2019-04-27, you should be fine. If you followed it after that date, you need to remove the build.target key from your .cargo/config file and explicitly pass a target argument to cargo xbuild.

Alternatively, consider reading the new Testing post instead. It sets up a similar functionality as this post, but instead of running the tests on your host system, they are run in a realistic environment inside QEMU.

ðŸ”—Unit Tests for `no_std` Binaries

Unfortunately itâ€™s a bit more complicated for no_std applications such as our kernel. If we run cargo test (without adding any test yet), we get the following error:

> cargo test
   Compiling blog_os v0.2.0 (file:///â€¦/blog_os)
error[E0152]: duplicate lang item found: `panic_impl`.
  --> src/main.rs:35:1
   |
35 | / fn panic(info: &PanicInfo) -> ! {
36 | |     println!("{}", info);
37 | |     loop {}
38 | | }
   | |_^
   |
   = note: first defined in crate `std`.

The problem is that unit tests are built for the host machine, with the std library included. This makes sense because they should be able to run as a normal application on the host operating system. Since the standard library has itâ€™s own panic_handler function, we get the above error. To fix it, we use conditional compilation to include our implementation of the panic handler only in non-test environments:

// in src/main.rs

use core::panic::PanicInfo;

#[cfg(not(test))] // only compile when the test flag is not set
#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    println!("{}", info);
    loop {}
}

The only change is the added #[cfg(not(test))] attribute. The #[cfg(â€¦)] attribute ensures that the annotated item is only included if the passed condition is met. The test configuration is set when the crate is compiled for unit tests. Through not(â€¦) we negate the condition so that the language item is only compiled for non-test builds.

When we now try cargo test again, we get an ugly linker error:

error: linking with `cc` failed: exit code: 1
  |
  = note: "cc" "-Wl,--as-needed" "-Wl,-z,noexecstack" "-m64" "-L" "/â€¦/lib/rustlib/x86_64-unknown-linux-gnu/lib" [â€¦]
  = note: /â€¦/blog_os-969bdb90d27730ed.2q644ojj2xqxddld.rcgu.o: In function `_start':
          /â€¦/blog_os/src/main.rs:17: multiple definition of `_start'
          /usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/Scrt1.o:(.text+0x0): first defined here
          /usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/Scrt1.o: In function `_start':
          (.text+0x20): undefined reference to `main'
          collect2: error: ld returned 1 exit status

I shortened the output here because it is extremely verbose. The relevant part is at the bottom, after the second â€œnote:â€. We got two distinct errors here, â€œmultiple definition of _startâ€ and â€œundefined reference to mainâ€.

The reason for the first error is that the test framework injects its own main and _start functions, which will run the tests when invoked. So we get two functions named _start when compiling in test mode, one from the test framework and the one we defined ourselves. To fix this, we need to exclude our _start function in that case, which we can do by marking it as #[cfg(not(test))]:

// in src/main.rs

#[cfg(not(test))]
#[no_mangle]
pub extern "C" fn _start() -> ! { â€¦ }

The second problem is that we use the #![no_main] attribute for our crate, which suppresses any main generation, including the test main. To solve this, we use the cfg_attr attribute to conditionally enable the no_main attribute only in non-test mode:

// in src/main.rs

#![cfg_attr(not(test), no_main)] // instead of `#![no_main]`

Now cargo test works:

> cargo test
   Compiling blog_os v0.2.0 (file:///â€¦/blog_os)
    [some warnings]
    Finished dev [unoptimized + debuginfo] target(s) in 0.98 secs
     Running target/debug/deps/blog_os-1f08396a9eff0aa7

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

The test framework seems to work as intended. We donâ€™t have any tests yet, but we already get a test result summary.

ðŸ”—Silencing the Warnings

We get a few warnings about unused imports, because we no longer compile our _start function. To silence such unused code warnings, we can add the following to the top of our main.rs:

#![cfg_attr(test, allow(unused_imports))]

Like before, the cfg_attr attribute sets the passed attribute if the passed condition holds. Here, we set the allow(â€¦) attribute when compiling in test mode. We use the allow attribute to disable warnings for the unused_import lint.

Lints are classes of warnings, for example dead_code for unused code or missing-docs for missing documentation. Lints can be set to four different states:

allow: no errors, no warnings
warn: causes a warning
deny: causes a compilation error
forbid: like deny, but canâ€™t be overridden

Some lints are allow by default (such as missing-docs), others are warn by default (such as dead_code), and some few are even deny by default.. The default can be overridden by the allow, warn, deny and forbid attributes. For a list of all lints, see rustc -W help. There is also the clippy project, which provides many additional lints.

ðŸ”—Including the Standard Library

Unit tests run on the host machine, so itâ€™s possible to use the complete standard library inside them. To link the standard library in test mode, we can make the #![no_std] attribute conditional through cfg_attr too:

-#![no_std]
+#![cfg_attr(not(test), no_std)]

ðŸ”—Testing the VGA Module

Now that we have set up the test framework, we can add a first unit test for our vga_buffer module:

// in src/vga_buffer.rs

#[cfg(test)]
mod test {
    use super::*;

    #[test]
    fn foo() {}
}

We add the test in an inline test submodule. This isnâ€™t necessary, but a common way to separate test code from the rest of the module. By adding the #[cfg(test)] attribute, we ensure that the module is only compiled in test mode. Through use super::*, we import all items of the parent module (the vga_buffer module), so that we can test them easily.

The #[test] attribute on the foo function tells the test framework that the function is an unit test. The framework will find it automatically, even if itâ€™s private and inside a private module as in our case:

> cargo test
   Compiling blog_os v0.2.0 (file:///â€¦/blog_os)
    Finished dev [unoptimized + debuginfo] target(s) in 2.99 secs
     Running target/debug/deps/blog_os-1f08396a9eff0aa7

running 1 test
test vga_buffer::test::foo ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

We see that the test was found and executed. It didnâ€™t panic, so it counts as passed.

ðŸ”—Constructing a Writer

In order to test the VGA methods, we first need to construct a Writer instance. Since we will need such an instance for other tests too, we create a separate function for it:

// in src/vga_buffer.rs

#[cfg(test)]
mod test {
    use super::*;

    fn construct_writer() -> Writer {
        use std::boxed::Box;

        let buffer = construct_buffer();
        Writer {
            column_position: 0,
            color_code: ColorCode::new(Color::Blue, Color::Magenta),
            buffer: Box::leak(Box::new(buffer)),
        }
    }

    fn construct_buffer() -> Buffer { â€¦ }
}

We set the initial column position to 0 and choose some arbitrary colors for foreground and background color. The difficult part is the buffer construction, itâ€™s described in detail below. We then use Box::new and Box::leak to transform the created Buffer into a &'static mut Buffer, because the buffer field needs to be of that type.

ðŸ”—Buffer Construction

So how do we create a Buffer instance? The naive approach does not work unfortunately:

fn construct_buffer() -> Buffer {
    Buffer {
        chars: [[Volatile::new(empty_char()); BUFFER_WIDTH]; BUFFER_HEIGHT],
    }
}

fn empty_char() -> ScreenChar {
    ScreenChar {
        ascii_character: b' ',
        color_code: ColorCode::new(Color::Green, Color::Brown),
    }
}

When running cargo test the following error occurs:

error[E0277]: the trait bound `volatile::Volatile<vga_buffer::ScreenChar>: core::marker::Copy` is not satisfied
   --> src/vga_buffer.rs:186:21
    |
186 |             chars: [[Volatile::new(empty_char); BUFFER_WIDTH]; BUFFER_HEIGHT],
    |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `core::marker::Copy` is not implemented for `volatile::Volatile<vga_buffer::ScreenChar>`
    |
    = note: the `Copy` trait is required because the repeated element will be copied

The problem is that array construction in Rust requires that the contained type is Copy. The ScreenChar is Copy, but the Volatile wrapper is not. There is currently no easy way to circumvent this without using unsafe, but fortunately there is the array_init crate that provides a safe interface for such operations.

To use that crate, we add the following to our Cargo.toml:

[dev-dependencies]
array-init = "0.0.3"

Note that weâ€™re using the dev-dependencies table instead of the dependencies table, because we only need the crate for cargo test and not for a normal build.

Now we can fix our construct_buffer function:

fn construct_buffer() -> Buffer {
    use array_init::array_init;

    Buffer {
        chars: array_init(|_| array_init(|_| Volatile::new(empty_char()))),
    }
}

See the documentation of array_init for more information about using that crate.

ðŸ”—Testing `write_byte`

Now weâ€™re finally able to write a first unit test that tests the write_byte method:

// in vga_buffer.rs

mod test {
    [â€¦]

    #[test]
    fn write_byte() {
        let mut writer = construct_writer();
        writer.write_byte(b'X');
        writer.write_byte(b'Y');

        for (i, row) in writer.buffer.chars.iter().enumerate() {
            for (j, screen_char) in row.iter().enumerate() {
                let screen_char = screen_char.read();
                if i == BUFFER_HEIGHT - 1 && j == 0 {
                    assert_eq!(screen_char.ascii_character, b'X');
                    assert_eq!(screen_char.color_code, writer.color_code);
                } else if i == BUFFER_HEIGHT - 1 && j == 1 {
                    assert_eq!(screen_char.ascii_character, b'Y');
                    assert_eq!(screen_char.color_code, writer.color_code);
                } else {
                    assert_eq!(screen_char, empty_char());
                }
            }
        }
    }
}

We construct a Writer, write two bytes to it, and then check that the right screen characters were updated. When we run cargo test, we see that the test is executed and passes:

running 1 test
test vga_buffer::test::write_byte ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Try to play around a bit with this function and verify that the test fails if you change something, e.g. if you print a third byte without adjusting the for loop.

(If youâ€™re getting an â€œbinary operation == cannot be applied to type vga_buffer::ScreenCharâ€ error, you need to also derive PartialEq for ScreenChar and ColorCode).

ðŸ”—Testing Strings

Letâ€™s add a second unit test to test formatted output and newline behavior:

// in src/vga_buffer.rs

mod test {
    [â€¦]

    #[test]
    fn write_formatted() {
        use core::fmt::Write;

        let mut writer = construct_writer();
        writeln!(&mut writer, "a").unwrap();
        writeln!(&mut writer, "b{}", "c").unwrap();

        for (i, row) in writer.buffer.chars.iter().enumerate() {
            for (j, screen_char) in row.iter().enumerate() {
                let screen_char = screen_char.read();
                if i == BUFFER_HEIGHT - 3 && j == 0 {
                    assert_eq!(screen_char.ascii_character, b'a');
                    assert_eq!(screen_char.color_code, writer.color_code);
                } else if i == BUFFER_HEIGHT - 2 && j == 0 {
                    assert_eq!(screen_char.ascii_character, b'b');
                    assert_eq!(screen_char.color_code, writer.color_code);
                } else if i == BUFFER_HEIGHT - 2 && j == 1 {
                    assert_eq!(screen_char.ascii_character, b'c');
                    assert_eq!(screen_char.color_code, writer.color_code);
                } else if i >= BUFFER_HEIGHT - 2 {
                    assert_eq!(screen_char.ascii_character, b' ');
                    assert_eq!(screen_char.color_code, writer.color_code);
                } else {
                    assert_eq!(screen_char, empty_char());
                }
            }
        }
    }
}

In this test weâ€™re using the writeln! macro to print strings with newlines to the buffer. Most of the for loop is similar to the write_byte test and only verifies if the written characters are at the expected place. The new if i >= BUFFER_HEIGHT - 2 case verifies that the empty lines that are shifted in on a newline have the writer.color_code, which is different from the initial color.

ðŸ”—More Tests

We only present two basic tests here as an example, but of course many more tests are possible. For example a test that changes the writer color in between writes. Or a test that checks that the top line is correctly shifted off the screen on a newline. Or a test that checks that non-ASCII characters are handled correctly.

ðŸ”—Summary

Unit testing is a very useful technique to ensure that certain components have a desired behavior. Even if they cannot show the absence of bugs, theyâ€™re still an useful tool for finding them and especially for avoiding regressions.

This post explained how to set up unit testing in a Rust kernel. We now have a functioning test framework and can easily add tests by adding functions with a #[test] attribute. To run them, a short cargo test suffices. We also added a few basic tests for our VGA buffer as an example how unit tests could look like.

We also learned a bit about conditional compilation, Rustâ€™s lint system, how to initialize arrays with non-Copy types, and the dev-dependencies section of the Cargo.toml.

ðŸ”—Whatâ€™s next?

We now have a working unit testing framework, which gives us the ability to test individual components. However, unit tests have the disadvantage that they run on the host machine and are thus unable to test how components interact with platform specific parts. For example, we canâ€™t test the println! macro with an unit test because it wants to write at the VGA text buffer at address 0xb8000, which only exists in the bare metal environment.

The next post will close this gap by creating a basic integration test framework, which runs the tests in QEMU and thus has access to platform specific components. This will allow us to test the full system, for example that our kernel boots correctly or that no deadlock occurs on nested println! invocations.

Writing an OS in pure Rust

Fri, 09 Mar 2018 00:00:00 +0000

Over the past six months weâ€™ve been working on a second edition of this blog. Our goals for this new version are numerous and we are still not done yet, but today we reached a major milestone: It is now possible to build the OS natively on Windows, macOS, and Linux without any non-Rust dependendencies.

The first edition required several C-tools for building:

We used the GRUB bootloader for booting our kernel. To create a bootable disk/CD image we used the grub-mkrescue tool, which is very difficult to get to run on Windows.
The xorriso program was also required, because it is used by grub-mkrescue.
GRUB only boots to protected mode, so we needed some assembly code for entering long mode. For building the assembly code, we used the nasm assembler.
We used the GNU linker ld for linking together the assembly files with the rust code, using a custom linker script.
Finally, we used make for automating the various build steps (assembling, compiling the Rust code, linking, invoking grub-mkrescue).

We got lots of feedback that this setup was difficult to get running under macOS and Windows. As a workaround, we added support for docker, but that still required users to install and understand an additional dependency. So when we decided to create a second edition of the blog - originally because the order of posts led to jumps in difficulty - we thought about how we could avoid these C-dependencies.

There are lots of alternatives to make, including some Rust tools such as just and cargo-make. Avoiding nasm is also possible by using Rustâ€™s global_asm feature instead. So there are only two problems left: the bootloader and the linker.

A custom Bootloader

To avoid the dependency on GRUB and to make things more ergonomic, we decided to write our own bootloader using Rustâ€™s global_asm feature. This way, the kernel can be significantly simplified, since the switch to long mode and the initial page table layout can already be done in the bootloader. Thus, we can avoid the initial assembly level blog posts in the second edition and directly start with high level Rust code.

The bootloader is still an early prototype, but it is already capable of switching to long mode and loading the kernel in form of an 64-bit ELF binary. It also performs the correct page table mapping (with the correct read/write/execute permissions) as itâ€™s specified in the ELF file and creates an initial physical memory map.

The plan for the future is to make the bootloader more stable, add documentation, and ultimately add a â€œWriting a Bootloaderâ€ series to the blog, which explains in detail how the bootloader works.

Linking with LLD

With our custom bootloader in place, the last remaining problem is platform independent linking. Fortunately there is LLD, the cross-platform linker from the LLVM project, which is already very stable for the x86 architecture. As a bonus, LLD is now shipped with Rust, which means that it can be used without any extra installation.

The new Posts

The second edition is already live at https://os.phil-opp.com/second-edition. Please tell us if you have any feedback on the new posts! Weâ€™re planning to move over the content from the first edition iteratively, in a different order and with various other improvements.

Many thanks to everyone who helped to make Rust an even better language for OS development!

VGA Text Mode

Mon, 26 Feb 2018 00:00:00 +0000

The VGA text mode is a simple way to print text to the screen. In this post, we create an interface that makes its usage safe and simple by encapsulating all unsafety in a separate module. We also implement support for Rustâ€™s formatting macros.

ðŸ”—The VGA Text Buffer

To print a character to the screen in VGA text mode, one has to write it to the text buffer of the VGA hardware. The VGA text buffer is a two-dimensional array with typically 25 rows and 80 columns, which is directly rendered to the screen. Each array entry describes a single screen character through the following format:

Bit(s)	Value
0-7	ASCII code point
8-11	Foreground color
12-14	Background color
15	Blink

The first byte represents the character that should be printed in the ASCII encoding. To be more specific, it isnâ€™t exactly ASCII, but a character set named code page 437 with some additional characters and slight modifications. For simplicity, we will proceed to call it an ASCII character in this post.

The second byte defines how the character is displayed. The first four bits define the foreground color, the next three bits the background color, and the last bit whether the character should blink. The following colors are available:

Number	Color	Number + Bright Bit	Bright Color
0x0	Black	0x8	Dark Gray
0x1	Blue	0x9	Light Blue
0x2	Green	0xa	Light Green
0x3	Cyan	0xb	Light Cyan
0x4	Red	0xc	Light Red
0x5	Magenta	0xd	Pink
0x6	Brown	0xe	Yellow
0x7	Light Gray	0xf	White

Bit 4 is the bright bit, which turns, for example, blue into light blue. For the background color, this bit is repurposed as the blink bit.

The VGA text buffer is accessible via memory-mapped I/O to the address 0xb8000. This means that reads and writes to that address donâ€™t access the RAM but directly access the text buffer on the VGA hardware. This means we can read and write it through normal memory operations to that address.

Note that memory-mapped hardware might not support all normal RAM operations. For example, a device could only support byte-wise reads and return junk when a u64 is read. Fortunately, the text buffer supports normal reads and writes, so we donâ€™t have to treat it in a special way.

ðŸ”—A Rust Module

Now that we know how the VGA buffer works, we can create a Rust module to handle printing:

//â€¯in src/main.rs
mod vga_buffer;

For the content of this module, we create a new src/vga_buffer.rs file. All of the code below goes into our new module (unless specified otherwise).

ðŸ”—Colors

First, we represent the different colors using an enum:

// in src/vga_buffer.rs

#[allow(dead_code)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum Color {
    Black = 0,
    Blue = 1,
    Green = 2,
    Cyan = 3,
    Red = 4,
    Magenta = 5,
    Brown = 6,
    LightGray = 7,
    DarkGray = 8,
    LightBlue = 9,
    LightGreen = 10,
    LightCyan = 11,
    LightRed = 12,
    Pink = 13,
    Yellow = 14,
    White = 15,
}

We use a C-like enum here to explicitly specify the number for each color. Because of the repr(u8) attribute, each enum variant is stored as a u8. Actually 4 bits would be sufficient, but Rust doesnâ€™t have a u4 type.

Normally the compiler would issue a warning for each unused variant. By using the #[allow(dead_code)] attribute, we disable these warnings for the Color enum.

By deriving the Copy, Clone, Debug, PartialEq, and Eq traits, we enable copy semantics for the type and make it printable and comparable.

To represent a full color code that specifies foreground and background color, we create a newtype on top of u8:

// in src/vga_buffer.rs

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(transparent)]
struct ColorCode(u8);

impl ColorCode {
    fn new(foreground: Color, background: Color) -> ColorCode {
        ColorCode((background as u8) << 4 | (foreground as u8))
    }
}

The ColorCode struct contains the full color byte, containing foreground and background color. Like before, we derive the Copy and Debug traits for it. To ensure that the ColorCode has the exact same data layout as a u8, we use the repr(transparent) attribute.

ðŸ”—Text Buffer

Now we can add structures to represent a screen character and the text buffer:

// in src/vga_buffer.rs

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(C)]
struct ScreenChar {
    ascii_character: u8,
    color_code: ColorCode,
}

const BUFFER_HEIGHT: usize = 25;
const BUFFER_WIDTH: usize = 80;

#[repr(transparent)]
struct Buffer {
    chars: [[ScreenChar; BUFFER_WIDTH]; BUFFER_HEIGHT],
}

Since the field ordering in default structs is undefined in Rust, we need the repr(C) attribute. It guarantees that the structâ€™s fields are laid out exactly like in a C struct and thus guarantees the correct field ordering. For the Buffer struct, we use repr(transparent) again to ensure that it has the same memory layout as its single field.

To actually write to screen, we now create a writer type:

// in src/vga_buffer.rs

pub struct Writer {
    column_position: usize,
    color_code: ColorCode,
    buffer: &'static mut Buffer,
}

The writer will always write to the last line and shift lines up when a line is full (or on \n). The column_position field keeps track of the current position in the last row. The current foreground and background colors are specified by color_code and a reference to the VGA buffer is stored in buffer. Note that we need an explicit lifetime here to tell the compiler how long the reference is valid. The 'static lifetime specifies that the reference is valid for the whole program run time (which is true for the VGA text buffer).

ðŸ”—Printing

Now we can use the Writer to modify the bufferâ€™s characters. First we create a method to write a single ASCII byte:

// in src/vga_buffer.rs

impl Writer {
    pub fn write_byte(&mut self, byte: u8) {
        match byte {
            b'\n' => self.new_line(),
            byte => {
                if self.column_position >= BUFFER_WIDTH {
                    self.new_line();
                }

                let row = BUFFER_HEIGHT - 1;
                let col = self.column_position;

                let color_code = self.color_code;
                self.buffer.chars[row][col] = ScreenChar {
                    ascii_character: byte,
                    color_code,
                };
                self.column_position += 1;
            }
        }
    }

    fn new_line(&mut self) {/* TODO */}
}

If the byte is the newline byte \n, the writer does not print anything. Instead, it calls a new_line method, which weâ€™ll implement later. Other bytes get printed to the screen in the second match case.

When printing a byte, the writer checks if the current line is full. In that case, a new_line call is used to wrap the line. Then it writes a new ScreenChar to the buffer at the current position. Finally, the current column position is advanced.

To print whole strings, we can convert them to bytes and print them one-by-one:

// in src/vga_buffer.rs

impl Writer {
    pub fn write_string(&mut self, s: &str) {
        for byte in s.bytes() {
            match byte {
                // printable ASCII byte or newline
                0x20..=0x7e | b'\n' => self.write_byte(byte),
                // not part of printable ASCII range
                _ => self.write_byte(0xfe),
            }

        }
    }
}

The VGA text buffer only supports ASCII and the additional bytes of code page 437. Rust strings are UTF-8 by default, so they might contain bytes that are not supported by the VGA text buffer. We use a match to differentiate printable ASCII bytes (a newline or anything in between a space character and a ~ character) and unprintable bytes. For unprintable bytes, we print a â– character, which has the hex code 0xfe on the VGA hardware.

ðŸ”—Try it out!

To write some characters to the screen, you can create a temporary function:

// in src/vga_buffer.rs

pub fn print_something() {
    let mut writer = Writer {
        column_position: 0,
        color_code: ColorCode::new(Color::Yellow, Color::Black),
        buffer: unsafe { &mut *(0xb8000 as *mut Buffer) },
    };

    writer.write_byte(b'H');
    writer.write_string("ello ");
    writer.write_string("WÃ¶rld!");
}

It first creates a new Writer that points to the VGA buffer at 0xb8000. The syntax for this might seem a bit strange: First, we cast the integer 0xb8000 as a mutable raw pointer. Then we convert it to a mutable reference by dereferencing it (through *) and immediately borrowing it again (through &mut). This conversion requires an unsafe block, since the compiler canâ€™t guarantee that the raw pointer is valid.

Then it writes the byte b'H' to it. The b prefix creates a byte literal, which represents an ASCII character. By writing the strings "ello " and "WÃ¶rld!", we test our write_string method and the handling of unprintable characters. To see the output, we need to call the print_something function from our _start function:

// in src/main.rs

#[no_mangle]
pub extern "C" fn _start() -> ! {
    vga_buffer::print_something();

    loop {}
}

When we run our project now, a Hello Wâ– â– rld! should be printed in the lower left corner of the screen in yellow:

Notice that the Ã¶ is printed as two â– characters. Thatâ€™s because Ã¶ is represented by two bytes in UTF-8, which both donâ€™t fall into the printable ASCII range. In fact, this is a fundamental property of UTF-8: the individual bytes of multi-byte values are never valid ASCII.

ðŸ”—Volatile

We just saw that our message was printed correctly. However, it might not work with future Rust compilers that optimize more aggressively.

The problem is that we only write to the Buffer and never read from it again. The compiler doesnâ€™t know that we really access VGA buffer memory (instead of normal RAM) and knows nothing about the side effect that some characters appear on the screen. So it might decide that these writes are unnecessary and can be omitted. To avoid this erroneous optimization, we need to specify these writes as volatile. This tells the compiler that the write has side effects and should not be optimized away.

In order to use volatile writes for the VGA buffer, we use the volatile library. This crate (this is how packages are called in the Rust world) provides a Volatile wrapper type with read and write methods. These methods internally use the read_volatile and write_volatile functions of the core library and thus guarantee that the reads/writes are not optimized away.

We can add a dependency on the volatile crate by adding it to the dependencies section of our Cargo.toml:

# in Cargo.toml

[dependencies]
volatile = "0.2.6"

Make sure to specify volatile version 0.2.6. Newer versions of the crate are not compatible with this post. 0.2.6 is the semantic version number. For more information, see the Specifying Dependencies guide of the cargo documentation.

Letâ€™s use it to make writes to the VGA buffer volatile. We update our Buffer type as follows:

// in src/vga_buffer.rs

use volatile::Volatile;

struct Buffer {
    chars: [[Volatile<ScreenChar>; BUFFER_WIDTH]; BUFFER_HEIGHT],
}

Instead of a ScreenChar, weâ€™re now using a Volatile<ScreenChar>. (The Volatile type is generic and can wrap (almost) any type). This ensures that we canâ€™t accidentally write to it â€œnormallyâ€. Instead, we have to use the write method now.

This means that we have to update our Writer::write_byte method:

// in src/vga_buffer.rs

impl Writer {
    pub fn write_byte(&mut self, byte: u8) {
        match byte {
            b'\n' => self.new_line(),
            byte => {
                ...

                self.buffer.chars[row][col].write(ScreenChar {
                    ascii_character: byte,
                    color_code,
                });
                ...
            }
        }
    }
    ...
}

Instead of a typical assignment using =, weâ€™re now using the write method. Now we can guarantee that the compiler will never optimize away this write.

ðŸ”—Formatting Macros

It would be nice to support Rustâ€™s formatting macros, too. That way, we can easily print different types, like integers or floats. To support them, we need to implement the core::fmt::Write trait. The only required method of this trait is write_str, which looks quite similar to our write_string method, just with a fmt::Result return type:

// in src/vga_buffer.rs

use core::fmt;

impl fmt::Write for Writer {
    fn write_str(&mut self, s: &str) -> fmt::Result {
        self.write_string(s);
        Ok(())
    }
}

The Ok(()) is just a Ok Result containing the () type.

Now we can use Rustâ€™s built-in write!/writeln! formatting macros:

// in src/vga_buffer.rs

pub fn print_something() {
    use core::fmt::Write;
    let mut writer = Writer {
        column_position: 0,
        color_code: ColorCode::new(Color::Yellow, Color::Black),
        buffer: unsafe { &mut *(0xb8000 as *mut Buffer) },
    };

    writer.write_byte(b'H');
    writer.write_string("ello! ");
    write!(writer, "The numbers are {} and {}", 42, 1.0/3.0).unwrap();
}

Now you should see a Hello! The numbers are 42 and 0.3333333333333333 at the bottom of the screen. The write! call returns a Result which causes a warning if not used, so we call the unwrap function on it, which panics if an error occurs. This isnâ€™t a problem in our case, since writes to the VGA buffer never fail.

ðŸ”—Newlines

Right now, we just ignore newlines and characters that donâ€™t fit into the line anymore. Instead, we want to move every character one line up (the top line gets deleted) and start at the beginning of the last line again. To do this, we add an implementation for the new_line method of Writer:

// in src/vga_buffer.rs

impl Writer {
    fn new_line(&mut self) {
        for row in 1..BUFFER_HEIGHT {
            for col in 0..BUFFER_WIDTH {
                let character = self.buffer.chars[row][col].read();
                self.buffer.chars[row - 1][col].write(character);
            }
        }
        self.clear_row(BUFFER_HEIGHT - 1);
        self.column_position = 0;
    }

    fn clear_row(&mut self, row: usize) {/* TODO */}
}

We iterate over all the screen characters and move each character one row up. Note that the upper bound of the range notation (..) is exclusive. We also omit the 0th row (the first range starts at 1) because itâ€™s the row that is shifted off screen.

To finish the newline code, we add the clear_row method:

// in src/vga_buffer.rs

impl Writer {
    fn clear_row(&mut self, row: usize) {
        let blank = ScreenChar {
            ascii_character: b' ',
            color_code: self.color_code,
        };
        for col in 0..BUFFER_WIDTH {
            self.buffer.chars[row][col].write(blank);
        }
    }
}

This method clears a row by overwriting all of its characters with a space character.

ðŸ”—A Global Interface

To provide a global writer that can be used as an interface from other modules without carrying a Writer instance around, we try to create a static WRITER:

// in src/vga_buffer.rs

pub static WRITER: Writer = Writer {
    column_position: 0,
    color_code: ColorCode::new(Color::Yellow, Color::Black),
    buffer: unsafe { &mut *(0xb8000 as *mut Buffer) },
};

However, if we try to compile it now, the following errors occur:

error[E0015]: calls in statics are limited to constant functions, tuple structs and tuple variants
 --> src/vga_buffer.rs:7:17
  |
7 |     color_code: ColorCode::new(Color::Yellow, Color::Black),
  |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0396]: raw pointers cannot be dereferenced in statics
 --> src/vga_buffer.rs:8:22
  |
8 |     buffer: unsafe { &mut *(0xb8000 as *mut Buffer) },
  |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ dereference of raw pointer in constant

error[E0017]: references in statics may only refer to immutable values
 --> src/vga_buffer.rs:8:22
  |
8 |     buffer: unsafe { &mut *(0xb8000 as *mut Buffer) },
  |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ statics require immutable values

error[E0017]: references in statics may only refer to immutable values
 --> src/vga_buffer.rs:8:13
  |
8 |     buffer: unsafe { &mut *(0xb8000 as *mut Buffer) },
  |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ statics require immutable values

To understand whatâ€™s happening here, we need to know that statics are initialized at compile time, in contrast to normal variables that are initialized at run time. The component of the Rust compiler that evaluates such initialization expressions is called the â€œconst evaluatorâ€. Its functionality is still limited, but there is ongoing work to expand it, for example in the â€œAllow panicking in constantsâ€ RFC.

The issue with ColorCode::new would be solvable by using const functions, but the fundamental problem here is that Rustâ€™s const evaluator is not able to convert raw pointers to references at compile time. Maybe it will work someday, but until then, we have to find another solution.

ðŸ”—Lazy Statics

The one-time initialization of statics with non-const functions is a common problem in Rust. Fortunately, there already exists a good solution in a crate named lazy_static. This crate provides a lazy_static! macro that defines a lazily initialized static. Instead of computing its value at compile time, the static lazily initializes itself when accessed for the first time. Thus, the initialization happens at runtime, so arbitrarily complex initialization code is possible.

Letâ€™s add the lazy_static crate to our project:

# in Cargo.toml

[dependencies.lazy_static]
version = "1.0"
features = ["spin_no_std"]

We need the spin_no_std feature, since we donâ€™t link the standard library.

With lazy_static, we can define our static WRITER without problems:

// in src/vga_buffer.rs

use lazy_static::lazy_static;

lazy_static! {
    pub static ref WRITER: Writer = Writer {
        column_position: 0,
        color_code: ColorCode::new(Color::Yellow, Color::Black),
        buffer: unsafe { &mut *(0xb8000 as *mut Buffer) },
    };
}

However, this WRITER is pretty useless since it is immutable. This means that we canâ€™t write anything to it (since all the write methods take &mut self). One possible solution would be to use a mutable static. But then every read and write to it would be unsafe since it could easily introduce data races and other bad things. Using static mut is highly discouraged. There were even proposals to remove it. But what are the alternatives? We could try to use an immutable static with a cell type like RefCell or even UnsafeCell that provides interior mutability. But these types arenâ€™t Sync (with good reason), so we canâ€™t use them in statics.

ðŸ”—Spinlocks

To get synchronized interior mutability, users of the standard library can use Mutex. It provides mutual exclusion by blocking threads when the resource is already locked. But our basic kernel does not have any blocking support or even a concept of threads, so we canâ€™t use it either. However, there is a really basic kind of mutex in computer science that requires no operating system features: the spinlock. Instead of blocking, the threads simply try to lock it again and again in a tight loop, thus burning CPU time until the mutex is free again.

To use a spinning mutex, we can add the spin crate as a dependency:

# in Cargo.toml
[dependencies]
spin = "0.5.2"

Then we can use the spinning mutex to add safe interior mutability to our static WRITER:

// in src/vga_buffer.rs

use spin::Mutex;
...
lazy_static! {
    pub static ref WRITER: Mutex<Writer> = Mutex::new(Writer {
        column_position: 0,
        color_code: ColorCode::new(Color::Yellow, Color::Black),
        buffer: unsafe { &mut *(0xb8000 as *mut Buffer) },
    });
}

Now we can delete the print_something function and print directly from our _start function:

// in src/main.rs
#[no_mangle]
pub extern "C" fn _start() -> ! {
    use core::fmt::Write;
    vga_buffer::WRITER.lock().write_str("Hello again").unwrap();
    write!(vga_buffer::WRITER.lock(), ", some numbers: {} {}", 42, 1.337).unwrap();

    loop {}
}

We need to import the fmt::Write trait in order to be able to use its functions.

ðŸ”—Safety

Note that we only have a single unsafe block in our code, which is needed to create a Buffer reference pointing to 0xb8000. Afterwards, all operations are safe. Rust uses bounds checking for array accesses by default, so we canâ€™t accidentally write outside the buffer. Thus, we encoded the required conditions in the type system and are able to provide a safe interface to the outside.

ðŸ”—A println Macro

Now that we have a global writer, we can add a println macro that can be used from anywhere in the codebase. Rustâ€™s macro syntax is a bit strange, so we wonâ€™t try to write a macro from scratch. Instead, we look at the source of the println! macro in the standard library:

#[macro_export]
macro_rules! println {
    () => (print!("\n"));
    ($($arg:tt)*) => (print!("{}\n", format_args!($($arg)*)));
}

Macros are defined through one or more rules, similar to match arms. The println macro has two rules: The first rule is for invocations without arguments, e.g., println!(), which is expanded to print!("\n") and thus just prints a newline. The second rule is for invocations with parameters such as println!("Hello") or println!("Number: {}", 4). It is also expanded to an invocation of the print! macro, passing all arguments and an additional newline \n at the end.

The #[macro_export] attribute makes the macro available to the whole crate (not just the module it is defined in) and external crates. It also places the macro at the crate root, which means we have to import the macro through use std::println instead of std::macros::println.

The print! macro is defined as:

#[macro_export]
macro_rules! print {
    ($($arg:tt)*) => ($crate::io::_print(format_args!($($arg)*)));
}

The macro expands to a call of the _print function in the io module. The $crate variable ensures that the macro also works from outside the std crate by expanding to std when itâ€™s used in other crates.

The format_args macro builds a fmt::Arguments type from the passed arguments, which is passed to _print. The _print function of libstd calls print_to, which is rather complicated because it supports different Stdout devices. We donâ€™t need that complexity since we just want to print to the VGA buffer.

To print to the VGA buffer, we just copy the println! and print! macros, but modify them to use our own _print function:

// in src/vga_buffer.rs

#[macro_export]
macro_rules! print {
    ($($arg:tt)*) => ($crate::vga_buffer::_print(format_args!($($arg)*)));
}

#[macro_export]
macro_rules! println {
    () => ($crate::print!("\n"));
    ($($arg:tt)*) => ($crate::print!("{}\n", format_args!($($arg)*)));
}

#[doc(hidden)]
pub fn _print(args: fmt::Arguments) {
    use core::fmt::Write;
    WRITER.lock().write_fmt(args).unwrap();
}

One thing that we changed from the original println definition is that we prefixed the invocations of the print! macro with $crate too. This ensures that we donâ€™t need to import the print! macro too if we only want to use println.

Like in the standard library, we add the #[macro_export] attribute to both macros to make them available everywhere in our crate. Note that this places the macros in the root namespace of the crate, so importing them via use crate::vga_buffer::println does not work. Instead, we have to do use crate::println.

The _print function locks our static WRITER and calls the write_fmt method on it. This method is from the Write trait, which we need to import. The additional unwrap() at the end panics if printing isnâ€™t successful. But since we always return Ok in write_str, that should not happen.

Since the macros need to be able to call _print from outside of the module, the function needs to be public. However, since we consider this a private implementation detail, we add the doc(hidden) attribute to hide it from the generated documentation.

ðŸ”—Hello World using `println`

Now we can use println in our _start function:

// in src/main.rs

#[no_mangle]
pub extern "C" fn _start() -> ! {
    println!("Hello World{}", "!");

    loop {}
}

Note that we donâ€™t have to import the macro in the main function, because it already lives in the root namespace.

As expected, we now see a â€œHello World!â€ on the screen:

ðŸ”—Printing Panic Messages

Now that we have a println macro, we can use it in our panic function to print the panic message and the location of the panic:

// in main.rs

/// This function is called on panic.
#[panic_handler]
fn panic(info: &PanicInfo) -> ! {
    println!("{}", info);
    loop {}
}

When we now insert panic!("Some panic message"); in our _start function, we get the following output:

So we know not only that a panic has occurred, but also the panic message and where in the code it happened.

ðŸ”—Summary

In this post, we learned about the structure of the VGA text buffer and how it can be written through the memory mapping at address 0xb8000. We created a Rust module that encapsulates the unsafety of writing to this memory-mapped buffer and presents a safe and convenient interface to the outside.

Thanks to cargo, we also saw how easy it is to add dependencies on third-party libraries. The two dependencies that we added, lazy_static and spin, are very useful in OS development and we will use them in more places in future posts.

ðŸ”—Whatâ€™s next?

The next post explains how to set up Rustâ€™s built-in unit test framework. We will then create some basic unit tests for the VGA buffer module from this post.

A Freestanding Rust Binary

Sat, 10 Feb 2018 00:00:00 +0000

The first step in creating our own operating system kernel is to create a Rust executable that does not link the standard library. This makes it possible to run Rust code on the bare metal without an underlying operating system.

ðŸ”—Introduction

To write an operating system kernel, we need code that does not depend on any operating system features. This means that we canâ€™t use threads, files, heap memory, the network, random numbers, standard output, or any other features requiring OS abstractions or specific hardware. Which makes sense, since weâ€™re trying to write our own OS and our own drivers.

This means that we canâ€™t use most of the Rust standard library, but there are a lot of Rust features that we can use. For example, we can use iterators, closures, pattern matching, option and result, string formatting, and of course the ownership system. These features make it possible to write a kernel in a very expressive, high level way without worrying about undefined behavior or memory safety.

In order to create an OS kernel in Rust, we need to create an executable that can be run without an underlying operating system. Such an executable is often called a â€œfreestandingâ€ or â€œbare-metalâ€ executable.

This post describes the necessary steps to create a freestanding Rust binary and explains why the steps are needed. If youâ€™re just interested in a minimal example, you can jump to the summary.

ðŸ”—Disabling the Standard Library

By default, all Rust crates link the standard library, which depends on the operating system for features such as threads, files, or networking. It also depends on the C standard library libc, which closely interacts with OS services. Since our plan is to write an operating system, we canâ€™t use any OS-dependent libraries. So we have to disable the automatic inclusion of the standard library through the no_std attribute.

We start by creating a new cargo application project. The easiest way to do this is through the command line:

cargo new blog_os --bin --edition 2018

I named the project blog_os, but of course you can choose your own name. The --bin flag specifies that we want to create an executable binary (in contrast to a library) and the --edition 2018 flag specifies that we want to use the 2018 edition of Rust for our crate. When we run the command, cargo creates the following directory structure for us:

blog_os
â”œâ”€â”€ Cargo.toml
â””â”€â”€ src
    â””â”€â”€ main.rs

The Cargo.toml contains the crate configuration, for example the crate name, the author, the semantic version number, and dependencies. The src/main.rs file contains the root module of our crate and our main function. You can compile your crate through cargo build and then run the compiled blog_os binary in the target/debug subfolder.

ðŸ”—The `no_std` Attribute

Right now our crate implicitly links the standard library. Letâ€™s try to disable this by adding the no_std attribute:

// main.rs

#![no_std]

fn main() {
    println!("Hello, world!");
}

When we try to build it now (by running cargo build), the following error occurs:

error: cannot find macro `println!` in this scope
 --> src/main.rs:4:5
  |
4 |     println!("Hello, world!");
  |     ^^^^^^^

The reason for this error is that the println macro is part of the standard library, which we no longer include. So we can no longer print things. This makes sense, since println writes to standard output, which is a special file descriptor provided by the operating system.

So letâ€™s remove the printing and try again with an empty main function:

// main.rs

#![no_std]

fn main() {}

> cargo build
error: `#[panic_handler]` function required, but not found
error: language item required, but not found: `eh_personality`

Now the compiler is missing a #[panic_handler] function and a language item.

ðŸ”—Panic Implementation

The panic_handler attribute defines the function that the compiler should invoke when a panic occurs. The standard library provides its own panic handler function, but in a no_std environment we need to define it ourselves:

// in main.rs

use core::panic::PanicInfo;

/// This function is called on panic.
#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}

The PanicInfo parameter contains the file and line where the panic happened and the optional panic message. The function should never return, so it is marked as a diverging function by returning the â€œneverâ€ type !. There is not much we can do in this function for now, so we just loop indefinitely.

ðŸ”—The `eh_personality` Language Item

Language items are special functions and types that are required internally by the compiler. For example, the Copy trait is a language item that tells the compiler which types have copy semantics. When we look at the implementation, we see it has the special #[lang = "copy"] attribute that defines it as a language item.

While providing custom implementations of language items is possible, it should only be done as a last resort. The reason is that language items are highly unstable implementation details and not even type checked (so the compiler doesnâ€™t even check if a function has the right argument types). Fortunately, there is a more stable way to fix the above language item error.

The eh_personality language item marks a function that is used for implementing stack unwinding. By default, Rust uses unwinding to run the destructors of all live stack variables in case of a panic. This ensures that all used memory is freed and allows the parent thread to catch the panic and continue execution. Unwinding, however, is a complicated process and requires some OS-specific libraries (e.g. libunwind on Linux or structured exception handling on Windows), so we donâ€™t want to use it for our operating system.

ðŸ”—Disabling Unwinding

There are other use cases as well for which unwinding is undesirable, so Rust provides an option to abort on panic instead. This disables the generation of unwinding symbol information and thus considerably reduces binary size. There are multiple places where we can disable unwinding. The easiest way is to add the following lines to our Cargo.toml:

[profile.dev]
panic = "abort"

[profile.release]
panic = "abort"

This sets the panic strategy to abort for both the dev profile (used for cargo build) and the release profile (used for cargo build --release). Now the eh_personality language item should no longer be required.

Now we fixed both of the above errors. However, if we try to compile it now, another error occurs:

> cargo build
error: requires `start` lang_item

Our program is missing the start language item, which defines the entry point.

ðŸ”—The `start` attribute

One might think that the main function is the first function called when you run a program. However, most languages have a runtime system, which is responsible for things such as garbage collection (e.g. in Java) or software threads (e.g. goroutines in Go). This runtime needs to be called before main, since it needs to initialize itself.

In a typical Rust binary that links the standard library, execution starts in a C runtime library called crt0 (â€œC runtime zeroâ€), which sets up the environment for a C application. This includes creating a stack and placing the arguments in the right registers. The C runtime then invokes the entry point of the Rust runtime, which is marked by the start language item. Rust only has a very minimal runtime, which takes care of some small things such as setting up stack overflow guards or printing a backtrace on panic. The runtime then finally calls the main function.

Our freestanding executable does not have access to the Rust runtime and crt0, so we need to define our own entry point. Implementing the start language item wouldnâ€™t help, since it would still require crt0. Instead, we need to overwrite the crt0 entry point directly.

ðŸ”—Overwriting the Entry Point

To tell the Rust compiler that we donâ€™t want to use the normal entry point chain, we add the #![no_main] attribute.

#![no_std]
#![no_main]

use core::panic::PanicInfo;

/// This function is called on panic.
#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}

You might notice that we removed the main function. The reason is that a main doesnâ€™t make sense without an underlying runtime that calls it. Instead, we are now overwriting the operating system entry point with our own _start function:

#[no_mangle]
pub extern "C" fn _start() -> ! {
    loop {}
}

By using the #[no_mangle] attribute, we disable name mangling to ensure that the Rust compiler really outputs a function with the name _start. Without the attribute, the compiler would generate some cryptic _ZN3blog_os4_start7hb173fedf945531caE symbol to give every function a unique name. The attribute is required because we need to tell the name of the entry point function to the linker in the next step.

We also have to mark the function as extern "C" to tell the compiler that it should use the C calling convention for this function (instead of the unspecified Rust calling convention). The reason for naming the function _start is that this is the default entry point name for most systems.

The ! return type means that the function is diverging, i.e. not allowed to ever return. This is required because the entry point is not called by any function, but invoked directly by the operating system or bootloader. So instead of returning, the entry point should e.g. invoke the exit system call of the operating system. In our case, shutting down the machine could be a reasonable action, since thereâ€™s nothing left to do if a freestanding binary returns. For now, we fulfill the requirement by looping endlessly.

When we run cargo build now, we get an ugly linker error.

ðŸ”—Linker Errors

The linker is a program that combines the generated code into an executable. Since the executable format differs between Linux, Windows, and macOS, each system has its own linker that throws a different error. The fundamental cause of the errors is the same: the default configuration of the linker assumes that our program depends on the C runtime, which it does not.

To solve the errors, we need to tell the linker that it should not include the C runtime. We can do this either by passing a certain set of arguments to the linker or by building for a bare metal target.

ðŸ”—Building for a Bare Metal Target

By default Rust tries to build an executable that is able to run in your current system environment. For example, if youâ€™re using Windows on x86_64, Rust tries to build an .exe Windows executable that uses x86_64 instructions. This environment is called your â€œhostâ€ system.

To describe different environments, Rust uses a string called target triple. You can see the target triple for your host system by running rustc --version --verbose:

rustc 1.35.0-nightly (474e7a648 2019-04-07)
binary: rustc
commit-hash: 474e7a6486758ea6fc761893b1a49cd9076fb0ab
commit-date: 2019-04-07
host: x86_64-unknown-linux-gnu
release: 1.35.0-nightly
LLVM version: 8.0

The above output is from a x86_64 Linux system. We see that the host triple is x86_64-unknown-linux-gnu, which includes the CPU architecture (x86_64), the vendor (unknown), the operating system (linux), and the ABI (gnu).

By compiling for our host triple, the Rust compiler and the linker assume that there is an underlying operating system such as Linux or Windows that uses the C runtime by default, which causes the linker errors. So, to avoid the linker errors, we can compile for a different environment with no underlying operating system.

An example of such a bare metal environment is the thumbv7em-none-eabihf target triple, which describes an embedded ARM system. The details are not important, all that matters is that the target triple has no underlying operating system, which is indicated by the none in the target triple. To be able to compile for this target, we need to add it in rustup:

rustup target add thumbv7em-none-eabihf

This downloads a copy of the standard (and core) library for the system. Now we can build our freestanding executable for this target:

cargo build --target thumbv7em-none-eabihf

By passing a --target argument we cross compile our executable for a bare metal target system. Since the target system has no operating system, the linker does not try to link the C runtime and our build succeeds without any linker errors.

This is the approach that we will use for building our OS kernel. Instead of thumbv7em-none-eabihf, we will use a custom target that describes a x86_64 bare metal environment. The details will be explained in the next post.

ðŸ”—Linker Arguments

Instead of compiling for a bare metal system, it is also possible to resolve the linker errors by passing a certain set of arguments to the linker. This isnâ€™t the approach that we will use for our kernel, therefore this section is optional and only provided for completeness. Click on â€œLinker Argumentsâ€ below to show the optional content.

Linker Arguments

In this section we discuss the linker errors that occur on Linux, Windows, and macOS, and explain how to solve them by passing additional arguments to the linker. Note that the executable format and the linker differ between operating systems, so that a different set of arguments is required for each system.

ðŸ”—Linux

On Linux the following linker error occurs (shortened):

error: linking with `cc` failed: exit code: 1
  |
  = note: "cc" [â€¦]
  = note: /usr/lib/gcc/../x86_64-linux-gnu/Scrt1.o: In function `_start':
          (.text+0x12): undefined reference to `__libc_csu_fini'
          /usr/lib/gcc/../x86_64-linux-gnu/Scrt1.o: In function `_start':
          (.text+0x19): undefined reference to `__libc_csu_init'
          /usr/lib/gcc/../x86_64-linux-gnu/Scrt1.o: In function `_start':
          (.text+0x25): undefined reference to `__libc_start_main'
          collect2: error: ld returned 1 exit status

The problem is that the linker includes the startup routine of the C runtime by default, which is also called _start. It requires some symbols of the C standard library libc that we donâ€™t include due to the no_std attribute, therefore the linker canâ€™t resolve these references. To solve this, we can tell the linker that it should not link the C startup routine by passing the -nostartfiles flag.

One way to pass linker attributes via cargo is the cargo rustc command. The command behaves exactly like cargo build, but allows to pass options to rustc, the underlying Rust compiler. rustc has the -C link-arg flag, which passes an argument to the linker. Combined, our new build command looks like this:

cargo rustc -- -C link-arg=-nostartfiles

Now our crate builds as a freestanding executable on Linux!

We didnâ€™t need to specify the name of our entry point function explicitly since the linker looks for a function with the name _start by default.

ðŸ”—Windows

On Windows, a different linker error occurs (shortened):

error: linking with `link.exe` failed: exit code: 1561
  |
  = note: "C:\\Program Files (x86)\\â€¦\\link.exe" [â€¦]
  = note: LINK : fatal error LNK1561: entry point must be defined

The â€œentry point must be definedâ€ error means that the linker canâ€™t find the entry point. On Windows, the default entry point name depends on the used subsystem. For the CONSOLE subsystem, the linker looks for a function named mainCRTStartup and for the WINDOWS subsystem, it looks for a function named WinMainCRTStartup. To override the default and tell the linker to look for our _start function instead, we can pass an /ENTRY argument to the linker:

cargo rustc -- -C link-arg=/ENTRY:_start

From the different argument format we clearly see that the Windows linker is a completely different program than the Linux linker.

Now a different linker error occurs:

error: linking with `link.exe` failed: exit code: 1221
  |
  = note: "C:\\Program Files (x86)\\â€¦\\link.exe" [â€¦]
  = note: LINK : fatal error LNK1221: a subsystem can't be inferred and must be
          defined

This error occurs because Windows executables can use different subsystems. For normal programs, they are inferred depending on the entry point name: If the entry point is named main, the CONSOLE subsystem is used, and if the entry point is named WinMain, the WINDOWS subsystem is used. Since our _start function has a different name, we need to specify the subsystem explicitly:

cargo rustc -- -C link-args="/ENTRY:_start /SUBSYSTEM:console"

We use the CONSOLE subsystem here, but the WINDOWS subsystem would work too. Instead of passing -C link-arg multiple times, we use -C link-args which takes a space separated list of arguments.

With this command, our executable should build successfully on Windows.

ðŸ”—macOS

On macOS, the following linker error occurs (shortened):

error: linking with `cc` failed: exit code: 1
  |
  = note: "cc" [â€¦]
  = note: ld: entry point (_main) undefined. for architecture x86_64
          clang: error: linker command failed with exit code 1 [â€¦]

This error message tells us that the linker canâ€™t find an entry point function with the default name main (for some reason, all functions are prefixed with a _ on macOS). To set the entry point to our _start function, we pass the -e linker argument:

cargo rustc -- -C link-args="-e __start"

The -e flag specifies the name of the entry point function. Since all functions have an additional _ prefix on macOS, we need to set the entry point to __start instead of _start.

Now the following linker error occurs:

error: linking with `cc` failed: exit code: 1
  |
  = note: "cc" [â€¦]
  = note: ld: dynamic main executables must link with libSystem.dylib
          for architecture x86_64
          clang: error: linker command failed with exit code 1 [â€¦]

macOS does not officially support statically linked binaries and requires programs to link the libSystem library by default. To override this and link a static binary, we pass the -static flag to the linker:

cargo rustc -- -C link-args="-e __start -static"

This still does not suffice, as a third linker error occurs:

error: linking with `cc` failed: exit code: 1
  |
  = note: "cc" [â€¦]
  = note: ld: library not found for -lcrt0.o
          clang: error: linker command failed with exit code 1 [â€¦]

This error occurs because programs on macOS link to crt0 (â€œC runtime zeroâ€) by default. This is similar to the error we had on Linux and can also be solved by adding the -nostartfiles linker argument:

cargo rustc -- -C link-args="-e __start -static -nostartfiles"

Now our program should build successfully on macOS.

ðŸ”—Unifying the Build Commands

Right now we have different build commands depending on the host platform, which is not ideal. To avoid this, we can create a file named .cargo/config.toml that contains the platform-specific arguments:

# in .cargo/config.toml

[target.'cfg(target_os = "linux")']
rustflags = ["-C", "link-arg=-nostartfiles"]

[target.'cfg(target_os = "windows")']
rustflags = ["-C", "link-args=/ENTRY:_start /SUBSYSTEM:console"]

[target.'cfg(target_os = "macos")']
rustflags = ["-C", "link-args=-e __start -static -nostartfiles"]

The rustflags key contains arguments that are automatically added to every invocation of rustc. For more information on the .cargo/config.toml file, check out the official documentation.

Now our program should be buildable on all three platforms with a simple cargo build.

ðŸ”—Should You Do This?

While itâ€™s possible to build a freestanding executable for Linux, Windows, and macOS, itâ€™s probably not a good idea. The reason is that our executable still expects various things, for example that a stack is initialized when the _start function is called. Without the C runtime, some of these requirements might not be fulfilled, which might cause our program to fail, e.g. through a segmentation fault.

If you want to create a minimal binary that runs on top of an existing operating system, including libc and setting the #[start] attribute as described here is probably a better idea.

ðŸ”—Summary

A minimal freestanding Rust binary looks like this:

src/main.rs:

#![no_std] // don't link the Rust standard library
#![no_main] // disable all Rust-level entry points

use core::panic::PanicInfo;

#[no_mangle] // don't mangle the name of this function
pub extern "C" fn _start() -> ! {
    // this function is the entry point, since the linker looks for a function
    // named `_start` by default
    loop {}
}

/// This function is called on panic.
#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}

Cargo.toml:

[package]
name = "crate_name"
version = "0.1.0"
authors = ["Author Name <[email protected]>"]

# the profile used for `cargo build`
[profile.dev]
panic = "abort" # disable stack unwinding on panic

# the profile used for `cargo build --release`
[profile.release]
panic = "abort" # disable stack unwinding on panic

To build this binary, we need to compile for a bare metal target such as thumbv7em-none-eabihf:

cargo build --target thumbv7em-none-eabihf

Alternatively, we can compile it for the host system by passing additional linker arguments:

# Linux
cargo rustc -- -C link-arg=-nostartfiles
# Windows
cargo rustc -- -C link-args="/ENTRY:_start /SUBSYSTEM:console"
# macOS
cargo rustc -- -C link-args="-e __start -static -nostartfiles"

Note that this is just a minimal example of a freestanding Rust binary. This binary expects various things, for example, that a stack is initialized when the _start function is called. So for any real use of such a binary, more steps are required.

ðŸ”—Whatâ€™s next?

The next post explains the steps needed for turning our freestanding binary into a minimal operating system kernel. This includes creating a custom target, combining our executable with a bootloader, and learning how to print something to the screen.

A Minimal Rust Kernel

Sat, 10 Feb 2018 00:00:00 +0000

In this post, we create a minimal 64-bit Rust kernel for the x86 architecture. We build upon the freestanding Rust binary from the previous post to create a bootable disk image that prints something to the screen.

ðŸ”—The Boot Process

When you turn on a computer, it begins executing firmware code that is stored in motherboard ROM. This code performs a power-on self-test, detects available RAM, and pre-initializes the CPU and hardware. Afterwards, it looks for a bootable disk and starts booting the operating system kernel.

On x86, there are two firmware standards: the â€œBasic Input/Output Systemâ€œ (BIOS) and the newer â€œUnified Extensible Firmware Interfaceâ€ (UEFI). The BIOS standard is old and outdated, but simple and well-supported on any x86 machine since the 1980s. UEFI, in contrast, is more modern and has much more features, but is more complex to set up (at least in my opinion).

Currently, we only provide BIOS support, but support for UEFI is planned, too. If youâ€™d like to help us with this, check out the Github issue.

ðŸ”—BIOS Boot

Almost all x86 systems have support for BIOS booting, including newer UEFI-based machines that use an emulated BIOS. This is great, because you can use the same boot logic across all machines from the last century. But this wide compatibility is at the same time the biggest disadvantage of BIOS booting, because it means that the CPU is put into a 16-bit compatibility mode called real mode before booting so that archaic bootloaders from the 1980s would still work.

But letâ€™s start from the beginning:

When you turn on a computer, it loads the BIOS from some special flash memory located on the motherboard. The BIOS runs self-test and initialization routines of the hardware, then it looks for bootable disks. If it finds one, control is transferred to its bootloader, which is a 512-byte portion of executable code stored at the diskâ€™s beginning. Most bootloaders are larger than 512 bytes, so bootloaders are commonly split into a small first stage, which fits into 512 bytes, and a second stage, which is subsequently loaded by the first stage.

The bootloader has to determine the location of the kernel image on the disk and load it into memory. It also needs to switch the CPU from the 16-bit real mode first to the 32-bit protected mode, and then to the 64-bit long mode, where 64-bit registers and the complete main memory are available. Its third job is to query certain information (such as a memory map) from the BIOS and pass it to the OS kernel.

Writing a bootloader is a bit cumbersome as it requires assembly language and a lot of non insightful steps like â€œwrite this magic value to this processor registerâ€. Therefore, we donâ€™t cover bootloader creation in this post and instead provide a tool named bootimage that automatically prepends a bootloader to your kernel.

If you are interested in building your own bootloader: Stay tuned, a set of posts on this topic is already planned!

ðŸ”—The Multiboot Standard

To avoid that every operating system implements its own bootloader, which is only compatible with a single OS, the Free Software Foundation created an open bootloader standard called Multiboot in 1995. The standard defines an interface between the bootloader and the operating system, so that any Multiboot-compliant bootloader can load any Multiboot-compliant operating system. The reference implementation is GNU GRUB, which is the most popular bootloader for Linux systems.

To make a kernel Multiboot compliant, one just needs to insert a so-called Multiboot header at the beginning of the kernel file. This makes it very easy to boot an OS from GRUB. However, GRUB and the Multiboot standard have some problems too:

They support only the 32-bit protected mode. This means that you still have to do the CPU configuration to switch to the 64-bit long mode.
They are designed to make the bootloader simple instead of the kernel. For example, the kernel needs to be linked with an adjusted default page size, because GRUB canâ€™t find the Multiboot header otherwise. Another example is that the boot information, which is passed to the kernel, contains lots of architecture-dependent structures instead of providing clean abstractions.
Both GRUB and the Multiboot standard are only sparsely documented.
GRUB needs to be installed on the host system to create a bootable disk image from the kernel file. This makes development on Windows or Mac more difficult.

Because of these drawbacks, we decided to not use GRUB or the Multiboot standard. However, we plan to add Multiboot support to our bootimage tool, so that itâ€™s possible to load your kernel on a GRUB system too. If youâ€™re interested in writing a Multiboot compliant kernel, check out the first edition of this blog series.

ðŸ”—UEFI

(We donâ€™t provide UEFI support at the moment, but we would love to! If youâ€™d like to help, please tell us in the Github issue.)

ðŸ”—A Minimal Kernel

Now that we roughly know how a computer boots, itâ€™s time to create our own minimal kernel. Our goal is to create a disk image that prints a â€œHello World!â€ to the screen when booted. We do this by extending the previous postâ€™s freestanding Rust binary.

As you may remember, we built the freestanding binary through cargo, but depending on the operating system, we needed different entry point names and compile flags. Thatâ€™s because cargo builds for the host system by default, i.e., the system youâ€™re running on. This isnâ€™t something we want for our kernel, because a kernel that runs on top of, e.g., Windows, does not make much sense. Instead, we want to compile for a clearly defined target system.

ðŸ”—Installing Rust Nightly

Rust has three release channels: stable, beta, and nightly. The Rust Book explains the difference between these channels really well, so take a minute and check it out. For building an operating system, we will need some experimental features that are only available on the nightly channel, so we need to install a nightly version of Rust.

To manage Rust installations, I highly recommend rustup. It allows you to install nightly, beta, and stable compilers side-by-side and makes it easy to update them. With rustup, you can use a nightly compiler for the current directory by running rustup override set nightly. Alternatively, you can add a file called rust-toolchain with the content nightly to the projectâ€™s root directory. You can check that you have a nightly version installed by running rustc --version: The version number should contain -nightly at the end.

The nightly compiler allows us to opt-in to various experimental features by using so-called feature flags at the top of our file. For example, we could enable the experimental asm! macro for inline assembly by adding #![feature(asm)] to the top of our main.rs. Note that such experimental features are completely unstable, which means that future Rust versions might change or remove them without prior warning. For this reason, we will only use them if absolutely necessary.

ðŸ”—Target Specification

Cargo supports different target systems through the --target parameter. The target is described by a so-called target triple, which describes the CPU architecture, the vendor, the operating system, and the ABI. For example, the x86_64-unknown-linux-gnu target triple describes a system with an x86_64 CPU, no clear vendor, and a Linux operating system with the GNU ABI. Rust supports many different target triples, including arm-linux-androideabi for Android or wasm32-unknown-unknown for WebAssembly.

For our target system, however, we require some special configuration parameters (e.g. no underlying OS), so none of the existing target triples fits. Fortunately, Rust allows us to define our own target through a JSON file. For example, a JSON file that describes the x86_64-unknown-linux-gnu target looks like this:

{
    "llvm-target": "x86_64-unknown-linux-gnu",
    "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
    "arch": "x86_64",
    "target-endian": "little",
    "target-pointer-width": "64",
    "target-c-int-width": "32",
    "os": "linux",
    "executables": true,
    "linker-flavor": "gcc",
    "pre-link-args": ["-m64"],
    "morestack": false
}

Most fields are required by LLVM to generate code for that platform. For example, the data-layout field defines the size of various integer, floating point, and pointer types. Then there are fields that Rust uses for conditional compilation, such as target-pointer-width. The third kind of field defines how the crate should be built. For example, the pre-link-args field specifies arguments passed to the linker.

We also target x86_64 systems with our kernel, so our target specification will look very similar to the one above. Letâ€™s start by creating an x86_64-blog_os.json file (choose any name you like) with the common content:

{
    "llvm-target": "x86_64-unknown-none",
    "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
    "arch": "x86_64",
    "target-endian": "little",
    "target-pointer-width": "64",
    "target-c-int-width": "32",
    "os": "none",
    "executables": true
}

Note that we changed the OS in the llvm-target and the os field to none, because we will run on bare metal.

We add the following build-related entries:

"linker-flavor": "ld.lld",
"linker": "rust-lld",

Instead of using the platformâ€™s default linker (which might not support Linux targets), we use the cross-platform LLD linker that is shipped with Rust for linking our kernel.

"panic-strategy": "abort",

This setting specifies that the target doesnâ€™t support stack unwinding on panic, so instead the program should abort directly. This has the same effect as the panic = "abort" option in our Cargo.toml, so we can remove it from there. (Note that, in contrast to the Cargo.toml option, this target option also applies when we recompile the core library later in this post. So, even if you prefer to keep the Cargo.toml option, make sure to include this option.)

"disable-redzone": true,

Weâ€™re writing a kernel, so weâ€™ll need to handle interrupts at some point. To do that safely, we have to disable a certain stack pointer optimization called the â€œred zoneâ€, because it would cause stack corruption otherwise. For more information, see our separate post about disabling the red zone.

"features": "-mmx,-sse,+soft-float",

The features field enables/disables target features. We disable the mmx and sse features by prefixing them with a minus and enable the soft-float feature by prefixing it with a plus. Note that there must be no spaces between different flags, otherwise LLVM fails to interpret the features string.

The mmx and sse features determine support for Single Instruction Multiple Data (SIMD) instructions, which can often speed up programs significantly. However, using the large SIMD registers in OS kernels leads to performance problems. The reason is that the kernel needs to restore all registers to their original state before continuing an interrupted program. This means that the kernel has to save the complete SIMD state to main memory on each system call or hardware interrupt. Since the SIMD state is very large (512â€“1600 bytes) and interrupts can occur very often, these additional save/restore operations considerably harm performance. To avoid this, we disable SIMD for our kernel (not for applications running on top!).

A problem with disabling SIMD is that floating point operations on x86_64 require SIMD registers by default. To solve this problem, we add the soft-float feature, which emulates all floating point operations through software functions based on normal integers.

For more information, see our post on disabling SIMD.

ðŸ”—Putting it Together

Our target specification file now looks like this:

{
    "llvm-target": "x86_64-unknown-none",
    "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
    "arch": "x86_64",
    "target-endian": "little",
    "target-pointer-width": "64",
    "target-c-int-width": "32",
    "os": "none",
    "executables": true,
    "linker-flavor": "ld.lld",
    "linker": "rust-lld",
    "panic-strategy": "abort",
    "disable-redzone": true,
    "features": "-mmx,-sse,+soft-float"
}

ðŸ”—Building our Kernel

Compiling for our new target will use Linux conventions, since the ld.lld linker-flavor instructs llvm to compile with the -flavor gnu flag (for more linker options, see the rustc documentation). This means that we need an entry point named _start as described in the previous post:

// src/main.rs

#![no_std] // don't link the Rust standard library
#![no_main] // disable all Rust-level entry points

use core::panic::PanicInfo;

/// This function is called on panic.
#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}

#[no_mangle] // don't mangle the name of this function
pub extern "C" fn _start() -> ! {
    // this function is the entry point, since the linker looks for a function
    // named `_start` by default
    loop {}
}

Note that the entry point needs to be called _start regardless of your host OS.

We can now build the kernel for our new target by passing the name of the JSON file as --target:

> cargo build --target x86_64-blog_os.json

error[E0463]: can't find crate for `core`

It fails! The error tells us that the Rust compiler no longer finds the core library. This library contains basic Rust types such as Result, Option, and iterators, and is implicitly linked to all no_std crates.

The problem is that the core library is distributed together with the Rust compiler as a precompiled library. So it is only valid for supported host triples (e.g., x86_64-unknown-linux-gnu) but not for our custom target. If we want to compile code for other targets, we need to recompile core for these targets first.

ðŸ”—The `build-std` Option

Thatâ€™s where the build-std feature of cargo comes in. It allows to recompile core and other standard library crates on demand, instead of using the precompiled versions shipped with the Rust installation. This feature is very new and still not finished, so it is marked as â€œunstableâ€ and only available on nightly Rust compilers.

To use the feature, we need to create a local cargo configuration file at .cargo/config.toml (the .cargo folder should be next to your src folder) with the following content:

# in .cargo/config.toml

[unstable]
build-std = ["core", "compiler_builtins"]

This tells cargo that it should recompile the core and compiler_builtins libraries. The latter is required because it is a dependency of core. In order to recompile these libraries, cargo needs access to the rust source code, which we can install with rustup component add rust-src.

Note: The unstable.build-std configuration key requires at least the Rust nightly from 2020-07-15.

After setting the unstable.build-std configuration key and installing the rust-src component, we can rerun our build command:

> cargo build --target x86_64-blog_os.json
   Compiling core v0.0.0 (/â€¦/rust/src/libcore)
   Compiling rustc-std-workspace-core v1.99.0 (/â€¦/rust/src/tools/rustc-std-workspace-core)
   Compiling compiler_builtins v0.1.32
   Compiling blog_os v0.1.0 (/â€¦/blog_os)
    Finished dev [unoptimized + debuginfo] target(s) in 0.29 secs

We see that cargo build now recompiles the core, rustc-std-workspace-core (a dependency of compiler_builtins), and compiler_builtins libraries for our custom target.

The Rust compiler assumes that a certain set of built-in functions is available for all systems. Most of these functions are provided by the compiler_builtins crate that we just recompiled. However, there are some memory-related functions in that crate that are not enabled by default because they are normally provided by the C library on the system. These functions include memset, which sets all bytes in a memory block to a given value, memcpy, which copies one memory block to another, and memcmp, which compares two memory blocks. While we didnâ€™t need any of these functions to compile our kernel right now, they will be required as soon as we add some more code to it (e.g. when copying structs around).

Since we canâ€™t link to the C library of the operating system, we need an alternative way to provide these functions to the compiler. One possible approach for this could be to implement our own memset etc. functions and apply the #[no_mangle] attribute to them (to avoid the automatic renaming during compilation). However, this is dangerous since the slightest mistake in the implementation of these functions could lead to undefined behavior. For example, implementing memcpy with a for loop may result in an infinite recursion because for loops implicitly call the IntoIterator::into_iter trait method, which may call memcpy again. So itâ€™s a good idea to reuse existing, well-tested implementations instead.

Fortunately, the compiler_builtins crate already contains implementations for all the needed functions, they are just disabled by default to not collide with the implementations from the C library. We can enable them by setting cargoâ€™s build-std-features flag to ["compiler-builtins-mem"]. Like the build-std flag, this flag can be either passed on the command line as a -Z flag or configured in the unstable table in the .cargo/config.toml file. Since we always want to build with this flag, the config file option makes more sense for us:

# in .cargo/config.toml

[unstable]
build-std-features = ["compiler-builtins-mem"]
build-std = ["core", "compiler_builtins"]

(Support for the compiler-builtins-mem feature was only added very recently, so you need at least Rust nightly 2020-09-30 for it.)

Behind the scenes, this flag enables the mem feature of the compiler_builtins crate. The effect of this is that the #[no_mangle] attribute is applied to the memcpy etc. implementations of the crate, which makes them available to the linker.

With this change, our kernel has valid implementations for all compiler-required functions, so it will continue to compile even if our code gets more complex.

ðŸ”—Set a Default Target

To avoid passing the --target parameter on every invocation of cargo build, we can override the default target. To do this, we add the following to our cargo configuration file at .cargo/config.toml:

# in .cargo/config.toml

[build]
target = "x86_64-blog_os.json"

This tells cargo to use our x86_64-blog_os.json target when no explicit --target argument is passed. This means that we can now build our kernel with a simple cargo build. For more information on cargo configuration options, check out the official documentation.

We are now able to build our kernel for a bare metal target with a simple cargo build. However, our _start entry point, which will be called by the boot loader, is still empty. Itâ€™s time that we output something to screen from it.

ðŸ”—Printing to Screen

The easiest way to print text to the screen at this stage is the VGA text buffer. It is a special memory area mapped to the VGA hardware that contains the contents displayed on screen. It normally consists of 25 lines that each contain 80 character cells. Each character cell displays an ASCII character with some foreground and background colors. The screen output looks like this:

We will discuss the exact layout of the VGA buffer in the next post, where we write a first small driver for it. For printing â€œHello World!â€, we just need to know that the buffer is located at address 0xb8000 and that each character cell consists of an ASCII byte and a color byte.

The implementation looks like this:

static HELLO: &[u8] = b"Hello World!";

#[no_mangle]
pub extern "C" fn _start() -> ! {
    let vga_buffer = 0xb8000 as *mut u8;

    for (i, &byte) in HELLO.iter().enumerate() {
        unsafe {
            *vga_buffer.offset(i as isize * 2) = byte;
            *vga_buffer.offset(i as isize * 2 + 1) = 0xb;
        }
    }

    loop {}
}

First, we cast the integer 0xb8000 into a raw pointer. Then we iterate over the bytes of the static HELLO byte string. We use the enumerate method to additionally get a running variable i. In the body of the for loop, we use the offset method to write the string byte and the corresponding color byte (0xb is a light cyan).

Note that thereâ€™s an unsafe block around all memory writes. The reason is that the Rust compiler canâ€™t prove that the raw pointers we create are valid. They could point anywhere and lead to data corruption. By putting them into an unsafe block, weâ€™re basically telling the compiler that we are absolutely sure that the operations are valid. Note that an unsafe block does not turn off Rustâ€™s safety checks. It only allows you to do five additional things.

I want to emphasize that this is not the way we want to do things in Rust! Itâ€™s very easy to mess up when working with raw pointers inside unsafe blocks. For example, we could easily write beyond the bufferâ€™s end if weâ€™re not careful.

So we want to minimize the use of unsafe as much as possible. Rust gives us the ability to do this by creating safe abstractions. For example, we could create a VGA buffer type that encapsulates all unsafety and ensures that it is impossible to do anything wrong from the outside. This way, we would only need minimal amounts of unsafe code and can be sure that we donâ€™t violate memory safety. We will create such a safe VGA buffer abstraction in the next post.

ðŸ”—Running our Kernel

Now that we have an executable that does something perceptible, it is time to run it. First, we need to turn our compiled kernel into a bootable disk image by linking it with a bootloader. Then we can run the disk image in the QEMU virtual machine or boot it on real hardware using a USB stick.

ðŸ”—Creating a Bootimage

To turn our compiled kernel into a bootable disk image, we need to link it with a bootloader. As we learned in the section about booting, the bootloader is responsible for initializing the CPU and loading our kernel.

Instead of writing our own bootloader, which is a project on its own, we use the bootloader crate. This crate implements a basic BIOS bootloader without any C dependencies, just Rust and inline assembly. To use it for booting our kernel, we need to add a dependency on it:

# in Cargo.toml

[dependencies]
bootloader = "0.9"

Note: This post is only compatible with bootloader v0.9. Newer versions use a different build system and will result in build errors when following this post.

Adding the bootloader as a dependency is not enough to actually create a bootable disk image. The problem is that we need to link our kernel with the bootloader after compilation, but cargo has no support for post-build scripts.

To solve this problem, we created a tool named bootimage that first compiles the kernel and bootloader, and then links them together to create a bootable disk image. To install the tool, go into your home directory (or any directory outside of your cargo project) and execute the following command in your terminal:

cargo install bootimage

For running bootimage and building the bootloader, you need to have the llvm-tools-preview rustup component installed. You can do so by executing rustup component add llvm-tools-preview.

After installing bootimage and adding the llvm-tools-preview component, you can create a bootable disk image by going back into your cargo project directory and executing:

> cargo bootimage

We see that the tool recompiles our kernel using cargo build, so it will automatically pick up any changes you make. Afterwards, it compiles the bootloader, which might take a while. Like all crate dependencies, it is only built once and then cached, so subsequent builds will be much faster. Finally, bootimage combines the bootloader and your kernel into a bootable disk image.

After executing the command, you should see a bootable disk image named bootimage-blog_os.bin in your target/x86_64-blog_os/debug directory. You can boot it in a virtual machine or copy it to a USB drive to boot it on real hardware. (Note that this is not a CD image, which has a different format, so burning it to a CD doesnâ€™t work).

ðŸ”—How does it work?

The bootimage tool performs the following steps behind the scenes:

It compiles our kernel to an ELF file.
It compiles the bootloader dependency as a standalone executable.
It links the bytes of the kernel ELF file to the bootloader.

When booted, the bootloader reads and parses the appended ELF file. It then maps the program segments to virtual addresses in the page tables, zeroes the .bss section, and sets up a stack. Finally, it reads the entry point address (our _start function) and jumps to it.

ðŸ”—Booting it in QEMU

We can now boot the disk image in a virtual machine. To boot it in QEMU, execute the following command:

> qemu-system-x86_64 -drive format=raw,file=target/x86_64-blog_os/debug/bootimage-blog_os.bin

This opens a separate window which should look similar to this:

We see that our â€œHello World!â€ is visible on the screen.

ðŸ”—Real Machine

It is also possible to write it to a USB stick and boot it on a real machine, but be careful to choose the correct device name, because everything on that device is overwritten:

> dd if=target/x86_64-blog_os/debug/bootimage-blog_os.bin of=/dev/sdX && sync

Where sdX is the device name of your USB stick.

After writing the image to the USB stick, you can run it on real hardware by booting from it. You probably need to use a special boot menu or change the boot order in your BIOS configuration to boot from the USB stick. Note that it currently doesnâ€™t work for UEFI machines, since the bootloader crate has no UEFI support yet.

ðŸ”—Using `cargo run`

To make it easier to run our kernel in QEMU, we can set the runner configuration key for cargo:

# in .cargo/config.toml

[target.'cfg(target_os = "none")']
runner = "bootimage runner"

The target.'cfg(target_os = "none")' table applies to all targets whose target configuration fileâ€™s "os" field is set to "none". This includes our x86_64-blog_os.json target. The runner key specifies the command that should be invoked for cargo run. The command is run after a successful build with the executable path passed as the first argument. See the cargo documentation for more details.

The bootimage runner command is specifically designed to be usable as a runner executable. It links the given executable with the projectâ€™s bootloader dependency and then launches QEMU. See the Readme of bootimage for more details and possible configuration options.

Now we can use cargo run to compile our kernel and boot it in QEMU.

ðŸ”—Whatâ€™s next?

In the next post, we will explore the VGA text buffer in more detail and write a safe interface for it. We will also add support for the println macro.

Handling Exceptions

Sun, 26 Mar 2017 00:00:00 +0000

In this post, we start exploring CPU exceptions. Exceptions occur in various erroneous situations, for example when accessing an invalid memory address or when dividing by zero. To catch them, we have to set up an interrupt descriptor table that provides handler functions. At the end of this post, our kernel will be able to catch breakpoint exceptions and to resume normal execution afterwards.

As always, the complete source code is available on GitHub. Please file issues for any problems, questions, or improvement suggestions. There is also a comment section at the end of this page.

ðŸ”—Exceptions

Weâ€™ve already seen several types of exceptions in our kernel:

Invalid Opcode: This exception occurs when the current instruction is invalid. For example, this exception occurred when we tried to use SSE instructions before enabling SSE. Without SSE, the CPU didnâ€™t know the movups and movaps instructions, so it throws an exception when it stumbles over them.
Page Fault: A page fault occurs on illegal memory accesses. For example, if the current instruction tries to read from an unmapped page or tries to write to a read-only page.
Double Fault: When an exception occurs, the CPU tries to call the corresponding handler function. If another exception occurs while calling the exception handler, the CPU raises a double fault exception. This exception also occurs when there is no handler function registered for an exception.
Triple Fault: If an exception occurs while the CPU tries to call the double fault handler function, it issues a fatal triple fault. We canâ€™t catch or handle a triple fault. Most processors react by resetting themselves and rebooting the operating system. This causes the bootloops we experienced in the previous posts.

For the full list of exceptions check out the OSDev wiki.

ðŸ”—The Interrupt Descriptor Table

In order to catch and handle exceptions, we have to set up a so-called Interrupt Descriptor Table (IDT). In this table we can specify a handler function for each CPU exception. The hardware uses this table directly, so we need to follow a predefined format. Each entry must have the following 16-byte structure:

Type	Name	Description
u16	Function Pointer [0:15]	The lower bits of the pointer to the handler function.
u16	GDT selector	Selector of a code segment in the GDT.
u16	Options	(see below)
u16	Function Pointer [16:31]	The middle bits of the pointer to the handler function.
u32	Function Pointer [32:63]	The remaining bits of the pointer to the handler function.
u32	Reserved

The options field has the following format:

Bits	Name	Description
0-2	Interrupt Stack Table Index	0: Donâ€™t switch stacks, 1-7: Switch to the n-th stack in the Interrupt Stack Table when this handler is called.
3-7	Reserved
8	0: Interrupt Gate, 1: Trap Gate	If this bit is 0, interrupts are disabled when this handler is called.
9-11	must be one
12	must be zero
13â€‘14	Descriptor Privilege Level (DPL)	The minimal privilege level required for calling this handler.
15	Present

Each exception has a predefined IDT index. For example the invalid opcode exception has table index 6 and the page fault exception has table index 14. Thus, the hardware can automatically load the corresponding IDT entry for each exception. The Exception Table in the OSDev wiki shows the IDT indexes of all exceptions in the â€œVector nr.â€ column.

When an exception occurs, the CPU roughly does the following:

Push some registers on the stack, including the instruction pointer and the RFLAGS register. (We will use these values later in this post.)
Read the corresponding entry from the Interrupt Descriptor Table (IDT). For example, the CPU reads the 14-th entry when a page fault occurs.
Check if the entry is present. Raise a double fault if not.
Disable interrupts if the entry is an interrupt gate (bit 40 not set).
Load the specified GDT selector into the CS segment.
Jump to the specified handler function.

ðŸ”—An IDT Type

Instead of creating our own IDT type, we will use the Idt struct of the x86_64 crate, which looks like this:

#[repr(C)]
pub struct Idt {
    pub divide_by_zero: IdtEntry<HandlerFunc>,
    pub debug: IdtEntry<HandlerFunc>,
    pub non_maskable_interrupt: IdtEntry<HandlerFunc>,
    pub breakpoint: IdtEntry<HandlerFunc>,
    pub overflow: IdtEntry<HandlerFunc>,
    pub bound_range_exceeded: IdtEntry<HandlerFunc>,
    pub invalid_opcode: IdtEntry<HandlerFunc>,
    pub device_not_available: IdtEntry<HandlerFunc>,
    pub double_fault: IdtEntry<HandlerFuncWithErrCode>,
    pub invalid_tss: IdtEntry<HandlerFuncWithErrCode>,
    pub segment_not_present: IdtEntry<HandlerFuncWithErrCode>,
    pub stack_segment_fault: IdtEntry<HandlerFuncWithErrCode>,
    pub general_protection_fault: IdtEntry<HandlerFuncWithErrCode>,
    pub page_fault: IdtEntry<PageFaultHandlerFunc>,
    pub x87_floating_point: IdtEntry<HandlerFunc>,
    pub alignment_check: IdtEntry<HandlerFuncWithErrCode>,
    pub machine_check: IdtEntry<HandlerFunc>,
    pub simd_floating_point: IdtEntry<HandlerFunc>,
    pub virtualization: IdtEntry<HandlerFunc>,
    pub security_exception: IdtEntry<HandlerFuncWithErrCode>,
    pub interrupts: [IdtEntry<HandlerFunc>; 224],
    // some fields omitted
}

The fields have the type IdtEntry<F>, which is a struct that represents the fields of an IDT entry (see the table above). The type parameter F defines the expected handler function type. We see that some entries require a HandlerFunc and some entries require a HandlerFuncWithErrCode. The page fault even has its own special type: PageFaultHandlerFunc.

Letâ€™s look at the HandlerFunc type first:

type HandlerFunc = extern "x86-interrupt" fn(_: &mut ExceptionStackFrame);

ðŸ”—The Interrupt Calling Convention

Exceptions are quite similar to function calls: The CPU jumps to the first instruction of the called function and executes it. Afterwards, if the function is not diverging, the CPU jumps to the return address and continues the execution of the parent function.

However, there is a major difference between exceptions and function calls: A function call is invoked voluntary by a compiler inserted call instruction, while an exception might occur at any instruction. In order to understand the consequences of this difference, we need to examine function calls in more detail.

the first six integer arguments are passed in registers rdi, rsi, rdx, rcx, r8, r9
additional arguments are passed on the stack
results are returned in rax and rdx

Note that Rust does not follow the C ABI (in fact, there isnâ€™t even a Rust ABI yet). So these rules apply only to functions declared as extern "C" fn.

ðŸ”—Preserved and Scratch Registers

The calling convention divides the registers in two parts: preserved and scratch registers.

The values of preserved registers must remain unchanged across function calls. So a called function (the â€œcalleeâ€) is only allowed to overwrite these registers if it restores their original values before returning. Therefore these registers are called â€œcallee-savedâ€. A common pattern is to save these registers to the stack at the functionâ€™s beginning and restore them just before returning.

In contrast, a called function is allowed to overwrite scratch registers without restrictions. If the caller wants to preserve the value of a scratch register across a function call, it needs to backup and restore it before the function call (e.g. by pushing it to the stack). So the scratch registers are caller-saved.

On x86_64, the C calling convention specifies the following preserved and scratch registers:

preserved registers	scratch registers
`rbp`, `rbx`, `rsp`, `r12`, `r13`, `r14`, `r15`	`rax`, `rcx`, `rdx`, `rsi`, `rdi`, `r8`, `r9`, `r10`, `r11`
callee-saved	caller-saved

ðŸ”—Preserving all Registers

In contrast to function calls, exceptions can occur on any instruction. In most cases we donâ€™t even know at compile time if the generated code will cause an exception. For example, the compiler canâ€™t know if an instruction causes a stack overflow or a page fault.

Since we donâ€™t know when an exception occurs, we canâ€™t backup any registers before. This means that we canâ€™t use a calling convention that relies on caller-saved registers for exception handlers. Instead, we need a calling convention means that preserves all registers. The x86-interrupt calling convention is such a calling convention, so it guarantees that all register values are restored to their original values on function return.

ðŸ”—The Exception Stack Frame

Aligning the stack pointer: An interrupt can occur at any instructions, so the stack pointer can have any value, too. However, some CPU instructions (e.g. some SSE instructions) require that the stack pointer is aligned on a 16 byte boundary, therefore the CPU performs such an alignment right after the interrupt.
Switching stacks (in some cases): A stack switch occurs when the CPU privilege level changes, for example when a CPU exception occurs in an user mode program. It is also possible to configure stack switches for specific interrupts using the so-called Interrupt Stack Table (described in the next post).
Pushing the old stack pointer: The CPU pushes the values of the stack pointer (rsp) and the stack segment (ss) registers at the time when the interrupt occurred (before the alignment). This makes it possible to restore the original stack pointer when returning from an interrupt handler.
Pushing and updating the RFLAGS register: The RFLAGS register contains various control and status bits. On interrupt entry, the CPU changes some bits and pushes the old value.
Pushing the instruction pointer: Before jumping to the interrupt handler function, the CPU pushes the instruction pointer (rip) and the code segment (cs). This is comparable to the return address push of a normal function call.
Pushing an error code (for some exceptions): For some specific exceptions such as page faults, the CPU pushes an error code, which describes the cause of the exception.
Invoking the interrupt handler: The CPU reads the address and the segment descriptor of the interrupt handler function from the corresponding field in the IDT. It then invokes this handler by loading the values into the rip and cs registers.

So the exception stack frame looks like this:

In the x86_64 crate, the exception stack frame is represented by the ExceptionStackFrame struct. It is passed to interrupt handlers as &mut and can be used to retrieve additional information about the exceptionâ€™s cause. The struct contains no error code field, since only some few exceptions push an error code. These exceptions use the separate HandlerFuncWithErrCode function type, which has an additional error_code argument.

ðŸ”—Behind the Scenes

Retrieving the arguments: Most calling conventions expect that the arguments are passed in registers. This is not possible for exception handlers, since we must not overwrite any register values before backing them up on the stack. Instead, the x86-interrupt calling convention is aware that the arguments already lie on the stack at a specific offset.
Returning using iretq: Since the exception stack frame completely differs from stack frames of normal function calls, we canâ€™t return from handlers functions through the normal ret instruction. Instead, the iretq instruction must be used.
Handling the error code: The error code, which is pushed for some exceptions, makes things much more complex. It changes the stack alignment (see the next point) and needs to be popped off the stack before returning. The x86-interrupt calling convention handles all that complexity. However, it doesnâ€™t know which handler function is used for which exception, so it needs to deduce that information from the number of function arguments. That means that the programmer is still responsible to use the correct function type for each exception. Luckily, the Idt type defined by the x86_64 crate ensures that the correct function types are used.
Aligning the stack: There are some instructions (especially SSE instructions) that require a 16-byte stack alignment. The CPU ensures this alignment whenever an exception occurs, but for some exceptions it destroys it again later when it pushes an error code. The x86-interrupt calling convention takes care of this by realigning the stack in this case.

If you are interested in more details: We also have a series of posts that explains exception handling using naked functions linked at the end of this post.

ðŸ”—Implementation

Now that weâ€™ve understood the theory, itâ€™s time to handle CPU exceptions in our kernel. We start by creating a new interrupts module:

// in src/lib.rs
...
mod interrupts;
...

In the new module, we create an init function, that creates a new Idt:

// in src/interrupts.rs

use x86_64::structures::idt::Idt;

pub fn init() {
    let mut idt = Idt::new();
}

Now we can add handler functions. We start by adding a handler for the breakpoint exception. The breakpoint exception is the perfect exception to test exception handling. Its only purpose is to temporary pause a program when the breakpoint instruction int3 is executed.

For our use case, we donâ€™t need to overwrite any instructions (it wouldnâ€™t even be possible since we set the page table flags to read-only). Instead, we just want to print a message when the breakpoint instruction is executed and then continue the program.

So letâ€™s create a simple breakpoint_handler function and add it to our IDT:

/// in src/interrupts.rs

use x86_64::structures::idt::ExceptionStackFrame;

pub fn init() {
    let mut idt = Idt::new();
    idt.breakpoint.set_handler_fn(breakpoint_handler);
}

extern "x86-interrupt" fn breakpoint_handler(
    stack_frame: &mut ExceptionStackFrame)
{
    println!("EXCEPTION: BREAKPOINT\n{:#?}", stack_frame);
}

Our handler just outputs a message and pretty-prints the exception stack frame.

When we try to compile it, the following error occurs:

error: x86-interrupt ABI is experimental and subject to change (see issue #40180)
  --> src/interrupts.rs:8:1
   |
8  |   extern "x86-interrupt" fn breakpoint_handler(
   |  _^ starting here...
9  | |     stack_frame: &mut ExceptionStackFrame)
10 | | {
11 | |     println!("EXCEPTION: BREAKPOINT\n{:#?}", stack_frame);
12 | | }
   | |_^ ...ending here
   |
   = help: add #![feature(abi_x86_interrupt)] to the crate attributes to enable

ðŸ”—Loading the IDT

In order that the CPU uses our new interrupt descriptor table, we need to load it using the lidt instruction. The Idt struct of the x86_64 provides a load method function for that. Letâ€™s try to use it:

pub fn init() {
    let mut idt = Idt::new();
    idt.breakpoint.set_handler_fn(breakpoint_handler);
    idt.load();
}

When we try to compile it now, the following error occurs:

error: `idt` does not live long enough
  --> src/interrupts/mod.rs:43:5
   |
43 |     idt.load();
   |     ^^^ does not live long enough
44 | }
   | - borrowed value only lives until here
   |
   = note: borrowed value must be valid for the static lifetime...

So the load methods expects a &'static self, that is a reference that is valid for the complete runtime of the program. The reason is that the CPU will access this table on every interrupt until we load a different IDT. So using a shorter lifetime than 'static could lead to use-after-free bugs.

In fact, this is exactly what happens here. Our idt is created on the stack, so it is only valid inside the init function. Afterwards the stack memory is reused for other functions, so the CPU would interpret random stack memory as IDT. Luckily, the Idt::load method encodes this lifetime requirement in its function definition, so that the Rust compiler is able to prevent this possible bug at compile time.

In order to fix this problem, we need to store our idt at a place where it has a 'static lifetime. To achieve this, we could either allocate our IDT on the heap using Box and then convert it to a 'static reference or we can store the IDT as a static. Letâ€™s try the latter:

static IDT: Idt = Idt::new();

pub fn init() {
    IDT.breakpoint.set_handler_fn(breakpoint_handler);
    IDT.load();
}

There are two problems with this. First, statics are immutable, so we canâ€™t modify the breakpoint entry from our init function. Second, the Idt::new function is not a const function, so it canâ€™t be used to initialize a static. We could solve this problem by using a static mut of type Option<Idt>:

static mut IDT: Option<Idt> = None;

pub fn init() {
    unsafe {
        let IDT = Some(Idt::new());
        let idt = IDT.as_mut_ref().unwrap();
        idt.breakpoint.set_handler_fn(breakpoint_handler);
        idt.load();
    }
}

This variant compiles without errors but itâ€™s far from idiomatic. static muts are very prone to data races, so we need an unsafe block on each access. Also, we need to explicitly unwrap the IDT on each use, since might be None.

ðŸ”—Lazy Statics to the Rescue

The one-time initialization of statics with non-const functions is a common problem in Rust. Fortunately, there already exists a good solution in a crate named lazy_static. This crate provides a lazy_static! macro that defines a lazily initialized static. Instead of computing its value at compile time, the static laziliy initializes itself when itâ€™s accessed the first time. Thus, the initialization happens at runtime so that arbitrarily complex initialization code is possible.

Letâ€™s add the lazy_static crate to our project:

// in src/lib.rs

#[macro_use]
extern crate lazy_static;

# in Cargo.toml

[dependencies.lazy_static]
version = "0.2.4"
features = ["spin_no_std"]

We need the spin_no_std feature, since we donâ€™t link the standard library. We also need the #[macro_use] attribute on the extern crate line to import the lazy_static! macro.

Now we can create our static IDT using lazy_static:

lazy_static! {
    static ref IDT: Idt = {
        let mut idt = Idt::new();
        idt.breakpoint.set_handler_fn(breakpoint_handler);
        idt
    };
}

pub fn init() {
    IDT.load();
}

Note how this solution requires no unsafe blocks or unwrap calls.

ðŸ”—Aside: How does the lazy_static! macro work?

The macro generates a static of type Once<Idt>. The Once type is provided by the spin crate and allows deferred one-time initialization. It is implemented using an AtomicUsize for synchronization and an UnsafeCell for storing the (possibly uninitialized) value. So this solution also uses unsafe behind the scenes, but it is abstracted away in a safe interface.

ðŸ”—Testing it

Now we should be able to handle breakpoint exceptions! Letâ€™s try it in our rust_main:

// in src/lib.rs

pub extern "C" fn rust_main(...) {
    ...
    memory::init(boot_info);

    // initialize our IDT
    interrupts::init();

    // invoke a breakpoint exception
    x86_64::instructions::interrupts::int3();

    println!("It did not crash!");
    loop {}
}

When we run it in QEMU now (using make run), we see the following:

It works! The CPU successfully invokes our breakpoint handler, which prints the message, and then returns back to the rust_main function, where the It did not crash! message is printed.

Aside: If it doesnâ€™t work and a boot loop occurs, this might be caused by a kernel stack overflow. Try increasing the stack size to at least 16kB (4096 * 4 bytes) in the boot.asm file.

We see that the exception stack frame tells us the instruction and stack pointers at the time when the exception occurred. This information is very useful when debugging unexpected exceptions. For example, we can look at the corresponding assembly line using objdump:

> objdump -d build/kernel-x86_64.bin | grep -B5 "1140a6:"
00000000001140a0 <x86_64::instructions::interrupts::int3::h015bf61815bb8afe>:
  1140a0:	55                   	push   %rbp
  1140a1:	48 89 e5             	mov    %rsp,%rbp
  1140a4:	50                   	push   %rax
  1140a5:	cc                   	int3
  1140a6:	48 83 c4 08          	add    $0x8,%rsp

The -d flags disassembles the code section and -C flag makes function names more readable by demangling them. The -B flag of grep specifies the number of preceding lines that should be shown (5 in our case).

We clearly see the int3 exception that caused the breakpoint exception at address 1140a5. Waitâ€¦ the stored instruction pointer was 1140a6, which is a normal add operation. Whatâ€™s happening here?

ðŸ”—Faults, Aborts, and Traps

The answer is that the stored instruction pointer only points to the causing instruction for fault type exceptions, but not for trap or abort type exceptions. The difference between these types is the following:

Faults are exceptions that can be corrected so that the program can continue as if nothing happened. An example is the page fault, which can often be resolved by loading the accessed page from the disk into memory.
Aborts are fatal exceptions that canâ€™t be recovered. Examples are machine check exception or the double fault.
Traps are only reported to the kernel, but donâ€™t hinder the continuation of the program. Examples are the breakpoint exception and the overflow exception.

The reason for the diffent instruction pointer values is that the stored value is also the return address. So for faults, the instruction that caused the exception is restarted and might cause the same exception again if itâ€™s not resolved. This would not make much sense for traps, since invoking the breakpoint exception again would just cause another breakpoint exception¹. Thus the instruction pointer points to the next instruction for these exceptions.

In some cases, the distinction between faults and traps is vague. For example, the debug exception behaves like a fault in some cases, but like a trap in others. So to find out the meaning of the saved instruction pointer, it is a good idea to read the official documentation for the exception, which can be found in the AMD64 manual in Section 8.2. For example, for the breakpoint exception it says:

#BP is a trap-type exception. The saved instruction pointer points to the byte after the INT3 instruction.

The documentation of the Idt struct and the OSDev Wiki also contain this information.

ðŸ”—Too much Magic?

The x86-interrupt calling convention and the Idt type made the exception handling process relatively straightforward and painless. If this was too much magic for you and you like to learn all the gory details of exception handling, we got you covered: Our â€œHandling Exceptions with Naked Functionsâ€ series shows how to handle exceptions without the x86-interrupt calling convention and also creates its own Idt type. Historically, these posts were the main exception handling posts before the x86-interrupt calling convention and the x86_64 crate existed.

ðŸ”—Whatâ€™s next?

Weâ€™ve successfully caught our first exception and returned from it! The next step is to add handlers for other common exceptions such as page faults. We also need to make sure that we never cause a triple fault, since it causes a complete system reset. The next post explains how we can avoid this by correctly catching double faults.

ðŸ”—Footnotes

There are valid use cases for restarting an instruction that caused a breakpoint. The most common use case is a debugger: When setting a breakpoint on some code line, the debugger overwrites the corresponding instruction with an int3 instruction, so that the CPU traps when that line is executed. When the user continues execution, the debugger swaps in the original instruction and continues the program from the replaced instruction.

Double Faults

Mon, 02 Jan 2017 00:00:00 +0000

In this post we explore double faults in detail. We also set up an Interrupt Stack Table to catch double faults on a separate kernel stack. This way, we can completely prevent triple faults, even on kernel stack overflow.

As always, the complete source code is available on GitHub. Please file issues for any problems, questions, or improvement suggestions. There is also a gitter chat and a comment section at the end of this page.

ðŸ”—What is a Double Fault?

In simplified terms, a double fault is a special exception that occurs when the CPU fails to invoke an exception handler. For example, it occurs when a page fault is triggered but there is no page fault handler registered in the Interrupt Descriptor Table (IDT). So itâ€™s kind of similar to catch-all blocks in programming languages with exceptions, e.g. catch(...) in C++ or catch(Exception e) in Java or C#.

A double fault behaves like a normal exception. It has the vector number 8 and we can define a normal handler function for it in the IDT. It is really important to provide a double fault handler, because if a double fault is unhandled a fatal triple fault occurs. Triple faults canâ€™t be caught and most hardware reacts with a system reset.

ðŸ”—Triggering a Double Fault

Letâ€™s provoke a double fault by triggering an exception for that we didnâ€™t define a handler function:

// in src/lib.rs

#[no_mangle]
pub extern "C" fn rust_main(multiboot_information_address: usize) {
    ...
    // initialize our IDT
    interrupts::init();

    // trigger a page fault
    unsafe {
        *(0xdeadbeaf as *mut u64) = 42;
    };

    println!("It did not crash!");
    loop {}
}

We try to write to address 0xdeadbeaf, but the corresponding page is not present in the page tables. Thus, a page fault occurs. We havenâ€™t registered a page fault handler in our IDT, so a double fault occurs.

When we start our kernel now, we see that it enters an endless boot loop:

The reason for the boot loop is the following:

The CPU tries to write to 0xdeadbeaf, which causes a page fault.
The CPU looks at the corresponding entry in the IDT and sees that the present bit isnâ€™t set. Thus, it canâ€™t call the page fault handler and a double fault occurs.
The CPU looks at the IDT entry of the double fault handler, but this entry is also non-present. Thus, a triple fault occurs.
A triple fault is fatal. QEMU reacts to it like most real hardware and issues a system reset.

So in order to prevent this triple fault, we need to either provide a handler function for page faults or a double fault handler. Letâ€™s start with the latter, since we want to avoid triple faults in all cases.

ðŸ”—A Double Fault Handler

A double fault is a normal exception with an error code, so we can use our handler_with_error_code macro to create a wrapper function:

// in src/interrupts.rs

lazy_static! {
    static ref IDT: idt::Idt = {
        let mut idt = idt::Idt::new();

        idt.breakpoint.set_handler_fn(breakpoint_handler);
        idt.double_fault.set_handler_fn(double_fault_handler);

        idt
    };
}

// our new double fault handler
extern "x86-interrupt" fn double_fault_handler(
    stack_frame: &mut ExceptionStackFrame, _error_code: u64)
{
    println!("\nEXCEPTION: DOUBLE FAULT\n{:#?}", stack_frame);
    loop {}
}

Our handler prints a short error message and dumps the exception stack frame. The error code of the double fault handler is always zero, so thereâ€™s no reason to print it.

When we start our kernel now, we should see that the double fault handler is invoked:

It worked! Here is what happens this time:

The CPU executes tries to write to 0xdeadbeaf, which causes a page fault.
Like before, the CPU looks at the corresponding entry in the IDT and sees that the present bit isnâ€™t set. Thus, a double fault occurs.
The CPU jumps to the â€“ now present â€“ double fault handler.

The triple fault (and the boot-loop) no longer occurs, since the CPU can now call the double fault handler.

ðŸ”—Causes of Double Faults

Before we look at the special cases, we need to know the exact causes of double faults. Above, we used a pretty vague definition:

A double fault is a special exception that occurs when the CPU fails to invoke an exception handler.

What does â€œfails to invokeâ€ mean exactly? The handler is not present? The handler is swapped out? And what happens if a handler causes exceptions itself?

For example, what happens ifâ€¦ :

a divide-by-zero exception occurs, but the corresponding handler function is swapped out?
a page fault occurs, but the page fault handler is swapped out?
a divide-by-zero handler causes a breakpoint exception, but the breakpoint handler is swapped out?
our kernel overflows its stack and the guard page is hit?

First Exception	Second Exception
Divide-by-zero, Invalid TSS, Segment Not Present, Stack-Segment Fault, General Protection Fault	Invalid TSS, Segment Not Present, Stack-Segment Fault, General Protection Fault
Page Fault	Page Fault, Invalid TSS, Segment Not Present, Stack-Segment Fault, General Protection Fault

So for example a divide-by-zero fault followed by a page fault is fine (the page fault handler is invoked), but a divide-by-zero fault followed by a general-protection fault leads to a double fault.

With the help of this table, we can answer the first three of the above questions:

If a divide-by-zero exception occurs and the corresponding handler function is swapped out, a page fault occurs and the page fault handler is invoked.
If a page fault occurs and the page fault handler is swapped out, a double fault occurs and the double fault handler is invoked.
If a divide-by-zero handler causes a breakpoint exception, the CPU tries to invoke the breakpoint handler. If the breakpoint handler is swapped out, a page fault occurs and the page fault handler is invoked.

In fact, even the case of a non-present handler follows this scheme: A non-present handler causes a segment-not-present exception. We didnâ€™t define a segment-not-present handler, so another segment-not-present exception occurs. According to the table, this leads to a double fault.

ðŸ”—Kernel Stack Overflow

Letâ€™s look at the fourth question:

What happens if our kernel overflows its stack and the guard page is hit?

When our kernel overflows its stack and hits the guard page, a page fault occurs. The CPU looks up the page fault handler in the IDT and tries to push the exception stack frame onto the stack. However, our current stack pointer still points to the non-present guard page. Thus, a second page fault occurs, which causes a double fault (according to the above table).

So the CPU tries to call our double fault handler now. However, on a double fault the CPU tries to push the exception stack frame, too. Our stack pointer still points to the guard page, so a third page fault occurs, which causes a triple fault and a system reboot. So our current double fault handler canâ€™t avoid a triple fault in this case.

Letâ€™s try it ourselves! We can easily provoke a kernel stack overflow by calling a function that recurses endlessly:

// in src/lib.rs

#[no_mangle]
pub extern "C" fn rust_main(multiboot_information_address: usize) {
    ...
    // initialize our IDT
    interrupts::init();

    fn stack_overflow() {
        stack_overflow(); // for each recursion, the return address is pushed
    }

    // trigger a stack overflow
    stack_overflow();

    println!("It did not crash!");
    loop {}
}

When we try this code in QEMU, we see that the system enters a boot-loop again.

ðŸ”—Switching Stacks

This switching mechanism is implemented as an Interrupt Stack Table (IST). The IST is a table of 7 pointers to known-good stacks. In Rust-like pseudo code:

struct InterruptStackTable {
    stack_pointers: [Option<StackPointer>; 7],
}

For each exception handler, we can choose a stack from the IST through the options field in the corresponding IDT entry. For example, we could use the first stack in the IST for our double fault handler. Then the CPU would automatically switch to this stack whenever a double fault occurs. This switch would happen before anything is pushed, so it would prevent the triple fault.

ðŸ”—Allocating a new Stack

In order to fill an Interrupt Stack Table later, we need a way to allocate new stacks. Therefore we extend our memory module with a new stack_allocator submodule:

// in src/memory/mod.rs

mod stack_allocator;

First, we create a new StackAllocator struct and a constructor function:

// in src/memory/stack_allocator.rs

use memory::paging::PageIter;

pub struct StackAllocator {
    range: PageIter,
}

impl StackAllocator {
    pub fn new(page_range: PageIter) -> StackAllocator {
        StackAllocator { range: page_range }
    }
}

We create a simple StackAllocator that allocates stacks from a given range of pages (PageIter is an Iterator over a range of pages; we introduced it in the kernel heap post.).

We add a alloc_stack method that allocates a new stack:

// in src/memory/stack_allocator.rs

use memory::paging::{self, Page, ActivePageTable};
use memory::{PAGE_SIZE, FrameAllocator};

impl StackAllocator {
    pub fn alloc_stack<FA: FrameAllocator>(&mut self,
                                           active_table: &mut ActivePageTable,
                                           frame_allocator: &mut FA,
                                           size_in_pages: usize)
                                           -> Option<Stack> {
        if size_in_pages == 0 {
            return None; /* a zero sized stack makes no sense */
        }

        // clone the range, since we only want to change it on success
        let mut range = self.range.clone();

        // try to allocate the stack pages and a guard page
        let guard_page = range.next();
        let stack_start = range.next();
        let stack_end = if size_in_pages == 1 {
            stack_start
        } else {
            // choose the (size_in_pages-2)th element, since index
            // starts at 0 and we already allocated the start page
            range.nth(size_in_pages - 2)
        };

        match (guard_page, stack_start, stack_end) {
            (Some(_), Some(start), Some(end)) => {
                // success! write back updated range
                self.range = range;

                // map stack pages to physical frames
                for page in Page::range_inclusive(start, end) {
                    active_table.map(page, paging::WRITABLE, frame_allocator);
                }

                // create a new stack
                let top_of_stack = end.start_address() + PAGE_SIZE;
                Some(Stack::new(top_of_stack, start.start_address()))
            }
            _ => None, /* not enough pages */
        }
    }
}

The method takes mutable references to the ActivePageTable and a FrameAllocator, since it needs to map the new virtual stack pages to physical frames. We define that the stack size is a multiple of the page size.

Instead of operating directly on self.range, we clone it and only write it back on success. This way, subsequent stack allocations can still succeed if there are pages left (e.g., a call with size_in_pages = 3 can still succeed after a failed call with size_in_pages = 100).

In order to be able to clone PageIter, we add a #[derive(Clone)] to its definition in src/memory/paging/mod.rs. We also need to make the start_address method of the Page type public (in the same file).

The actual allocation is straightforward: First, we choose the next page as guard page. Then we choose the next size_in_pages pages as stack pages using Iterator::nth. If all three variables are Some, the allocation succeeded and we map the stack pages to physical frames using ActivePageTable::map. The guard page remains unmapped.

Finally, we create and return a new Stack, which we define as follows:

// in src/memory/stack_allocator.rs

#[derive(Debug)]
pub struct Stack {
    top: usize,
    bottom: usize,
}

impl Stack {
    fn new(top: usize, bottom: usize) -> Stack {
        assert!(top > bottom);
        Stack {
            top: top,
            bottom: bottom,
        }
    }

    pub fn top(&self) -> usize {
        self.top
    }

    pub fn bottom(&self) -> usize {
        self.bottom
    }
}

The Stack struct describes a stack though its top and bottom addresses.

ðŸ”—The Memory Controller

Now weâ€™re able to allocate a new double fault stack. However, we add one more level of abstraction to make things easier. For that we add a new MemoryController type to our memory module:

// in src/memory/mod.rs

pub use self::stack_allocator::Stack;

pub struct MemoryController {
    active_table: paging::ActivePageTable,
    frame_allocator: AreaFrameAllocator,
    stack_allocator: stack_allocator::StackAllocator,
}

impl MemoryController {
    pub fn alloc_stack(&mut self, size_in_pages: usize) -> Option<Stack> {
        let &mut MemoryController { ref mut active_table,
                                    ref mut frame_allocator,
                                    ref mut stack_allocator } = self;
        stack_allocator.alloc_stack(active_table, frame_allocator,
                                    size_in_pages)
    }
}

The MemoryController struct holds the three types that are required for alloc_stack and provides a simpler interface (only one argument). The alloc_stack wrapper just takes the tree types as &mut through destructuring and forwards them to the stack_allocator. The ref mut-s are needed to take the inner fields by mutable reference. Note that weâ€™re re-exporting the Stack type since it is returned by alloc_stack.

The last step is to create a StackAllocator and return a MemoryController from memory::init:

// in src/memory/mod.rs

pub fn init(boot_info: &BootInformation) -> MemoryController {
    ...

    let stack_allocator = {
        let stack_alloc_start = heap_end_page + 1;
        let stack_alloc_end = stack_alloc_start + 100;
        let stack_alloc_range = Page::range_inclusive(stack_alloc_start,
                                                      stack_alloc_end);
        stack_allocator::StackAllocator::new(stack_alloc_range)
    };

    MemoryController {
        active_table: active_table,
        frame_allocator: frame_allocator,
        stack_allocator: stack_allocator,
    }
}

We create a new StackAllocator with a range of 100 pages starting right after the last heap page.

In order to do arithmetic on pages (e.g. calculate the hundredth page after stack_alloc_start), we implement Add<usize> for Page:

// in src/memory/paging/mod.rs

use core::ops::Add;

impl Add<usize> for Page {
    type Output = Page;

    fn add(self, rhs: usize) -> Page {
        Page { number: self.number + rhs }
    }
}

ðŸ”—Allocating a Double Fault Stack

Now we can allocate a new double fault stack by passing the memory controller to our interrupts::init function:

// in src/lib.rs

#[no_mangle]
pub extern "C" fn rust_main(multiboot_information_address: usize) {
    ...

    // set up guard page and map the heap pages
    let mut memory_controller = memory::init(boot_info); // new return type

    // initialize our IDT
    interrupts::init(&mut memory_controller); // new argument

    ...
}


// in src/interrupts.rs

use memory::MemoryController;

pub fn init(memory_controller: &mut MemoryController) {
    let double_fault_stack = memory_controller.alloc_stack(1)
        .expect("could not allocate double fault stack");

    IDT.load();
}

We allocate a 4096 bytes stack (one page) for our double fault handler. Now we just need some way to tell the CPU that it should use this stack for handling double faults.

ðŸ”—The IST and TSS

The Interrupt Stack Table (IST) is part of an old legacy structure called Task State Segment (TSS). The TSS used to hold various information (e.g. processor register state) about a task in 32-bit mode and was for example used for hardware context switching. However, hardware context switching is no longer supported in 64-bit mode and the format of the TSS changed completely.

On x86_64, the TSS no longer holds any task specific information at all. Instead, it holds two stack tables (the IST is one of them). The only common field between the 32-bit and 64-bit TSS is the pointer to the I/O port permissions bitmap.

The 64-bit TSS has the following format:

Field	Type
(reserved)	`u32`
Privilege Stack Table	`[u64; 3]`
(reserved)	`u64`
Interrupt Stack Table	`[u64; 7]`
(reserved)	`u64`
(reserved)	`u16`
I/O Map Base Address	`u16`

The Privilege Stack Table is used by the CPU when the privilege level changes. For example, if an exception occurs while the CPU is in user mode (privilege level 3), the CPU normally switches to kernel mode (privilege level 0) before invoking the exception handler. In that case, the CPU would switch to the 0th stack in the Privilege Stack Table (since 0 is the target privilege level). We donâ€™t have any user mode programs yet, so we ignore this table for now.

ðŸ”—Creating a TSS

Letâ€™s create a new TSS that contains our double fault stack in its interrupt stack table. For that we need a TSS struct. Fortunately, the x86_64 crate already contains a TaskStateSegment struct that we can use:

// in src/interrupts.rs

use x86_64::structures::tss::TaskStateSegment;

Letâ€™s create a new TSS in our interrupts::init function:

// in src/interrupts.rs

use x86_64::VirtualAddress;

const DOUBLE_FAULT_IST_INDEX: usize = 0;

pub fn init(memory_controller: &mut MemoryController) {
    let double_fault_stack = memory_controller.alloc_stack(1)
        .expect("could not allocate double fault stack");

    let mut tss = TaskStateSegment::new();
    tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX] = VirtualAddress(
        double_fault_stack.top());

    IDT.load();
}

We define that the 0th IST entry is the double fault stack (any other IST index would work too). We create a new TSS through the TaskStateSegment::new function and load the top address (stacks grow downwards) of the double fault stack into the 0th entry.

ðŸ”—Loading the TSS

Now that we created a new TSS, we need a way to tell the CPU that it should use it. Unfortunately, this is a bit cumbersome, since the TSS is a Task State Segment (for historical reasons). So instead of loading the table directly, we need to add a new segment descriptor to the Global Descriptor Table (GDT). Then we can load our TSS invoking the ltr instruction with the respective GDT index.

ðŸ”—The Global Descriptor Table (again)

The Global Descriptor Table (GDT) is a relict that was used for memory segmentation before paging became the de facto standard. It is still needed in 64-bit mode for various things such as kernel/user mode configuration or TSS loading.

We already created a GDT when switching to long mode. Back then, we used assembly to create valid code and data segment descriptors, which were required to enter 64-bit mode. We could just edit that assembly file and add an additional TSS descriptor. However, we now have the expressiveness of Rust, so letâ€™s do it in Rust instead.

We start by creating a new interrupts::gdt submodule. For that we need to rename the src/interrupts.rs file to src/interrupts/mod.rs. Then we can create a new submodule:

// in src/interrupts/mod.rs

mod gdt;

// src/interrupts/gdt.rs

pub struct Gdt {
    table: [u64; 8],
    next_free: usize,
}

impl Gdt {
    pub fn new() -> Gdt {
        Gdt {
            table: [0; 8],
            next_free: 1,
        }
    }
}

We create a simple Gdt struct with two fields. The table field contains the actual GDT modeled as a [u64; 8]. Theoretically, a GDT can have up to 8192 entries, but this doesnâ€™t make much sense in 64-bit mode (since there is no real segmentation support). Eight entries should be more than enough for our system.

The next_free field stores the index of the next free entry. We initialize it with 1 since the 0th entry needs always needs to be 0 in a valid GDT.

ðŸ”—User and System Segments

There are two types of GDT entries in long mode: user and system segment descriptors. Descriptors for code and data segment segments are user segment descriptors. They contain no addresses since segments always span the complete address space on x86_64 (real segmentation is no longer supported). Thus, user segment descriptors only contain a few flags (e.g. present or user mode) and fit into a single u64 entry.

System descriptors such as TSS descriptors are different. They often contain a base address and a limit (e.g. TSS start and length) and thus need more than 64 bits. Therefore, system segments are 128 bits. They are stored as two consecutive entries in the GDT.

Consequently, we model a Descriptor as an enum:

// in src/interrupts/gdt.rs

pub enum Descriptor {
    UserSegment(u64),
    SystemSegment(u64, u64),
}

The flag bits are common between all descriptor types, so we create a general DescriptorFlags type (using the bitflags macro):

// in src/interrupts/gdt.rs

bitflags! {
    struct DescriptorFlags: u64 {
        const CONFORMING        = 1 << 42;
        const EXECUTABLE        = 1 << 43;
        const USER_SEGMENT      = 1 << 44;
        const PRESENT           = 1 << 47;
        const LONG_MODE         = 1 << 53;
    }
}

We only add flags that are relevant in 64-bit mode. For example, we omit the read/write bit, since it is completely ignored by the CPU in 64-bit mode.

ðŸ”—Code Segments

We add a function to create kernel mode code segments:

// in src/interrupts/gdt.rs

impl Descriptor {
    pub fn kernel_code_segment() -> Descriptor {
        let flags = USER_SEGMENT | PRESENT | EXECUTABLE | LONG_MODE;
        Descriptor::UserSegment(flags.bits())
    }
}

We set the USER_SEGMENT bit to indicate a 64 bit user segment descriptor (otherwise the CPU expects a 128 bit system segment descriptor). The PRESENT, EXECUTABLE, and LONG_MODE bits are also needed for a 64-bit mode code segment.

The data segment registers ds, ss, and es are completely ignored in 64-bit mode, so we donâ€™t need any data segment descriptors in our GDT.

ðŸ”—TSS Segments

A TSS descriptor is a system segment descriptor with the following format:

Bit(s)	Name	Meaning
0-15	limit 0-15	the first 2 byte of the TSSâ€™s limit
16-39	base 0-23	the first 3 byte of the TSSâ€™s base address
40-43	type	must be `0b1001` for an available 64-bit TSS
44	zero	must be 0
45-46	privilege	the ring level: 0 for kernel, 3 for user
47	present	must be 1 for valid selectors
48-51	limit 16-19	bits 16 to 19 of the segmentâ€™s limit
52	available	freely available to the OS
53-54	ignored
55	granularity	if itâ€™s set, the limit is the number of pages, else itâ€™s a byte number
56-63	base 24-31	the fourth byte of the base address
64-95	base 32-63	the last four bytes of the base address
96-127	ignored/must be zero	bits 104-108 must be zero, the rest is ignored

We only need the bold fields for our TSS descriptor. For example, we donâ€™t need the limit 16-19 field since a TSS has a fixed size that is smaller than 2^16.

Letâ€™s add a function to our descriptor that creates a TSS descriptor for a given TSS:

// in src/interrupts/gdt.rs

use x86_64::structures::tss::TaskStateSegment;

impl Descriptor {
    pub fn tss_segment(tss: &'static TaskStateSegment) -> Descriptor {
        use core::mem::size_of;
        use bit_field::BitField;

        let ptr = tss as *const _ as u64;

        let mut low = PRESENT.bits();
        // base
        low.set_bits(16..40, ptr.get_bits(0..24));
        low.set_bits(56..64, ptr.get_bits(24..32));
        // limit (the `-1` in needed since the bound is inclusive)
        low.set_bits(0..16, (size_of::<TaskStateSegment>() - 1) as u64);
        // type (0b1001 = available 64-bit tss)
        low.set_bits(40..44, 0b1001);

        let mut high = 0;
        high.set_bits(0..32, ptr.get_bits(32..64));

        Descriptor::SystemSegment(low, high)
    }
}

The set_bits and get_bits methods are provided by the BitField trait of the bit_fields crate. They allow us to easily get or set specific bits in an integer without using bit masks or shift operations. For example, we can do x.set_bits(8..12, 42) instead of x = (x & 0xfffff0ff) | (42 << 8).

To link the bit_fields crate, we modify our Cargo.toml and our src/lib.rs:

[dependencies]
bit_field = "0.7.0"

extern crate bit_field;

We require the 'static lifetime for the TaskStateSegment reference, since the hardware might access it on every interrupt as long as the OS runs.

ðŸ”—Adding Descriptors to the GDT

In order to add descriptors to the GDT, we add a add_entry method:

// in src/interrupts/gdt.rs

use x86_64::structures::gdt::SegmentSelector;
use x86_64::PrivilegeLevel;

impl Gdt {
    pub fn add_entry(&mut self, entry: Descriptor) -> SegmentSelector {
        let index = match entry {
            Descriptor::UserSegment(value) => self.push(value),
            Descriptor::SystemSegment(value_low, value_high) => {
                let index = self.push(value_low);
                self.push(value_high);
                index
            }
        };
        SegmentSelector::new(index as u16, PrivilegeLevel::Ring0)
    }
}

For an user segment we just push the u64 and remember the index. For a system segment, we push the low and high u64 and use the index of the low value. We then use this index to return a new SegmentSelector.

The push method looks like this:

// in src/interrupts/gdt.rs

impl Gdt {
    fn push(&mut self, value: u64) -> usize {
        if self.next_free < self.table.len() {
            let index = self.next_free;
            self.table[index] = value;
            self.next_free += 1;
            index
        } else {
            panic!("GDT full");
        }
    }
}

The method just writes to the next_free entry and returns the corresponding index. If there is no free entry left, we panic since this likely indicates a programming error (we should never need to create more than two or three GDT entries for our kernel).

ðŸ”—Loading the GDT

To load the GDT, we add a new load method:

// in src/interrupts/gdt.rs

impl Gdt {
    pub fn load(&'static self) {
        use x86_64::instructions::tables::{DescriptorTablePointer, lgdt};
        use core::mem::size_of;

        let ptr = DescriptorTablePointer {
            base: self.table.as_ptr() as u64,
            limit: (self.table.len() * size_of::<u64>() - 1) as u16,
        };

        unsafe { lgdt(&ptr) };
    }
}

We use the DescriptorTablePointer struct and the lgdt function provided by the x86_64 crate to load our GDT. Again, we require a 'static reference since the GDT possibly needs to live for the rest of the run time.

ðŸ”—Putting it together

We now have a double fault stack and are able to create and load a TSS (which contains an IST). So letâ€™s put everything together to catch kernel stack overflows.

We already created a new TSS in our interrupts::init function. Now we can load this TSS by creating a new GDT:

// in src/interrupts/mod.rs

pub fn init(memory_controller: &mut MemoryController) {
    let double_fault_stack = memory_controller.alloc_stack(1)
        .expect("could not allocate double fault stack");

    let mut tss = TaskStateSegment::new();
    tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX] = VirtualAddress(
        double_fault_stack.top());

    let mut gdt = gdt::Gdt::new();
    let code_selector = gdt.add_entry(gdt::Descriptor::kernel_code_segment());
    let tss_selector = gdt.add_entry(gdt::Descriptor::tss_segment(&tss));
    gdt.load();

    IDT.load();
}

However, when we try to compile it, the following errors occur:

error: `tss` does not live long enough
   --> src/interrupts/mod.rs:118:68
    |
118 |    let tss_selector = gdt.add_entry(gdt::Descriptor::tss_segment(&tss));
    |                                         does not live long enough ^^^
...
122 | }
    | - borrowed value only lives until here
    |
    = note: borrowed value must be valid for the static lifetime...

error: `gdt` does not live long enough
   --> src/interrupts/mod.rs:119:5
    |
119 |    gdt.load();
    |    ^^^ does not live long enough
...
122 | }
    | - borrowed value only lives until here
    |
    = note: borrowed value must be valid for the static lifetime...

The problem is that we require that the TSS and GDT are valid for the rest of the run time (i.e. for the 'static lifetime). But our created tss and gdt live on the stack and are thus destroyed at the end of the init function. So how do we fix this problem?

We could allocate our TSS and GDT on the heap using Box and use into_raw and a bit of unsafe to convert it to &'static references (RFC 1233 was closed unfortunately).

Alternatively, we could store them in a static somehow. The lazy_static macro doesnâ€™t work here, since we need access to the MemoryController for initialization. However, we can use its fundamental building block, the spin::Once type.

ðŸ”—spin::Once

Letâ€™s try to solve our problem using spin::Once:

// in src/interrupts/mod.rs

use spin::Once;

static TSS: Once<TaskStateSegment> = Once::new();
static GDT: Once<gdt::Gdt> = Once::new();

The Once type allows us to initialize a static at runtime. It is safe because the only way to access the static value is through the provided methods (call_once, try, and wait). Thus, no value can be read before initialization and the value can only be initialized once.

(The Once was added in spin 0.4, so youâ€™re probably need to update your spin dependency.)

So letâ€™s rewrite our interrupts::init function to use the static TSS and GDT:

pub fn init(memory_controller: &mut MemoryController) {
    let double_fault_stack = memory_controller.alloc_stack(1)
        .expect("could not allocate double fault stack");

    let tss = TSS.call_once(|| {
        let mut tss = TaskStateSegment::new();
        tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX] = VirtualAddress(
            double_fault_stack.top());
        tss
    });

    let gdt = GDT.call_once(|| {
        let mut gdt = gdt::Gdt::new();
        let code_selector = gdt.add_entry(gdt::Descriptor::
                            kernel_code_segment());
        let tss_selector = gdt.add_entry(gdt::Descriptor::tss_segment(&tss));
        gdt
    });
    gdt.load();

    IDT.load();
}

Now it should compile again!

ðŸ”—The final Steps

Weâ€™re almost done. We successfully loaded our new GDT, which contains a TSS descriptor. Now there are just a few steps left:

We changed our GDT, so we should reload the cs, the code segment register. This required since the old segment selector could point a different GDT descriptor now (e.g. a TSS descriptor).
We loaded a GDT that contains a TSS selector, but we still need to tell the CPU that it should use that TSS.
As soon as our TSS is loaded, the CPU has access to a valid interrupt stack table (IST). Then we can tell the CPU that it should use our new double fault stack by modifying our double fault IDT entry.

For the first two steps, we need access to the code_selector and tss_selector variables outside of the closure. We can achieve this by moving the let declarations out of the closure:

// in src/interrupts/mod.rs
pub fn init(memory_controller: &mut MemoryController) {
    use x86_64::structures::gdt::SegmentSelector;
    use x86_64::instructions::segmentation::set_cs;
    use x86_64::instructions::tables::load_tss;
    ...

    let mut code_selector = SegmentSelector(0);
    let mut tss_selector = SegmentSelector(0);
    let gdt = GDT.call_once(|| {
        let mut gdt = gdt::Gdt::new();
        code_selector = gdt.add_entry(gdt::Descriptor::kernel_code_segment());
        tss_selector = gdt.add_entry(gdt::Descriptor::tss_segment(&tss));
        gdt
    });
    gdt.load();

    unsafe {
        // reload code segment register
        set_cs(code_selector);
        // load TSS
        load_tss(tss_selector);
    }

    IDT.load();
}

We first set the descriptors to empty and then update them from inside the closure (which implicitly borrows them as &mut). Now weâ€™re able to reload the code segment register using set_cs and to load the TSS using load_tss.

Now that we loaded a valid TSS and interrupt stack table, we can set the stack index for our double fault handler in the IDT:

// in src/interrupt/mod.rs

lazy_static! {
    static ref IDT: idt::Idt = {
        let mut idt = idt::Idt::new();
        ...
        unsafe {
            idt.double_fault.set_handler_fn(double_fault_handler)
                .set_stack_index(DOUBLE_FAULT_IST_INDEX as u16);
        }
        ...
    };
}

The set_stack_index method is unsafe because the the caller must ensure that the used index is valid and not already used for another exception.

Thatâ€™s it! Now the CPU should switch to the double fault stack whenever a double fault occurs. Thus, we are able to catch all double faults, including kernel stack overflows:

From now on we should never see a triple fault again!

ðŸ”—Whatâ€™s next?

Now that we mastered exceptions, itâ€™s time to explore another kind of interrupts: interrupts from external devices such as timers, keyboards, or network controllers. These hardware interrupts are very similar to exceptions, e.g. they are also dispatched through the IDT.

However, unlike exceptions, they donâ€™t arise directly on the CPU. Instead, an interrupt controller aggregates these interrupts and forwards them to CPU depending on their priority. In the next posts we will explore the two interrupt controller variants on x86: the Intel 8259 (â€œPICâ€) and the APIC. This will allow us to react to keyboard and mouse input.

Returning from Exceptions

Wed, 21 Sep 2016 00:00:00 +0000

In this post, we learn how to return from exceptions correctly. In the course of this, we will explore the iretq instruction, the C calling convention, multimedia registers, and the red zone.

As always, the complete source code is on GitHub. Please file issues for any problems, questions, or improvement suggestions. There is also a gitter chat and a comment section at the end of this page.

Note: This post describes how to handle exceptions using naked functions (see â€œHandling Exceptions with Naked Functionsâ€ for an overview). Our new way of handling exceptions can be found in the â€œHandling Exceptionsâ€ post.

ðŸ”—Introduction

Most exceptions are fatal and canâ€™t be resolved. For example, we canâ€™t return from a divide-by-zero exception in a reasonable way. However, there are some exceptions that we can resolve:

Imagine a system that uses memory mapped files: We map a file into the virtual address space without loading it into memory. Whenever we access a part of the file for the first time, a page fault occurs. However, this page fault is not fatal. We can resolve it by loading the corresponding page from disk into memory and setting the present flag in the page table. Then we can return from the page fault handler and restart the failed instruction, which now successfully accesses the file data.

Memory mapped files are completely out of scope for us right now (we have neither a file concept nor a hard disk driver). So we need an exception that we can resolve easily so that we can return from it in a reasonable way. Fortunately, there is an exception that needs no resolution at all: the breakpoint exception.

ðŸ”—The Breakpoint Exception

The breakpoint exception is the perfect exception to test our upcoming return-from-exception logic. Its only purpose is to temporary pause a program when the breakpoint instruction int3 is executed.

ðŸ”—Catching Breakpoints

Letâ€™s start by defining a handler function for the breakpoint exception:

// in src/interrupts/mod.rs

extern "C" fn breakpoint_handler(stack_frame: &ExceptionStackFrame) -> !
{
    let stack_frame = unsafe { &*stack_frame };
    println!("\nEXCEPTION: BREAKPOINT at {:#x}\n{:#?}",
        stack_frame.instruction_pointer, stack_frame);
    loop {}
}

We print an error message and also output the instruction pointer and the rest of the stack frame. Note that this function does not return yet, since our handler! macro still requires a diverging function.

We need to register our new handler function in the interrupt descriptor table (IDT):

// in src/interrupts/mod.rs

lazy_static! {
    static ref IDT: idt::Idt = {
        let mut idt = idt::Idt::new();

        idt.set_handler(0, handler!(divide_by_zero_handler));
        idt.set_handler(3, handler!(breakpoint_handler)); // new
        idt.set_handler(6, handler!(invalid_opcode_handler));
        idt.set_handler(14, handler_with_error_code!(page_fault_handler));

        idt
    };
}

We set the IDT entry with number 3 since itâ€™s the vector number of the breakpoint exception.

ðŸ”—Testing it

In order to test it, we insert an int3 instruction in our rust_main:

// in src/lib.rs
...
#[macro_use] // needed for the `int!` macro
extern crate x86_64;
...

#[no_mangle]
pub extern "C" fn rust_main(...) {
    ...
    interrupts::init();

    // trigger a breakpoint exception
    unsafe { int!(3) };

    println!("It did not crash!");
    loop {}
}

When we execute make run, we see the following:

It works! Now we â€œjustâ€ need to return from the breakpoint handler somehow so that we see the It did not crash message again.

ðŸ”—Returning from Exceptions

So how do we return from exceptions? To make it easier, we look at a normal function return first:

When calling a function, the call instruction pushes the return address on the stack. When the called function is finished, it can return to the parent function through the ret instruction, which pops the return address from the stack and then jumps to it.

The exception stack frame, in contrast, looks a bit different:

Instead of pushing a return address, the CPU pushes the stack and instruction pointers (with their segment descriptors), the RFLAGS register, and an optional error code. It also aligns the stack pointer to a 16 byte boundary before pushing values.

So we canâ€™t use a normal ret instruction, since it expects a different stack frame layout. Instead, there is a special instruction for returning from exceptions: iretq.

ðŸ”—The `iretq` Instruction

The iretq instruction is the one and only way to return from exceptions and is specifically designed for this purpose. The AMD64 instruction manual (PDF) even demands that iretq â€œmust be used to terminate the exception or interrupt handler associated with the exceptionâ€.

IRETQ restores rip, cs, rflags, rsp, and ss from the values saved on the stack and thus continues the interrupted program. The instruction does not handle the optional error code, so it must be popped from the stack before.

We see that iretq treats the stored instruction pointer as return address. For most exceptions, the stored rip points to the instruction that caused the fault. So by executing iretq, we restart the failing instruction. This makes sense because we should have resolved the exception when returning from it, so the instruction should no longer fail (e.g. the accessed part of the memory mapped file is now present in memory).

The situation is a bit different for the breakpoint exception, since it needs no resolution. Restarting the int3 instruction wouldnâ€™t make sense, since it would cause a new breakpoint exception and we would enter an endless loop. For this reason the hardware designers decided that the stored rip should point to the next instruction after the int3 instruction.

Letâ€™s check this for our breakpoint handler. Remember, the handler printed the following message (see the image above):

EXCEPTION: BREAKPOINT at 0x110970

So letâ€™s disassemble the instruction at 0x110970 and its predecessor:

> objdump -d build/kernel-x86_64.bin | grep -B1 "110970:"
11096f:	cc                   	int3
110970:	48 c7 01 2a 00 00 00 	movq   $0x2a,(%rcx)

We see that 0x110970 indeed points to the next instruction after int3. So we can simply jump to the stored instruction pointer when we want to return from the breakpoint exception.

ðŸ”—Implementation

Letâ€™s update our handler! macro to support non-diverging exception handlers:

// in src/interrupts/mod.rs

macro_rules! handler {
    ($name: ident) => {{
        #[naked]
        extern "C" fn wrapper() -> ! {
            unsafe {
                asm!("mov rdi, rsp
                      sub rsp, 8 // align the stack pointer
                      call $0"
                      :: "i"($name as extern "C" fn(
                          &ExceptionStackFrame)) // no longer diverging
                      : "rdi" : "intel", "volatile");

                // new
                asm!("add rsp, 8 // undo stack pointer alignment
                      iretq"
                      :::: "intel", "volatile");
                ::core::intrinsics::unreachable();
            }
        }
        wrapper
    }}
}

When an exception handler returns from the call instruction, we use the iretq instruction to continue the interrupted program. Note that we need to undo the stack pointer alignment before, so that rsp points to the end of the exception stack frame again.

Weâ€™ve changed the handler function type, so we need to adjust our existing exception handlers:

// in src/interrupts/mod.rs

extern "C" fn divide_by_zero_handler(
-   stack_frame: &ExceptionStackFrame) -> ! {...}
+   stack_frame: &ExceptionStackFrame) {...}

extern "C" fn invalid_opcode_handler(
-   stack_frame: &ExceptionStackFrame) -> ! {...}
+   stack_frame: &ExceptionStackFrame) {...}

extern "C" fn breakpoint_handler(
-   stack_frame: &ExceptionStackFrame) -> ! {
+   stack_frame: &ExceptionStackFrame) {
    println!(...);
-   loop {}
}

Note that we also removed the loop {} at the end of our breakpoint_handler so that it no longer diverges. The divide_by_zero_handler and the invalid_opcode_handler still diverge (albeit the new function type would allow a return).

ðŸ”—Testing

Letâ€™s try our new iretq logic:

Instead of the expected â€œIt did not crashâ€ message after the breakpoint exception, we get a page fault. The strange thing is that our kernel tried to access address 0x1, which should never happen. So it seems like we messed up something important.

ðŸ”—Debugging

Letâ€™s debug it using GDB. For that we execute make debug in one terminal (which starts QEMU with the -s -S flags) and then make gdb (which starts and connects GDB) in a second terminal. For more information about GDB debugging, check out our Set Up GDB guide.

First we want to check if our iretq was successful. Therefore we set a breakpoint on the println!("It did not crash line!") statement in src/lib.rs. Letâ€™s assume that itâ€™s on line 61:

(gdb) break blog_os/src/lib.rs:61
Breakpoint 1 at 0x110a95: file /home/.../blog_os/src/lib.rs, line 61.

This line is after the int3 instruction, so we know that the iretq succeeded when the breakpoint is hit. To test this, we continue the execution:

(gdb) continue
Continuing.

Breakpoint 1, blog_os::rust_main (multiboot_information_address=1539136)
    at /home/.../blog_os/src/lib.rs:61
61	    println!("It did not crash!");

It worked! So our kernel successfully returned from the int3 instruction, which means that the iretq itself works.

However, when we continue the execution again, we get the page fault. So the exception occurs somewhere in the println logic. This means that it occurs in code generated by the compiler (and not e.g. in inline assembly). But the compiler should never access 0x1, so how is this happening?

The answer is that weâ€™ve used the wrong calling convention for our exception handlers. Thus, we violate some compiler invariants so that the code that works fine without intermediate exceptions starts to violate memory safety when itâ€™s executed after a breakpoint exception.

ðŸ”—Calling Conventions

Exceptions are quite similar to function calls: The CPU jumps to the first instruction of the (handler) function and executes the function. Afterwards, if the function is not diverging, the CPU jumps to the return address and continues the execution of the parent function.

the first six integer arguments are passed in registers rdi, rsi, rdx, rcx, r8, r9
additional arguments are passed on the stack
results are returned in rax and rdx

Note that Rust does not follow the C ABI (in fact, there isnâ€™t even a Rust ABI yet). So these rules apply only to functions declared as extern "C" fn.

ðŸ”—Preserved and Scratch Registers

The calling convention divides the registers in two parts: preserved and scratch registers.

The values of the preserved register must remain unchanged across function calls. So a called function (the â€œcalleeâ€) is only allowed to overwrite these registers if it restores their original values before returning. Therefore these registers are called â€œcallee-savedâ€. A common pattern is to save these registers to the stack at the functionâ€™s beginning and restore them just before returning.

In contrast, a called function is allowed to overwrite scratch registers without restrictions. If the caller wants to preserve the value of a scratch register across a function call, it needs to backup and restore it (e.g. by pushing it to the stack before the function call). So the scratch registers are caller-saved.

On x86_64, the C calling convention specifies the following preserved and scratch registers:

preserved registers	scratch registers
`rbp`, `rbx`, `rsp`, `r12`, `r13`, `r14`, `r15`	`rax`, `rcx`, `rdx`, `rsi`, `rdi`, `r8`, `r9`, `r10`, `r11`
callee-saved	caller-saved

ðŸ”—The Exception Calling Convention

Since we donâ€™t know when an exception occurs, we canâ€™t backup any registers before. This means that we canâ€™t use a calling convention that relies on caller-saved registers for our exception handlers. But we do so at the moment: Our exception handlers are declared as extern "C" fn and thus use the C calling convention.

So here is what happens:

rust_main is executing; it writes some memory address into rax.
The int3 instruction causes a breakpoint exception.
Our breakpoint_handler prints to the screen and assumes that it can overwrite rax freely (since itâ€™s a scratch register). Somehow the value 0 ends up in rax.
We return from the breakpoint exception using iretq.
rust_main continues and accesses the memory address in rax.
The CPU tries to access address 0x1, which causes a page fault.

So our exception handler erroneously assumes that the scratch registers were saved by the caller. But the caller (rust_main) couldnâ€™t save any registers since it didnâ€™t know that an exception occurs. So nobody saves rax and the other scratch registers, which leads to the page fault.

The problem is that we use a calling convention with caller-saved registers for our exception handlers. Instead, we need a calling convention means that preserves all registers. In other words, all registers must be callee-saved:

extern "all-registers-callee-saved" fn exception_handler() {...}

Unfortunately, Rust does not support such a calling convention. It was proposed once, but did not get accepted for various reasons. The primary reason was that such calling conventions can be simulated by writing a naked wrapper function.

(Remember: Naked functions are functions without prologue and can contain only inline assembly. They were discussed in the previous post.)

ðŸ”—A naked wrapper function

Such a naked wrapper function might look like this:

#[naked]
extern "C" fn calling_convention_wrapper() {
    unsafe {
        asm!("
            push rax
            push rcx
            push rdx
            push rsi
            push rdi
            push r8
            push r9
            push r10
            push r11
            // TODO: call exception handler with C calling convention
            pop r11
            pop r10
            pop r9
            pop r8
            pop rdi
            pop rsi
            pop rdx
            pop rcx
            pop rax
        " :::: "intel", "volatile");
    }
}

This wrapper function saves all scratch registers to the stack before calling the exception handler and restores them afterwards. Note that we pop the registers in reverse order.

We donâ€™t need to backup preserved registers since they are callee-saved in the C calling convention. Thus, the compiler already takes care of preserving their values.

ðŸ”—Fixing our Handler Macro

Letâ€™s update our handler macro to fix the calling convention problem. Therefore we need to backup and restore all scratch registers. For that we create two new macros:

// in src/interrupts/mod.rs

macro_rules! save_scratch_registers {
    () => {
        asm!("push rax
              push rcx
              push rdx
              push rsi
              push rdi
              push r8
              push r9
              push r10
              push r11
        " :::: "intel", "volatile");
    }
}

macro_rules! restore_scratch_registers {
    () => {
        asm!("pop r11
              pop r10
              pop r9
              pop r8
              pop rdi
              pop rsi
              pop rdx
              pop rcx
              pop rax
            " :::: "intel", "volatile");
    }
}

We need to declare these macros above our handler macro, since macros are only available after their declaration.

Now we can use these macros to fix our handler! macro:

// in src/interrupts/mod.rs

macro_rules! handler {
    ($name: ident) => {{
        #[naked]
        extern "C" fn wrapper() -> ! {
            unsafe {
                save_scratch_registers!();
                asm!("mov rdi, rsp
                      add rdi, 9*8 // calculate exception stack frame pointer
                      // sub rsp, 8 (stack is aligned already)
                      call $0"
                      :: "i"($name as
                             extern "C" fn(&ExceptionStackFrame))
                      : "rdi" : "intel", "volatile");

                restore_scratch_registers!();
                asm!("
                      // add rsp, 8 (undo stack alignment; not needed anymore)
                      iretq"
                      :::: "intel", "volatile");
                ::core::intrinsics::unreachable();
            }
        }
        wrapper
    }}
}

Itâ€™s important that we save the registers first, before we modify any of them. After the call instruction (but before iretq) we restore the registers again. Because weâ€™re now changing rsp (by pushing the register values) before we load it into rdi, we would get a wrong exception stack frame pointer. Therefore we need to adjust it by adding the number of bytes we push. We push 9 registers that are 8 bytes each, so 9 * 8 bytes in total.

Note that we no longer need to manually align the stack pointer, because weâ€™re pushing an uneven number of registers in save_scratch_registers. Thus the stack pointer already has the required 16-byte alignment.

ðŸ”—Testing it again

Letâ€™s test it again with our corrected handler! macro:

The page fault is gone and we see the â€œIt did not crashâ€ message again!

So the page fault occurred because our exception handler didnâ€™t preserve the scratch register rax. Our new handler! macro fixes this problem by saving all scratch registers (including rax) before calling exception handlers. Thus, rax still contains the valid memory address when rust-main continues execution.

ðŸ”—Multimedia Registers

When we discussed calling conventions above, we assumed that a x86_64 CPU only has the following 16 registers: rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8, r9, r10, r11.r12, r13, r14, and r15. These registers are called general purpose registers since each of them can be used for arithmetic and load/store instructions.

However, modern CPUs also have a set of special purpose registers, which can be used to improve performance in several use cases. On x86_64, the most important set of special purpose registers are the multimedia registers. These registers are larger than the general purpose registers and can be used to speed up audio/video processing or matrix calculations. For example, we could use them to add two 4-dimensional vectors in a single CPU instruction:

Such multimedia instructions are called Single Instruction Multiple Data (SIMD) instructions, because they simultaneously perform an operation (e.g. addition) on multiple data words. Good compilers are able to transform normal loops into such SIMD code automatically. This process is called auto-vectorization and can lead to huge performance improvements.

However, auto-vectorization causes a problem for us: Most of the multimedia registers are caller-saved. According to our discussion of calling conventions above, this means that our exception handlers erroneously assume that they are allowed to overwrite them without preserving their values.

We donâ€™t use any multimedia registers explicitly, but the Rust compiler might auto-vectorize our code (including the exception handlers). Thus we could silently clobber the multimedia registers, which leads to the same problems as above:

This example shows a program that is using the first three multimedia registers (mm0 to mm2). At some point, an exception occurs and control is transferred to the exception handler. The exception handler uses mm1 for its own data and thus overwrites the previous value. When the exception is resolved, the CPU continues the interrupted program again. However, the program is now corrupt since it relies on the original mm1 value.

ðŸ”—Saving and Restoring Multimedia Registers

In order to fix this problem, we need to backup all caller-saved multimedia registers before we call the exception handler. The problem is that the set of multimedia registers varies between CPUs. There are different standards:

MMX: The MMX instruction set was introduced in 1997 and defines eight 64 bit registers called mm0 through mm7. These registers are just aliases for the registers of the x87 floating point unit.
SSE: The Streaming SIMD Extensions instruction set was introduced in 1999. Instead of re-using the floating point registers, it adds a completely new register set. The sixteen new registers are called xmm0 through xmm15 and are 128 bits each.
AVX: The Advanced Vector Extensions are extensions that further increase the size of the multimedia registers. The new registers are called ymm0 through ymm15 and are 256 bits each. They extend the xmm registers, so e.g. xmm0 is the lower (or upper?) half of ymm0.

The Rust compiler (and LLVM) assume that the x86_64-unknown-linux-gnu target supports only MMX and SSE, so we donâ€™t need to save the ymm0 through ymm15. But we need to save xmm0 through xmm15 and also mm0 through mm7. There is a special instruction to do this: fxsave. This instruction saves the floating point and multimedia state to a given address. It needs 512 bytes to store that state.

In order to save/restore the multimedia registers, we could add new macros:

macro_rules! save_multimedia_registers {
    () => {
        asm!("sub rsp, 512
              fxsave [rsp]
        " :::: "intel", "volatile");
    }
}

macro_rules! restore_multimedia_registers {
    () => {
        asm!("fxrstor [rsp]
              add rsp, 512
            " :::: "intel", "volatile");
    }
}

First, we reserve the 512 bytes on the stack and then we use fxsave to backup the multimedia registers. In order to restore them later, we use the fxrstor instruction. Note that fxsave and fxrstor require a 16 byte aligned memory address.

However, we wonâ€™t do it that way. The problem is the large amount of memory required. We will reuse the same code when we handle hardware interrupts in a future post. So for each mouse click, pressed key, or arrived network package we need to write 512 bytes to memory. This would be a huge performance problem.

Fortunately, there exists an alternative solution.

ðŸ”—Disabling Multimedia Extensions

We just disable MMX, SSE, and all the other fancy multimedia extensions in our kernel¹. This way, our exception handlers wonâ€™t clobber the multimedia registers because they wonâ€™t use them at all.

Userspace programs will still be able to use the multimedia registers.

This solution has its own disadvantages, of course. For example, it leads to slower kernel code because the compiler canâ€™t perform any auto-vectorization optimizations. But itâ€™s still the faster solution (since we save many memory accesses) and most kernels do it this way (including Linux).

So how do we disable MMX and SSE? Well, we just tell the compiler that our target system doesnâ€™t support it. Since the very beginning, weâ€™re compiling our kernel for the x86_64-unknown-linux-gnu target. This worked fine so far, but now we want a different target without support for multimedia extensions. We can do so by creating a target configuration file.

ðŸ”—Target Specifications

In order to disable the multimedia extensions for our kernel, we need to compile for a custom target. We want a target that is equal to x86_64-unknown-linux-gnu, but without MMX and SSE support. Rust allows us to specify such a target using a JSON configuration file.

A minimal target specification that describes the x86_64-unknown-linux-gnu target looks like this:

{
  "llvm-target": "x86_64-unknown-linux-gnu",
  "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
  "target-endian": "little",
  "target-pointer-width": "64",
  "target-c-int-width": "32",
  "arch": "x86_64",
  "os": "none"
}

The llvm-target field specifies the target triple that is passed to LLVM. We want to derive a 64-bit Linux target, so we choose x86_64-unknown-linux-gnu. The data-layout field is also passed to LLVM and specifies how data should be laid out in memory. It consists of various specifications separated by a - character. For example, the e means little endian and S128 specifies that the stack should be 128 bits (= 16 byte) aligned. The format is described in detail in the LLVM documentation but there shouldnâ€™t be a reason to change this string.

The other fields are used for conditional compilation. This allows crate authors to use cfg variables to write special code for depending on the OS or the architecture. There isnâ€™t any up-to-date documentation about these fields but the corresponding source code is quite readable.

ðŸ”—Disabling MMX and SSE

In order to disable the multimedia extensions, we create a new target named x86_64-blog_os. To describe this target, we create a file named x86_64-blog_os.json in the project root with the following content:

{
  "llvm-target": "x86_64-unknown-linux-gnu",
  "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
  "target-endian": "little",
  "target-pointer-width": "64",
  "target-c-int-width": "32",
  "arch": "x86_64",
  "os": "none",
  "features": "-mmx,-sse"
}

Itâ€™s equal to x86_64-unknown-linux-gnu target but has one additional option: "features": "-mmx,-sse". So we added two target features: -mmx and -sse. The minus prefix defines that our target does not support this feature. So by specifying -mmx and -sse, we disable the default mmx and sse features.

In order to compile for the new target, we need to adjust our Makefile:

# in `Makefile`

 arch ?= x86_64
-target ?= $(arch)-unknown-linux-gnu
+target ?= $(arch)-blog_os
...

The new target name (x86_64-blog_os) is the file name of the JSON configuration file without the .json extension.

ðŸ”—Cross compilation

Letâ€™s try if our kernel still works with the new target:

> make run
Compiling raw-cpuid v2.0.1
Compiling rlibc v0.1.5
Compiling x86 v0.7.1
Compiling spin v0.3.5
error[E0463]: can't find crate for `core`

error: aborting due to previous error

Build failed, waiting for other jobs to finish...
...
Makefile:52: recipe for target 'cargo' failed
make: *** [cargo] Error 101

It doesnâ€™t compile anymore. The error tells us that the Rust compiler no longer finds the core library.

The core library is implicitly linked to all no_std crates and contains things such as Result, Option, and iterators. Weâ€™ve used that library without problems since the very beginning, so why is it no longer available?

The problem is that the core library is distributed together with the Rust compiler as a precompiled library. So it is only valid for the host triple, which is x86_64-unknown-linux-gnu in our case. If we want to compile code for other targets, we need to recompile core for these targets first.

ðŸ”—Xargo

Thatâ€™s where xargo comes in. It is a wrapper for cargo that eases cross compilation. We can install it by executing:

cargo install xargo

Xargo depends on the rust source code, which we can install with rustup component add rust-src.

Xargo is â€œa drop-in replacement for cargoâ€, so every cargo command also works with xargo. You can do e.g. xargo --help, xargo clean, or xargo doc. However, the build command gains additional functionality: xargo build will automatically cross compile the core library when compiling for custom targets.

Thatâ€™s exactly what we want, so we change one letter in our Makefile:

# in `Makefile`
...

cargo:
-	@cargo build --target $(target)
+	@xargo build --target $(target)
...

Now the build goes through xargo, which should fix the compilation error. Letâ€™s try it out:

> make run
Compiling core v0.0.0 (file:///home/â€¦/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore)
LLVM ERROR: SSE register return with SSE disabled
error: Could not compile `core`.

Well, we get a different error now, so it seems like weâ€™re making progress :). It seems like there is a â€œSSE register returnâ€ although SSE is disabled. But whatâ€™s an â€œSSE register returnâ€?

ðŸ”—SSE Register Return

Remember when we discussed calling conventions above? The calling convention defines which registers are used for return values. Well, the System V ABI defines that xmm0 should be used for returning floating point values. So somewhere in the core library a function returns a float and LLVM doesnâ€™t know what to do. The ABI says â€œuse xmm0â€ but the target specification says â€œdonâ€™t use xmm registersâ€.

In order to fix this problem, we need to change our float ABI. The idea is to avoid normal hardware-supported floats and use a pure software implementation instead. We can do so by enabling the soft-float feature for our target. For that, we edit x86_64-blog_os.json:

{
  "llvm-target": "x86_64-unknown-linux-gnu",
  ...
  "features": "-mmx,-sse,+soft-float"
}

The plus prefix tells LLVM to enable the soft-float feature.

Letâ€™s try make run again:

> make run
   Compiling core v0.0.0 (file:///â€¦/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore)
    Finished release [optimized] target(s) in 21.95 secs
   Compiling spin v0.4.5
   Compiling once v0.3.2
   Compiling x86 v0.8.0
   Compiling bitflags v0.9.1
   Compiling raw-cpuid v2.0.1
   Compiling rlibc v0.1.5
   Compiling linked_list_allocator v0.2.3
   Compiling volatile v0.1.0
   Compiling bitflags v0.4.0
   Compiling bit_field v0.5.0
   Compiling spin v0.3.5
   Compiling multiboot2 v0.1.0
   Compiling lazy_static v0.2.2
   Compiling hole_list_allocator v0.1.0 (file:///â€¦/libs/hole_list_allocator)
   Compiling blog_os v0.1.0 (file:///â€¦)
error[E0463]: can't find crate for `alloc`
  --> src/lib.rs:33:1
   |
33 | extern crate alloc;
   | ^^^^^^^^^^^^^^^^^^^ can't find crate

error: aborting due to previous error

We see that xargo now compiles the core crate in release mode. Then it starts the normal cargo build. Cargo then recompiles all dependencies, since it needs to generate different code for the new target.

However, the build still fails. The reason is that xargo only installs core by default, but we also need the alloc crate. We can enable it by creating a file named Xargo.toml with the following contents:

# Xargo.toml

[target.x86_64-blog_os.dependencies]
alloc = {}

Now xargo compiles alloc, too:

> make run
   Compiling core v0.0.0 (file:///â€¦/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore)
   Compiling std_unicode v0.0.0 (file:///â€¦/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd_unicode)
   Compiling alloc v0.0.0 (file:///â€¦/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/liballoc)
    Finished release [optimized] target(s) in 28.84 secs
   Compiling blog_os v0.1.0 (file:///â€¦/Documents/blog_os/master)
warning: unused variable: `allocator` [â€¦]
warning: unused variable: `frame` [â€¦]

    Finished debug [unoptimized + debuginfo] target(s) in 1.75 secs

It worked! Now we have a kernel that never touches the multimedia registers! We can verify this by executing:

> objdump -d build/kernel-x86_64.bin | grep "mm[0-9]"

If the command produces no output, our kernel uses neither MMX (mm0 â€“ mm7) nor SSE (xmm0 â€“ xmm15) registers.

So now our return-from-exception logic works without problems in most cases. However, there is still a pitfall hidden in the C calling convention, which might cause hideous bugs in some rare cases.

ðŸ”—The Red Zone

The red zone is an optimization of the System V ABI that allows functions to temporary use the 128 bytes below its stack frame without adjusting the stack pointer:

The image shows the stack frame of a function with n local variables. On function entry, the stack pointer is adjusted to make room on the stack for the local variables.

The red zone is defined as the 128 bytes below the adjusted stack pointer. The function can use this area for temporary data thatâ€™s not needed across function calls. Thus, the two instructions for adjusting the stack pointer can be avoided in some cases (e.g. in small leaf functions).

However, this optimization leads to huge problems with exceptions. Letâ€™s assume that an exception occurs while a function uses the red zone:

The CPU and the exception handler overwrite the data in red zone. But this data is still needed by the interrupted function. So the function wonâ€™t work correctly anymore when we return from the exception handler. It might fail or cause another exception, but it could also lead to strange bugs that take weeks to debug.

ðŸ”—Adjusting our Exception Handler?

The problem is that the System V ABI demands that the red zone â€œshall not be modified by signal or interrupt handlers.â€ Our current exception handlers do not respect this. We could try to fix it by subtracting 128 from the stack pointer before pushing anything:

sub rsp, 128
save_scratch_registers()
...
call ...
...
restore_scratch_registers()
add rsp, 128

iretq

This will not work. The problem is that the CPU pushes the exception stack frame before even calling our handler function. So the CPU itself will clobber the red zone and there is nothing we can do about that. So our only chance is to disable the red zone.

ðŸ”—Disabling the Red Zone

The red zone is a property of our target, so in order to disable it we edit our x86_64-blog_os.json a last time:

{
  "llvm-target": "x86_64-unknown-linux-gnu",
  ...
  "features": "-mmx,-sse,+soft-float",
  "disable-redzone": true
}

We add one additional option at the end: "disable-redzone": true. As you might guess, this option disables the red zone optimization.

Now we have a red zone free kernel!

ðŸ”—Exceptions with Error Codes

Weâ€™re now able to correctly return from exceptions without error codes. However, we still canâ€™t return from exceptions that push an error code (e.g. page faults). Letâ€™s fix that by updating our handler_with_error_code macro:

// in src/interrupts/mod.rs

macro_rules! handler_with_error_code {
    ($name: ident) => {{
        #[naked]
        extern "C" fn wrapper() -> ! {
            unsafe {
                asm!("pop rsi // pop error code into rsi
                      mov rdi, rsp
                      sub rsp, 8 // align the stack pointer
                      call $0"
                      :: "i"($name as extern "C" fn(
                          &ExceptionStackFrame, u64))
                      : "rdi","rsi" : "intel");
                asm!("iretq" :::: "intel", "volatile");
                ::core::intrinsics::unreachable();
            }
        }
        wrapper
    }}
}

First, we change the type of the handler function: no more -> !, so it no longer needs to diverge. We also add an iretq instruction at the end.

Now we can make our page_fault_handler non-diverging:

// in src/interrupts/mod.rs

 extern "C" fn page_fault_handler(stack_frame: &ExceptionStackFrame,
-   error_code: u64) -> ! { ... }
+   error_code: u64) { ... }

However, now we have the same problem as above: The handler function will overwrite the scratch registers and cause bugs when returning. Letâ€™s fix this by invoking save_scratch_registers at the beginning:

// in src/interrupts/mod.rs

macro_rules! handler_with_error_code {
    ($name: ident) => {{
        #[naked]
        extern "C" fn wrapper() -> ! {
            unsafe {
                save_scratch_registers!();
                asm!("pop rsi // pop error code into rsi
                      mov rdi, rsp
                      add rdi, 10*8 // calculate exception stack frame pointer
                      sub rsp, 8 // align the stack pointer
                      call $0
                      add rsp, 8 // undo stack pointer alignment
                      " :: "i"($name as extern "C" fn(
                          &ExceptionStackFrame, u64))
                      : "rdi","rsi" : "intel");
                restore_scratch_registers!();
                asm!("iretq" :::: "intel", "volatile");
                ::core::intrinsics::unreachable();
            }
        }
        wrapper
    }}
}

Now we backup the scratch registers to the stack right at the beginning and restore them just before the iretq. Like in the handler macro, we now need to add 10*8 to rdi in order to get the correct exception stack frame pointer (save_scratch_registers pushes nine 8 byte registers, plus the error code). We also need to undo the stack pointer alignment after the call ².

The stack alignment is actually wrong here, since we additionally pushed an uneven number of registers. However, the pop rsi is wrong too, since the error code is no longer at the top of the stack. When we fix that problem, the stack alignment becomes correct again. So I left it in to keep things simple.

Now we have one last bug: We pop the error code into rsi, but the error code is no longer at the top of the stack (since save_scratch_registers pushed 9 registers on top of it). So we need to do it differently:

// in src/interrupts/mod.rs

macro_rules! handler_with_error_code {
    ($name: ident) => {{
        #[naked]
        extern "C" fn wrapper() -> ! {
            unsafe {
                save_scratch_registers!();
                asm!("mov rsi, [rsp + 9*8] // load error code into rsi
                      mov rdi, rsp
                      add rdi, 10*8 // calculate exception stack frame pointer
                      sub rsp, 8 // align the stack pointer
                      call $0
                      add rsp, 8 // undo stack pointer alignment
                      " :: "i"($name as extern "C" fn(
                          &ExceptionStackFrame, u64))
                      : "rdi","rsi" : "intel");
                restore_scratch_registers!();
                asm!("add rsp, 8 // pop error code
                      iretq" :::: "intel", "volatile");
                ::core::intrinsics::unreachable();
            }
        }
        wrapper
    }}
}

Instead of using pop, weâ€™re calculating the error code address manually (save_scratch_registers pushes nine 8 byte registers) and load it into rsi using a mov. So now the error code stays on the stack. But iretq doesnâ€™t handle the error code, so we need to pop it before invoking iretq.

Phew! That was a lot of fiddling with assembly. Letâ€™s test if it still works.

ðŸ”—Testing

First, we test if the exception stack frame pointer and the error code are still correct:

// in rust_main in src/lib.rs

...
unsafe { int!(3) };

// provoke a page fault
unsafe { *(0xdeadbeaf as *mut u64) = 42; }

println!("It did not crash!");
loop {}

This should cause the following error message:

EXCEPTION: PAGE FAULT while accessing 0xdeadbeaf
error code: CAUSED_BY_WRITE
ExceptionStackFrame {
    instruction_pointer: 1114753,
    code_segment: 8,
    cpu_flags: 2097158,
    stack_pointer: 1171104,
    stack_segment: 16
}

The error code should still be CAUSED_BY_WRITE and the exception stack frame values should also be correct (e.g. code_segment should be 8 and stack_segment should be 16).

ðŸ”—Returning from Page Faults

Letâ€™s see what happens if we comment out the trailing loop in our page fault handler:

We see that the same error message is printed over and over again. Here is what happens:

The CPU executes rust_main and tries to access 0xdeadbeaf. This causes a page fault.
The page fault handler prints an error message and returns without fixing the cause of the exception (0xdeadbeaf is still unaccessible).
The CPU restarts the instruction that caused the page fault and thus tries to access 0xdeadbeaf again. Of course, this causes a page fault again.
The page fault handler prints the error message and returns.

â€¦ and so on. Thus, our code indefinitely jumps between the page fault handler and the instruction that accesses 0xdeadbeaf.

This is a good thing! It means that our iretq logic is working correctly, since it returns to the correct instruction every time. So our handler_with_error_code macro seems to be correct.

ðŸ”—Whatâ€™s next?

We are now able to catch exceptions and to return from them. However, there are still exceptions that completely crash our kernel by causing a triple fault. In the next post, we will fix this issue by handling a special type of exception: the double fault. Thus, we will be able to avoid random reboots in our kernel.

Better Exception Messages

Wed, 03 Aug 2016 00:00:00 +0000

In this post, we explore exceptions in more detail. Our goal is to print additional information when an exception occurs, for example the values of the instruction and stack pointer. In the course of this, we will explore inline assembly and naked functions. We will also add a handler function for page faults and read the associated error code.

Note: This post describes how to handle exceptions using naked functions (see â€œHandling Exceptions with Naked Functionsâ€ for an overview). Our new way of handling exceptions can be found in the â€œHandling Exceptionsâ€ post.

ðŸ”—Exceptions in Detail

An exception signals that something is wrong with the currently-executed instruction. Whenever an exception occurs, the CPU interrupts its current work and starts an internal exception routine.

This routine involves reading the interrupt descriptor table and invoking the registered handler function. But first, the CPU pushes various information onto the stack, which describe the current state and provide information about the cause of the exception:

The pushed information contain the instruction and stack pointer, the current CPU flags, and (for some exceptions) an error code, which contains further information about the cause of the exception. Letâ€™s look at the fields in detail:

First, the CPU aligns the stack pointer on a 16-byte boundary. This allows the handler function to use SSE instructions, which partly expect such an alignment.
After that, the CPU pushes the stack segment descriptor (SS) and the old stack pointer (from before the alignment) onto the stack. This allows us to restore the previous stack pointer when we want to resume the interrupted program.
Then the CPU pushes the contents of the RFLAGS register. This register contains various state information of the interrupted program. For example, it indicates if interrupts were enabled and whether the last executed instruction returned zero.
Next the CPU pushes the instruction pointer and its code segment descriptor onto the stack. This tells us the address of the last executed instruction, which caused the exception.
Finally, the CPU pushes an error code for some exceptions. This error code only exists for exceptions such as page faults or general protection faults and provides additional information. For example, it tells us whether a page fault was caused by a read or a write request.

ðŸ”—Printing the Exception Stack Frame

Letâ€™s create a struct that represents the exception stack frame:

// in src/interrupts/mod.rs

#[derive(Debug)]
#[repr(C)]
struct ExceptionStackFrame {
    instruction_pointer: u64,
    code_segment: u64,
    cpu_flags: u64,
    stack_pointer: u64,
    stack_segment: u64,
}

The divide-by-zero fault pushes no error code, so we leave it out for now. Note that the stack grows downwards in memory, so we need to declare the fields in reverse order (compared to the figure above).

Now we need a way to find the memory address of this stack frame. When we look at the above graphic again, we see that the start address of the exception stack frame is the new stack pointer. So we just need to read the value of rsp at the very beginning of our handler function:

// in src/interrupts/mod.rs

extern "C" fn divide_by_zero_handler() -> ! {
    let stack_frame: &ExceptionStackFrame;
    unsafe {
        asm!("mov $0, rsp" : "=r"(stack_frame) ::: "intel");
    }
    println!("\nEXCEPTION: DIVIDE BY ZERO\n{:#?}", stack_frame);
    loop {}
}

Weâ€™re using inline assembly here to load the value from the rsp register into stack_frame. The syntax is a bit strange, so hereâ€™s a quick explanation:

The asm! macro emits raw assembly instructions. This is the only way to read raw register values in Rust.
We insert a single assembly instruction: mov $0, rsp. It moves the value of rsp to some register (the $0 is a placeholder for an arbitrary register, which gets filled by the compiler).
The colons are separators. After the first colon, the asm! macro expects output operands. Weâ€™re specifying our stack_frame variable as a single output operand here. The =r tells the compiler that it should use any register for the first placeholder $0.
After the second colon, we can specify input operands. We donâ€™t need any, therefore we leave it empty.
After the third colon, the macro expects so called clobbers. We donâ€™t change any register values, so we leave it empty too.
The last block (after the 4th colon) specifies options. The intel option tells the compiler that our code is in Intel assembly syntax (instead of the default AT&T syntax).

So the inline assembly loads the stack pointer value to stack_frame at the very beginning of our function. Thus we have a pointer to the exception stack frame and are able to pretty-print its Debug formatting through the {:#?} argument.

ðŸ”—Testing it

Letâ€™s try it by executing make run:

Those ExceptionStackFrame values look very wrong. The instruction pointer definitely shouldnâ€™t be 1 and the code segment should be 0x8 instead of some big number. So whatâ€™s going on here?

ðŸ”—Debugging

It seems like we somehow got the pointer wrong. The ExceptionStackFrame type and our inline assembly seem correct, so something must be modifying rsp before we load it into stack_frame.

Letâ€™s see whatâ€™s happening by looking at the disassembly of our function:

> objdump -d build/kernel-x86_64.bin | grep -A20 "divide_by_zero_handler"

 [...]
000000000010ced0 <_ZN7blog_os10interrupts22divide_by_zero_handler17h62189e8E>:
 10ced0:	55                   	push   %rbp
 10ced1:	48 89 e5             	mov    %rsp,%rbp
 10ced4:	48 81 ec b0 00 00 00 	sub    $0xb0,%rsp
 10cedb:	48 8d 45 98          	lea    -0x68(%rbp),%rax
 10cedf:	48 b9 1d 1d 1d 1d 1d 	movabs $0x1d1d1d1d1d1d1d1d,%rcx
 10cee6:	1d 1d 1d
 10cee9:	48 89 4d 98          	mov    %rcx,-0x68(%rbp)
 10ceed:	48 89 4d f8          	mov    %rcx,-0x8(%rbp)
 10cef1:	48 89 e1             	mov    %rsp,%rcx
 10cef4:	48 89 4d f8          	mov    %rcx,-0x8(%rbp)
 10cef8:  ...
[...]

Our divide_by_zero_handler starts at address 0x10ced0. Letâ€™s look at the instruction at address 0x10cef1:

mov %rsp,%rcx

This is our inline assembly instruction, which loads the stack pointer into the stack_frame variable. It just looks a bit different, since itâ€™s in AT&T syntax and contains rcx instead of our $0 placeholder. It moves rsp to rcx, and then the next instruction (mov %rcx,-0x8(%rbp)) moves rcx to the variable on the stack.

We can clearly see the problem here: The compiler inserted various other instructions before our inline assembly. These instructions modify the stack pointer so that we donâ€™t read the original rsp value and get a wrong pointer. But why is the compiler doing this?

The reason is that we need some place on the stack to store things like variables. Therefore the compiler inserts a so-called function prologue, which prepares the stack and reserves space for all variables. In our case, the compiler subtracts from the stack pointer to make room for i.a. our stack_frame variable. This prologue is the first thing in every function and comes before every other code.

So in order to correctly load the exception frame pointer, we need some way to circumvent the automatic prologue generation.

ðŸ”—Naked Functions

Fortunately there is a way to disable the prologue: naked functions. A naked function has no prologue and immediately starts with the first instruction of its body. However, most Rust code requires the prologue. Therefore naked functions should only contain inline assembly.

A naked function looks like this (note the #[naked] attribute):

#[naked]
extern "C" fn naked_function_example() {
    unsafe {
        asm!("mov rax, 0x42" ::: "rax" : "intel");
    };
}

Naked functions are highly unstable, so we need to add #![feature(naked_functions)] to our src/lib.rs.

If you want to try it, insert it in src/lib.rs and call it from rust_main. When we inspect the disassembly, we see that the function prologue is missing:

> objdump -d build/kernel-x86_64.bin | grep -A5 "naked_function_example"
[...]
000000000010df90 <_ZN7blog_os22naked_function_example17ha9f733dfe42b595dE>:
  10df90:	48 c7 c0 2a 00 00 00 	mov    $0x42,%rax
  10df97:	c3                   	retq
  10df98:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
  10df9f:	00

It contains just the specified inline assembly and a return instruction (you can ignore the junk values after the return statement). So letâ€™s try to use a naked function to retrieve the exception frame pointer.

ðŸ”—A Naked Exception Handler

We canâ€™t use Rust code in naked functions, but we still want to use Rust in our exception handler. Therefore we split our handler function in two parts. A main exception handler in Rust and a small naked wrapper function, which just loads the exception frame pointer and then calls the main handler.

Our new two-stage exception handler looks like this:

// in src/interrupts/mod.rs

#[naked]
extern "C" fn divide_by_zero_wrapper() -> ! {
    unsafe {
        asm!(/* load exception frame pointer and call main handler */);
    }
}

extern "C" fn divide_by_zero_handler(stack_frame: &ExceptionStackFrame)
    -> !
{
    println!("\nEXCEPTION: DIVIDE BY ZERO\n{:#?}",
        unsafe { &*stack_frame });
    loop {}
}

The naked wrapper function retrieves the exception stack frame pointer and then calls the divide_by_zero_handler with the pointer as argument. We canâ€™t use Rust code in naked functions, so we need to do both things in inline assembly.

Retrieving the pointer to the exception stack frame is easy: We just need to load it from the rsp register. Our wrapper function has no prologue (itâ€™s naked), so we can be sure that nothing modifies the register before.

Calling the main handler is a bit more complicated, since we need to pass the argument correctly. Our main handler uses the C calling convention, which specifies that the the first argument is passed in the rdi register. So we need to load the pointer value into rdi and then use the call instruction to call divide_by_zero_handler.

Translated to assembly, it looks like this:

mov rdi, rsp
call divide_by_zero_handler

It moves the exception stack frame pointer from rsp to rdi, where the first argument is expected, and then calls the main handler. Letâ€™s create the corresponding inline assembly to complete our wrapper function:

#[naked]
extern "C" fn divide_by_zero_wrapper() -> ! {
    unsafe {
        asm!("mov rdi, rsp; call $0"
             :: "i"(divide_by_zero_handler as extern "C" fn(_) -> !)
             : "rdi" : "intel");
    }
}

Instead of call divide_by_zero_handler, we use a placeholder again. The reason is Rustâ€™s name mangling, which changes the name of the divide_by_zero_handler function. To circumvent this, we pass a function pointer as input parameter (after the second colon). The "i" tells the compiler that it is an immediate value, which can be directly inserted for the placeholder. We also specify a clobber after the third colon, which tells the compiler that we change the value of the rdi register.

ðŸ”—Intrinsics::Unreachable

When we try to compile it, we get the following error:

error: computation may converge in a function marked as diverging
  --> src/interrupts/mod.rs:23:1
   |>
23 |> extern "C" fn divide_by_zero_wrapper() -> ! {
   |> ^

The reason is that we marked our divide_by_zero_wrapper function as diverging (the !). We call another diverging function in inline assembly, so it is clear that the function diverges. However, the Rust compiler doesnâ€™t understand inline assembly, so it doesnâ€™t know that. To fix this, we tell the compiler that all code after the asm! macro is unreachable:

#[naked]
extern "C" fn divide_by_zero_wrapper() -> ! {
    unsafe {
        asm!("mov rdi, rsp; call $0"
             :: "i"(divide_by_zero_handler as extern "C" fn(_) -> !)
             : "rdi" : "intel");
        ::core::intrinsics::unreachable();
    }
}

The intrinsics::unreachable function is unstable, so we need to add #![feature(core_intrinsics)] to our src/lib.rs. It is just an annotation for the compiler and produces no real code. (Not to be confused with the unreachable! macro, which is completely different!)

ðŸ”—It works!

The last step is to update the interrupt descriptor table (IDT) to use our new wrapper function:

// in src/interrupts/mod.rs

lazy_static! {
    static ref IDT: idt::Idt = {
        let mut idt = idt::Idt::new();
        idt.set_handler(0, divide_by_zero_wrapper); // changed
        idt
    };
}

Now we see a correct exception stack frame when we execute make run:

ðŸ”—Testing on real Hardware

Virtual machines such as QEMU are very convenient to quickly test our kernel. However, they might behave a bit different than real hardware in some situations. So we should test our kernel on real hardware, too.

Letâ€™s do it by burning it to an USB stick:

> sudo dd if=build/os-x86_64.iso of=/dev/sdX; and sync

Replace sdX by the device name of your USB stick. But be careful! The command will erase everything on that device.

Now we should be able to boot from this USB stick. When we do it, we see that it works fine on real hardware, too. Great!

However, this section wouldnâ€™t exist if there werenâ€™t a problem. To trigger this problem, we add some example code to the start of our divide_by_zero_handler:

// in src/interrupts/mod.rs

extern "C" fn divide_by_zero_handler(...) {
    let x = (1u64, 2u64, 3u64);
    let y = Some(x);
    for i in (0..100).map(|z| (z, z - 1)) {}

    println!(...);
    loop {}
}

This is just some garbage code that doesnâ€™t do anything useful. When we try it in QEMU using make run, it still works fine. However, when we burn it to an USB stick again and boot from it on real hardware, we see that our computer reboots just before printing the exception message.

So our code, which worked well in QEMU, causes a triple fault on real hardware. Whatâ€™s happening?

ðŸ”—Reproducing the Bug in QEMU

Debugging on a real machine is difficult. Fortunately there is a way to reproduce this bug in QEMU: We use Linuxâ€™s Kernel-based Virtual Machine (KVM) by passing the â€‘enable-kvm flag:

> qemu-system-x86_64 -cdrom build/os-x86_64.iso -enable-kvm

Now QEMU triple faults as well. This should make debugging much easier.

ðŸ”—Debugging

QEMUâ€™s -d int, which prints every exception, doesnâ€™t seem to work in KVM mode. However -d cpu_reset still works. It prints the complete CPU state whenever the CPU resets. Letâ€™s try it:

> qemu-system-x86_64 -cdrom build/os-x86_64.iso -enable-kvm -d cpu_reset
CPU Reset (CPU 0)
EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000000
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=00000000 EFL=00000000 [-------] CPL=0 II=0 A20=0 SMM=0 HLT=0
[...]
CPU Reset (CPU 0)
EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000663
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
[...]
CPU Reset (CPU 0)
RAX=0000000000118cb8 RBX=0000000000000800 RCX=1d1d1d1d1d1d1d1d RDX=0..0000000
RSI=0000000000112cd0 RDI=0000000000118d38 RBP=0000000000118d28 RSP=0..0118c68
R8 =0000000000000000 R9 =0000000000000100 R10=0000000000118700 R11=0..0118a00
R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0..0000000
RIP=000000000010cf08 RFL=00210002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
[...]

The first two resets occur while the CPU is still in 32-bit mode (EAX instead of RAX), so we ignore them. The third reset is the interesting one, because it occurs in 64-bit mode. The register dump tells us that the instruction pointer (rip) was 0x10cf08 just before the reset. This might be the address of the instruction that caused the triple fault.

We can find the corresponding instruction by disassembling our kernel:

objdump -d build/kernel-x86_64.bin | grep "10cf08:"
  10cf08:	0f 29 45 b0          	movaps %xmm0,-0x50(%rbp)

The movaps instruction is an SSE instruction that moves aligned 128bit values. It can fail for a number of reasons:

For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.
For an illegal address in the SS segment.
If a memory operand is not aligned on a 16-byte boundary.
For a page fault.
If TS in CR0 is set.

The segment registers contain no meaningful values in long mode, so they canâ€™t contain illegal addresses. We did not change the TS bit in CR0 and there is no reason for a page fault either. So it has to be option 3.

ðŸ”—16-byte Alignment

Some SSE instructions such as movaps require that memory operands are 16-byte aligned. In our case, the instruction is movaps %xmm0,-0x50(%rbp), which writes to address rbp - 0x50. Therefore rbp needs to be 16-byte aligned.

Letâ€™s look at the above -d cpu_reset dump again and check the value of rbp:

CPU Reset (CPU 0)
RAX=[...] RBX=[...] RCX=[...] RDX=[...]
RSI=[...] RDI=[...] RBP=0000000000118d28 RSP=[...]
...

RBP is 0x118d28, which is not 16-byte aligned. So this is the reason for the triple fault. (It seems like QEMU doesnâ€™t check the alignment for movaps, but real hardware of course does.)

But how did we end up with a misaligned rbp register?

ðŸ”—The Base Pointer

In order to solve this mystery, we need to look at the disassembly of the preceding code:

> objdump -d build/kernel-x86_64.bin | grep -B10 "10cf08:"
000000000010cee0 <_ZN7blog_os10interrupts22divide_by_zero_handler17hE>:
  10cee0:	55                   	push   %rbp
  10cee1:	48 89 e5             	mov    %rsp,%rbp
  10cee4:	48 81 ec c0 00 00 00 	sub    $0xc0,%rsp
  10ceeb:	48 8d 45 90          	lea    -0x70(%rbp),%rax
  10ceef:	48 b9 1d 1d 1d 1d 1d 	movabs $0x1d1d1d1d1d1d1d1d,%rcx
  10cef6:	1d 1d 1d
  10cef9:	48 89 4d 90          	mov    %rcx,-0x70(%rbp)
  10cefd:	48 89 7d f8          	mov    %rdi,-0x8(%rbp)
  10cf01:	0f 10 05 a8 51 00 00 	movups 0x51a8(%rip),%xmm0
  10cf08:	0f 29 45 b0          	movaps %xmm0,-0x50(%rbp)

At the last line we have the movaps instruction, which caused the triple fault. The exception occurs inside our divide_by_zero_handler function. We see that rbp is loaded with the value of rsp at the beginning (at 0x10cee1). The rbp register holds the so-called base pointer, which points to the beginning of the stack frame. It is used in the rest of the function to address variables and other values on the stack.

The base pointer is initialized directly from the stack pointer (rsp) after pushing the old base pointer. There is no special alignment code, so the compiler blindly assumes that (rsp - 8)¹ is always 16-byte aligned. This seems to be wrong in our case. But why does the compiler assume this?

By pushing the old base pointer, rsp is updated to rsp-8.

ðŸ”—Calling Conventions

The reason is that our exception handler is defined as extern "C" function, which specifies that itâ€™s using the C calling convention. On x86_64 Linux, the C calling convention is specified by the System V AMD64 ABI (PDF). Section 3.2.2 defines the following:

The end of the input argument area shall be aligned on a 16 byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 when control is transferred to the function entry point.

The â€œend of the input argument areaâ€ refers to the last stack-passed argument (in our case there arenâ€™t any). So the stack pointer must be 16 byte aligned whenever we call a C-compatible function. The call instruction then pushes the return value on the stack so that â€œthe value (%rsp + 8) is a multiple of 16 when control is transferred to the function entry pointâ€.

Summary: The calling convention requires a 16 byte aligned stack pointer before call instructions. The compiler relies on this requirement, but we broke it somehow. Thus the generated code triple faults due to a misaligned memory address in the movaps instruction.

ðŸ”—Fixing the Alignment

In order to fix this bug, we need to make sure that the stack pointer is correctly aligned before calling extern "C" functions. Letâ€™s summarize the stack pointer modifications that occur before the exception handler is called:

The CPU aligns the stack pointer to a 16 byte boundary.
The CPU pushes ss, rsp, rflags, cs, and rip. So it pushes five 8 byte registers, which makes rsp misaligned.
The wrapper function calls divide_by_zero_handler with a misaligned stack pointer.

The problem is that weâ€™re pushing an uneven number of 8 byte registers. Thus we need to align the stack pointer again before the call instruction:

#[naked]
extern "C" fn divide_by_zero_wrapper() -> ! {
    unsafe {
        asm!("mov rdi, rsp
              sub rsp, 8 // align the stack pointer
              call $0"
              :: "i"(divide_by_zero_handler as extern "C" fn(_) -> !)
              : "rdi" : "intel");
        ::core::intrinsics::unreachable();
    }
}

The additional sub rsp, 8 instruction aligns the stack pointer to a 16 byte boundary. Now it should work on real hardware (and in QEMU KVM mode) again.

ðŸ”—A Handler Macro

The next step is to add handlers for other exceptions. However, we would need wrapper functions for them too. To avoid this code duplication, we create a handler macro that creates the wrapper functions for us:

// in src/interrupts/mod.rs

macro_rules! handler {
    ($name: ident) => {{
        #[naked]
        extern "C" fn wrapper() -> ! {
            unsafe {
                asm!("mov rdi, rsp
                      sub rsp, 8 // align the stack pointer
                      call $0"
                      :: "i"($name as extern "C" fn(
                          &ExceptionStackFrame) -> !)
                      : "rdi" : "intel");
                ::core::intrinsics::unreachable();
            }
        }
        wrapper
    }}
}

The macro takes a single Rust identifier (ident) as argument and expands to a {} block (hence the double braces). The block defines a new wrapper function that calls the function $name and passes a pointer to the exception stack frame. Note that weâ€™re fixing the argument type to &ExceptionStackFrame. If we used a _ like before, the passed function could accept an arbitrary argument, which would lead to ugly bugs at runtime.

Now we can remove the divide_by_zero_wrapper and use our new handler! macro instead:

// in src/interrupts/mod.rs

lazy_static! {
    static ref IDT: idt::Idt = {
        let mut idt = idt::Idt::new();
        idt.set_handler(0, handler!(divide_by_zero_handler)); // new
        idt
    };
}

Note that the handler! macro needs to be defined above the static IDT, because macros are only available after their definition.

ðŸ”—Invalid Opcode Exception

With the handler! macro we can create new handler functions easily. For example, we can add a handler for the invalid opcode exception as follows:

// in src/interrupts/mod.rs

lazy_static! {
    static ref IDT: idt::Idt = {
        let mut idt = idt::Idt::new();
        idt.set_handler(0, handler!(divide_by_zero_handler));
        idt.set_handler(6, handler!(invalid_opcode_handler)); // new
        idt
    };
}

extern "C" fn invalid_opcode_handler(stack_frame: &ExceptionStackFrame)
    -> !
{
    let stack_frame = unsafe { &*stack_frame };
    println!("\nEXCEPTION: INVALID OPCODE at {:#x}\n{:#?}",
        stack_frame.instruction_pointer, stack_frame);
    loop {}
}

Invalid opcode faults have the vector number 6, so we set the 6th IDT entry. This time we additionally print the address of the invalid instruction.

We can test our new handler with the special ud2 instruction, which generates a invalid opcode:

// in src/lib.rs

#[no_mangle]
pub extern "C" fn rust_main(multiboot_information_address: usize) {
    ...

    // initialize our IDT
    interrupts::init();

    // provoke a invalid opcode exception
    unsafe { asm!("ud2") };

    println!("It did not crash!");
    loop {}
}

ðŸ”—Exceptions with Error Codes

When a divide-by-zero exception occurs, we immediately know the reason: Someone tried to divide by zero. In contrast, there are faults with many possible causes. For example, a page fault occurs in many occasions: When accessing a non-present page, when writing to a read-only page, when the page table is malformed, etc. In order to differentiate these causes, the CPU pushes an additional error code onto the stack for such exceptions, which gives additional information.

ðŸ”—A new Macro

Since the CPU pushes an additional error code, the stack frame is different and our handler! macro is not applicable. Therefore we create a new handler_with_error_code! macro for them:

// in src/interrupts/mod.rs

macro_rules! handler_with_error_code {
    ($name: ident) => {{
        #[naked]
        extern "C" fn wrapper() -> ! {
            unsafe {
                asm!("pop rsi // pop error code into rsi
                      mov rdi, rsp
                      sub rsp, 8 // align the stack pointer
                      call $0"
                      :: "i"($name as extern "C" fn(
                          &ExceptionStackFrame, u64) -> !)
                      : "rdi","rsi" : "intel");
                ::core::intrinsics::unreachable();
            }
        }
        wrapper
    }}
}

The difference to the handler! macro is the additional error code argument. The CPU pushes the error code last, so we pop it right at the beginning of the wrapper function. We pop it into rsi because the C calling convention expects the second argument in it.

ðŸ”—A Page Fault Handler

Letâ€™s write a page fault handler which analyzes and prints the error code:

// in src/interrupts/mod.rs

extern "C" fn page_fault_handler(stack_frame: &ExceptionStackFrame,
                                 error_code: u64) -> !
{
    println!(
        "\nEXCEPTION: PAGE FAULT with error code {:?}\n{:#?}",
        error_code, unsafe { &*stack_frame });
    loop {}
}

We need to register our new handler function in the static interrupt descriptor table (IDT):

// in src/interrupts/mod.rs

lazy_static! {
    static ref IDT: idt::Idt = {
        let mut idt = idt::Idt::new();

        idt.set_handler(0, handler!(divide_by_zero_handler));
        idt.set_handler(6, handler!(invalid_opcode_handler));
        // new
        idt.set_handler(14, handler_with_error_code!(page_fault_handler));

        idt
    };
}

Page faults have the vector number 14, so we set the 14th IDT entry.

ðŸ”—Testing it

Letâ€™s test our new page fault handler by provoking a page fault in our main function:

// in src/lib.rs

#[no_mangle]
pub extern "C" fn rust_main(multiboot_information_address: usize) {
    ...

    // initialize our IDT
    interrupts::init();

    // provoke a page fault
    unsafe { *(0xdeadbeaf as *mut u64) = 42 };

    println!("It did not crash!");
    loop {}
}

We get the following output:

ðŸ”—The Page Fault Error Code

â€œError code 2â€ is not really an useful error message. Letâ€™s improve this by creating a PageFaultErrorCode type:

// in src/interrupts/mod.rs

bitflags! {
    struct PageFaultErrorCode: u64 {
        const PROTECTION_VIOLATION = 1 << 0;
        const CAUSED_BY_WRITE = 1 << 1;
        const USER_MODE = 1 << 2;
        const MALFORMED_TABLE = 1 << 3;
        const INSTRUCTION_FETCH = 1 << 4;
    }
}

When the PROTECTION_VIOLATION flag is set, the page fault was caused e.g. by a write to a read-only page. If itâ€™s not set, it was caused by accessing a non-present page.
The CAUSED_BY_WRITE flag specifies if the fault was caused by a write (if set) or a read (if not set).
The USER_MODE flag is set when the fault occurred in non-privileged mode.
The MALFORMED_TABLE flag is set when the page table entry has a 1 in a reserved field.
When the INSTRUCTION_FETCH flag is set, the page fault occurred while fetching the next instruction.

Now we can improve our page fault error message by using the new PageFaultErrorCode. We also print the accessed memory address:

extern "C" fn page_fault_handler(stack_frame: &ExceptionStackFrame,
                                 error_code: u64) -> !
{
    use x86_64::registers::control_regs;
    println!(
        "\nEXCEPTION: PAGE FAULT while accessing {:#x}\
        \nerror code: {:?}\n{:#?}",
        unsafe { control_regs::cr2() },
        PageFaultErrorCode::from_bits(error_code).unwrap(),
        unsafe { &*stack_frame });
    loop {}
}

The from_bits function tries to convert the u64 into a PageFaultErrorCode. We use unwrap to panic if the error code has invalid bits set, since this indicates an error in our PageFaultErrorCode definition or a stack corruption. We also print the contents of the cr2 register. It contains the accessed memory address, which was the cause of the page fault.

Now we get a useful error message when a page fault occurs, which allows us to debug it more easily:

As expected, the page fault was caused by write to 0xdeadbeaf. The PROTECTION_VIOLATION flag is not set, so the accessed page was not present.

ðŸ”—Whatâ€™s next?

Now weâ€™re able to catch and analyze various exceptions. The next step is to resolve exceptions, if possible. An example is demand paging: The OS swaps out memory pages to disk so that a page fault occurs when the page is accessed the next time. In that case, the OS can resolve the exception by bringing the page back into memory. Afterwards, the OS resumes the interrupted program as if nothing had happened.

The next post will implement the first portion of demand paging: saving and restoring the complete state of an program. This will allow us to transparently interrupt and resume programs in the future.

Catching Exceptions

Sat, 28 May 2016 00:00:00 +0000

In this post, we start exploring exceptions. We set up an interrupt descriptor table and add handler functions. At the end of this post, our kernel will be able to catch divide-by-zero faults.

As always, the complete source code is on GitHub. Please file issues for any problems, questions, or improvement suggestions. There is also a comment section at the end of this page.

Note: This post describes how to handle exceptions using naked functions (see â€œHandling Exceptions with Naked Functionsâ€ for an overview). Our new way of handling exceptions can be found in the â€œHandling Exceptionsâ€ post.

ðŸ”—Exceptions

Weâ€™ve already seen several types of exceptions in our kernel:

Invalid Opcode: This exception occurs when the current instruction is invalid. For example, this exception occurred when we tried to use SSE instructions before enabling SSE. Without SSE, the CPU didnâ€™t know the movups and movaps instructions, so it throws an exception when it stumbles over them.
Page Fault: A page fault occurs on illegal memory accesses. For example, if the current instruction tries to read from an unmapped page or tries to write to a read-only page.
Double Fault: When an exception occurs, the CPU tries to call the corresponding handler function. If another exception exception occurs while calling the exception handler, the CPU raises a double fault exception. This exception also occurs when there is no handler function registered for an exception.
Triple Fault: If an exception occurs while the CPU tries to call the double fault handler function, it issues a fatal triple fault. We canâ€™t catch or handle a triple fault. Most processors react by resetting themselves and rebooting the operating system. This causes the bootloops we experienced in the previous posts.

For the full list of exceptions check out the OSDev wiki.

ðŸ”—The Interrupt Descriptor Table

Type	Name	Description
u16	Function Pointer [0:15]	The lower bits of the pointer to the handler function.
u16	GDT selector	Selector of a code segment in the GDT.
u16	Options	(see below)
u16	Function Pointer [16:31]	The middle bits of the pointer to the handler function.
u32	Function Pointer [32:63]	The remaining bits of the pointer to the handler function.
u32	Reserved

The options field has the following format:

Bits	Name	Description
0-2	Interrupt Stack Table Index	0: Donâ€™t switch stacks, 1-7: Switch to the n-th stack in the Interrupt Stack Table when this handler is called.
3-7	Reserved
8	0: Interrupt Gate, 1: Trap Gate	If this bit is 0, interrupts are disabled when this handler is called.
9-11	must be one
12	must be zero
13â€‘14	Descriptor Privilege Level (DPL)	The minimal privilege level required for calling this handler.
15	Present

When an exception occurs, the CPU roughly does the following:

Read the corresponding entry from the Interrupt Descriptor Table (IDT). For example, the CPU reads the 14-th entry when a page fault occurs.
Check if the entry is present. Raise a double fault if not.
Push some registers on the stack, including the instruction pointer and the EFLAGS register. (We will use these values in a future post.)
Disable interrupts if the entry is an interrupt gate (bit 40 not set).
Load the specified GDT selector into the CS segment.
Jump to the specified handler function.

ðŸ”—Handling Exceptions

Letâ€™s try to catch and handle CPU exceptions. We start by creating a new interrupts module with an idt submodule:

// in src/lib.rs
...
mod interrupts;
...

// src/interrupts/mod.rs

mod idt;

Now we create types for the IDT and its entries:

// src/interrupts/idt.rs

use x86_64::instructions::segmentation;
use x86_64::structures::gdt::SegmentSelector;
use x86_64::PrivilegeLevel;

pub struct Idt([Entry; 16]);

#[derive(Debug, Clone, Copy)]
#[repr(C, packed)]
pub struct Entry {
    pointer_low: u16,
    gdt_selector: SegmentSelector,
    options: EntryOptions,
    pointer_middle: u16,
    pointer_high: u32,
    reserved: u32,
}

The IDT is variable sized and can have up to 256 entries. We only need the first 16 entries in this post, so we define the table as [Entry; 16]. The remaining 240 handlers are treated as non-present by the CPU.

The Entry type is the translation of the above table to Rust. The repr(C, packed) attribute ensures that the compiler keeps the field ordering and does not add any padding between them. Instead of describing the gdt_selector as a plain u16, we use the SegmentSelector type of the x86 crate. We also merge bits 32 to 47 into an option field, because Rust has no u3 or u1 type. The EntryOptions type is described below:

ðŸ”—Entry Options

The EntryOptions type has the following skeleton:

#[derive(Debug, Clone, Copy)]
pub struct EntryOptions(u16);

impl EntryOptions {
    fn new() -> Self {...}

    pub fn set_present(&mut self, present: bool) {...}

    pub fn disable_interrupts(&mut self, disable: bool) {...}

    pub fn set_privilege_level(&mut self, dpl: u16) {...}

    pub fn set_stack_index(&mut self, index: u16) {...}
}

The implementations of these methods need to modify the correct bits of the u16 without touching the other bits. For example, we would need the following bit-fiddling to set the stack index:

self.0 = (self.0 & 0xfff8) | stack_index;

Or alternatively:

self.0 = (self.0 & (!0b111)) | stack_index;

Or:

self.0 = ((self.0 >> 3) << 3) | stack_index;

Well, none of these variants is really readable and itâ€™s very easy to make mistakes somewhere. Therefore I created a BitField trait that provides the following Range-based API:

self.0.set_bits(0..3, stack_index);

I think it is much more readable, since we abstracted away all bit-masking details. The BitField trait is contained in the bit_field crate. (Itâ€™s pretty new, so it might still contain bugs.) To add it as dependency, we run cargo add bit_field and add extern crate bit_field; to our src/lib.rs.

Now we can use the trait to implement the methods of EntryOptions:

// in src/interrupts/idt.rs

use bit_field::BitField;

#[derive(Debug, Clone, Copy)]
pub struct EntryOptions(u16);

impl EntryOptions {
    fn minimal() -> Self {
        let mut options = 0;
        options.set_bits(9..12, 0b111); // 'must-be-one' bits
        EntryOptions(options)
    }

    fn new() -> Self {
        let mut options = Self::minimal();
        options.set_present(true).disable_interrupts(true);
        options
    }

    pub fn set_present(&mut self, present: bool) -> &mut Self {
        self.0.set_bit(15, present);
        self
    }

    pub fn disable_interrupts(&mut self, disable: bool) -> &mut Self {
        self.0.set_bit(8, !disable);
        self
    }

    pub fn set_privilege_level(&mut self, dpl: u16) -> &mut Self {
        self.0.set_bits(13..15, dpl);
        self
    }

    pub fn set_stack_index(&mut self, index: u16) -> &mut Self {
        self.0.set_bits(0..3, index);
        self
    }
}

Note that the ranges are exclusive the upper bound. The minimal function creates an EntryOptions type with only the â€œmust-be-oneâ€ bits set. The new function, on the other hand, chooses reasonable defaults: It sets the present bit (why would you want to create a non-present entry?) and disables interrupts (normally we donâ€™t want that our exception handlers can be interrupted). By returning the self pointer from the set_* methods, we allow easy method chaining such as options.set_present(true).disable_interrupts(true).

ðŸ”—Creating IDT Entries

Now we can add a function to create new IDT entries:

impl Entry {
    fn new(gdt_selector: SegmentSelector, handler: HandlerFunc) -> Self {
        let pointer = handler as u64;
        Entry {
            gdt_selector: gdt_selector,
            pointer_low: pointer as u16,
            pointer_middle: (pointer >> 16) as u16,
            pointer_high: (pointer >> 32) as u32,
            options: EntryOptions::new(),
            reserved: 0,
        }
    }
}

We take a GDT selector and a handler function as arguments and create a new IDT entry for it. The HandlerFunc type is described below. It is a function pointer that can be converted to an u64. We choose the lower 16 bits for pointer_low, the next 16 bits for pointer_middle and the remaining 32 bits for pointer_high. For the options field we choose our default options, i.e. present and disabled interrupts.

ðŸ”—The Handler Function Type

The HandlerFunc type is a type alias for a function type:

pub type HandlerFunc = extern "C" fn() -> !;

It needs to be a function with a defined calling convention, as it called directly by the hardware. The C calling convention is the de facto standard in OS development, so weâ€™re using it, too. The function takes no arguments, since the hardware doesnâ€™t supply any arguments when jumping to the handler function.

It is important that the function is diverging, i.e. it must never return. The reason is that the hardware doesnâ€™t call the handler functions, it just jumps to them after pushing some values to the stack. So our stack might look different:

If our handler function returned normally, it would try to pop the return address from the stack. But it might get some completely different value then. For example, the CPU pushes an error code for some exceptions. Bad things would happen if we interpreted this error code as return address and jumped to it. Therefore interrupt handler functions must diverge¹.

Another reason is that we overwrite the current register values by executing the handler function. Thus, the interrupted function looses its state and canâ€™t proceed anyway.

ðŸ”—IDT methods

Letâ€™s add a function to create new interrupt descriptor tables:

impl Idt {
    pub fn new() -> Idt {
        Idt([Entry::missing(); 16])
    }
}

impl Entry {
    fn missing() -> Self {
        Entry {
            gdt_selector: SegmentSelector::new(0, PrivilegeLevel::Ring0),
            pointer_low: 0,
            pointer_middle: 0,
            pointer_high: 0,
            options: EntryOptions::minimal(),
            reserved: 0,
        }
    }
}

The missing function creates a non-present Entry. We could choose any values for the pointer and GDT selector fields as long as the present bit is not set.

However, a table with non-present entries is not very useful. So we create a set_handler method to add new handler functions:

impl Idt {
    pub fn set_handler(&mut self, entry: u8, handler: HandlerFunc)
        -> &mut EntryOptions
    {
        self.0[entry as usize] = Entry::new(segmentation::cs(), handler);
        &mut self.0[entry as usize].options
    }
}

The method overwrites the specified entry with the given handler function. We use the segmentation::cs function of the x86_64 crate to get the current code segment descriptor. Thereâ€™s no need for different kernel code segments in long mode, so the current cs value should be always the right choice.

By returning a mutual reference to the entryâ€™s options, we allow the caller to override the default settings. For example, the caller could add a non-present entry by executing: idt.set_handler(11, handler_fn).set_present(false).

ðŸ”—Loading the IDT

Now weâ€™re able to create new interrupt descriptor tables with registered handler functions. We just need a way to load an IDT, so that the CPU uses it. The x86 architecture uses a special register to store the active IDT and its length. In order to load a new IDT we need to update this register through the lidt instruction.

The lidt instruction expects a pointer to a special data structure, which specifies the start address of the IDT and its length:

Type	Name	Description
u16	Limit	The maximum addressable byte in the table. Equal to the table size in bytes minus 1.
u64	Offset	Virtual start address of the table.

This structure is already contained in the x86_64 crate, so we donâ€™t need to create it ourselves. The same is true for the lidt function. So we just need to put the pieces together to create a load method:

impl Idt {
    pub fn load(&self) {
        use x86_64::instructions::tables::{DescriptorTablePointer, lidt};
        use core::mem::size_of;

        let ptr = DescriptorTablePointer {
            base: self as *const _ as u64,
            limit: (size_of::<Self>() - 1) as u16,
        };

        unsafe { lidt(&ptr) };
    }
}

The method does not need to modify the IDT, so it takes self by immutable reference. First, we create a DescriptorTablePointer and then we pass it to lidt. The lidt function expects that the base field has the type u64, therefore we need to cast the self pointer. For calculating the limit we use mem::size_of. The additional -1 is needed because the limit field has to be the maximum addressable byte (inclusive bound). We need an unsafe block around lidt, because the function assumes that the specified handler addresses are valid.

ðŸ”—Safety

But can we really guarantee that handler addresses are always valid? Letâ€™s see:

The Idt::new function creates a new table populated with non-present entries. Thereâ€™s no way to set these entries to present from outside of this module, so this function is fine.
The set_handler method allows us to overwrite a specified entry and point it to some handler function. Rustâ€™s type system guarantees that function pointers are always valid (as long as no unsafe is involved), so this function is fine, too.

There are no other public functions in the idt module (except load), so it should be safeâ€¦ right?

Wrong! Imagine the following scenario:

pub fn init() {
    load_idt();
    cause_page_fault();
}

fn load_idt() {
    let mut idt = idt::Idt::new();
    idt.set_handler(14, page_fault_handler);
    idt.load();
}

fn cause_page_fault() {
    let x = [1,2,3,4,5,6,7,8,9];
    unsafe{ *(0xdeadbeaf as *mut u64) = x[4] };
}

This wonâ€™t work. If weâ€™re lucky, we get a triple fault and a boot loop. If weâ€™re unlucky, our kernel does strange things and fails at some completely unrelated place. So whatâ€™s the problem here?

Well, we construct an IDT on the stack and load it. It is perfectly valid until the end of the load_idt function. But as soon as the function returns, its stack frame can be reused by other functions. Thus, the IDT gets overwritten by the stack frame of the cause_page_fault function. So when the page fault occurs and the CPU tries to read the entry, it only sees some garbage values and issues a double fault, which escalates to a triple fault and a CPU reset.

Now imagine that the cause_page_fault function declared an array of pointers instead. If the present was coincidentally set, the CPU would jump to some random pointer and interpret random memory as code. This would be a clear violation of memory safety.

ðŸ”—Fixing the load method

So how do we fix it? We could make the load function itself unsafe and push the unsafety to the caller. However, there is a much better solution in this case. In order to see it, we formulate the requirement for the load method:

The referenced IDT must be valid until a new IDT is loaded.

We canâ€™t know when the next IDT will be loaded. Maybe never. So in the worst case:

The referenced IDT must be valid as long as our kernel runs.

This is exactly the definition of a static lifetime. So we can easily ensure that the IDT lives long enough by adding a 'static requirement to the signature of the load function:

pub fn load(&'static self) {...}
//           ^^^^^^^ ensure that the IDT reference has the 'static lifetime

Thatâ€™s it! Now the Rust compiler ensures that the above error canâ€™t happen anymore:

error: `idt` does not live long enough
  --> src/interrupts/mod.rs:78:5
78 |>     idt.load();
   |>     ^^^
note: reference must be valid for the static lifetime...
note: ...but borrowed value is only valid for the block suffix following
          statement 0 at 75:34
  --> src/interrupts/mod.rs:75:35
75 |>     let mut idt = idt::Idt::new();
   |>                                   ^

ðŸ”—A static IDT

So a valid IDT needs to have the 'static lifetime. We can either create a static IDT or deliberately leak a Box. We will most likely only need a single IDT for the foreseeable future, so letâ€™s try the static approach:

// in src/interrupts/mod.rs

static IDT: idt::Idt = {
    let mut idt = idt::Idt::new();

    idt.set_handler(0, divide_by_zero_handler);

    idt
};

extern "C" fn divide_by_zero_handler() -> ! {
    println!("EXCEPTION: DIVIDE BY ZERO");
    loop {}
}

We register a single handler function for a divide by zero error (index 0). Like the name says, this exception occurs when dividing a number by 0. Thus we have an easy way to test our new exception handler.

However, it doesnâ€™t work this way:

error: calls in statics are limited to constant functions, struct and enum
       constructors [E0015]
...
error: blocks in statics are limited to items and tail expressions [E0016]
...
error: references in statics may only refer to immutable values [E0017]
...

The reason is that the Rust compiler is not able to evaluate the value of the static at compile time. Maybe it will work someday when const functions become more powerful. But until then, we have to find another solution.

ðŸ”—Lazy Statics to the Rescue

Fortunately the lazy_static macro exists. Instead of evaluating a static at compile time, the macro performs the initialization when the static is referenced the first time. Thus, we can do almost everything in the initialization block and are even able to read runtime values.

Letâ€™s add the lazy_static crate to our project:

// in src/lib.rs

#[macro_use]
extern crate lazy_static;

# in Cargo.toml

[dependencies.lazy_static]
version = "0.2.1"
features = ["spin_no_std"]

We need the spin_no_std feature, since we donâ€™t link the standard library.

With lazy_static, we can define our IDT without problems:

// in src/interrupts/mod.rs

lazy_static! {
    static ref IDT: idt::Idt = {
        let mut idt = idt::Idt::new();

        idt.set_handler(0, divide_by_zero_handler);

        idt
    };
}

Now weâ€™re ready to load our IDT! Therefore we add a interrupts::init function:

// in src/interrupts/mod.rs

pub fn init() {
    IDT.load();
}

We donâ€™t need our assert_has_not_been_called macro here, since nothing bad happens when init is called twice. It just reloads the same IDT again.

ðŸ”—Testing it

Now we should be able to catch page faults! Letâ€™s try it in our rust_main:

// in src/lib.rs

pub extern "C" fn rust_main(...) {
    ...
    memory::init(boot_info);

    // initialize our IDT
    interrupts::init();

    // provoke a divide-by-zero fault
    42 / 0;

    println!("It did not crash!");
    loop {}
}

When we run it, we get a runtime panic:

PANIC in src/lib.rs at line 57:
    attempted to divide by zero

Thatâ€™s a not our exception handler. The reason is that Rust itself checks for a possible division by zero and panics in that case. So in order to raise a divide-by-zero error in the CPU, we need to bypass the Rust compiler somehow.

ðŸ”—Inline Assembly

In order to cause a divide-by-zero exception, we need to execute a div or idiv assembly instruction with operand 0. We could write a small assembly function and call it from our Rust code. An easier way is to use Rustâ€™s inline assembly macro.

Inline assembly allows us to write raw x86 assembly within a Rust function. The feature is unstable, so we need to add #![feature(asm)] to our src/lib.rs. Then weâ€™re able to write a divide_by_zero function:

fn divide_by_zero() {
    unsafe {
        asm!("mov dx, 0; div dx" ::: "ax", "dx" : "volatile", "intel")
    }
}

Letâ€™s try to decode it:

The asm! macro emits raw assembly instructions, so itâ€™s unsafe to use it.
We insert two assembly instructions here: mov dx, 0 and div dx. The former loads a 0 into the dx register (a subset of rdx) and the latter divides the ax register by dx. (The div instruction always implicitly operates on the ax register).
The colons are separators. After the first : we could specify output operands and after the second : we could specify input operands. We need neither, so we leave these areas empty.
After the third colon, we specify the so-called clobbers. These tell the compiler that our assembly modifies the values of some registers. Otherwise, the compiler assumes that the registers preserve their value. In our case, we clobber dx (we load 0 to it) and ax (the div instruction places the result in it).
The last block (after the 4th colon) specifies some options. The volatile option tells the compiler: â€œThis code has side effects. Do not delete it and do not move it elsewhereâ€. In our case, the â€œside effectâ€ is the divide-by-zero exception. Finally, the intel option allows us to use the Intel assembly syntax instead of the default AT&T syntax.

Letâ€™s use our new divide_by_zero function to raise a CPU exception:

// in src/lib.rs

pub extern "C" fn rust_main(...) {
    ...

    // provoke a divide-by-zero fault
    divide_by_zero();

    println!("It did not crash!");
    loop {}
}

It works! We see a EXCEPTION: DIVIDE BY ZERO message at the bottom of our screen:

ðŸ”—Whatâ€™s next?

Weâ€™ve successfully caught our first exception! However, our EXCEPTION: DIVIDE BY ZERO message doesnâ€™t contain much information about the cause of the exception. The next post improves the situation by printing i.a. the current stack pointer and address of the causing instruction. We will also explore other exceptions such as page faults, for which the CPU pushes an error code on the stack.

Kernel Heap

Mon, 11 Apr 2016 00:00:00 +0000

In the previous posts we created a frame allocator and a page table module. Now we are ready to create a kernel heap and a memory allocator. Thus, we will unlock Box, Vec, BTreeMap, and the rest of the alloc crate.

As always, you can find the complete source code on GitHub. Please file issues for any problems, questions, or improvement suggestions. There is also a comment section at the end of this page.

ðŸ”—Introduction

The heap is the memory area for long-lived allocations. The programmer can access it by using types like Box or Vec. Behind the scenes, the compiler manages that memory by inserting calls to some memory allocator. By default, Rust links to the jemalloc allocator (for binaries) or the system allocator (for libraries). However, both rely on system calls such as sbrk and are thus unusable in our kernel. So we need to create and link our own allocator.

A good allocator is fast and reliable. It also effectively utilizes the available memory and keeps fragmentation low. Furthermore, it works well for concurrent applications and scales to any number of processors. It even optimizes the memory layout with respect to the CPU caches to improve cache locality and avoid false sharing.

These requirements make good allocators pretty complex. For example, jemalloc has over 30.000 lines of code. This complexity is out of scope for our kernel, so we will create a much simpler allocator. Nevertheless, it should suffice for the foreseeable future, since weâ€™ll allocate only when itâ€™s absolutely necessary.

ðŸ”—The Allocator Interface

The allocator interface in Rust is defined through the Alloc trait, which looks like this:

pub unsafe trait Alloc {
    unsafe fn alloc(&mut self, layout: Layout) -> Result<*mut u8, AllocErr>;
    unsafe fn dealloc(&mut self, ptr: *mut u8, layout: Layout);
    [â€¦] // about 13 methods with default implementations
}

The alloc method should allocate a memory block with the size and alignment given through Layout parameter. The deallocate method should free such memory blocks again. Both methods are unsafe, as is the trait itself. This has different reasons:

Implementing the Alloc trait is unsafe, because the implementation must satisfy a set of contracts. Among other things, pointers returned by alloc must point to valid memory and adhere to the Layout requirements.
Calling alloc is unsafe because the caller must ensure that the passed layout does not have size zero. I think this is because of compatibility reasons with existing C-allocators, where zero-sized allocations are undefined behavior.
Calling dealloc is unsafe because the caller must guarantee that the passed parameters adhere to the contract. For example, ptr must denote a valid memory block allocated via this allocator.

To set the system allocator, the global_allocator attribute can be added to a static that implements Alloc for a shared reference of itself. For example:

#[global_allocator]
static MY_ALLOCATOR: MyAllocator = MyAllocator {...};

impl<'a> Alloc for &'a MyAllocator {
    unsafe fn alloc(&mut self, layout: Layout) -> Result<*mut u8, AllocErr> {...}
    unsafe fn dealloc(&mut self, ptr: *mut u8, layout: Layout) {...}
}

Note that Alloc needs to be implemented for &MyAllocator, not for MyAllocator. The reason is that the alloc and dealloc methods require mutable self references, but thereâ€™s no way to get such a reference safely from a static. By requiring implementations for &MyAllocator, the global allocator interface avoids this problem and pushes the burden of synchronization onto the user.

ðŸ”—Including the alloc crate

The Alloc trait is part of the alloc crate, which like core is a subset of Rustâ€™s standard library. Apart from the trait, the crate also contains the standard types that require allocations such as Box, Vec and Arc. We can include it through a simple extern crate:

// in src/lib.rs
#![feature(alloc)] // the alloc crate is still unstable

[...]

#[macro_use]
extern crate alloc;

We donâ€™t need to add anything to our Cargo.toml, since the alloc crate is part of the standard library and shipped with the Rust compiler. The alloc crate provides the format! and vec! macros, so we use #[macro_use] to import them.

When we try to compile our crate now, the following error occurs:

error[E0463]: can't find crate for `alloc`
  --> src/lib.rs:10:1
   |
16 | extern crate alloc;
   | ^^^^^^^^^^^^^^^^^^^ can't find crate

The problem is that xargo only cross compiles libcore by default. To also cross compile the alloc crate, we need to create a file named Xargo.toml in our project root (right next to the Cargo.toml) with the following content:

[target.x86_64-blog_os.dependencies]
alloc = {}

This instructs xargo that we also need alloc. It still doesnâ€™t compile, since we need to define a global allocator in order to use the alloc crate:

error: no #[default_lib_allocator] found but one is required; is libstd not linked?

ðŸ”—A Bump Allocator

For our first allocator, we start simple. We create a memory::heap_allocator module containing a so-called bump allocator:

// in src/memory/mod.rs

mod heap_allocator;

// in src/memory/heap_allocator.rs

use alloc::heap::{Alloc, AllocErr, Layout};

/// A simple allocator that allocates memory linearly and ignores freed memory.
#[derive(Debug)]
pub struct BumpAllocator {
    heap_start: usize,
    heap_end: usize,
    next: usize,
}

impl BumpAllocator {
    pub const fn new(heap_start: usize, heap_end: usize) -> Self {
        Self { heap_start, heap_end, next: heap_start }
    }
}

unsafe impl Alloc for BumpAllocator {
    unsafe fn alloc(&mut self, layout: Layout) -> Result<*mut u8, AllocErr> {
        let alloc_start = align_up(self.next, layout.align());
        let alloc_end = alloc_start.saturating_add(layout.size());

        if alloc_end <= self.heap_end {
            self.next = alloc_end;
            Ok(alloc_start as *mut u8)
        } else {
            Err(AllocErr::Exhausted{ request: layout })
        }
    }

    unsafe fn dealloc(&mut self, ptr: *mut u8, layout: Layout) {
        // do nothing, leak memory
    }
}

We also need to add #![feature(allocator_api)] to our lib.rs, since the allocator API is still unstable.

The heap_start and heap_end fields contain the start and end address of our kernel heap. The next field contains the next free address and is increased after every allocation. To allocate a memory block we align the next address using the align_up function (described below). Then we add up the desired size and make sure that we donâ€™t exceed the end of the heap. We use a saturating add so that the alloc_end cannot overflow, which could lead to an invalid allocation. If everything goes well, we update the next address and return a pointer to the start address of the allocation. Else, we return None.

ðŸ”—Alignment

In order to simplify alignment, we add align_down and align_up functions:

/// Align downwards. Returns the greatest x with alignment `align`
/// so that x <= addr. The alignment must be a power of 2.
pub fn align_down(addr: usize, align: usize) -> usize {
    if align.is_power_of_two() {
        addr & !(align - 1)
    } else if align == 0 {
        addr
    } else {
        panic!("`align` must be a power of 2");
    }
}

/// Align upwards. Returns the smallest x with alignment `align`
/// so that x >= addr. The alignment must be a power of 2.
pub fn align_up(addr: usize, align: usize) -> usize {
    align_down(addr + align - 1, align)
}

Letâ€™s start with align_down: If the alignment is a valid power of two (i.e. in {1,2,4,8,â€¦}), we use some bitwise operations to return the aligned address. It works because every power of two has exactly one bit set in its binary representation. For example, the numbers {1,2,4,8,â€¦} are {1,10,100,1000,â€¦} in binary. By subtracting 1 we get {0,01,011,0111,â€¦}. These binary numbers have a 1 at exactly the positions that need to be zeroed in addr. For example, the last 3 bits need to be zeroed for a alignment of 8.

To align addr, we create a bitmask from align-1. We want a 0 at the position of each 1, so we invert it using !. After that, the binary numbers look like this: {â€¦11111,â€¦11110,â€¦11100,â€¦11000,â€¦}. Finally, we zero the correct bits using a binary AND.

Aligning upwards is simple now. We just increase addr by align-1 and call align_down. We add align-1 instead of align because we would otherwise waste align bytes for already aligned addresses.

ðŸ”—Reusing Freed Memory

The heap memory is limited, so we should reuse freed memory for new allocations. This sounds simple, but is not so easy in practice since allocations can live arbitrarily long (and can be freed in an arbitrary order). This means that we need some kind of data structure to keep track of which memory areas are free and which are in use. This data structure should be very optimized since it causes overheads in both space (i.e. it needs backing memory) and time (i.e. accessing and organizing it needs CPU cycles).

Our bump allocator only keeps track of the next free memory address, which doesnâ€™t suffice to keep track of freed memory areas. So our only choice is to ignore deallocations and leak the corresponding memory. Thus our allocator quickly runs out of memory in a real system, but it suffices for simple testing. Later in this post, we will introduce a better allocator that does not leak freed memory.

ðŸ”—Using it as System Allocator

Above we saw that we can use a static allocator as system allocator through the global_allocator attribute:

#[global_allocator]
static ALLOCATOR: MyAllocator = MyAllocator {...};

This requires an implementation of Alloc for &MyAllocator, i.e. a shared reference. If we try to add such an implementation for our bump allocator (unsafe impl<'a> Alloc for &'a BumpAllocator), we have a problem: Our alloc method requires updating the next field, which is not possible for a shared reference.

One solution could be to put the bump allocator behind a Mutex and wrap it into a new type, for which we can implement Alloc for a shared reference:

struct LockedBumpAllocator(Mutex<BumpAllocator>);

impl<'a> Alloc for &'a LockedBumpAllocator {
    unsafe fn alloc(&mut self, layout: Layout) -> Result<*mut u8, AllocErr> {
        self.0.lock().alloc(layout)
    }

    unsafe fn dealloc(&mut self, ptr: *mut u8, layout: Layout) {
        self.0.lock().dealloc(ptr, layout)
    }
}

However, there is a more interesting solution for our bump allocator that avoids locking altogether. The idea is to exploit that we only need to update a single usize field byusing an AtomicUsize type. This type uses special synchronized hardware instructions to ensure data race freedom without requiring locks.

ðŸ”—A lock-free Bump Allocator

A lock-free implementation looks like this:

use core::sync::atomic::{AtomicUsize, Ordering};

/// A simple allocator that allocates memory linearly and ignores freed memory.
#[derive(Debug)]
pub struct BumpAllocator {
    heap_start: usize,
    heap_end: usize,
    next: AtomicUsize,
}

impl BumpAllocator {
    pub const fn new(heap_start: usize, heap_end: usize) -> Self {
        // NOTE: requires adding #![feature(const_atomic_usize_new)] to lib.rs
        Self { heap_start, heap_end, next: AtomicUsize::new(heap_start) }
    }
}

unsafe impl<'a> Alloc for &'a BumpAllocator {
    unsafe fn alloc(&mut self, layout: Layout) -> Result<*mut u8, AllocErr> {
        loop {
            // load current state of the `next` field
            let current_next = self.next.load(Ordering::Relaxed);
            let alloc_start = align_up(current_next, layout.align());
            let alloc_end = alloc_start.saturating_add(layout.size());

            if alloc_end <= self.heap_end {
                // update the `next` pointer if it still has the value `current_next`
                let next_now = self.next.compare_and_swap(current_next, alloc_end,
                    Ordering::Relaxed);
                if next_now == current_next {
                    // next address was successfully updated, allocation succeeded
                    return Ok(alloc_start as *mut u8);
                }
            } else {
                return Err(AllocErr::Exhausted{ request: layout })
            }
        }
    }

    unsafe fn dealloc(&mut self, ptr: *mut u8, layout: Layout) {
        // do nothing, leak memory
    }
}

The implementation is a bit more complicated now. First, there is now a loop around the whole method body, since we might need multiple tries until we succeed (e.g. if multiple threads try to allocate at the same time). Also, the loads operation is an explicit method call now, i.e. self.next.load(Ordering::Relaxed) instead of just self.next. The ordering parameter makes it possible to restrict the automatic instruction reordering performed by both the compiler and the CPU itself. For example, it is used when implementing locks to ensure that no write to the locked variable happens before the lock is acquired. We donâ€™t have such requirements, so we use the less restrictive Relaxed ordering.

The heart of this lock-free method is the compare_and_swap call that updates the next address:

...
let next_now = self.next.compare_and_swap(current_next, alloc_end,
    Ordering::Relaxed);
if next_now == current_next {
    // next address was successfully updated, allocation succeeded
    return Ok(alloc_start as *mut u8);
}
...

Compare-and-swap is a special CPU instruction that updates a variable with a given value if it still contains the value we expect. If it doesnâ€™t, it means that another thread updated the value simultaneously, so we need to try again. The important feature is that this happens in a single uninteruptible operation (thus the name atomic), so no partial updates or intermediate states are possible.

In detail, compare_and_swap works by comparing next with the first argument and, in case theyâ€™re equal, updates next with the second parameter (the previous value is returned). To find out whether a switch happened, we check the returned previous value of next. If it is equal to the first parameter, the values were swapped. Otherwise, we try again in the next loop iteration.

ðŸ”—Setting the Global Allocator

Now we can define a static bump allocator, that we can set as system allocator:

pub const HEAP_START: usize = 0o_000_001_000_000_0000;
pub const HEAP_SIZE: usize = 100 * 1024; // 100 KiB

#[global_allocator]
static HEAP_ALLOCATOR: BumpAllocator = BumpAllocator::new(HEAP_START,
    HEAP_START + HEAP_SIZE);

We use 0o_000_001_000_000_0000 as heap start address, which is the address starting at the second P3 entry. It doesnâ€™t really matter which address we choose here as long as itâ€™s unused. We use a heap size of 100 KiB, which should be large enough for the near future.

Putting the above in the memory::heap_allocator module would make most sense, but unfortunately there is currently a weird bug in the global allocator implementation that requires putting the global allocator in the root module. I hope itâ€™s fixed soon, but until then we need to put the above lines in src/lib.rs. For that, we need to make the memory::heap_allocator module public and add an import for BumpAllocator. We also need to add the #![feature(global_allocator)] at the top of our lib.rs, since the global_allocator attribute is still unstable.

Thatâ€™s it! We have successfully created and linked a custom system allocator. Now weâ€™re ready to test it.

ðŸ”—Testing

We should be able to allocate memory on the heap now. Letâ€™s try it in our rust_main:

// in rust_main in src/lib.rs

use alloc::boxed::Box;
let heap_test = Box::new(42);

When we run it, a triple fault occurs and causes permanent rebooting. Letâ€™s try debug it using QEMU and objdump as described in the previous post:

> qemu-system-x86_64 -d int -no-reboot -cdrom build/os-x86_64.iso
â€¦
check_exception old: 0xffffffff new 0xe
     0: v=0e e=0002 i=0 cpl=0 IP=0008:0000000000102860 pc=0000000000102860
        SP=0010:0000000000116af0 CR2=0000000040000000
â€¦

Aha! Itâ€™s a page fault (v=0e) and was caused by the code at 0x102860. The code tried to write (e=0002) to address 0x40000000. This address is 0o_000_001_000_000_0000 in octal, which is the HEAP_START address defined above. Of course it page-faults: We have forgotten to map the heap memory to some physical memory.

ðŸ”—Some Refactoring

In order to map the heap cleanly, we do a bit of refactoring first. We move all memory initialization from our rust_main to a new memory::init function. Now our rust_main looks like this:

// in src/lib.rs

#[no_mangle]
pub extern "C" fn rust_main(multiboot_information_address: usize) {
    // ATTENTION: we have a very small stack and no guard page
    vga_buffer::clear_screen();
    println!("Hello World{}", "!");

    let boot_info = unsafe {
        multiboot2::load(multiboot_information_address)
    };
    enable_nxe_bit();
    enable_write_protect_bit();

    // set up guard page and map the heap pages
    memory::init(boot_info);

    use alloc::boxed::Box;
    let heap_test = Box::new(42);

    println!("It did not crash!");

    loop {}
}

The memory::init function looks like this:

// in src/memory/mod.rs

use multiboot2::BootInformation;

pub fn init(boot_info: &BootInformation) {
    let memory_map_tag = boot_info.memory_map_tag().expect(
        "Memory map tag required");
    let elf_sections_tag = boot_info.elf_sections_tag().expect(
        "Elf sections tag required");

    let kernel_start = elf_sections_tag.sections()
        .filter(|s| s.is_allocated()).map(|s| s.addr).min().unwrap();
    let kernel_end = elf_sections_tag.sections()
        .filter(|s| s.is_allocated()).map(|s| s.addr + s.size).max()
        .unwrap();

    println!("kernel start: {:#x}, kernel end: {:#x}",
             kernel_start,
             kernel_end);
    println!("multiboot start: {:#x}, multiboot end: {:#x}",
             boot_info.start_address(),
             boot_info.end_address());

    let mut frame_allocator = AreaFrameAllocator::new(
        kernel_start as usize, kernel_end as usize,
        boot_info.start_address(), boot_info.end_address(),
        memory_map_tag.memory_areas());

    paging::remap_the_kernel(&mut frame_allocator, boot_info);
}

Weâ€™ve just moved the code to a new function. However, weâ€™ve sneaked some improvements in:

An additional .filter(|s| s.is_allocated()) in the calculation of kernel_start and kernel_end. This ignores all sections that arenâ€™t loaded to memory (such as debug sections). Thus, the kernel end address is no longer artificially increased by such sections.
We use the start_address() and end_address() methods of boot_info instead of calculating the addresses manually.
We use the alternate {:#x} form when printing kernel/multiboot addresses. Before, we used 0x{:x}, which leads to the same result. For a complete list of these â€œalternateâ€ formatting forms, check out the std::fmt documentation.

ðŸ”—Safety

It is important that the memory::init function is called only once, because it creates a new frame allocator based on kernel and multiboot start/end. When we call it a second time, a new frame allocator is created that reassigns the same frames, even if they are already in use.

In the second call it would use an identical frame allocator to remap the kernel. The remap_the_kernel function would request a frame from the frame allocator to create a new page table. But the returned frame is already in use, since we used it to create our current page table in the first call. In order to initialize the new table, the function zeroes it. This is the point where everything breaks, since we zero our current page table. The CPU is unable to read the next instruction and throws a page fault.

So we need to ensure that memory::init can be only called once. We could mark it as unsafe, which would bring it in line with Rustâ€™s memory safety rules. However, that would just push the unsafety to the caller. The caller can still accidentally call the function twice, the only difference is that the mistake needs to happen inside unsafe blocks.

A better solution is to insert a check at the functionâ€™s beginning, that panics if the function is called a second time. This approach has a small runtime cost, but we only call it once, so itâ€™s negligible. And we avoid two unsafe blocks (one at the calling site and one at the function itself), which is always good.

In order to make such checks easy, I created a small crate named once. To add it, we run cargo add once and add the following to our src/lib.rs:

// in src/lib.rs

#[macro_use]
extern crate once;

The crate provides an assert_has_not_been_called! macro (sorry for the long name :D). We can use it to fix the safety problem easily:

// in src/memory/mod.rs

pub fn init(boot_info: &BootInformation) {
    assert_has_not_been_called!("memory::init must be called only once");

    let memory_map_tag = ...
    ...
}

Thatâ€™s it. Now our memory::init function can only be called once. The macro works by creating a static AtomicBool named CALLED, which is initialized to false. When the macro is invoked, it checks the value of CALLED and sets it to true. If the value was already true before, the macro panics.

ðŸ”—Mapping the Heap

Now weâ€™re ready to map the heap pages. In order to do it, we need access to the ActivePageTable or Mapper instance (see the page table and kernel remapping posts). For that we return it from the paging::remap_the_kernel function:

// in src/memory/paging/mod.rs

pub fn remap_the_kernel<A>(allocator: &mut A, boot_info: &BootInformation)
    -> ActivePageTable // new
    where A: FrameAllocator
{
    ...
    println!("guard page at {:#x}", old_p4_page.start_address());

    active_table // new
}

Now we have full page table access in the memory::init function. This allows us to map the heap pages to physical frames:

// in src/memory/mod.rs

pub fn init(boot_info: &BootInformation) {
    ...

    let mut frame_allocator = ...;

    // below is the new part

    let mut active_table = paging::remap_the_kernel(&mut frame_allocator,
        boot_info);

    use self::paging::Page;
    use {HEAP_START, HEAP_SIZE};

    let heap_start_page = Page::containing_address(HEAP_START);
    let heap_end_page = Page::containing_address(HEAP_START + HEAP_SIZE-1);

    for page in Page::range_inclusive(heap_start_page, heap_end_page) {
        active_table.map(page, paging::WRITABLE, &mut frame_allocator);
    }
}

The Page::range_inclusive function is just a copy of the Frame::range_inclusive function:

// in src/memory/paging/mod.rs

#[derive(â€¦, PartialEq, Eq, PartialOrd, Ord)]
pub struct Page {...}

impl Page {
    ...
    pub fn range_inclusive(start: Page, end: Page) -> PageIter {
        PageIter {
            start: start,
            end: end,
        }
    }
}

pub struct PageIter {
    start: Page,
    end: Page,
}

impl Iterator for PageIter {
    type Item = Page;

    fn next(&mut self) -> Option<Page> {
        if self.start <= self.end {
            let page = self.start;
            self.start.number += 1;
            Some(page)
        } else {
            None
        }
    }
}

Now we map the whole heap to physical pages. This needs some time and might introduce a noticeable delay when we increase the heap size in the future. Another drawback is that we consume a large amount of physical frames even though we might not need the whole heap space. We will fix these problems in a future post by mapping the pages lazily.

ðŸ”—It works!

Now Box and Vec should work. For example:

// in rust_main in src/lib.rs

use alloc::boxed::Box;
let mut heap_test = Box::new(42);
*heap_test -= 15;
let heap_test2 = Box::new("hello");
println!("{:?} {:?}", heap_test, heap_test2);

let mut vec_test = vec![1,2,3,4,5,6,7];
vec_test[3] = 42;
for i in &vec_test {
    print!("{} ", i);
}

We can also use all other types of the alloc crate, including:

the reference counted pointers Rc and Arc
the owned string type String and the format! macro
Linked List
the growable ring buffer VecDeque
BinaryHeap
BTreeMap and BTreeSet

ðŸ”—A better Allocator

Right now, we leak every freed memory block. Thus, we run out of memory quickly, for example, by creating a new String in each iteration of a loop:

// in rust_main in src/lib.rs

for i in 0..10000 {
    format!("Some String");
}

To fix this, we need to create an allocator that keeps track of freed memory blocks and reuses them if possible. This introduces some challenges:

We need to keep track of a possibly unlimited number of freed blocks. For example, an application could allocate n one-byte sized blocks and free every second block, which creates n/2 freed blocks. We canâ€™t rely on any upper bound of freed block since n could be arbitrarily large.
We canâ€™t use any of the collections from above, since they rely on allocations themselves. (It might be possible as soon as RFC #1398 is implemented, which allows user-defined allocators for specific collection instances.)
We need to merge adjacent freed blocks if possible. Otherwise, the freed memory is no longer usable for large allocations. We will discuss this point in more detail below.
Our allocator should search the set of freed blocks quickly and keep fragmentation low.

ðŸ”—Creating a List of freed Blocks

Where do we store the information about an unlimited number of freed blocks? We canâ€™t use any fixed size data structure since it could always be too small for some allocation sequences. So we need some kind of dynamically growing set.

One possible solution could be to use an array-like data structure that starts at some unused virtual address. If the array becomes full, we increase its size and map new physical frames as backing storage. This approach would require a large part of the virtual address space since the array could grow significantly. We would need to create a custom implementation of a growable array and manipulate the page tables when deallocating. It would also consume a possibly large number of physical frames as backing storage.

We will choose another solution with different tradoffs. Itâ€™s not clearly â€œbetterâ€ than the approach above and has significant disadvantages itself. However, it has one big advantage: It does not need any additional physical or virtual memory at all. This makes it less complex since we donâ€™t need to manipulate any page tables. The idea is the following:

A freed memory block is not used anymore and no one needs the stored information. It is still mapped to a virtual address and backed by a physical page. So we just store the information about the freed block in the block itself. We keep a pointer to the first block and store a pointer to the next block in each block. Thus, we create a single linked list:

In the following, we call a freed block a hole. Each hole stores its size and a pointer to the next hole. If a hole is larger than needed, we leave the remaining memory unused. By storing a pointer to the first hole, we are able to traverse the complete list.

ðŸ”—Initialization

When the heap is created, all of its memory is unused. Thus, it forms a single large hole:

The optional pointer to the next hole is set to None.

ðŸ”—Allocation

In order to allocate a block of memory, we need to find a hole that satisfies the size and alignment requirements. If the found hole is larger than required, we split it into two smaller holes. For example, when we allocate a 24 byte block right after initialization, we split the single hole into a hole of size 24 and a hole with the remaining size:

Then we use the new 24 byte hole to perform the allocation:

To find a suitable hole, we can use several search strategies:

best fit: Search the whole list and choose the smallest hole that satisfies the requirements.
worst fit: Search the whole list and choose the largest hole that satisfies the requirements.
first fit: Search the list from the beginning and choose the first hole that satisfies the requirements.

Each strategy has its advantages and disadvantages. Best fit uses the smallest hole possible and leaves larger holes for large allocations. But splitting the smallest hole might create a tiny hole, which is too small for most allocations. In contrast, the worst fit strategy always chooses the largest hole. Thus, it does not create tiny holes, but it consumes the large block, which might be required for large allocations.

For our use case, the best fit strategy is better than worst fit. The reason is that we have a minimal hole size of 16 bytes, since each hole needs to be able to store a size (8 bytes) and a pointer to the next hole (8 bytes). Thus, even the best fit strategy leads to holes of usable size. Furthermore, we will need to allocate very large blocks occasionally (e.g. for DMA buffers).

However, both best fit and worst fit have a significant problem: They need to scan the whole list for each allocation in order to find the optimal block. This leads to long allocation times if the list is long. The first fit strategy does not have this problem, as it returns as soon as it finds a suitable hole. It is fairly fast for small allocations and might only need to scan the whole list for large allocations.

ðŸ”—Deallocation

To deallocate a block of memory, we can just insert its corresponding hole somewhere into the list. However, we need to merge adjacent holes. Otherwise, we are unable to reuse the freed memory for larger allocations. For example:

In order to use these adjacent holes for a large allocation, we need to merge them to a single large hole first:

The easiest way to ensure that adjacent holes are always merged, is to keep the hole list sorted by address. Thus, we only need to check the predecessor and the successor in the list when we free a memory block. If they are adjacent to the freed block, we merge the corresponding holes. Else, we insert the freed block as a new hole at the correct position.

ðŸ”—Implementation

The detailed implementation would go beyond the scope of this post, since it contains several hidden difficulties. For example:

Several merge cases: Merge with the previous hole, merge with the next hole, merge with both holes.
We need to satisfy the alignment requirements, which requires additional splitting logic.
The minimal hole size of 16 bytes: We must not create smaller holes when splitting a hole.

I created the linked_list_allocator crate to handle all of these cases. It consists of a Heap struct that provides an allocate_first_fit and a deallocate method. It also contains a LockedHeap type that wraps Heap into spinlock so that itâ€™s usable as a static system allocator. If you are interested in the implementation details, check out the source code.

We need to add the extern crate to our Cargo.toml and our lib.rs:

> cargo add linked_list_allocator

// in src/lib.rs
extern crate linked_list_allocator;

Now we can change our global allocator:

use linked_list_allocator::LockedHeap;

#[global_allocator]
static HEAP_ALLOCATOR: LockedHeap = LockedHeap::empty();

We canâ€™t initialize the linked list allocator statically, since it needs to initialize the first hole (like described above). This canâ€™t be done at compile time, so the function canâ€™t be a const function. Therefore we can only create an empty heap and initialize it later at runtime. For that, we add the following lines to our rust_main function:

// in src/lib.rs

#[no_mangle]
pub extern "C" fn rust_main(multiboot_information_address: usize) {
    [â€¦]

    // set up guard page and map the heap pages
    memory::init(boot_info);

    // initialize the heap allocator
    unsafe {
        HEAP_ALLOCATOR.lock().init(HEAP_START, HEAP_START + HEAP_SIZE);
    }
    [â€¦]
}

It is important that we initialize the heap after mapping the heap pages, since the init function writes to the heap memory (the first hole).

Our kernel uses the new allocator now, so we can deallocate memory without leaking it. The example from above should work now without causing an OOM situation:

// in rust_main in src/lib.rs

for i in 0..10000 {
    format!("Some String");
}

ðŸ”—Performance

The linked list based approach has some performance problems. Each allocation or deallocation might need to scan the complete list of holes in the worst case. However, I think itâ€™s good enough for now, since our heap will stay relatively small for the near future. When our allocator becomes a performance problem eventually, we can just replace it with a faster alternative.

ðŸ”—Summary

Now weâ€™re able to use heap storage in our kernel without leaking memory. This allows us to effectively process dynamic data such as user supplied strings in the future. We can also use Rc and Arc to create types with shared ownership. And we have access to various data structures such as Vec or Linked List, which will make our lives much easier. We even have some well tested and optimized binary heap and B-tree implementations!

ðŸ”—Whatâ€™s next?

This post concludes the section about memory management for now. We will revisit this topic eventually, but now itâ€™s time to explore other topics. The upcoming posts will be about CPU exceptions and interrupts. We will catch all page, double, and triple faults and create a driver to read keyboard input. The next post starts by setting up a so-called Interrupt Descriptor Table.

Remap the Kernel

Fri, 01 Jan 2016 00:00:00 +0000

In this post we will create a new page table to map the kernel sections correctly. Therefore we will extend the paging module to support modifications of inactive page tables as well. Then we will switch to the new table and secure our kernel stack by creating a guard page.

As always, you can find the source code on GitHub. Donâ€™t hesitate to file issues there if you have any problems or improvement suggestions. There is also a comment section at the end of this page. Note that this post requires a current Rust nightly.

ðŸ”—Motivation

In the previous post, we had a strange bug in the unmap function. Its reason was a silent stack overflow, which corrupted the page tables. Fortunately, our kernel stack is right above the page tables so that we noticed the overflow relatively quickly. This wonâ€™t be the case when we add threads with new stacks in the future. Then a silent stack overflow could overwrite some data without us noticing. But eventually some completely unrelated function fails because a variable changed its value.

As you can imagine, these kinds of bugs are horrendous to debug. For that reason we will create a new hierarchical page table in this post, which has guard page below the stack. A guard page is basically an unmapped page that causes a page fault when accessed. Thus we can catch stack overflows right when they happen.

Also, we will use the information about kernel sections to map the various sections individually instead of blindly mapping the first gigabyte. To improve safety even further, we will set the correct page table flags for the various sections. Thus it wonâ€™t be possible to modify the contents of .text or to execute code from .data anymore.

ðŸ”—Preparation

There are many things that can go wrong when we switch to a new table. Therefore itâ€™s a good idea to set up a debugger. You should not need it when you follow this post, but itâ€™s good to know how to debug a problem when it occurs¹.

We also update the Page and Frame types to make our lives easier. The Page struct gets some derived traits:

// in src/memory/paging/mod.rs

#[derive(Debug, Clone, Copy)]
pub struct Page {
    number: usize,
}

By making it Copy, we can still use it after passing it to functions such as map_to. We also make the Page::containing_address public (if it isnâ€™t already).

The Frame type gets a clone method too, but it does not implement the Clone trait:

// in src/memory/mod.rs

impl Frame {
    ...
    fn clone(&self) -> Frame {
        Frame { number: self.number }
    }
}

The big difference is that this clone method is private. If we implemented the Clone trait, it would be public and usable from other modules. For example they could abuse it to free the same frame twice in the frame allocator.

So why do we implement Copy for Page and make even its constructor public, but keep Frame as private as possible? The reason is that we can easily check the status of a Page by looking at the page tables. For example, the map_to function can easily check that the given page is unused.

We canâ€™t do that for a Frame. If we wanted to be sure that a given frame is unused, we would need to look at all mapped pages and verify that none of them is mapped to the given frame. Since this is impractical, we need to rely on the fact that a passed Frame is always unused. For that reason it must not be possible to create a new Frame or to clone one from other modules. The only valid way to get a frame is to allocate it from a FrameAllocator.

ðŸ”—Recap: The Paging Module

This post builds upon the post about page tables, so letâ€™s start by quickly recapitulating what weâ€™ve done there.

We created a memory::paging module, which reads and modifies the hierarchical page table through recursive mapping. The owner of the active P4 table and thus all subtables is an ActivePageTable struct, which must be instantiated only once.

The ActivePageTable struct provides the following interface:

/// Translates a virtual to the corresponding physical address.
/// Returns `None` if the address is not mapped.
pub fn translate(&self, virtual_address: VirtualAddress) ->
    Option<PhysicalAddress>
{...}

/// Maps the page to the frame with the provided flags.
/// The `PRESENT` flag is added by default. Needs a
/// `FrameAllocator` as it might need to create new page tables.
pub fn map_to<A>(&mut self,
                 page: Page,
                 frame: Frame,
                 flags: EntryFlags,
                 allocator: &mut A)
    where A: FrameAllocator
{...}

/// Maps the page to some free frame with the provided flags.
/// The free frame is allocated from the given `FrameAllocator`.
pub fn map<A>(&mut self, page: Page, flags: EntryFlags, allocator: &mut A)
    where A: FrameAllocator
{...}

/// Identity map the the given frame with the provided flags.
/// The `FrameAllocator` is used to create new page tables if needed.
pub fn identity_map<A>(&mut self,
                       frame: Frame,
                       flags: EntryFlags,
                       allocator: &mut A)
    where A: FrameAllocator
{...}


/// Unmaps the given page and adds all freed frames to the given
/// `FrameAllocator`.
fn unmap<A>(&mut self, page: Page, allocator: &mut A)
    where A: FrameAllocator
{...}

ðŸ”—Overview

Our goal is to use the ActivePageTable functions to map the kernel sections correctly in a new page table. In pseudo code:

fn remap_the_kernel(boot_info: &BootInformation) {
    let new_table = create_new_table();

    for section in boot_info.elf_sections {
        for frame in section {
            new_table.identity_map(frame, section.flags);
        }
    }

    new_table.activate();
    create_guard_page_for_stack();
}

But the ActivePageTable methods â€“ as the name suggests â€“ only work for the active table. So we would need to activate new_table before we use identity_map. But this is not possible since it would cause an immediate page fault when the CPU tries to read the next instruction.

So we need a way to use the ActivePageTable methods on inactive page tables as well.

ðŸ”—Inactive Tables

Letâ€™s start by creating a type for inactive page tables. Like an ActivePageTable, an InactivePageTable owns a P4 table. The difference is that the inactive P4 table is not used by the CPU.

We create the struct in memory/paging/mod.rs:

pub struct InactivePageTable {
    p4_frame: Frame,
}

impl InactivePageTable {
    pub fn new(frame: Frame) -> InactivePageTable {
        // TODO zero and recursive map the frame
        InactivePageTable { p4_frame: frame }
    }
}

Without zeroing, the P4 table contains complete garbage and maps random memory. But we canâ€™t zero it right now because the p4_frame is not mapped to a virtual address.

Well, maybe itâ€™s still part of the identity mapped first gigabyte. Then we could zero it without problems since the physical address would be a valid virtual address, too. But this â€œsolutionâ€ is hacky and wonâ€™t work after this post anymore (since we will remove all needless identity mapping).

Instead, we will try to temporary map the frame to some virtual address.

ðŸ”—Temporary Mapping

Therefor we add a TemporaryPage struct. We create it in a new temporary_page submodule to keep the paging module clean. It looks like this:

// src/memory/paging/mod.rs
mod temporary_page;

// src/memory/paging/temporary_page.rs

use super::Page;

pub struct TemporaryPage {
    page: Page,
}

We add methods to temporary map and unmap the page:

use super::{ActivePageTable, VirtualAddress};
use memory::Frame;

impl TemporaryPage {
    /// Maps the temporary page to the given frame in the active table.
    /// Returns the start address of the temporary page.
    pub fn map(&mut self, frame: Frame, active_table: &mut ActivePageTable)
        -> VirtualAddress
    {
        use super::entry::WRITABLE;

        assert!(active_table.translate_page(self.page).is_none(),
                "temporary page is already mapped");
        active_table.map_to(self.page, frame, WRITABLE, ???);
        self.page.start_address()
    }

    /// Unmaps the temporary page in the active table.
    pub fn unmap(&mut self, active_table: &mut ActivePageTable) {
        active_table.unmap(self.page, ???)
    }
}

The ??? needs to be some FrameAllocator. We could just add an additional allocator argument but there is a better solution.

It takes advantage of the fact that we always map the same page. So the allocator only needs to hold 3 frames: one P3, one P2, and one P1 table (the P4 table is always mapped). This allows us to create a tiny allocator and add it as field to the TemporaryPage struct itself:

pub struct TemporaryPage {
    page: Page,
    allocator: TinyAllocator,
}

impl TemporaryPage {
    // as above, but with `&mut self.allocator` instead of `???`
}

struct TinyAllocator([Option<Frame>; 3]);

Our tiny allocator just consists of 3 slots to store frames. It will be empty when the temporary page is mapped and full when all corresponding page tables are unmapped.

To turn TinyAllocator into a frame allocator, we need to add the trait implementation:

use memory::FrameAllocator;

impl FrameAllocator for TinyAllocator {
    fn allocate_frame(&mut self) -> Option<Frame> {
        for frame_option in &mut self.0 {
            if frame_option.is_some() {
                return frame_option.take();
            }
        }
        None
    }

    fn deallocate_frame(&mut self, frame: Frame) {
        for frame_option in &mut self.0 {
            if frame_option.is_none() {
                *frame_option = Some(frame);
                return;
            }
        }
        panic!("Tiny allocator can hold only 3 frames.");
    }
}

On allocation, we use the Option::take function to take an available frame from the first filled slot and on deallocation, we put the frame back into the first free slot.

To finish the TinyAllocator, we add a constructor that fills it from some other allocator:

impl TinyAllocator {
    fn new<A>(allocator: &mut A) -> TinyAllocator
        where A: FrameAllocator
    {
        let mut f = || allocator.allocate_frame();
        let frames = [f(), f(), f()];
        TinyAllocator(frames)
    }
}

We use a little closure here that saves us some typing.

Now our TemporaryPage type is nearly complete. We only add one more method for convenience:

use super::table::{Table, Level1};

/// Maps the temporary page to the given page table frame in the active
/// table. Returns a reference to the now mapped table.
pub fn map_table_frame(&mut self,
                       frame: Frame,
                       active_table: &mut ActivePageTable)
                       -> &mut Table<Level1> {
    unsafe { &mut *(self.map(frame, active_table) as *mut Table<Level1>) }
}

This function interprets the given frame as a page table frame and returns a Table reference. We return a table of level 1 because it forbids calling the next_table methods. Calling next_table must not be possible since itâ€™s not a page of the recursive mapping. To be able to return a Table<Level1>, we need to make the Level1 enum in memory/paging/table.rs public.

The unsafe block is safe since the VirtualAddress returned by the map function is always valid and the type cast just reinterprets the frameâ€™s content.

To complete the temporary_page module, we add a TemporaryPage::new constructor:

pub fn new<A>(page: Page, allocator: &mut A) -> TemporaryPage
    where A: FrameAllocator
{
    TemporaryPage {
        page: page,
        allocator: TinyAllocator::new(allocator),
    }
}

ðŸ”—Zeroing the InactivePageTable

Now we can use TemporaryPage to fix our InactivePageTable::new function:

// in src/memory/paging/mod.rs

use self::temporary_page::TemporaryPage;

impl InactivePageTable {
    pub fn new(frame: Frame,
               active_table: &mut ActivePageTable,
               temporary_page: &mut TemporaryPage)
               -> InactivePageTable {
        {
            let table = temporary_page.map_table_frame(frame.clone(),
                active_table);
            // now we are able to zero the table
            table.zero();
            // set up recursive mapping for the table
            table[511].set(frame.clone(), PRESENT | WRITABLE);
        }
        temporary_page.unmap(active_table);

        InactivePageTable { p4_frame: frame }
    }
}

We added two new arguments, active_table and temporary_page. We need an inner scope to ensure that the table variable is dropped before we try to unmap the temporary page again. This is required since the table variable exclusively borrows temporary_page as long as itâ€™s alive.

Now we are able to create valid inactive page tables, which are zeroed and recursively mapped. But we still canâ€™t modify them. To resolve this problem, we need to look at recursive mapping again.

ðŸ”—Revisiting Recursive Mapping

Recursive mapping works by mapping the last P4 entry to the P4 table itself. Thus we can access the page tables by looping one or more times.

For example, accessing a P3 table requires lopping three times:

We can use the same mechanism to access inactive tables. The trick is to change the recursive mapping of the active P4 table to point to the inactive P4 table:

Now the inactive table can be accessed exactly as the active table, even the magic addresses are the same. This allows us to use the ActivePageTable interface and the existing mapping methods for inactive tables, too. Note that everything besides the recursive mapping continues to work exactly as before since weâ€™ve never changed the active table in the CPU.

ðŸ”—Implementation Draft

We add a method to ActivePageTable that temporary changes the recursive mapping and executes a given closure in the new context:

pub fn with<F>(&mut self,
               table: &mut InactivePageTable,
               f: F)
    where F: FnOnce(&mut ActivePageTable)
{
    use x86_64::instructions::tlb;

    // overwrite recursive mapping
    self.p4_mut()[511].set(table.p4_frame.clone(), PRESENT | WRITABLE);
    tlb::flush_all();

    // execute f in the new context
    f(self);

    // TODO restore recursive mapping to original p4 table
}

It overwrites the 511th P4 entry and points it to the inactive table frame. Then it flushes the translation lookaside buffer (TLB), which still contains some old translations. We need to flush all pages that are part of the recursive mapping, so the easiest way is to flush the TLB completely.

Now that the recursive mapping points to the given inactive table, we execute the closure in the new context. The closure can call all active table methods such as translate or map_to. It could even call with again and chain another inactive table! Waitâ€¦ that would not work:

Here the closure called with again and thus changed the recursive mapping of the inactive table to point to a second inactive table. Now we want to modify the P1 of the second inactive table, but instead we land on the P1 of the first inactive table since we never follow the pointer to the second table. Only when modifying the P2, P3, or P4 table we really access the second inactive table. This inconsistency would break our mapping functions completely.

So we should really prohibit the closure from calling with again. We could add some runtime assertion that panics when the active table is not recursive mapped anymore. But a cleaner solution is to split off the mapping code from ActivePageTable into a new Mapper type.

ðŸ”—Refactoring

We start by creating a new memory/paging/mapper.rs submodule and moving the ActivePageTable struct and its impl block to it. Then we rename it to Mapper and make all methods public (so we can still use them from the paging module). The with method is removed.

After adjusting the imports, the module should look like this:

// in memory/paging/mod.rs
mod mapper;

// memory/paging/mapper.rs

use super::{VirtualAddress, PhysicalAddress, Page, ENTRY_COUNT};
use super::entry::*;
use super::table::{self, Table, Level4, Level1};
use memory::{PAGE_SIZE, Frame, FrameAllocator};
use core::ptr::Unique;

pub struct Mapper {
    p4: Unique<Table<Level4>>,
}

impl Mapper {
    pub unsafe fn new() -> Mapper {...}

    pub fn p4(&self) -> &Table<Level4> {...}

    // the remaining mapping methods, all public
}

Now we create a new ActivePageTable struct in memory/paging/mod.rs:

pub use self::mapper::Mapper;
use core::ops::{Deref, DerefMut};

pub struct ActivePageTable {
    mapper: Mapper,
}

impl Deref for ActivePageTable {
    type Target = Mapper;

    fn deref(&self) -> &Mapper {
        &self.mapper
    }
}

impl DerefMut for ActivePageTable {
    fn deref_mut(&mut self) -> &mut Mapper {
        &mut self.mapper
    }
}

impl ActivePageTable {
    unsafe fn new() -> ActivePageTable {
        ActivePageTable {
            mapper: Mapper::new(),
        }
    }

    pub fn with<F>(&mut self,
                   table: &mut InactivePageTable,
                   f: F)
        where F: FnOnce(&mut Mapper) // `Mapper` instead of `ActivePageTable`
    {...}
}

The Deref and DerefMut implementations allow us to use the ActivePageTable exactly as before, for example we still can call map_to on it (because of deref coercions). But the closure called in the with function can no longer invoke with again. The reason is that we changed the type of the generic F parameter a bit: Instead of an ActivePageTable, the closure just gets a Mapper as argument.

ðŸ”—Restoring the Recursive Mapping

Right now, the with function overwrites the recursive mapping and calls the closure. But it does not restore the previous recursive mapping yet. So letâ€™s fix that!

To backup the physical P4 frame of the active table, we can either read it from the 511th P4 entry (before we change it) or from the CR3 control register directly. We will do the latter as it should be faster and we already have a external crate that makes it easy:

use x86_64::registers::control_regs;
let backup = Frame::containing_address(
    unsafe { control_regs::cr3() } as usize
);

Why is it unsafe? Because reading the CR3 register leads to a CPU exception if the processor is not running in kernel mode (Ring 0). But this code will always run in kernel mode, so the unsafe block is completely safe here.

Now that we have a backup of the original P4 frame, we need a way to restore it after the closure has run. So we need to somehow modify the 511th entry of the original P4 frame, which is still the active table in the CPU. But we canâ€™t access it because the recursive mapping now points to the inactive table:

Itâ€™s just not possible to access the active P4 entry in 4 steps, so we canâ€™t reach it through the 4-level page table.

We could try to overwrite the recursive mapping of the inactive P4 table and point it back to the original P4 frame:

Now we can reach the active P4 entry in 4 steps and could restore the original mapping in the active table. But this hack has a drawback: The inactive table is now invalid since it is no longer recursive mapped. We would need to fix it by using a temporary page again (as above).

But if we need a temporary page anyway, we can just use it to map the original P4 frame directly. Thus we avoid the above hack and make the code simpler. So letâ€™s do it that way.

ðŸ”—Completing the Implementation

The with method gets an additional TemporaryPage argument, which we use to backup and restore the original recursive mapping:

pub fn with<F>(&mut self,
                   table: &mut InactivePageTable,
                   temporary_page: &mut temporary_page::TemporaryPage, // new
                   f: F)
    where F: FnOnce(&mut Mapper)
{
    use x86_64::instructions::tlb;
    use x86_64::registers::control_regs;

    {
        let backup = Frame::containing_address(
            control_regs::cr3().0 as usize);

        // map temporary_page to current p4 table
        let p4_table = temporary_page.map_table_frame(backup.clone(), self);

        // overwrite recursive mapping
        self.p4_mut()[511].set(table.p4_frame.clone(), PRESENT | WRITABLE);
        tlb::flush_all();

        // execute f in the new context
        f(self);

        // restore recursive mapping to original p4 table
        p4_table[511].set(backup, PRESENT | WRITABLE);
        tlb::flush_all();
    }

    temporary_page.unmap(self);
}

Again, the inner scope is needed to end the borrow of temporary_page so that we can unmap it again. Note that we need to flush the TLB another time after we restored the original recursive mapping.

Now the with function is ready to be used!

ðŸ”—Remapping the Kernel

Letâ€™s tackle the main task of this post: remapping the kernel sections. Therefor we create a remap_the_kernel function in memory/paging/mod.rs:

use multiboot2::BootInformation;
use memory::{PAGE_SIZE, Frame, FrameAllocator};

pub fn remap_the_kernel<A>(allocator: &mut A, boot_info: &BootInformation)
    where A: FrameAllocator
{
    let mut temporary_page = TemporaryPage::new(Page { number: 0xcafebabe },
        allocator);

    let mut active_table = unsafe { ActivePageTable::new() };
    let mut new_table = {
        let frame = allocator.allocate_frame().expect("no more frames");
        InactivePageTable::new(frame, &mut active_table, &mut temporary_page)
    };

    active_table.with(&mut new_table, &mut temporary_page, |mapper| {
        let elf_sections_tag = boot_info.elf_sections_tag()
            .expect("Memory map tag required");

        for section in elf_sections_tag.sections() {
            // TODO mapper.identity_map() all pages of `section`
        }
    });
}

First, we create a temporary page at page number 0xcafebabe. We could use 0xdeadbeaf or 0x123456789 as well, as long as the page is unused. The active_table and the new_table are created using their constructor functions.

Then we use the with function to temporary change the recursive mapping and execute the closure as if the new_table were active. This allows us to map the sections in the new table without changing the active mapping. To get the kernel sections, we use the Multiboot information structure.

Letâ€™s resolve the above TODO by identity mapping the sections:

for section in elf_sections_tag.sections() {
    use self::entry::WRITABLE;

    if !section.is_allocated() {
        // section is not loaded to memory
        continue;
    }
    assert!(section.start_address() % PAGE_SIZE == 0,
            "sections need to be page aligned");

    println!("mapping section at addr: {:#x}, size: {:#x}",
        section.addr, section.size);

    let flags = WRITABLE; // TODO use real section flags

    let start_frame = Frame::containing_address(section.start_address());
    let end_frame = Frame::containing_address(section.end_address() - 1);
    for frame in Frame::range_inclusive(start_frame, end_frame) {
        mapper.identity_map(frame, flags, allocator);
    }
}

We skip all sections that were not loaded into memory (e.g. debug sections). We require that all sections are page aligned because a page must not contain sections with different flags. For example, we would need to set the EXECUTABLE and WRITABLE flags for a page that contains parts of the .code and .data section. Thus we could modify the running code or execute bytes from the .data section as code.

To map a section, we iterate over all of its frames of a section by using a new Frame::range_inclusive function (shown below). Note that the end address is exclusive, so that itâ€™s not part of the section anymore (itâ€™s the first byte of the next section). Thus we need to subtract 1 to get the end_frame.

The Frame::range_inclusive function looks like this:

// in src/memory/mod.rs

impl Frame {
    fn range_inclusive(start: Frame, end: Frame) -> FrameIter {
        FrameIter {
            start: start,
            end: end,
        }
    }
}

struct FrameIter {
    start: Frame,
    end: Frame,
}

impl Iterator for FrameIter {
    type Item = Frame;

    fn next(&mut self) -> Option<Frame> {
        if self.start <= self.end {
            let frame = self.start.clone();
            self.start.number += 1;
            Some(frame)
        } else {
            None
        }
    }
 }

Instead of creating a custom iterator, we could have used the Range struct of the standard library. But it requires that we implement the One and Add traits for Frame. Then every module could perform arithmetic operations on frames, for example let frame3 = frame1 + frame2. This would violate our safety invariants because frame3 could be already in use. The range_inclusive function does not have these problems because it is only available inside the memory module.

ðŸ”—Page Align Sections

Right now our sections arenâ€™t page aligned, so the assertion in remap_the_kernel would fail. We can fix this by making the section size a multiple of the page size. To do this, we add an ALIGN statement to all sections in the linker file. For example:

SECTIONS {
  . = 1M;

  .text :
  {
    *(.text .text.*)
    . = ALIGN(4K);
  }
}

The . is the â€œcurrent location counterâ€ and represents the current virtual address. At the beginning of the SECTIONS tag we set it to 1M, so our kernel starts at 1MiB. We use the ALIGN function to align the current location counter to the next 4K boundary (4K is the page size). Thus the end of the .text section â€“ and the beginning of the next section â€“ are page aligned.

To put all sections on their own page, we add the ALIGN statement to all of them:

/* src/arch/x86_64/linker.ld */

ENTRY(start)

SECTIONS {
  . = 1M;

  .rodata :
  {
    /* ensure that the multiboot header is at the beginning */
    KEEP(*(.multiboot_header))
    *(.rodata .rodata.*)
    . = ALIGN(4K);
  }

  .text :
  {
    *(.text .text.*)
    . = ALIGN(4K);
  }

  .data :
  {
    *(.data .data.*)
    . = ALIGN(4K);
  }

  .bss :
  {
    *(.bss .bss.*)
    . = ALIGN(4K);
  }

  .got :
  {
    *(.got)
    . = ALIGN(4K);
  }

  .got.plt :
  {
    *(.got.plt)
    . = ALIGN(4K);
  }

  .data.rel.ro : ALIGN(4K) {
    *(.data.rel.ro.local*) *(.data.rel.ro .data.rel.ro.*)
    . = ALIGN(4K);
  }

  .gcc_except_table : ALIGN(4K) {
    *(.gcc_except_table)
    . = ALIGN(4K);
  }
}

Instead of page aligning the .multiboot_header section, we merge it into the .rodata section. That way, we donâ€™t waste a whole page for the few bytes of the Multiboot header. We could merge it into any section, but .rodata fits best because it has the same flags (neither writable nor executable). The Multiboot header still needs to be at the beginning of the file, so .rodata must be our first section now.

ðŸ”—Testing it

Time to test it! We re-export the remap_the_kernel function from the memory module and call it from rust_main:

// in src/memory/mod.rs
pub use self::paging::remap_the_kernel;

// in src/lib.rs
#[no_mangle]
pub extern "C" fn rust_main(multiboot_information_address: usize) {
    // ATTENTION: we have a very small stack and no guard page

    // the same as before
    vga_buffer::clear_screen();
    println!("Hello World{}", "!");

    let boot_info = unsafe {
        multiboot2::load(multiboot_information_address)
    };
    let memory_map_tag = boot_info.memory_map_tag()
        .expect("Memory map tag required");
    let elf_sections_tag = boot_info.elf_sections_tag()
        .expect("Elf sections tag required");

    let kernel_start = elf_sections_tag.sections().map(|s| s.addr)
        .min().unwrap();
    let kernel_end = elf_sections_tag.sections().map(|s| s.addr + s.size)
        .max().unwrap();

    let multiboot_start = multiboot_information_address;
    let multiboot_end = multiboot_start + (boot_info.total_size as usize);

    println!("kernel start: 0x{:x}, kernel end: 0x{:x}",
        kernel_start, kernel_end);
    println!("multiboot start: 0x{:x}, multiboot end: 0x{:x}",
        multiboot_start, multiboot_end);

    let mut frame_allocator = memory::AreaFrameAllocator::new(
        kernel_start as usize, kernel_end as usize, multiboot_start,
        multiboot_end, memory_map_tag.memory_areas());

    // this is the new part
    memory::remap_the_kernel(&mut frame_allocator, boot_info);
    println!("It did not crash!");

    loop {}
}

If you see the It did not crash message, the kernel survived our page table modifications without causing a CPU exception. But did we map the kernel sections correctly?

Letâ€™s try it out by switching to the new table! We identity map all kernel sections, so it should work without problems.

ðŸ”—Switching Tables

Switching tables is easy. We just need to reload the CR3 register with the physical address of the new P4 frame.

We do this in a new ActivePageTable::switch method:

// in `impl ActivePageTable` in src/memory/paging/mod.rs

pub fn switch(&mut self, new_table: InactivePageTable) -> InactivePageTable {
    use x86_64::PhysicalAddress;
    use x86_64::registers::control_regs;

    let old_table = InactivePageTable {
        p4_frame: Frame::containing_address(
            control_regs::cr3().0 as usize
        ),
    };
    unsafe {
        control_regs::cr3_write(PhysicalAddress(
            new_table.p4_frame.start_address() as u64));
    }
    old_table
}

This function activates the given inactive table and returns the previous active table as a InactivePageTable. We donâ€™t need to flush the TLB here, as the CPU does it automatically when the P4 table is switched. In fact, the tlb::flush_all function, which we used above, does nothing more than reloading the CR3 register.

Now we are finally able to switch to the new table. We do it by adding the following lines to our remap_the_kernel function:

// in remap_the_kernel in src/memory/paging/mod.rs

...
active_table.with(&mut new_table, &mut temporary_page, |mapper| {
    ...
});

let old_table = active_table.switch(new_table);
println!("NEW TABLE!!!");

Letâ€™s cross our fingers and run itâ€¦

â€¦ and it fails with a boot loop.

ðŸ”—Debugging

A QEMU boot loop indicates that some CPU exception occurred. We can see all thrown CPU exception by starting QEMU with -d int:

> qemu-system-x86_64 -d int -no-reboot -cdrom build/os-x86_64.iso
...
check_exception old: 0xffffffff new 0xe
     0: v=0e e=0002 i=0 cpl=0 IP=0008:000000000010ab97 pc=000000000010ab97
        SP=0010:00000000001182d0 CR2=00000000000b8f00
...

These lines are the important ones. We can read many useful information from them:

v=0e: An exception with number 0xe occurred, which is a page fault according to the OSDev Wiki.
e=0002: The CPU set an error code, which tells us why the exception occurred. The 0x2 bit tells us that it was caused by a write operation. And since the 0x1 bit is not set, the target page was not present.
IP=0008:000000000010ab97 or pc=000000000010ab97: The program counter register tells us that the exception occurred when the CPU tried to execute the instruction at 0x10ab97. We can disassemble this address to see the corresponding function. The 0008: prefix in IP indicates the code GDT segment.
SP=0010:00000000001182d0: The stack pointer was 0x1182d0 (the 0010: prefix indicates the data GDT segment). This tells us if it the stack overflowed.
CR2=00000000000b8f00: Finally the most useful register. It tells us which virtual address caused the page fault. In our case itâ€™s 0xb8f00, which is part of the VGA text buffer.

So letâ€™s find out which function caused the exception:

objdump -d build/kernel-x86_64.bin | grep -B100 "10ab97"

We disassemble our kernel and search for 10ab97. The -B100 option prints the 100 preceding lines too. The output tells us the responsible function:

...
000000000010aa80 <_ZN10vga_buffer6Writer10write_byte20h4601f5e405b6e89facaE>:
  10aa80:	55                   	push   %rbp
  ...
  10ab93:	66 8b 55 aa          	mov    -0x56(%rbp),%dx
  10ab97:	66 89 14 48          	mov    %dx,(%rax,%rcx,2)

The reason for the cryptical function name is Rustâ€™s name mangling. But we can identity the vga_buffer::Writer::write_byte function nonetheless.

So the reason for the page fault is that the write_byte function tried to write to the VGA text buffer at 0xb8f00. Of course this provokes a page fault: We forgot to identity map the VGA buffer in the new page table.

The fix is pretty simple:

// in src/memory/paging/mod.rs

pub fn remap_the_kernel<A>(allocator: &mut A, boot_info: &BootInformation)
    where A: FrameAllocator
{
    ...
    active_table.with(&mut new_table, &mut temporary_page, |mapper| {
        ...
        for section in elf_sections_tag.sections() {
            ...
        }

        // identity map the VGA text buffer
        let vga_buffer_frame = Frame::containing_address(0xb8000); // new
        mapper.identity_map(vga_buffer_frame, WRITABLE, allocator); // new
    });

    let old_table = active_table.switch(new_table);
    println!("NEW TABLE!!!");
}

Now we should see the NEW TABLE!!! message (and also the It did not crash! line again). Congratulations! We successfully switched our kernel to a new page table!

ðŸ”—Fixing the Frame Allocator

The same problem as above occurs when we try to use our AreaFrameAllocator again. Try to add the following to rust_main after switching to the new table:

// in src/lib.rs
pub extern "C" fn rust_main(multiboot_information_address: usize) {
    ...
    memory::remap_the_kernel(&mut frame_allocator, boot_info);
    frame_allocator.allocate_frame(); // new: try to allocate a frame
    println!("It did not crash!");

This causes the same bootloop as above. The reason is that the AreaFrameAllocator uses the memory map of the Multiboot information structure. But we did not map the Multiboot structure, so it causes a page fault. To fix it, we identity map it as well:

// in `remap_the_kernel` in src/memory/paging/mod.rs
active_table.with(&mut new_table, &mut temporary_page, |mapper| {

    // â€¦ identity map the allocated kernel sections
    // â€¦ identity map the VGA text buffer

    // new:
    // identity map the multiboot info structure
    let multiboot_start = Frame::containing_address(boot_info.start_address());
    let multiboot_end = Frame::containing_address(boot_info.end_address() - 1);
    for frame in Frame::range_inclusive(multiboot_start, multiboot_end) {
        mapper.identity_map(frame, PRESENT, allocator);
    }
});

Normally the multiboot struct fits on one page. But GRUB can place it anywhere, so it could randomly cross a page boundary. Therefore we use range_inclusive to be on the safe side. Note that we need to subtract 1 to get the address of the last byte because the end address is exclusive.

Now we should be able to allocate frames again.

ðŸ”—Using the Correct Flags

Right now, our new table maps all kernel sections as writable and executable. To fix this, we add a EntryFlags::from_elf_section_flags function:

// in src/memory/paging/entry.rs

use multiboot2::ElfSection;

impl EntryFlags {
    pub fn from_elf_section_flags(section: &ElfSection) -> EntryFlags {
        use multiboot2::{ELF_SECTION_ALLOCATED, ELF_SECTION_WRITABLE,
            ELF_SECTION_EXECUTABLE};

        let mut flags = EntryFlags::empty();

        if section.flags().contains(ELF_SECTION_ALLOCATED) {
            // section is loaded to memory
            flags = flags | PRESENT;
        }
        if section.flags().contains(ELF_SECTION_WRITABLE) {
            flags = flags | WRITABLE;
        }
        if !section.flags().contains(ELF_SECTION_EXECUTABLE) {
            flags = flags | NO_EXECUTE;
        }

        flags
    }
}

It just converts the ELF section flags to page table flags.

Now we can use it to fix the TODO in our remap_the_kernel function:

// in src/memory/paging/mod.rs

pub fn remap_the_kernel<A>(allocator: &mut A, boot_info: &BootInformation)
    where A: FrameAllocator
{
    ...
    active_table.with(&mut new_table, &mut temporary_page, |mapper| {
        ...
        for section in elf_sections_tag.sections() {
            ...
            if !section.is_allocated() {
                // section is not loaded to memory
                continue;
            }
            ...
            // this is the new part
            let flags = EntryFlags::from_elf_section_flags(section);
            ...
            for frame in Frame::range_inclusive(start_frame, end_frame) {
                mapper.identity_map(frame, flags, allocator);
            }
        }
        ...
    });
    ...
}

But when we test it now, we get a page fault again. We can use the same technique as above to get the responsible function. I wonâ€™t bother you with the QEMU output and just tell you the results:

This time the responsible function is control_regs::cr3_write() itself. From the error code we learn that it was a page protection violation and caused by â€œreading a 1 in a reserved fieldâ€. So the page table had some reserved bit set that should be always 0. It must be the NO_EXECUTE flag, since itâ€™s the only new bit that we set in the page table.

ðŸ”—The NXE Bit

The reason is that the NO_EXECUTE bit must only be used when the NXE bit in the Extended Feature Enable Register (EFER) is set. That register is similar to Rustâ€™s feature gating and can be used to enable all sorts of advanced CPU features. Since the NXE bit is off by default, we caused a page fault when we added the NO_EXECUTE bit to the page table.

So we need to enable the NXE bit. For that we use the x86_64 crate again:

// in lib.rs

fn enable_nxe_bit() {
    use x86_64::registers::msr::{IA32_EFER, rdmsr, wrmsr};

    let nxe_bit = 1 << 11;
    unsafe {
        let efer = rdmsr(IA32_EFER);
        wrmsr(IA32_EFER, efer | nxe_bit);
    }
}

The unsafe block is needed since accessing the EFER register is only allowed in kernel mode. But we are in kernel mode, so everything is fine.

When we call this function before calling remap_the_kernel, everything should work again.

ðŸ”—The Write Protect Bit

Right now, we are still able to modify the .code and .rodata sections, even though we did not set the WRITABLE flag for them. The reason is that the CPU ignores this bit in kernel mode by default. To enable write protection for the kernel as well, we need to set the Write Protect bit in the CR0 register:

// in lib.rs

fn enable_write_protect_bit() {
    use x86_64::registers::control_regs::{cr0, cr0_write, Cr0};

    unsafe { cr0_write(cr0() | Cr0::WRITE_PROTECT) };
}

The cr0 functions are unsafe because accessing the CR0 register is only allowed in kernel mode.

If we havenâ€™t forgotten to set the WRITABLE flag somewhere, it should still work without crashing.

ðŸ”—Creating a Guard Page

The final step is to create a guard page for our kernel stack.

The decision to place the kernel stack right above the page tables was already useful to detect a silent stack overflow in the previous post. Now we profit from it again. Letâ€™s look at our assembly .bss section again to understand why:

; in src/arch/x86_64/boot.asm

section .bss
align 4096
p4_table:
    resb 4096
p3_table:
    resb 4096
p2_table:
    resb 4096
stack_bottom:
    resb 4096 * 4
stack_top:

The old page tables are right below the stack. They are still identity mapped since they are part of the kernelâ€™s .bss section. We just need to turn the old p4_table into a guard page to secure the kernel stack. That way we even reuse the memory of the old P3 and P2 tables to increase the stack size.

So letâ€™s implement it:

// in src/memory/paging/mod.rs
pub fn remap_the_kernel<A>(allocator: &mut A, boot_info: &BootInformation)
    where A: FrameAllocator
{
    ...
    let old_table = active_table.switch(new_table);
    println!("NEW TABLE!!!");

    // below is the new part

    // turn the old p4 page into a guard page
    let old_p4_page = Page::containing_address(
      old_table.p4_frame.start_address()
    );
    active_table.unmap(old_p4_page, allocator);
    println!("guard page at {:#x}", old_p4_page.start_address());
}

Now we have a very basic guard page: The page below the stack is unmapped, so a stack overflow causes an immediate page fault. Thus, silent stack overflows are no longer possible.

Or to be precise, they are improbable. If we have a function with many big stack variables, itâ€™s possible that the guard page is missed. For example, the following function could still corrupt memory below the stack:

fn stack_overflow() {
    let x = [0; 99999];
}

This creates a very big array on the stack, which is currently filled from bottom to top. Therefore it misses the guard page and overwrites some memory below the stack. Eventually it hits the bottom of the guard page and causes a page fault. But before, it messes up memory, which is bad.

Fortunately, there exists a solution called stack probes. The basic idea is to check all required stack pages at the beginning of each function. For example, a function that needs 9000 bytes on the stack would try to access SP + 0, SP + 4096, and SP + 2 * 4096 (SP is the stack pointer). If the stack is not big enough, the guard page is hit and a page fault occurs. The function canâ€™t mess up memory anymore since the stack check occurs right at its start.

Unfortunately stack probes require compiler support. They already work on Windows but they donâ€™t exist on Linux yet. The problem seems to be in LLVM, which Rust uses as backend. Hopefully it gets resolved soon so that our kernel stack becomes safe. For the current status and more information about stack probes check out the tracking issue.

ðŸ”—Whatâ€™s next?

Now that we have a (mostly) safe kernel stack and a working page table module, we can add a virtual memory allocator. The next post will explore Rustâ€™s allocator API and create a very basic allocator. At the end of that post, we will be able to use Rustâ€™s allocation and collections types such as Box, Vec, or even BTreeMap.

ðŸ”—Footnotes

For this post the most useful GDB command is probably p/x *((long int*)0xfffffffffffff000)@512. It prints all entries of the recursively mapped P4 table by interpreting it as an array of 512 long ints (the @512 is GDBâ€™s array syntax). Of course you can also print other tables by adjusting the address.

Page Tables

Wed, 09 Dec 2015 00:00:00 +0000

In this post we will create a paging module, which allows us to access and modify the 4-level page table. We will explore recursive page table mapping and use some Rust features to make it safe. Finally we will create functions to translate virtual addresses and to map and unmap pages.

You can find the source code and this post itself on GitHub. Please file an issue there if you have any problems or improvement suggestions. There is also a comment section at the end of this page. Note that this post requires a current Rust nightly.

ðŸ”—Paging

Paging is a memory management scheme that separates virtual and physical memory. The address space is split into equal sized pages and page tables specify which virtual page points to which physical frame. For an extensive paging introduction take a look at the paging chapter (PDF) of the Three Easy Pieces OS book.

The x86 architecture uses a 4-level page table in 64-bit mode. A virtual address has the following structure:

The bits 48â€“63 are so-called sign extension bits and must be copies of bit 47. The following 36 bits define the page table indexes (9 bits per table) and the last 12 bits specify the offset in the 4KiB page.

Each table has 2^9 = 512 entries and each entry is 8 byte. Thus a page table fits exactly in one page (4 KiB).

To translate an address, the CPU reads the P4 address from the CR3 register. Then it uses the indexes to walk the tables:

The P4 entry points to a P3 table, where the next 9 bits of the address are used to select an entry. The P3 entry then points to a P2 table and the P2 entry points to a P1 table. The P1 entry, which is specified through bits 12â€“20, finally points to the physical frame.

ðŸ”—A Basic Paging Module

Letâ€™s create a basic paging module in memory/paging/mod.rs:

use memory::PAGE_SIZE; // needed later

const ENTRY_COUNT: usize = 512;

pub type PhysicalAddress = usize;
pub type VirtualAddress = usize;

pub struct Page {
   number: usize,
}

We import the PAGE_SIZE and define a constant for the number of entries per table. To make future function signatures more expressive, we can use the type aliases PhysicalAddress and VirtualAddress. The Page struct is similar to the Frame struct in the previous post, but represents a virtual page instead of a physical frame.

ðŸ”—Page Table Entries

To model page table entries, we create a new entry submodule:

use memory::Frame; // needed later

pub struct Entry(u64);

impl Entry {
    pub fn is_unused(&self) -> bool {
        self.0 == 0
    }

    pub fn set_unused(&mut self) {
        self.0 = 0;
    }
}

We define that an unused entry is completely 0. That allows us to distinguish unused entries from other non-present entries in the future. For example, we could define one of the available bits as the swapped_out bit for pages that are swapped to disk.

Next we will model the contained physical address and the various flags. Remember, entries have the following format:

Bit(s)	Name	Meaning
0	present	the page is currently in memory
1	writable	itâ€™s allowed to write to this page
2	user accessible	if not set, only kernel mode code can access this page
3	write through caching	writes go directly to memory
4	disable cache	no cache is used for this page
5	accessed	the CPU sets this bit when this page is used
6	dirty	the CPU sets this bit when a write to this page occurs
7	huge page/null	must be 0 in P1 and P4, creates a 1GiB page in P3, creates a 2MiB page in P2
8	global	page isnâ€™t flushed from caches on address space switch (PGE bit of CR4 register must be set)
9-11	available	can be used freely by the OS
12-51	physical address	the page aligned 52bit physical address of the frame or the next page table
52-62	available	can be used freely by the OS
63	no execute	forbid executing code on this page (the NXE bit in the EFER register must be set)

To model the various flags, we will use the bitflags crate. To add it as a dependency, add the following to your Cargo.toml:

[dependencies]
...
bitflags = "0.9.1"

To import the macro, we need to use #[macro_use] above the extern crate definition:

// in src/lib.rs
#[macro_use]
extern crate bitflags;

Now we can model the various flags:

bitflags! {
    pub struct EntryFlags: u64 {
        const PRESENT =         1 << 0;
        const WRITABLE =        1 << 1;
        const USER_ACCESSIBLE = 1 << 2;
        const WRITE_THROUGH =   1 << 3;
        const NO_CACHE =        1 << 4;
        const ACCESSED =        1 << 5;
        const DIRTY =           1 << 6;
        const HUGE_PAGE =       1 << 7;
        const GLOBAL =          1 << 8;
        const NO_EXECUTE =      1 << 63;
    }
}

To extract the flags from the entry we create an Entry::flags method that uses from_bits_truncate:

pub fn flags(&self) -> EntryFlags {
    EntryFlags::from_bits_truncate(self.0)
}

This allows us to check for flags through the contains() function. For example, flags().contains(PRESENT | WRITABLE) returns true if the entry contains both flags.

To extract the physical address, we add a pointed_frame method:

pub fn pointed_frame(&self) -> Option<Frame> {
    if self.flags().contains(PRESENT) {
        Some(Frame::containing_address(
            self.0 as usize & 0x000fffff_fffff000
        ))
    } else {
        None
    }
}

If the entry is present, we mask bits 12â€“51 and return the corresponding frame. If the entry is not present, it does not point to a valid frame so we return None.

To modify entries, we add a set method that updates the flags and the pointed frame:

pub fn set(&mut self, frame: Frame, flags: EntryFlags) {
    assert!(frame.start_address() & !0x000fffff_fffff000 == 0);
    self.0 = (frame.start_address() as u64) | flags.bits();
}

The start address of a frame should be page aligned and smaller than 2^52 (since x86 uses 52bit physical addresses). Since an invalid address could mess up the entry, we add an assertion. To actually set the entry, we just need to or the start address and the flag bits.

The missing Frame::start_address method is pretty simple:

use self::paging::PhysicalAddress;

fn start_address(&self) -> PhysicalAddress {
    self.number * PAGE_SIZE
}

We add it to the impl Frame block in memory/mod.rs.

ðŸ”—Page Tables

To model page tables, we create a basic Table struct in a new table submodule:

use memory::paging::entry::*;
use memory::paging::ENTRY_COUNT;

pub struct Table {
    entries: [Entry; ENTRY_COUNT],
}

Itâ€™s just an array of 512 page table entries.

To make the Table indexable itself, we can implement the Index and IndexMut traits:

use core::ops::{Index, IndexMut};

impl Index<usize> for Table {
    type Output = Entry;

    fn index(&self, index: usize) -> &Entry {
        &self.entries[index]
    }
}

impl IndexMut<usize> for Table {
    fn index_mut(&mut self, index: usize) -> &mut Entry {
        &mut self.entries[index]
    }
}

Now itâ€™s possible to get the 42th entry through some_table[42]. Of course we could replace usize with u32 or even u16 here but it would cause more numerical conversions (x as u16).

Letâ€™s add a method that sets all entries to unused. We will need it when we create new page tables in the future. The method looks like this:

pub fn zero(&mut self) {
    for entry in self.entries.iter_mut() {
        entry.set_unused();
    }
}

Now we can read page tables and retrieve the mapping information. We can also update them through the IndexMut trait and the Entry::set method. But how do we get references to the various page tables?

We could read the CR3 register to get the physical address of the P4 table and read its entries to get the P3 addresses. The P3 entries then point to the P2 tables and so on. But this method only works for identity-mapped pages. In the future we will create new page tables, which arenâ€™t in the identity-mapped area anymore. Since we canâ€™t access them through their physical address, we need a way to map them to virtual addresses.

ðŸ”—Mapping Page Tables

So how do we map the page tables itself? We donâ€™t have that problem for the current P4, P3, and P2 table since they are part of the identity-mapped area, but we need a way to access future tables, too.

One solution is to identity map all page tables. That way we would not need to differentiate virtual and physical addresses and could easily access the tables. But it clutters the virtual address space and increases fragmentation. And it makes creating page tables much more complicated since we need a physical frame whose corresponding page isnâ€™t already used for something else.

An alternative solution is to map the page tables only temporary. To read/write a page table, we would map it to some free virtual address until weâ€™re done. We could use a small pool of such virtual addresses and reuse them for various tables. This method occupies only few virtual addresses and thus is a good solution for 32-bit systems, which have small address spaces. But it makes things much more complicated since we need to temporary map up to 4 tables to access a single page. And the temporary mapping requires modification of other page tables, which need to be mapped, too.

We will solve the problem in another way using a trick called recursive mapping.

ðŸ”—Recursive Mapping

The trick is to map the P4 table recursively: The last entry doesnâ€™t point to a P3 table, but to the P4 table itself. We can use this entry to remove a translation level so that we land on a page table instead. For example, we can â€œloopâ€ once to access a P1 table:

By selecting the 511th P4 entry, which points points to the P4 table itself, the P4 table is used as the P3 table. Similarly, the P3 table is used as a P2 table and the P2 table is treated like a P1 table. Thus the P1 table becomes the target page and can be accessed through the offset.

Itâ€™s also possible to access P2 tables by looping twice. And if we select the 511th entry three times, we can access and modify P3 tables:

So we just need to specify the desired P3 table in the address through the P1 index. By choosing the 511th entry multiple times, we stay on the P4 table until the addressâ€™s P1 index becomes the actual P4 index.

To access the P4 table itself, we loop once more and thus never leave the frame:

So we can access and modify page tables of all levels by just setting one P4 entry once. Most work is done by the CPU, we just the recursive entry to remove one or more translation levels. It may seem a bit strange at first, but itâ€™s a clean and simple solution once you wrapped your head around it.

By using recursive mapping, each page table is accessible through an unique virtual address. The math checks out, too: If all page tables are used, there is 1 P4 table, 511 P3 tables (the last entry is used for the recursive mapping), 511*512 P2 tables, and 511*512*512 P1 tables. So there are 134217728 page tables altogether. Each page table occupies 4KiB, so we need 134217728 * 4KiB = 512GiB to store them. Thatâ€™s exactly the amount of memory that can be accessed through one P4 entry since 4KiB per page * 512 P1 entries * 512 P2 entries * 512 P3 entries = 512GiB.

Of course recursive mapping has some disadvantages, too. It occupies a P4 entry and thus 512GiB of the virtual address space. But since weâ€™re in long mode and have a 48-bit address space, there are still 225.5TiB left. The bigger problem is that only the active table can be modified by default. To access another table, the recursive entry needs to be replaced temporary. We will tackle this problem in the next post when we switch to a new page table.

ðŸ”—Implementation

To map the P4 table recursively, we just need to point the 511th entry to the table itself. Of course we could do it in Rust, but it would require some fiddling with unsafe pointers. Itâ€™s easier to just add some lines to our boot assembly:

mov eax, p4_table
or eax, 0b11 ; present + writable
mov [p4_table + 511 * 8], eax

I put it right after the set_up_page_tables label, but you can add it wherever you like.

Now we can use special virtual addresses to access the page tables. The P4 table is available at 0xfffffffffffff000. Letâ€™s add a P4 constant to the table submodule:

pub const P4: *mut Table = 0xffffffff_fffff000 as *mut _;

Letâ€™s switch to the octal system, since it makes more sense for the other special addresses. The P4 address from above is equivalent to 0o177777_777_777_777_777_0000 in octal. You can see that is has index 777 in all tables and offset 0000. The 177777 bits on the left are the sign extension bits, which are copies of the 47th bit. They are required because x86 only uses 48bit virtual addresses.

The other tables can be accessed through the following addresses:

Table	Address	Indexes
P4	`0o177777_777_777_777_777_0000`	â€“
P3	`0o177777_777_777_777_XXX_0000`	`XXX` is the P4 index
P2	`0o177777_777_777_XXX_YYY_0000`	like above, and `YYY` is the P3 index
P1	`0o177777_777_XXX_YYY_ZZZ_0000`	like above, and `ZZZ` is the P2 index

If we look closely, we can see that the P3 address is equal to (P4 << 9) | XXX_0000. And the P2 address is calculated through (P3 << 9) | YYY_0000. So to get the next address, we need to shift it 9 bits to the left and add the table index. As a formula:

next_table_address = (table_address << 9) | (index << 12)

ðŸ”—The `next_table` Methods

Letâ€™s add the above formula as a Table method:

fn next_table_address(&self, index: usize) -> Option<usize> {
    let entry_flags = self[index].flags();
    if entry_flags.contains(PRESENT) && !entry_flags.contains(HUGE_PAGE) {
        let table_address = self as *const _ as usize;
        Some((table_address << 9) | (index << 12))
    } else {
        None
    }
}

The next table address is only valid if the corresponding entry is present and does not create a huge page. Then we can do some pointer casting to get the table address and use the formula to calculate the next address.

If the index is out of bounds, the function will panic since Rust checks array bounds. The panic is desired here since a wrong index should not be possible and indicates a bug.

To convert the address into references, we add two functions:

pub fn next_table(&self, index: usize) -> Option<&Table> {
    self.next_table_address(index)
        .map(|address| unsafe { &*(address as *const _) })
}

pub fn next_table_mut(&mut self, index: usize) -> Option<&mut Table> {
    self.next_table_address(index)
        .map(|address| unsafe { &mut *(address as *mut _) })
}

We convert the address into raw pointers through as casts and then convert them into Rust references through &mut *. The latter is an unsafe operation since Rust canâ€™t guarantee that the raw pointer is valid.

Note that self stays borrowed as long as the returned reference is valid. This is because of Rustâ€™s lifetime elision rules. Basically, these rules say that the lifetime of an output reference is the same as the lifetime of the input reference by default. So the above function signatures are expanded to:

pub fn next_table<'a>(&'a self, index: usize) -> Option<&'a Table> {...}

pub fn next_table_mut<'a>(&'a mut self, index: usize)
    -> Option<&'a mut Table>
{...}

Note the additional lifetime parameters, which are identical for input and output references. Thatâ€™s exactly what we want. It ensures that we canâ€™t modify tables as long as we have references to lower tables. For example, it would be very bad if we could unmap a P3 table if we still write to one of its P2 tables.

ðŸ”—Safety

Now we can start at the P4 constant and use the next_table functions to access the lower tables. And we donâ€™t even need unsafe blocks to do it! Right now, your alarm bells should be ringing. Thanks to Rust, everything weâ€™ve done before in this post was completely safe. But we just introduced two unsafe blocks to convince Rust that there are valid tables at the specified addresses. Can we really be sure?

First, these addresses are only valid if the P4 table is mapped recursively. Since the paging module will be the only module that modifies page tables, we can introduce an invariant for the module:

The 511th entry of the active P4 table must always be mapped to the active P4 table itself.

So if we switch to another P4 table at some time, it needs to be identity mapped before it becomes active. As long as we obey this invariant, we can safely use the special addresses. But even with this invariant, there is a big problem with the two methods:

What happens if we call them on a P1 table?

Well, they would calculate the address of the next table (which does not exist) and treat it as a page table. Either they construct an invalid address (if XXX < 400)¹ or access the mapped page itself. That way, we could easily corrupt memory or cause CPU exceptions by accident. So these two functions are not safe in Rust terms. Thus we need to make them unsafe functions unless we find some clever solution.

ðŸ”—Some Clever Solution

We can use Rustâ€™s type system to statically guarantee that the next_table methods can only be called on P4, P3, and P2 tables, but not on a P1 table. The idea is to add a Level parameter to the Table type and implement the next_table methods only for level 4, 3, and 2.

To model the levels we use a trait and empty enums:

pub trait TableLevel {}

pub enum Level4 {}
pub enum Level3 {}
pub enum Level2 {}
pub enum Level1 {}

impl TableLevel for Level4 {}
impl TableLevel for Level3 {}
impl TableLevel for Level2 {}
impl TableLevel for Level1 {}

An empty enum has size zero and disappears completely after compiling. Unlike an empty struct, itâ€™s not possible to instantiate an empty enum. Since we will use TableLevel and the table levels in exported types, they need to be public.

To differentiate the P1 table from the other tables, we introduce a HierarchicalLevel trait, which is a subtrait of TableLevel. But we implement it only for the levels 4, 3, and 2:

pub trait HierarchicalLevel: TableLevel {}

impl HierarchicalLevel for Level4 {}
impl HierarchicalLevel for Level3 {}
impl HierarchicalLevel for Level2 {}

Now we add the level parameter to the Table type:

use core::marker::PhantomData;

pub struct Table<L: TableLevel> {
    entries: [Entry; ENTRY_COUNT],
    level: PhantomData<L>,
}

We need to add a PhantomData field because unused type parameters are not allowed in Rust.

Since we changed the Table type, we need to update every use of it:

pub const P4: *mut Table<Level4> = 0xffffffff_fffff000 as *mut _;
...
impl<L> Table<L> where L: TableLevel
{
    pub fn zero(&mut self) {...}
}

impl<L> Table<L> where L: HierarchicalLevel
{
    pub fn next_table(&self, index: usize) -> Option<&Table<???>> {...}

    pub fn next_table_mut(&mut self, index: usize) -> Option<&mut Table<???>>
    {...}

    fn next_table_address(&self, index: usize) -> Option<usize> {...}
}

impl<L> Index<usize> for Table<L> where L: TableLevel {...}

impl<L> IndexMut<usize> for Table<L> where L: TableLevel {...}

Now the next_table methods are only available for P4, P3, and P2 tables. But they have the incomplete return type Table<???> now. What should we fill in for the ????

For a P4 table we would like to return a Table<Level3>, for a P3 table a Table<Level2>, and for a P2 table a Table<Level1>. So we want to return a table of the next level.

We can define the next level by adding an associated type to the HierarchicalLevel trait:

trait HierarchicalLevel: TableLevel {
    type NextLevel: TableLevel;
}

impl HierarchicalLevel for Level4 {
    type NextLevel = Level3;
}

impl HierarchicalLevel for Level3 {
    type NextLevel = Level2;
}

impl HierarchicalLevel for Level2 {
    type NextLevel = Level1;
}

Now we can replace the Table<???> types with Table<L::NextLevel> types and our code works as intended. You can try it with a simple test function:

fn test() {
    let p4 = unsafe { &*P4 };
    p4.next_table(42)
      .and_then(|p3| p3.next_table(1337))
      .and_then(|p2| p2.next_table(0xdeadbeaf))
      .and_then(|p1| p1.next_table(0xcafebabe))
}

Most of the indexes are completely out of bounds, so it would panic if itâ€™s called. But we donâ€™t need to call it since it already fails at compile time:

error: no method named `next_table` found for type
  `&memory::paging::table::Table<memory::paging::table::Level1>`
  in the current scope

Remember that this is bare metal kernel code. We just used type system magic to make low-level page table manipulations safer. Rust is just awesome!

ðŸ”—Translating Addresses

Now letâ€™s do something useful with our new module. We will create a function that translates a virtual address to the corresponding physical address. We add it to the paging/mod.rs module:

pub fn translate(virtual_address: VirtualAddress)
    -> Option<PhysicalAddress>
{
    let offset = virtual_address % PAGE_SIZE;
    translate_page(Page::containing_address(virtual_address))
        .map(|frame| frame.number * PAGE_SIZE + offset)
}

It uses two functions we havenâ€™t defined yet: translate_page and Page::containing_address. Letâ€™s start with the latter:

pub fn containing_address(address: VirtualAddress) -> Page {
    assert!(address < 0x0000_8000_0000_0000 ||
        address >= 0xffff_8000_0000_0000,
        "invalid address: 0x{:x}", address);
    Page { number: address / PAGE_SIZE }
}

The assertion is needed because there can be invalid addresses. Addresses on x86 are just 48-bit long and the other bits are just sign extension, i.e. a copy of the most significant bit. For example:

invalid address: 0x0000_8000_0000_0000
valid address:   0xffff_8000_0000_0000
                        â””â”€â”€ bit 47

So the address space is split into two halves: the higher half containing addresses with sign extension and the lower half containing addresses without. Everything in between is invalid.

Since we added containing_address, we add the inverse method as well (maybe we need it later):

fn start_address(&self) -> usize {
    self.number * PAGE_SIZE
}

The other missing function, translate_page, looks like this:

use memory::Frame;

fn translate_page(page: Page) -> Option<Frame> {
    use self::entry::HUGE_PAGE;

    let p3 = unsafe { &*table::P4 }.next_table(page.p4_index());

    let huge_page = || {
        // TODO
    };

    p3.and_then(|p3| p3.next_table(page.p3_index()))
      .and_then(|p2| p2.next_table(page.p2_index()))
      .and_then(|p1| p1[page.p1_index()].pointed_frame())
      .or_else(huge_page)
}

We use an unsafe block to convert the raw P4 pointer to a reference. Then we use the Option::and_then function to go through the four table levels. If some entry along the way is None, we check if the page is a huge page through the (unimplemented) huge_page closure.

The Page::p*_index functions return the different table indexes. They look like this:

fn p4_index(&self) -> usize {
    (self.number >> 27) & 0o777
}
fn p3_index(&self) -> usize {
    (self.number >> 18) & 0o777
}
fn p2_index(&self) -> usize {
    (self.number >> 9) & 0o777
}
fn p1_index(&self) -> usize {
    (self.number >> 0) & 0o777
}

ðŸ”—Safety

We use an unsafe block to convert the raw P4 pointer into a shared reference. Itâ€™s safe because we donâ€™t create any &mut references to the table right now and donâ€™t switch the P4 table either. But as soon as we do something like that, we have to revisit this method.

ðŸ”—Huge Pages

The huge_page closure calculates the corresponding frame if huge pages are used. Its content looks like this:

p3.and_then(|p3| {
      let p3_entry = &p3[page.p3_index()];
      // 1GiB page?
      if let Some(start_frame) = p3_entry.pointed_frame() {
          if p3_entry.flags().contains(HUGE_PAGE) {
              // address must be 1GiB aligned
              assert!(start_frame.number % (ENTRY_COUNT * ENTRY_COUNT) == 0);
              return Some(Frame {
                  number: start_frame.number + page.p2_index() *
                          ENTRY_COUNT + page.p1_index(),
              });
          }
      }
      if let Some(p2) = p3.next_table(page.p3_index()) {
          let p2_entry = &p2[page.p2_index()];
          // 2MiB page?
          if let Some(start_frame) = p2_entry.pointed_frame() {
              if p2_entry.flags().contains(HUGE_PAGE) {
                  // address must be 2MiB aligned
                  assert!(start_frame.number % ENTRY_COUNT == 0);
                  return Some(Frame {
                      number: start_frame.number + page.p1_index()
                  });
              }
          }
      }
      None
  })

This function is much longer and more complex than the translate_page function itself. To avoid this complexity in the future, we will only work with standard 4KiB pages from now on.

ðŸ”—Mapping Pages

Letâ€™s add a function that modifies the page tables to map a Page to a Frame:

pub use self::entry::*;
use memory::FrameAllocator;

pub fn map_to<A>(page: Page, frame: Frame, flags: EntryFlags,
                 allocator: &mut A)
    where A: FrameAllocator
{
    let p4 = unsafe { &mut *P4 };
    let mut p3 = p4.next_table_create(page.p4_index(), allocator);
    let mut p2 = p3.next_table_create(page.p3_index(), allocator);
    let mut p1 = p2.next_table_create(page.p2_index(), allocator);

    assert!(p1[page.p1_index()].is_unused());
    p1[page.p1_index()].set(frame, flags | PRESENT);
}

We add an re-export for all entry types since they are required to call the function. We assert that the page is unmapped and always set the present flag (since it wouldnâ€™t make sense to map a page without setting it).

The Table::next_table_create method doesnâ€™t exist yet. It should return the next table if it exists, or create a new one. For the implementation we need the FrameAllocator from the previous post and the Table::zero method:

use memory::FrameAllocator;

pub fn next_table_create<A>(&mut self,
                            index: usize,
                            allocator: &mut A)
                            -> &mut Table<L::NextLevel>
    where A: FrameAllocator
{
    if self.next_table(index).is_none() {
        assert!(!self.entries[index].flags().contains(HUGE_PAGE),
                "mapping code does not support huge pages");
        let frame = allocator.allocate_frame().expect("no frames available");
        self.entries[index].set(frame, PRESENT | WRITABLE);
        self.next_table_mut(index).unwrap().zero();
    }
    self.next_table_mut(index).unwrap()
}

We can use unwrap() here since the next table definitely exists.

ðŸ”—Safety

We used an unsafe block in map_to to convert the raw P4 pointer to a &mut reference. Thatâ€™s bad. Itâ€™s now possible that the &mut reference is not exclusive, which breaks Rustâ€™s guarantees. Itâ€™s only a matter time before we run into a data race. For example, imagine that one thread maps an entry to frame_A and another thread (on the same core) tries to map the same entry to frame_B.

The problem is that thereâ€™s no clear owner for the page tables. So letâ€™s define page table ownership!

ðŸ”—Page Table Ownership

We define the following:

A page table owns all of its subtables.

We already obey this rule: To get a reference to a table, we need to borrow it from its parent table through the next_table method. But who owns the P4 table?

The recursively mapped P4 table is owned by a ActivePageTable struct.

We just defined some random owner for the P4 table. But it will solve our problems. And it will also provide the interface to other modules.

So letâ€™s create the struct:

use self::table::{Table, Level4};
use core::ptr::Unique;

pub struct ActivePageTable {
    p4: Unique<Table<Level4>>,
}

We canâ€™t store the Table<Level4> directly because it needs to be at a special memory location (like the VGA text buffer). We could use a raw pointer or &mut instead of Unique, but Unique indicates ownership better.

Because the ActivePageTable owns the unique recursive mapped P4 table, there must be only one ActivePageTable instance. Thus we make the constructor function unsafe:

impl ActivePageTable {
    pub unsafe fn new() -> ActivePageTable {
        ActivePageTable {
            p4: Unique::new_unchecked(table::P4),
        }
    }
}

We add some methods to get P4 references:

fn p4(&self) -> &Table<Level4> {
    unsafe { self.p4.as_ref() }
}

fn p4_mut(&mut self) -> &mut Table<Level4> {
    unsafe { self.p4.as_mut() }
}

Since we will only create valid P4 pointers, the unsafe blocks are safe. However, we donâ€™t make these functions public since they can be used to make page tables invalid. Only the higher level functions (such as translate or map_to) should be usable from other modules.

Now we can make the map_to and translate functions safe by making them methods of ActivePageTable:

impl ActivePageTable {
    pub unsafe fn new() -> ActivePageTable {...}

    fn p4(&self) -> &Table<Level4> {...}

    fn p4_mut(&mut self) -> &mut Table<Level4> {...}

    pub fn translate(&self, virtual_address: VirtualAddress)
        -> Option<PhysicalAddress>
    {
        ...
        self.translate_page(...).map(...)
    }

    fn translate_page(&self, page: Page) -> Option<Frame> {
        let p3 = self.p4().next_table(...);
        ...
    }

    pub fn map_to<A>(&mut self,
                     page: Page,
                     frame: Frame,
                     flags: EntryFlags,
                     allocator: &mut A)
        where A: FrameAllocator
    {
        let mut p3 = self.p4_mut().next_table_create(...);
        ...
    }
}

Now the p4() and p4_mut() methods should be the only methods containing an unsafe block in the paging/mod.rs file.

ðŸ”—More Mapping Functions

For convenience, we add a map method that just picks a free frame for us:

pub fn map<A>(&mut self, page: Page, flags: EntryFlags, allocator: &mut A)
    where A: FrameAllocator
{
    let frame = allocator.allocate_frame().expect("out of memory");
    self.map_to(page, frame, flags, allocator)
}

We also add a identity_map function to make it easier to remap the kernel in the next post:

pub fn identity_map<A>(&mut self,
                       frame: Frame,
                       flags: EntryFlags,
                       allocator: &mut A)
    where A: FrameAllocator
{
    let page = Page::containing_address(frame.start_address());
    self.map_to(page, frame, flags, allocator)
}

ðŸ”—Unmapping Pages

To unmap a page, we set the corresponding P1 entry to unused:

fn unmap<A>(&mut self, page: Page, allocator: &mut A)
    where A: FrameAllocator
{
    assert!(self.translate(page.start_address()).is_some());

    let p1 = self.p4_mut()
                 .next_table_mut(page.p4_index())
                 .and_then(|p3| p3.next_table_mut(page.p3_index()))
                 .and_then(|p2| p2.next_table_mut(page.p2_index()))
                 .expect("mapping code does not support huge pages");
    let frame = p1[page.p1_index()].pointed_frame().unwrap();
    p1[page.p1_index()].set_unused();
    // TODO free p(1,2,3) table if empty
    allocator.deallocate_frame(frame);
}

The assertion ensures that the page is mapped. Thus the corresponding P1 table and frame must exist for a standard 4KiB page. We set the entry to unused and free the associated frame in the supplied frame allocator.

We can also free the P1, P2, or even P3 table when the last entry is freed. But checking the whole table on every unmap would be very expensive. So we leave the TODO in place until we find a good solution. Iâ€™m open for suggestions :).

Spoiler: There is an ugly bug in this function, which we will find in the next section.

ðŸ”—Testing and Bugfixing

To test it, we add a test_paging function in memory/paging/mod.rs:

pub fn test_paging<A>(allocator: &mut A)
    where A: FrameAllocator
{
    let mut page_table = unsafe { ActivePageTable::new() };

    // test it
}

We borrow the frame allocator since we will need it for the mapping functions. To be able to call that function from main, we need to re-export it in memory/mod.rs:

// in memory/mod.rs
pub use self::paging::test_paging;

// lib.rs
let mut frame_allocator = ...;
memory::test_paging(&mut frame_allocator);

ðŸ”—map_to

Letâ€™s test the map_to function:

let addr = 42 * 512 * 512 * 4096; // 42th P3 entry
let page = Page::containing_address(addr);
let frame = allocator.allocate_frame().expect("no more frames");
println!("None = {:?}, map to {:?}",
         page_table.translate(addr),
         frame);
page_table.map_to(page, frame, EntryFlags::empty(), allocator);
println!("Some = {:?}", page_table.translate(addr));
println!("next free frame: {:?}", allocator.allocate_frame());

We just map some random page to a free frame. To be able to borrow the page table as &mut, we need to make it mutable.

You should see output similar to this:

None = None, map to Frame { number: 0 }
Some = Some(0)
next free frame: Some(Frame { number: 3 })

Itâ€™s frame 0 because itâ€™s the first frame returned by the frame allocator. Since we map the 42th P3 entry, the mapping code needs to create a P2 and a P1 table. So the next free frame returned by the allocator is frame 3.

ðŸ”—unmap

To test the unmap function, we unmap the test page so that it translates to None again:

page_table.unmap(Page::containing_address(addr), allocator);
println!("None = {:?}", page_table.translate(addr));

It causes a panic since we call the unimplemented deallocate_frame method in unmap. If we comment this call out, it works without problems. But there is some bug in this function nevertheless.

Letâ€™s read something from the mapped page (of course before we unmap it again):

println!("{:#x}", unsafe {
    *(Page::containing_address(addr).start_address() as *const u64)
});

Since we donâ€™t zero the mapped pages, the output is random. For me, itâ€™s 0xf000ff53f000ff53.

If unmap worked correctly, reading it again after unmapping should cause a page fault. But it doesnâ€™t. Instead, it just prints the same number again. When we remove the first read, we get the desired page fault (i.e. QEMU reboots again and again). So this seems to be some cache issue.

An x86 processor has many different caches because always accessing the main memory would be very slow. Most of these caches are completely transparent. That means everything works exactly the same as without them, itâ€™s just much faster. But there is one cache, that needs to be updated manually: the translation lookaside buffer.

The translation lookaside buffer, or TLB, caches the translation of virtual to physical addresses. Itâ€™s filled automatically when a page is accessed. But itâ€™s not updated transparently when the mapping of a page changes. This is the reason that we still can access the page even through we unmapped it in the page table.

So to fix our unmap function, we need to remove the cached translation from the TLB. We can use the x86_64 crate to do this easily. To add it, we append the following to our Cargo.toml:

[dependencies]
...
x86_64 = "0.1.2"

Now we can use it to fix unmap:

...
    p1[page.p1_index()].set_unused();

    use x86_64::instructions::tlb;
    use x86_64::VirtualAddress;
    tlb::flush(VirtualAddress(page.start_address()));

    // TODO free p(1,2,3) table if empty
    //allocator.deallocate_frame(frame);
}

Now the desired page fault occurs even when we access the page before.

ðŸ”—Conclusion

This post has become pretty long. So letâ€™s summarize what weâ€™ve done:

we created a paging module and modeled page tables plus entries
we mapped the P4 page recursively and created next_table methods
we used empty enums and associated types to make the next_table functions safe
we wrote a function to translate virtual to physical addresses
we created safe functions to map and unmap pages
and we fixed stack overflow and TLB related bugs

ðŸ”—Whatâ€™s next?

In the next post we will extend this module and add a function to modify inactive page tables. Through that function, we will create a new page table hierarchy that maps the kernel correctly using 4KiB pages. Then we will switch to the new table to get a safer kernel environment.

Afterwards, we will use this paging module to build a heap allocator. This will allow us to use allocation and collection types such as Box and Vec.

Image sources: ²

ðŸ”—Footnotes

If the XXX part of the address is smaller than 0o400, itâ€™s binary representation doesnâ€™t start with 1. But the sign extension bits, which should be a copy of that bit, are 1 instead of 0. Thus the address is not valid.

Image sources: Modified versions of an image from Wikipedia. The modified files are licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.

Allocating Frames

Sun, 15 Nov 2015 00:00:00 +0000

In this post we create an allocator that provides free physical frames for a future paging module. To get the required information about available and used memory we use the Multiboot information structure. Additionally, we improve the panic handler to print the corresponding message and source line.

The full source code is available on GitHub. Feel free to open issues there if you have any problems or improvements. You can also leave a comment at the bottom.

ðŸ”—Preparation

We still have a really tiny stack of 64 bytes, which wonâ€™t suffice for this post. So we increase it to 16kB (four pages) in boot.asm:

section .bss
...
stack_bottom:
    resb 4096 * 4
stack_top:

ðŸ”—The Multiboot Information Structure

When a Multiboot compliant bootloader loads a kernel, it passes a pointer to a boot information structure in the ebx register. We can use it to get information about available memory and loaded kernel sections.

First, we need to pass this pointer to our kernel as an argument to rust_main. To find out how arguments are passed to functions, we can look at the calling convention of Linux:

The first six integer or pointer arguments are passed in registers RDI, RSI, RDX, RCX, R8, and R9

So to pass the pointer to our kernel, we need to move it to rdi before calling the kernel. Since weâ€™re not using the rdi/edi register in our bootstrap code, we can simply set the edi register right after booting (in boot.asm):

start:
    mov esp, stack_top
    mov edi, ebx       ; Move Multiboot info pointer to edi

Now we can add the argument to our rust_main:

pub extern fn rust_main(multiboot_information_address: usize) { ... }

Instead of writing an own Multiboot module, we use the multiboot2 crate. It gives us some basic information about mapped kernel sections and available memory. I just wrote it for this blog post since I could not find any other Multiboot 2 crate. Itâ€™s still incomplete, but it does its job.

So letâ€™s add a dependency on the git repository:

# in Cargo.toml
[dependencies]
...
multiboot2 = "0.1.0"

// in src/lib.rs
extern crate multiboot2;

Now we can use it to print available memory areas.

ðŸ”—Available Memory

The boot information structure consists of various tags. See section 3.4 of the Multiboot specification (PDF) for a complete list. The memory map tag contains a list of all available RAM areas. Special areas such as the VGA text buffer at 0xb8000 are not available. Note that some of the available memory is already used by our kernel and by the multiboot information structure itself.

To print all available memory areas, we can use the multiboot2 crate in our rust_main as follows:

let boot_info = unsafe{ multiboot2::load(multiboot_information_address) };
let memory_map_tag = boot_info.memory_map_tag()
    .expect("Memory map tag required");

println!("memory areas:");
for area in memory_map_tag.memory_areas() {
    println!("    start: 0x{:x}, length: 0x{:x}",
        area.base_addr, area.length);
}

The load function is unsafe because it relies on a valid address. Since the memory tag is not required by the Multiboot specification, the memory_map_tag() function returns an Option. The memory_areas() function returns the desired memory area iterator.

The output looks like this:

Hello World!
memory areas:
    start: 0x0, length: 0x9fc00
    start: 0x100000, length: 0x7ee0000

So we have one area from 0x0 to 0x9fc00, which is a bit below the 1MiB mark. The second, bigger area starts at 1MiB and contains the rest of available memory. The area from 0x9fc00 to 1MiB is not available since it contains for example the VGA text buffer at 0xb8000. This is the reason for putting our kernel at 1MiB and not somewhere below.

If you give QEMU more than 4GiB of memory by passing -m 5G, you get another unusable area below the 4GiB mark. This memory is normally mapped to some hardware devices. See the OSDev Wiki for more information.

ðŸ”—Handling Panics

We used expect in the code above, which will panic if there is no memory map tag. But our current panic handler just loops without printing any error message. Of course we could replace expect by a match, but we should fix the panic handler nonetheless:

#[lang = "panic_fmt"]
#[no_mangle]
pub extern fn panic_fmt() -> ! {
    println!("PANIC");
    loop{}
}

Now we get a PANIC message. But we can do even better. The panic_fmt function has actually some arguments:

#[lang = "panic_fmt"]
#[no_mangle]
pub extern fn panic_fmt(fmt: core::fmt::Arguments, file: &'static str,
    line: u32) -> !
{
    println!("\n\nPANIC in {} at line {}:", file, line);
    println!("    {}", fmt);
    loop{}
}

Be careful with these arguments as the compiler does not check the function signature for lang_items.

Now we get the panic message and the causing source line. You can try it by inserting a panic somewhere.

ðŸ”—Kernel ELF Sections

To read and print the sections of our kernel ELF file, we can use the Elf-sections tag:

let elf_sections_tag = boot_info.elf_sections_tag()
    .expect("Elf-sections tag required");

println!("kernel sections:");
for section in elf_sections_tag.sections() {
    println!("    addr: 0x{:x}, size: 0x{:x}, flags: 0x{:x}",
        section.addr, section.size, section.flags);
}

This should print out the start address and size of all kernel sections. If the section is writable, the 0x1 bit is set in flags. The 0x4 bit marks an executable section and the 0x2 bit indicates that the section was loaded in memory. For example, the .text section is executable but not writable and the .data section just the opposite.

But when we execute it, tons of really small sections are printed. We can use the objdump -h build/kernel-x86_64.bin command to list the sections with name. There seem to be over 200 sections and many of them start with .text.* or .data.rel.ro.local.*. This is because the Rust compiler puts e.g. each function in its own .text subsection. That way, unused functions are removed when the linker omits unused sections.

To merge these subsections, we need to update our linker script:

ENTRY(start)

SECTIONS {
    . = 1M;

    .boot :
    {
        KEEP(*(.multiboot_header))
    }

    .text :
    {
        *(.text .text.*)
    }

    .rodata : {
        *(.rodata .rodata.*)
    }

    .data.rel.ro : {
        *(.data.rel.ro.local*) *(.data.rel.ro .data.rel.ro.*)
    }
}

These lines are taken from the default linker script of ld, which can be obtained through ldÂ â€‘verbose. The .text output section contains now all .text.* input sections of the static library (and the same applies for the .rodata and .data.rel.ro sections).

Now there are only 12 sections left and we get a much more useful output:

If you like, you can compare this output to the objdump -h build/kernel-x86_64.bin output. You will see that the start addresses and sizes match exactly for each section. The sections with flags 0x0 are mostly debug sections, so they donâ€™t need to be loaded. And the last few sections of the QEMU output arenâ€™t in the objdump output because they are special sections such as string tables.

ðŸ”—Start and End of Kernel

We can now use the ELF section tag to calculate the start and end address of our loaded kernel:

let kernel_start = elf_sections_tag.sections().map(|s| s.addr)
    .min().unwrap();
let kernel_end = elf_sections_tag.sections().map(|s| s.addr + s.size)
    .max().unwrap();

The other used memory area is the Multiboot Information structure:

let multiboot_start = multiboot_information_address;
let multiboot_end = multiboot_start + (boot_info.total_size as usize);

Printing these numbers gives us:

kernel_start: 0x100000, kernel_end: 0x11a168
multiboot_start: 0x11d400, multiboot_end: 0x11d9c8

So the kernel starts at 1MiB (like expected) and is about 105 KiB in size. The multiboot information structure was placed at 0x11d400 by GRUB and needs 1480 bytes. Of course your numbers could be a bit different due to different versions of Rust or GRUB (or some differences in the source code).

ðŸ”—A frame allocator

When using paging, the physical memory is split into equally sized chunks (normally 4096 bytes) Such a chunk is called â€œphysical pageâ€ or â€œframeâ€. These frames can be mapped to any virtual page through page tables. For more information about paging take a peek at the next post.

We will need a free frame in many cases. For example when want to increase the size of our future kernel heap. Or when we create a new page table. Or when we add a new kernel thread and thus need to allocate a new stack. So we need some kind of allocator that keeps track of physical frames and gives us a free one when needed.

There are various ways to write such a frame allocator:

We could create some kind of linked list from the free frames. For example, each frame could begin with a pointer to the next free frame. Since the frames are free, this would not overwrite any data. Our allocator would just save the head of the list and could easily allocate and deallocate frames by updating pointers. Unfortunately, this approach has a problem: It requires reading and writing these free frames. So we would need to map all physical frames to some virtual address, at least temporary. Another disadvantage is that we need to create this linked list at startup. That implies that we need to set over one million pointers at startup if the machine has 4GiB of RAM.

Another approach is to create some kind of data structure such as a bitmap or a stack to manage free frames. We could place it in the already identity mapped area right behind the kernel or multiboot structure. That way we would not need to (temporary) map each free frame. But it has the same problem of the slow initial creating/filling. In fact, we will use this approach in a future post to manage frames that are freed again. But for the initial management of free frames, we use a different method.

In the following, we will use Multibootâ€™s memory map directly. The idea is to maintain a simple counter that starts at frame 0 and is increased constantly. If the current frame is available (part of an available area in the memory map) and not used by the kernel or the multiboot structure (we know their start and end addresses), we know that itâ€™s free and return it. Else, we increase the counter to the next possibly free frame. That way, we donâ€™t need to create a data structure when booting and the physical frames can remain unmapped. The only problem is that we cannot reasonably free frames again, but we will solve that problem in a future post (by adding an intermediate frame stack that saves freed frames).

So letâ€™s start implementing our memory map based frame allocator.

ðŸ”—A Memory Module

First we create a memory module with a Frame type (src/memory/mod.rs):

#[derive(Debug, PartialEq, Eq, PartialOrd, Ord)]
pub struct Frame {
    number: usize,
}

(Donâ€™t forget to add the mod memory line to src/lib.rs.) Instead of e.g. the start address, we just store the frame number. We use usize here since the number of frames depends on the memory size. The long derive line makes frames printable and comparable.

To make it easy to get the corresponding frame for a physical address, we add a containing_address method:

pub const PAGE_SIZE: usize = 4096;

impl Frame {
    fn containing_address(address: usize) -> Frame {
        Frame{ number: address / PAGE_SIZE }
    }
}

We also add a FrameAllocator trait:

pub trait FrameAllocator {
    fn allocate_frame(&mut self) -> Option<Frame>;
    fn deallocate_frame(&mut self, frame: Frame);
}

This allows us to create another, more advanced frame allocator in the future.

ðŸ”—The Allocator

Now we can put everything together and create the actual frame allocator. Therefor we create a src/memory/area_frame_allocator.rs submodule. The allocator struct looks like this:

use memory::{Frame, FrameAllocator};
use multiboot2::{MemoryAreaIter, MemoryArea};

pub struct AreaFrameAllocator {
    next_free_frame: Frame,
    current_area: Option<&'static MemoryArea>,
    areas: MemoryAreaIter,
    kernel_start: Frame,
    kernel_end: Frame,
    multiboot_start: Frame,
    multiboot_end: Frame,
}

The next_free_frame field is a simple counter that is increased every time we return a frame. Itâ€™s initialized to 0 and every frame below it counts as used. The current_area field holds the memory area that contains next_free_frame. If next_free_frame leaves this area, we will look for the next one in areas. When there are no areas left, all frames are used and current_area becomes None. The {kernel, multiboot}_{start, end} fields are used to avoid returning already used fields.

To implement the FrameAllocator trait, we need to implement the allocation and deallocation methods:

impl FrameAllocator for AreaFrameAllocator {
    fn allocate_frame(&mut self) -> Option<Frame> {
        // TODO (see below)
    }

    fn deallocate_frame(&mut self, frame: Frame) {
        // TODO (see below)
    }
}

The allocate_frame method looks like this:

// in `allocate_frame` in `impl FrameAllocator for AreaFrameAllocator`

if let Some(area) = self.current_area {
    // "Clone" the frame to return it if it's free. Frame doesn't
    // implement Clone, but we can construct an identical frame.
    let frame = Frame{ number: self.next_free_frame.number };

    // the last frame of the current area
    let current_area_last_frame = {
        let address = area.base_addr + area.length - 1;
        Frame::containing_address(address as usize)
    };

    if frame > current_area_last_frame {
        // all frames of current area are used, switch to next area
        self.choose_next_area();
    } else if frame >= self.kernel_start && frame <= self.kernel_end {
        // `frame` is used by the kernel
        self.next_free_frame = Frame {
            number: self.kernel_end.number + 1
        };
    } else if frame >= self.multiboot_start && frame <= self.multiboot_end {
        // `frame` is used by the multiboot information structure
        self.next_free_frame = Frame {
            number: self.multiboot_end.number + 1
        };
    } else {
        // frame is unused, increment `next_free_frame` and return it
        self.next_free_frame.number += 1;
        return Some(frame);
    }
    // `frame` was not valid, try it again with the updated `next_free_frame`
    self.allocate_frame()
} else {
    None // no free frames left
}

The choose_next_area method isnâ€™t part of the trait and thus goes into a new implÂ AreaFrameAllocator block:

// in `impl AreaFrameAllocator`

fn choose_next_area(&mut self) {
    self.current_area = self.areas.clone().filter(|area| {
        let address = area.base_addr + area.length - 1;
        Frame::containing_address(address as usize) >= self.next_free_frame
    }).min_by_key(|area| area.base_addr);

    if let Some(area) = self.current_area {
        let start_frame = Frame::containing_address(area.base_addr as usize);
        if self.next_free_frame < start_frame {
            self.next_free_frame = start_frame;
        }
    }
}

This function chooses the area with the minimal base address that still has free frames, i.e. next_free_frame is smaller than its last frame. Note that we need to clone the iterator because the min_by_key function consumes it. If there are no areas with free frames left, min_by_key automatically returns the desired None.

If the next_free_frame is below the new current_area, it needs to be updated to the areaâ€™s start frame. Else, the allocate_frame call could return an unavailable frame.

We donâ€™t have a data structure to store free frames, so we canâ€™t implement deallocate_frame reasonably. Thus we use the unimplemented macro, which just panics when the method is called:

impl FrameAllocator for AreaFrameAllocator {
    fn allocate_frame(&mut self) -> Option<Frame> {
        // described above
    }

    fn deallocate_frame(&mut self, _frame: Frame) {
        unimplemented!()
    }
}

Now we only need a constructor function to make the allocator usable:

pub fn new(kernel_start: usize, kernel_end: usize,
      multiboot_start: usize, multiboot_end: usize,
      memory_areas: MemoryAreaIter) -> AreaFrameAllocator
{
    let mut allocator = AreaFrameAllocator {
        next_free_frame: Frame::containing_address(0),
        current_area: None,
        areas: memory_areas,
        kernel_start: Frame::containing_address(kernel_start),
        kernel_end: Frame::containing_address(kernel_end),
        multiboot_start: Frame::containing_address(multiboot_start),
        multiboot_end: Frame::containing_address(multiboot_end),
    };
    allocator.choose_next_area();
    allocator
}

Note that we call choose_next_area manually here because allocate_frame returns None as soon as current_area is None. So by calling choose_next_area we initialize it to the area with the minimal base address.

ðŸ”—Testing it

In order to test it in main, we need to re-export the AreaFrameAllocator in the memory module. Then we can create a new allocator:

let mut frame_allocator = memory::AreaFrameAllocator::new(
    kernel_start as usize, kernel_end as usize, multiboot_start,
    multiboot_end, memory_map_tag.memory_areas());

Now we can test it by adding some frame allocations:

println!("{:?}", frame_allocator.allocate_frame());

You will see that the frame number starts at 0 and increases steadily, but the kernel and Multiboot frames are left out (you need to allocate many frames to see this since the kernel starts at frame 256).

The following for loop allocates all frames and prints out the total number of allocated frames:

for i in 0.. {
    if let None = frame_allocator.allocate_frame() {
        println!("allocated {} frames", i);
        break;
    }
}

You can try different amounts of memory by passing e.g. -m 500M to QEMU. To compare these numbers, WolframAlpha can be very helpful.

ðŸ”—Conclusion

Now we have a working frame allocator. It is a bit rudimentary and cannot free frames, but it also is very fast since it reuses the Multiboot memory map and does not need any costly initialization. A future post will build upon this allocator and add a stack-like data structure for freed frames.

ðŸ”—Whatâ€™s next?

The next post will be about paging again. We will use the frame allocator to create a safe module that allows us to switch page tables and map pages. Then we will use this module and the information from the Elf-sections tag to remap the kernel correctly.

ðŸ”—Recommended Posts

Eric Kidd started the Bare Metal Rust series last week. Like this post, it builds upon the code from Printing to Screen, but tries to support keyboard input instead of wrestling through memory management details.

Printing to Screen

Fri, 23 Oct 2015 00:00:00 +0000

In the previous post we switched from assembly to Rust, a systems programming language that provides great safety. But so far we are using unsafe features like raw pointers whenever we want to print to screen. In this post we will create a Rust module that provides a safe and easy-to-use interface for the VGA text buffer. It will support Rustâ€™s formatting macros, too.

This post uses recent unstable features, so you need an up-to-date nighly compiler. If you have any questions, problems, or suggestions please file an issue or create a comment at the bottom. The code from this post is also available on GitHub.

ðŸ”—The VGA Text Buffer

The text buffer starts at physical address 0xb8000 and contains the characters displayed on screen. It has 25 rows and 80 columns. Each screen character has the following format:

Bit(s)	Value
0-7	ASCII code point
8-11	Foreground color
12-14	Background color
15	Blink

The following colors are available:

Number	Color	Number + Bright Bit	Bright Color
0x0	Black	0x8	Dark Gray
0x1	Blue	0x9	Light Blue
0x2	Green	0xa	Light Green
0x3	Cyan	0xb	Light Cyan
0x4	Red	0xc	Light Red
0x5	Magenta	0xd	Pink
0x6	Brown	0xe	Yellow
0x7	Light Gray	0xf	White

Bit 4 is the bright bit, which turns for example blue into light blue. It is unavailable in background color as the bit is used to control if the text should blink. If you want to use a light background color (e.g. white) you have to disable blinking through a BIOS function.

ðŸ”—A basic Rust Module

Now that we know how the VGA buffer works, we can create a Rust module to handle printing:

//â€¯in src/lib.rs
mod vga_buffer;

The content of this module can live either in src/vga_buffer.rs or src/vga_buffer/mod.rs. The latter supports submodules while the former does not. But our module does not need any submodules so we create it as src/vga_buffer.rs.

All of the code below goes into our new module (unless specified otherwise).

ðŸ”—Colors

First, we represent the different colors using an enum:

#[allow(dead_code)]
#[repr(u8)]
pub enum Color {
    Black      = 0,
    Blue       = 1,
    Green      = 2,
    Cyan       = 3,
    Red        = 4,
    Magenta    = 5,
    Brown      = 6,
    LightGray  = 7,
    DarkGray   = 8,
    LightBlue  = 9,
    LightGreen = 10,
    LightCyan  = 11,
    LightRed   = 12,
    Pink       = 13,
    Yellow     = 14,
    White      = 15,
}

We use a C-like enum here to explicitly specify the number for each color. Because of the repr(u8) attribute each enum variant is stored as an u8. Actually 4 bits would be sufficient, but Rust doesnâ€™t have an u4 type.

Normally the compiler would issue a warning for each unused variant. By using the #[allow(dead_code)] attribute we disable these warnings for the Color enum.

To represent a full color code that specifies foreground and background color, we create a newtype on top of u8:

struct ColorCode(u8);

impl ColorCode {
    const fn new(foreground: Color, background: Color) -> ColorCode {
        ColorCode((background as u8) << 4 | (foreground as u8))
    }
}

The ColorCode contains the full color byte, containing foreground and background color. Blinking is enabled implicitly by using a bright background color (soon we will disable blinking anyway). The new function is a const function to allow it in static initializers. As const functions are unstable we need to add the const_fn feature in src/lib.rs.

ðŸ”—The Text Buffer

Now we can add structures to represent a screen character and the text buffer:

#[repr(C)]
struct ScreenChar {
    ascii_character: u8,
    color_code: ColorCode,
}

const BUFFER_HEIGHT: usize = 25;
const BUFFER_WIDTH: usize = 80;

struct Buffer {
    chars: [[ScreenChar; BUFFER_WIDTH]; BUFFER_HEIGHT],
}

Since the field ordering in default structs is undefined in Rust, we need the repr(C) attribute. It guarantees that the structâ€™s fields are laid out exactly like in a C struct and thus guarantees the correct field ordering.

To actually write to screen, we now create a writer type:

use core::ptr::Unique;

pub struct Writer {
    column_position: usize,
    color_code: ColorCode,
    buffer: Unique<Buffer>,
}

The writer will always write to the last line and shift lines up when a line is full (or on \n). The column_position field keeps track of the current position in the last row. The current foreground and background colors are specified by color_code and a pointer to the VGA buffer is stored in buffer. To make it possible to create a static Writer later, the buffer field stores an Unique<Buffer> instead of a plain *mut Buffer. Unique is a wrapper that implements Send/Sync and is thus usable as a static. Since itâ€™s unstable, you may need to add the unique feature to lib.rs:

// in src/lib.rs
#![feature(unique)]

ðŸ”—Printing Characters

Now we can use the Writer to modify the bufferâ€™s characters. First we create a method to write a single ASCII byte (it doesnâ€™t compile yet):

impl Writer {
    pub fn write_byte(&mut self, byte: u8) {
        match byte {
            b'\n' => self.new_line(),
            byte => {
                if self.column_position >= BUFFER_WIDTH {
                    self.new_line();
                }

                let row = BUFFER_HEIGHT - 1;
                let col = self.column_position;

                let color_code = self.color_code;
                self.buffer().chars[row][col] = ScreenChar {
                    ascii_character: byte,
                    color_code: color_code,
                };
                self.column_position += 1;
            }
        }
    }

    fn buffer(&mut self) -> &mut Buffer {
        unsafe{ self.buffer.as_mut() }
    }

    fn new_line(&mut self) {/* TODO */}
}

If the byte is the newline byte \n, the writer does not print anything. Instead it calls a new_line method, which weâ€™ll implement later. Other bytes get printed to the screen in the second match case.

When printing a byte, the writer checks if the current line is full. In that case, a new_line call is required before to wrap the line. Then it writes a new ScreenChar to the buffer at the current position. Finally, the current column position is advanced.

The buffer() auxiliary method converts the raw pointer in the buffer field into a safe mutable buffer reference. The unsafe block is needed because the as_mut() method of Unique is unsafe. But our buffer() method itself isnâ€™t marked as unsafe, so it must not introduce any unsafety (e.g. cause segfaults). To guarantee that, itâ€™s very important that the buffer field always points to a valid Buffer. Itâ€™s like a contract that we must stand to every time we create a Writer. To ensure that itâ€™s not possible to create an invalid Writer from outside of the module, the struct must have at least one private field and public creation functions are not allowed either.

ðŸ”—Cannot Move out of Borrowed Content

When we try to compile it, we get the following error:

error[E0507]: cannot move out of borrowed content
  --> src/vga_buffer.rs:79:34
   |
79 | let color_code = self.color_code;
   |                  ^^^^ cannot move out of borrowed content

The reason it that Rust moves values by default instead of copying them like other languages. And we cannot move color_code out of self because we only borrowed self. For more information check out the ownership section in the Rust book.

To fix it, we can implement the Copy trait for the ColorCode type. The easiest way to do this is to use the built-in derive macro:

#[derive(Debug, Clone, Copy)]
struct ColorCode(u8);

We also derive the Clone trait, since itâ€™s a requirement for Copy, and the Debug trait, which allows us to print this field for debugging purposes.

Now our project should compile again.

However, the documentation for Copy says: â€œif your type can implement Copy, it shouldâ€. Therefore we also derive Copy for Color and ScreenChar:

#[allow(dead_code)]
#[derive(Debug, Clone, Copy)]
#[repr(u8)]
pub enum Color {...}

#[derive(Debug, Clone, Copy)]
#[repr(C)]
struct ScreenChar {...}

ðŸ”—Try it out!

To write some characters to the screen, you can create a temporary function:

pub fn print_something() {
    let mut writer = Writer {
        column_position: 0,
        color_code: ColorCode::new(Color::LightGreen, Color::Black),
        buffer: unsafe { Unique::new_unchecked(0xb8000 as *mut _) },
    };

    writer.write_byte(b'H');
}

It just creates a new Writer that points to the VGA buffer at 0xb8000. To use the unstable Unique::new_unchecked function, we need to add the feature flag #![feature(const_unique_new)] to the top of our src/lib.rs.

Then it writes the byte b'H' to it. The b prefix creates a byte character, which represents an ASCII code point. When we call vga_buffer::print_something in main, a H should be printed in the lower left corner of the screen in light green:

ðŸ”—Volatile

We just saw that our H was printed correctly. However, it might not work with future Rust compilers that optimize more aggressively.

The problem is that we only write to the Buffer and never read from it again. The compiler doesnâ€™t know about the side effect that some characters appear on the screen. So it might decide that these writes are unnecessary and can be omitted.

To avoid this erroneous optimization, we need to specify these writes as volatile. This tells the compiler that the write has side effects and should not be optimized away.

In order to use volatile writes for the VGA buffer, we use the volatile library. This crate (this is how packages are called in the Rust world) provides a Volatile wrapper type with read and write methods. These methods internally use the read_volatile and write_volatile functions of the standard library and thus guarantee that the reads/writes are not optimized away.

We can add a dependency on the volatile crate by adding it to the dependencies section of our Cargo.toml:

# in Cargo.toml

[dependencies]
volatile = "0.1.0"

The 0.1.0 is the semantic version number. For more information, see the Specifying Dependencies guide of the cargo documentation.

Now weâ€™ve declared that our project depends on the volatile crate and are able to import it in src/lib.rs:

// in src/lib.rs

extern crate volatile;

Letâ€™s use it to make writes to the VGA buffer volatile. We update our Buffer type as follows:

// in src/vga_buffer.rs

use volatile::Volatile;

struct Buffer {
    chars: [[Volatile<ScreenChar>; BUFFER_WIDTH]; BUFFER_HEIGHT],
}

Instead of a ScreenChar, weâ€™re now using a Volatile<ScreenChar>. (The Volatile type is generic and can wrap (almost) any type). This ensures that we canâ€™t accidentally write to it through a â€œnormalâ€ write. Instead, we have to use the write method now.

This means that we have to update our Writer::write_byte method:

impl Writer {
    pub fn write_byte(&mut self, byte: u8) {
        match byte {
            b'\n' => self.new_line(),
            byte => {
                ...

                self.buffer().chars[row][col].write(ScreenChar {
                    ascii_character: byte,
                    color_code: color_code,
                });
                ...
            }
        }
    }
    ...
}

Instead of a normal assignment using =, weâ€™re now using the write method. This guarantees that the compiler will never optimize away this write.

ðŸ”—Printing Strings

To print whole strings, we can convert them to bytes and print them one-by-one:

// in `impl Writer`
pub fn write_str(&mut self, s: &str) {
    for byte in s.bytes() {
      self.write_byte(byte)
    }
}

You can try it yourself in the print_something function.

When you print strings with some special characters like Ã¤ or Î», youâ€™ll notice that they cause weird symbols on screen. Thatâ€™s because they are represented by multiple bytes in UTF-8. By converting them to bytes, we of course get strange results. But since the VGA buffer doesnâ€™t support UTF-8, itâ€™s not possible to display these characters anyway.

ðŸ”—Support Formatting Macros

It would be nice to support Rustâ€™s formatting macros, too. That way, we can easily print different types like integers or floats. To support them, we need to implement the core::fmt::Write trait. The only required method of this trait is write_str that looks quite similar to our write_str method. To implement the trait, we just need to move it into an impl fmt::Write for Writer block and add a return type:

use core::fmt;

impl fmt::Write for Writer {
    fn write_str(&mut self, s: &str) -> fmt::Result {
        for byte in s.bytes() {
          self.write_byte(byte)
        }
        Ok(())
    }
}

The Ok(()) is just a Ok Result containing the () type. We can drop the pub because trait methods are always public.

Now we can use Rustâ€™s built-in write!/writeln! formatting macros:

// in the `print_something` function
use core::fmt::Write;
let mut writer = Writer {...};
writer.write_byte(b'H');
writer.write_str("ello! ");
write!(writer, "The numbers are {} and {}", 42, 1.0/3.0);

Now you should see a Hello! The numbers are 42 and 0.3333333333333333 at the bottom of the screen.

ðŸ”—Newlines

Right now, we just ignore newlines and characters that donâ€™t fit into the line anymore. Instead we want to move every character one line up (the top line gets deleted) and start at the beginning of the last line again. To do this, we add an implementation for the new_line method of Writer:

// in `impl Writer`

fn new_line(&mut self) {
    for row in 1..BUFFER_HEIGHT {
        for col in 0..BUFFER_WIDTH {
            let buffer = self.buffer();
            let character = buffer.chars[row][col].read();
            buffer.chars[row - 1][col].write(character);
        }
    }
    self.clear_row(BUFFER_HEIGHT-1);
    self.column_position = 0;
}

fn clear_row(&mut self, row: usize) {/* TODO */}

We iterate over all screen characters and move each characters one row up. Note that the range notation (..) is exclusive the upper bound. We also omit the 0th row (the first range starts at 1) because itâ€™s the row that is shifted off screen.

Now we only need to implement the clear_row method to finish the newline code:

// in `impl Writer`
fn clear_row(&mut self, row: usize) {
    let blank = ScreenChar {
        ascii_character: b' ',
        color_code: self.color_code,
    };
    for col in 0..BUFFER_WIDTH {
        self.buffer().chars[row][col].write(blank);
    }
}

This method clears a row by overwriting all of its characters with a space character.

ðŸ”—Providing an Interface

To provide a global writer that can used as an interface from other modules, we can add a static writer:

pub static WRITER: Writer = Writer {
    column_position: 0,
    color_code: ColorCode::new(Color::LightGreen, Color::Black),
    buffer: unsafe { Unique::new_unchecked(0xb8000 as *mut _) },
};

But we canâ€™t use it to print anything! You can try it yourself in the print_something function. The reason is that we try to take a mutable reference (&mut) to a immutable static when calling WRITER.print_byte.

To resolve it, we could use a mutable static. But then every read and write to it would be unsafe since it could easily introduce data races and other bad things. Using static mut is highly discouraged, there are even proposals to remove it.

But what are the alternatives? We could try to use a cell type like RefCell or even UnsafeCell to provide interior mutability. But these types arenâ€™t Sync (with good reason), so we canâ€™t use them in statics.

To get synchronized interior mutability, users of the standard library can use Mutex. It provides mutual exclusion by blocking threads when the resource is already locked. But our basic kernel does not have any blocking support or even a concept of threads, so we canâ€™t use it either. However there is a really basic kind of mutex in computer science that requires no operating system features: the spinlock. Instead of blocking, the threads simply try to lock it again and again in a tight loop and thus burn CPU time until the mutex is free again.

To use a spinning mutex, we can add the spin crate as a dependency:

# in Cargo.toml
[dependencies]
rlibc = "0.1.4"
spin = "0.4.5"

// in src/lib.rs
extern crate spin;

Then we can use the spinning Mutex to add interior mutability to our static writer:

// in src/vga_buffer.rs again
use spin::Mutex;
...
pub static WRITER: Mutex<Writer> = Mutex::new(Writer {
    column_position: 0,
    color_code: ColorCode::new(Color::LightGreen, Color::Black),
    buffer: unsafe { Unique::new_unchecked(0xb8000 as *mut _) },
});

Mutex::new is a const function, too, so it can be used in statics.

Now we can easily print from our main function:

// in src/lib.rs
pub extern fn rust_main() {
    use core::fmt::Write;
    vga_buffer::WRITER.lock().write_str("Hello again");
    write!(vga_buffer::WRITER.lock(), ", some numbers: {} {}", 42, 1.337);
    loop{}
}

Note that we need to import the Write trait if we want to use its functions.

ðŸ”—A println macro

Rustâ€™s macro syntax is a bit strange, so we wonâ€™t try to write a macro from scratch. Instead we look at the source of the println! macro in the standard library:

macro_rules! println {
    ($fmt:expr) => (print!(concat!($fmt, "\n")));
    ($fmt:expr, $($arg:tt)*) => (print!(concat!($fmt, "\n"), $($arg)*));
}

Macros are defined through one or more rules, which are similar to match arms. The println macro has two rules: The first rule is for invocations with a single argument (e.g. println!("Hello")) and the second rule is for invocations with additional parameters (e.g. println!("{}{}", 4, 2)).

Both rules simply append a newline character (\n) to the format string and then invoke the print! macro, which is defined as:

macro_rules! print {
    ($($arg:tt)*) => ($crate::io::_print(format_args!($($arg)*)));
}

The macro expands to a call of the _print function in the io module. The $crate variable ensures that the macro also works from outside the std crate. For example, it expands to ::std when itâ€™s used in other crates.

The format_args macro builds a fmt::Arguments type from the passed arguments, which is passed to _print. The _print function of libstd is rather complicated, as it supports different Stdout devices. We donâ€™t need that complexity since we just want to print to the VGA buffer.

To print to the VGA buffer, we just copy the println! macro and modify the print! macro to use our static WRITER instead of _print:

// in src/vga_buffer.rs
macro_rules! print {
    ($($arg:tt)*) => ({
        use core::fmt::Write;
        let mut writer = $crate::vga_buffer::WRITER.lock();
        writer.write_fmt(format_args!($($arg)*)).unwrap();
    });
}

Instead of a _print function, we call the write_fmt method of our static Writer. Since weâ€™re using a method from the Write trait, we need to import it before. The additional unwrap() at the end panics if printing isnâ€™t successful. But since we always return Ok in write_str, that should not happen.

Note the additional {} scope around the macro: We write => ({â€¦}) instead of => (â€¦). The additional {} avoids that the Write trait is silently imported to the parent scope when print is used.

ðŸ”—Clearing the screen

We can now use println! to add a rather trivial function to clear the screen:

// in src/vga_buffer.rs
pub fn clear_screen() {
    for _ in 0..BUFFER_HEIGHT {
        println!("");
    }
}

ðŸ”—Hello World using `println`

To use println in lib.rs, we need to import the macros of the VGA buffer module first. Therefore we add a #[macro_use] attribute to the module declaration:

// in src/lib.rs

#[macro_use]
mod vga_buffer;

#[no_mangle]
pub extern fn rust_main() {
    // ATTENTION: we have a very small stack and no guard page
    vga_buffer::clear_screen();
    println!("Hello World{}", "!");

    loop{}
}

Since we imported the macros at crate level, they are available in all modules and thus provide an easy and safe interface to the VGA buffer.

As expected, we now see a â€œHello World!â€ on a cleared screen:

ðŸ”—Deadlocks

Whenever we use locks, we must be careful to not accidentally introduce deadlocks. A deadlock occurs when a thread/program waits for a lock that will never be released. Normally, this happens when multiple threads access multiple locks. For example, when thread A holds lock 1 and tries to acquire lock 2 and â€“ at the same time â€“ thread B holds lock 2 and tries to acquire lock 1.

However, a deadlock can also occur when a thread tries to acquire the same lock twice. This way we can trigger a deadlock in our VGA driver:

// in rust_main in src/lib.rs

println!("{}", { println!("inner"); "outer" });

The argument passed to println is new block that resolves to the string â€œouterâ€ (a block always returns the result of the last expression). But before returning â€œouterâ€, the block tries to print the string â€œinnerâ€.

When we try this code in QEMU, we see that neither of the strings are printed. To understand whatâ€™s happening, we take a look at our print macro again:

macro_rules! print {
    ($($arg:tt)*) => ({
        use core::fmt::Write;
        let mut writer = $crate::vga_buffer::WRITER.lock();
        writer.write_fmt(format_args!($($arg)*)).unwrap();
    });
}

So we first lock the WRITER and then we evaluate the arguments using format_args. The problem is that the argument in our code example contains another println, which tries to lock the WRITER again. So now the inner println waits for the outer println and vice versa. Thus, a deadlock occurs and the CPU spins endlessly.

ðŸ”—Fixing the Deadlock

In order to fix the deadlock, we need to evaluate the arguments before locking the WRITER. We can do so by moving the locking and printing logic into a new print function (like itâ€™s done in the standard library):

// in src/vga_buffer.rs

macro_rules! print {
    ($($arg:tt)*) => ({
        $crate::vga_buffer::print(format_args!($($arg)*));
    });
}

pub fn print(args: fmt::Arguments) {
    use core::fmt::Write;
    WRITER.lock().write_fmt(args).unwrap();
}

Now the macro only evaluates the arguments (through format_args!) and passes them to the new print function. The print function then locks the WRITER and prints the formatting arguments using write_fmt. So now the arguments are evaluated before locking the WRITER.

Thus, we fixed the deadlock:

We see that both â€œinnerâ€ and â€œouterâ€ are printed.

ðŸ”—Whatâ€™s next?

In the next posts we will map the kernel pages correctly so that accessing 0x0 or writing to .rodata is not possible anymore. To obtain the loaded kernel sections we will read the Multiboot information structure. Then we will create a paging module and use it to switch to a new page table where the kernel sections are mapped correctly.

The next post describes the Multiboot information structure and creates a frame allocator using the information about memory areas.

ðŸ”—Other Rust OS Projects

Now that you know the very basics of OS development in Rust, you should also check out the following projects:

Rust Bare-Bones Kernel: A basic kernel with roughly the same functionality as ours. Writes output to the serial port instead of the VGA buffer and maps the kernel to the higher half (instead of our identity mapping). Note: You need to cross compile binutils to build it (or you create some symbolic links¹ if youâ€™re on x86_64).
RustOS: More advanced kernel that supports allocation, keyboard inputs, and threads. It also has a scheduler and a basic network driver.
â€œTifflinâ€ Experimental Kernel: Big kernel project by thepowersgang, that is actively developed and has over 650 commits. It has a separate userspace and supports multiple file systems, even a GUI is included. Needs a cross compiler.
Redox: Probably the most complete Rust OS today. It has an active community and over 1000 Github stars. File systems, network, an audio player, a picture viewer, and much more. Just take a look at the screenshots.

ðŸ”—Footnotes

You will need to symlink x86_64-none_elf-XXX to /usr/bin/XXX where XXX is in {as, ld, objcopy, objdump, strip}. The x86_64-none_elf-XXX files must be in some folder that is in your $PATH. But then you can only build for your x86_64 host architecture, so use this hack only for testing.

Set Up Rust

Wed, 02 Sep 2015 00:00:00 +0000

In the previous posts we created a minimal Multiboot kernel and switched to Long Mode. Now we can finally switch to Rust code. Rust is a high-level language without runtime. It allows us to not link the standard library and write bare metal code. Unfortunately the setup is not quite hassle-free yet.

This blog post tries to set up Rust step-by-step and point out the different problems. If you have any questions, problems, or suggestions please file an issue or create a comment at the bottom. The code from this post is in a Github repository, too.

ðŸ”—Installing Rust

We need a nightly compiler, as we will use many unstable features. To manage Rust installations I highly recommend rustup. It allows you to install nightly, beta, and stable compilers side-by-side and makes it easy to update them. To use a nightly compiler for the current directory, you can run rustup override add nightly. Alternatively, you can add a file called rust-toolchain to the projectâ€™s root directory:

nightly

ðŸ”—Creating a Cargo project

Cargo is Rustâ€™s excellent package manager. Normally you would call cargo new when you want to create a new project folder. We canâ€™t use it because our folder already exists, so we need to do it manually. Fortunately we only need to add a cargo configuration file named Cargo.toml:

[package]
name = "blog_os"
version = "0.1.0"
authors = ["Philipp Oppermann <[email protected]>"]

[lib]
crate-type = ["staticlib"]

The package section contains required project metadata such as the semantic crate version. The lib section specifies that we want to build a static library, i.e. a library that contains all of its dependencies. This is required to link the Rust project with our kernel.

Now we place our root source file in src/lib.rs:

#![feature(lang_items)]
#![no_std]

#[no_mangle]
pub extern fn rust_main() {}

#[lang = "eh_personality"] #[no_mangle] pub extern fn eh_personality() {}
#[lang = "panic_fmt"] #[no_mangle] pub extern fn panic_fmt() -> ! {loop{}}

Letâ€™s break it down:

#! defines an attribute of the current module. Since we are at the root module, the attributes apply to the crate itself.
The feature attribute is used to allow the specified feature-gated attributes in this crate. You canâ€™t do that in a stable/beta compiler, so this is one reason we need a Rust nighly.
The no_std attribute prevents the automatic linking of the standard library. We canâ€™t use std because it relies on operating system features like files, system calls, and various device drivers. Remember that currently the only â€œfeatureâ€ of our OS is printing OKAY :).
A # without a ! afterwards defines an attribute for the following item (a function in our case).
The no_mangle attribute disables the automatic name mangling that Rust uses to get unique function names. We want to do a call rust_main from our assembly code, so this function name must stay as it is.
We mark our main function as extern to make it compatible to the standard C calling convention.
The lang attribute defines a Rust language item.
The eh_personality function is used for Rustâ€™s unwinding on panic!. We can leave it empty since we donâ€™t have any unwinding support in our OS yet.
The panic_fmt function is the entry point on panic. Right now we canâ€™t do anything useful, so we just make sure that it doesnâ€™t return (required by the ! return type).

ðŸ”—Building Rust

We can now build it using cargo build, which creates a static library at target/debug/libblog_os.a. However, the resulting library is specific to our host operating system. This is undesirable, because our target system might be different.

Letâ€™s define some properties of our target system:

x86_64: Our target CPU is a recent x86_64 CPU.
No operating system: Our target does not run any operating system (weâ€™re currently writing it), so the compiler should not assume any OS-specific functionality.
Handles hardware interrupts: Weâ€™re writing a kernel, so weâ€™ll need to handle asynchronous hardware interrupts at some point. This means that we have to disable a certain stack pointer optimization (the so-called red zone), because it would cause stack corruptions otherwise.
No SSE: Our target might not have SSE support. Even if it does, we probably donâ€™t want to use SSE instructions in our kernel, because it makes interrupt handling much slower. We will explain this in detail in the â€œHandling Exceptionsâ€ post.
No hardware floats: The x86_64 architecture uses SSE instructions for floating point operations, which we donâ€™t want to use (see the previous point). So we also need to avoid hardware floating point operations in our kernel. Instead, we will use soft floats, which are basically software functions that emulate floating point operations using normal integers.

ðŸ”—Target Specifications

Rust allows us to define custom targets through a JSON configuration file. A minimal target specification equal to x86_64-unknown-linux-gnu (the default 64-bit Linux target) looks like this:

{
  "llvm-target": "x86_64-unknown-linux-gnu",
  "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
  "linker-flavor": "gcc",
  "target-endian": "little",
  "target-pointer-width": "64",
  "target-c-int-width": "32",
  "arch": "x86_64",
  "os": "linux"
}

The llvm-target field specifies the target triple that is passed to LLVM. Target triples are a naming convention that define the CPU architecture (e.g., x86_64 or arm), the vendor (e.g., apple or unknown), the operating system (e.g., windows or linux), and the ABI (e.g., gnu or msvc). For example, the target triple for 64-bit Linux is x86_64-unknown-linux-gnu and for 32-bit Windows the target triple is i686-pc-windows-msvc.

The data-layout field is also passed to LLVM and specifies how data should be laid out in memory. It consists of various specifications separated by a - character. For example, the e means little endian and S128 specifies that the stack should be 128 bits (= 16 byte) aligned. The format is described in detail in the LLVM documentation but there shouldnâ€™t be a reason to change this string.

The linker-flavor field was recently introduced in #40018 with the intention to add support for the LLVM linker LLD, which is platform independent. In the future, this might allow easy cross compilation without the need to install a gcc cross compiler for linking.

ðŸ”—A Kernel Target Specification

For our target system, we define the following JSON configuration in a file named x86_64-blog_os.json:

{
  "llvm-target": "x86_64-unknown-none",
  "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
  "linker-flavor": "gcc",
  "target-endian": "little",
  "target-pointer-width": "64",
  "target-c-int-width": "32",
  "arch": "x86_64",
  "os": "none",
  "disable-redzone": true,
  "features": "-mmx,-sse,+soft-float"
}

As llvm-target we use x86_64-unknown-none, which defines the x86_64 architecture, an unknown vendor, and no operating system (none). The ABI doesnâ€™t matter for us, so we just leave it off. The data-layout field is just copied from the x86_64-unknown-linux-gnu target. We also use the same values for the target-endian, target-pointer-width, target-c-int-width, and arch fields. For the os field we choose none, since our kernel runs on bare metal.

ðŸ”—The Red Zone

The red zone is an optimization of the System V ABI that allows functions to temporary use the 128 bytes below its stack frame without adjusting the stack pointer:

The image shows the stack frame of a function with n local variables. On function entry, the stack pointer is adjusted to make room on the stack for the local variables.

However, this optimization leads to huge problems with exceptions or hardware interrupts. Letâ€™s assume that an exception occurs while a function uses the red zone:

The CPU and the exception handler overwrite the data in red zone. But this data is still needed by the interrupted function. So the function wonâ€™t work correctly anymore when we return from the exception handler. This might lead to strange bugs that take weeks to debug.

To avoid such bugs when we implement exception handling in the future, we disable the red zone right from the beginning. This is achieved by adding the "disable-redzone": true line to our target configuration file.

ðŸ”—SIMD Extensions

The features field enables/disables target features. We disable the mmx and sse features by prefixing them with a minus and enable the soft-float feature by prefixing it with a plus. The mmx and sse features determine support for Single Instruction Multiple Data (SIMD) instructions, which simultaneously perform an operation (e.g. addition) on multiple data words. The x86 architecture supports the following standards:

MMX: The Multi Media Extension instruction set was introduced in 1997 and defines eight 64 bit registers called mm0 through mm7. These registers are just aliases for the registers of the x87 floating point unit.
SSE: The Streaming SIMD Extensions instruction set was introduced in 1999. Instead of re-using the floating point registers, it adds a completely new register set. The sixteen new registers are called xmm0 through xmm15 and are 128 bits each.
AVX: The Advanced Vector Extensions are extensions that further increase the size of the multimedia registers. The new registers are called ymm0 through ymm15 and are 256 bits each. They extend the xmm registers, so e.g. xmm0 is the lower half of ymm0.

By using such SIMD standards, programs can often speed up significantly. Good compilers are able to transform normal loops into such SIMD code automatically through a process called auto-vectorization.

However, the large SIMD registers lead to problems in OS kernels. The reason is that the kernel has to backup all registers that it uses on each hardware interrupt (we will look into this in the â€œHandling Exceptionsâ€ post). So if the kernel uses SIMD registers, it has to backup a lot more data, which noticeably decreases performance. To avoid this performance loss, we disable the sse and mmx features (the avx feature is disabled by default).

As noted above, floating point operations on x86_64 use SSE registers, so floats are no longer usable without SSE. Unfortunately, the Rust core library already uses floats (e.g., it implements traits for f32 and f64), so we need an alternative way to implement float operations. The soft-float feature solves this problem by emulating all floating point operations through software functions based on normal integers.

ðŸ”—Compiling

To build our kernel for our new target, we pass the configuration fileâ€™s name as --target argument. There is currently an open bug for custom target specifications, so you also need to set the RUST_TARGET_PATH environment variable to the current directory, otherwise Rust doesnâ€™t find your target. The full command is:

RUST_TARGET_PATH=$(pwd) cargo build --target x86_64-blog_os

However, the following error occurs:

error[E0463]: can't find crate for `core`
  |
  = note: the `x86_64-blog_os` target may not be installed

The error tells us that the Rust compiler no longer finds the core library. The core library is implicitly linked to all no_std crates and contains things such as Result, Option, and iterators.

The problem is that the core library is distributed together with the Rust compiler as a precompiled library. So it is only valid for the host triple (e.g., x86_64-unknown-linux-gnu) but not for our custom target. If we want to compile code for other targets, we need to recompile core for these targets first.

ðŸ”—Xargo

Thatâ€™s where xargo comes in. It is a wrapper for cargo that eases cross compilation. We can install it by executing:

cargo install xargo

Xargo depends on the rust source code, which we can install with rustup component add rust-src.

Letâ€™s try it:

> RUST_TARGET_PATH=$(pwd) xargo build --target=x86_64-blog_os
   Compiling core v0.0.0 (file:///â€¦/rust/src/libcore)
    Finished release [optimized] target(s) in 22.87 secs
   Compiling blog_os v0.1.0 (file:///â€¦/blog_os/tags)
    Finished dev [unoptimized + debuginfo] target(s) in 0.29 secs

It worked! We see that xargo cross-compiled the core library for our new custom target and then continued to compile our blog_os crate. After compilation, we can find a static library at target/x86_64-blog_os/debug/libblog_os.a, which can be linked with our assembly kernel.

ðŸ”—Integrating Rust

Letâ€™s try to integrate our Rust library into our assembly kernel so that we can call the rust_main function. For that we need to pass the libblog_os.a file to the linker, together with the assembly object files.

ðŸ”—Adjusting the Makefile

To build and link the rust library on make, we extend our Makefile(full file):

# ...
target ?= $(arch)-blog_os
rust_os := target/$(target)/debug/libblog_os.a
# ...
.PHONY: all clean run iso kernel
# ...
$(kernel): kernel $(rust_os) $(assembly_object_files) $(linker_script)
	@ld -n -T $(linker_script) -o $(kernel) \
		$(assembly_object_files) $(rust_os)

kernel:
	@RUST_TARGET_PATH=$(shell pwd) xargo build --target $(target)

We add a new kernel target that just executes xargo build and modify the $(kernel) target to link the created static lib. We also add the new kernel target to the .PHONY list, since it does not belong to a file with that name.

But now xargo build is executed on every make, even if no source file was changed. And the ISO is recreated on every make iso/make run, too. We could try to avoid this by adding dependencies on all rust source and cargo configuration files to the kernel target, but the ISO creation takes only half a second on my machine and most of the time we will have changed a Rust file when we run make. So we keep it simple for now and let cargo do the bookkeeping of changed files (it does it anyway).

ðŸ”—Calling Rust

Now we can call the main method in long_mode_start:

bits 64
long_mode_start:
    ...

    ; call the rust main
    extern rust_main     ; new
    call rust_main       ; new

    ; print `OKAY` to screen
    mov rax, 0x2f592f412f4b2f4f
    mov qword [0xb8000], rax
    hlt

By defining rust_main as extern we tell nasm that the function is defined in another file. As the linker takes care of linking them together, weâ€™ll get a linker error if we have a typo in the name or forget to mark the rust function as pub extern.

If weâ€™ve done everything right, we should still see the green OKAY when executing make run. That means that we successfully called the Rust function and returned back to assembly.

ðŸ”—Fixing Linker Errors

Now we can try some Rust code:

pub extern fn rust_main() {
    let x = ["Hello", "World", "!"];
    let y = x;
}

When we test it using make run, it fails with undefined reference to 'memcpy'. The memcpy function is one of the basic functions of the C library (libc). Usually the libc crate is linked to every Rust program together with the standard library, but we opted out through #![no_std]. We could try to fix this by adding the libc crate as extern crate. But libc is just a wrapper for the system libc, for example glibc on Linux, so this wonâ€™t work for us. Instead we need to recreate the basic libc functions such as memcpy, memmove, memset, and memcmp in Rust.

ðŸ”—rlibc

Fortunately there already is a crate for that: rlibc. When we look at its source code we see that it contains no magic, just some raw pointer operations in a while loop. To add rlibc as a dependency we just need to add two lines to the Cargo.toml:

...
[dependencies]
rlibc = "1.0"

and an extern crate definition in our src/lib.rs:

...
extern crate rlibc;

#[no_mangle]
pub extern fn rust_main() {
...

Now make run doesnâ€™t complain about memcpy anymore. Instead it will show a pile of new ugly linker errors:

target/x86_64-blog_os/debug/libblog_os.a(core-92335f822fa6c9a6.0.o):
    In function `_$LT$f32$u20$as$u20$core..num..dec2flt..
        rawfp..RawFloat$GT$::from_int::h50f7952efac3fdca':
    core.cgu-0.rs:(.text._ZN59_$LT$f32$u20$as$u20$core..num..dec2flt..
        rawfp..RawFloat$GT$8from_int17h50f7952efac3fdcaE+0x2):
    undefined reference to `__floatundisf'
target/x86_64-blog_os/debug/libblog_os.a(core-92335f822fa6c9a6.0.o):
    In function `_$LT$f64$u20$as$u20$core..num..dec2flt..rawfp..
        RawFloat$GT$::from_int::h12a81f175246914a':
    core.cgu-0.rs:(.text._ZN59_$LT$f64$u20$as$u20$core..num..dec2flt..rawfp..
        RawFloat$GT$8from_int17h12a81f175246914aE+0x2):
    undefined reference to `__floatundidf'
target/x86_64-blog_os/debug/libblog_os.a(core-92335f822fa6c9a6.0.o):
    In function `core::num::from_str_radix::h09b12650704e0508':
    core.cgu-0.rs:(.text._ZN4core3num14from_str_radix
        17h09b12650704e0508E+0xcf):
    undefined reference to `__muloti4'
...

ðŸ”—â€“gc-sections

The new errors are linker errors about various missing functions such as __floatundisf or __muloti4. These functions are part of LLVMâ€™s compiler-rt builtins and are normally linked by the standard library. For no_std crates like ours, one has to link the compiler-rt library manually. Unfortunately, this library is implemented in C and the build process is a bit cumbersome. Alternatively, there is the compiler-builtins crate that tries to port the library to Rust, but it isnâ€™t complete yet.

In our case, there is a much simpler solution, since our kernel doesnâ€™t really need any of those functions yet. So we can just tell the linker to remove unused program sections and hopefully all references to these functions will disappear. Removing unused sections is generally a good idea as it reduces kernel size. The magic linker flag for this is --gc-sections, which stands for â€œgarbage collect sectionsâ€. Letâ€™s add it to the $(kernel) target in our Makefile:

$(kernel): xargo $(rust_os) $(assembly_object_files) $(linker_script)
	@ld -n --gc-sections -T $(linker_script) -o $(kernel) \
		$(assembly_object_files) $(rust_os)

Now we can do a make run again and it compiles without errors again. However, it doesnâ€™t boot anymore:

GRUB error: no multiboot header found.

What happened? Well, the linker removed unused sections. And since we donâ€™t use the Multiboot section anywhere, ld removes it, too. So we need to tell the linker explicitly that it should keep this section. The KEEP command does exactly that, so we add it to the linker script (linker.ld):

.boot :
{
    /* ensure that the multiboot header is at the beginning */
    KEEP(*(.multiboot_header))
}

Now everything should work again (the green OKAY). But there is another linking issue, which is triggered by some other example code.

ðŸ”—panic = â€œabortâ€

The following snippet still fails:

    ...
    let test = (0..3).flat_map(|x| 0..x).zip(0..);

The error is a linker error again (hence the ugly error message):

target/x86_64-blog_os/debug/libblog_os.a(blog_os-b5a29f28b14f1f1f.0.o):
    In function `core::ptr::drop_in_place<core::iter::Zip<
        core::iter::FlatMap<core::ops::Range<i32>, core::ops::Range<i32>,
        closure>, core::ops::RangeFrom<i32>>>':
        /â€¦/rust/src/libcore/ptr.rs:66:
    undefined reference to `_Unwind_Resume'
target/x86_64-blog_os/debug/libblog_os.a(blog_os-b5a29f28b14f1f1f.0.o):
    In function `core::iter::iterator::Iterator::zip<core::iter::FlatMap<
        core::ops::Range<i32>, core::ops::Range<i32>, closure>,
        core::ops::RangeFrom<i32>>':
        /â€¦/rust/src/libcore/iter/iterator.rs:389:
    undefined reference to `_Unwind_Resume'
...

So the linker canâ€™t find a function named _Unwind_Resume that is referenced e.g. in iter/iterator.rs:389 in libcore. This reference is not really there at line 389 of libcoreâ€™s iterator.rs. Instead, it is a compiler inserted landing pad, which is used for panic handling.

By default, the destructors of all stack variables are run when a panic occurs. This is called unwinding and allows parent threads to recover from panics. However, it requires a platform specific gcc library, which isnâ€™t available in our kernel.

Fortunately, Rust allows us to disable unwinding for our target. For that we add the following line to our x86_64-blog_os.json file:

{
  "...",
  "panic-strategy": "abort"
}

By setting the panic strategy to abort instead of the default unwind, we disable all unwinding in our kernel. Letâ€™s try make run again:

   Compiling core v0.0.0 (file:///â€¦/rust/src/libcore)
    Finished release [optimized] target(s) in 22.24 secs
    Finished dev [unoptimized + debuginfo] target(s) in 0.5 secs
target/x86_64-blog_os/debug/libblog_os.a(blog_os-b5a29f28b14f1f1f.0.o):
    In function `core::ptr::drop_in_place<â€¦>':
    /â€¦/src/libcore/ptr.rs:66:
    undefined reference to `_Unwind_Resume'
...

We see that xargo recompiles the core crate, but the _Unwind_Resume error still occurs. This is because our blog_os crate was not recompiled somehow and thus still references the unwinding function. To fix this, we need to force a recompile using cargo clean:

> cargo clean
> make run
   Compiling rlibc v1.0.0
   Compiling blog_os v0.1.0 (file:///home/philipp/Documents/blog_os/tags)
warning: unused variable: `test` [â€¦]

    Finished dev [unoptimized + debuginfo] target(s) in 0.60 secs

It worked! We no longer see linker errors and our kernel prints OKAY again.

ðŸ”—Hello World!

Finally, itâ€™s time for a Hello World! from Rust:

#[no_mangle]
pub extern fn rust_main() {
    // ATTENTION: we have a very small stack and no guard page

    let hello = b"Hello World!";
    let color_byte = 0x1f; // white foreground, blue background

    let mut hello_colored = [color_byte; 24];
    for (i, char_byte) in hello.into_iter().enumerate() {
        hello_colored[i*2] = *char_byte;
    }

    // write `Hello World!` to the center of the VGA text buffer
    let buffer_ptr = (0xb8000 + 1988) as *mut _;
    unsafe { *buffer_ptr = hello_colored };

    loop{}
}

Some notes:

The b prefix creates a byte string, which is just an array of u8
enumerate is an Iterator method that adds the current index i to elements
buffer_ptr is a raw pointer that points to the center of the VGA text buffer
Rust doesnâ€™t know the VGA buffer and thus canâ€™t guarantee that writing to the buffer_ptr is safe (it could point to important data). So we need to tell Rust that we know what we are doing by using an unsafe block.

ðŸ”—Stack Overflows

Since we still use the small 64 byte stack from the last post, we must be careful not to overflow it. Normally, Rust tries to avoid stack overflows through guard pages: The page below the stack isnâ€™t mapped and such a stack overflow triggers a page fault (instead of silently overwriting random memory). But we canâ€™t unmap the page below our stack right now since we currently use only a single big page. Fortunately the stack is located just above the page tables. So some important page table entry would probably get overwritten on stack overflow and then a page fault occurs, too.

ðŸ”—Whatâ€™s next?

Until now we write magic bits to some memory location when we want to print something to screen. In the next post we create a abstraction for the VGA text buffer that allows us to print strings in different colors and provides a simple interface.

Entering Long Mode

Tue, 25 Aug 2015 00:00:00 +0000

In the previous post we created a minimal multiboot kernel. It just prints OK and hangs. The goal is to extend it and call 64-bit Rust code. But the CPU is currently in protected mode and allows only 32-bit instructions and up to 4GiB memory. So we need to set up Paging and switch to the 64-bit long mode first.

I tried to explain everything in detail and to keep the code as simple as possible. If you have any questions, suggestions, or issues, please leave a comment or create an issue on Github. The source code is available in a repository, too.

ðŸ”—Some Tests

To avoid bugs and strange errors on old CPUs we should check if the processor supports every needed feature. If not, the kernel should abort and display an error message. To handle errors easily, we create an error procedure in boot.asm. It prints a rudimentary ERR: X message, where X is an error code letter, and hangs:

; Prints `ERR: ` and the given error code to screen and hangs.
; parameter: error code (in ascii) in al
error:
    mov dword [0xb8000], 0x4f524f45
    mov dword [0xb8004], 0x4f3a4f52
    mov dword [0xb8008], 0x4f204f20
    mov byte  [0xb800a], al
    hlt

At address 0xb8000 begins the so-called VGA text buffer. Itâ€™s an array of screen characters that are displayed by the graphics card. A future post will cover the VGA buffer in detail and create a Rust interface to it. But for now, manual bit-fiddling is the easiest option.

A screen character consists of a 8 bit color code and a 8 bit ASCII character. We used the color code 4f for all characters, which means white text on red background. 0x52 is an ASCII R, 0x45 is an E, 0x3a is a :, and 0x20 is a space. The second space is overwritten by the given ASCII byte. Finally the CPU is stopped with the hlt instruction.

Now we can add some check functions. A function is just a normal label with an ret (return) instruction at the end. The call instruction can be used to call it. Unlike the jmp instruction that just jumps to a memory address, the call instruction will push a return address to the stack (and the ret will jump to this address). But we donâ€™t have a stack yet. The stack pointer in the esp register could point to some important data or even invalid memory. So we need to update it and point it to some valid stack memory.

ðŸ”—Creating a Stack

To create stack memory we reserve some bytes at the end of our boot.asm:

...
section .bss
stack_bottom:
    resb 64
stack_top:

A stack doesnâ€™t need to be initialized because we will pop only when we pushed before. So storing the stack memory in the executable file would make it unnecessary large. By using the .bss section and the resb (reserve byte) command, we just store the length of the uninitialized data (= 64). When loading the executable, GRUB will create the section of required size in memory.

To use the new stack, we update the stack pointer register right after start:

global start

section .text
bits 32
start:
    mov esp, stack_top

    ; print `OK` to screen
    ...

We use stack_top because the stack grows downwards: A push eax subtracts 4 from esp and does a mov [esp], eax afterwards (eax is a general purpose register).

Now we have a valid stack pointer and are able to call functions. The following check functions are just here for completeness and I wonâ€™t explain details. Basically they all work the same: They will check for a feature and jump to error if itâ€™s not available.

ðŸ”—Multiboot check

We rely on some Multiboot features in the next posts. To make sure the kernel was really loaded by a Multiboot compliant bootloader, we can check the eax register. According to the Multiboot specification (PDF), the bootloader must write the magic value 0x36d76289 to it before loading a kernel. To verify that we can add a simple function:

check_multiboot:
    cmp eax, 0x36d76289
    jne .no_multiboot
    ret
.no_multiboot:
    mov al, "0"
    jmp error

We use the cmp instruction to compare the value in eax to the magic value. If the values are equal, the cmp instruction sets the zero flag in the FLAGS register. The jne (â€œjump if not equalâ€) instruction reads this zero flag and jumps to the given address if itâ€™s not set. Thus we jump to the .no_multiboot label if eax does not contain the magic value.

In no_multiboot, we use the jmp (â€œjumpâ€) instruction to jump to our error function. We could just as well use the call instruction, which additionally pushes the return address. But the return address is not needed because error never returns. To pass 0 as error code to the error function, we move it into al before the jump (error will read it from there).

ðŸ”—CPUID check

CPUID is a CPU instruction that can be used to get various information about the CPU. But not every processor supports it. CPUID detection is quite laborious, so we just copy a detection function from the OSDev wiki:

check_cpuid:
    ; Check if CPUID is supported by attempting to flip the ID bit (bit 21)
    ; in the FLAGS register. If we can flip it, CPUID is available.

    ; Copy FLAGS in to EAX via stack
    pushfd
    pop eax

    ; Copy to ECX as well for comparing later on
    mov ecx, eax

    ; Flip the ID bit
    xor eax, 1 << 21

    ; Copy EAX to FLAGS via the stack
    push eax
    popfd

    ; Copy FLAGS back to EAX (with the flipped bit if CPUID is supported)
    pushfd
    pop eax

    ; Restore FLAGS from the old version stored in ECX (i.e. flipping the
    ; ID bit back if it was ever flipped).
    push ecx
    popfd

    ; Compare EAX and ECX. If they are equal then that means the bit
    ; wasn't flipped, and CPUID isn't supported.
    cmp eax, ecx
    je .no_cpuid
    ret
.no_cpuid:
    mov al, "1"
    jmp error

Basically, the CPUID instruction is supported if we can flip some bit in the FLAGS register. We canâ€™t operate on the flags register directly, so we need to load it into some general purpose register such as eax first. The only way to do this is to push the FLAGS register on the stack through the pushfd instruction and then pop it into eax. Equally, we write it back through push ecx and popfd. To flip the bit we use the xor instruction to perform an exclusive OR. Finally we compare the two values and jump to .no_cpuid if both are equal (je â€“ â€œjump if equalâ€). The .no_cpuid code just jumps to the error function with error code 1.

Donâ€™t worry, you donâ€™t need to understand the details.

ðŸ”—Long Mode check

Now we can use CPUID to detect whether long mode can be used. I use code from OSDev again:

check_long_mode:
    ; test if extended processor info in available
    mov eax, 0x80000000    ; implicit argument for cpuid
    cpuid                  ; get highest supported argument
    cmp eax, 0x80000001    ; it needs to be at least 0x80000001
    jb .no_long_mode       ; if it's less, the CPU is too old for long mode

    ; use extended info to test if long mode is available
    mov eax, 0x80000001    ; argument for extended processor info
    cpuid                  ; returns various feature bits in ecx and edx
    test edx, 1 << 29      ; test if the LM-bit is set in the D-register
    jz .no_long_mode       ; If it's not set, there is no long mode
    ret
.no_long_mode:
    mov al, "2"
    jmp error

Like many low-level things, CPUID is a bit strange. Instead of taking a parameter, the cpuid instruction implicitly uses the eax register as argument. To test if long mode is available, we need to call cpuid with 0x80000001 in eax. This loads some information to the ecx and edx registers. Long mode is supported if the 29th bit in edx is set. Wikipedia has detailed information.

If you look at the assembly above, youâ€™ll probably notice that we call cpuid twice. The reason is that the CPUID command started with only a few functions and was extended over time. So old processors may not know the 0x80000001 argument at all. To test if they do, we need to invoke cpuid with 0x80000000 in eax first. It returns the highest supported parameter value in eax. If itâ€™s at least 0x80000001, we can test for long mode as described above. Else the CPU is old and doesnâ€™t know what long mode is either. In that case, we directly jump to .no_long_mode through the jb instruction (â€œjump if belowâ€).

ðŸ”—Putting it together

We just call these check functions right after start:

global start

section .text
bits 32
start:
    mov esp, stack_top

    call check_multiboot
    call check_cpuid
    call check_long_mode

    ; print `OK` to screen
    ...

When the CPU doesnâ€™t support a needed feature, we get an error message with an unique error code. Now we can start the real work.

ðŸ”—Paging

Paging is a memory management scheme that separates virtual and physical memory. The address space is split into equal sized pages and a page table specifies which virtual page points to which physical page. If you never heard of paging, you might want to look at the paging introduction (PDF) of the Three Easy Pieces OS book.

In long mode, x86 uses a page size of 4096 bytes and a 4 level page table that consists of:

the Page-Map Level-4 Table (PML4),
the Page-Directory Pointer Table (PDP),
the Page-Directory Table (PD),
and the Page Table (PT).

As I donâ€™t like these names, I will call them P4, P3, P2, and P1 from now on.

Each page table contains 512 entries and one entry is 8 bytes, so they fit exactly in one page (512*8 = 4096). To translate a virtual address to a physical address the CPU¹ will do the following²:

Get the address of the P4 table from the CR3 register
Use bits 39-47 (9 bits) as an index into P4 (2^9 = 512 = number of entries)
Use the following 9 bits as an index into P3
Use the following 9 bits as an index into P2
Use the following 9 bits as an index into P1
Use the last 12 bits as page offset (2^12 = 4096 = page size)

But what happens to bits 48-63 of the 64-bit virtual address? Well, they canâ€™t be used. The â€œ64-bitâ€ long mode is in fact just a 48-bit mode. The bits 48-63 must be copies of bit 47, so each valid virtual address is still unique. For more information see Wikipedia.

An entry in the P4, P3, P2, and P1 tables consists of the page aligned 52-bit physical address of the frame or the next page table and the following bits that can be OR-ed in:

Bit(s)	Name	Meaning
0	present	the page is currently in memory
1	writable	itâ€™s allowed to write to this page
2	user accessible	if not set, only kernel mode code can access this page
3	write through caching	writes go directly to memory
4	disable cache	no cache is used for this page
5	accessed	the CPU sets this bit when this page is used
6	dirty	the CPU sets this bit when a write to this page occurs
7	huge page/null	must be 0 in P1 and P4, creates a 1GiB page in P3, creates a 2MiB page in P2
8	global	page isnâ€™t flushed from caches on address space switch (PGE bit of CR4 register must be set)
9-11	available	can be used freely by the OS
52-62	available	can be used freely by the OS
63	no execute	forbid executing code on this page (the NXE bit in the EFER register must be set)

ðŸ”—Set Up Identity Paging

When we switch to long mode, paging will be activated automatically. The CPU will then try to read the instruction at the following address, but this address is now a virtual address. So we need to do identity mapping, i.e. map a physical address to the same virtual address.

The huge page bit is now very useful to us. It creates a 2MiB (when used in P2) or even a 1GiB page (when used in P3). So we could map the first gigabytes of the kernel with only one P4 and one P3 table by using 1GiB pages. Unfortunately 1GiB pages are relatively new feature, for example Intel introduced it 2010 in the Westmere architecture. Therefore we will use 2MiB pages instead to make our kernel compatible to older computers, too.

To identity map the first gigabyte of our kernel with 512 2MiB pages, we need one P4, one P3, and one P2 table. Of course we will replace them with finer-grained tables later. But now that weâ€™re stuck with assembly, we choose the easiest way.

We can add these two tables at the beginning³ of the .bss section:

...

section .bss
align 4096
p4_table:
    resb 4096
p3_table:
    resb 4096
p2_table:
    resb 4096
stack_bottom:
    resb 64
stack_top:

The resb command reserves the specified amount of bytes without initializing them, so the 8KiB donâ€™t need to be saved in the executable. The align 4096 ensures that the page tables are page aligned.

When GRUB creates the .bss section in memory, it will initialize it to 0. So the p4_table is already valid (it contains 512 non-present entries) but not very useful. To be able to map 2MiB pages, we need to link P4â€™s first entry to the p3_table and P3â€™s first entry to the the p2_table:

set_up_page_tables:
    ; map first P4 entry to P3 table
    mov eax, p3_table
    or eax, 0b11 ; present + writable
    mov [p4_table], eax

    ; map first P3 entry to P2 table
    mov eax, p2_table
    or eax, 0b11 ; present + writable
    mov [p3_table], eax

    ; TODO map each P2 entry to a huge 2MiB page
    ret

We just set the present and writable bits (0b11 is a binary number) in the aligned P3 table address and move it to the first 4 bytes of the P4 table. Then we do the same to link the first P3 entry to the p2_table.

Now we need to map P2â€™s first entry to a huge page starting at 0, P2â€™s second entry to a huge page starting at 2MiB, P2â€™s third entry to a huge page starting at 4MiB, and so on. Itâ€™s time for our first (and only) assembly loop:

set_up_page_tables:
    ...
    ; map each P2 entry to a huge 2MiB page
    mov ecx, 0         ; counter variable

.map_p2_table:
    ; map ecx-th P2 entry to a huge page that starts at address 2MiB*ecx
    mov eax, 0x200000  ; 2MiB
    mul ecx            ; start address of ecx-th page
    or eax, 0b10000011 ; present + writable + huge
    mov [p2_table + ecx * 8], eax ; map ecx-th entry

    inc ecx            ; increase counter
    cmp ecx, 512       ; if counter == 512, the whole P2 table is mapped
    jne .map_p2_table  ; else map the next entry

    ret

Maybe I should first explain how an assembly loop works. We use the ecx register as a counter variable, just like i in a for loop. After mapping the ecx-th entry, we increase ecx by one and jump to .map_p2_table again if itâ€™s still smaller than 512.

To map a P2 entry we first calculate the start address of its page in eax: The ecx-th entry needs to be mapped to ecx * 2MiB. We use the mul operation for that, which multiplies eax with the given register and stores the result in eax. Then we set the present, writable, and huge page bits and write it to the P2 entry. The address of the ecx-th entry in P2 is p2_table + ecx * 8, because each entry is 8 bytes large.

Now the first gigabyte (512 * 2MiB) of our kernel is identity mapped and thus accessible through the same physical and virtual addresses.

ðŸ”—Enable Paging

To enable paging and enter long mode, we need to do the following:

write the address of the P4 table to the CR3 register (the CPU will look there, see the paging section)
long mode is an extension of Physical Address Extension (PAE), so we need to enable PAE first
Set the long mode bit in the EFER register
Enable Paging

The assembly function looks like this (some boring bit-moving to various registers):

enable_paging:
    ; load P4 to cr3 register (cpu uses this to access the P4 table)
    mov eax, p4_table
    mov cr3, eax

    ; enable PAE-flag in cr4 (Physical Address Extension)
    mov eax, cr4
    or eax, 1 << 5
    mov cr4, eax

    ; set the long mode bit in the EFER MSR (model specific register)
    mov ecx, 0xC0000080
    rdmsr
    or eax, 1 << 8
    wrmsr

    ; enable paging in the cr0 register
    mov eax, cr0
    or eax, 1 << 31
    mov cr0, eax

    ret

The or eax, 1 << X is a common pattern. It sets the bit X in the eax register (<< is a left shift). Through rdmsr and wrmsr itâ€™s possible to read/write to the so-called model specific registers at address ecx (in this case ecx points to the EFER register).

Finally we need to call our new functions in start:

...
start:
    mov esp, stack_top

    call check_multiboot
    call check_cpuid
    call check_long_mode

    call set_up_page_tables ; new
    call enable_paging     ; new

    ; print `OK` to screen
    mov dword [0xb8000], 0x2f4b2f4f
    hlt
...

To test it we execute make run. If the green OK is still printed, we have successfully enabled paging!

ðŸ”—The Global Descriptor Table

After enabling Paging, the processor is in long mode. So we can use 64-bit instructions now, right? Wrong. The processor is still in a 32-bit compatibility submode. To actually execute 64-bit code, we need to set up a new Global Descriptor Table. The Global Descriptor Table (GDT) was used for Segmentation in old operating systems. I wonâ€™t explain Segmentation but the Three Easy Pieces OS book has good introduction (PDF) again.

Today almost everyone uses Paging instead of Segmentation (and so do we). But on x86, a GDT is always required, even when youâ€™re not using Segmentation. GRUB has set up a valid 32-bit GDT for us but now we need to switch to a long mode GDT.

A GDT always starts with a 0-entry and contains an arbitrary number of segment entries afterwards. A 64-bit entry has the following format:

Bit(s)	Name	Meaning
0-41	ignored	ignored in 64-bit mode
42	conforming	the current privilege level can be higher than the specified level for code segments (else it must match exactly)
43	executable	if set, itâ€™s a code segment, else itâ€™s a data segment
44	descriptor type	should be 1 for code and data segments
45-46	privilege	the ring level: 0 for kernel, 3 for user
47	present	must be 1 for valid selectors
48-52	ignored	ignored in 64-bit mode
53	64-bit	should be set for 64-bit code segments
54	32-bit	must be 0 for 64-bit segments
55-63	ignored	ignored in 64-bit mode

We need one code segment, a data segment is not necessary in 64-bit mode. Code segments have the following bits set: descriptor type, present, executable and the 64-bit flag. Translated to assembly the long mode GDT looks like this:

section .rodata
gdt64:
    dq 0 ; zero entry
    dq (1<<43) | (1<<44) | (1<<47) | (1<<53) ; code segment

We chose the .rodata section here because itâ€™s initialized read-only data. The dq command stands for define quad and outputs a 64-bit constant (similar to dw and dd). And the (1<<43) is a bit shift that sets bit 43.

ðŸ”—Loading the GDT

To load our new 64-bit GDT, we have to tell the CPU its address and length. We do this by passing the memory location of a special pointer structure to the lgdt (load GDT) instruction. The pointer structure looks like this:

gdt64:
    dq 0 ; zero entry
    dq (1<<43) | (1<<44) | (1<<47) | (1<<53) ; code segment
.pointer:
    dw $ - gdt64 - 1
    dq gdt64

The first 2 bytes specify the (GDT length - 1). The $ is a special symbol that is replaced with the current address (itâ€™s equal to .pointer in our case). The following 8 bytes specify the GDT address. Labels that start with a point (such as .pointer) are sub-labels of the last label without point. To access them, they must be prefixed with the parent label (e.g., gdt64.pointer).

Now we can load the GDT in start:

start:
    ...
    call enable_paging

    ; load the 64-bit GDT
    lgdt [gdt64.pointer]

    ; print `OK` to screen
    ...

When you still see the green OK, everything went fine and the new GDT is loaded. But we still canâ€™t execute 64-bit code: The code selector register cs still has the values from the old GDT. To update it, we need to load it with the GDT offset (in bytes) of the desired segment. In our case the code segment starts at byte 8 of the GDT, but we donâ€™t want to hardcode that 8 (in case we modify our GDT later). Instead, we add a .code label to our GDT, that calculates the offset directly from the GDT:

section .rodata
gdt64:
    dq 0 ; zero entry
.code: equ $ - gdt64 ; new
    dq (1<<43) | (1<<44) | (1<<47) | (1<<53) ; code segment
.pointer:
    ...

We canâ€™t just use a normal label here, since we need the table offset. We calculate this offset using the current address $ and set the label to this value using equ. Now we can use gdt64.code instead of 8 and this label will still work if we modify the GDT.

In order to finally enter the true 64-bit mode, we need to load cs with gdt64.code. But we canâ€™t do it through mov. The only way to reload the code selector is a far jump or a far return. These instructions work like a normal jump/return but change the code selector. We use a far jump to a long mode label:

global start
extern long_mode_start
...
start:
    ...
    lgdt [gdt64.pointer]

    jmp gdt64.code:long_mode_start
...

The actual long_mode_start label is defined as extern, so itâ€™s part of another file. The jmp gdt64.code:long_mode_start is the mentioned far jump.

I put the 64-bit code into a new file to separate it from the 32-bit code, thereby we canâ€™t call the (now invalid) 32-bit code accidentally. The new file (I named it long_mode_init.asm) looks like this:

global long_mode_start

section .text
bits 64
long_mode_start:
    ; print `OKAY` to screen
    mov rax, 0x2f592f412f4b2f4f
    mov qword [0xb8000], rax
    hlt

You should see a green OKAY on the screen. Some notes on this last step:

As the CPU expects 64-bit instructions now, we use bits 64
We can now use the extended registers. Instead of the 32-bit eax, ebx, etc. we now have the 64-bit rax, rbx, â€¦
and we can write these 64-bit registers directly to memory using mov qword (quad word)

Congratulations! You have successfully wrestled through this CPU configuration and compatibility mode mess :).

ðŸ”—One Last Thing

Above, we reloaded the code segment register cs with the new GDT offset. However, the data segment registers ss, ds, es, fs, and gs still contain the data segment offsets of the old GDT. This isnâ€™t necessarily bad, since theyâ€™re ignored by almost all instructions in 64-bit mode. However, there are a few instructions that expect a valid data segment descriptor or the null descriptor in those registers. An example is the the iretq instruction that weâ€™ll need in the Returning from Exceptions post.

To avoid future problems, we reload all data segment registers with null:

long_mode_start:
    ; load 0 into all data segment registers
    mov ax, 0
    mov ss, ax
    mov ds, ax
    mov es, ax
    mov fs, ax
    mov gs, ax

    ; print `OKAY` to screen
    ...

ðŸ”—Whatâ€™s next?

Itâ€™s time to finally leave assembly behind and switch to Rust. Rust is a systems language without garbage collections that guarantees memory safety. Through a real type system and many abstractions it feels like a high-level language but can still be low-level enough for OS development. The next post describes the Rust setup.

ðŸ”—Footnotes

In the x86 architecture, the page tables are hardware walked, so the CPU will look at the table on its own when it needs a translation. Other architectures, for example MIPS, just throw an exception and let the OS translate the virtual address.

Image source: Wikipedia, with modified font size, page table naming, and removed sign extended bits. The modified file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.

Page tables need to be page-aligned as the bits 0-11 are used for flags. By putting these tables at the beginning of .bss, the linker can just page align the whole section and we donâ€™t have unused padding bytes in between.

A minimal Multiboot Kernel

Tue, 18 Aug 2015 00:00:00 +0000

This post explains how to create a minimal x86 operating system kernel using the Multiboot standard. In fact, it will just boot and print OK to the screen. In subsequent blog posts we will extend it using the Rust programming language.

I tried to explain everything in detail and to keep the code as simple as possible. If you have any questions, suggestions or other issues, please leave a comment or create an issue on Github. The source code is available in a repository, too.

Note that this tutorial is written mainly for Linux. For some known problems on OS X see the comment section and this issue. If you want to use a virtual Linux machine, you can find instructions and a Vagrantfile in Ashley Willamsâ€™s x86-kernel repository.

ðŸ”—Overview

When you turn on a computer, it loads the BIOS from some special flash memory. The BIOS runs self test and initialization routines of the hardware, then it looks for bootable devices. If it finds one, the control is transferred to its bootloader, which is a small portion of executable code stored at the deviceâ€™s beginning. The bootloader has to determine the location of the kernel image on the device and load it into memory. It also needs to switch the CPU to the so-called protected mode because x86 CPUs start in the very limited real mode by default (to be compatible to programs from 1978).

We wonâ€™t write a bootloader because that would be a complex project on its own (if you really want to do it, check out Rolling Your Own Bootloader). Instead we will use one of the many well-tested bootloaders out there to boot our kernel from a CD-ROM. But which one?

ðŸ”—Multiboot

Fortunately there is a bootloader standard: the Multiboot Specification. Our kernel just needs to indicate that it supports Multiboot and every Multiboot-compliant bootloader can boot it. We will use the Multiboot 2 specification (PDF) together with the well-known GRUB 2 bootloader.

To indicate our Multiboot 2 support to the bootloader, our kernel must start with a Multiboot Header, which has the following format:

Field	Type	Value
magic number	u32	`0xE85250D6`
architecture	u32	`0` for i386, `4` for MIPS
header length	u32	total header size, including tags
checksum	u32	`-(magic + architecture + header_length)`
tags	variable
end tag	(u16, u16, u32)	`(0, 0, 8)`

Converted to a x86 assembly file it looks like this (Intel syntax):

section .multiboot_header
header_start:
    dd 0xe85250d6                ; magic number (multiboot 2)
    dd 0                         ; architecture 0 (protected mode i386)
    dd header_end - header_start ; header length
    ; checksum
    dd 0x100000000 - (0xe85250d6 + 0 + (header_end - header_start))

    ; insert optional multiboot tags here

    ; required end tag
    dw 0    ; type
    dw 0    ; flags
    dd 8    ; size
header_end:

If you donâ€™t know x86 assembly, here is some quick guide:

the header will be written to a section named .multiboot_header (we need this later)
header_start and header_end are labels that mark a memory location, we use them to calculate the header length easily
dd stands for define double (32bit) and dw stands for define word (16bit). They just output the specified 32bit/16bit constant.
the additional 0x100000000 in the checksum calculation is a small hack¹ to avoid a compiler warning

We can already assemble this file (which I called multiboot_header.asm) using nasm. It produces a flat binary by default, so the resulting file just contains our 24 bytes (in little endian if you work on a x86 machine):

> nasm multiboot_header.asm
> hexdump -x multiboot_header
0000000    50d6    e852    0000    0000    0018    0000    af12    17ad
0000010    0000    0000    0008    0000
0000018

ðŸ”—The Boot Code

To boot our kernel, we must add some code that the bootloader can call. Letâ€™s create a file named boot.asm:

global start

section .text
bits 32
start:
    ; print `OK` to screen
    mov dword [0xb8000], 0x2f4b2f4f
    hlt

There are some new commands:

global exports a label (makes it public). As start will be the entry point of our kernel, it needs to be public.
the .text section is the default section for executable code
bits 32 specifies that the following lines are 32-bit instructions. Itâ€™s needed because the CPU is still in Protected mode when GRUB starts our kernel. When we switch to Long mode in the next post we can use bits 64 (64-bit instructions).
the mov dword instruction moves the 32bit constant 0x2f4b2f4f to the memory at address b8000 (it prints OK to the screen, an explanation follows in the next posts)
hlt is the halt instruction and causes the CPU to stop

Through assembling, viewing and disassembling we can see the CPU Opcodes in action:

> nasm boot.asm
> hexdump -x boot
0000000    05c7    8000    000b    2f4b    2f4f    00f4
000000b
> ndisasm -b 32 boot
00000000  C70500800B004B2F  mov dword [dword 0xb8000],0x2f4b2f4f
         -4F2F
0000000A  F4                hlt

ðŸ”—Building the Executable

To boot our executable later through GRUB, it should be an ELF executable. So we want nasm to create ELF object files instead of plain binaries. To do that, we simply pass the â€‘fÂ elf64 argument to it.

To create the ELF executable, we need to link the object files together. We use a custom linker script named linker.ld:

ENTRY(start)

SECTIONS {
    . = 1M;

    .boot :
    {
        /* ensure that the multiboot header is at the beginning */
        *(.multiboot_header)
    }

    .text :
    {
        *(.text)
    }
}

Letâ€™s translate it:

start is the entry point, the bootloader will jump to it after loading the kernel
. = 1M; sets the load address of the first section to 1 MiB, which is a conventional place to load a kernel²
the executable will have two sections: .boot at the beginning and .text afterwards
the .text output section contains all input sections named .text
Sections named .multiboot_header are added to the first output section (.boot) to ensure they are at the beginning of the executable. This is necessary because GRUB expects to find the Multiboot header very early in the file.

So letâ€™s create the ELF object files and link them using our new linker script:

> nasm -f elf64 multiboot_header.asm
> nasm -f elf64 boot.asm
> ld -n -o kernel.bin -T linker.ld multiboot_header.o boot.o

Itâ€™s important to pass the -n (or --nmagic) flag to the linker, which disables the automatic section alignment in the executable. Otherwise the linker may page align the .boot section in the executable file. If that happens, GRUB isnâ€™t able to find the Multiboot header because it isnâ€™t at the beginning anymore.

We can use objdump to print the sections of the generated executable and verify that the .boot section has a low file offset:

> objdump -h kernel.bin
kernel.bin:     file format elf64-x86-64

Sections:
Idx Name      Size      VMA               LMA               File off  Algn
  0 .boot     00000018  0000000000100000  0000000000100000  00000080  2**0
              CONTENTS, ALLOC, LOAD, READONLY, DATA
  1 .text     0000000b  0000000000100020  0000000000100020  000000a0  2**4
              CONTENTS, ALLOC, LOAD, READONLY, CODE

Note: The ld and objdump commands are platform specific. If youâ€™re not working on x86_64 architecture, you will need to cross compile binutils. Then use x86_64â€‘elfâ€‘ld and x86_64â€‘elfâ€‘objdump instead of ld and objdump.

ðŸ”—Creating the ISO

All PC BIOSes know how to boot from a CD-ROM, so we want to create a bootable CD-ROM image, containing our kernel and the GRUB bootloaderâ€™s files, in a single file called an ISO. Make the following directory structure and copy the kernel.bin to the right place:

isofiles
â””â”€â”€ boot
    â”œâ”€â”€ grub
    â”‚   â””â”€â”€ grub.cfg
    â””â”€â”€ kernel.bin

The grub.cfg specifies the file name of our kernel and its Multiboot 2 compliance. It looks like this:

set timeout=0
set default=0

menuentry "my os" {
    multiboot2 /boot/kernel.bin
    boot
}

Now we can create a bootable image using the command:

grub-mkrescue -o os.iso isofiles

Note: grub-mkrescue causes problems on some platforms. If it does not work for you, try the following steps:

try to run it with --verbose
make sure xorriso is installed (xorriso or libisoburn package)
If youâ€™re using an EFI-system, grub-mkrescue tries to create an EFI image by default. You can either pass -d /usr/lib/grub/i386-pc to avoid EFI or install the mtools package to get a working EFI image
on some system the command is named grub2-mkrescue

ðŸ”—Booting

Now itâ€™s time to boot our OS. We will use QEMU:

qemu-system-x86_64 -cdrom os.iso

Notice the green OK in the upper left corner. If it does not work for you, take a look at the comment section.

Letâ€™s summarize what happens:

the BIOS loads the bootloader (GRUB) from the virtual CD-ROM (the ISO)
the bootloader reads the kernel executable and finds the Multiboot header
it copies the .boot and .text sections to memory (to addresses 0x100000 and 0x100020)
it jumps to the entry point (0x100020, you can obtain it through objdump -f)
our kernel prints the green OK and stops the CPU

You can test it on real hardware, too. Just burn the ISO to a disk or USB stick and boot from it.

ðŸ”—Build Automation

Right now we need to execute 4 commands in the right order every time we change a file. Thatâ€™s bad. So letâ€™s automate the build using a Makefile. But first we should create some clean directory structure for our source files to separate the architecture specific files:

â€¦
â”œâ”€â”€ Makefile
â””â”€â”€ src
    â””â”€â”€ arch
        â””â”€â”€ x86_64
            â”œâ”€â”€ multiboot_header.asm
            â”œâ”€â”€ boot.asm
            â”œâ”€â”€ linker.ld
            â””â”€â”€ grub.cfg

The Makefile looks like this (indented with tabs instead of spaces):

arch ?= x86_64
kernel := build/kernel-$(arch).bin
iso := build/os-$(arch).iso

linker_script := src/arch/$(arch)/linker.ld
grub_cfg := src/arch/$(arch)/grub.cfg
assembly_source_files := $(wildcard src/arch/$(arch)/*.asm)
assembly_object_files := $(patsubst src/arch/$(arch)/%.asm, \
	build/arch/$(arch)/%.o, $(assembly_source_files))

.PHONY: all clean run iso

all: $(kernel)

clean:
	@rm -r build

run: $(iso)
	@qemu-system-x86_64 -cdrom $(iso)

iso: $(iso)

$(iso): $(kernel) $(grub_cfg)
	@mkdir -p build/isofiles/boot/grub
	@cp $(kernel) build/isofiles/boot/kernel.bin
	@cp $(grub_cfg) build/isofiles/boot/grub
	@grub-mkrescue -o $(iso) build/isofiles 2> /dev/null
	@rm -r build/isofiles

$(kernel): $(assembly_object_files) $(linker_script)
	@ld -n -T $(linker_script) -o $(kernel) $(assembly_object_files)

# compile assembly files
build/arch/$(arch)/%.o: src/arch/$(arch)/%.asm
	@mkdir -p $(shell dirname $@)
	@nasm -felf64 $< -o $@

Some comments (see the [Makefile tutorial] if you donâ€™t know make):

the $(wildcard src/arch/$(arch)/*.asm) chooses all assembly files in the src/arch/$(arch)` directory, so you donâ€™t have to update the Makefile when you add a file
the patsubst operation for assembly_object_files just translates src/arch/$(arch)/XYZ.asm to build/arch/$(arch)/XYZ.o
the $< and $@ in the assembly target are automatic variables
if youâ€™re using cross-compiled binutils just replace ld with x86_64â€‘elfâ€‘ld

Now we can invoke make and all updated assembly files are compiled and linked. The make iso command also creates the ISO image and make run will additionally start QEMU.

ðŸ”—Whatâ€™s next?

In the next post we will create a page table and do some CPU configuration to switch to the 64-bit long mode.

ðŸ”—Footnotes

The formula from the table, -(magic + architecture + header_length), creates a negative value that doesnâ€™t fit into 32bit. By subtracting from 0x100000000 (= 2^(32)) instead, we keep the value positive without changing its truncated value. Without the additional sign bit(s) the result fits into 32bit and the compiler is happy :).

We donâ€™t want to load the kernel to e.g. 0x0 because there are many special memory areas below the 1MB mark (for example the so-called VGA buffer at 0xb8000, that we use to print OK to the screen).

Writing an OS in Rust

Updates in March 2020

blog_os

x86_64

bootloader

bootimage

vga

Async/Await

ðŸ”—Multitasking

ðŸ”—Preemptive Multitasking

ðŸ”—Saving State

ðŸ”—Discussion

ðŸ”—Cooperative Multitasking

ðŸ”—Saving State

ðŸ”—Discussion

ðŸ”—Async/Await in Rust

ðŸ”—Futures

ðŸ”—Example

ðŸ”—Futures in Rust

ðŸ”—Working with Futures

ðŸ”—Waiting on Futures

ðŸ”—Future Combinators

ðŸ”—Advantages

ðŸ”—Drawbacks

ðŸ”—The Async/Await Pattern

ðŸ”—State Machine Transformation

ðŸ”—Saving State

ðŸ”—The Full State Machine Type

ðŸ”—Pinning

ðŸ”—Self-Referential Structs

ðŸ”—The Problem with Self-Referential Structs

ðŸ”—Possible Solutions

ðŸ”—Heap Values

ðŸ”—Pin<Box<T>> and Unpin

ðŸ”—Stack Pinning and Pin<&mut T>

ðŸ”—Pinning and Futures

ðŸ”—Executors and Wakers

ðŸ”—Executors

ðŸ”—Wakers

ðŸ”—Cooperative Multitasking?

ðŸ”—Implementation

ðŸ”—Task

ðŸ”—Simple Executor

ðŸ”—Dummy Waker

ðŸ”—RawWaker

ðŸ”—A Dummy RawWaker

ðŸ”—A run Method

ðŸ”—Trying It

ðŸ”—Async Keyboard Input

ðŸ”—Scancode Queue

ðŸ”—The crossbeam Crate

ðŸ”—Queue Implementation

ðŸ”—Filling the Queue

ðŸ”—Scancode Stream

ðŸ”—The Stream Trait

ðŸ”—Implementing Stream

ðŸ”—Waker Support

ðŸ”—AtomicWaker

ðŸ”—Storing a Waker

ðŸ”—Waking the Stored Waker

ðŸ”—Keyboard Task

ðŸ”—Executor with Waker Support

ðŸ”—Task Id

ðŸ”—The Executor Type

ðŸ”—Spawning Tasks

ðŸ”—Running Tasks

ðŸ”—Waker Design

ðŸ”—The Wake Trait

ðŸ”—Creating Wakers

ðŸ”—A run Method

ðŸ”—Sleep If Idle

ðŸ”—Possible Extensions

ðŸ”—Summary

ðŸ”—Whatâ€™s Next?

Updates in February 2020

blog_os

x86_64

bootloader

bootimage

cargo-xbuild

`blog_os`

`x86_64`

`bootloader`

`bootimage`

`vga`

ðŸ”—`Pin<Box<T>>` and `Unpin`

ðŸ”—Stack Pinning and `Pin<&mut T>`

ðŸ”—`RawWaker`

ðŸ”—A Dummy `RawWaker`

ðŸ”—A `run` Method

ðŸ”—The `crossbeam` Crate

ðŸ”—The `Stream` Trait

ðŸ”—Implementing `Stream`

ðŸ”—The `Executor` Type

ðŸ”—The `Wake` Trait

ðŸ”—A `run` Method

`blog_os`

`x86_64`

`bootloader`

`bootimage`

`cargo-xbuild`

`uart_16550`

`multiboot2-elf64`

`blog_os`

`bootloader`

`bootimage`

`x86_64`

`cargo-xbuild`

`uart_16550`

ðŸ”—Implementing `GlobalAlloc`

ðŸ”—`GlobalAlloc` and Mutability

ðŸ”—A `Locked` Wrapper Type

ðŸ”—Implementation for `Locked<BumpAllocator>`

ðŸ”—The `add_free_region` Method

ðŸ”—The `find_region` Method

ðŸ”—The `alloc_from_region` Function

ðŸ”—Implementing `GlobalAlloc`

ðŸ”—Implementing `GlobalAlloc`

ðŸ”—`alloc`

ðŸ”—`dealloc`

`blog_os`

`bootloader`

`bootimage`

`x86_64`

`cargo-xbuild`

`blog_os`

`bootloader`

`bootimage`

`x86_64`