Mort's Ramblings https://mort.coffee/home en-us [email protected] (Martin Dørum) https://mort.coffee/_/imgs/favicon.pnghttps://mort.coffeeMort's Ramblings housecat https://cyber.harvard.edu/rss/rss.html Faster virtual machines: Speeding up programming language execution https://mort.coffee/home/fast-interpreters <![CDATA[

Date: 2023-01-15
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00015-fast-interpreters.md

In this post, I hope to explore how interpreters are often implemented, what a "virtual machine" means in this context, and how to make them faster.

Note: This post will contain a lot of C source code. Most of it is fairly simple C which should be easy to follow, but some familiarity with the C language is suggested.

What is a (virtual) machine?

For our purposes, a "machine" is anything which can read some sequence of instructions ("code") and act upon them. A Turing machine reads instructions from the cells of a tape and changes its state accordingly. Your CPU is a machine which reads instructions in the form of binary data representing x86 or ARM machine code and modifies its state accordingly. A LISP machine reads instructions in the form of LISP code and modifies its state accordingly.

Your computer's CPU is a physical machine, with all the logic required to read and execute its native machine code implemented as circuitry in hardware. But we can also implement a "machine" to read and execute instructions in software. A software implementation of a machine is what we call a virtual machine. QEMU is an example of a project which implements common CPU instruction sets in software, so we can take native machine code for ARM64 and run it in a virtual ARM64 machine regardless of what architecture our physical CPU implements.

But we don't have to limit ourselves to virtual machines which emulate real CPU architectures. In the world of programming languages, a "virtual machine" is usually used to mean something which takes some language-specific code and executes it.

What is bytecode?

Many programming languages are separated into roughly two parts: the front-end, which parses your textual source code and emits some form of machine-readable code, and the virtual machine, which executes the instructions in this machine-readable code. This machine-readable code that's inteneded to be executed by a virtual machine is usually called "bytecode".

You're probably familiar with this from Java, where the Java compiler produces .class files containing Java bytecode, and the Java Virtual Machine (JVM) executes these .class files. (You may be more familiar with .jar files, which are essentially zip files with a bunch of .class files.)

Python is also an example of a programming language with these two parts. The only difference between Python's approach and Java's approach is that the Python compiler and the Python virtual machine are part of the same executable, and you're not meant to distribute the Python bytecode. But Python also generates bytecode files; the __pycache__ directories and .pyc files Python generates contains Python bytecode. This lets Python avoid compiling your source code to bytecode every time you run a Python script, speeding up startup times.

So how does this "bytecode" look like? Well, it usually has a concept of an "operation" (represented by some numeric "op-code") and "operands" (some fixed numeric argument which somehow modifies the behavior of the instruction). But other than that, it varies wildly between languages.

Note: Sometimes "bytecode" is used interchangeably with any form of code intended to be executed by a virtual machine. Other times, it's used to mean specifically code where an instruction is always encoded using exactly one byte for an "op-code".

Our own bytecode

In this post, we will invent our own bytecode with these characteristics:

  • Each operation is a 1-byte "op-code", sometimes followed by a 4-byte operand that's interpreted as a 32-bit signed integer (little endian).
  • The machine has a stack, where each value on the stack is a 32-bit signed integer.
  • In the machine's model of the stack, stackptr[0] represents the value at the top of the stack, stackptr[1] the one before that, etc.

This is the set of instructions our bytecode language will have:

00000000: CONSTANT c:
Push 'c' onto the stack.
> push(c);

00000001: ADD:
Pop two values from the stack, push their
sum onto the stack.
> b = pop();
> a = pop();
> push(a + b);

00000010: PRINT:
Pop a value from the stack and print it.
> print(pop());

00000011: INPUT:
Read a value from some external input,
and push it onto the stack.
> push(input())

00000100: DISCARD:
Pop a value from the stack and discard it.
> pop();

00000101: GET offset:
Find the value at the 'offset' from the
top of the stack and push it onto the stack.
> val = stackptr[offset];
> push(val);

0000110: SET offset:
Pop a value from the stack, replace the value
at the 'offset' with the popped value.
> val = pop();
> stackptr[offset] = val;

00000110: CMP:
Compare two values on the stack, push -1 if
the first is smaller than the second, 1 if the
first is bigger than the second, and 0 otherwise.
> b = pop();
> a = pop();
> if (a > b) push(1);
> else if (a < b) push(-1);
> else push(0);

00000111: JGT offset:
Pop the stack, jump relative to the given 'offset'
if the popped value is positive.
> val = pop();
> if (val > 0) instrptr += offset;

00001000: HALT:
Stop execution

I'm sure you can imagine expanding this instruction set with more instructions. Maybe a SUB instruction, maybe more jump instructions, maybe more I/O. If you want, you can play along with this post and expand my code to implement your own custom instructions!

Throughout this blog post, I will be using an example program which multiplies two numbers together. Here's the program in pseudocode:

A = input()
B = input()

Accumulator = 0
do {
	Accumulator = Accumulator + A
	B = B - 1
} while (B > 0)

print(Accumulator)

(This program assumes B is greater than 0 for simplicity.)

Here's that program implemented in our bytecode language:

INPUT // A = input()
INPUT // B = input()

CONSTANT 0 // Accumulator = 0

// Loop body:

// Accumulator + A
GET 0
GET 3
ADD
// Accumulator = <result>
SET 0

// B - 1
GET 1
CONSTANT -1
ADD
// B = <result>
SET 1

// B CMP 0
GET 1
CONSTANT 0
CMP
// Jump to start of loop body if <result> > 0
// We get the value -43 by counting the bytes from
// the first instruction in the loop body.
// Operations are 1 byte, operands are 4 bytes.
JGT -43

// Accumulator
GET 0
// print(<result>)
PRINT

HALT

Note: If you're viewing this in a browser with JavaScript enabled, the above code should be interactive! Press the Step or Run buttons to execute it. The bar on the right represents the stack. The yellow box indicates the current stack pointer, a blinking green box means a value is being read, a blinking red box means a value is being written. The blue rectangle in the code area shows the instruction pointer. You can also edit the code; try your hand at writing your own program!

Here's a link which takes you directly to the interactive virtual machine.

You should take some moments to convince yourself that the bytecode truly reflects the pseudocode. Maybe you can even imagine how you could write a compiler which takes a syntax tree reflecting the source code and produces bytecode? (Hint: Every expression and sub-expression leaves exactly one thing on the stack.)

Implementing a bytecode interpreter

A bytecode interpreter can be basically just a loop with a switch statement. Here's my shot at implementing one in C for the bytecode language we invented:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

enum op {
	OP_CONSTANT, OP_ADD, OP_PRINT, OP_INPUT, OP_DISCARD,
	OP_GET, OP_SET, OP_CMP, OP_JGT, OP_HALT,
};

void interpret(unsigned char *bytecode, int32_t *input) {
	// Create a "stack" of 128 integers,
	// and a "stack pointer" which always points to the first free stack slot.
	// That means the value at the top of the stack is always 'stackptr[-1]'.
	int32_t stack[128];
	int32_t *stackptr = stack;

	// Create an instruction pointer which keeps track of where in the bytecode we are.
	unsigned char *instrptr = bytecode;

	// Some utility macros, to pop a value from the stack, push a value to the stack,
	// peek into the stack at an offset, and interpret the next 4 bytes as a 32-bit
	// signed integer to read an instruction's operand.
	#define POP() (*(--stackptr))
	#define PUSH(val) (*(stackptr++) = (val))
	#define STACK(offset) (*(stackptr - 1 - offset))
	#define OPERAND() ( \
		((int32_t)instrptr[1] << 0) | \
		((int32_t)instrptr[2] << 8) | \
		((int32_t)instrptr[3] << 16) | \
		((int32_t)instrptr[4] << 24))

	int32_t a, b;

	// This is where we just run one instruction at a time, using a switch statement
	// to figure out what to do in response to each op-code.
	while (1) {
		enum op op = (enum op)*instrptr;
		switch (op) {
		case OP_CONSTANT:
			PUSH(OPERAND());
			// We move past 5 bytes, 1 for the op-code, 4 for the 32-bit operand
			instrptr += 5; break;
		case OP_ADD:
			b = POP();
			a = POP();
			PUSH(a + b);
			// This instruction doesn't have an operand, so we move only 1 byte
			instrptr += 1; break;
		case OP_PRINT:
			a = POP();
			printf("%i\n", (int)a);
			instrptr += 1; break;
		case OP_INPUT:
			PUSH(*(input++));
			instrptr += 1; break;
		case OP_DISCARD:
			POP();
			instrptr += 1; break;
		case OP_GET:
			a = STACK(OPERAND());
			PUSH(a);
			instrptr += 5; break;
		case OP_SET:
			a = POP();
			STACK(OPERAND()) = a;
			instrptr += 5; break;
		case OP_CMP:
			b = POP();
			a = POP();
			if (a > b) PUSH(1);
			else if (a < b) PUSH(-1);
			else PUSH(0);
			instrptr += 1; break;
		case OP_JGT:
			a = POP();
			if (a > 0) instrptr += OPERAND();
			else instrptr += 5;
			break;
		case OP_HALT:
			return;
		}
	}
}

That's it. That's a complete virtual machine for our little bytecode language. Let's give it a spin! Here's a main function which exercises it:

int main(int argc, char **argv) {
	unsigned char program[] = {
		OP_INPUT, OP_INPUT,
		OP_CONSTANT, 0, 0, 0, 0,

		OP_GET, 0, 0, 0, 0,
		OP_GET, 3, 0, 0, 0,
		OP_ADD,
		OP_SET, 0, 0, 0, 0,

		OP_GET, 1, 0, 0, 0,
		OP_CONSTANT, 0xff, 0xff, 0xff, 0xff, // -1 32-bit little-endian (two's complement)
		OP_ADD,
		OP_SET, 1, 0, 0, 0,

		OP_GET, 1, 0, 0, 0,
		OP_CONSTANT, 0, 0, 0, 0,
		OP_CMP,
		OP_JGT, 0xd5, 0xff, 0xff, 0xff, // -43 in 32-bit little-endian (two's complement)

		OP_GET, 0, 0, 0, 0,
		OP_PRINT,

		OP_HALT,
	};
	int32_t input[] = {atoi(argv[1]), atoi(argv[2])};
	interpret(program, input);
}

Note: We use two's complement to represent negative numbers, because that's what the CPU does. A 32-bit number can represent the numbers between 0 and 4'294'967'295. Two's complement is a convention where the numbers between 0 and 2'147'483'647 are treated normally, and the numbers between 2'147'483'648 and 4'294'967'295 represent the numbers between -2'147'483'648 and -1.

Little-endian just means that order of the bytes are "swapped" compared to what you'd expect. For example, to express the number 35799 (10001011'11010111 in binary) as 2 bytes in little-endian, we put the last 8 bits first and the first 8 bits last: unsigned char bytes[] = {0b11010111, 0b10001011}. It's a bit counter-intuitive, but it's how most CPU architectures these days represent numbers larger than one byte.

When I compile and run the full C program with the inputs 3 and 5, it prints 15. Success!

If I instead ask it to calculate 1 * 100'000'000, my laptop (Apple M1 Pro, Apple Clang 14.0.0 with -O3) runs the program in 1.4 seconds. My desktop (AMD R9 5950x, GCC 12.2.0 with -O3) runs the same program in 1.1 seconds. The loop contains 12 instructions, and there are 6 instructions outside of the loop, so a complete run executes 100'000'000*12+6=1'200'000'006 instructions. That means my laptop runs 856 million bytecode instructions per second ("IPS") on average, and my desktop runs 1.1 billion instructions per second.

(Link)
Clang + Apple M1 Pro GCC + AMD R9 5950x
Time IPS Time IPS
Basic bytecode interpreter 1'401ms856M1'096ms1'095M

Note: The actual benchmarked code defines the program variable in a separate translation unit from the main function and interpret function, and link-time optimization is disabled. This prevents the compiler from optimizing based on the knowledge of the bytecode program.

Not bad, but can we do better?

Managing our own jump table

Looking at Godbolt, the assembly generated for our loop + switch is roughly like this:

loop:
	jmp jmp_table[*instrptr]

jmp_table:
	.quad case_op_constant
	.quad case_op_add
	.quad case_op_print
	.quad case_op_discard
	.quad case_op_get
	.quad case_op_set
	.quad case_op_cmp
	.quad case_op_jgt
	.quad case_op_halt

case_op_constant:
	; (code...)
	add instrptr, 5
	jmp loop

case_op_add:
	; (code...)
	add instrptr, 1
	jmp loop

; etc

Note: This isn't real x86 or ARM assembly, but it gives an idea of what's going on without getting into the weeds of assembly syntax.

We can see that the compiler generated a jump table; a table of memory addresses to jump to. At the beginning of each iteration of the loop, it looks up the target address in the jump table based on the opcode at the instruction pointer, then jumps to it. And at the end of executing each switch case, it jumps back to the beginning of the loop. This is fine, but it's a bit unnecessary to jump to the start of the loop just to immediately jump again based on the next op-code. We could just replace the jmp loop with jmp jmp_table[*instrptr] like this:

	jmp jmp_table[*instrptr]

jmp_table:
	.quad case_op_constant
	.quad case_op_add
	.quad case_op_print
	.quad case_op_discard
	.quad case_op_get
	.quad case_op_set
	.quad case_op_cmp
	.quad case_op_jgt
	.quad case_op_halt

case_op_constant:
	; code
	add instrptr, 5
	jmp jmp_table[*instrptr]

case_op_add:
	; code
	add instrptr, 1
	jmp jmp_table[*instrptr]

; etc

This has the advantage of using one less instruction per iteration, but that's negligible; completely predictable jumps such as our jmp loop are essentially free. However, there's a much bigger advantage: the CPU can exploit the inherent predictability of our bytecode instruction stream to improve its branch prediction. For example, a CMP instruction is usually going to be followed by the JGE instruction, so the CPU can start speculatively executing the JGE instruction before it's even done executing the CMP instruction. (At least that's what I believe is happeneing; figuring out why something is as fast or slow as it is, at an instruction-by-instruction level, is incredibly difficult on modern CPUs.)

Sadly, standard C doesn't let us express this style of jump table. But GNU C does! With GNU's Labels as Values extension, we can create our own jump table and indirect goto:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

enum op {
	OP_CONSTANT, OP_ADD, OP_PRINT, OP_INPUT, OP_DISCARD,
	OP_GET, OP_SET, OP_CMP, OP_JGT, OP_HALT,
};

void interpret(unsigned char *bytecode, int32_t *input) {
	int32_t stack[128];
	int32_t *stackptr = stack;
	unsigned char *instrptr = bytecode;

	#define POP() (*(--stackptr))
	#define PUSH(val) (*(stackptr++) = (val))
	#define STACK(offset) (*(stackptr - 1 - offset))
	#define OPERAND() ( \
		((int32_t)instrptr[1] << 0) | \
		((int32_t)instrptr[2] << 8) | \
		((int32_t)instrptr[3] << 16) | \
		((int32_t)instrptr[4] << 24))

	// Note: This jump table must be synchronized with the 'enum op',
	// so that `jmptable[op]` represents the label with the code for the instruction 'op'
	void *jmptable[] = {
		&&case_constant, &&case_add, &&case_print, &&case_input, &&case_discard,
		&&case_get, &&case_set, &&case_cmp, &&case_jgt, &&case_halt,
	};

	int32_t a, b;
	goto *jmptable[*instrptr];

case_constant:
	PUSH(OPERAND());
	instrptr += 5; goto *jmptable[*instrptr];
case_add:
	b = POP();
	a = POP();
	PUSH(a + b);
	instrptr += 1; goto *jmptable[*instrptr];
case_print:
	a = POP();
	printf("%i\n", (int)a);
	instrptr += 1; goto *jmptable[*instrptr];
case_input:
	PUSH(*(input++));
	instrptr += 1; goto *jmptable[*instrptr];
case_discard:
	POP();
	instrptr += 1; goto *jmptable[*instrptr];
case_get:
	a = STACK(OPERAND());
	PUSH(a);
	instrptr += 5; goto *jmptable[*instrptr];
case_set:
	a = POP();
	STACK(OPERAND()) = a;
	instrptr += 5; goto *jmptable[*instrptr];
case_cmp:
	b = POP();
	a = POP();
	if (a > b) PUSH(1);
	else if (a < b) PUSH(-1);
	else PUSH(0);
	instrptr += 1; goto *jmptable[*instrptr];
case_jgt:
	a = POP();
	if (a > 0) instrptr += OPERAND();
	else instrptr += 5;
	goto *jmptable[*instrptr];
case_halt:
	return;
}

With this interpreter loop, my laptop calculates 1 * 100'000'000 in 898ms, while my desktop does it in 1 second. It's interesting that Clang + M1 is significantly slower than GCC + AMD with the basic interpreter but significantly faster for this custom jump table approach. At least it's a speed-up in both cases.

(Link)
Clang + Apple M1 Pro GCC + AMD R9 5950x
Time IPS Time IPS
Basic bytecode interpreter 1'401ms856M1'096ms1'095M
Custom jump table 898ms1'336M1'011ms1'187M

Getting rid of the switch entirely with tail calls

Both of the implementations so far have essentially been of the form, "Look at the current instruction, and decide what code to run with some kind of jump table". But we don't actually need that. Instead of doing the jump table look-up every time, we could do the look-up once for every instruction before starting execution. Instead of an array of op codes, we could have an array of pointers to some machine code.

The easiest and most standard way to do this would be to have each instruction as its own function, and let that function tail-call the next function. Here's an implementation of that:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

union instr {
	void (*fn)(union instr *instrs, int32_t *stackptr, int32_t *input);
	int32_t operand;
};

#define POP() (*(--stackptr))
#define PUSH(val) (*(stackptr++) = (val))
#define STACK(offset) (*(stackptr - 1 - offset))
#define OPERAND() (instrs[1].operand)

static void op_constant(union instr *instrs, int32_t *stackptr, int32_t *input) {
	PUSH(OPERAND());
	instrs[2].fn(&instrs[2], stackptr, input);
}

static void op_add(union instr *instrs, int32_t *stackptr, int32_t *input) {
	int32_t b = POP();
	int32_t a = POP();
	PUSH(a + b);
	instrs[1].fn(&instrs[1], stackptr, input);
}

static void op_print(union instr *instrs, int32_t *stackptr, int32_t *input) {
	int32_t a = POP();
	printf("%i\n", (int)a);
	instrs[1].fn(&instrs[1], stackptr, input);
}

static void op_input(union instr *instrs, int32_t *stackptr, int32_t *input) {
	PUSH(*(input++));
	instrs[1].fn(&instrs[1], stackptr, input);
}

static void op_discard(union instr *instrs, int32_t *stackptr, int32_t *input) {
	POP();
	instrs[1].fn(&instrs[1], stackptr, input);
}

static void op_get(union instr *instrs, int32_t *stackptr, int32_t *input) {
	int32_t a = STACK(OPERAND());
	PUSH(a);
	instrs[2].fn(&instrs[2], stackptr, input);
}

static void op_set(union instr *instrs, int32_t *stackptr, int32_t *input) {
	int32_t a = POP();
	STACK(OPERAND()) = a;
	instrs[2].fn(&instrs[2], stackptr, input);
}

static void op_cmp(union instr *instrs, int32_t *stackptr, int32_t *input) {
	int32_t b = POP();
	int32_t a = POP();
	if (a > b) PUSH(1);
	else if (a < b) PUSH(-1);
	else PUSH(0);
	instrs[1].fn(&instrs[1], stackptr, input);
}

static void op_jgt(union instr *instrs, int32_t *stackptr, int32_t *input) {
	int32_t a = POP();
	if (a > 0) instrs += instrs[1].operand;
	else instrs += 2;
	instrs[0].fn(&instrs[0], stackptr, input);
}

static void op_halt(union instr *instrs, int32_t *stackptr, int32_t *input) {
	return;
}

This time, we can't just feed our interpreter an array of bytes as the bytecode, since there isn't really "an interpreter", there's just a collection of functions. We can manually create a program containing function pointers like this:

int main(int argc, char **argv) {
	union instr program[] = {
		{.fn = op_input}, {.fn = op_input},

		{.fn = op_constant}, {.operand = 0},

		{.fn = op_get}, {.operand = 0},
		{.fn = op_get}, {.operand = 3},
		{.fn = op_add},
		{.fn = op_set}, {.operand = 0},

		{.fn = op_get}, {.operand = 1},
		{.fn = op_constant}, {.operand = -1},
		{.fn = op_add},
		{.fn = op_set}, {.operand = 1},

		{.fn = op_get}, {.operand = 1},
		{.fn = op_constant}, {.operand = 0},
		{.fn = op_cmp},
		{.fn = op_jgt}, {.operand = -19},

		{.fn = op_get}, {.operand = 0},
		{.fn = op_print},

		{.fn = op_halt},
	};

	int32_t input[] = {atoi(argv[1]), atoi(argv[2])};
	int32_t stack[128];
	program[0].fn(program, stack, input);
}

And that works.

In a real use-case, you would probably want to have some code to automatically generate such an array of union instr based on bytecode, but we'll ignore that for now.

With this approach, my laptop calculates 1 * 100'000'000 in 841ms, while my desktop does it in only 553ms. It's not a huge improvement for the Clang + M1 case, but it's almost twice as fast with GCC + AMD! And compared to the previous approach, it's written in completely standard ISO C99, with the caveat that the compiler must perform tail call elimination. (Most compilers will do this at higher optimization levels, and most compilers let us specify per-function optimization levels with pragmas, so that's not a big issue in practice.)

(Link)
Clang + Apple M1 Pro GCC + AMD R9 5950x
Time IPS Time IPS
Basic bytecode interpreter 1'401ms856M1'096ms1'095M
Custom jump table 898ms1'336M1'011ms1'187M
Tail calls 841ms1'427M553ms2'171M

Note: The timings from the benchmark includes the time it takes to convert the bytecode into this function pointer array form.

Final step: A compiler

All approaches so far have relied on finding ever faster ways to select which source code snippet to run next. As it turns out, the fastest way to do that is to simply put the right source code snippets after each other!

If we have the following bytecode:

CONSTANT 5
INPUT
ADD
PRINT

We can just generate C source code to do what we want:

PUSH(5);

PUSH(INPUT());

b = POP();
a = POP();
PUSH(a + b);

printf("%i\n", (int)POP());

We can then either shell out to GCC/Clang, or link with libclang to compile the generated C code. This also lets us take advantage of those projects's excellent optimizers.

Note: At this point, we don't have a "virtual machine" anymore.

One challenge is how to deal with jumps. The easiest solution from a code generation perspective is probably to wrap all the code in a switch statement in a loop:

int32_t index = 0;
while (1) {
	switch (index) {
	case 0:
		PUSH(5);

	case 5:
		PUSH(INPUT());

	case 6:
		a = POP();
		b = POP();
		PUSH(a + b);

	case 7:
		printf("%i\n", (int)POP());
	}
}

With this approach, a jump to instruction N becomes index = N; break;.

Note: Remember that in C, switch statement cases fall through to the next case unless you explicitly jump to the end with a break. So once the code for instruction 5 is done, we just fall through to instruction 6.

Here's my implementation:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

enum op {
	OP_CONSTANT, OP_ADD, OP_PRINT, OP_INPUT, OP_DISCARD,
	OP_GET, OP_SET, OP_CMP, OP_JGT, OP_HALT,
};

void write_operand(unsigned char *i32le, FILE *out) {
	fprintf(out, "    operand = %i;\n",
		(int)i32le[0] | (int)i32le[1] << 8 | (int)i32le[2] << 16 | (int)i32le[3] << 24);
}

void compile(unsigned char *bytecode, size_t size, FILE *out) {
	fputs(
		"#include <stdio.h>\n"
		"#include <stdint.h>\n"
		"#include <stdlib.h>\n"
		"\n"
		"int main(int argc, char **argv) {\n"
		"  int32_t stack[128];\n"
		"  int32_t *stackptr = stack;\n"
		"  char **inputptr = &argv[1];\n"
		"\n"
		"#define POP() (*(--stackptr))\n"
		"#define PUSH(val) (*(stackptr++) = (val))\n"
		"#define STACK(offset) (*(stackptr - 1 - offset))\n"
		"\n"
		"  int32_t a, b, operand;\n"
		"  int32_t index = 0;\n"
		"  while (1) switch (index) {\n",
		out);

	for (size_t i = 0; i < size;) {
		fprintf(out, "  case %zi:\n", i);

		enum op op = (enum op)bytecode[i];
		switch (op) {
		case OP_CONSTANT:
			write_operand(&bytecode[i + 1], out);
			fputs("    PUSH(operand);\n", out);
			i += 5; break;

		case OP_ADD:
			fputs(
				"    b = POP();\n"
				"    a = POP();\n"
				"    PUSH(a + b);\n",
				out);
			i += 1; break;

		case OP_PRINT:
			fputs(
				"    a = POP();\n"
				"    printf(\"%i\\n\", (int)a);\n",
				out);
			i += 1; break;

		case OP_INPUT:
			fputs("    PUSH(atoi(*(inputptr++)));\n", out);
			i += 1; break;

		case OP_DISCARD:
			fputs("    POP();\n", out);
			i += 1; break;

		case OP_GET:
			write_operand(&bytecode[i + 1], out);
			fputs(
				"    a = STACK(operand);\n"
				"    PUSH(a);\n",
				out);
			i += 5; break;

		case OP_SET:
			write_operand(&bytecode[i + 1], out);
			fputs(
				"    a = POP();\n"
				"    STACK(operand) = a;\n",
				out);
			i += 5; break;

		case OP_CMP:
			fputs(
				"    b = POP();\n"
				"    a = POP();\n"
				"    if (a > b) PUSH(1);\n"
				"    else if (a < b) PUSH(-1);\n"
				"    else PUSH(0);\n",
				out);
			i += 1; break;

		case OP_JGT:
			write_operand(&bytecode[i + 1], out);
			fprintf(out,
				"    a = POP();\n"
				"    if (a > 0) { index = %zi + operand; break; }\n",
				i);
			i += 5; break;

		case OP_HALT:
			fputs("    return 0;\n", out);
			i += 1; break;
		}
	}

	fputs(
		"  }\n"
		"\n"
		"  abort(); // If we get here, there's a missing HALT\n"
		"}",
		out);
}

If we run our compiler on the bytecode for our multiplication program, it outputs this C code:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

int main(int argc, char **argv) {
  int32_t stack[128];
  int32_t *stackptr = stack;
  char **inputptr = &argv[1];

  #define POP() (*(--stackptr))
  #define PUSH(val) (*(stackptr++) = (val))
  #define STACK(offset) (*(stackptr - 1 - offset))

  int32_t a, b, operand;
  int32_t index = 0;
  while (1) switch (index) {
  case 0:
    PUSH(atoi(*(inputptr++)));
  case 1:
    PUSH(atoi(*(inputptr++)));
  case 2:
    operand = 0;
    PUSH(operand);
  case 7:
    operand = 0;
    a = STACK(operand);
    PUSH(a);

  /* ... */

  case 49:
    b = POP();
    a = POP();
    if (a > b) PUSH(1);
    else if (a < b) PUSH(-1);
    else PUSH(0);
  case 50:
    operand = -43;
    a = POP();
    if (a > 0) { index = 50 + operand; break; }
  case 55:
    operand = 0;
    a = STACK(operand);
    PUSH(a);
  case 60:
    a = POP();
    printf("%i\n", (int)a);
  case 61:
    return 0;
  }

  abort(); // If we get here, there's a missing HALT
}

If we compile the generated C code with -O3, my laptop runs the 1 * 100'000'000 calculation in 204ms! That's over 4 times faster than the fastest interpreter we've had so far. That also means we're executing our 1'200'000'006 bytecode instructions at 5'882 million instructions per second! Its CPU only runs at 3'220 million CPU clock cycles per second, meaning it's spending significantly less than a clock cycle per bytecode instruction on average. My desktop with GCC is doing even better, executing all the code in 47ms, which means a whopping 25.7 billion instructions per second!

Note that in this particular case, the compiler is able to see that some instructions always happen after each other, which means it can optimize across bytecode instructions. For example, the bytecode contains a sequence GET 1; CONSTANT -1; ADD;, which the compiler is able to prove you won't ever jump into the middle of, so it optimizes out all the implied stack manipulation code; it's optimized into a single sub instruction which subtracts the constant 1 from one register and writes the result to another.

This is kind of an important point. The compiler can generate amazing code, if it can figure out which instructions (i.e switch cases) are potential jump targets. This is information you probably have access to in the source code, so it's worth thinking about how you can design your bytecode such that GCC or Clang can figure it out when looking at your compiler output. One approach could be to add "label" bytecode instructions, and only permit jumping to such a label. With our bytecode, the only jump instruction we have jumps to a known location, since the jump offset is an immediate operand to the instruction. If we added an instruction which reads the jump target from the stack instead, we might quickly get into situations where GCC/Clang has lost track of which instructions can be jump targets, and must therefore make sure not to optimize across instruction boundaries.

We can preventing the compiler from optimizing across instruction boundaries by inserting this code after the case 61: (the code for the HALT instruction):

if (argc > 100) { PUSH(argc); index = argc % 61; break; }

With this modification, every single instruction might be a branch target, so every instruction must make sense in its own right regardless of which instruction was executed before or how the stack looks.

This time, the 1 * 100'000'000 calculation happens in 550ms on my laptop with Clang, which is still not bad. It means we're executing 2'181 million bytecode instructions per second. My desktop is doing even better, at 168ms.

At this point, I got curious about whether it's the CPU or the compiler making the difference, so the next table contains all the benchmarks for both compilers on both systems.

(Link)
Apple M1 Pro AMD R9 5950x
GCC 12.1.0 Clang 14.0.0 GCC 12.2.0 Clang 15.0.2
Basic bytecode interpreter 1'902ms1'402ms 1'135ms2'347ms
Custom jump table 816ms897ms 1'023ms912ms
Tail calls 1'068ms843ms 557ms645ms
Compiler (pessimized) 342ms548ms 172ms302ms
Compiler 71ms205ms 52ms161ms

I have no intelligent commentary on those numbers. They're all over the place. In the basic interpreter case for example, GCC is much faster than Clang on the AMD CPU, but Clang is much faster than GCC on the Apple CPU. It's the opposite in the custom jump table case, where GCC is much master than Clang on the Apple CPU, but Clang is much faster than GCC on the AMD CPU. The overall pattern we've been looking at holds though, for the most part: for any given CPU + compiler combination, every implementation I've introduced is faster than the one before it. The big exception is the tail call version, where the binary compiled by GCC performs horribly on the Apple CPU (even though it performs excellently on the AMD CPU!).

If anything though, this mess of numbers indicates the value of knowing about all the different possible approaches and choosing the right one for the situation. Which takes us to...

Bringing it all together

We have 4 different implementations of the same bytecode , all with different advantages and drawbacks. And even though every instruction does the same thing in every implementation, we have written 4 separate implementations of every instruction.

That seems unnecessary. After all, we know that ADD, in every implementation, will do some variant of this:

b = POP();
a = POP();
PUSH(a + b);
GO_TO_NEXT_INSTRUCTION();

What exactly it means to POP or to PUSH or to go to the next instruction might depend on the implementation, but the core functionality is the same for all of them. We can utilize that regularity to specify the instructions only once in a way that's re-usable across implementations using so-called X macros.

We create a file instructions.x which contains code to define all our instructions:

X(CONSTANT, 1, {
	PUSH(OPERAND());
	NEXT();
})

X(ADD, 0, {
	b = POP();
	a = POP();
	PUSH(a + b);
	NEXT();
})

// etc...

Let's say we want to create an instructions.h which contains an enum op with all the operation types and a const char *op_names[] which maps enum values to strings. We can implement that by doing something like this:

#ifndef INSTRUCTIONS_H
#define INSTRUCTIONS_H

enum op {
#define X(name, has_operand, code...) OP_ ## name,
#include "instructions.x"
#undef X
};

static const char *op_names[] = {
#define X(name, has_operand, code...) [OP_ ## name] = "OP_" #name,
#include "instructions.x"
#undef X
};

#endif

This code might look a bit confusing at first glance, but it makes sense: we have generic descriptions of instructions in the instructions.x file, and then we define a macro called X to extract information from those descriptions. It's basically a weird preprocessor-based application of the visitor pattern. In the above example, we use the instruction definitions twice: once to define the enum op, and once to define the const char *op_names[]. If we run the code through the preprocessor, we get something rouhly like this:

enum op {
OP_CONSTANT,
OP_ADD,
};

const char *op_names[] = {
[OP_CONSTANT] = "OP_CONSTANT",
[OP_ADD] = "OP_ADD",
};

Now let's say we want to write a function which executes an instruction. We could write that function like this:

void execute(enum op op) {
	switch (op) {
#define X(name, has_operand, code...) case OP_ ## name: code break;
#include "instructions.x"
#undef X
	}
}

Which expands to:

void execute(enum op op) {
	switch (op)
	case OP_CONSTANT:
		{
			PUSH(OPERAND());
			NEXT();
		} break;
	case OP_ADD:
		{
			b = POP();
			a = POP();
			PUSH(a + b);
			NEXT();
		} break;
	}
}

Note: We use a variadic argument for the code block because the C preprocessor has annoying splitting rules. Code such as X(FOO, 1, {int32_t a, b;}) would call the macro X with 4 arguments: FOO, 1, {int32_t a, and b;}. Using a variadic argument "fixes" this, because when we expand code in the macro body, the preprocessor will insert a comma between the arguments. You can read about more stupid preprocessor hacks here: https://mort.coffee/home/obscure-c-features/

This is starting to look reasonable, but it doesn't quite work. We haven't defined those PUSH/OPERAND/NEXT/POP macros, nor the a and b variables. We need to be a bit more rigorous about what exactly is expected by the instruction, and what's expected by the environment which the instruction's code is expanded into. So let's design a sort of "contract" between the instruction and the execution environment.

The environment must:

  • Provide a POP() macro which pops the stack and evaluates to the result.
  • Provide a PUSH(val) macro which push the value to the stack.
  • Provide a STACK(offset) macro which evaluates to an lvalue for the stack value at offset.
  • Provide an OPERAND() macro which evaluates to the current instruction's operand as a int32_t.
  • Provide an INPUT() macro which reads external input and evaluates to the result.
  • Provide a PRINT(val) macro which outputs the value somehow (such as by printing to stdout).
  • Provide a GOTO_RELATIVE(offset) macro which jumps to currentInstruction + offset
  • Provide a NEXT() macro which goes to the next instruction
  • Provide a HALT() macro which halts execution.
  • Provide the variables int32_t a and int32_t b as general-purpose variables. (This turns out to significantly speed up execution in some cases compared to defining the variables locally within the scope.)

As for the instruction:

  • It must call X(name, has_operand, code...) with an identifier for name, a 0 or 1 for has_operand, and a brace-enclosed code block for code....
  • The code block may only invoke OPERAND() if it has set has_operand to 1.
  • The code block must only contain standard C code and calls to the macros we defined earlier.
  • The code block must not try to directly access any other variables which may exist in the context in which it is expanded.
  • The code block can assume that the following C headers are included: <stdio.h>, <stdlib.h>, <stdint.h>.
  • The code must not change the stack pointer and dereference it in the same expression (essentially, no PUSH(STACK(1)), since there's no sequence point between the dereference and the increment).

With this, we can re-implement our basic bytecode interpreter:

#include "instructions.h"

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

void interpret(unsigned char *bytecode, int32_t *input) {
	int32_t stack[128];
	int32_t *stackptr = stack;
	unsigned char *instrptr = bytecode;

	int instrsize; // Will be initialized later

	#define POP() (*(--stackptr))
	#define PUSH(val) (*(stackptr++) = (val))
	#define STACK(offset) (*(stackptr - 1 - offset))
	#define OPERAND() ( \
		((int32_t)instrptr[1] << 0) | \
		((int32_t)instrptr[2] << 8) | \
		((int32_t)instrptr[3] << 16) | \
		((int32_t)instrptr[4] << 24))
	#define INPUT() (*(input++))
	#define PRINT(val) (printf("%i\n", (int)(val)))
	#define GOTO_RELATIVE(offset) (instrptr += (offset))
	#define NEXT() (instrptr += instrsize)
	#define HALT() return

	int32_t a, b;
	while (1) {
		switch ((enum op)*instrptr) {
#define X(name, has_operand, code...) \
		case OP_ ## name: \
			instrsize = has_operand ? 5 : 1; \
			code \
			break;
#include "instructions.x"
#undef X
		}
	}
}

And that's it! That's our whole generic basic bytecode interpreter, defined using the instruction definitions in instructions.x. And any time we add more bytecode instructions to instructions.x, the instructions are automatically added to the enum op and const char *op_names[] in instructions.h, and they're automatically supported by this new basic interpreter.

I won't deny that this style of code is a bit harder to follow than straight C code. However, I've seen VM with their own custom domain-specific languages and code generators to define instructions, and I find that much harder to follow than this preprocessor-based approach. Even though the C preprocessor is flawed in many ways, it has the huge advantage that C programmers already understand how it works for the most part, and they're used to following code which uses macros and includes. With decent comments in strategic places, I don't think this sort of "abuse" of the C preprocessor is wholly unreasonable. Your mileage may differ though, and my threshold for "too much preprocessor magic" might be set too high.

For completeness, let's amend instructions.x with all the instructions in the bytecode language I defined at the start of this post:

X(CONSTANT, 1, {
	PUSH(OPERAND());
	NEXT();
})

X(ADD, 0, {
	b = POP();
	a = POP();
	PUSH(a + b);
	NEXT();
})

X(PRINT, 0, {
	PRINT(POP());
	NEXT();
})

X(INPUT, 0, {
	PUSH(INPUT());
	NEXT();
})

X(DISCARD, 0, {
	(void)POP();
	NEXT();
})

X(GET, 1, {
	a = STACK(OPERAND());
	PUSH(a);
	NEXT();
})

X(SET, 1, {
	a = POP();
	STACK(OPERAND()) = a;
	NEXT();
})

X(CMP, 0, {
	b = POP();
	a = POP();
	if (a > b) PUSH(1);
	else if (a < b) PUSH(-1);
	else PUSH(0);
	NEXT();
})

X(JGT, 1, {
	a = POP();
	if (a > 0) { GOTO_RELATIVE(OPERAND()); }
	else { NEXT(); }
})

X(HALT, 0, {
	HALT();
})

Implementing the custom jump table variant and the tail-call variant using this X-macro system is left as an exercise to the reader. However, just to show that it's possible, here's the compiler variant implemented generically:

#include "instructions.h"

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

void compile(unsigned char *bytecode, size_t size, FILE *out) {
	fputs(
		"#include <stdio.h>\n"
		"#include <stdint.h>\n"
		"#include <stdlib.h>\n"
		"\n"
		"int main(int argc, char **argv) {\n"
		"  int32_t stack[128];\n"
		"  int32_t *stackptr = stack;\n"
		"  char **inputptr = &argv[1];\n"
		"\n"
		"#define POP() (*(--stackptr))\n"
		"#define PUSH(val) (*(stackptr++) = (val))\n"
		"#define STACK(offset) (*(stackptr - 1 - offset))\n"
		"#define OPERAND() operand\n"
		"#define INPUT() (atoi(*(inputptr++)))\n"
		"#define PRINT(val) printf(\"%i\\n\", (int)(val))\n"
		"#define GOTO_RELATIVE(offset) index += offset; break\n"
		"#define NEXT()\n"
		"#define HALT() return 0\n"
		"\n"
		"  int32_t a, b, operand;\n"
		"  int32_t index = 0;\n"
		"  while (1) switch (index) {\n",
		out);

	for (size_t i = 0; i < size;) {
		fprintf(out, "  case %zi:\n", i);

		enum op op = (enum op)bytecode[i];
		switch (op) {
#define X(name, has_operand, code...) \
		case OP_ ## name: \
			fprintf(out, "    index = %zi;\n", i); \
			i += 1; \
			if (has_operand) { \
				fprintf(out, "    operand = %i;\n", (int)( \
					((int32_t)bytecode[i + 0] << 0) | ((int32_t)bytecode[i + 1] << 8) | \
					((int32_t)bytecode[i + 2] << 16) | ((int32_t)bytecode[i + 3] << 24))); \
				i += 4; \
			} \
			fputs("    " #code "\n", out); \
			break;
#include "instructions.x"
#undef X
		}
	}

	fputs(
		"  }\n"
		"\n"
		"  abort(); // If we get here, there's a missing HALT\n"
		"}",
		out);
}

A word on real-world performance

I thought I should mention that the techniques described in this post won't magically make any interpreted language much faster. The main source of the performance differences we have explored here is due to the overhead involved in selecting which instruction to execute next; the code which runs between the instructions. By reducing this overhead, we're able to make our simple bytecode execute blazing fast. But that's really only because all our instructions are extremely simple.

In the case of something like Python, each instruction might be much more complex to execute. The BINARY_ADD operation, for example, pops two values from the stack, adds them together, and pushes the result onto the stack, much like how our bytecode's ADD operation does. However, our ADD operation knows that the two popped values are 32-bit signed integers. In Python, the popped values may be strings, they may be arrays, they may be numbers, they may be objects with a custom __add__ method, etc. This means that the time it takes to actually execute instructions in Python will dominate to the point that speeding up instruction dispatch is likely insignificant. Optimizing highly dynamic languages like Python kind of requires some form of tracing JIT to stamp out specialized functions which make assumptions about what types their arguments are, which is outside the scope of this post.

But that doesn't mean the speed-up I have shown here is unrealistic. If you're making a language with static types, you can have dedicated fast instructions for adding i32s, adding doubles, etc. And at that point, the optimizations shown in this post will give drastic speed-ups.

Further reading


So those are my thoughts on speeding up virtual machine execution. If you want, you may check out my programming languages Gilia and osyris. Neither makes use of any of the techniques discussed in this post, but playing with Gilia's VM is what got me started down this path of exploring different techniques. If I ever get around to implementing these ideas into Gilia's VM, I'll add a link to the relevant parts of the source code here.

]]> https://mort.coffee/home/fast-interpreters 15 Jan 2023 17:00 +0100 The tar archive format, its extensions, and why GNU tar extracts in quadratic time https://mort.coffee/home/tar <![CDATA[

Date: 2022-07-23
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00014-tar.md

(If you're here from Google and just need help with tar being slow: If you trust the tar archive, extract with -P to make tar fast.)

A couple of days ago, I had a 518GiB tar.gz file (1.1 TiB uncompressed) that I had to extract. At first, GNU tar was doing a great job, chewing through the tar.gz at around 100MiB/s. But after a while, it slowed significantly; down to less than a kilobyte per second. pv's time estimate went from a bit over an hour, to multiple hours, to over a day, to almost a week. After giving it some time, and after failing to find anything helpful through Google, I decided that learning the tar file format and making my own tar extractor would probably be faster than waiting for tar. And I was right; before the day was over, I had a working tar extractor, and I had successfully extracted my 1.1TiB tarball.

I will explain why GNU tar is so slow later in this post, but first, let's take a look at:

The original tar file format

Tar is pretty unusual for an archive file format. There's no archive header, no index of files to fascilitate seeking, no magic bytes to help file and its ilk detect whether a file is a tar archive, no footer, no archive-wide metadata. The only kind of thing in a tar file is a file object.

So, how do these file objects look? Well, they start with a 512-byte file object header which looks like this:

struct file_header {
	char file_path[100];
	char file_mode[8];
	char owner_user_id[8];
	char owner_group_id[8];
	char file_size[12];
	char file_mtime[12];
	char header_checksum[8];
	char file_type;
	char link_path[100];

	char padding[255];
};

Followed by ceil(file_size / 512) 512-byte blocks of payload (i.e file contents).

We have most of the attributes we would expect a file object to have: the file path, the mode, the modification time (mtime), the user/group ID, the file size, and the file type. To support symlinks and hard links, there's also a link path.

The original tar file format defines these possible values for the file_type field:

  • '0' (or sometimes '\0', the NUL character): Normal file
  • '1': Hard link
  • '2': Symbolic link

Future extensions to tar implements additional file types, among them '5', which represents a directory. Some old tar implementations apparently used a trailing slash '/' in a '0'-type file object to represent directories, at least according to Wikipedia.

You may think that the numeric values (file_mode, file_size, file_mtime, ...) would be encoded in base 10, or maybe in hex, or using plain binary numbers ("base 256"). But no, they're actually encoded as octal strings (with a NUL terminator, or sometimes a space terminator). Tar is the only file format I know of which uses base 8 to encode numbers. I don't quite understand why, since octal is neither space-efficient nor human-friendly. When representing numbers in this post, I will write them in decimal (base 10).

To encode a tar archive with one file called "hello.txt" and the content "Hello World", we need two 512-byte blocks:

  1. Bytes 0-511: Header, type='0', file_path="./hello.txt", file_size=11
  2. Bytes 512-1023: "Hello World", followed by 501 zero bytes

In addition, a tar file is supposed to end with 1024 zero-bytes to represent an end-of-file marker.

The two big limitations of the original tar format is that paths can't be longer than 100 characters, and files can't be larger than 8GiB (8^11 bytes). Otherwise though, I quite like the simplicity of the format. We'll discuss how various extensions address the limitations later, but first, let's try to implement an extractor:

(Feel free to skip this source code, but you should at least skim the comments)

// tarex.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>

struct file_header {
	char file_path[100];
	char file_mode[8];
	char owner_user_id[8];
	char owner_group_id[8];
	char file_size[12];
	char file_mtime[12];
	char header_checksum[8];
	char file_type;
	char link_path[100];

	char padding[255];
};

// We don't bother with great error reporting, just abort on error
#define check(x) if (!(x)) abort()

// Utilities to abort on short read/write
#define xfread(ptr, size, f) check(fread(ptr, 1, size, f) == size)
#define xfwrite(ptr, size, f) check(fwrite(ptr, 1, size, f) == size)

// Tar represents all its numbers as octal
size_t parse_octal(char *str, size_t maxlen) {
	size_t num = 0;
	for (size_t i = 0; i < maxlen && str[i] >= '0' && str[i] <= '7'; ++i) {
		num *= 8;
		num += str[i] - '0';
	}

	return num;
}

// Extract one file from the archive.
// Returns 1 if it extracted something, or 0 if it reached the end.
int extract(FILE *f) {
	unsigned char header_block[512];
	xfread(header_block, sizeof(header_block), f);
	struct file_header *header = (struct file_header *)header_block;

	// The end of the archive is represented with blocks of all-zero content.
	// For simplicity, assume that if the file path is empty, the block is all zero
	// and we reached the end.
	if (header->file_path[0] == '\0') {
		return 0;
	}

	// The file path and link path fields aren't always 0-terminated, so we need to copy them
	// into our own buffers, otherwise we break on files with exactly 100 character paths.
	char file_path[101] = {0};
	memcpy(file_path, header->file_path, 100);
	char link_path[101] = {0};
	memcpy(link_path, header->link_path, 100);

	// We need these for later
	size_t file_size = parse_octal(header->file_size, sizeof(header->file_size));
	FILE *out_file = NULL;

	if (header->file_type == '0' || header->file_type == '\0') {
		// A type of '0' means that this is a plain file.
		// Some early implementations also use a NUL character ('\0') instead of an ASCII zero.

		fprintf(stderr, "Regular file: %s\n", file_path);
		out_file = fopen(file_path, "w");
		check(out_file != NULL);

	} else if (header->file_type == '1') {
		// A type of '1' means that this is a hard link.
		// That means we create a hard link at 'file_path' which links to the file at 'link_path'.

		fprintf(stderr, "Hard link: %s -> %s\n", file_path, link_path);
		check(link(link_path, file_path) >= 0);

	} else if (header->file_type == '2') {
		// A type of '2' means that this is a symbolic link.
		// That means we create a symlink at 'file_path' which links to the file at 'link_path'.

		fprintf(stderr, "Symbolic link: %s -> %s\n", file_path, link_path);
		check(symlink(link_path, file_path) >= 0);

	} else if (header->file_type == '5') {
		// A type of '5' means that this is a directory.

		fprintf(stderr, "Directory: %s\n", file_path);
		check(mkdir(file_path, 0777) >= 0);

		// Directories sometimes use the size field, but they don't contain data blocks.
		// Zero out file_size to avoid skipping entries.
		file_size = 0;

	} else {
		// There are other possible fields added by various tar implementations and standards,
		// but we'll ignore those for this implementation.
		fprintf(stderr, "Unsupported file type %c: %s\n", header->file_type, file_path);
	}

	// We have read the header block, now we need to read the payload.
	// If we're reading a file (i.e if 'outfile' is non-NULL) we will also write the body,
	// but otherwise we'll just skip it.
	char block[512];
	while (file_size > 0) {
		xfread(block, sizeof(block), f);
		size_t n = file_size > 512 ? 512 : file_size;

		file_size -= n;
		if (out_file != NULL) {
			xfwrite(block, n, out_file);
		}
	}

	if (out_file != NULL) {
		check(fclose(out_file) >= 0);
	}

	// Indicate that we have successfully extracted a file object, and are ready to read the next
	return 1;
}

int main() {
	while (extract(stdin));
}

Let's see it in action:

~/tarex $ ls
tarex.c testdir
~/tarex $ gcc -o tarex tarex.c
~/tarex $ tree
.
├── tarex.c
├── tarex
└── testdir
    ├── hello-symlink -> hello.txt
    ├── hello.txt
    └── subdir
        └── file.txt

~/tarex $ tar c testdir >testdir.tar
~/tarex $ mkdir extract && cd extract

~/tarex/extract $ ../tarex <../testdir.tar
Directory: testdir/
Symbolic link: testdir/hello-symlink -> hello.txt
Directory: testdir/subdir/
Regular file: testdir/hello.txt
Regular file: testdir/subdir/file.txt

~/tarex/extract $ tree
.
└── testdir
    ├── hello-symlink -> hello.txt
    ├── hello.txt
    └── subdir
        └── file.txt

The UStar file format

The first major extension to the tar file format we will look at is the UStar format, which increases the file length limit to 256 characters and adds some new file types. The header is expanded to this:

struct file_header {
	// Original tar header fields
	char file_path[100];
	char file_mode[8];
	char owner_user_id[8];
	char owner_group_id[8];
	char file_size[12];
	char file_mtime[12];
	char header_checksum[8];
	char file_type;
	char link_path[100];

	// New UStar fields
	char magic_bytes[6];
	char version[2];
	char owner_user_name[32];
	char owner_group_name[32];
	char device_major_number[8];
	char device_minor_number[8];
	char prefix[155];

	char padding[12];
};

We now have some magic bytes (defined to be "ustar\0" for the UStar format), as well as the owner user/group names. But most importantly, we have a prefix field, which allows up to 256 character file paths. With UStar, instead of just extracting the bytes from file_path and link_path like before, we must construct a file path like this:

void read_path(char dest[257], char path[100], char prefix[100]) {
	// If there's no prefix, use name directly
	if (prefix[0] == '\0') {
		memcpy(dest, path, 100);
		dest[100] = '\0';
		return;
	}

	// If there is a prefix, the path is: <prefix> '/' <path>
	size_t prefix_len = strnlen(prefix, 155);
	memcpy(dest, prefix, prefix_len);
	dest[prefix_len] = '/';
	memcpy(&dest[prefix_len + 1], path, 100);
	dest[256] = '\0';
}

int extract(FILE *f) {
	/* ... */

	char file_path[257];
	read_path(file_path, header->file_path, header->prefix);
	char link_path[257];
	read_path(link_path, header->link_path, header->prefix);

	/* ... */
}

The original tar format had the file types '0' (or '\0'), '1' and '2', for regular files, hard links and symlinks. UStar defines these additional file types:

  • '3' and '4': Character devices and block devices. These are the reason for the new device_major_number and device_minor_number fields.
  • '5': Directories.
  • '6': FIFO files.
  • '7': Contiguous files. This type isn't really used much these days, and most implementations just treat it as a regular file.

This is definitely an improvement, but we can still only encode up to 256 character long paths. And that 8GiB file size limit still exists. Which leads us to:

The pax file format

The POSIX.1-2001 standard introduced the pax command line tool, and with it, a new set of extensions to the tar file format. This format is identical to UStar, except that it adds two new file object types: 'x' and 'g'. Both of these types let us define "extended header records", as the spec calls it. Records set with 'x' apply to only the next file, while records set with 'g' apply to all following files.

With this new extended header, we can encode the access and modification times with more precision, user/group IDs above 8^7, file sizes over 8^11, file paths of arbitrary length, and a whole lot more. The records are in the payload of the extended headr file object, and use a simple length-prefixed key/value syntax. To represent our "hello.txt" example file with an access time attribute, we need these four 512-byte blocks:

  1. Header, type='x', file_size=30
  2. "30 atime=1658409251.551879906\n", followed by 482 zeroes
  3. Header, type='0', file_path="hello.txt", file_size=11
  4. "Hello World", followed by 501 zero bytes

Interestingly, these extended header records all seem to use decimal (base 10). On the one hand, using base 10 makes sense, but on the other hand, wouldn't it be nice to stick to one way of representing numbers?

Anyways, we can see that the file format has become quite complex now. Just the file path can be provided in any of four different ways:

  • The full path might be in the file_path field.
  • The path might be a combination of the prefix and the file_path fields.
  • The previous file object might've been an 'x' type record with set a path property.
  • There might've been some 'g' type file object earlier in the archive which set a path property.

The GNU tar file format

GNU tar has its own file format, called gnu, which is different from the pax format. Like pax, the gnu format is based on UStar, but it has a different way of encoding arbitrary length paths and large file sizes:

  • It introduces the 'L' type, where the payload of the file object represents the file_path of the next file object.
  • It introduces the 'K' type, where the payload of the file object represents the link_path of the next file object.
  • A link with both a long file_path and a long link_path is preceeded by both an 'L' type file object and a 'K' type file object. The order isn't specified from what I can tell.
  • If a file is over 8GiB, it will set the high bit of the first character in file_size, and the rest of the string is parsed as base 256 (i.e it's treated as a 95-bit integer, big endian).

In some ways, I prefer this approach over the pax approach, since it's much simpler; the pax format requires the extractor to parse the record grammar. On the other hand, the pax format is both more space efficient and vastly more flexible.

In any case, the result is that a tar extractor which wants to support both pax tar files and GNU tar files needs to support 5 different ways of reading the file path, 5 different ways of reading the link path, and 3 different ways of reading the file size.

Whatever happened to the nice and simple format we started out with?

Why GNU tar extracts in quadratic time

Our simple tar extraction implementation has what could be considered a quite serious security bug: It allows people to put files outside the directory we're extracting to. Nothing is stopping an evil arcive from containing a file object with file_path="../hello.txt". You might try to fix that by just disallowing file objects from using ".." as a path component, but it's not that simple. Consider the following sequence of file objects:

  1. Symlink, file_path="./foo", link_path=".."
  2. Normal file, file_path="./foo/hello.txt"

We want to allow symlinks which point to their parent directory, since there are completely legitimate use cases for that. We could try to figure out whether a symlink will end up pointing to somewhere outside of the extraction directory, but that gets complicated real fast when you have to consider symlinks to symlinks and hard links to symlinks. It might be possible to do correctly, but it's not the solution GNU tar goes for.

When GNU tar encounters a hard link or symlink with ".." as a path component in its link_path, tar will create a regular file in its place as a placeholder, and put a note about the delayed link in a linked list datastructure. When it's done extracting the entire archive, it will go through the whole list of delayed links and replace the placeholders with proper links. So far, so good.

The problem comes when trying to extract a hard link which doesn't contain ".." as a path component in its link_path. GNU tar wants to create such hard links immediately if it can. But it can't create a hard link if the target is occupied by a placeholder file. That means, every time GNU tar wants to create a hard link, it first has to walk the entire linked list of delayed links and see if the target is a delayed link. If the target is a delayed link, the new link must also be delayed.

Your time complexity alarm bells should be starting to ring now. For every hard link, we walk the list of all delayed links. But it actually gets worse; for reasons I don't quite understand yet, tar will actually go through the entire list of delayed links again if it found out that it can create the link immediately. So for all "normal" hard links, it has to go through the entire linked list of delayed links twice.

If you're a bit crafty, you can construct a tar archive which GNU tar extracts in precisely O(n^2) time; you just need to alternate between links whose link_path has ".." as a path component and thus get delayed, and "normal" hard links which don't get delayed. If you're a bit unlucky, you might have a totally benign tarball which nevertheless happens to contain a bunch of symlinks which refer to files in a parent directory, followed by a bunch of normal hard links. This is what had happened to me. My tarball happened to contain over 800 000 links with ".." as a path component. It also happened to contain over 5.4 million hard links. Every one of those hard links had to go through the entire list of every hitherto deferred link. No wonder tar got slow.

If you ever find yourself in this situation, pass the --absolute-paths (or -P) parameter to tar. Tar's documentation says this about --absolute-paths:

Preserve pathnames. By default, absolute pathnames (those that begin with a / character) have the leading slash removed both when creating archives and extracting from them. Also, tar will refuse to extract archive entries whose pathnames contain .. or whose target directory would be altered by a symlink. This option suppresses these behaviors.

You would never guess it from reading the documentation, but when you pass --absolute-paths during extraction, tar assumes that the archive is benign and the whole delayed linking mechanism is disabled. Make sure you trust the tar archive though! When extracted with --absolute-paths, a malicious archive will be able to put files anywhere it wants.

I'm absolutely certain that it's possible to make GNU tar extract in O(n) without --absolute-paths by replacing the linked list with a hash map. But that's an adventure for another time.

References

These are the documents I've drawn information from when researching for my tar extractor and this blog post:

If I have represented anything inaccurately in this post, please do correct me.

]]>
https://mort.coffee/home/tar 23 Jul 2022 21:00 +0200
C/C++: 70x faster file embeds using string literals https://mort.coffee/home/fast-cpp-embeds <![CDATA[

Date: 2020-08-03
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00013-fast-cpp-embeds.md
Tool: https://github.com/mortie/strliteral

It's really common to want to embed some static data into a binary. Game developers want to embed their shaders. Developers of graphical apps may want to embed sounds or icons. Developers of programming language interpreters may want to embed their language's standard library. I have many times built software whose GUI is in the form of a web app served from a built-in HTTP server, where I want to embed the HTML/JS/CSS into the binary.

Since neither C nor C++ currently has a built-in way to embed files, we use work-arounds. These usually fall into one of two categories: Either we use toolchain-specific features to generate object files with the data exposed as symbols, or we generate C code which we subsequently compile to object files. Since the toolchain-specific features are, well, toolchain-specific, people writing cross-platform software generally prefer code generation.

The most common tool I'm aware of to generate C code for embedding data is xxd, whose -i option will generate C code with an unsigned char array literal.

Given the following input text:

<html>
	<head>
		<title>Hello World</title>
	</head>
	<body>
		Hello World
	</body>
</html>
index.html

The command xxd -i index.html will produce this C code:

unsigned char index_html[] = {
  0x3c, 0x68, 0x74, 0x6d, 0x6c, 0x3e, 0x0a, 0x09, 0x3c, 0x68, 0x65, 0x61,
  0x64, 0x3e, 0x0a, 0x09, 0x09, 0x3c, 0x74, 0x69, 0x74, 0x6c, 0x65, 0x3e,
  0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x57, 0x6f, 0x72, 0x6c, 0x64, 0x3c,
  0x2f, 0x74, 0x69, 0x74, 0x6c, 0x65, 0x3e, 0x0a, 0x09, 0x3c, 0x2f, 0x68,
  0x65, 0x61, 0x64, 0x3e, 0x0a, 0x09, 0x3c, 0x62, 0x6f, 0x64, 0x79, 0x3e,
  0x0a, 0x09, 0x09, 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x57, 0x6f, 0x72,
  0x6c, 0x64, 0x0a, 0x09, 0x3c, 0x2f, 0x62, 0x6f, 0x64, 0x79, 0x3e, 0x0a,
  0x3c, 0x2f, 0x68, 0x74, 0x6d, 0x6c, 0x3e, 0x0a
};
unsigned int index_html_len = 92;

This works fairly well. Any C or C++ compiler can compile that code and produce an object file with our static data, which we can link against to embed that data into our binary. All in a cross-platform and cross-toolchain way.

There's just one problem: It's slow. Really slow. On my laptop, embedding a megabyte this way takes 2 seconds using g++. Embedding one decent quality MP3 at 8.4MB takes 23 seconds, using 2.5 gigabytes of RAM.

bippety-boppety.mp3, an 8.4MB song

Whether or not we should embed files of that size into our binaries is a question I won't cover in this article, and the answer depends a lot on context. Regardless, processing data at just over 400kB per second is objectively terrible. We can do so much better.

The main reason it's so slow is that parsing arbitrary C++ expressions is actually really complicated. Every single byte is a separate expression, parsed using a complex general expression parser, presumably separately allocated as its own node in the syntax tree. If only we could generate code which combines lots of bytes of data into one token...

I wrote a small tool, called strliteral, which outputs data as a string literal rather than a character array. The command strliteral index.html will produce this C code:

const unsigned char index_html[] =
	"<html>\n\t<head>\n\t\t<title>Hello World</title>\n\t</head>\n\t<body>\n\t\tHello"
	" World\n\t</body>\n</html>\n";
const unsigned int index_html_len = 92;

It should come as no surprise that this is many times faster to parse than the character array approach. Instead of invoking a full expression parser for each and every byte, most of the time will just be spent in a tight loop which reads bytes and appends them to an array. The grammar for a string literal is ridiculously simple compared to the grammar for an array literal.

Compared to xxd's 23 seconds and 2.5GB of RAM usage for my 8.4MB file, my strliteral tool produces code which g++ can compile in 0.6 seconds, using only 138 megs of RAM. That's almost a 40x speed-up, and an 18x reduction in RAM usage. It's processing data at a rate of 15MB per second, compared to xxd's 0.4MB per second. As a bonus, my tool generates 26MB of C code, compared to xxd's 52MB.

Here's how that song looks, encoded with strliteral:

const unsigned char bippety_boppety_mp3[] =
	"\377\373\340D\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\242\240,\2253]5\234\316\020\234\375\246\072D\307\027\203R\030\307\221\314`\243B\370\013\301\220\256"
	"\235\036\243E\246\331\216\026\004\341\362uU&\255\030@,\227\021q]1\231L\304\010E\311\231\005W\231\210"
	"j-\"\374|\210II0\221\026\045\021}qC\206\t9<\320\013\246w\350\263EmH`#\262\037\252\304\272\340\355`7\217"
	"\343*\016\236\320\345oa\217\204\361~k\224\255|\301cy\371\375\034\366K'\236\037\271\204\371\275\rV\267"
	"\252\020\245\322~\233\350\222\343\347\204\332\340~\236-\355S.W\045\365\301=\\+\236\270F\312\246g\266"
	"CX2\376\265V\242T0\337I\031\343\347\320\336\322\020\016\020H\250\007]\031\201\235\025\300h\2628d\000"
	/* 249707 lines snipped */
	"\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252"
	"\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252\252"
	"\252TAG\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000Created with LMMS\000"
	"\000\000\000\000\000\000\000\000\000\000\000\000\377";
unsigned int bippety_boppety_mp3_len = 8779359;

The difference is even bigger when processing mostly-ASCII text rather than binary data. Since xxd produces the same 6 bytes of source code for every byte of input (0x, two hex digits, comma, space), the data itself doesn't matter. However, strliteral produces 4 bytes of source code (\, then three octal digits) for every "weird" character, but just one byte of source code for every "regular" ASCII character.

Graphs

I wrote some benchmarking code to compare various aspects of xxd and strliteral. All times are measured using an Intel Core i7-8705G CPU in a Dell XPS 15 9575. g++ and xxd are from the Ubuntu 20.04 repositories. strliteral is compiled with gcc -O3 -o strliteral strliteral.c using GCC 9.3.0. The benchmarking source code can be found here: https://github.com/mortie/strliteral/tree/master/benchmark

Here's a graph which shows exactly how the two tools compare, across a range of input sizes, given either text or random binary data:

The 70x number in the title comes from this graph. The 60ms spent compiling strliteral-generated code is 72x faster than the 4324ms spent compiling xxd-generated code. Comparing random binary data instead of text would show a lower - though still respectable - speed-up of 25x.

Though most of the time spent when embedding data with xxd comes from the compiler, the xxd tool itself is actually fairly slow too:

Those ~200 milliseconds xxd takes to generate code for a 2MB file isn't very significant compared to the 4.3 second compile time, but if strliteral was equally slow, 75% of the time would've been spent generating code as opposed to compiling code. Luckily, strliteral runs through 2MB of text in 11ms.

Looking at the xxd source code, the reason it's so slow seems to be that it prints every single byte using a call to fprintf:

while ((length < 0 || p < length) && (c = getc(fp)) != EOF)
  {
    if (fprintf(fpo, (hexx == hexxa) ? "%s0x%02x" : "%s0X%02X",
                (p % cols) ? ", " : &",\n  "[2*!p],  c) < 0)
      die(3);
    p++;
  }

Finally, here's a graph over g++'s memory usage:

Caveats

Update: In the reddit discussion, someone pointed out that MSVC, Microsoft's compiler, has a fairly low maximum string length limit (the exact limit is fairly complicated). I had assumed that any modern compiler would just keep strings in a variable sized array. Maybe strliteral will eventually grow an MSVC-specific workaround, but until then, using a better compiler like Clang or GCC on Windows is an option.

Using string literals for arbitrary binary data is a bit more complicated than using an array with integer literals. Both xxd and strliteral might have trouble in certain edge cases, such as when cross-compiling if the host and target disagrees on the number of bits in a byte. Using string literals adds an extra complication due to the distinction between the "source character set" and the "execution character set". The C11 spec (5.2.1p2) states:

In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters.

If you run strliteral on a file which contains the byte 97, it will output the code const unsigned char data[] = "a";. If that C code is compiled with a "source character set" of ASCII and an "execution character set" of EBCDIC, my understanding of the standard text is that the ASCII "a" (byte 97) will be translated to the EBCDIC "a" (byte 129). Whether that's even a bug or not depends on whether the intention is to embed binary data or textual data, but it's probably not what people expect from a tool to embed files.

This should only ever become an issue if you're compiling with different source and execution charsets, where the source charset and execution charset aren't based on ASCII. Compiling with a UTF-8 source charset and an EBCDIC execution charset will cause issues, but since all non-ASCII characters are printed as octal escape sequences, compiling with e.g a UTF-8 source charset and a LATIN-1 execution charset isn't an issue.

It seems extremely unlikely to me that someone will compile with a source charset and an execution charset which are both different and not based on ASCII, but I suppose it's something to keep in mind. If it does become an issue, the --always-escape option will cause strliteral to only generate octal escape sequences. That should work the same as xxd -i in all cases, just faster.

Some implementation notes

C is a weird language. For some reason, probably to better support systems where bytes are bigger than 8 bits, hex string escapes like "\x6c" can be an arbitrary number of characters. "\xfffff" represents a string with one character whose numeric value is 1048575. That obviously won't work on machines with 8-bit bytes, but it could conceivably be useful on a machine with 24-bit bytes, so it's allowed. Luckily, octal escapes are at most 3 numbers, so while "\xf8ash" won't work, "\370ash" will.

C also has a concept of trigraphs and digraphs, and they're expanded even within string literals. The string literals "??(" and "[" are identical (at least in C, and in C++ before C++17). Currently, strliteral just treats ?, : and % as "special" characters which are escaped, which means no digraphs or trigraphs will ever appear in the generated source code. I decided it's not worth the effort to add more "clever" logic which e.g escapes a ( if the two preceeding characters are question marks.

]]>
https://mort.coffee/home/fast-cpp-embeds 03 Aug 2020 15:00 +0200
Hacking on Clang is surprisingly easy https://mort.coffee/home/clang-compiler-hacking <![CDATA[

Date: 2020-01-27
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00012-clang-compiler-hacking.md

I happen to think that the current lambda syntax for C++ is kind of verbose. I'm not the only one to have thought that, and there has already been a paper discussing a possible abbreviated lambda syntax (though it was rejected).

In this blog post, I will detail my attempt to implement a sort of simplest possible version of an abbreviated lambda syntax. Basically, this:

[](auto &&a, auto &&b) => a.id() < b.id();

should mean precisely:

[](auto &&a, auto &&b) { return a.id() < b.id(); };

I will leave a discussion about whether that change is worth it or not to the end. Most of this article will just assume that we want that new syntax, and discuss how to actually implement it in Clang.

If you want to read more discussion on the topic, I wrote a somewhat controversial post on Reddit discussing why I think it might be a good idea.

Here's the implementation I will go through in this post: https://github.com/mortie/llvm/commit/e4726dc9d9d966978714fc3d85c6e9c335a38ab8 - 28 additions, including comments and whitespace, across 3 files.

Getting the Clang code

This wasn't my first time compiling Clang, but it was my first time downloading the source code with the intent to change it.

LLVM has a nice page which details getting and building Clang, but the tl;dr is:

git clone https://github.com/llvm/llvm-project.git
cd llvm-project && mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=`pwd`/inst -DLLVM_ENABLE_PROJECTS=clang -DCMAKE_BUILD_TYPE=Release ../llvm
make -j 8
make install

A few points to note:

  • The build will take a long time. Clang is big.
  • I prefer -DCMAKE_BUILD_TYPE=Release because it's way faster to build. Linking Clang with debug symbols and everything takes ages and will OOM your machine.
  • This will install your built clang to inst (short for "install"). The clang binary itself will be in inst/bin/clang.

Now that we have a clang setup, we can have a look at how the project is laid out, and play with it.

Changing the Clang code

The feature I want to add is very simple: Basically, I want [] => 10 to mean the exact same thing as [] { return 10; }. In order to understand how one would achieve that, an extremely short introduction to how compilers work is necessary:

Our code is just a sequence of bytes, like [] => 10 + 20. In order for Clang to make sense of that, it will go through many steps. We can basically divide a compiler into two parts: the "front-end", which goes through many steps to build a thorough understanding of the code as a tree structure, and the "back-end" which goes through many steps to remove information, eventually ending up with a simple series of bytes again, but this time as machine code instead of ASCII.

We'll ignore the back-end for now. The front-end basically works like this:

  1. Split the stream of bytes into a stream of tokens. This step turns [] => 10 + 20 into something like (open-bracket) (close-bracket) (fat-arrow) (number: 10) (plus) (number: 20).
  2. Go through those tokens and construct a tree. This step turns the sequence of tokens into a tree: (lambda-expression (body (return-statement (add-expression (number 10) (number 20))))) (Yeah, this looks a lot like Lisp. There's a reason people say Lisp basically has no syntax; you're just writing out the syntax tree by hand.)
  3. Add semantic information, such as types.

The first phase is usually called lexical analysis, or tokenization, or scanning. The second phase is what we call parsing. The third phase is usually called semantic analysis or type checking.

Well, the change I want to make involves adding a new token, the "fat arrow" token =>. That means we'll have to find out how the lexer (or tokenizer) is implemented; where it keeps its list of valid tokens types, and where it turns the input text into tokens. After some grepping, I found the file clang/include/clang/Basic/TokenKinds.def, which includes a bunch of token descriptions, such as PUNCTUATOR(arrow, "->"). This file seems to be a "supermacro"; a file which exists to be included by another file as a form of macro expansion.

I added PUNCTUATOR(fatarrow, "=>") right below the PUNCTUATOR(arrow, "->") line.

Now that we have defined our token, we need to get the lexer to actually generate it.

After some more grepping, I found clang/lib/Lex/Lexer.cpp, where the Lexer::LexTokenInternal function is what's actually looking at characters and deciding what tokens they represent. It has a case statement to deal with tokens which start with an = character:

case '=':
	Char = getCharAndSize(CurPtr, SizeTmp);
	if (Char == '=') {
		// If this is '====' and we're in a conflict marker, ignore it.
		if (CurPtr[1] == '=' && HandleEndOfConflictMarker(CurPtr-1))
			goto LexNextToken;

		Kind = tok::equalequal;
		CurPtr = ConsumeChar(CurPtr, SizeTmp, Result);
	} else {
		Kind = tok::equal;
	}
	break;

Given that, the change to support my fatarrow token is really simple:

case '=':
	Char = getCharAndSize(CurPtr, SizeTmp);
	if (Char == '=') {
		// If this is '====' and we're in a conflict marker, ignore it.
		if (CurPtr[1] == '=' && HandleEndOfConflictMarker(CurPtr-1))
		goto LexNextToken;

		Kind = tok::equalequal;
		CurPtr = ConsumeChar(CurPtr, SizeTmp, Result);

	// If the first character is a '=', and it's followed by a '>', it's a fat arrow
	} else if (Char == '>') {
		Kind = tok::fatarrow;
		CurPtr = ConsumeChar(CurPtr, SizeTmp, Result);

	} else {
		Kind = tok::equal;
	}
	break;

Now that we have a lexer which generates a tok::fatarrow any time it encounters a => in our code, we can start changing the parser to make use of it.

Since I want to change lambda parsing, the code which parses a lamba seems like a good place to start (duh). I found that in a file called clang/lib/Parse/ParseExprCXX.cpp, in the function ParseLambdaExpressionAfterIntroducer. Most of the function deals with things like the template parameter list and trailing return type, which I don't want to change, but the very end of the function contains this gem:

// Parse compound-statement.
if (!Tok.is(tok::l_brace)) {
	Diag(Tok, diag::err_expected_lambda_body);
	Actions.ActOnLambdaError(LambdaBeginLoc, getCurScope());
	return ExprError();
}

StmtResult Stmt(ParseCompoundStatementBody());
BodyScope.Exit();
TemplateParamScope.Exit();

if (!Stmt.isInvalid() && !TrailingReturnType.isInvalid())
	return Actions.ActOnLambdaExpr(LambdaBeginLoc, Stmt.get(), getCurScope());

Actions.ActOnLambdaError(LambdaBeginLoc, getCurScope());
return ExprError();
  1. If the next token isn't an opening brace, error.
  2. Parse a compound statement body (i.e consume a {, read statements until the }).
  3. After some housekeeping, act on the now fully parsed lambda expression.

In principle, what we want to do is to check if the next token is a => instead of a {; if it is, we want to parse an expression instead of a compound statement, and then somehow pretend that the expression is a return statement. Through some trial, error and careful copy/pasting, I came up with this block of code which I put right before the if (!Tok.is(tok::l_brace)):

// If this is an arrow lambda, we just need to parse an expression.
// We parse the expression, then put that expression in a return statement,
// and use that return statement as our body.
if (Tok.is(tok::fatarrow)) {
	SourceLocation ReturnLoc(ConsumeToken());

	ExprResult Expr(ParseExpression());
	if (Expr.isInvalid()) {
		Actions.ActOnLambdaError(LambdaBeginLoc, getCurScope());
		return ExprError();
	}

	StmtResult Stmt = Actions.ActOnReturnStmt(ReturnLoc, Expr.get(), getCurScope());

	BodyScope.Exit();
	TemplateParamScope.Exit();

	if (!Stmt.isInvalid() && !TrailingReturnType.isInvalid())
		return Actions.ActOnLambdaExpr(LambdaBeginLoc, Stmt.get(), getCurScope());

	Actions.ActOnLambdaError(LambdaBeginLoc, getCurScope());
	return ExprError();
}

// Otherwise, just parse a compound statement as usual.
if (!Tok.is(tok::l_brace)) ...

This is really basic; if the token is a => instead of a {, parse an expression, then put that expression into a return statement, and then use that return statement as our lambda's body.

And it works! Lambda expressions with fat arrows are now successfully parsed as if they were regular lambdas whose body is a single return statement:

Demonstration of our new feature

Was it worth it?

Implementing this feature into Clang was definitely worth it just to get more familiar with how the code base works. However, is the feature itself a good idea at all?

I think the best way to decide if a new syntax is better or not is to look at old code which could've made use of the new syntax, and decide if the new syntax makes a big difference. Therefore, and now that I have a working compiler, I have gone through all the single-expression lambdas and replaced them with my fancy new arrow lambdas in some projects I'm working on.


Before:

std::erase_if(active_chunks_, [](Chunk *chunk) { return !chunk->isActive(); });

After:

std::erase_if(active_chunks_, [](Chunk *chunk) => !chunk->isActive());

This code deletes a chunk from a game world if the chunk isn't currently active. In my opinion, the version with the arrow function is a bit clearer, but a better solution could be C++17's std::invoke. If I understand std::invoke correctly, if C++ was to adopt std::invoke for algorithms, this code could be written like this:

std::erase_if(active_chunks_, &Chunk::isInactive);

This looks nicer, but has the disadvantage that you need to add an extra method to the class. Having both isActive and its negation isInactive as member functions just because someone might want to use it as a predicate in an algorithm sounds unfortunate. I prefer lambdas' fleixibilty.


Before:

return map(begin(worldgens_), end(worldgens_), [](auto &ptr) { return ptr.get(); });

After:

return map(begin(worldgens_), end(worldgens_), [](auto &ptr) => ptr.get());

This code maps a vector of unique pointers to raw pointers. This is yet another case where I think the arrow syntax is slightly nicer than the C++11 alternative, but this time, we could actually use the member function invocation if I changed my map function to use std::invoke:

return map(begin(worldgens_), end(worldgens_), &std::unique_ptr<Swan::WorldGen::Factory>::get);

Well, this illustrates that invoking a member function doesn't really work with overly complex types. Imagine if the type was instead something more elaborate:

return map(begin(worldgens_), end(worldgens_),
	std::unique_ptr<Swan::WorldGen<int>::Factory, Swan::WorldGen<int>::Factory::Deleter>::get);

This also happens to be unspecified behavior, because taking the address of a function in the standard library is generally not legal. From https://en.cppreference.com/w/cpp/language/extending_std:

The behavior of a C++ program is unspecified (possibly ill-formed) if it explicitly or implicitly attempts to form a pointer, reference (for free functions and static member functions) or pointer-to-member (for non-static member functions) to a standard library function or an instantiation of a standard library function template, unless it is designated an addressable function.


Before:

bool needRender = dirty || std::any_of(widgets.begin(), widgets.end(),
	[](auto &w) { return w.needRender(); });

After:

bool needRender = dirty || std::any_of(widgets.begin(), widgets.end(),
	[](auto &w) => w.needRender());

Again, the short lambda version looks a bit better to me. However, here again, we could replace the lambda with a member reference if algorithms were changed to use std::invoke:

bool needRender = dirty || std::any_of(widgets.begin(), widgets.end(),
	&WidgetContainer::needRender);

Overall, I see a short lambda syntax as a modest improvement. The biggest readability win mostly stems from the lack of that awkward ; }); at the end of an expression; foo([] => bar()) instead of foo([] { return bar(); });. It certainly breaks down a bit when the argument list is long; neither of these two lines are particularly short:

foo([](auto const &&a, auto const &&b, auto const &&c) => a + b + c);
foo([](auto const &&a, auto const &&b, auto const &&c) { return a + b + c; });

I think, considering the minimal cost of implementing this short lambda syntax, the modest improvements outweigh the added complexity. However, there's also an opportunity cost associated with my version of the short lambda syntax: it makes a better, future short lambda syntax either impossible or more challenging. For example, accepting my syntax would mean we couldn't really adopt a ratified version of P0573R2's short lambda syntax in the future, even if the issues with it were otherwise fixed.

Therefore, I will argue strongly that my syntax makes code easier to read, but I can't say anything about whether it should be standardized or not.

Aside: Corner cases

If we were to standardize this syntax, we would have to consider all kinds of corner cases, not just accept what clang with my changes happens to to do. However, I'm still curious about what exactly clang happens to do with my changes.

How does this interact with the comma operator?

The comma operator in C++ (and most C-based languages) is kind of strange. For example, what does foo(a, b) do? We know it calls foo with the arguments a and b, but you could technically decide to parse it as foo(a.operator,(b)).

My arrow syntax parses foo([] => 10, 20) as calling foo with one argument; a function with the body 10, 20 (where the comma operator means the 10 does nothing, and 20 is returned). I would probably want that to be changed, so that foo is called with two arguments; a lambda and an int.

This turns out to be fairly easy to fix, because Clang already has ways of dealing with expressions which can't include top-level commas. After all, there's precedence here; since clang parses foo(10, 20) without interpreting the , a top-level comma as a comma operator, we can use the same infrastructure for arrow lambdas.

In clang/lib/Parser/ParseExpr.cpp, Clang defines a function ParseAssignmentExpression, which has this comment:

Parse an expr that doesn't include (top-level) commas.

Calling ParseAssignmentExpression is also the first thing the ParseExpression function does. It seems like it's just the general function for parsing an expression without a top-level comma operator, even though the name is somewhat misleading. This patch changes arrow lambdas to use ParseAssignmentExpression instead of ParseExpression: https://github.com/mortie/llvm/commit/c653318c0056d06a512dfce0799b66032edbed4c

How do immediately invoked lambdas work?

With C++11 lambdas, you can write an immediately invoked lambda in the obvious way; just do [] { return 10; }(), and that expression will return 10. With my arrow lambda syntax, it's not quite as obvious. Would [] => foo() be interpreted as immediately invoking the lambda [] => foo, or would it be interpreted as creating a lambda whose body is foo()?

In my opinion, the only sane way for arrow lambdas to work would be that [] => foo() creates a lambda with the body foo(), and that creating an immediately invoked lambda would require extra parens; ([] => foo())(). That's also how my implementation happens to work.

How does this interact with with explicit return types, specifiers, etc?

Since literally all the code before the arrow/opening brace is shared between arrow lambdas and C++11 lambdas, everything should work exactly the same. That means that all of these statements should work:

auto l1 = []() -> long => 10;
auto l2 = [foo]() mutable => foo++;
auto l3 = [](auto a, auto b) noexcept(noexcept(a + b)) => a + b;

And so would any other combination of captures, template params, params, specifiers, attributes, constraints, etc, except that the body has to be a single expression.

]]>
https://mort.coffee/home/clang-compiler-hacking 27 Jan 2020 12:00 GMT
C compiler quirks I have encountered https://mort.coffee/home/c-compiler-quirks <![CDATA[

Date: 2018-07-26
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00011-c-compiler-quirks.md

In a previous blog post, I wrote about some weird features of C, the C preprocessor, and GNU extensions to C that I used in my testing library, Snow.

This post will be about some of the weird compiler and language quirks, limitations, and annoyances I've come across. I don't mean to bash compilers or the specification; most of these quirks have good technical or practical reasons.

Compilers lie about what version of the standard they support

There's a handy macro, called __STDC_VERSION__, which describes the version of the C standard your C implementation conforms to. We can check #if (__STDC_VERSION__ >= 201112L) to check if our C implementaion confirms to C11 or higher (C11 was published in December 2011, hence 2011 12). That's really useful if, say, you're a library author and have a macro which uses _Generics, but also have alternative ways of doing the same and want to warn people when they use the C11-only macro in an older compiler.

In theory, this should always work; any implementation of C which conforms to all of C11 will define __STDC_VERSION__ as 201112L, while any implementation which doesn't conform to C11, but conforms to some earlier version, will define __STDC_VERSION__ to be less than 201112L. Therefore, unless the _Generic feature gets removed in a future version of the standard, __STDC_VERSION__ >= 201112L means that we can safely use _Generic.

Sadly, the real world is not that clean. You could already in GCC 4.7 enable C11 by passing in -std=c11, which would set __STDC_VERSION__ to 201112L, but the first release to actually implement all non-optional features of C11 was GCC 4.9. That means, if we just check the value of __STDC_VERSION__, users on GCC 4.7 and GCC 4.8 who use -std=c11 will see really confusing error messages instead of our nice error message. Annoyingly, GCC 4.7 and 4.8 happens to still be extremely widespread versions of GCC. (Relevant: GCC Wiki's C11Status page)

The solution still seems relatively simple; just don't use -std=c11. More recent compilers default to C11 anyways, and there's no widely used compiler that I know of which will default to setting __STDC_VERSION__ to C11 without actually supporting all of C11. That works well enough, but there's one problem: GCC 4.9 supports all of C11 just fine, but only if we give it -std=c11. GCC 4.9 also seems to be one of those annoyingly widespread versions of GCC, so we'd prefer to encourage users to set -std=c11 and make the macros which rely on _Generic work in GCC 4.9.

Again, the solution seems obvious enough, if a bit ugly: if the compiler is GCC, we only use _Genric if the GCC version is 4.9 or greater and __STDC_VERSION__ is C11. If the compiler is not GCC, we just trust it if it says it supports C11. This should in theory work perfectly:

#if (__STDC_VERSION__ >= 201112L)
# ifdef __GNUC__
#  if (__GNUC__ >= 5 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 9))
#   define IS_C11
#  endif
# else
#  define IS_C11
# endif
#endif

Our new IS_C11 macro should now always be defined if we can use _Generic and always not be defined when we can't use _Generic, right?

Wrong. It turns out that in their quest to support code written for GCC, Clang also defines the __GNUC__, __GNUC_MINOR__, and __GNUC_PATCHLEVEL__ macros, specifically to fool code which checks for GCC into thinking Clang is GCC. However, it doesn't really go far enough; it defines the __GNUC_* variables to correspond to the the version of clang, not the version of GCC which Clang claims to imitate. Clang gained support for C11 in 3.6, but using our code, we would conclude that it doesn't support C11 because __GNUC__ is 3 and __GNUC_MINOR__ is 6. Update: it turns out that Clang always pretends to be GCC 4.2, but the same issue still applies; __GNUC__ is 4, and __GNUC_MINOR__ is 2, so it fails our version check. We can solve this by adding a special case for when __clang__ is defined:

#if (__STDC_VERSION__ >= 201112L)
# if defined(__GNUC__) && !defined(__clang__)
#  if (__GNUC__ >= 5 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 9))
#   define IS_C11
#  endif
# else
#  define IS_C11
# endif
#endif

Now our code works with both Clang and with GCC, and should work with all other compilers which don't try to immitate GCC - but for every compiler which does immitate GCC, we would have to add a new special case. This is starting to smell a lot like user agent strings.

The Intel compiler is at least nice enough to define __GNUC__ and __GNUC_MINOR__ according to be the version of GCC installed on the system; so even though our version check is completely irrelevant in the Intel compiler, at least it will only prevent an otherwise C11-compliant Intel compiler from using _Generic if the user has an older version of GCC installed.

User: Hi, I'm using the Intel compiler, and your library claims my compiler doesn't support C11, even though it does.

You: Upgrading GCC should solve the issue. What version of GCC do you have installed?

User: ...but I'm using the Intel compiler, not GCC.

You: Still, what version of GCC do you have?

User: 4.8, but I really don't see how that's relevant...

You: Try upgrading GCC to at least version 4.9.

(Relevant: Intel's Additional Predefined Macros page)

_Pragma in macro arguments

C has had pragma directives for a long time. It's a useful way to tell our compiler something implementation-specific; something which there's no way to say using only standard C. For example, using GCC, we could use a pragma directive to tell our compiler to ignore a warning for a couple of lines, without changing warning settings globally:

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wfloat-equal"
// my_float being 0 indicates a horrible failure case.
if (my_float == 0)
	abort();
#pragma GCC diagnostic pop

We might also want to define a macro which outputs the above code, so C99 introduced the _Pragma operator, which works like #pragma, but can be used in macros. Once this code goes through the preprocessor, it will do exactly the same as the above code:

#define abort_if_zero(x) \
	_Pragma("GCC diagnostic push") \
	_Pragma("GCC diagnostic ignored \"-Wfloat-equal\"") \
	if (x == 0) \
		abort(); \
	_Pragma("GCC diagnostic pop")

abort_if_zero(my_float);

Now, imagine that we want a macro to trace certain lines; a macro which takes a line of code, and prints that line of code while executing the line. This code looks completely reasonable, right?

#define trace(x) \
	fprintf(stderr, "TRACE: %s\n", #x); \
	x

trace(abort_if_zero(my_float));

However, if we run that code through GCC's preprocessor, we see this mess:

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wfloat-equal"
#pragma GCC diagnostic pop
fprintf(stderr, "TRACE: %s\n", "abort_if_zero(my_float)"); if (my_float == 0) abort();

The pragmas all got bunched up at the top! From what I've heard, this isn't against the C standard, because the standard not entirely clear on what happens when you send in _Pragma operators as macro arguments, but it sure surprised me when I encountered it nonetheless.

For the Snow library, this means that there are certain warnings which I would have loved to only disable for a few lines, but which I have to disable for all code following the #include <snow/snow.h> line.

Side note: Clang's preprocessor does exactly what one would expect, and produces this output:

fprintf(stderr, "TRACE: %s\n", "abort_if_zero(my_float)");
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wfloat-equal"
 if (my_float == 0) abort();
#pragma GCC diagnostic pop

Line numbers in macro arguments

Until now, the quirks I've shown have been issues you could potentially encounter in decent, real-world code. If this quirk has caused issues for you however, it might be a sign that you're slightly over-using macros.

All testing code in Snow happens within macro arguments. This allows for what I think is a really nice looking API, and allows all testing code to be disabled just by changing one macro definition. This is a small example of a Snow test suite:

#include <stdio.h>
#include <snow/snow.h>

describe(files, {
	it("writes to files", {
		FILE *f = fopen("testfile", "w");
		assertneq(f, NULL);
		defer(remove("testfile"));
		defer(fclose(f));

		char str[] = "hello there";
		asserteq(fwrite(str, 1, sizeof(str), f), sizeof(str));
	});
});

snow_main();

If that assertneq or asserteq fails, we would like and expect to see a line number. Unfortunately, after the code goes through the preprocessor, the entire nested macro expansion ends up on a single line. All line number information is lost. __LINE__ just returns the number of the last line of the macro expansion, which is 14 in this case. All __LINE__ expressions inside the block we pass to describe will return the same number. I have googled around a bunch for a solution to this issue, but none of the solutions I've looked at actually solve the issue. The only actual solution I can think of is to write my own preprocessor.

Some warnings can't be disabled with pragma

Like the above example, this is probably an issue you shouldn't have come across in production code.

First, some background. In Snow, both the code which is being tested and the test cases can be in the same file. This is to make it possible to test static functions and other functionality which isn't part of the component's public API. The idea is that at the bottom of the file, after all non-testing code, one should include <snow/snow.h> and write the test cases. In a non-testing build, all the testing code will be removed by the preprocessor, because the describe(...) macro expands to nothing unless SNOW_ENABLED is defined.

My personal philosophy is that your regular builds should not have -Werror, and that your testing builds should have as strict warnings as possible and be compiled with -Werror. Your users may be using a different compiler version from you, and that compiler might produce some warnings which you haven't fixed yet. Being a user of a rolling release distro, with a very recent of GCC, I have way too often had to edit someone else's Makefile and remove -Werror just to make their code compile. Compiling the test suite with -Werror and regular builds without -Werror has none of the drawbacks of using -Werror for regular builds, and most or all of the advantages (at least if you don't accept contributions which break your test suite).

This all means that I want to be able to compile all files with at least -Wall -Wextra -Wpedantic -Werror, even if the code includes <snow/snow.h>. However, Snow contains code which produces warnings (and therefore errors) with those settings; among other things, it uses some GNU extensions which aren't actually part of the C standard.

I would like to let users of Snow compile their code with at least -Wall -Wextra -Wpedantic -Werror, but Snow has to disable at least -Wpedantic for all code after the inclusion of the library. In theory, that shouldn't be an issue, right? We just include #pragma GCC diagnostic ignored "-Wpedantic" somewhere.

Well, as it turns out, disabling -Wpedantic with a pragma doesn't disable all the warnings enabled by -Wpedantic; there are some warnings which are impossible to disable once they're enabled. One such warning is about using directives (like #ifdef) inside macro arguments. As I explained earlier, everything in Snow happens inside of macro arguments. That means that when compiling with -Wpedantic, this code produces a warning which it's impossible to disable without removing -Wpedantic from the compiler's arguments:

describe(some_component, {
#ifndef __MINGW32__
	it("does something which can't be tested on mingw", {
		/* ... */
	});
#endif
});

That's annoying, because it's perfectly legal in GNU's dialect of C. The only reason we can't do it is that it just so happens to be impossible to disable that particular warning with a pragma.

To be completely honest, this issue makes complete sense. I imagine the preprocessor stage, which is where macros are expanded, doesn't care much about pragmas. It feels unnecessary to implement pragma parsing for the preprocessor just in order to let people compile files with -Wpedantic but still selectively disable this particular warning. That doesn't make it less annoying though.

Funnily enough, I encountered this issue while writing Snow's test suite. My solution was to just define a macro called NO_MINGW which is empty if __MINGW32__ is defined, and expands to the contents of its arguments otherwise.

]]>
https://mort.coffee/home/c-compiler-quirks 26 Jul 2018 12:00 GMT
Some obscure C features you might not know about https://mort.coffee/home/obscure-c-features <![CDATA[

Date: 2018-01-25
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00010-obscure-c-features.md

I have been working on Snow, a unit testing library for C. I wanted to see how close I could come to making a DSL (domain specific language) with its own syntax and features, using only the C preprocessor and more obscure C features and GNU extensions. I will not go into detail about how Snow works unless it's directly relevant, so I recommend taking a quick look at the readme on the GitHub page.

Sending blocks as arguments to macros

Let's start with the trick that's probably both the most useful in everyday code, and the least technically complicated.

Originally, I defined macros like describe, subdesc, and it similar to this:

#define describe(name, block) \
	void test_##name() { \
		/* some code, omitted for brevity */ \
		block \
		/* more code */ \
	}

The intended use would then be like this:

describe(something, {
	/* code */
});

The C preprocessor doesn't really understand the code; it only copies and pastes strings around. It splits the string between the opening ( and the closing ) by comma; that means, in this case, something would be sent in as the first argument, and { /* code */ } as the second argument (pretend /* code */ is actual code; the preprocessor actually strips out comments). The C preprocessor is smart enough to know that you might want to pass function calls to macros, and function calls contain commas, so parentheses will "guard" the commas they contain. describe(something, foo(10, 20)) would therefore pass something as the first argument, and foo(10, 20) as the second argument.

Now, we're not passing in function calls, but blocks. The preprocessor only considers parentheses; braces { } or brackets [ ] don't guard their contents. That means this call will fail:

describe(something, {
	int a, b;
	/* code */
});

The preprocessor will interpret something as the first argument, { int a as the second argument, and b; /* code */ } as the third argument, but describe only takes two arguments! The preprocessor will halt and show an error message.

So, how do we fix this? Not being able to write commas outside of parentheses in our blocks is quite the limitation. Not only does it prevent us from declaring multiple variables in one statement, it also messes with array declarations like int foo[] = { 10, 20, 30 };.

Well, the preprocessor supports variadic macros; macros which can take an unlimited amount of arguments. The way they are implemented is that any extra arguments (indicated by ... in the macro definition) are made available through the __VA_ARGS__ identifier; __VA_ARGS__ is replaced with all the extra arguments separated by commas. So, what happens if we define the macro like this?

#define describe(name, ...) \
	void test_##name() { \
		/* some code, omitted for brevity */ \
		__VA_ARGS__ \
		/* more code */ \
	}

Let's call describe like we did above:

describe(something, {
	int a, b;
	/* code */
});

Now, the arguments will be interpreted the same way as before; something will be the first argument, { int a will be the second argument, and b; /* code */ } will be the third. However, __VA_ARGS__ will be replaced by the second and third argument with a comma inbetween, and together they produce { int a, b; /* code */ }, just as we intended. The entire describe call will be expanded into this (with added newlines and indentation for clarity; the actual preprocessor would put it all on one line):

void test_something() {
	/* some code, omitted for brevity */
	{
		int a, b;
		/* code */
	}
	/* more code */
}

And just like that, we successfully passed a block of code, with unguarded commas, to a macro.

Credit for this solution goes to this stackoverflow answer.

Generic macros with _Generic

I wanted to be able to use one set of macros, asserteq and assertneq, to be able to do most simple equality checks, instead of having to write asserteq_str for strings, asserteq_int for integers, etc. The C11 standard added the _Generic keyword, which sounds like it's perfect for that; given a list of types and expressions, _Generic will choose the expression whose associated type is compatible with a controlling expression. For example, this code will print "I am an int":

_Generic(10,
	int: printf("I am an int\n"),
	char *: printf("I am a string\n")
);

By itself, _Generic isn't terribly useful, but it can be used to make faux-generic function-like macros. The cppreference.com page uses the example of a generic cbrt (cube root) macro:

#define cbrt(x) _Generic((x), \
	long double: cbrtl, \
	float: cbrtf, \
	default: cbrt)(x)

Calling cbrt on a long double will now call cbrtl, while calling cbrt on a double will call the regular cbrt function, etc. Note that _Generic is not part of the preprocessor; the preprocessor will just spit out the _Generic syntax with x replaced with the macro's argument, and it's the actual compiler's job to figure out what type the controlling expression is and choose the appropriate expression.

I have a bunch of asserteq functions for the various types; asserteq_ptr(void *a, void *b), asserteq_int(intmax_t a, intmax_t b), asserteq_str(const char *a, const char *b), etc. (In reality, the function signatures are a lot uglier, and they're prefixed with _snow_, but for the sake of this article, I'll pretend they look like void asserteq_<suffix>(<type> a, <type> b)).

At first glance, _Generic looks perfect for this use case; just define an asserteq macro like this:

#define asserteq(a, b) _Generic((b), \
	const char *: asserteq_str, \
	char *: asserteq_str, \
	void *: asserteq_ptr, \
	int: asserteq_int)(a, b)

It's sadly not that simple. _Generic will match only specific types; int matches only int, not long. void * matches void pointers, not any other form of pointer. There's no way to say "match every pointer type", for example.

However, there is a default clause, just like in switch statements. My first solution was to just pass anything not otherwise specified to asserteq_int, and use _Pragma (like #pragma, but can be used inside macros) to ignore the warnings:

#define asserteq(a, b) \
	do { \
		_Pragma("GCC diagnostic push") \
		_Pragma("GCC diagnostic ignored \"-Wint-conversion\"") \
		_Generic((b), \
			const char *: asserteq_str, \
			char *: asserteq_str, \
			default: asserteq_int)(a, b) \
		_Pragma("GCC diagnostic pop") \
	} while (0)

That solution worked but it's not exactly nice. I assume it would eventually break, either due to compiler optimizations or due to weird systems where an intmax_t is smaller than a pointer or whatever. Luckily, the good people over in ##C@freenode had an answer: subtracting a pointer from a pointer results in a ptrdiff_t! That means we can nest _Generics, and appropriately choose asserteq_int for any integer types, or asserteq_ptr for any pointer types:

#define asserteq(a, b) _Generic((b), \
	const char *: asserteq_str, \
	char *: asserteq_str, \
	default: _Generic((b) - (b), \
		ptrdiff_t: asserteq_ptr(a, b), \
		default: asserteq_int(a, b)))(a, b)

Defer, label pointers, and goto *(void *)

I once saw a demonstration of Golang's defer statement, and fell in love. It immediately struck me as a much better way to clean up than relying solely on the try/catch stuff we've been used to ever since 1985. Naturally, I wanted to use that for tearing down test cases in Snow, but there's not exactly any obvious way to implement it in C.

For those unfamiliar with it, in Go, defer is basically a way to say, "run this expression once the function returns". It works like a stack; when the function returns, the most recently deferred expression will be executed first, and the first deferred expression will be executed last. The beautiful part is that even if the function returns early, either because some steps can be skipped, or because something failed, all the appropriate deferred expressions, and only the appropriate deferred expressions, will be executed. Replace "function" with "test case", and it sounds perfect for tearing down tests.

So, how would you implement that in C? Well, it turns out that GCC has two useful non-standard extensions (which are also supported by Clang by the way): local labels, and labels as values.

Local labels are basically regular labels which you can jump to with goto, but instead of being global to the entire function, they're only available in the block they're declared in. That's fairly straightforward. You declare that a label should be block scoped by just putting __label__ label_name; at the top of the block, and then you can use label_name: anywhere within the block to actually create the label. A goto label_name from anywhere within the block will then go to the label, as expected.

Labels as values is weirder. GCC adds a new unary && operator, which gets a pointer to a label as a void *. Moreover, if you save that pointer in a variable which is accessible outside the block, you can jump back in to that block from outside of it, even though it's a local label. This will print "hello" in an infinite loop:

{
	void *somelabel;

	{
		__label__ lbl;
		lbl:
		somelabel = &&lbl;
		printf("hello\n");
	}

	goto *somelabel;
}

Yes, the somelabel is a void *. Yes, we dereference somelabel to go to it. I don't know how that works, but the important part is that it does. Other than being dereferencable, the void * we get from the unary && works exactly like any other void *, and can even be in an array. Knowing this, implementing defer isn't too hard; here's a simplified implementation of the it(description, block) macro (using the __VA_ARGS__ trick from before) which describes one test case, and the defer(expr) macro which can be used inside the it block:

#define it(description, ...) \
	do { \
		__label__ done_label; \
		void *defer_labels[32]; \
		int defer_count = 0; \
		int run_defer = 0; \
		__VA_ARGS__ \
		done_label: \
		run_defer = 1; \
		if (defer_count > 0) { \
			defer_count -= 1; \
			goto *defer_labels[defer_count]; \
		} \
	} while (0)

#define defer(expr) \
	do { \
		__label__ lbl; \
		lbl: \
		if (run_defer) { \
			expr; \
			/* Go to the previous defer, or the end of the `it` block */ \
			if (defer_count > 0) { \
				defer_count -= 1; \
				goto *defer_labels[defer_count]; \
			} else { \
				goto done_label; \
			} \
		} else { \
			defer_labels[defer_count] = &&lbl; \
			defer_count += 1; \
		} \
	} while (0)

That might not be the most understandable code you've ever seen, but let's break it down with an example.

it("whatever", {
	printf("Hello World\n");
	defer(printf("world\n"));
	defer(printf("hello "));
});

Running that through the preprocessor, we get this code:

do {
	__label__ done_label;
	void *defer_labels[32];
	int defer_count = 0;
	int run_defer = 0;

	{
		printf("Hello World\n");

		do {
			__label__ lbl;
			lbl:
			if (run_defer) {
				printf("world\n");

				/* Go to the previous defer, or the end of the `it` block */
				if (defer_count > 0) {
					defer_count -= 1;
					goto *defer_labels[defer_count];
				} else {
					goto done_label;
				}
			} else {
				defer_labels[defer_count] = &&lbl;
				defer_count += 1;
			}
		} while (0);

		do {
			__label__ lbl;
			lbl:
			if (run_defer) {
				printf("hello ");

				/* Go to the previous defer, or the end of the `it` block */
				if (defer_count > 0) {
					defer_count -= 1;
					goto *defer_labels[defer_count];
				} else {
					goto done_label;
				}
			} else {
				defer_labels[defer_count] = &&lbl;
				defer_count += 1;
			}
		} while (0);
	}

	done_label:
	run_defer = 1;
	if (defer_count > 0) {
		defer_count -= 1;
		goto *defer_labels[defer_count];
	}
} while (0);

That's still not extremely obvious on first sight, but it's at least more obvious than staring at the macro definitions. The first time through, run_defer is false, so both the defer blocks will just add their labels to the defer_labels array and increment defer_count. Then, just through normal execution (without any goto), we end up at the label called done_label, where we set run_defer to true. Because defer_count is 2, we decrement defer_count and jump to defer_labels[1], which is the last defer.

This time, because run_defer is true, we run the deferred expression printf("hello "), decrement defer_count again, and jump to defer_labels[0], which is the first defer.

The first defer runs its expression, printf("world\n"), but because defer_count is now 0, we jump back to done_label. defer_count is of course still 0, so we just exit the block.

The really nice thing about this system is that a failing assert can at any time just say goto done_label, and only the expressions which were deferred before the goto will be executed.

(Note: in the actual implementation in Snow, defer_labels is of course a dynamically allocated array which is realloc'd when necessary. It's also global to avoid an allocation and free for every single test case. I omitted that part because it's not that relevant, and would've made the example code unnecessarily complicated.)

Update: A bunch of people on Reddit and Hacker News have suggested ways to accomplish this. I ended up using the __attribute__((constructor)) function attribute, which makes a given function execute before the main function. Basically, each describe creates a function called test_##name, and a constructor function called _snow_constructor_##name whose only job is to add test_##name to a global list of functions. Here's the code: https://github.com/mortie/snow/blob/7ee25ebbf0edee519c6eb6d36b82d784b0fdcbfb/snow/snow.h#L393-L421

Automatically call all functions created by describe

The describe macro is meant to be used at the top level, outside of functions, because it creates functions. It's basically just this:

#define describe(name, ...) \
	void test_##name() { \
		__VA_ARGS__ \
	}

Calling describe(something, {}) will create a function called test_something. Currently, that function has to be called manually, because no other part of Snow knows what the function is named. If you have used the describe macro to define the functions test_foo, test_bar, and test_baz, the main function will look like this:

snow_main({
	test_foo();
	test_bar();
	test_baz();
})

I would have loved it if snow_main could just know what functions are declared by describe, and automatically call them. I will go over a couple of ways I tried, which eventually turned out to not be possible, and then one way which would definitely work, but which is a little too crazy, even for me.

Static array of function pointers

What if, instead of just declaring functions with describe, we also appended them to an array of function pointers? What if snow.h contained code like this:

void (*described_functions[512])();

#define describe(name, ...) \
	void test_##name() { \
		__VA_ARGS__ \
	} \
	described_functions[__COUNTER__] = &test_##name

__COUNTER__ is a special macro which starts at 0, and is incremented by one every time it's referenced. That means that assuming nothing else uses __COUNTER__, this solution would have worked, and would have been relatively clean, if only it was valid syntax. Sadly, you can't set the value of an index in an array like that in the top level in C, only inside functions.

Appending to a macro

What if we had a macro which we appended test_##name(); to every time a function is declared by describe? It turns out that this is almost possible using some obscure GCC extensions. I found this solution on StackOverflow:

#define described_functions test_foo();

#pragma push_macro("described_functions")
#undef described_functions
#define described_functions _Pragma("pop_macro(\"described_functions\")") described_functions test_bar();

#pragma push_macro("described_functions")
#undef described_functions
#define described_functions _Pragma("pop_macro(\"described_functions\")") described_functions test_baz();

described_functions // expands to test_foo(); test_bar(); test_baz();

This is actually a way to append a string to a macro which works, at least in GCC. Snow could have used that... except for one problem: you of course can't use #define from within a macro, and we would have needed to do this from within the describe macro. I have searched far and wide for a way, even a weird GCC-specific possibly pragma-related way, to redefine a macro from within another macro, but I haven't found anything. Close, but no cigar.

The way which actually works

I mentioned that there is actually one way to do it. Before I show you, I need to cover dlopen and dlsym.

void *dlopen(const char *filename, int flags) opens a binary (usually a shared object... usually), and returns a handle. Giving dlopen NULL as the file name gives us a handle to the main program.

void *dlsym(void *handle, const char *symbol) returns a pointer to a symbol (for example a function) in the binary which handle refers to.

We can use dlopen and dlsym like this:

#include <stdio.h>
#include <dlfcn.h>

void foo() {
	printf("hello world\n");
}

int main() {
	void *h = dlopen(NULL, RTLD_LAZY);

	void *fptr = dlsym(h, "foo");
	void (*f)() = fptr;
	f();

	dlclose(h);
}

Compile that code with gcc -Wl,--export-dynamic -ldl -o something something.c, and run ./something, and you'll see it print hello world to the terminal. That means we can actually call functions dynamically based on an arbitrary string at runtime. (The -Wl,--export-dynamic is necessary to tell the linker to export the symbols, such that they're available to us through dlsym).

Being able to run functions based on a runtime C string, combined with our friend __COUNTER__, opens up some interesting possibilities. We could write a program like this:

#include <stdio.h>
#include <dlfcn.h>

/* Annoyingly, the concat_ and concat macros are necessary to
 * be able to use __COUNTER__ in an identifier name */
#define concat_(a, b) a ## b
#define concat(a, b) concat_(a, b)

#define describe(...) \
	void concat(test_, __COUNTER__)() { \
		__VA_ARGS__ \
	}

describe({
	printf("Hello from function 0\n");
})

describe({
	printf("Hi from function 1\n");
})

int main() {
	void *h = dlopen(NULL, RTLD_LAZY);
	char symbol[32] = { '\0' };

	for (int i = 0; i < __COUNTER__; ++i) {
		snprintf(symbol, 31, "test_%i", i);
		void *fptr = dlsym(h, symbol);
		void (*f)() = fptr;
		f();
	}

	dlclose(h);
}

Run that through the preprocessor, and we get:

void test_0() {
	{ printf("Hello from function 0\n"); }
}
void test_1() {
	{ printf("Hi from function 1\n"); }
}

int main() {
	void *h = dlopen(NULL, RTLD_LAZY);
	char symbol[32] = { '\0' };

	for (int i = 0; i < 2; ++i) {
		snprintf(symbol, 31, "test_%i", i);
		void *fptr = dlsym(h, symbol);
		void (*f)() = fptr;
		f();
	}

	dlclose(h);
}

That for loop in our main function will first call test_0(), then test_1().

I hope you understand why even though this technically works, it's not exactly something I want to include in Snow ;)

]]>
https://mort.coffee/home/obscure-c-features 25 Jan 2018 12:00 GMT
Replacing Apple TV https://mort.coffee/home/replacing-apple-tv <![CDATA[

Date: 2015-12-18
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00009-replacing-apple-tv.md

For a long time now, I and my family have used an Apple TV as our media PC. Not those newfangled ones with third-party games and apps, but the older generation, those with a set of pre-installed "apps" which let you access certain quarantines of content, such as Netflix, YouTube, iTunes, etc.

The Apple TV worked well enough when accessing content from those sources. The Netflix client was good, the YouTube client kind of lackluster, and the iTunes client decent enough. There various other "apps", but those went mostly unused. The main problem however, affecting basically everything on the platform, is that I live in Norway; as a result, most of the time, somewhat new TV shows or movies we want to watch simply isn't available through those sources. Often, we needed to play video files obtained through other means. This left us two options:

  1. Find a Mac, play the video file in VLC there, mirror the screen to the Apple TV. This gives us various degrees of choppy frame rate, but lets us play the video instantly after it's downloaded, and lets us use subtitles if we so desire.
  2. Spend around half an hour converting the video to mp4, and stream it to the TV with this tool. This gives smooth frame rate, but takes a while due to converting media. It also doesn't support subtitles.

One day, I decided I'd had enough. I found an old laptop, threw Linux on it, connected it to the TV, and started writing code.

Introducing MMPC

MMPC, Mort's Media PC, is the fruits of my endevours. It's designed to be controlled from afar, with a web interface. It's also written in a modular fashion, and I'll go through what each module does.

Media Streaming

https://github.com/mortie/mmpc-media-streamer

The media streamer module is the most important module. When playing a movie or episode from a TV show, we generally have a torrent containing the media file. In the past, we would download the movie from the torrent file, and then find a way to play it on the TV when it's done. This module however lets us instead either paste in a torrent link, or upload a torrent file, and it'll stream that to a VLC window which opens on the big screen. VLC also comes with a web interface, so once you start playing a video, the browser loads the VLC web interface, and you can control video playback from there.

The control panel, letting you paste a magnet link, youtube link, etc:

Control Panel

VLC playback controls:

Playback Controls

Remote Desktop

https://github.com/mortie/mmpc-remote-desktop

Sometimes, you need more than streaming torrent files. Netflix, for example, is very useful, whenever it has the content we want to watch, and the same applies to various other websites. As the computer is running a full Linux distro instead of some locked down version of iOS, it's useful to have a way to control it directly. However, it's also annoying to have a wireless keyboard and mouse connected to it constantly and use that. Therefore, I decided it would be nice to be able to remote control it from the browser.

Implementing remote desktop in the browser sounded like it would be an interesting challenge. However, it turned out to be surprisingly easy. There's already a library out there called jsmpg which basically does everything I need. It has a client to stream an mpeg stream to a canvas element, and a server to stream to the client using websockets. The server also has an HTTP server, which you can stream video to and have it appear in all connected clients. Ffmpeg can both record the screen, and output to an HTTP server.

Once I had streaming video to the client working, the rest was just listening for various events on the client (mousemove, mousedown, etc.), and send HTTP requests to an HTTP server, which then promptly runs an xdotool command, and voila, remote desktop.

Remote Desktop

Live Wallpapers

https://github.com/mortie/mmpc-wallpaper

One nice thing about the Apple TV is that it can be set to display random pictures in your photo gallery, which is very nice. However, those pictures have to be in your iCloud photo library, which is sort of problematic, considering I don't use Apple devices, and dislike that kind of platform lock-in for something as important as photos. I therefore moved everything from what we used to use for photos, a shared iCloud photo stream, over to a NAS, mounted that NAS on the media PC as a webdav volume with davfs, and wrote this module to pick a random picture every 5 seconds and set it as the wallpaper. If the random picture it picked was in portrait orientation, it will find another portrait picture, and put them side by side using imagemagick before setting it as the wallpaper.

Why not Plex/Kodi/whatever?

Media PC software already exists. However, both Plex and Kodi, to my knowledge, sort of expects you to have a media collection stored on hard drives somewhere. They excel at letting you browse and play that media, but we rarely find ourselves in a situation where that would be beneficial. Most of the time, we just want to watch a movie we haven't seen before and have had no reason to already have in a media library, and we generally decide what movie to watch shortly before watching it. Writing the software myself lets me tailor it specifically to our use case.

UPDATE: It has come to my attention that Kodi has some addons which let you stream torrent files. However, even with that, there are some things Kodi doesn't do:

  • Streaming from arbitrary video services - some services, like Netflix, have kodi addons, but many streaming services don't have such plugins. There's no plugin for daisuki for example. Just having a regular desktop with Google Chrome, and some links to websites on the desktop for easier access, solves this.
  • Having those unobstructed dynamic live wallpapers of pictures we've taken whenever video isn't playing is rather nice.
  • Being able to control from a laptop instead of a remote control is useful; remote controls get lost, laptops don't. Typing on a laptop keyboard is also a lot easier than with a remote control.

You could get many of the features I want by using Kodi by installing, and maybe writing, lots of plugins, but I'm not convinced that would've been mch easier than just writing the thousand lines of javascript this project required.

]]>
https://mort.coffee/home/replacing-apple-tv 18 Dec 2015 12:00 GMT
Housecat, my new static site generator https://mort.coffee/home/housecat <![CDATA[

Date: 2015-10-08
Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00008-housecat.md

This website has gone through several content management systems throughout the times. Years ago, it was WordPress, before I switched to a basic homegrown one written in PHP. A while after that, I switched to a static site generator written in JavaScript, which I called Nouwell.. Now, the time has come to yet again move, as I just completed Housecat, my new static site generator.

Nouwell, like the PHP blogging system which came before it, was designed to be one complete solution to write and manage blog posts. It was an admin interface and a site builder in a complete package. With Housecat, I decided to take a different route. It's around 1500 lines of C code, compared to Nouwell's roughly 5000 lines of javascript for node.js, PHP, and HTML/CSS/JS. That's because its scope is so much more limited and well defined; take a bunch of source files in a given directory structure, and create a bunch of output files in another directory structure.

Housecat is designed to be a tool in a bigger system. It does its one thing, and, in my opinion, it does it pretty well. Currently, I'm editing this blog post with vim. I'm navigating and administrating articles with regular unix directory and file utilities. I have a tiny bash script which converts my articles from markdown to HTML before Housecat processes them. Eventually, I even plan to make a web interface for administrating things and writing blog posts and such, which will use Housecat as a back-end. To my understanding, this is what the UNIX philosophy is all about.

Housecat features a rather powerful theming system, a plugin system, pagination, drafts (start a post with the string "DRAFT:", and it'll only be accessible through the canonical URL, not listed anywhere), and should be compatible with any sane web server, and is, of course, open source.

Now, some of you might be wondering why anyone would ever use C to write a static site generator. To be honest, the main reason I chose C was that I wanted to learn it better. However, after using it for a while, it doesn't seem like a bad choice at all. Coming mostly from javascript, it's refreshing to have a compiler actually tell me when something's wrong, instead of just randomly blowing up in certain situations. C certainly isn't perfect when it comes to compiler warnings, as anyone who has ever seen the phrase segmentation fault (core dumped) will tell you, however having a compiler tell you you're wrong at all is a very nice change, and valgrind helps a lot with those segfaults. I also think that being forced to have more control over what I'm doing and what goes where helps; with javascript, you can generally throw enough hacks at the problem, and it disappears for a while. That strategy generally literally doesn't work at all in C. That isn't to say that you can't write good code in javascript, nor that you can't write bad code in C, but I found it nice nonetheless.

]]>
https://mort.coffee/home/housecat 08 Oct 2015 12:00 GMT
Apple and security: Are we back to using our favorite band as our passwords? https://mort.coffee/home/apple-and-security <![CDATA[

I have relatively recently started switching all my online account over to using a password system where all my passwords are over 20 characters, and the password is different for each and every account. It also contains both numbers, lowercase characters, and uppercase characters. I should be safe, right? Well, not quite.

A while ago, I just randomly decided to try out Apple’s “forgot password” feature. I’m a web developer, and am sometimes curious as to how websites implement that kind of thing, so I headed over to http://iforgot.apple.com/ and typed in my Apple ID. I noticed that it gave me the option to answer security questions.

I was first greeted with this screen, asking me for my date of birth:

apple-dob

The date of birth is obviously not classified information, and is basically available to anyone who knows my name and how to use Google.

Having typed in this, I get to a new page, which looks like this:

apple-secquestions

It asks me what my favourite band is and what my first teacher’s name is. None of that is secret either; anyone who know me knows that my favorite band is Metallica, and there are traces of that all throughout the Internet, and if it’s not in public records somewhere, anyone could just randomly ask me what my first teacher’s name was, and I’d probably answer honestly.

Anyways, typing in that information, I find something truly terrifying:

apple-terrifying

I was able to change my password. Only knowing my email address, my date of birth, my favourite band, and my first teacher, anyone could take complete control of all my Apple devices, remotely delete everything on them, access all my images, all documents, everything. And there would be nothing I could do to stop it. After some days, I would probably notice that I couldn’t log in to anything, and would call tech support, but at that point, it would already have been way too late. Anyone could by then have remotely deleted all my data, after backing it up on their machine. Only by knowing publicly available information about me, or asking me seemingly innocent questions via chat.

This isn’t even a case of me using terrible security questions either. Apple only allows you to pick from a small set of security questions, and the vast majority of them were completely inapplicable to me. I have no favourite children’s book. I’m not sure what my dream job is. I didn’t have a childhood nickname, unless we count “mort”, which isn’t really a “childhood” nickname, as it’s my current nick to this day. I don’t have a car, so I don’t know the model of my first car. I have no favourite film star or character. Et cetera. Those are all questions I could’ve chosen instead of “Who was your favourite band or singer in school?”, but none are applicable to me, and more importantly, none of them would be more secure than my current security questions.

Is this standard for security really acceptable from anyone, much less the world’s most valuable tech company, in this day and age? Are we really back to the dark ages of using birth dates, favorite bands, and other personal information as our passwords? Didn’t security experts find out that this was a bad idea a long time ago?

There are of course ways to mitigate the effects of Apple's poorly designed system. You could generate new random passwords for each security questions if you're using a password manager, or you could make up fake answers. I highly suggest going to https://appleid.apple.com/signin and change your security questions right away. However, Apple's solution is still broken. I expect that the vast majority of people will give their actual personal information as the answers, because after all, that's what the website asks you to do.

]]>
https://mort.coffee/home/apple-and-security 01 Sep 2014 12:00 GMT
Yet Another "Types of Programmers"-post https://mort.coffee/home/type-of-programmers <![CDATA[

I and some friends were hanging out in a Google Docs-document earlier, and made our own "types of programmers"-list. The result:

The kinda-master:

This kind of programmer always assumes they know best, and commits without any testing, often breaking sections of the code. They invent unusual methods of doing things, and insist they are the best, especially when they are not. They are very skilled at many things, and assume this translates to everything.

The Delusional:

This kind of programmer takes knowing more than the average person to be knowing more than any person. They code badly and refuse to improve, under the impression everyone else is a moron, especially those who know better than them.

The Ninja:

This breed of programmer quietly twiddles their thumbs until their code is due, at which point they promptly ejaculate a pile of code resembling the result of spaghetti being fed through a blender.

The Bugfixer:

This kind of programmer does nothing at all to assist in the development of a project, leaving it to someone else, who as a result of having more work will produce a buggy and terrible end project. It is at this point that the bugfixer springs into action, finding single lines of code and suggesting slight improvements, all the while making snobby comments about how they should have been written better.

The Magician:

The one you go to whenever you need help, only to get an answer you can't seem to comprehend. You proceed to assume The Magician is an expert in the field, as he seem to possess some long lost knowledge you never knew existed.

The Tinkerer:

The programmer who starts one project, then spends the rest of his life working on said project. The Tinkerer will often end up with a neat end product, but it will be his only product. He will become an expert in any field his project touches, but will know nothing of any other field, save random information he happens to stumble upon.

The Procrastinator:

The one who always theorises about great projects. The Procrastinator will rarely get started with a project, and if they do, you can bet your ass it won't live for more than a week, tops.

The Perfectionist:

A close relative of The Procrastinator. There's one important difference though: while The Procrastinator rarely gets any idea started, The Perfectionist will usually start his projects. The Perfectionist will never get anything done though, as his time doesn't spent writing code, but rather trying to find ways to make his code perfect.

The Duct-Tape Programmer:

The one who tends to get projects done, but usually in a messy manner. The Duct-Tape Programmer is a polar opposite of The Perfectionist and The Procrastinator. Much like The Procrastinator, The Duct-Tape Programmer's head is sprawling with ideas. The difference is that The Duct-Tape Programmer will immediately start writing code, without thinking through anything. The shortcomings of his code will be fixed with duct tape.

The Anti Insertionist:

This kind of programmer relies on others to create a bad codebase, then improves it quickly, ending up with less insertions than deletions. They fix the most bugs and are useful as hell when finishing a project, but are useless by themselves. Since they are able to improve everyone else's code, they have a higher degree of knowledge of what they are doing than anyone else.

The Medium-rare:

Delicious.

The Spandex:

The programmer who doesn't specialise in one field, but rather his best to cover all fields. The Spandex will know something about everything, but a lot about nothing. Jack of all trades, master of none.

The Delusional Magikarp:

Is not a programmer but thinks they are programmer. The most common case of this is the front end web developer, who thinks knowing HTML and CSS puts him in the same league as those writing physics engines in C.

The Newbie:

The one who has just recently discovered the art of programming. The Newbie is a dangerous beast, his code is usually riddled with bugs, memory leaks, SQL injections and what have you not, but at least his heart is in the right place, and he's willing to learn.

The Douchebag:

The individual who has no good intentions. The Douchebag is often seen writing code which relies on a key value store which can only be interfaced with by writing a distributed Map-Reduce function. To make matters worse, his language of choice is Erlang.

The Insane:

Conversations with himself which usually go something like this:

observation: mort often makes jokes I don't get.

observation: inv often doesn't (stop confusing don't and doesn't) (sorry D:::) (>:C) get the jokes I make

observation: who the hell are you. < who typed this dunno

who t

she hell are YOUXeooooooolokk swwag

swedgeis this cursor purple???? ANSEWR ME >:L BLEURIOE:( wut color D: But invalid is blue :O WHO IS THAT

LE PURPLE

I WANT TO BE HEDGEHOG >:LJOP no its not its bzxcvzxcvzxcvAAAAAAAAAAAAAsdlue many dolphin, such wow, much anonymous if you are a hedgehog it is purple Can i haz purple hedgehog no? never

Plox ;n; POWEJPRWEJ D:::::::: BUT HEDGEHOG >:L fack u no not now

]]>
https://mort.coffee/home/type-of-programmers 01 Aug 2014 12:00 GMT
Experimenting with static site generation https://mort.coffee/home/experimenting-with-ssg <![CDATA[

Hi!

If you visit this blog on a regular basis, which you probably don't, you may have noticed that it looks a bit different from how it used to look. The overall theme is the same, but the URLs look way different, and there is no comment section. So what happened?

A while ago, I wrote my own content management system in PHP, replacing WordPress. That was your standard CMS, for every request, you have a script (PHP script in this case) which generates the page dynamically, showing you the content you requested. March this year (2014) though, I started a new project, which works fundamentally different.

Enter jsSiteBuilder.

jsSiteBuilder is a static site generator. That means that nothing is built on the fly. There's a script which reads the content of a MySQL database and generates all the required HTML files for each and every page and post. Whenever a user requests the website, the one and only thing the web server ever does is what web servers do best - it fetches the file and sends it back to the user.

The most obvious advantage with this method is performance. Both server load and request time is brought down to the absolute minimum. It also provides some much needed stability. Your website doesn't go down, even in the case of an extreme disaster. Say you accidentally delete your MySQL database, or the database host goes down, or file permissions mess up. Usually, this would take down the website. With this CMS however, all html files will just stay there, available for everyone to see. Your users won't notice a thing, while you can take all the time you need to properly fix whatever issue appeared. You can even re-run the site building script as much as you like while everything is down; it won't delete anything.

Not everything with jsSiteBuilder is static however. More specifically, it has the admin control panel you'd expect from an old-fashioned dynamic site generator, like WordPress. That's because the admin panel is written in PHP. This makes it easy to create, edit, and generally administer and set up the website. All the admin interface does is to interact with the MySQL database. Once you're done making whatever changes you want to make, you can update the user-facing portion of the website with the push of a button. On my blog, with my server, completely regenerating the entire site takes no more than a few tenths of a second.

]]>
https://mort.coffee/home/experimenting-with-ssg 27 Jun 2014 12:00 GMT
A small WTF regarding CSS units https://mort.coffee/home/css-units-wtf <![CDATA[

CSS. A tool loved by web developers all over the world. It lets us style our HTML easily, and creates somewhat loose coupling between content and layout.

CSS. A tool hated by web developers all over the world. It makes us spend countless hours trying to accomplish what seemed to be the most mundane task.

Today, CSS got a little bit weirder.

It all started with me and a discussion between me and a friend, Stef Velzel (or Invalid). He has a website at ckefworx.com, in case you're reading this in the year 3026 and he has finally removed that "under construction" banner.

Anyways, Invalid and I was discussing, as we so often do. I had come across a blog post on Reddit claiming that in CSS, one pixel (px) is always 1/96 of an inch. I even found CSS specifications, which supported the statement. Invalid disagreed though, and meant that 1px is always one pixel on the screen, according to his experience.

It turns out I was right. In a sense at least. But so was Invalid. You see, 1px being 1/96 of an inch and 1px being exactly one pixel isn't mutually exclusive. Not with CSS at least.

"But wait", I hear you say. "That doesnt make sense?"

Well, yes it does. Sort of. If you're a mathemagician, you may have figured out this already. The only way this adds up, is if one inch is 96 pixels, 96 points of light on your screen.

Have you spotted the issue here? no? yes? 96 pixels isn't equivalent to 1 inch. Not in a world where pixel density varies wildly from device to device. CSS can't just redefine inches like that can it? I mean, 1 inch is exactly 1 inch, isn't it..?

Well apparently, CSS can indeed just redefine units of measurement like that. 1 inch is 96 dots of light, not 1 inch.

On some deep level, this makes sense. It seems like common sense define units of measurement out from the fundamental unit of the display instead of arbitrarily defined real life measurements. The problem is that it's marketed as inches, centimetres, etc. instead of what it actually is. This causes a lot of confusion.

I should probably add that the spec doesn't state that one inch should be 96 pixels. It rather says that one px should be 1/96 real world inch, which at least makes a little sense. Browser vendors implement it how I described above though, and in the end, that's what matters.

Update: I should probably have included some of the tests I did, and some sources.

First off, let's see how a pixel is defined by the W3C:

The absolute length units are fixed in relation to each other and anchored to some physical measurement. They are mainly useful when the output environment is known. The absolute units consist of the physical units (in, cm, mm, pt, pc) and the px unit:

cm: centimeters

[...]

in: inches; 1in is equal to 2.54cm

px: pixels; 1px is equal to 1/96th of 1in

[...]

According to that, it would seem like 1 inch is exactly 1 inch, regardless of resolution. 1px should arlso be the same regardless of resolution, as it is defined using inches.

Look at this example element however:

<div style="background-color: #000; height: 1in; width: 1in"></div>

I don't know about your browser, but mine does at least not render that as exactly 1 inch. Have a look at this however:

<div style="background-color: #00F; width: 1in; height: 1in; display: inline-block"></div>
<div style="background-color: #F00; width: 96px; height: 96px; display: inline-block"></div>

I don't know with you, but those look fairly similar with all browsers I've tested with. This shows that, regardless of what the CSS spec says, an in isn't an inch, at least not in all browsers. It's 96 pixels.

]]>
https://mort.coffee/home/css-units-wtf 01 May 2014 12:00 GMT
Rant about YouTube https://mort.coffee/home/rant-about-youtube <![CDATA[

Since its inception in 2005, YouTube has grown out of proportions. It is to videos what Google is to search.

With a user base of billions of people and thousands of hours of footage uploaded daily, you'd almost think they knew what they were doing. And they do, from an infrastructure standpoint. That amount of traffic requires vast server farms all around the world, all working together.

Where YouTube lacks however, is in terms of its user interface. There are some real disasters in this department.

Take for instance selecting a video's quality. When you change the quality from the default 360p to say 720p, and user would expect the quality to change. The user model says that when you change quality, the quality changes. Believe it or not, the YouTube team actually managed to get this wrong.

When you change a video's quality, the video remains in its original quality. It changes to the selected quality only when it has played through all that is already buffered.

For me, a pattern like this is not unusual:

I click on a video. I fullscreen it on my big 1080p display, before changing the quality to 1080p, but due to my 70 Mb/s internet connection and YouTube's great infrastructure, 3/4 of the video is already loaded. I proceed to watch 3/4 of the video in horrible quality, before it switches over to beautiful full HD.

One definition of great software, is that the program model corresponds to the user model. Basically, the program should behave like the user expects. The user model is definitely not that changes in quality settings applies only after watching a random portion of the video, if at all.

Another cause of bad usability is frequent changes in the user interface. Not small changes, like altering the looks of a button here and adding some gradients there. No, we're talking total overhauls of the UI. Completely revamping how everything is structured.

YouTube has had quite a few of these huge overhauls. Some times it has restructured everything multiple times per year.

Users hate big changes. The reason is that users don't analyze the interface and look for the most logical place for a feature every time they need said feature. No, we users memorize where what we need is, and navigate there out of habit more than anything else, at least with programs we use frequently.

When the user interface is restructured, we still go looking for what we want in the areas we're used to. When that doesn't work, we lose our feeling of control. The program now has control over you. This happens unconsciously, and leads to frustration.

Furthermore, it forces us to reanalyze the interface and look for features we previously knew where was. As it turns out, this causes quite a bit of cognitive overhead. Our brains are way better at just looking up already stored information than processing brand new data. When YouTube us completely overhauling its website multiple times per year, this becomes quite a bit of an annoyance.

The mobile application for iOS, and possible Android, is a bit of a disaster too. At least in sone respects. Sure, it has its bright sides, but as this is a hateful rant, I'll jump gracefully over those and focus on the bad aspects.

In YouTube, there is a comment section, as you may know already. It's possible to reply to people's comments. You can even see who a comment is a reply to, and by the press of a button, you can see the original comment.

The team working on the mobile app rightfully decided to implement the comment section. What they did not however, was to let you see who a comment is a reply to, rendering it useless. You see, this reply-to-feature is frequently used. It's used so much that without it, the comment section is just a bunch of random statements, completely out of context.

The app also has a feature which dynamically sets the video's quality according to your internet speed. Unfortunately, this doesn't quite work right.

In my home, there's an area which the router's WiFi doesn't quite reach. When I move into it, YouTuve notices the bad connection and drops the quality to below unbearable. It seriously looks like a grunge teared apart by a crack in the space time continuum, just slightly less fancy. Unfortunately, it does not adjust itself once the connection picks up again for quite some time. This forces me to exit out of the video, scan the list of videos for the proper one, wait for it to load, and navigate to where I was. As you might expect, this completely kills my flow.

In short, how to fix YouTube:

  • Make quality changes immideate
  • Less frequent UI overhauls
  • Implement replies into the mobile app's comment sections
  • Provide optional quality controls to the mobile app
  • Make the app better at automatically selecting quality
]]>
https://mort.coffee/home/rant-about-youtube 01 Jan 2014 12:00 GMT
JavaScript's Rough Childhood https://mort.coffee/home/javascripts-rough-childhood <![CDATA[

As some of you may know, I'm a fan of JavaScript. Pretty much all of my projects are web apps, and as such, JavaScript is an important part of them.

JavaScript does have its good parts. Parts which at the very least are en par with other languages. It does, however, also have quite a bit of bad parts.

In the beginning, there was Brendan Eich. Eich got hired by Netscape to design a programming language for their web browser. At first, what he had in mind was something resembling Scheme, a dialect of Lisp.

When he had worked for some time on this Scheme-esque language, someone, presumably in Netscape's management, decided they wanted something else. Eich got told to start from scratch. This whole Java-thing seemed to be taking off, so he would have to make it more like Java. "And by the way, we need it in ten days", they told him.

Now, ten days is orders of magnitude less than you need to design a great language. Eich did a great job, but as you would expect however, it did have quite a few unforeseen quirks.

Unsurprisingly, Microsoft decided to copy Netscape. They had a team dedicated to find JavaScript's quirks and replicate them.

Fast forward a bit, and Netscape submit their language to European Computer Manufacturers Association, or ECMA, to make it a standard. ECMA agrees, on the premise that it won't be called JavaScript anymore. A team of people started writing detailed documentation for what was internally called ECMAScript. Microsoft had a key role in this documentation process, and due to their work on accurately cloning Netscape's JavaScript implementation they knew exactly what odd quirks JavaScript had, and thus what they should make sure to avoid. Or, as it turns out, make sure it got into the standard. Yeah.

This glorious work from Microsoft's side is part of the reason JavaScript is the inconsistent mess it is today. Take for instance how typeof null is "object". That, and a bunch more, is a result of the incredibly short amount of time Eich had to make JavaScript, and Microsoft's effort to make sure all quirks from the original JavaScript implementation stuck in the ECMAScript standard.

Abstraction

Even though it can be tempting to blame Microsoft for everything wrong in the world, it should be said that they aren't the root of all problems with JavaScript. Some of the problems aren't even real problems, but a result of leaky abstractions.

It is fairly obvious that JavaScript tries to be fairly abstract. Take for instance how it doesn't have types. That's an abstraction. Internally, the computer does distinguish between text, integers, numbers with decimals, booleans and more. The JavaScript language tries to hide this however. In the world of JavaScript, everything's a "variable", declared by the keyword var.

One problem which almost exploded in my face is related to how some of JavaScripts types are passed by reference, and others by value.

Now some of you may never have heard of passing values by reference or values is. Nor do you understand why it's a big deal. Even if you're a programmer, this can be a completely foreign concept for you. If that's the case, chances are you're using a very abstract language like JavaScript.

I won't get too much into the inner workings of computers, but I will try to explain the basics of passing by references and values.

Say you have two variables, Foo and Bar. Say you set Foo to 5:

Foo = 5

Now, we set Bar to Foo:

Bar = Foo

If we pass by value, Bar and Foo will be two distinct, completely unrelated variables. Changing one will never ever in a billion years affect the other.

If we however pass by reference, Bar isn't a value in itself. When someone ask what Bar is, it simply says "go check out Foo, maybe he knows"; Bar is what we call a pointer. Because of this, when you change Foo, Bar's value also changes. If we set Foo to 10, Bar is also set to 10. This also works the other way around, at least in JavaScript. If we set Bar to 12053, Foo will be set to 12053. The two variables are the same, just under different names.

Passing by reference can be a lot faster when dealing with big variables. Therefore, JavaScript passes some types by reference. Those types are functions, arrays and objects. The problem here is that when programming in JavaScript, there's no clear distinction between types. After all, that's the point of being an untyped language isn't it?

This can cause some really confusing quirks. For instance, after this code, bar is 10:

foobar = 10;
bar = foobar;
foobar = 20;

while after this code, bar.val is 20:

foo = {"bar": 10};
bar = foo;
foo.bar = 20;

If you're not experienced in JavaScript, or programming in general, this might not make a lot of sense to you. Trust me though when I say that this can cause severe problems.

Of corse, the whole problem would be gone if JavaScript by default passed values by value. Passing by reference could be an option. This is how C does it, and it works great.

There are other quite freaky abstractions out there. Take for instance how JavaScript doesn't force you to use semicolons at the end of lines. It does this by automatically inserting semicolons where they are missing. One of the ways it does this is really creepy: it runs a line of code, and if it fails, it inserts a semicolon at the end and tries again. There are a few problems caused by this which I wont get into, but most of all it's just creeping me out to know that JavaScript does that. It does also teach new programmers the horrible custom of ignoring semicolon, so using languages where semicolons are required becomes a hell. Therefore, use semicolons!

Solutions

What can we do to make our web programming lives easier, and overcome JavaScript's flaws?

One of the solutions can be to translate other languages to JavaScript. Lots of such translators have been made, and nowadays you can translate pretty much any language C, C++, C#, Java, you name it into JavaScript. People are even designing languages whose sole purpose are to be translated into JavaScript code. CoffeeScript is an example of this. A problem with translating other languages however is that the web browser will still be running JavaScript code, and will spew out errors in the JavaScript code. It can't magically know where in the code you wrote the error is. This adds a lot of complication to debugging, and you pretty much have to be fluent in JavaScript anyways to be able to see what the error really is.

Another solution is to simply go with JavaScript, learn to love its quirks, or at least learn how to overcome them. Know that when you typeof null, it will return "object". Learn that if you declare variables certain ways, they are objects, arrays or functions, and as such are passed by reference, and learn what passing by value/reference really means. Learn to always have your Google machine ready. Learn that while high levels of abstraction makes languages a lot easier to get involved with, it also makes it quite a bit harder to really get to know the language.

(some of the things I've written here, I learned from a talk about JavaScript. I think the talk was by Douglas Crockford. Sadly I can't find it again.) As some of you may know, I'm a fan of JavaScript. Pretty much all of my projects are web apps, and as such, JavaScript is an important part of them.

]]>
https://mort.coffee/home/javascripts-rough-childhood 01 Nov 2013 12:00 GMT
"'Considered Harmful' essays considered harmful" essays considered harmful https://mort.coffee/home/considered-harmful-essays <![CDATA[

Okay, that title is a bit of a brain twister. Hear me out though, I promise I'll eventually make some kind of sense.

Since the late 60′s, a type of computer-related essays, namely so-called "considered harmful" essays, became popular.

Considered harmful essays are all about writing page up and page down about why something programming related is bad and should be avoided. The first considered harmful essay, at least the first somewhat mainstream one, was written in 1968 by the Dutchman Edsger W. Dijkstra It was called "Go To Statements Considered Harmful", and, as you might have guessed by now, is about how GoTo statements have a tendency to produce some really messy spaghetti.

After Dijkstra's essay, the style of writing got so popular you could say it became a clich. We got a metric ton of "considered harmful" essays, each essay nitpicking on its own small area. With statements, XSL, Java, the "new" keyword, namespaces - all of which, and more, considered harmful by someone or another.

One of the later additions to the considered harmful family of essays is "'considered harmful' essays considered harmful".

Now, what's harmful about "'considered harmful' considered harmful" essays?

Well, one of the more obvious effects it has, is this post. I mean, "'Considered Harmful' essays considered harmful" essays considered harmful. If that title doesn't blow your brain out of your ears, I don't know what will.

But it doesn't stop with molten brains. Oh no, far from it. You see, considered harmful essays aren't really necessarily there to tell you not to do or use whatever the essay is about. It works more like a warning. Not as much "don't do x", but "before you do x, make sure you know what you're doing".

There are lots of new programmers out there. I myself am fairly new. With the extreme levels of abstraction in the languages which are considered great for beginners, it's easy to do something which in the code looks completely sane, but when a virtual machine runs it, it forces the CPU to do a gazillion operations. If you had taken a slightly different approach to the problem however, it would only have taken a few billion operations.

Or things could behave unexpected. For instance, the language could suddenly decide that nope, that variable (everything's variables these days - goodbye data types) is passed by reference, while all other variables are passed by value. This can have gastronomical implications, and completely break a project.

Many considered harmful essays are there to tell you about those pitfalls of leaky abstractions.

In addition to the informative value, they're a joy to read. I myself do at least love reading a well written considered harmful essay.

]]>
https://mort.coffee/home/considered-harmful-essays 01 Sep 2013 12:00 GMT
My Worst Code https://mort.coffee/home/my-worst-code <![CDATA[

Someone called Xeomorpher made a thread on the Open Redstone Engineers forum (more about ORE some here), asking what people's worst pieces of code were. I wrote a response, which I might as well post here too:

TequilaJumper!

Made for Ludum Dare with minimal amounts of experience, it's not of my prettiest of works. It did however spawn some offspring in the form of xeo's TofuJumper.

Because of open sourceness, the source code can be found here: https://github.com/mortie/tequilaJumper

So, let's have a look at it shall we?

You don't even have to look at any of the source code to find the first horrible decision. Everything in one file. One index.html, containing almost 800 lines of source code. Yeeah.

Opening the file, we see some disastrous code. Take for instance this draw code: [line 218]

gameCtx.fillStyle = "rgba(0, 0, 0, 0.5";
gameCtx.beginPath();
gameCtx.moveTo(Math.floor(platformStartX[i] + platformWidth[i]/2), Math.floor(drawYModifier(platformStartY[i] + platformHeight[i]/2, 0))); 
gameCtx.lineTo(Math.floor(platformEndX[i] + platformWidth[i]/2), Math.floor(drawYModifier(platformEndY[i] + platformHeight[i]/2, 0)));
gameCtx.stroke();

Beautiful, right? That was the code for drawing lines marking the path of moving platforms (play the game for yourself, and you'll see what I mean).

This one-liner is quite extraordinary too: [line 271]

gameCtx.fillRect (Math.floor(platformX[i]), Math.floor(drawYModifier(platformY[i], platformHeight[i])), Math.floor(platformWidth[i]), Math.floor(platformHeight[i]));

Yeeah, that's one line. Believe it or not.

In the code for handling the movement of platforms (which is a complex mess of work too by the way, starting at line 283 and ending at 349): [line 324]

//HAAAAAACK!
platformMovementInvertedX[i] = platformMovementInvertedY[i];

Encountered a bug I didn't manage to fix, so I simply hacked my way around it using an extremely dirty trick.

Another thing, which isn't as clearly expressed in the code, but maybe is worst of them all:

When a platform is disappearing off of the screen, it doesn't really disappear. It's still stored in memory - it doesn't get overwritten by new platforms. This leads to a horrible memory leak. That's right. Platforms never despawn.

That's a selection of the worst parts of the code. Other ineresting areas are:

Oh, and I almost forgot:

Almost all variables are global. Just look at the variable declaration part [line 7-73] :S

In my defence, it was made for Ludum Dare, AND I attended a party which took most of my weekend. The time was therefore short. It also kinda feels wonderful to hack away on code, not spending a single thought on structure, and just see where you end up. The code becomes extremely horrible and unreadable, but it's rather fun :P

]]>
https://mort.coffee/home/my-worst-code 01 Jul 2013 12:00 GMT