DecemberAdventure 2024
author: @[email protected]
avatar: 🐐
license: CC SA
The December Adventure is low key. The goal is to do a bit of creative coding every day in December and write about it.
This year, hoping to make some progress with:
- ✅ finding a **new** gemini host
- 🚧 write a lex-like utility for ADSL
- 🚧 use it to write a toy C-compiler in C against Nora Sandler's advice in "Writing a C Compiler" from Nostarch Press
- 🚧 use that to find a set of VM instructions that make adding that VM as a C compiler target architecture trivial
Further work:
- ⏳ fold any improvements I make back into my C implementation of Kroc's v80 assembler
- ⏳ fork v80 assembler C implementation to assemble code from the toy compiler
- ⏳ adjust my DecAdv2023 VM to run code from the assembler
- ⏳ support all of C99, maybe even C17
- ⏳ write a C preprocesser to the same standard
I'll also be logging stuff on mastodon:
2024-12-03 16:21 UTC
Day 三 of #DecemberAdventure
One of the other reasons (aside from getting bogged down in all the AST book-keeping code) that I abandoned my initial speed run through the first few chapters of "Writing a C Compiler" is that I realized too late that recording the file, line & column numbers associated only with each Token instance makes it hard to spit out good diagnostics for source code errors not discovered until after the AST had already been created. For example, I could detect when a new variable declaration shadows an earlier declarion with the same name in an outer scope with a variable validation pass over the finished AST, but since the memory for the tokens that were used to build it was long since recycled, I had now lost track of what file and line that the shadowed declaration came from. So the best diagnostic I could output was "variable declaration for 'a' shadows another declaration in an outer scope, you can grep the sources to figure out where exactly 🤦".
I want to preempt that problem in the do over, so I've made a new Location module by abstracting the filename/lineno/colno fields out of yesterday's Token type. Now it's easy to add Location fields to AST node structs too, and I'll be able to provide error diagnostics that refer back to the specific line, column and filename where the error was introduced when I get back to that point again.
And finally for today, to represent this single line from the ADSL spec I posted on Day 一:
| Typed(char *argname, unsigned argnamelen, char *argtype, unsigned argtypelen)
I wrote this:
enum asdtype { ... ASD_TYPED, }; typedef struct asd ASD; typedef struct { char *argname; unsigned argnamelen; char *argtype; unsigned argtypelen; } ASD_Typed; struct asd { union { ... ASD_Typed typed; }; enum asdtype type; Location loc; }; static ASD * asd_new_typed(Location *ploc, char *argname, unsigned argnamelen, char *argtype, unsigned argtypelen) { ASD *r = xmalloc(sizeof *r); r->type = ASD_TYPED; r->typed.argname = xmemdup(argname, argnamelen); r->typed.argnamelen = argnamelen; r->typed.argtype = xmemdup(argtype, argtypelen); r->typed.argtypelen = argtypelen; r->loc.zfname = xstrdup(ploc->zfname); r->loc.lineno = ploc->lineno; r->loc.colno = ploc->colno; return r; } static ASD * asd_free(ASD *stale) { switch(stale->type) { ... case ASD_TYPED: xfree(stale->typed.argname); xfree(stale->typed.argtype); break; } xfree(stale->loc.zfname; return xfree(stale); }
And almost 400 lines more that all look very similar to this, in order to support the other 10 node types for this particular AST. It's all very repetitive, but certainly should be able to generate it reliably after writing a recursive descent parser that uses these boring functions to assemble an AST from the token stream. I'll also need to write a serialization function that lets me dump an S-Expr that represents the structure of the instantiated AST -- really I just want that so that I can eyeball the structure and make sure I parsed inputs correctly, but a more general ASDL tool should provide a way to load and dump a AST from a text file given the matching ASDL description so that it's easy to write standalone tools to load a serialized AST from a file into memory, apply some optimizations or even transform into a different AST, and then serialize that memory back to another file. Chaining together the various stages of a compiler on a low power machine would then be a pipeline of transformation processes communicating with the serialized trees. I'm not planning to do that. But, it might make for a fun follow up project if, say, I wanted to run the completed C Compiler on a CP/M machine that doesn't have enough memory or processing power to run all the stages from tokenizing through assembling with a single program.
I'll save the recursive descent parser until tomorrow. And if that goes quickly, maybe the serialization function to check that the in-memory AST looks right...
2024-12-02 14:28 UTC
Day 二 of #DecemberAdventure
Today I split the tokenizer I wrote for my aborted first run through "Writing a C Compiler" into a generic lexer module that has the common functionality that will be shared between the ASDL generator and the C Compiler do over, including a base struct for the Token type that can be type cast to a more specific Token that has the additional fields needed for a particular tokenizer, and a Lexer struct along with functions to managing peeking and consuming characters from an input stream while keeping track of the line and column number so that the calling code doesn't have to.
This started life as the tokenizer I contributed to a C implementation of Kroc's v80 assembler:
The v80 tokenizer is a fully baked tokenizer for Kroc's clever v80 assembler specification and weighs just 260SLOC. The general version I abstracted out here weighs 192SLOC for the generic Lexer, and then another 294SLOC for the ASDL tokenizer even though this version makes no attempt to keep within C89 compatibility, because it's also paying for the abstraction layer. It already handles reading the Lex like input format below and spewing out a stream of properly location tagged tokens.
%{ any C code destined for the generated .h file in here %} <the ASDL AST spec I posted yesterday pasted here verbatim> %% everything after the line above gets added to the end of the generated .c file
Running the tokenizer over the nascent asd.asdl input now gives me:
src/asd.asdl:1.1: #<HEADER: "#include "xalloc.h" "> src/asd.asdl:5.1: #<IDENTIFIER: "asdl"> src/asd.asdl:5.6: #<EQUAL: '='> src/asd.asdl:5.8: #<IDENTIFIER: "statement"> src/asd.asdl:5.17: #<STAR: '*'> src/asd.asdl:6.1: #<IDENTIFIER: "statement"> src/asd.asdl:6.11: #<EQUAL: '='> src/asd.asdl:6.13: #<IDENTIFIER: "Header"> src/asd.asdl:6.19: #<LPAREN: '('> src/asd.asdl:6.20: #<IDENTIFIER: "char"> src/asd.asdl:6.25: #<STAR: '*'> src/asd.asdl:6.26: #<IDENTIFIER: "text"> src/asd.asdl:6.30: #<COMMA: ','> src/asd.asdl:6.32: #<IDENTIFIER: "unsigned"> src/asd.asdl:6.41: #<IDENTIFIER: "textlen"> src/asd.asdl:6.48: #<RPAREN: ')'> src/asd.asdl:7.5: #<PIPE: '|'> src/asd.asdl:7.7: #<IDENTIFIER: "Production"> src/asd.asdl:7.17: #<LPAREN: '('> src/asd.asdl:7.18: #<IDENTIFIER: "char"> src/asd.asdl:7.23: #<STAR: '*'> src/asd.asdl:7.24: #<IDENTIFIER: "name"> src/asd.asdl:7.28: #<COMMA: ','> src/asd.asdl:7.30: #<IDENTIFIER: "unsigned"> src/asd.asdl:7.39: #<IDENTIFIER: "namelen"> src/asd.asdl:7.46: #<COMMA: ','> src/asd.asdl:7.48: #<IDENTIFIER: "term"> src/asd.asdl:7.52: #<STAR: '*'> src/asd.asdl:7.53: #<RPAREN: ')'> src/asd.asdl:8.5: #<PIPE: '|'> ... src/asd.asdl:20.1: #<CODE: " #include <stdlib.h> #include "error.h" #include ..."> src/asd.asdl:565.1: #<ENDOFFILE: "EOF">
That's a nice milestone for the day. Tomorrow I'll start hand coding the boring AST types, constructors and destructors but in a way that should be easy to regenerate automatically once I've bootstrapped this tool.
I'm happy to publish all the code as I go. If anyone is interested in that, please let me know... I'm not thrilled with Github these days so I'll probably figure out how to host it on my VPS alongside this gemini capsule, so please be patient if you ask and it takes me a few days to get something working!
2024-12-01 19:33 UTC
Day 一 of #DecemberAdventure 🎉🎉🎉
I have been working through Nora Sandler's excellent "Writing a C Compiler" for the last few weeks. It is an excellent book, itself based on the paper "An Incremental Approach to Compiler Construction" by Abdulaziz Ghuloum, which I read many years before I knew enough compiler theory to understand it. In addition to being hand-held through creating a C Compiler one step at a time, I'm hoping this exercise will also push me beyond being an x86 assembly language dabbler to actually being able to write idiomatic assembly code, something I haven't been capable of since my childhood days of 8-bit CPUs.
In the introduction to the book, Nora says: "We'll be creating a compiler *for* C, but I don't recommend writing it *in* C". Of course I totally ignored that advice because who wouldn't want a self hosted C compiler after a few months of work? Through November I worked through the first few chapters, until I recognized why ignoring that advice was at my own peril. Modern compilers are generally implemented as a series of Abstract Syntax Trees and transformations on those trees, after turning the source into tokens, we first make an AST, and then turn that into a different tree to represent the intermediate format, make some passes over that for optimization, and then transform into a low level syntax tree closer to the assembly code, make some more passes over that, and eventually walk the final tree with a code generator to spit out x86_64 assembly language. The interesting part of the code is slowly adding support for more C syntax to the initial AST, and then tweaking the subsequent transformations to accomodate it. The *bulk* of the code in my C implementation turns out to be reams and reams of mechanical AST node data types, constructors, destructors and code to dump each type of tree in various states of transformation to the console for debugging purposes. C is not a great language for pushing around trees of pointers and strings and keeping track of all the memory, especially when you have to type all that code out by hand.
Having now recognized that I'm introducing and chasing many dumb bugs in the reams of boring mechanical syntax tree wrangling code, I've decided to start again, but with a code generator that reads an ASDL format description of each syntax tree and the relationships between the many (many!) different types of nodes, and generates all the boring mechanical code for me. Correctly. I think I'll be able to enjoy the rest of the book much more by working more carefully on just the interesting parts of the implementation, and letting the generator handle all the boring code after updating the ASDL description of each tree.
Today, after making a start at the ASDL generator, I realize that it too is actually reading an ascii source file with the ASDL tree description into a tokenizer and then making an AST of the ASDL description in memory, and then generating some C code as it walks that tree. So I can not only reuse a lot of the code I wrote earlier from v1 of my C compiler implementation, but I should be able to have it generate the boring parts of its own code if I feed it an ASDL description of the ASDL format itself...
Behold, an ASDL description of an AST for ASDL!
asdl = statement* statement = Header(char *text, unsigned textlen) | Production(char *name, unsigned namelen, term*) | Code(char *text, unsigned textlen) term = Terminal(char *name, unsigned namelen) | List(char *name, unsigned namelen) | Nonterminal(char *name, unsigned namelen, arg*) arg = Untyped(char *argname, unsigned argnamelen) | Typed(char *argname, unsigned argnamelen, char *argtype, unsigned argtypelen)
(apologies for lazily using C data types instead of real ASDL for the node fields)
Response: 20 (Success), text/gemini
Original URL | gemini://gary.vaughan.pe/2024-DecemberAdventure.gmi |
---|---|
Status Code | 20 (Success) |
Content-Type | text/gemini; charset=utf-8; lang=en-US |