This post is Part 1 of a series of posts about my project to create a C compiler within a boot sector. The code for the project is on github.
Part 0 Part 1 Part 2
In Part 0, I talked about how x86 computers boot up from nothing, and the environment that the first code on the system gets. Part 1 will be about what I did in that limited and hostile environment and how I managed to fit a compiler (even a very bad one) in 446 bytes.
The Code
This project uses nasm as the assembler of choice. No particular reason for nasm over anything else, though I was set on using Intel Syntax for the assembly, which does eliminate some choices.
The first order of business is, of course, to set up a basic file that can boot, even if it does nothing.
BITS 16 ; real mode is 16 bit code
org 0x7C00 ; BIOS drops us at this address
main:
jmp $
TIMES 0x1BE-($-$$) db 0x00
; THIS WOULD NORMALLY BE THE MBR
; REMOVED FOR SIMPLICITY (and qemu doesn't need it)
TIMES 64 db 0x00
dw 0xAA55 ; boot signatureThis code is the bare minimum to run. The Master Boot Record is not shown in that snippet because it doesn't really matter for our purposes. Additionally, the label main is not special. The first byte in the file will be run, so you need to be careful to never put anything before main, or have a jump to main as the first thing. The label is simply there for organizational purposes.
This code doesn't do anything though. jmp $ means "jump to the start of this line". This will repeat forever, keeping the CPU in an infinite loop. Assembling the code with nasm -w+all -w+error -f bin boot.asm -o boot.bin results in a 512 byte binary file, which is our boot sector. Running this sector with qemu-system-x86_64 -drive format=raw,file=boot.bin -s greets us with a screen, with some text that says "booting from hard disk...", and nothing happens. This is expected though, we're not doing anything. The -s flag tells qemu to allow a debugger to connect, so we can connect gdb with gdb --init-eval-command="target remote :1234" We will see that the program is indeed at 0x7C00 and if we step forward, it remains there forever.
This is what the development cycle looks like, make some changes to the asm, build it, launch qemu, and then inspect the state with a debugger to see if things are correct. Because the output of this compiler is x86 machine code bytes in memory (and due to space limitations), all debugging is done though gdb. There's not much that's useful to print to an output, and even if there were, inspecting the generated program will require looking at raw memory anyway. I am now intimately familiar with the code that my compiler generates, to the point that sometimes I don't even use a disassembler to see what's wrong, I can identify it by the patterns I expect.
If you are interested in working in this environment yourself, good luck. I find it highly interesting, but it's not a particularly useful skill these days. If you're actually interested in modern bootloaders and operating systems, UEFI is what everything uses now, which provides a much nicer 64 bit environment and standardized APIs for doing the things that a bootloader needs to do. Writing a Real Bootloader is a very different project, but one that provides similar joy in my opinion, and you might end up with something you could actually use. Piece by piece replacing every part of your computer's software stack with something you created is ideal, in my opinion, and it would be super fun to run a bootloader I wrote, but that may be a future project.
Implementing the Compiler
Input Program
The first thing that the compiler needs to do is actually get a source program. I chose to send the program over the serial port so that the binary does not have to include the source. This also makes it possible to type a program by hand, if you connect to the serial port, though there is no support for editing or backspace or moving up or down lines.
The basic code to read from the serial port looks like this
mov di, 0x6000 ; location in memory reserved for the source code
._load_program:
mov dx, COM1_PORT + 5 ; status port
in al, dx ; read from the status port
and al, 0x01 ; bit 0 is set if there is a character ready
jz ._load_program ; loop until the character is ready
mov dx, COM1_PORT ; data port
in al, dx ; read from the data port
stosb ; *di = al; di += 1;
or al, al ; exit when a 0x00 byte is read
jnz ._load_programThere is additional code that is not shown to initialize the serial port, since it has to be told what rate to send bits and other configuration. This code simply places the input program in memory, where it will later be read by the compiler.
In total, the serial port code totals to 0x2F bytes (mostly due to the initialization code). 0x1BE - 0x2F = 0x18F bytes remain.
Tokenization
The compiler needs to split the input program into tokens. In most compilers this is a fairly complicated process to allow for optional whitespace and line breaks and other characters. x=4*(x+2) should act the same as x = <newline> 4 * ( x + 2 ). My compiler, of course, throws that out the window. Tokens are delimited by whitespace. Note that all characters with ascii value 0x20 or less are considered whitespace. If there's not a whitespace, my compiler treats it as still the same token. x=4*(x+2) is considered One Token to my compiler. This greatly simplifies the implementation of the tokenizer, at great cost to readability of code. However, since C allows lots of weird whitespace, my compiler requiring it means that this is still a subset of C. This is what some code might look like in my version of C:
int meow ;
int main (){
meow = 20 ;
while( 1 == 1 ){
read_char ();
print_char ();
}
}At the surface, this doesn't look so bad. However, all of those weird spaces are required. There must be whitespace:
- between a variable declaration and a semicolon
- between the name of a function and the token
(){(more on this later) - between the left hand side of an assignment, the
=, and the right hand side of an assignment - between the left hand side of a binary expression, the binary operator, and the right hand side of the binary expression
- before the semicolon after every statement
- between the name of a function and the token
();to call it (once again a weird token) - before the closing brace of any block.
- many other weird places within
ifandwhileandasm(more weird tokens here)
This is surprisingly difficult to get right. Formatters will scream at you if you try to code like this, so you just have to remember the rules. Additionally, there's no error handling. At all. If the compiler expects a token, that token better exist exactly as the compiler expects since, if it doesn't, all hell will break loose and you will certainly not get the output you expect (most of the time the compiler crashes and starts executing random memory).
The tokenizer is deceptively simple. In general its algorithm is:
<skip whitespace>
ident = 0
get_character()
while character > 0x20
ident = 10 * ident + (character - 0x30)
get_character()
return ident
This is a terrible algorithm for identifying anything. I copied the idea from another project that aimed to make a small C compiler1. The idea here is that integer literals get converted into their value, and everything else just so happens to produce a number that looks Varied Enough for use. This however does mean that it is impossible to differentiate between numbers and not numbers, for example ; and 11 have the same value. This is fortunately not an issue, as users will write well formed programs which always specific formats, so we never need to make that distinction.
The compiler only needs to know about a few different tokens and they are as follows:
| NAME | Value | meaning |
|---|---|---|
| INT | 0x18F4 | int |
| INT_PTR | 0xF982 | int* |
| OPEN_PAREN | 0xFFF8 | ( |
| CLOSE_PAREN | 0xFFF9 | ) |
| CLOSE_BRACE | 0x004D | } |
| FN_ARGS | 0xFCE5 | (){ |
| FN_CALL | 0xFCA5 | (); |
| SEMICOLON | 0x000B | ; |
| IF_START | 0x1858 | if( |
| WHILE_START | 0xDA02 | while( |
| ASM_START | 0x973E | asm(" |
| DOT_BYTE | 0x9491 | .byte |
| ASM_END | 0xFA4D | "); |
| STAR | 0xFFFA | * |
| AND | 0xFFF6 | & |
| RETURN | 0x7DA7 | return; |
| PLUS | 0xFFFB | + |
| MINUS | 0xFFFD | - |
| OR | 0x004C | | |
| XOR | 0x002E | ^ |
| SHL | 0x0084 | << |
| SHR | 0x009A | >> |
| EQUAL_EQUAL | 0x008F | == |
| NOT_EQUAL | 0xFF77 | != |
| LESS | 0x000C | < |
| LESS_EQUAL | 0x0085 | <= |
Some notes on these tokens:
- comments don't exist, functionality is more important
(){is one token. functions may not have arguments. use global variables instead.();(see above)if(is one token. this has tripped me up several times, since in many languages, the convention is to place a space between theifand the(.while((same as if)return;is one token. functions may not return values. use global variables instead.- only
<and<=exist. swap the order of the arguments if you want a greater than comparison. - the
asm(",.byte, and");tokens exist to implement theasmextension. This is compatible with gcc and clang's extensions. It is used likeasm(" .byte 235 ; .byte 254 ; ");Only decimal literals exist, so this is very inconvenient to write.
The grammar for this implementation of C is also very limited. (expr) does not exist, so you cannot have compound expressions, only simple binary expressions. Function calls can only appear alone, as a statement (since they can't return a value). Declaring and assigning to variables must be separate, there's no variable initializers. Only global variables exist. Local variables require a surprising amount of implementation work to do with scopes and the stack, so you can only declare global variables. The lack of local variables is possibly one of the worst "features" of this language. Declaring variables away from their usage makes it easier to misuse them, and leads to awful conventions to prevent collisions of names, like int _read_char_state ; (this variable really wants to be a local named state).
Compiler Architecture
Several registers have meanings that are consistent throughout the compiler:
axis used for holding token values or for emitting generated codebxis scratch, typically pointers, because onlybx,si, anddican be used for pointers in real mode, and the other two are in usecxholds the location in memory where a specific function or variable isdxis mostly used for generating code, otherwise scratchsiis the pointer to the start of the next token in the source codediis the pointer to the next place to emit bytes for the compiled code
The compiler processes tokens by getting tokens sequentially and using recursive descent. It starts by looking for top level declarations, which are variables and functions. The type of the declaration is read first. Variables must have type int or int* while functions may have type void, int, or int*, but should use void, since return values do not exist. If the type of a declaration is int*, the location of the declaration generated program is incremented by 1. All other declarations are aligned to 2 bytes, so this allows code to determine if a variable is a pointer or not without having to store that data in a separate table. For all declarations, first an entry is placed in a location of memory that acts as an ident -> pogram location map. This map is 128KiB large, since there's 65536 possible ident numbers and each one uses two bytes to represent the location of the declaration in the generated program. To distinguish a function declaration from a variable declaration, the code looks for a (){ token after the name of the variable or function. Since functions cannot have arguments, this suffices to identify all functions.
Generating code for variables is very simple - the compiled code pointer is incremented by 2 to reserve space for the variable, and the compiler loops to check for the next top level declaration.
The rest of the compiler's code is dedicated to generating code for functions, including all the different types of statements and expressions that can be within them.
There are 7 top level statements within a function body: variable assignment, assigning behind a pointer, function calls, if, while, return, and asm.
Function Calls
This was actually the very first thing that I implemented within the compiler, because it's surprisingly simple! The call x86 instruction implements everything we need, since we don't have locals or arguments. There exists a call <register> form too, so the compiler just generates
mov bx, ADDRESS
call bxTo obtain the address to call, the compiler looks up the address of the function in the ident -> address map. This is a constant over the lifetime of the program, so the generated code only needs to load that constant and call it.
Assignment
Variable assignment (var = expr) and assignment through a pointer (* ptr = expr) share most of their logic, but are some of the most complicated statements. Assignment has to remember the variable that it is assigning to, then parse and codegen a complicated arbitrary expression, and then store the result of that expression in the variable, or in the case of assignment through a pointer, in the memory location pointed to. The generated code for the assignment itself is fairly simple, it assumes that there's a value in ax and then stores it to the appropriate location.
; assume expr result in ax
; variable assignment
mov bx, VARIABLE_ADDRESS
mov word [bx], ax ; store to the variable
; assignment through a pointer
mov bx, VARIABLE_ADDRESS
mov bx, word [bx] ; get the pointer's value
mov word [bx], ax ; store to that addressExpressions
A vast majority of the complexity of the compiler is in the expression parsing and codegen required for assignment. Expressions can be either unary expressions (a single Thing, like a number, or another variable), or binary expressions, which are a unary expression followed by a binary operator, and then another unary expression. Expressions cannot contain an expression themselves, which makes writing complex code difficult. This is especially annoying in if and while, where code frequently has to have intermediate variables to combine conditions.
cond = a & b ;
cond = cond & c ;
cond = cond & d ;
if( cond | other ){
}
However, despite the flaws in using these expressions, the general design of the implementation is generally easy.
parse_unary();
if next_token() != ";" {
binary_op = parse_binary_operator();
parse_unary();
emit(binary_op); // in this order to combine the two unary expressions
}
Some notes on the implementation:
The code for parsing and generating the unary expressions on the left hand side and right hand side of the binary operator is identical. Since the generated code always writes the value of the unary expression to ax and uses bx for addresses, we need to save the left hand side of the operator in cx so that the right hand side can generate into ax, and then swap them back so that the left hand side is in ax and the right hand side is in cx. Thankfully this is only a few bytes to emit a xchg cx, ax before and after the right hand side's code. By ensuring that the LHS of the expression is in ax and the RHS is in cx, the code that emits the binary operator only needs to pick a sequence of bytes depending on the operator. It might emit add ax, cx to add two values or cmp ax, cx; sete al to check if two values are equal, but it doesn't concern itself with complicated register allocation.
There are a lot of assumptions in the binary operator code that assume that the user wrote a valid binary operator. Only the last byte of the idents are checked, meaning that there are 256 different ident numbers (and an infinite number of strings) that the compiler will consider to be a binary operator. If the token provided is not a binary operator, it's highly likely that a table will be indexed out of bounds and the generated program will be garbage.
Numbers and variables are indistinguishable. The way that the compiler knows a variable is being accessed is if that ident number has an entry in the address table. This means that certain integer literals cannot be used. However, it should be possible to use the sum or difference of integer literals to form any integer that you want. I have yet to encounter this problem in my code, so it's hopefully Unlikely Enough. This is something that cannot be improved however, not without completely redoing the way that idents are managed.
In C, doing pointer + integer adds sizeof(type) * integer bytes to the pointer. Additionally, subtracting pointers divides the result of the subtraction by sizeof(type), giving the number of elements between the pointers. These are both special cased in the compiler. If the left hand side of a binary operation is a pointer (determined by its variable address being odd) and the operator is a +, twice as much is added to the pointer. If the left hand side is a pointer and the operator is a -, the result of the addition is divided by 2. If the left hand side is a pointer and the operator is anything else, normal behavior is used, which allows ==, !=, <, and <= to work without concern. If the left hand side of an expression is not a pointer, none of these special cases are used.
ptr == ptr was actually doing ptr == (ptr * 2). Then, once I fixed that, I realized that ptr - ptr isn't equal to ptr - (ptr * 2) and that I had just so happened to pick a pair of pointers for testing that made it look like it worked. These bugs were then fixed and the previous paragraph updated.
if and while
Weirdly, implementing if and while is not very complicated. The basic structure of an if is
<expression> ; result in ax
or al, al
je end ; skip the block if the condition is not met
<block>
end:
This means that the expression code and the block code (used for generating function bodies) can both be reused. The compiler only needs to emit a conditional jump before the block. One complicating factor is that the target for the conditional jump is not known until the block is generated. This is solved by storing the location of the jump and then fixing it to point to the correct location after generating the block.
while can be composed similarly
while_start:
<expression> ; result in ax
or al, al
je while_end
<block>
jmp while_start ; loop to check the condition again
while_end:
This is very similar to an if, but with an extra jmp at the end, and a slightly different jump target if the condition fails. And indeed, in the compiler, that's exactly how it's implemented. The compiler generates an if, adjusts the jump target for the if forwards by 3 bytes and then generates a jump back to the start of the while loop.
All of these design choices, with all of their flaws, are specifically to reduce the size of the compiler. As much as possible needs to be ripped out of the language to reduce the number of things to implement, and then the implementation needs to be exceptionally optimized to be able to fit everything left. Even the choice to omit > and >= is critical, implementing those would cost 6 bytes under the current implementation, but 6 bytes is not worth the convenience of those operators.
The next post will be about exactly how I saved so much space in the implementation of the compiler. There's actually only a few particularly tricky things that I had to do to get this space, but I will go over all of them in detail.




