Like many who fall into building programming languages as personal projects or research rabbit holes, there was a period where I became obsessed with type systems, formal languages, the specifics of what the lambda calculus is and what other computational calculi are.
One of my favourite things to do in programming languages with a strong type system is to find or otherwise design a type signature that leaves little room for "misuse". I am not unique in enjoying this, it's the premise of type-driven development and it's arguably one of the most interesting itches to scratch when you have a complex type system to play with.
It is not unexpected to want to, once you understand a set of tools, turn those tools in on themselves. It feels smart, it feels satisfying, it feels like you're coming up against the limits of a concept. Sometimes this is a quagmire, where you just get lost trying to define things from their surroundings in a way which becomes circular or muddied or distressing, but with formal systems there's the clarity of playing in a sandbox.
This post serves as an illustration of my favourite way of leveraging powerful type systems to make the path for adding type theories – specifically type checking and type inference – to toy formal language projects as clear as possible. To be clear: I am talking about outlining type checking, not implementing type checking. That's what you go on wikipedia and then google scholar and then JSTOR and other rabbit holes of research for. Or you follow your heart and decide to mess with these things from first-principles, as long as it's not in your heart purely to feel The Smartest Of All.
First things first, as we're talking about a typed language developed in a typed language, I'll be referring to them this way:
- Host language: The language we're writing things in, not the language we're designing.
- Host type system: the type system of the host language, not the type system we're designing.
- Toy1 language: The language we're designing, not the language we're writing things in.
- Toy type system: The type system we're designing, not the type system we're using while designing the type system.
The host language I'm using is haskell, which definitely isn't the most practical but I can leverage the host type system really well. I will also show how to do this in Rust, but that's more as an afterthought. What your host language will need is 1. Algebraic Data Types aka Enums but Fancy ✨ and 2. Type Parameters aka Generics aka Parametric Polymorphism. These are the two features that tend to make a programming language theorist's heart swell three times its size anyway. Appropriate host languages might then be:
- Java 20 (yes, really)
- Rust
- Many ancestors of Rust, like OCaml
- Haskell
- Any number of haskell derivatives, i.e. Idris, Purescript
Again, I will be using Haskell but please don't let that scare you off. It's just the language that this can be done the easiest and taken to the furthest length in.
Your toy language
Where does your toy type system "live" in your toy language? This is a question you'll need to answer. I'll be illustrating with a lambda calculus, but this is not necessarily the foundation you'll go with. Here are some questions to ask yourself, if you haven't already:
- Do you have a point in the toy language where variables are explicitly introduced? Not all formal languages do.
- Are there constants in your toy language whose type can be manually defined by you?
- Are there constants in your toy type system that can be defined by you?
- Are there rules you want types to follow in your toy type system?
This isn't an exhaustive list, I am not a type theorist and this is a blog post. Point is for this exercise, a language has a type system when its terms have types and those types have rules you follow. It's up to you to figure out how to do that, and it's up to you to see how the rules can be constructed and bent 2.
To illustrate asking these questions, I'll be using a lambda calculus (something we definitely know how to add a type system to), and asking these questions as I go through.
An example: a typed lambda calculus
Fig: a half-remembered definition of a typed lambda calculus. Please don't get lost here, I promise it's not too spooky.
The lambda calculus is... fine, fun, alright, good enough. And mathematical notation can be hard to decipher, so instead of the above alone here's a definition of it in the host language I'll be using:
data ToyType
-- When I refer to toy type expressions in the form `A -> B` I mean
-- effectively `Function "A" "B"`.`
= Function ToyType ToyType
| ... -- Not defining all other parts of the type, there can be many.
data LambdaCalc
= Var String
| Apply LambdaCalc LambdaCalc
| Abstract (String, ToyType) LambdaCalcThis is pretty kosher, though using strings for variables is a bit too specific and perhaps gaudy. But before getting distracted by the taunt of perfection, let's ask those questions from before:
- Do you have a point in the toy language where variables are explicitly introduced?
- Yes!
Abstractplays the part of the lambda operator, introducing a new variable to an expression body.
- Yes!
- Are there constants in your toy language whose type can be manually defined by you?
- Hmm, no. This lambda calculus doesn't have constants. But, we could add them. We'd want a stronger way of differentiating between variables and constants if we did, mind.
- Are there constants in your toy type system that can be defined by you?
- Yes! At the very least, we've defined
Functionas a type that describes function types. The rest of the type system is left up to the imagination, but it could include explicit primitives like integers, etc.
- Yes! At the very least, we've defined
- Are there rules you want types to follow in your toy type system?
- Yes! We want the following, in plain terms:
- Variables have types, and those types are introduced at the same point as those variables: in the
Abstractpart of the syntax tree. Applyreduces the type of its left expression if it's a function type, and its right expression should match the argument type of its left. In other words:- In our toy language we have two expressions
aandb.ahas typeXandbhas typeX -> Y. Apply b ashould type check as having typeY, because we applied an expression of typeXto a function of typeX -> Y.Apply a bshould fail to type check, as we're trying to apply the expressionbto the non-function expressionaApply b bshould fail to type check, as we're trying to apply an expression of typeX -> Yto a function of typeX -> Y, which only takes arguments of typeX.
- In our toy language we have two expressions
AbstractIntroduces a variable as a parameter to an expression, making it a function. So if it introduces a variablevof typeXinto an expression whose type isZ, then the expression it forms has the typeX -> Z
- Variables have types, and those types are introduced at the same point as those variables: in the
- Yes! We want the following, in plain terms:
Abstracting the toy type system
Another thing is that we don't want all types to always have to be annotated in every circumstance, as we can probably design a nice type inference algorithm.
So what, do we then extend the ToyType type so there's a lack of a type as one of its possibilities? That sounds horrible, an introduction of null to our toy type system also removes the certainty that when a variable has a type it has a type and not just a hole for a type to be derived into.
So do we then duplicate LambdaCalc into PartiallyTypedLambdaCalc? Where we replace ToyType with Maybe ToyType? That's also bizarre as a data structure design choice. In haskell, in typed functional programming languages, in well-typed languages overall we get trained into viewing things like this as things we can solve with the host type system. And yeah, there's a fairly obvious choice:
-- The toy type system is now a type parameter over our lambda calculus
-- in our host language implementation.
data LambdaCalc t =
Var String
Apply (LambdaCalc t) (LambdaCalc t)
Abstract (String, t) (LambdaCalc t)
deriving (Functor, ...) -- oh we could derive so many things!
We even get to do things such as define an untyped lambda calculus as LambdaCalc () using the Unit type. We get to define type erasure as just mapping \a -> () over LambdaCalc because we can derive Functor (or implement it manually, pick your poison) for it in the host language.
This is the point of typing things this way, we get more operations for free. We have to do less work overall and there are more rules that the compiler of the host language takes care of, instead of us making unchecked assumptions.
This is also an small example of an instance where Functor does not mean "Container". It's an easy misstep, but this works as a dweeby illustration of Functor's more nuanced semantics. Just because a data structure has a parameter does not mean it is supposed to be an alternative to an array, a set, a tree.
Aside: Context, , Monads
If you go through papers and web pages on type theories, you'll come across in the definitions. This is the representation for "Context" aka known and ongoing information or perhaps the most loathed term of all: state. Thinking through how we might do type checking operations we can imagine that when doing type checking on some toy languages we want an ongoing collection of named known expressions and their known types.
If we say that we mean that in our context we know of a symbol add_one with the type Int -> Int.
If we have a messy prototype complete with a helper function to parse basic type expressions then we could say that typecheck_context.get("add_one") == parse_type("Int -> Int") is true.
In practice, is implemented in any number of ways. It could be that you're passing around a key-value map that you're passing to every function, or there's a global variable everything is referring to, or a database, or a server, or a web crawler that finds anything that looks like typed symbol, or anything your heart desires.
Or, you could do what I'm doing in Haskell. In Haskell we have a very obvious tool for representing context aka ongoing information aka state that lets us dismiss all the explicit notation away: The monad. If you're not familiar with monads, just read this as "the way I'm accessing and modifying context". Or, watch this video. Or, read this blog post. Or, try using promises and callbacks in javascript with the then method.
The assumption from here on is that you know that when I talk about using monads in the host language of my choice, I'm talking about the structure that manages information that we know.
-- from the haskell package "mtl"
import Control.Monad.State.Lazy (State)
-- Typechecking could plausibly involve reading/writing from disk, network.
-- But that's getting into the weeds.
-- As an example here, I'm using a state monad whose state is a key-value
-- map between symbol names and their types. This is not a prescription of
-- what your type checking context should be, only an example.
-- And really, this example is flawed as it also needs to support
-- erroring and returning without having found a type for the
-- expression as well.
type ToyContextM = State (Map String ToyType)
Typing Type Checking
For me, the most important operation is finding the type of a fully-typed expression. If the type can be found, I know the following:
- The type of the expression overall (clearly).
- The type checking operation was successful.
- The types of any subexpressions, which may be added to the ongoing context.
This gives us more information than some isCorrectlyTyped :: Context -> LambdaCalc ToyType -> Bool function might give us.
So, we want an host language function that takes a toy language expression in its fully typed form and returns its type, plus any modifications to the context.
typecheck :: LambdaCalc ToyType -> ToyContextM ToyType
typecheck expr = -- It'd take a lot of space here to write this all out.
-- But, I believe in you. You can do it!
Typing Type Inference
We also can do something similar for type inference, which is an operation that takes a partially-typed expression and derives the types of the rest of the expression:
infer :: LambdaCalc (Maybe ToyType) -> ToyContextM (LambdaCalc ToyType)
infer expr = -- It's true! Google hindley-milner type inference to find out more.
Which we can express really nicely in the type system because we made the type system a parameter to LambdaCalc
Making a Typeclass/Trait/Interface
We have a structure, so let's be brash and make an assumption. Let's assume that all typed languages will work kind of the same way, or can be modified a little to work the same way. Let's assume we can make typeclasses that can encompass both inference and type checking analogously to how I've already presented them.
This will depend on host language features, but in Haskell3 it will look a bit like this:
{-# LANGUAGE FunctionalDependencies #-}
class Monad m => Typecheck m l t | l t -> m where
-- Here we define a typechecking operation by asking for a
-- function which, when given a typed expression, returns
-- its type within some monadic context `m`.
typecheck :: l t -> m t
class (Typecheck m l t) => Infer m l t | l t -> m where
-- Something that is inferrable must also be typecheckable,
-- as we have defined the type checking operation as being
-- able to determine a fully-typed expression's type and
-- we can imagine that inference would need that anyway.
infer :: l (Maybe t) -> m (l t)
Note that Infer requires a Typecheck implementation to be implemented. This is because I'm making an assumption that while doing type inference we'll want to type check subexpressions to know their types so that we can infer the correct type of the overall expression.
What does this give us, though? It puts two operations in a neat little box, it gives us direction. It is not the be-all end-all of what types or static analysis in general has to be. It's efficient communication to someone who wants to make their data structures behave a certain way. It formalises a data transformation in a way that asks the user to think about how they'd implement these functions for the data structures that matter to them.
It's also satisfying if you've been through bending your brain around type theories, typed languages. It's a way of presenting a general form in a way that makes you feel like something more true has been uncovered, even though all that's been done is the removal of specificities. We can now look at other abstract syntax trees for other languages and see how we might apply this way of doing things.
The most joyful/painful thing of all: Actually making your type system
This is where the scope of this blog post abruptly ends, as I am not instructing in type theory construction here. What I can do is once again offer a short list of what might be generously called insights:
- It's perfectly okay to do something that's already been done before.
- A language can have a type system and also be interpreted instead of compiled.
- Values don't have to only exist at "runtime".
- If you want to find some basic inspiration for toy type system features, look up the "lambda cube".
- Resolving free type variables honestly sucks and I'm really sorry. Terms you'll be looking for include: alpha equivalence, term unification, occurs check. Don't decide that you need to figure it out all by your self like I did when I was deep into this, reach out to others for help or to explain how to do things.
- Chase ambition sure, but we are a social lot. If you isolate yourself to solve all the problems you come up against, you'll find yourself lonely and perhaps without community to confidently share your work with or have appreciate your work. And don't work yourself into sickness.
- You don't need to have a runtime to have fun working on a language.
- What I laid out here won't necessarily work for everything.
- A language doesn't necessarily have only one type system.
Designing a type system for your language is largely about decision making and lining up what you want with what you can do. We are privileged as people who write software in this world to have access to a very tight feedback loop in the form of compilers and automated tests which can give us confidence in what we're doing very rapidly. Leverage your tools and keep an open mind.
Conclusion and postscripts
I hope this exercise in type-driven development as applied to building toy languages (domain-specific, programming, general, or otherwise) has been helpful and not too esoteric. This way of doing things lets us separate different stages of type checking from each other without having to repeat definitions of data structures, and without being able to plug wrong stages into each other, all while being able to tell some basics of what is known about the types during which processes.
Adding types to your languages can seem like an unnecessary and academic wander too far into the weeds. It doesn't have to be. The first step can be just looking at your toy language's AST and seeing if you can add a type parameter for extra data for your variables where they're introduced, and if that extra data can be types or thought of as types.
Next time: This, but for dependent types.
Postscript: Doing most of this in Rust
I do like Rust, and you can still do most of the cool stuff here in Rust. Rust doesn't have Higher-Kinded Types (or it does, but... I don't want to scare you off) so we can't get functors for free, or use monads, or easily state a trait that defines type checking and type inference.But we can do the important things, so Rust is perfectly better-than-okay for this way of doing things:
struct ToyContext {
// ...
}
enum ToyType {
// ...
}
enum LambdaCalc<T> {
Var(String),
Apply(Box<Self>, Box<Self>),
Abstract(String, T, Box<Self>)
}
type UntypedLC = LambdaCalc<()>
// Altneratively, ctx could be `&mut ToyContext` and
// you wouldn't need to return the `ToyContext`, but
// that's not the only way you could do things and would
// get in the way of parallelisation efforts. I'm just
// doing it this way here to line up with the haskell examples.
fn typecheck(
ctx: ToyContext,
expr: LambdaCalc<Type>,
) -> Result<(Ctx, Type), Err> {
// ...
}
fn infer(
ctx: ToyContext,
expr: LambdaCalc<Option<Type>>,
) -> Result<(ToyContext, LambdaCalc<Type>), Err> {
// ...
}
Note that we need to specify context as a variable, but that's fine. It's easier to understand from the point of view of someone not entrenched in Haskell. Crates such as nom do this same thing, and it works out fine for them.
If Haskell is too esoteric these days I might do a Rust Version of this post. Should get with the times, after all.
Postscript: Meta
I really wanted to call this "typing the type checker" but 1. not everyone has read Typing The Technical Interview and 2. what I'm doing here isn't nearly as complicated, I do not think it could live up to the reference.-
Toy here not meant to demean the language being developed, but instead because this is more about having fun with language design than getting serious about the guts and bureaucracy of things.
-
I mean this in an informal sense, not as a sweeping ontological statement of linguistics, type theory, mathematics.
-
While I'm enabling the
FunctionalDependenciesfeature here, isn't a necessary addition. It does let us specify that the language and type system are what determine the type of the monadic context.
