jamesmunns

Doing embedded stuff

Be nice, work hard. Build tools, tell stories. Start five year fights. Kein Ort für Nazis.


A list of all the other services I'm on around the internet


Posts with code highlighted with codehost by @wavebeem.


All posts licensed under CC-BY-SA 4.0, unless otherwise stated.


This is the rough version of a post I'd like to polish up later. Forgive me for not defining terms, or giving good didactical explanations. There will be a better post later, when I have more time and energy.

I work a lot with systems that talk to other systems. Technically, this puts them in the class of distributed systems, but unlike databases or something similarly complex, it's usually some kind of PC (laptop, webserver), talking to some kind of embedded system (a USB device, a sensor).

Regardless of complexity, these systems need to talk to each other to achieve some goal: obtaining sensor data, changing configuration values, etc.


My go-to example is a bootloader:

  • The "host" (a PC) holds the whole new firmware
  • The "client" (an embedded system) doesn't have room to hold the whole new firmware in memory
  • The host must send the firmware in pieces, to allow the client to handle it
  • The client will need to erase the old firmware in relatively large blocks, say 4KiB at a time
  • The host will need to send pieces of the new firmware in relatively small blocks, say 256 bytes at a time
  • This process needs to repeat until the full image has been transferred
  • The host finally say "good luck, boot into your new firmware"

This sucks to write

Writing this usually means writing three things:

  • A state machine for the client - code to walk through all the steps above
  • A state machine for the host - usually pushing the host through each step
  • A "wire protocol", for the actual kinds of message and the sequence of messages sent to achieve all the state transitions

This means I write three chunks of software in three places, and if I change any one part, I need to make sure the other two are consistent.

Yes, I know I'm writing three things. But that sucks, because I'm not writing three things: I'm writing one thing that exists in three places:

  • On the client
  • On the host
  • Over the wire

I haven't been able to find a good way of specifying this, at all.

For state machines, we have state machines diagrams, and a million ways to specify them.

For communication, we have flow diagrams, and a million other ways to specify them.

But how the hell do I specify a "protocol"? And how the hell do I write unfragile code that lets me write something that has a good chance of working out of the box?

Let's tarantino this problem

Let me make some assertions:

  • All programs are state machines. QED if I write some code, it's a valid way of specifying a state machine
  • Async programs are state machines that don't make me write the "wait for an event to occur" part of state machines
  • Traits (in Rust) are a way separating "abstract behavior" from "concrete implementation"

So how about some async rust code that describes a bootloader state machine?

async fn bootload(something: &mut Something) -> Result<(), ()> {
    // start the bootloading, and get the image size
    let image_end = something.start().await?;

    // While there are more sectors to erase/fill...
    while let Some(start) = something.next_sector() {
        // ...erase the sector...
        something.erase_sector(start, SECTOR_SIZE).await?;
        let sector_end = start + SECTOR_SIZE;
        let mut now = start;

        // ...while there is more data to write in this sector...
        while (now < sector_end) && (now < image_end) {
            now += something.write_next_chunk().await?;
        }
    }

    // All done, now boot.
    something.boot().await?;

    Ok(())
}

That seems like a decent representation of the system I described above.

Let's look at this one chunk at a time:

let image_end = something.start().await?;

We start the bootloading.

  • The host (the PC) should send a message to the client (the embedded system) to start the bootloading. It should send the total size of the image we want to send. It should then wait for an acknowledgement.
  • The client receives the message, and knows that the client would like to start a bootloading. It recieves the total size of the image the host wants to send. The client should also send an acknowledgement to say "yes go ahead thank you"
while let Some(start) = something.next_sector() { /* ... */ }

We iterate over the number of sectors. Both the host and client will know how many sectors need to be written.

// ...erase the sector...
something.erase_sector(start, SECTOR_SIZE).await?;

We need to erase a given sector.

  • The host should send a message to prompt the erasing of a section, and wait for an ACK
  • The client should erase the given sector, and send a message back one the erasing has completed.
let sector_end = start + SECTOR_SIZE;
let mut now = start;

// ...while there is more data to write in this sector...
while (now < sector_end) && (now < image_end) {
    now += something.write_next_chunk().await?;
}

We need to send manageable chunks of the new firmware image, until we have either filled this now-empty sector, or until we have finished transferring the whole firmware image.

  • The host should send one chunk at a time, waiting for an acknowledgement before proceeding.
  • The client should write the chunk of data, confirming once the write is complete
// All done, now boot.
something.boot().await?;

Once we are done transferring, we should prompt the client to boot into its new firmware.

  • The host should send a "okay all done now boot" message, and wait for confirmation
  • The client should say "okay here I go byyyeeeeee"

Okay that's cool but what?

So, I feel like I have sufficiently shown that the SAME vague state machine is suitable for both the host and client.

I could copy and paste this code to both projects, like an animal, or I could instead somehow make this generic over whether this state machine is running on a host, or running on a client.

Like I said before, this is what traits are for.

trait TraitMachine {
    const SECTOR_SIZE: usize;
    const CHUNK_SIZE: usize;

    fn next_sector(&mut self) -> Option<usize>;

    // These are all the joint state transitions, basically?
    async fn start(&mut self) -> Result<usize, ()>;
    async fn erase_sector(&mut self, start: usize, len: usize) -> Result<(), ()>;
    async fn write_next_chunk(&mut self) -> Result<usize, ()>;
    async fn boot(&mut self) -> Result<(), ()>;
    async fn abort(&mut self) -> Result<(), ()>;
}
async fn bootload<TM: TraitMachine>(tm: &mut TM) -> Result<(), ()> {
    // start the bootloading, and get the image size
    let image_end = tm.start().await?;

    // While there are more sectors to erase/fill...
    while let Some(start) = tm.next_sector() {
        // ...erase the sector...
        tm.erase_sector(start, TM::SECTOR_SIZE).await?;
        let sector_end = start + TM::SECTOR_SIZE;
        let mut now = start;

        // ...while there is more data to write in this sector...
        while (now < sector_end) && (now < image_end) {
            now += tm.write_next_chunk().await?;
        }
    }

    // All done, now boot.
    tm.boot().await?;

    Ok(())
}

Now I can actually specify TWO TOTALLY DIFFERENT BEHAVIORS for the host and client.

Let's look at the start method on the host:

impl TraitMachine for Host {
    // ...
    async fn start(&mut self) -> Result<usize, ()> {
        self.channel
            .send(Host2Client::Start {
                total_size: self.image.len(),
            })
            .await?;
        match self.channel.recv().await? {
            Client2Host::Starting => Ok(self.image.len()),
            _ => Err(()),
        }
    }
    // ...

We:

  • Send a "start" message to the client, with the total image size
  • We wait for a "starting" acknowledgement message from the client

Now lets look at the start method on the client:

impl TraitMachine for Client {
    // ...
    async fn start(&mut self) -> Result<usize, ()> {
        match self.channel.recv().await? {
            Host2Client::Start { total_size } if total_size <= Self::TOTAL_SIZE => {
                self.image_len = Some(total_size);
                self.channel.send(Client2Host::Starting).await?;
                Ok(total_size)
            }
            _ => Err(()),
        }
    }
    // ...
}

We:

  • Wait for a "Start" message from the host, with the total image size
  • We send a "starting" acknowledgement to the host

Mission Accomplished

Okay, is this actually better?

I dunno. I haven't seen anyone do it like this before, and seems neat that I can use out-of-the-box tools of Rust to achieve this.

The whole demo is here, and you can run it for yourself to see how it works.

I don't think this could be a "library", but more like a "pattern", similar to how things like xtask work.

Have a cool distributed state machine you work on that could be represented like this? Lemme know.

Oh, I call these "phase locked" state machines, or "lockstep" state machines, because they are the same state machine running in two places, but tightly "locked" to the same state or phase.


You must log in to comment.

in reply to @jamesmunns's post:

I had someone else mention session types! The initial examples I can find are... pretty dry and not very illuminating, if you have any good resources (blog posts, papers) that are more practical than theoretical, I'd love to read them!

Interested to hear what you think when you've finished reading the post.

It’s been a while since I last tried to mess with session types (it was probably like 2019), the Idris examples are the best I’ve seen. My impression I’m piecing back together is that basically you’d write the type the same way you wrote bootstrap, and it would produce a type that can be split in two and both halves used to write the two ends of the channel. I’ll have to try messing with the Idris session types library and see if I can write up this protocol with it.

"But how the hell do I specify a "protocol"? And how the hell do I write unfragile code that lets me write something that has a good chance of working out of the box?

I think talking about a stateful protocol in terms of a state machine is a good way to describe it to a human, in terms of how it's supposed to work. But I don't think it really gets us directly to unfragile implementations. The core issue, at least in the domain where protocols typically matter, is that each end of the conversation may have a different idea of protocol state. This could be due to buggy implementation, lossy communications, a malicious actor on the other end, cosmic rays, who knows.

The way I've dealt with this in the past is by starting with a protocol state machine, as you did here, but then moving into specific state machine specifications for each end of the communication. This is the sweet spot for model checkers; they're designed for exactly this kind of verification. I've used TLA+ for this in the past, pretty successfully.

Of course, a TLA+ model is not the code. It may be that if you write your endpoints against the right abstraction, you could verify the properties you care about with https://github.com/model-checking/kani or similar. Not sure, but it would be pretty sweet if you could.

w.r.t. session types: I've used this in the past as a way to make sure the types in your code are a direct reflection of an endpoint's observation of the protocol state. It doesn't solve the 'shit on the wire is painful' design problem, but it does let you statically check that you're only taking allowed transitions for one of the endpoint state machines. BUT, at least in Rust, I've session typing to be more trouble than it's worth. I inevitably build the non-session-typed version of a protocol client as well, which I always seem to need because the types don't /quite/ work out otherwise. Ymmv, of course.

Honestly

shit on the wire is painful

is actually super untrue for me! I use postcard for everything, so I typically don't even worry about the wire format - it's just data types in and data types out. "what data to send, when, and what messages are valid at what stage" are the much harder questions.

TLA+ is a good point, and I'll have to check out kani!

I've been reading the erlang paper lately, and am really coming around to "shits always gunna break, just handle brokenness at the system level", which is honestly very appealing. We still need a good way to specify the correct "happy path" tho, even if we defer all "non happy path" steps as "immediately bail and reset the state machine (exceptions in erlang, error returns in Rust).

The core issue, at least in the domain where protocols typically matter, is that each end of the conversation may have a different idea of protocol state.

I think this is exactly the thrust that I'm getting at - I want to be able to specify the protocol as an atomic unit, rather than implement it in two places. At least for my current work - where I'm writing the software for both ends of the wire.

In degenerate cases, where the state machines HAVE fallen out of phase, we declare that is an error!

While it is sometimes necessary to separate the different parts (different langs, protocol versions, etc.), IMO that should be the weird case, not the default option.

Funny enough, I have a case almost exactly like the one you described.

I can see this being particularly useful for implementing chatty stateful protocols that encapsulate a single complex operation (a "procedure").

[Reasoning out loud] *However you must be absolutely sure the state machine "contract" matches for both the client and server (either by versioning or as a checksum).

Defer any reliability concerns to a protocol layer like TCP. If someone violates the protocol, the other end resets the connection.*

I wonder if this would possibly also be useful in-process.