This is the rough version of a post I'd like to polish up later. Forgive me for not defining terms, or giving good didactical explanations. There will be a better post later, when I have more time and energy.
I work a lot with systems that talk to other systems. Technically, this puts them in the class of distributed systems, but unlike databases or something similarly complex, it's usually some kind of PC (laptop, webserver), talking to some kind of embedded system (a USB device, a sensor).
Regardless of complexity, these systems need to talk to each other to achieve some goal: obtaining sensor data, changing configuration values, etc.
My go-to example is a bootloader:
- The "host" (a PC) holds the whole new firmware
- The "client" (an embedded system) doesn't have room to hold the whole new firmware in memory
- The host must send the firmware in pieces, to allow the client to handle it
- The client will need to erase the old firmware in relatively large blocks, say 4KiB at a time
- The host will need to send pieces of the new firmware in relatively small blocks, say 256 bytes at a time
- This process needs to repeat until the full image has been transferred
- The host finally say "good luck, boot into your new firmware"
This sucks to write
Writing this usually means writing three things:
- A state machine for the client - code to walk through all the steps above
- A state machine for the host - usually pushing the host through each step
- A "wire protocol", for the actual kinds of message and the sequence of messages sent to achieve all the state transitions
This means I write three chunks of software in three places, and if I change any one part, I need to make sure the other two are consistent.
Yes, I know I'm writing three things. But that sucks, because I'm not writing three things: I'm writing one thing that exists in three places:
- On the client
- On the host
- Over the wire
I haven't been able to find a good way of specifying this, at all.
For state machines, we have state machines diagrams, and a million ways to specify them.
For communication, we have flow diagrams, and a million other ways to specify them.
But how the hell do I specify a "protocol"? And how the hell do I write unfragile code that lets me write something that has a good chance of working out of the box?
Let's tarantino this problem
Let me make some assertions:
- All programs are state machines. QED if I write some code, it's a valid way of specifying a state machine
- Async programs are state machines that don't make me write the "wait for an event to occur" part of state machines
- Traits (in Rust) are a way separating "abstract behavior" from "concrete implementation"
So how about some async rust code that describes a bootloader state machine?
async fn bootload(something: &mut Something) -> Result<(), ()> {
// start the bootloading, and get the image size
let image_end = something.start().await?;
// While there are more sectors to erase/fill...
while let Some(start) = something.next_sector() {
// ...erase the sector...
something.erase_sector(start, SECTOR_SIZE).await?;
let sector_end = start + SECTOR_SIZE;
let mut now = start;
// ...while there is more data to write in this sector...
while (now < sector_end) && (now < image_end) {
now += something.write_next_chunk().await?;
}
}
// All done, now boot.
something.boot().await?;
Ok(())
}
That seems like a decent representation of the system I described above.
Let's look at this one chunk at a time:
let image_end = something.start().await?;
We start the bootloading.
- The host (the PC) should send a message to the client (the embedded system) to start the bootloading. It should send the total size of the image we want to send. It should then wait for an acknowledgement.
- The client receives the message, and knows that the client would like to start a bootloading. It recieves the total size of the image the host wants to send. The client should also send an acknowledgement to say "yes go ahead thank you"
while let Some(start) = something.next_sector() { /* ... */ }
We iterate over the number of sectors. Both the host and client will know how many sectors need to be written.
// ...erase the sector...
something.erase_sector(start, SECTOR_SIZE).await?;
We need to erase a given sector.
- The host should send a message to prompt the erasing of a section, and wait for an ACK
- The client should erase the given sector, and send a message back one the erasing has completed.
let sector_end = start + SECTOR_SIZE;
let mut now = start;
// ...while there is more data to write in this sector...
while (now < sector_end) && (now < image_end) {
now += something.write_next_chunk().await?;
}
We need to send manageable chunks of the new firmware image, until we have either filled this now-empty sector, or until we have finished transferring the whole firmware image.
- The host should send one chunk at a time, waiting for an acknowledgement before proceeding.
- The client should write the chunk of data, confirming once the write is complete
// All done, now boot.
something.boot().await?;
Once we are done transferring, we should prompt the client to boot into its new firmware.
- The host should send a "okay all done now boot" message, and wait for confirmation
- The client should say "okay here I go byyyeeeeee"
Okay that's cool but what?
So, I feel like I have sufficiently shown that the SAME vague state machine is suitable for both the host and client.
I could copy and paste this code to both projects, like an animal, or I could instead somehow make this generic over whether this state machine is running on a host, or running on a client.
Like I said before, this is what traits are for.
trait TraitMachine {
const SECTOR_SIZE: usize;
const CHUNK_SIZE: usize;
fn next_sector(&mut self) -> Option<usize>;
// These are all the joint state transitions, basically?
async fn start(&mut self) -> Result<usize, ()>;
async fn erase_sector(&mut self, start: usize, len: usize) -> Result<(), ()>;
async fn write_next_chunk(&mut self) -> Result<usize, ()>;
async fn boot(&mut self) -> Result<(), ()>;
async fn abort(&mut self) -> Result<(), ()>;
}
async fn bootload<TM: TraitMachine>(tm: &mut TM) -> Result<(), ()> {
// start the bootloading, and get the image size
let image_end = tm.start().await?;
// While there are more sectors to erase/fill...
while let Some(start) = tm.next_sector() {
// ...erase the sector...
tm.erase_sector(start, TM::SECTOR_SIZE).await?;
let sector_end = start + TM::SECTOR_SIZE;
let mut now = start;
// ...while there is more data to write in this sector...
while (now < sector_end) && (now < image_end) {
now += tm.write_next_chunk().await?;
}
}
// All done, now boot.
tm.boot().await?;
Ok(())
}
Now I can actually specify TWO TOTALLY DIFFERENT BEHAVIORS for the host and client.
Let's look at the start method on the host:
impl TraitMachine for Host {
// ...
async fn start(&mut self) -> Result<usize, ()> {
self.channel
.send(Host2Client::Start {
total_size: self.image.len(),
})
.await?;
match self.channel.recv().await? {
Client2Host::Starting => Ok(self.image.len()),
_ => Err(()),
}
}
// ...
We:
- Send a "start" message to the client, with the total image size
- We wait for a "starting" acknowledgement message from the client
Now lets look at the start method on the client:
impl TraitMachine for Client {
// ...
async fn start(&mut self) -> Result<usize, ()> {
match self.channel.recv().await? {
Host2Client::Start { total_size } if total_size <= Self::TOTAL_SIZE => {
self.image_len = Some(total_size);
self.channel.send(Client2Host::Starting).await?;
Ok(total_size)
}
_ => Err(()),
}
}
// ...
}
We:
- Wait for a "Start" message from the host, with the total image size
- We send a "starting" acknowledgement to the host
Mission Accomplished
Okay, is this actually better?
I dunno. I haven't seen anyone do it like this before, and seems neat that I can use out-of-the-box tools of Rust to achieve this.
The whole demo is here, and you can run it for yourself to see how it works.
I don't think this could be a "library", but more like a "pattern", similar to how things like xtask work.
Have a cool distributed state machine you work on that could be represented like this? Lemme know.
Oh, I call these "phase locked" state machines, or "lockstep" state machines, because they are the same state machine running in two places, but tightly "locked" to the same state or phase.
