mido

thinking so ur dog doesnt have to

I like making little games with friends :] Mostly shit posting and vibing.


In which an arrogant little bug scuttles around the ancient halls of message protocols.

This weekend I tackled some ugliness that evolved as I worked to make things work in my pt.2 blog post.

Some Background

Architecturally, this project splits concerns across a few crates:

  • networking is the meat & potatoes, has the precious GameConnection abstraction as well as the vital Stream abstraction. GameConnections are a collection of 1 special "system" Stream and up to 31 slots of general purpose Streams.
  • protocol has all the funny little messages that go onto and off of the wire, so byte-level serde is its main concern, and it's main customer is networking.

Those two are the "heart" of everything. These building blocks are not, however, user-friendly or ergonomic. They assume you understand the desired encryption scheme, nuances about the handshakes procedures, notions of the state machine that a GameConnection scrunches itself through during said handshakes, etc.

So we add two more crates:

  • server_traffic contains all the glue for "central server" or "hub and spoke" architectures. There is one primary many-to-many UDP socket which must use various discriminations among incoming traffic. Does it get routed to an "active" connection? Is this a handshake packet that should go to a pending connection? Is it some unknown garbage?
  • client_traffic is much simpler, comparatively. This one take an interesting step and bundles an HTTP client dependency, so it may be in charge of making the HTTPS POST /join call, to securely trade necessary secrets to configure it's pending connection with. (Note: server_traffic does not house the http server! It may in the future, but server-side concerns are more complicated, you may, for example, want a distinct application handling HTTPS /join commands, and keep that responsibility away from the main realtime game server.)

Finally we have two crates that act as actual "real world" use cases, a test_server and test_client, which right now are just an echo server and echo client. The test_server is a little special, as it houses the HTTP server that is responsible for POST /join calls.

I don't have the energy to make great before/after diagrams, so we'll work backwards from the "after" diagram I did here: dependency_graph

Shaking Hands Is Difficult

Until this weekend I had a VERY simplified connection handshake. The client GameConnection basically sent a 1024-byte challenge and assumed the server received it and promoted itself. The server received this challenge and if valid, would promote that pending connection to active and store it with our venerated "active" connection collections, available for normal routing!

It was way too simple:

too simple

This lead me into a small trap of not caring where the handshake lived. I'm trying not to overthink stuff for once in my life so "anywhere is fine" has been okay, and I wait till things are a problem and then I fix them. 😩

I may have been a LITTLE too sloppy, as I let server_ingress handle the server side elements of the handshake, and also took shortcuts that affected the entire system's behavior, since it was all so local. Let's take a look at this. Note that this snippet is from server_ingress's main event loop, which consumes enum tokens like "udp packet received" (sent from it's own udp recv pump loop), and "build a new pending connection" (sent from the application), and some others (like querying active connections).

IngressCmd::RecvPacket(from, payload) => match active.get(&from) {
    Some(conn) => {
        conn.read().await.handle_encrypted(payload);
    }
    None => {
        // if we couldn't map the 'from' address+port to an active game connection,
        // we'll check if it's a hello packet and test a decryption
        if let Some(valid_connection) = verify_join_challenge(payload, &pending).await {
            println!("Promoting connection from pending to active: {:?}", from);
            let mut pending_conn: GameConnection<Pending> =
                pending.remove(&valid_connection).unwrap();

            // free up the write lock asap
            pending_conn.set_remote_addr(from);

            // at this point we're passing along ownership of this connection into
            // the hands of the application writer, it's up to them to call
            // .promote() on it and to hand it back to us via the IngressCmd::AdoptActiveConnection
            // this resolves a bunch of ownership headaches and lets the application
            // have as much context as it wants to configure the system stream.
            if on_conn_promote(pending_conn).is_err() {
                println!("Error promoting connection.")
            }
        } else {
            println!("Ignoring weird packet from {:?}", from);
        }
    }
},

This snippet shows the IngressCmd::RecvPacket(..) command, which is hit for every single incoming UDP message, and it's the udp_ingress tasks job to route it to an active or pending connection. If the "from" address isn't known, we take a chance and assume this unknown payload MIGHT be a pending connection's handshake-related message, and assume the first 8 bytes are a secret pending-ID that is only good for 5 seconds and only for pending connections.

If the the u64 we pull out of the raw, unknown payload, DO appear to match up with a connection in our pending map, we'll ask that connection to decrypt & verify the payload with the assumption that it's a join challenge. If it passes? Awesome, promote the connection, welcome to the club buddy. If it fails any branch here? Quietly ignore it, junk traffic, who cares.

This worked great, it was simple, and let me focus on other areas. But last weekend while doing some gentle stress testing (creating tons of echo clients and sending/receiving small bits of traffic) I witnessed something I knew was inevitable: a small race condition caused a peer to try and send system-stream payloads when the other peer wasn't ready to receive them.

That was the code BEFORE any of these big refactors, but things got out of control as I initially attempted this refactor, sadly I didn't think to capture my shame in a commit to share as sample of here, but imagine that IngressCmd::RecvPacket method except 1000x worse and getting huge with sprawling method calls. I knew it was a bad idea even before I started and I'm glad I cleaned it up in the end.

It was time to implement a proper handshake. Let's outline what we want to have happen: outline of handshake

One of the great things about async runtimes is that you can write less re-entrant, stateful code and can instead just write linear flows. I personally really enjoy this fact for protocol-oriented programming like this, because it means you can keep relevant elements on the stack or in small code blocks and create an overall more readable, linear, flow of the state progression even though it may be happening over long periods of time (tens, hundreds, thousands of millis)

As a result I ended up with two new functions inside of impl GameConnection<Pending>. One for servers/listeners, and one for clients/initiators:

impl GameConnection<Pending> {
    // .. snip

    pub async fn do_client_join_handshake(mut self) -> Result<GameConnection<Ready>, &'static str> { /* ... */ }

    pub async fn do_await_client_handshake(
        mut self,
        mut pending_traffic: UnboundedReceiver<(SocketAddr, Box<Bytes>)>,
    ) -> Result<GameConnection<Ready>, &'static str> { /* ... */ }

    // .. snip
}

These functions are a little too long and a little too sloppy to show here, about 240 lines in sum total, but needless to say there is a lot of 'send_challenge_and_retry loop { ... } style code that picks apart responses to see if they are contextually relevant to the various little phases of the handshake protocol, along with retrying marshalled via tokio::time::timeout(..). Overall this is working great and I'm pleased with it.

Decrypting & Routing Caveat

One of the idea called out in my protocol sequence diagram sketch up above, is the client peer optimistically promoting itself once it receives the modified server challenge and sends back a nice little "LOOKING_GOOD" ACK message. This optimistic promotion without waiting for the server to confirm receipt is somewhat necessary due to some important principles I have in mind with this whole system:

This optimistic leap means the server peer COULD be left hanging and try to re-send the server challenge, hoping for a LOOKIN_GOOD ACK still. The client peer, however, no longer has the means to handle and respond to handshake-style messages, so the server peer finds itself unable to promote to "active" and will timeout.

To get around this I have one very special shim in the "decrypt and route" event loop of an Active GameConnection, clients peers only:

let encrypted_envelope = match PublicEnvelope::try_from(encrypted_msg) {
    Ok(envelope) => envelope,
    Err(mut unknown_msg) => {
        // near the end of the join handshakes, there is a "leap-of-faith" moment
        // where both peers have enough context to try and promote to an Active
        // state, but will no longer be able to exchange out of band, non-stream
        // messages. The idea is: once the connection becomes Active, let the
        // application writers use the system stream to drive ALL further mutations
        // of the GameConnection.
        //
        // This puts us in a chicken & egg problem with wanting an overcomplicted
        // ACK/SYN/SYNACK/SYNACKSYNACKSYNACK dance before either peer is willing to
        // effectively abandon their handshake messaging.
        //
        // To work around this, whenever we fail to decrypt a payload, we'll make
        // some assumption that this payload MIGHT be a lingering handshake message,
        // such as a lingering re-attempt at delivery. With this assumption, we'll
        // send back the final "ACK" this peer attempted to deliver before promoting
        // to Active, which should let the remote peer attempt to do the same with
        // some reliability!
        use protocol::hello::extract_potential_slot_id;
        if !is_authority {
            // the only time the handshake gap should occur is when the server is
            // re-sending modified challenges because it hasn't yet promoted to
            // ACTIVE. So this re-emission concern is ONLY for clients.
            // But maybe it could happen if a packet was lost in the ether for
            // many seconds and arrived way out of order.
            println!(
                "Ignoring an unexpected handshake message for connection_id {}",
                conn_id
            );
            continue;
        }

        match extract_potential_slot_id(&mut unknown_msg) {
            Some(id) => {
                if id != conn_id {
                    continue; // didn't match! weird
                }
                // fall through onto the golden path
            }
            None => continue, // ignore weird message
        };

        // golden path: if we're here then it looks like this message is a remnant
        // of the join handshake, possibly our peer didn't receive our final ACKs
        // before promoting, and we should send them another.
        {
            println!("Re-sending challenge ACK to authoritative peer.");
            let mut conn = conn.write().await;
            let compliment = conn.get_readymade_compliment();
            conn.udp_send
                .send_to(compliment[..].as_ref(), remote_peer)
                .await
                .unwrap();
        }

        continue;
    }
};

I condense some of my thinking in the comments, maybe you'll find it interesting.

Active connections must only send/receive traffic via the Stream constructs.

Streams are, generally speaking, an orthogonal set of protocols and envelopes and messages from the limited set of "handshake" messages. Once a connection is "promoted" out of Pending, the only messages flowing back/forth should be stream-part messages.

Streams are only intended for use by 'application' layer

GameConnection should only be marshalling messages along Streams for the application's benefit. GameConnection, once "Active" has no out-of-band control constructs for the state of the connection for either peer, that is all the application's responsibility. The GameConnection must not do much of it's own thinking.

Tying that together...

What this means is that Connections have 3 states they promote through, in order:

  • Pending
    • Has specializations for sending/receiving handshake protocol traffic.
  • Ready (added in these refactors)
    • Cannot send or receive traffic, exists in a liminal state awaiting the application to configure the system stream, which will then promote it to active.
  • Active
    • System stream is configured, we will only send/receive traffic that relates to the underlying streams.

A loaded term in this tangent is "application". The "application" in this case is something like server_ingress, a consumer of GameConnection that is obfuscating these lower level details for the benefit of someone maybe using Unity, Unreal, Godot etc. who doesn't want to know all that stuff and just wants to send/receive messages with some parameterization (reliability, ordering concerns, etc).

With this concept, the idea is that a savvy developer (me) would wrap all these ideas up in a nice package that maybe handles FFI to C# and a game's architectures-and-design-specific concerns. Maybe you're making a multi-scene chatroom game and need to have a few different streams of information going at once (global reliable meta, current-scene realtime events unreliable, current-scene realtime events reliable, an unreliable mini-map stream, etc).

The goal of the lower level parts of this library are to be as non-prescriptive as possible once connection is established. Eliminating "meta" control channels like a sub-protocol message bus lets us keep concerns application-specific.

That's it thanks!

Okay I don't have anymore thoughts, next time I might write up some notes about security, replay attacks, amplification attacks etc and some ideas I'm employing to handle them (maybe unsuccessfully 🤷‍♀️ im no cybersec expert!)


You must log in to comment.