vogon

the evil "Website Boy"

member of @staff, lapsed linguist and drummer, electronics hobbyist

zip's bf

no supervisor but ludd means the threads any good


twitter (inactive)
twitter.com/vogon
bluesky
if bluesky has a million haters I am one of them, if bluesky has one hater that's me, if bluesky has no haters then I am no more on the earth (more details: https://cohost.org/vogon/post/1845751-bonus-pure-speculati)
irl
seattle, WA

MichelleDraws
@MichelleDraws

Centralia's VOL 1+2 Kickstarter is 75% funded!! Wowie zowies!

Here's the first of 4 postcard print designs that are available with the postcard print set ☆ the set is included in all bundle pledges and is available as an add-on to all pledge levels!

Get Centralia VOL 1+2 on Kickstarter!



mononcqc
@mononcqc

I've written a big incident report after Honeycomb had its biggest outage since having paying customers. There's a short copy on the blog but I encourage people to read the long form report [PDF]

It's got a bit of everything but the TL:DR; of it is that we migrated across clusters to avoid a bug, inadvertently discovered that a feature flag we had used many times did not work how we thought (because deploys were not running this one time and it somehow needed deploys), which in turn killed the update feed that was used to keep the ingest cache warm, which ended up creating heavy load on a DB, which ended up having an internal deadlock (in the MySQL implementation, not our transactions), which took down 100% of the system, which required a full ingest clamp and some SQL surgery to bring everything back to life.

What's specifically "fun" about this incident is that pretty much all the contributing factors we had were trying to do good things to prevent things from getting worse, and actually making them worse until the full outage.

It is a bit ironic how feature flags, frequent deploys, suspending deploys during incidents, and learning from prior near-misses all technically contributed to this incident, while being some of the most trusted practices we have to make our system safer.

Hopefully there's interesting lessons in there for our readers.