vogon

the evil "Website Boy"

member of @staff, lapsed linguist and drummer, electronics hobbyist

zip's bf

no supervisor but ludd means the threads any good


twitter (inactive)
twitter.com/vogon
bluesky
if bluesky has a million haters I am one of them, if bluesky has one hater that's me, if bluesky has no haters then I am no more on the earth (more details: https://cohost.org/vogon/post/1845751-bonus-pure-speculati)
irl
seattle, WA

mononcqc
@mononcqc

I've written a big incident report after Honeycomb had its biggest outage since having paying customers. There's a short copy on the blog but I encourage people to read the long form report [PDF]

It's got a bit of everything but the TL:DR; of it is that we migrated across clusters to avoid a bug, inadvertently discovered that a feature flag we had used many times did not work how we thought (because deploys were not running this one time and it somehow needed deploys), which in turn killed the update feed that was used to keep the ingest cache warm, which ended up creating heavy load on a DB, which ended up having an internal deadlock (in the MySQL implementation, not our transactions), which took down 100% of the system, which required a full ingest clamp and some SQL surgery to bring everything back to life.

What's specifically "fun" about this incident is that pretty much all the contributing factors we had were trying to do good things to prevent things from getting worse, and actually making them worse until the full outage.

It is a bit ironic how feature flags, frequent deploys, suspending deploys during incidents, and learning from prior near-misses all technically contributed to this incident, while being some of the most trusted practices we have to make our system safer.

Hopefully there's interesting lessons in there for our readers.


You must log in to comment.

in reply to @mononcqc's post:

i could go on at great length at any technical depth you care to hear about why mysql is entirely unsuited for any production workload and you should use postgres instead if it is in any way possible

yeah. personally my preferences generally include "have working tools" and "be able to produce results" and "not be spending the entire time constantly putting out exhausting fires from N+1 unintuitive footguns." despite these preferences happening to align with the true interests of the company, it's still usually an impossibly hard sell to ever make this kind of change once the sunk cost mindset has set in. especially when the people making the decisions are neither personally aware of the terrible cost being paid, nor have ever had conscious experience with a system that runs as designed. this gets even more frustrating when your job isn't merely the maintenance of an existing system, but extending and reshaping and futureproofing it, except your hands are tied behind your back.

so yeah i try to steer people away from making these same terrible choices in new projects but there's only so much you can do.

I’d be out of work if I were picky with tools because I’d rather work on writing infrastructure with Erlang in the first place and that practically no longer exists as a job opportunity.

I try to go for functional organizations with good teammates first these days.