jkap

CEO of posting

butch jewish dyke
part of @staff, cohost user #1
married to @kadybat

This user can say it
osu stats


🐘 mastodon
xoxo.zone/@jkap
🖼️ icon credit
twitter.com/osmoru
🐦 twitter
not anymore lol

staff
@staff

hi folks! you may have noticed that we had some downtime last night from approximately 6:47pm - 7:10pm PST. we’ve been able to determine the initial cause, as well as the cascade of events that caused recovery to take longer than it should, and we’re sharing our findings here in the interest of transparency.

first, this downtime only involved data storage systems which contain cached and otherwise replaceable data, and after the downtime period even that data was restored; there was no loss of user data.

to provide initial context, cohost consists of two main components: app, which is what renders pages; and api, which is where all requests go once the page has loaded. these are the same codebase, but are deployed separately as the performance characteristics and resource requirements are dramatically different for page rendering and API responses.

Timeline (all times in PST UTC-8)

  • 6:47pm: app stops reporting performance metrics
  • 6:48pm: we receive an e-mail from DigitalOcean reporting that our primary redis node has failed and a replica has been spun up in its place; this goes missed because we don’t have e-mail notifications turned on for this sort of thing
  • 6:54pm: we receive first reports from users that the site is inaccessible
  • 6:58pm: jae and colin begin investigating together on voice
  • 6:59pm: app and api are fully restarted, allowing app to begin serving requests again; recovery begins
  • 7:06pm: we manually increase the minimum replica count for app to deal with the surge in load
  • 7:10pm: full site stability restored

What happened

we were able to map a fairly clear chain of events that prevented the service from recovering automatically, requiring us to step in and manually restart services while fixing scaling values.

  1. redis primary replica fails and is replaced
    • we don’t know what caused this. unfortunately, DigitalOcean doesn’t persist metrics from old replicas, so when the primary replica failed we lost any data that could have told us what happened.
    • ACTION ITEM: figure out how to control these metrics so that, if/when this happens again, we have some hope of figuring out why
  2. app is unable to access redis to pull session data for requests, blocking all requests
    • grabbing session data is mandatory for the site to operate at all, so a backlog of hanging redis commands will prevent any requests from succeeding
    • interesting note: the redis library we use (ioredis) doesn’t set a command timeout by default! this means that, since these requests were made in the brief period between the primary replica failing and the swap finalizing, they were against a dead redis instance and hung forever (even after the HTTP request timed out!) until we restarted the server.
    • ACTION ITEM: configure a redis command timeout so that old, guaranteed-to-fail commands actually fail, clearing space for commands that would be executed against the replica (and thus complete)
    • restarting app was sufficient to kill the old, broken connections; we consider this to be the start of recovery
  3. due to requests never completing, app is unable to send performance metrics up to DataDog
    • we are dependent on these performance metrics for scaling cohost!
    • we know roughly how many requests-per-second a single running copy of app can serve, and we use metrics pulled from DataDog to automatically scale the number of running app instances to hit that target
    • while the redis server was down, requests couldn’t complete, so our “requests” metric dropped to zero, and the Kubernetes Horizontal Pod Autoscaler fell back to the required minimum number of replicas
    • our minimum was set too low (2 replicas?? what clown made that decision???? spoiler: the clown writing this), which meant we suddenly had around a sixth of the replicas we truly needed to serve traffic.
    • when app gets overloaded with too many requests, site performance degrades heavily for everyone.
    • app was so overloaded it was unable to send accurate metrics to DataDog, preventing correct scaling, causing the performance issues to continue.
    • we resolved this by temporarily setting the minimum replica count to the value we were running at prior to downtime. this immediately resolved lingering performance issues and allowed for full recovery.
    • ACTION ITEM: come up with a better minimum replica count
    • ACTION ITEM: figure out if there’s a secondary metric source we can use in the event of a DataDog failure

as stated above, we do not know what caused the redis crash, and we do not have enough information to figure it out. we know from reviewing metrics for app and api that there was not a sudden spike in requests or redis commands leading up to this; site usage and performance was As Expected. we don’t like not knowing what happened.

we are pleased that DigitalOcean was able to switch over to the secondary replica quickly. however, our systems failed to handle that failover and recover on their own. most of our action items revolve around ensuring that automatic recovery is possible in the future.

since we don’t know what caused redis to crash in the first place, it’s impossible to take any measures to prevent a recurrence.

  • ACTION ITEM: pray this doesn’t happen again

we’re sorry for the inconvenience this downtime caused and we hope this writeup was at least somewhat enlightening, if a little jargon-y (turns out it’s hard to discuss this sort of thing without jargon).

thanks, as always, for using cohost! :eggbug: :host-love:


You must log in to comment.

in reply to @staff's post:

time to backseat drive honk honk, so do take this with a pinch of salt

figure out if there’s a secondary metric source we can use in the event of a DataDog failure

it might be better to ask "how can we make good decisions when our metrics fail?" over just "can we have backup metrics?", because you'll still be left with the case of the backups failing, too.

the answer might just be "if datadog is offline, don't change the autoscaler", but you'll know the answer better than i will.

i'm pretty sure this question was raised, but it would be reassuring it see it as it's own action point, alongside "secondary metrics???"

we resolved this by temporarily setting the minimum replica count to the value we were running at prior to downtime.

honestly this might make a good permanent fix, or at least a fix you revisit each month as the site grows, ensuring you have some reasonable minimum for when your autoscaler decides to take a break for whatever reason

again, i think this is linked to the "action item: decide how to handle metric failure"

configure a redis command timeout so that old, guaranteed-to-fail commands actually fail, clearing space for commands that would be executed against the replica (and thus complete)

not to be all "where there's smoke, there's fire", but it might be worth adding an action item around "check our other libraries have timeouts set"

you might want to consider running a fire drill, turning off redis and seeing how long the site recovers.

you may also want to document which automatic recovery processes you think are in place, and which ones have been tested by hand or by acts of god.

there's a lot of potential work here, and i'm not saying you need to exhaustively test, but i do think there's a lot of use in mapping out the options

figure out how to control these metrics so that, if/when this happens again, we have some hope of figuring out why

odds are, if redis crashes, it's unlikely to be your fault.

it could have been a disk error, it could have been a rack losing power, it could have been a weird segfault in some other service on the machine that caused DO to flip it

on the other hand, if redis crashes repeatedly, then, yep, it's probably your fault. you might get more mileage out of monitoring redis uptime than just redis availability. once you fix the timeouts, you want to know if the redis machines are cycling more than you expect.

that said, it might not help much to begin with.

without sounding too mean, there's a real tendency to add stuff to the dashboard everytime an incident happens, and as a result, it's more of a hall of fame than a debugging tool.

although a good dashboard should give you a heads up when things are about to get bad, longer request latencies, servers powercycling, scaling going up and down, a good dashboard can get bad real quickly if you end up with fifty graphs that only answer question "is the site online" in different ways.

it is good to think about "would metrics help" but it's also ok if the answer is "no"

either way, ty for your writeup and ty for your clarity, it's real nice to see the cogs moving, even if it's jargon filled

way better than bug fixes and performance improvements

the answer might just be "if datadog is offline, don't change the autoscaler", but you'll know the answer better than i will.

yeah; unfortunately, I briefly looked into this earlier, and the obvious way to have our autoscaler not change its target wouldn't work because the metrics we're scaling based on go to zero, instead of leaving a convenient gap that datadog lets us treat as "last good value". we might investigate smoothing or one-sided rate limits or some other way of patching over this data anomaly, though.

not to be all "where there's smoke, there's fire", but it might be worth adding an action item around "check our other libraries have timeouts set"

totally.