hi folks! you may have noticed that we had some downtime last night from approximately 6:47pm - 7:10pm PST. we’ve been able to determine the initial cause, as well as the cascade of events that caused recovery to take longer than it should, and we’re sharing our findings here in the interest of transparency.
first, this downtime only involved data storage systems which contain cached and otherwise replaceable data, and after the downtime period even that data was restored; there was no loss of user data.
to provide initial context, cohost consists of two main components: app, which is what renders pages; and api, which is where all requests go once the page has loaded. these are the same codebase, but are deployed separately as the performance characteristics and resource requirements are dramatically different for page rendering and API responses.
Timeline (all times in PST UTC-8)
- 6:47pm:
appstops reporting performance metrics - 6:48pm: we receive an e-mail from DigitalOcean reporting that our primary redis node has failed and a replica has been spun up in its place; this goes missed because we don’t have e-mail notifications turned on for this sort of thing
- 6:54pm: we receive first reports from users that the site is inaccessible
- 6:58pm: jae and colin begin investigating together on voice
- 6:59pm:
appandapiare fully restarted, allowingappto begin serving requests again; recovery begins - 7:06pm: we manually increase the minimum replica count for
appto deal with the surge in load - 7:10pm: full site stability restored
What happened
we were able to map a fairly clear chain of events that prevented the service from recovering automatically, requiring us to step in and manually restart services while fixing scaling values.
- redis primary replica fails and is replaced
- we don’t know what caused this. unfortunately, DigitalOcean doesn’t persist metrics from old replicas, so when the primary replica failed we lost any data that could have told us what happened.
- ACTION ITEM: figure out how to control these metrics so that, if/when this happens again, we have some hope of figuring out why
appis unable to access redis to pull session data for requests, blocking all requests- grabbing session data is mandatory for the site to operate at all, so a backlog of hanging redis commands will prevent any requests from succeeding
- interesting note: the redis library we use (
ioredis) doesn’t set a command timeout by default! this means that, since these requests were made in the brief period between the primary replica failing and the swap finalizing, they were against a dead redis instance and hung forever (even after the HTTP request timed out!) until we restarted the server. - ACTION ITEM: configure a redis command timeout so that old, guaranteed-to-fail commands actually fail, clearing space for commands that would be executed against the replica (and thus complete)
- restarting
appwas sufficient to kill the old, broken connections; we consider this to be the start of recovery
- due to requests never completing,
appis unable to send performance metrics up to DataDog- we are dependent on these performance metrics for scaling cohost!
- we know roughly how many requests-per-second a single running copy of
appcan serve, and we use metrics pulled from DataDog to automatically scale the number of runningappinstances to hit that target - while the redis server was down, requests couldn’t complete, so our “requests” metric dropped to zero, and the Kubernetes Horizontal Pod Autoscaler fell back to the required minimum number of replicas
- our minimum was set too low (2 replicas?? what clown made that decision???? spoiler: the clown writing this), which meant we suddenly had around a sixth of the replicas we truly needed to serve traffic.
- when
appgets overloaded with too many requests, site performance degrades heavily for everyone. appwas so overloaded it was unable to send accurate metrics to DataDog, preventing correct scaling, causing the performance issues to continue.- we resolved this by temporarily setting the minimum replica count to the value we were running at prior to downtime. this immediately resolved lingering performance issues and allowed for full recovery.
- ACTION ITEM: come up with a better minimum replica count
- ACTION ITEM: figure out if there’s a secondary metric source we can use in the event of a DataDog failure
as stated above, we do not know what caused the redis crash, and we do not have enough information to figure it out. we know from reviewing metrics for app and api that there was not a sudden spike in requests or redis commands leading up to this; site usage and performance was As Expected. we don’t like not knowing what happened.
we are pleased that DigitalOcean was able to switch over to the secondary replica quickly. however, our systems failed to handle that failover and recover on their own. most of our action items revolve around ensuring that automatic recovery is possible in the future.
since we don’t know what caused redis to crash in the first place, it’s impossible to take any measures to prevent a recurrence.
- ACTION ITEM: pray this doesn’t happen again
we’re sorry for the inconvenience this downtime caused and we hope this writeup was at least somewhat enlightening, if a little jargon-y (turns out it’s hard to discuss this sort of thing without jargon).
thanks, as always, for using cohost!


