for context: cohost is hosted in a kubernetes cluster. we (mostly) like kubernetes, mostly because of its extremely robust tooling community and support from fucking Everybody. there's a lot of things we wanted to do with previous hosting solutions and just couldn't because the tooling wasn't there and we don't have the staff to write shit outselves.
we've had Vertical autoscaling (adjusting CPU/memory requests based on actual usage to improving scheduling) and Node autoscaling (adding/removing nodes from the cluster based on scheduling needs) since launch, which means that during spikes we've been able to hit an "add more replicas" button to add more replicas.
but having to do this manually fucking sucks for a bunch of reasons, the three primary (in my eyes) being:
- you have to Guess how many replicas you need
- you have to Guess when you can scale back down
- if you Guess wrong you are spending money you don't need to be spending (assuming nodes need to scale up as well)
kubernetes has mechanisms for scaling replicas automatically (Horizontal autoscaling) that can, in the most basic implementation, scale based off of CPU/memory usage, but we haven't found that to be a good measure of actual load1. the work today was getting our external metrics setup (specifically, p90 latency2) for us to scale off of.
we had a latency spike we needed to scale for about an hour ago and i'm thrilled to everything worked perfectly. after 30s sustained, we scaled up and then dropped back down to baseline once the load had subsided. at the currently levels, we are using fewer nodes than we previously were (which is Money Saved) and are being waaaaaaaay more efficient with the resources we have, all of which kicks ass.
in the end it should save us 15% or so a month on hosting, which is pretty great! plus i have a new graph i can look at when i'm bored, which is always nice to have.
plus it interacts REALLY BADLY with vertical autoscaling, which i learned when i briefly enabled it before launch and everything scaled way out of control extremely quickly
if you're curious, we consider 5s our maximum acceptable for the frontend app (inherently slower, unfortunately, because rendering is expensive) and 1s our maximum acceptable for the API. our worker jobs scale if there's more than 1 pending job for over 30 seconds (only happens during extreme spikes; we usually sit at 0 during the reporting windows but during the verge bump we hit 140)