(aside: performance has also been getting faster because jae has been making some changes to the way requests work which make each individual request faster. those are less general and harder to explain, though, so I'm not gonna. we'll write about them in the patch notes, though.)
every business process has a bottleneck -- the step whose throughput limits the throughput of the entire process -- and this is also true for software. for, like, Real industries, to alleviate a bottleneck, one of the things you can do is to set up multiple production lines; if one production line can build a widget in 10 minutes, two can do it in 5. one of the silly magic bits of modern software operations is that software, having no physical form, can autoscale: you can set up your orchestration framework to automatically commission (and decommission!) additional instances of a particular piece of software as it becomes the bottleneck in the processes it's involved in.
until recently, we were autoscaling services based on request latency: if average latency gets below (say) 2 seconds, take an instance out of commission; if it gets above (say) 5 seconds, put another one in. this works great, as long as you're sure that this service is the bottleneck.
if it's not, what can happen is:
- the orchestration framework notices that things are running a little slow, and it brings a new instance of the service into operation.
- the actual bottleneck now has n + 1 clients to deal with instead of n, and it gets a little slower, because its workload has gotten burstier or because it's suffering from the added overload of having more requests in flight, or whatever. (it's not important exactly why; I'm sure we can all relate viscerally to the idea that when you're already working at capacity, having more work to do can make you get less done.)
- a few minutes later, the orchestration framework notices that things are still running a little slow and the whole process starts over again.
after a couple hours, your services are sitting at their max replica count, things are slower than ever, and -- maybe worst of all -- you start making what you think should be slam-dunk performance improvements and they either have no effect or actually make performance worse, because the bottleneck is still untouched, and sometimes when you restart the service the bottleneck decides to be in a Good Mood and mostly keeps up with workflow for a while, but most of the time it's in a Bad Mood, starts behind and gets behinder.
after tearing our hair out over this for a bit, we decided mostly on a whim to turn off autoscaling and drastically scale down our least-loaded service by hand, and noticed that its latency immediately improved, and the puzzle pieces clicked into place.
so we kept doing it for a couple hours across the whole stack. we've since turned autoscaling back on, but using an alternate scaling metric that will hopefully not lead to the same bad behavior.

not the safest way to learn devops but we're making progress