Mikaela, Lily, Violet, and Ciri — a plural collective of nerdy, quoiromantic, poly, lesbian computer engineers and leftists.

Current media obsessions: Persona 5, RWBY, Cosmere


tef
@tef

the nice thing about working in big companies is that your tiny fuckups can impact millions of people. this is a story about how i fucked up, big time. worse still, most of the customers were programmers, so you just know they were dicks about the entire thing.

anyway: this is why your ****** backups stopped working

i was a month into my new job, and frankly i'd never worked anywhere as important. up until now i'd worked on small time startups, non profits both accidental and intentional, and frankly, i'd never been paid this much either. i felt like an imposter, and i was desperate to prove myself.

which is why i didn't complain at my first big assignment:

there was a backup system that had been somewhat abandoned, an internal api had been changed, and some other poor sod had been lumped with "just make it work", turning slightly crispy in the process. it was now my turn. the code was written, running in production, and someone just needed to answer support tickets if a problem occurred.

i did my due diligence as best i could.

i asked how long the migration had been running for "several months" i asked if we'd hit many problems "no" and i asked if we expected any issues either "not anymore." the code was finished and all we needed was some chump to babysit the cron job.

in hindsight what i should have done was to ignore any progress made and treat it as unknown and risky. what i did in practice was the opposite, i trusted my boss, i trusted my coworker's insight, and i was desperate to prove myself to the team. my boss told me to make things go fast, so after a few weeks of "it works ok", i turned up the big dial marked "migrations" and everything seemed to be working a few weeks later.

then everything broke at once.

that was the problem, the bugs took over a month to appear, and by that time most of the users were on the new system. backups taking forever to run. backups randomly erroring out. backups regularly hanging. people were screaming on twitter, talking about the utter carelessness and idiocy of whatever developer was in charge.

... and that was just the user visible issues: if you added too many workers, the system would crash its own database, but if you didn't have enough workers, everything would also grind to a halt. the nightmare carried on for the best part of a year. i answered so so many support tickets.

aside: it's worth noting that this wasn't the "real backup system" we used, we took regular snapshots and logs, uploaded to a bucket, and those continued to work fine. this was for the "customer facing backup system", something the team considered relatively non-essential. it's primary use case was making exports and imports of your database.

i could go on at length about how wrong we were here, but it's not that essential to the story. the important part is that although no data was ever at risk, it really looked this way to customers, and they were not happy about it. frankly, i don't blame them.

the code was a mess. i recall one argument where we had a mystery bug and neither i or the original developer could make sense of what the code should be doing. he wanted to keep debugging it, i wanted to rewrite it. thankfully, i won. it's easier to debug things when you keep things annoyingly simple.

we managed to track down several weird edge cases in the ruby i/o libraries, we cleaned up the error handling, and eventually we got to the point where backups didn't error out or hang. still, the work had barely begun.

sometimes backups just didn't run, and we'd worked out that it was linked to load on the system. the more customers trying to take backups, the more the backups went missing. customers would start new backups as a result, and everything just kept getting worse.

i did fix this bug, but it took me a while.

the scheduler was simple: at the beginning of the hour, find all the scheduled backups that should start, and create jobs for each one. the workers would search for the jobs waiting, and kick things off. it worked pretty well, until it didn't

sooner or later, there were so many jobs lined up that it took more than an hour to clear all the scheduled backups. the next hour of scheduled backups started late, and so did the next, and eventually, the scheduler ran at like 06:50am and finished at 08.10am, and skipped over all the jobs due at 7am. whoops.

i changed to to ask "which jobs should have started by now, and sort them by how overdue they are." and finally the missing backups were solved. that just left the precarious situation of "too many workers crashes the system."

that problem was simple: when a hundred workers asked the scheduler for work, and there wasn't any work, they hammered the database. one exponential backoff later, and suddenly, things weren't falling apart.

well, they were falling apart, just in a manageable way.

the other bugs barely register in my memory, but it was another four months of waiting before i had any confidence that things were working as intended. the only thing i really recall is an embarrassing support ticket

"i'm not sure i can make things go faster." i explained "but i will look." and then found a "go extra slow so the system doesn't crash again" hotfix that was no longer necessary. the user didn't even say thank you. rude.

it was a huge relief to finally be done with it.

a complete nightmare from start to finish, and to make it worse, i should have seen it coming.

the replacement was rushed out even before i'd been lumped with it, dumped in someone's lap who hadn't really done much backend or distributed horrors. then we were impatient in pushing it into service, not expecting to find problems six months later, only to discover the cost of hubris.

i wanted to prove myself, and made a fool of myself in the process. my boss was deeply unhappy with me, the support staff all knew me by name, and several hundred angry twitter users were calling for my blood.

it was deeply frustrating being blamed for choices I didn't make, and being rushed into fixing them as quickly as possible, rather than addressing the fundamental architectural problems that created the mess in the first place. i know it's unbecoming to say "i wouldn't have done it that way" but i think it's fairer to say "i already learned the hard way to not do it that way, and i didn't enjoy the refresher course."

in the end the biggest lesson was never learned

the real problem wasn't the code, or even the migration strategy, but our complete failure at customer communication.

if the "real backups" had appeared somewhere in the cli/admin pages, customers wouldn't have immediately panicked. if we'd called the tool "snapshot" instead of "backup", customers wouldn't have assumed it to be so load bearing. to cap it all off, no-one internally felt the tool was important, and didn't realize how many customers had come to rely on it.

we never lost any data but we did lose a hell of a lot of trust.

i have other stories, like the time we stayed in a rat filled warehouse and i almost got heat stroke, but those are for another time. fast forward a year, and my boss still holds a grudge, i'm burned out from being one of four people on the pager rota, and a doctor told me to quit. so i did.

there's no real moral to this story. it was your run of the mill clusterfuck, and even if you learn anything from it, you won't have the agency to stop it happening to you.


@Qyriad shared with:

You must log in to comment.

in reply to @tef's post:

very purple and slightly weeb too

we had an offsite meeting, but the team had gotten quite big, so my boss cheaped out with an airbnb that didn't have aircon

it also had rats

i ended up crashing at a friends for the last two days, amazed at not waking up in sweat, and came back home almost a stone lighter

same time i yelled at my boss for not reading a cv from a woman i'd recommended (we eventually hired her, she absolutely demolished the interview)

the new guy should never have to worry about breaking something important, because nobody should ever trust the new guy to be able to do anything significant in an unfamiliar codebase after just one month