Xuelder

Indie Game/Narrative Designer

Tech Warlock

Weird dude who makes weird things.

Part of the Swamp, Part of the Krewe


Itch 🕹️
xuelder.itch.io/

Stegosaur
@Stegosaur

I am lucky enough to have the day off, so I don't have to deal with thousands of Windows endpoints running Crowdstrike requiring recovery mode reboots and BitLocker keys to fix a bad update. So instead, I'm going to offer some insights from my ~15 years of professional experience for the rest of ya'll as a learning experience.

Here are my hot takes:

  • This outage is a demonstrator of the overcentralization of IT in general. One OS with one security agent has managed to take out banks, hospitals, airlines, government agencies, and media outlets. This should be terrifying to the casual observer, because it shows a single point of failure for modern economies.
  • This is also just a small preview of an eventual public cloud outage. What happens when AWS goes down globally? Or Azure? Or Google? This isn't doomsaying either, as all three have already had outages on regional scales before.
  • Too much of modern tech is just building bandaids for the problems of bigger tech companies - like CrowdStrike as a bandaid for Microsoft's abysmal product security. There's few to no companies actually championing a brand new way of doing things that removes dependencies on older technologies. Even Microsoft hasn't given up supporting Active Directory on-prem or eliminating it entirely, as every major Cloud Identity Provider inevitably hooks into it for Authentication and, often, Authorization.
  • Tech stocks have been overvalued for years, and this incident really shows their weakness. These aren't "radical innovators", they're just really good at selling minor improvements as gargantuan features (like Apple) or outright snake oil, and pumping their valuations accordingly.
  • The real innovators are being derided for not pandering to the status quo. Outfits like Oxide Computer are trying fresh, new approaches to longstanding problems, but because their products force businesses to behave better (by removing legacy systems, modernizing infrastructure, or just investing in their fucking workers), they're often passed over in favor of whatever lets the company keep the status quo a little longer and outsource abroad.
  • Speaking of outsourcing, that is absolutely an underlying contributor to today's events. Microsoft and CrowdStrike are popular precisely because they're a known factor for outsourced workers. Companies don't keep Microsoft because it's a good value or a high-quality product, they keep it so they can fire the expensive worker in the USA and ship that role to India for pennies on the dollar, all because CrowdStrike "makes it safer". They keep it because it's something "everyone already knows".
  • On the topic of outsourcing, the present solution to the CrowdStrike fiasco involves booting to recovery mode and juggling encryption keys (if enabled - and they better be enabled if you're good at your job). Companies that outsourced everything abroad will have slower recovery times for the very reason those folks can't physically assist their employees with recovery, and have to read or transmit these keys and instructions out-of-band (meaning, outside normal communications channels) in many cases. Having a domestic workforce, even in a remote-only working environment, better connects workers and allows faster recovery times.
  • One more on outsourcing: these outsourcing and MSP firms often support dozens, maybe hundreds of customers. How do you know where you sit in the priority list for a global outage like this? How long was your recovery delayed because you outsourced in the first place? In-house staff puts you first, while outsourced staff has to prioritize whoever pays the most first.

TL;DR Version

Even non-tech people should be horrified at what's happening today, because actually competent IT people (like myself) have been screaming about this for over a decade, now, and these outages are only growing more frequent in timing, larger in scope, and severe in intensity. The ultimate root cause of this is centralization of technology and business into the hands of very few companies, and the lack of both incentives to proactively fix their fucking shit, or meaningful punishments for failing to do so. It's a matter of global defense to decentralize our infrastructure, divest into multiple vendors, and improve regulations so that the next time something like this happens, the executives and shareholders who supported share buybacks and layoffs over R&D or improvements are held to account.

This was not a failure of technical controls, it was a failure of basic competency.