lapisnev

Don't squeeze me, I fart

Things that make you go 🤌. Weird computer stuff. Artist and general creative type. Occasionally funny. Gentoo on main. I play rhythm games!

Inkscape Monofur font Cohost PLUS!

You can post SVG files like photos on this website! Spread the word!


tef
@tef

"spaceflight is risky business" explains the tweet. i mean, yes, space is a hostile environment, but so is nasa.

i think about how the apollo 1 astronauts sent a picture to their boss' boss, of them praying towards the capsule, a few months before they burned to death in a test.

during the investigation, one senator said NASA had an "evasiveness, ... lack of candor, ... patronizing attitude toward Congress ... refusal to respond fully and forthrightly to legitimate Congressional inquiries"

i also think about the challenger disaster. the one where the rocket contractor went "don't fly" and nasa went "are you suuuuuuuuuuure?" and the contractor went "shrug."

i think about how the investigation only found out the truth because someone leaked the reports to the commission—who then protected their sources by inviting feynman over for dinner, and theatrically working on their car and talking about o-rings, so he'd have an excuse to start looking.

"The commission concluded that the safety culture and management structure at NASA were insufficient to properly report, analyze, and prevent flight issues."

or columbia, where the the management cancelled attempts to look for debris, because it would ruin the schedule.

"the board determined that NASA lacked the appropriate communication and integration channels to allow problems to be discussed and effectively routed and addressed."

the thing that gets me, apart from nasa killing astronauts every twenty years, is that so many people see it as an acceptable cost of progress. if anything, they see it proving the difficulty and the merit of spaceflight.

i'll give you two guesses for how the people who make "self driving cars" think


vogon
@vogon
  • while building the Apollo 13 service module, one of the frames that the oxygen tanks were mounted on was dropped a couple inches from a crane, causing a drain tube to get misaligned.
  • much of the service module hardware (basically anything that needed to run on the ground) had been redesigned to tolerate 65 volts DC power from the pad during the Gemini program, to allow NASA to cycle vehicles on the pad faster.
  • the fail-safe switches that shut off the oxygen tank heaters if the tank got above 80°F were not among this hardware, and were only rated for 28 volts DC.
  • during preflight testing, the oxygen tanks needed to be loaded and drained several times.
  • when they were attempting to detank the oxygen preflight, they noticed that one of the tanks wouldn't drain through the drain tube, and decided to boil the oxygen off instead of draining it in order to prevent delaying the mission by a month.1
  • the pad power going through the heater caused the shut-off switches to fail short, but the good news is that they were redundantly measuring the temperature of the heater on the pad so they could avoid causing permanent damage to the heater circuitry!
  • the bad news is that the thermometer on the pad pegged at 85°F, so the heater was, unbeknownst to everyone, getting hot enough to burn off all the insulation from the wires to the tank stir fans and staying there for several hours. these wires sparked 180,000 miles from earth and resulted in the crew having to ration water and electricity for the next week.

the story everyone remembers about Apollo 13 is that human ingenuity saved three people's lives, but the more interesting story to me has always been the mix of cost-cutting, tight deadlines, groupthink, and design failures that endangered them in the first place.


  1. the report notes that the crew had signed off on this detanking procedure but even aside from the fact that nobody (least of all them, who were probably spending much of their time getting ready to fly to the moon instead of thinking about engineering details) had the full information at the time, can you imagine the pressure on the astronauts not to be the ones who stopped the line for a month?


mifune
@mifune

To asses the risk of a component, system, or the entire thing failing, a Failure Mode, Effects, and Criticality Analysis can be made. This is a method to put a number on the chance of a thing breaking and what kind of havoc that will cause. It allows you to identify and fix the most risky parts. They are doing these since the 40's and were also done during the Apollo program. A key point of these analysis are that you only look at single failures. More information here: https://en.wikipedia.org/wiki/Failure_mode,_effects,_and_criticality_analysis

What makes the deadly NASA accidents so bad is that they can be reduced to a single failure. For Apollo 1 this was ignition of flammable materials in an oxygen rich environment, Challenger can be reduced to a failure of the O-rings in the SRB, and the Colombia by a foam strike from the fuel tank. These are the things you should catch with an FMECA. Especially the shuttle disasters were bad, because earlier missions had similar issues which should have raised the risk level.

Apollo 13 is a bit different in that you need multiple failures: 1. You have to drop an oxygen tank. 2. During draining you have to decide that you want to evaporate the oxygen, instead of fully investigate or replace the tank. 3. You plug the heater into the higher voltage system, breaking a switch. 4. The temperature sensor of the pad doesn't go high enough 5. The thing shorts in a way that it only blows up 290.000 km from the earth.

And that is why I consider Apollo 13 to be less bad than the other disasters. It's a failure that slipped through the cracks, instead of willfully ignoring the risk of a single failure. Loads of people fucked up, and probably only knew after the fact.


You must log in to comment.

in reply to @tef's post:

in reply to @mifune's post:

What makes the deadly NASA accidents so bad is that they can be reduced to a single failure. For Apollo 1 this was ignition of flammable materials in an oxygen rich environment, Challenger can be reduced to a failure of the O-rings in the SRB, and the Colombia by a foam strike from the fuel tank.

This is tempting, but not always accurate.

Apollo 1? The astronauts begged to have flammable substances removed. They compiled and then the contractors put them back in. The door wouldn't open because ground fires weren't considered, and a previous door design opened accidentally.

It wasn't just sparks in an oxygen rich atmosphere. It was filling the capsule with flammable materials, and not being able to escape, too.

Challenger? The O-Rings didn't work as designed. They only worked because the o-ring deformed enough to form a correct seal later. This was known to the contractor and NASA, as well as the discussion the day before launch where the contractors said "It is not safe to launch."

They didn't want to push the launch back, they wanted it to happen before Reagan's State of the Union address.

Columbia? Foam strike had been happening several times before, but dismissed. The debris strike team tried to get DoD imagery but were denied.

As for Apollo 13? Well, it's still better to view it through the eyes of process failure than a string of uncaught individual failures. Damaged tanks being used. Checks not being run. Launching with things operating outside of designed specifications.

You have to ask "Why were these safety checks bypassed?" and it's the same old story: management pressure to avoid delay.

The thing is: Single failures happen all the time in a complex system, especially one as large as a space shuttle. It's a process failure when these things escalate to a loss of life.

In each accident, there was a rush to perform, a haphazard approach to safety from management. That, if anything, is the single failure common to all accidents.

Even so reducing an accident to a single failure is in itself a symptom of poor safety culture. Asking "We just need to have better O-Rings" and not "Why did we launch knowing it was unsafe?" just leads to new, preventable accidents.

You're right that you can point a lot of blame to management for applying too much pressure, and if you look back on an accident you'll always find multiple causes for it. Most of them organizational.

It is still useful to look at the amount of steps required for a major accident, because it says a lot about the way an organization is broken and what has to change to fix it.

If you only need one component failure and only one or two preventative measures weren't taken, something is deeply wrong with risk management in an organization. It won't be just this thing that was deliberately ignored, there are probably several more accidents waiting to happen.

There will be fundamental flaws in designs and procedures. You can't fix it by taking care of the thing that caused the accident. You'll have to ask fundamental questions about why this failure was allowed to happen, what other potential failures are still there, and what changes have to be made in managing risk. It's a management problem, and not something with a root cause in the field.

A tool you can use for that is a FMECA, because it forces you to look at all single failures, classify them by risk and come up with fixes or measures to mitigate those risks. But this is what should have been done in the engineering phase, not after something horrible has happened.

An accident requiring a Rube Goldberg like chain of events exposes another type of organizational failure. The high risk failures can be under control, but due to pressure from management low risk failures are accumulating. The way to fix this is also different. For example it requires better feedback from the field to engineering to fix small issues, more resources for QC, and things like mandates from upper management to stop the work in unsafe situations.

Basically working conditions in general have to be improved. A driver can be The Big Accident, but what you'll see in most organizations is that they want to reduce the amount of injuries and quality issues.

Of course the real world is less black and white, but can inform you whether you have to start looking for fundamental management problems, or bad working conditions.

It's more, I don't think you can divide up the failures (apollo 1, columbia, challenger, apollo 13) into a simple "complex cause/simple cause" taxonomy.

If anything, the only way to divide up these accidents is into "everyone died" / "no-one died"

edit: i mean, it is worth looking at the steps, sure, but well, it's very easy to have one step when you think one way, and five steps when you think of another

like saying "it was a spark in a high oxygen environment" skips over the decisions and actions that lead there

"nitrogen atmospheres were considered higher risk" "the door was not designed to be opened in the event of a fire" "flammable material was not removed"

i could just as easily say "a damaged tank overheated and exploded in space", and skip over how and why a damage tank got sent into space, or how and why the tank exploded