Apollo 13: Engineering Lessons Under Crisis

This video is a bit different from my normal fare, but by about halfway through the film I knew I wanted to do this.

Links of Note

Script

Hey, everybody. This video is going to be a little different from my normal fare. I recently got the opportunity to re-watch Apollo 13, a movie I enjoyed growing up but have not seen in more than two decades. Even as a kid, I loved the engineering sequences, in which Mission Control needed to solve unexpected and extremely challenging problems to bring back home Jim Lovell, Fred Haise, and Jack Swigert. In this video, I’m going to delve into detail on those challenges and important take-aways for engineers in any discipline, like, say, software development. By the way, when I use the term “engineer” in this video, I’m purposefully including all kinds of roles, from engineers proper to technicians and operators. The exact job title here isn’t important; it’s the skill set and mentality which matters the most. Anyhow, sit back and enjoy:

APOLLO 13: ENGINEERING LESSONS UNDER CRISIS

Before I get started in earnest, I want to make note that this video is based solely on the movie Apollo 13, a dramatization of Jim Lovell’s book, Lost Moon: the Perilous Voyage of Apollo 13. This is not a recounting of actual events as they happened–I understand there were areas where the film wasn’t completely accurate to reality and they made up a couple of bits to ratchet up the suspense, but my plan is to appreciate the film as such rather than inasmuch as it was a reflector of actual events. That said, it’s still a lot closer to reality than most such movies.

As another aside, something I really appreciated was just how good the acting was in this movie. It struck me throughout the film that they were doing a great job of the adage “Show, don’t tell.” For example, we learn a bit about the personality of Jim Lovell–played by Tom Hanks–as a former test pilot in the opening sequence, where we see him zooming along the highway in Houston, starting with a forward-facing shot of a car zipping through traffic and switching to a view of Lovell in his shiny, red Corvette. (By the way, I said stick to the film, but in real life, Lovell’s Corvette was blue. How dare you, Hollywood.) Anyway, he’s a straight-laced guy who lives fast, and we get that information simply from this first scene, comparing Lovell’s bearing and appearance with his need for speed. This fits the early astronaut profile to a tee.

Another instance of this is later on in the film, when you see a television interview–taped sometime before the mission–of Lovell recounting a story as a combat pilot in which he is, at night, with no homing signal or radar, trying to find an aircraft carrier running dark. Then the electronics short out on his plane, leaving him completely in the dark, but instead of panicking he keeps his head straight, maintains control, and ends up finding the carrier safely due to a bit of luck and a trail of phosphorescent algae. This is playing on the television for us to get a measure of the man and I appreciate that they do it this way rather than having Lovell tell his fellow astronauts “Hey, I survived finding an aircraft carrier with no running or interior lights in the middle of a dark ocean, so we can do this!” He acts calm and collected by showing his training, and he doesn’t need to tell us any of this.

Speaking of the other astronauts, in early sequences in the film, we see Lovell, Ken Mattingly–played by Gary Sinise–and Fred Haise–played by Bill Paxton–become very close in training, to the point where they can interpret each others’ moves and have implicit trust. When Jack Swigert, played by Kevin Bacon, has to swap in for Mattingly at the 11th hour, there is a bit of conflict. We can see an example of that conflict in the way the trio address each other–Lovell and Mattingly call Haise Freddo, but Swigert, being the outsider, refers to him as Fred. But by the climax of the film, we see that conflict resolved in a natural way, and we see the comradery between the two men as Swigert too refers to Haise as Freddo. It’s these little character touches, as well as exactly the right facial expression or the right tone of a word which tell so much more to us than what the actors need to say. I wanted to get that in early because the acting and characterization really hold up and they’re worth discussing even in a video about Mission Control.

So now let’s get to the main event.

My first point is:

CONFLICT HAPPENS

Conflict happens, but interpersonal conflict shouldn’t drive animosities. In a healthy team, people should have the attitude that we’re on the same side here, trying to solve the same problems. But we do see conflict in ideas, in part because they have different weights on preferences and different sets of information available to them. In the film, one of the first Mission Control crisis scenes has a pair of engineers arguing over whether to turn the Apollo 13 command module around or try to slingshot around the moon. Both men make good points: the engineer wanting to turn the ship around notes that it would be the fastest way to get the crew back to Earth and that’s critical because they’ve already lost a considerable amount of oxygen. The opposing side points out that the service module engine is damaged and might explode if they try to use it, so it’s effectively dead. They’re hurtling toward the moon, so to turn around, they would need to cancel all of that velocity and then build up velocity in the opposite direction, and the vessel doesn’t have enough known-good engine power to do this. But if they continue on their path, they can redirect their existing momentum and use the moon’s gravitational pull to slingshot them back toward Earth. It will take longer to get back, but is much more likely to succeed.

Although tensions run high during the scene, the engineers all understand that this is an important part of decision-making under crisis: you need to understand the options available and be willing to speak up if you see a mistake in progress. Ed Harris’s Gene Krantz also does a good job here by allowing the squabbling but shutting it down when it stops being productive. He makes sure that the good engineers are able to voice their concerns and lay out all of the necessary information–even if they don’t do a perfect job of remaining calm and thoughtful. As a group, they are able to think through the problems, and as the flight director, Krantz’s job is to make the call.

This works because:

THERE IS NO SOLE SUPERGENIUS

Something I appreciated in the film is that we didn’t have a singular, brilliant mind answering all of the problems. Because that’s not how it works in good organizations. You have different groups of people with specializations in different areas and the capability to solve different types of problems. Mission Control had different groups of engineers solving problems like CO2 scrubbing, turning back on the flight computer, and coming up with a flight path that will let the command module land in the ocean without burning up in atmosphere or skipping past the earth like a rock on a pond. This wasn’t Gene Krantz sitting in his lair, meticulously plotting and having scribes and underlings carry things out; this was a big effort from a lot of people quite capable of solving problems.

What’s interesting in the film is that Ken Mattingly is kind of set up in the “supergenius” role due to his tremendous work ethic and deep knowledge of the command module. But even in the sequences where Mattingly works to get the flight computer power-on sequence correct, he’s a member of a team. He’s not the one who realized that there would be power problems in the first place, and although he’s the one who we see come up with the final process, he’s not doing it alone. Furthermore, they don’t have Ken fix the flight computer, figure out burn time, and scrub CO2–he’s a smart man performing a heroic effort in doing his part to bring three men home alive. That he’s not a sole supergenius takes nothing away from that, but it does give us a critical insight that engineering under crisis is a team sport.

And while I’m talking about Mattingly, I do want to point out that although he’s an astronaut, he’s tied in with the engineers too. At that point in its history, NASA selected primarily for a few critical traits: intelligence, composure, and creativity under stress. This is part of why they loved choosing test pilots, as that’s a great demonstration of all three. The training program was rigorous and included intimate knowledge of the critical systems, as once you’re out in space, you’re on your own. Mission Control may be available most of the time–assuming no radio outages–but you have to know how to operate the different systems, how to troubleshoot problems, and how to make things work in stressful conditions. The film shows this during the sequences where Mattingly and Swigert dock with and retrieve the Lunar Module, or LM (pronounced LEM). Mission Control simulates different failures and forces the command module pilot to think on his feet, diagnosing the problem, assessing the situation, and performing the right response. In order to be that highly proficient of an operator, the astronauts need to have almost the same understanding of the systems that engineers have. You may not have asked Ken Mattingly to design and fabricate a command module, but he has to know the hows and whys behind its design because it’s literally his life on the line.

Let me give you a more prosaic example of how all of this fits into the real world. IT teams are made up of people with a range of specialties: front-end developers, back-end developers, database developers, database administrators, network administrators, systems administrators, architects, and a whole lot more. If you have a Sev-1 crisis and your company’s applications are down, it’s natural to expect these groups of people to work within their specialties to solve problems. Ideally, you don’t have a single The Expert whose galaxy brain is responsible for solving all of the problems. This is because The Expert can be wrong and, no matter how long The Expert has been around in an organization, there’s always going to be a surfeit of knowledge. In a well-functioning engineering team under crisis, each person understands their piece of the puzzle and knows how to get information. App developers are looking through their logs while database administrators try to figure out if the issue is in their domain, network administrators review packet and firewall logs, and systems administrators check server logs. They can take and synthesize the information they’re processing to help shed light on the problem and develop viable solutions. When there are people with demonstrated expertise–as opposed to being The Expert–they can help narrow down places to look, compare the current issue to prior issues, and make connections which specialists alone might not have done. But even so, that person is still one member of a group.

This is also bleeding over into my next point:

FOCUS ON THE PROBLEM AT HAND

In one interesting scene, Swigert mentions to Lovell that they’re coming in too shallow and at this trajectory, will skip right out of the atmosphere. Lovell respond by saying “There are a thousand things that have to happen in order. We are on number 8; you’re talking about number 692.” In a tough situation, it’s easy to focus on a specific problem to the detriment of coming up with a solution. In a few lines of dialogue, we get the crux of an important point: document that there is an issue, follow up on the issue when it’s the appropriate time, and continue working on the most important thing. Under normal circumstances, it’s easy to prioritize issues based on who’s complaining loudest or most recently. To give you a data platform example, I might notice some slow reports in a log and decide to take the day speeding those up. It may not have been the single most important thing I could have done at that point in time, but it was a reasonable thing to do as it was causing people pain right then. But in a crisis situation, we have to avoid that temptation and focus on the single most important thing. Yes, those reports are slow. And they might cause us to lose the contract if we don’t fix them. But right now, the database is refusing connections and we need to fix that before we can think about reports.

Tying back to the previous idea of expertise without The Expert, this also allows engineering teams to focus on solving a problem. We have navigation specialists focusing on burn calculations, a sub-set of engineers trying to figure out how to put a square peg into a round hole, and a group of engineers working with Mattingly to power on a flight computer while using less than 12 amps and “You can’t run a vacuum cleaner on 12 amps, John.” Krantz and Glynn Lunney, as the two shifts’ flight directors we get to see in the film, need to worry about all of these problems, ensure that teams are working on the most critical things, and coordinate between teams. But they also need to be able to delegate tasks to team leads–they have to trust that those team leads will do the right thing because they don’t have the time or energy needed to coordinate efforts and solve problems.

Taking this back to the real world, part of my company’s production emergency process is to designate a coordinator. Typically, that person is someone separate from the engineers working on the problem, as the coordinator needs to communicate and process information effectively and it’s hard to do that while you’re typing furiously and scrambling to remember where that stupid log file gets written to and where you saw error code 3400. That coordinator role is only one portion of what Krantz and Lunney needed to do, but it’s a vital part, especially when a lot of people are working in small groups on different parts of the problem.

The film did an excellent job of portraying my next theme:

CREATIVITY AND PRAGMATISM

Apollo 13 really captured the spirit of great engineers under pressure: it is a combination of creativity in solving problems mixed with a down-to-earth thought process which forces you to grapple with the problem as it is rather than as you’d like it to be.

I think the line which most exemplified the pragmatism side was when Ken Mattingly prepared to enter the command module simulator, demanding that conditions be set the same as what the crew on board were experiencing. “I need a flashlight,” he says. When a tech offers up the flashlight in his hands, Mattingly responds, “That’s not what they have up there. Don’t give me anything they don’t have onboard.” Mattingly limits himself to the conditions and cirumstances of the actual command module to ensure that what he does on the ground can be replicated exactly by the crew onboard the Apollo 13 command module as it currently is, not as they ideally would wish it to be.

Meanwhile, the first major scene which really brought the combination of skills together was when Gene Krantz brought together all of his engineers and makes a show of throwing away the existing flight plan. All of their effort and planning is out the window because circumstances changed to the point where that plan is untenable. And now it’s time to improvise a new mission. We are in a crisis, so our expectations of normal behavior have gone in the trash along with that flight plan.

But we see another example of the brilliance that can come from creative, pragmatic people under pressure. In one of the most iconic sequences in the film, NASA engineers need to fit a square peg into a round hole. The carbon dioxide scrubbers in the lunar module are cylindrical and designed for a cabin with two people living in it for a fairly short amount of time. The command module CO2 scrubbers are larger, boxier, and completely useless as-is. Because nobody planned on using the lunar module as a lifeboat, there weren’t any spare CO2 scrubbers for the lunar module and without one, the crew would die of carbon dioxide poisoning long before they reached Earth.

Cue the engineers, dumping out everything available to them on a table and one technician saying, “Okay, people, listen up. The people upstairs handed us this one and we gotta come through.” No grousing, no complaints about how this isn’t normal. Instead, it’s a challenge. The other technicians start with the most logical step: let’s get it organized. Oh, and they get to brewing coffee too. I guess that’s the second logical step.

What they come up with is a bit of a mess, but it solves the problem. It is a hack in the most positive sense of the term: a clever and unexpected solution to a problem. You don’t want to make a living building these, but it’s good enough to get three men home.

Taking this a step further, use of the LM itself was a great example of creativity and pragmatism. The lunar module was designed to land on the moon, collect some rocks, and get them back to the command module. Using it as a lifeboat was, as the Grumman rep points out, outside of expected parameters. Even in this crisis, the rep is thinking as a bureaucrat: his team received specifications and requirements, his team met those specifications and requirements, and anything outside of those specifications and requirements he can’t guarantee. They did exactly what they needed to do in order to fulfill the contract and that, to them, was all they could do. This bureaucratic mindset is not a bad thing in normal circumstances–it reduces risk of legal liability by setting clear boundaries on expected behavior and promotes success by ensuring that people are willing to agree on the objectives to reach. But in the crisis, the Grumman rep gets shunted off to the side and the engineers take over. Because that’s what engineers do to bureaucrats when the going gets tough.

Another example of creativity marrying pragmatism involves the LM burn sequence. Lovell and Haise needed to perform a precise, controlled burn, but because the flight computer was turned off, they needed a way to ensure they stayed on track and did not stray too far. Their answer was to find a single, fixed point: Earth. As Lovell states, “If we can keep the earth in the window, fly manually, the co-ax crosshairs right on its terminator, all I have to know is how long do we need to burn the engine.” That sequence displays knowledge, creativity, and a way to make the best out of a given situation.

My third LM example takes us back to Ken Mattingly and the LM power supply. Mattingly is struggling to power on the flight computer with fewer than 12 amps and failing. He hits on a wild idea: the LM batteries still have power remaining, and there is an umbilical which provides power from the command module to the LM as a backup system. If they reverse the flow of power, they’ll draw from the LM batteries and into the command module, giving them enough power to get over the hump. Their unexpected lifeboat gives them one final miracle.

Creativity and pragmatism are invaluable assets for engineers in a crisis situation, letting them take stock of how things are and make best use of the circumstances to solve challenging problems. But they also need:

PERSEVERENCE UNDER PRESSURE

Being creative in a simulated situation or in day-to-day work is different from being able to get the same results under pressure. Perseverence is one of those key traits which can make the difference between success and failure in a crisis.

We get a glimpse of engineers persevering–and another example of show, don’t tell–when we see engineers sleeping in cots in a back room. “Is it A.M. or P.M.?” is a question you don’t generally want to hear from your employees, but sometimes a crisis requires this. Now, if your engineers have to do this regularly, you have a serious organizational problem. But when you are in the middle of a crisis, it’s good to know that you have people willing to push their limits like this. In other words, you want your employees to be willing to do this, but your employees shouldn’t be doing this except in true emergencies. Like three guys you’re trying to bring home from outer space.

Pressure doesn’t have to be huge and external, either. Think of the everyday, little things. Like, say, a projector bulb bursting. By the way, I had one of those happen to me during a training once. They do make a pretty loud bang when they go. Anyhow, what does Gene Krantz do after the first tool he uses fails on him? Move to a backup: the chalkboard. Have backups to your backups and be willing to move on. If they spent twenty minutes waiting for a staff member to find a spare overhead bulb, that’s twenty minutes they’re wasting during a crisis, at a point in which they need every minute they can get. Fail fast and be ready to abandon tools which are impeding a solution. By the way, I liked Krantz’s response to this: he’s frustrated. Frustration is okay–it’s part of the human condition, especially under stress. But don’t let it prevent you from doing what you need to do, and don’t get tunnel vision on this frustration, as you’ll only compound it without helping the situation any.

Underlying most of my video so far is an implicit cheerful, go-get-em, “We can do this!” attitude. But you don’t need that attitude all the time as an engineer in a state of crisis. Doubt, like frustration, is part of the human condition and to experience doubt in a situation of stress is normal. But again, just like frustration, focusing on the doubt gets nothing accomplished. Even if we doubt we can come up with something, we keep searching for a solution. A great example of this is where the technician doubts Mattingly will be able to turn on the flight computer without breaking 12 amps, but even so, he keeps at it because he refuses to let Mattingly or the crew on Apollo 13 down.

While we’re talking about topics like frustration and doubt, let’s make something clear:

WE STILL MAKE MISTAKES

Just because you’re an engineer–maybe even a smart engineer, maybe even one of the best in the world at what you do–it doesn’t mean you live mistake-free. And it’s surprisingly easy to make big mistakes simply by assuming that what you know is correct and failing to take into consideration changing circumstances.

Let me give you an example of that. During one attempt at maneuver, Lovell reports “We’re all out of whack. I’m trying to pitch down, but we’re yawing to the left. Why can’t I null this out?” Fred responds that “She wasn’t designed to fly like this, our center of gravity with the command module.” Lovell trained in the simulators and knew exactly how the LM is supposed to handle, but none of the simulations assumed that the LM would maneuver while still attached to the command module. As a result, all of those in-built expectations of behavior go right out the window and the crew are forced to compensate and learn anew.

Similarly, when Mission Control provided the crew with burn times to correct a trajectory problem, they made a crucial mistake: one of the flight path specialists tells Krantz, “We’re still shallowing up a bit in the re-entry corridor. It’s almost like they’re under weight.” They quickly realize the problem: Apollo 13 was intended to land on the moon, collect approximately 200 pounds of moon rocks, and bring them back to Earth. Therefore, they calculated the return trajectory based on that weight, but when the mission profile changed drastically, that small discrepancy made a difference and forced the crew to transfer items from the LM to the command module to keep things properly balanced.

A third example involves Haise, who is confused about why the CO2 numbers are rising so quickly, as “I went over those numbers three times.” But what he went over were the expectations for a two-man crew. Jack Swigert’s breathing threw off Fred Haise’s numbers, leading to the crew having less time than expected to solve this problem.

We generally do not need to be perfect in a crisis, and that’s a good thing–even at the best of times, I’m pretty sure nobody is perfect. What is important in these cases is that the engineers and crew understand that there is a mistake and correct things before they cause too much damage. And one way to limit the damage of mistakes is to have cross-checkers. I really liked this scene in which Lovell is filling in calculations for gimbal conversions and needs a double check of the arithmetic. Instead of trusting that Lovell will get the numbers right, he goes to Mission Control. And instead of one person doing the math, you see half a dozen engineers independently working on the problem. Having people who can check that what you are doing is correct is huge, and even more so if they’re able to perform independent evaluations.

As we near the close of this video, I want to talk about one last thing:

MANAGING ENGINEERS UNDER CRISIS

I really liked the way we saw Gene Krantz and Glynn Lunney manage engineers during the film, especially Krantz. He does an admirable job in performing the role of a manager. Early on, when the different teams are in a back room trying to figure out what to do, Krantz opens the floor and lets engineers confer and share ideas. This is an important part of management of engineers: they are idea people, but to get ideas out of them, they need to know you’re willing to listen and not shut them down immediately. But it’s also the job of a manager to make a decision. Engineers can go around and around on an issue until the cows come home or get knee-deep into the minutae of a situation, and sometimes, it’s the manager’s job to refocus people. Let your specialists figure out the options, but it’s your job to understand when it’s time to move.

Also, don’t solve a problem you don’t need to. We can spend a lot of time thinking about hypothetical situations and how we might proceed, but that’s a waste of brainpower and time, two critical resources during a crisis. Focus on the things you need to solve. Your team might need to solve more than one problem at a time, but make sure they’re actual problems and that they are the most important problems to solve right then.

And make sure your engineers understand exactly what it is they have to do. If you expect them to sleep in cots at the office for the duration of a crisis, you don’t want them twiddling their thumbs or having to look for work. If you’re a manager of managers, make sure that your leads have a grasp on the problem, understand the current state of affairs, and know what to work on. I’ve been in engineering crises–thankfully none which were life or death situations–and the most frustrating part is sitting there with nothing to do. You can’t go home (and it’s usually pretty late, so you want to go home), but there’s nothing you can actually do at the moment. That’s a failing of management we want to avoid.

To that extent, maximize communication. Bias toward over-sharing of problems and solutions rather than under-sharing. If a team sees a problem, it may turn out that it doesn’t actually need a solution and someone can tell them that. Or perhaps a member of another team sees an aspect of the problem which is worse than first anticipated, causing you to re-evaluate the problem altogether. Or maybe it turns out that a third team actually has the solution, but didn’t realize it was needed because they weren’t experiencing the problem.

Let me give you a real-life example of this, though I’m changing a few of the details to maintain appropriate confidences. Customers are calling because their data is not showing up in our system and it’s a critical time of the year where their data absolutely needs to be correct and up-to-date. Application developers are looking through their logs, trying to figure out what’s going on (and sometimes, trying to remember exactly how things work). I, as the database specialist, am looking through log entries in tables, database error logs, and whatever monitoring I can get my hands on to figure out if the database is rejecting data, if there are application errors when trying to write data, or what the deal is. We’re working together on this, but without sysadmins or network engineers to help us out. After some struggle, we engage the sysadmin, who tells us that things look fine on the servers, but noticed a major drop in network utilization over the past few hours, and the network engineer tells us that, oh yeah, there was an issue with a network link at a separate site and our co-lo’s technicians were working on the problem, but it only affected a few servers…specifically, our servers. This piece of information ended up being crucial, but there was a communication gap between the development teams (application and database) and the administration teams (systems and network). Had it been made clear to those administrative teams at the beginning, we might have saved hours of time trying to diagnose something that turned out not even to be something we caused.

That sharing also needs to be bi-directional. As a manager, getting regular reports from engineers is nice, but you have to be able to share syntheses of reports with everyone involved. One team might know the problem but lack the solution, and another team might have the solution but not know that there’s a problem; ensuring that people know what they need to know without a flood of unnecessary information is tricky, but that’s what a great manager has to do.

One thing great managers don’t do is management by hovering. Engineers know this all too well: we’re in a crisis, so you have levels of management standing over your desk as you’re trying to solve the problem. You can tell how big of a crisis it is by how many levels of management are all standing at your desk. They’re standing there, observing, often without providing direct insight or meaningful guidance. Oh, they may provide you guidance, but it’s rare that hoverers give you anything helpful. Hovering is a natural human instinct when you feel like you don’t have anything you can do but need to get a problem solved. To the extent that it signals the importance of the issue, it’s not the worst thing ever, but it does stress out engineers working on the problem. And hovering managers aren’t coordinating, which means all of those people who need information to help solve the problem aren’t getting it because the manager is standing over one person and watching that person type and click stuff.

Instead of hovering, be available. Frankly, even if you have nothing to do at that point in time, don’t hover. Gene Krantz? He doesn’t hover. He’s at his desk, he’s leading meetings, he’s talking to people. He’s available, but he doesn’t need to stand over the engineers working on the CO2 scrubber solution or the flight computer power solution or any other solution–he knows he has capable people who understand the problem and are motivated to solve it, so he’s in his role as collector and dissemninator of information and as the person responsible for ensuring that people are doing what they need to do.

Finally, if you are a manager, expressing in a positive but realistic manner is the way to go. Both traits are necessary here: if you are overly negative, people will wonder why they’re even there. If you’re sending the signal that you expect failure, that demotivates the people working on problems–after all, if we’re going to fail anyhow, why should I even try? But being overly cheerful or optimistic sounds Polyannish and runs the risk of losing the trust of engineers. Watch Gene Krantz throughout the film and he does a great job of straddling the line, including at two critical points in the film:

“Failure is not an option!”

And:

“With all due respect, sir, I believe this is going to be our finest hour.”

These are not the types of phrases you deploy on a day-to-day basis, but if you’re in a crisis, you want to know that the people in charge understand the situation and will move heaven and Earth to make it right. And from the manager’s side, that means making sure you give your engineers the best possible chance of success.

I hope you enjoyed this video. As I mentioned at the beginning, it’s a bit outside the norm for me, but it was a lot of fun to put together. And if you haven’t seen Apollo 13 in a while (or at all!), if you sat through this video, you’re definitely going to enjoy the movie. So until next time, take care.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s