Apollo 13: Engineering Lessons Under Crisis

This video is a bit different from my normal fare, but by about halfway through the film I knew I wanted to do this.

Links of Note

Script

Hey, everybody. This video is going to be a little different from my normal fare. I recently got the opportunity to re-watch Apollo 13, a movie I enjoyed growing up but have not seen in more than two decades. Even as a kid, I loved the engineering sequences, in which Mission Control needed to solve unexpected and extremely challenging problems to bring back home Jim Lovell, Fred Haise, and Jack Swigert. In this video, I’m going to delve into detail on those challenges and important take-aways for engineers in any discipline, like, say, software development. By the way, when I use the term “engineer” in this video, I’m purposefully including all kinds of roles, from engineers proper to technicians and operators. The exact job title here isn’t important; it’s the skill set and mentality which matters the most. Anyhow, sit back and enjoy:

APOLLO 13: ENGINEERING LESSONS UNDER CRISIS

Before I get started in earnest, I want to make note that this video is based solely on the movie Apollo 13, a dramatization of Jim Lovell’s book, Lost Moon: the Perilous Voyage of Apollo 13. This is not a recounting of actual events as they happened–I understand there were areas where the film wasn’t completely accurate to reality and they made up a couple of bits to ratchet up the suspense, but my plan is to appreciate the film as such rather than inasmuch as it was a reflector of actual events. That said, it’s still a lot closer to reality than most such movies.

As another aside, something I really appreciated was just how good the acting was in this movie. It struck me throughout the film that they were doing a great job of the adage “Show, don’t tell.” For example, we learn a bit about the personality of Jim Lovell–played by Tom Hanks–as a former test pilot in the opening sequence, where we see him zooming along the highway in Houston, starting with a forward-facing shot of a car zipping through traffic and switching to a view of Lovell in his shiny, red Corvette. (By the way, I said stick to the film, but in real life, Lovell’s Corvette was blue. How dare you, Hollywood.) Anyway, he’s a straight-laced guy who lives fast, and we get that information simply from this first scene, comparing Lovell’s bearing and appearance with his need for speed. This fits the early astronaut profile to a tee.

Another instance of this is later on in the film, when you see a television interview–taped sometime before the mission–of Lovell recounting a story as a combat pilot in which he is, at night, with no homing signal or radar, trying to find an aircraft carrier running dark. Then the electronics short out on his plane, leaving him completely in the dark, but instead of panicking he keeps his head straight, maintains control, and ends up finding the carrier safely due to a bit of luck and a trail of phosphorescent algae. This is playing on the television for us to get a measure of the man and I appreciate that they do it this way rather than having Lovell tell his fellow astronauts “Hey, I survived finding an aircraft carrier with no running or interior lights in the middle of a dark ocean, so we can do this!” He acts calm and collected by showing his training, and he doesn’t need to tell us any of this.

Speaking of the other astronauts, in early sequences in the film, we see Lovell, Ken Mattingly–played by Gary Sinise–and Fred Haise–played by Bill Paxton–become very close in training, to the point where they can interpret each others’ moves and have implicit trust. When Jack Swigert, played by Kevin Bacon, has to swap in for Mattingly at the 11th hour, there is a bit of conflict. We can see an example of that conflict in the way the trio address each other–Lovell and Mattingly call Haise Freddo, but Swigert, being the outsider, refers to him as Fred. But by the climax of the film, we see that conflict resolved in a natural way, and we see the comradery between the two men as Swigert too refers to Haise as Freddo. It’s these little character touches, as well as exactly the right facial expression or the right tone of a word which tell so much more to us than what the actors need to say. I wanted to get that in early because the acting and characterization really hold up and they’re worth discussing even in a video about Mission Control.

So now let’s get to the main event.

My first point is:

CONFLICT HAPPENS

Conflict happens, but interpersonal conflict shouldn’t drive animosities. In a healthy team, people should have the attitude that we’re on the same side here, trying to solve the same problems. But we do see conflict in ideas, in part because they have different weights on preferences and different sets of information available to them. In the film, one of the first Mission Control crisis scenes has a pair of engineers arguing over whether to turn the Apollo 13 command module around or try to slingshot around the moon. Both men make good points: the engineer wanting to turn the ship around notes that it would be the fastest way to get the crew back to Earth and that’s critical because they’ve already lost a considerable amount of oxygen. The opposing side points out that the service module engine is damaged and might explode if they try to use it, so it’s effectively dead. They’re hurtling toward the moon, so to turn around, they would need to cancel all of that velocity and then build up velocity in the opposite direction, and the vessel doesn’t have enough known-good engine power to do this. But if they continue on their path, they can redirect their existing momentum and use the moon’s gravitational pull to slingshot them back toward Earth. It will take longer to get back, but is much more likely to succeed.

Although tensions run high during the scene, the engineers all understand that this is an important part of decision-making under crisis: you need to understand the options available and be willing to speak up if you see a mistake in progress. Ed Harris’s Gene Krantz also does a good job here by allowing the squabbling but shutting it down when it stops being productive. He makes sure that the good engineers are able to voice their concerns and lay out all of the necessary information–even if they don’t do a perfect job of remaining calm and thoughtful. As a group, they are able to think through the problems, and as the flight director, Krantz’s job is to make the call.

This works because:

THERE IS NO SOLE SUPERGENIUS

Something I appreciated in the film is that we didn’t have a singular, brilliant mind answering all of the problems. Because that’s not how it works in good organizations. You have different groups of people with specializations in different areas and the capability to solve different types of problems. Mission Control had different groups of engineers solving problems like CO2 scrubbing, turning back on the flight computer, and coming up with a flight path that will let the command module land in the ocean without burning up in atmosphere or skipping past the earth like a rock on a pond. This wasn’t Gene Krantz sitting in his lair, meticulously plotting and having scribes and underlings carry things out; this was a big effort from a lot of people quite capable of solving problems.

What’s interesting in the film is that Ken Mattingly is kind of set up in the “supergenius” role due to his tremendous work ethic and deep knowledge of the command module. But even in the sequences where Mattingly works to get the flight computer power-on sequence correct, he’s a member of a team. He’s not the one who realized that there would be power problems in the first place, and although he’s the one who we see come up with the final process, he’s not doing it alone. Furthermore, they don’t have Ken fix the flight computer, figure out burn time, and scrub CO2–he’s a smart man performing a heroic effort in doing his part to bring three men home alive. That he’s not a sole supergenius takes nothing away from that, but it does give us a critical insight that engineering under crisis is a team sport.

And while I’m talking about Mattingly, I do want to point out that although he’s an astronaut, he’s tied in with the engineers too. At that point in its history, NASA selected primarily for a few critical traits: intelligence, composure, and creativity under stress. This is part of why they loved choosing test pilots, as that’s a great demonstration of all three. The training program was rigorous and included intimate knowledge of the critical systems, as once you’re out in space, you’re on your own. Mission Control may be available most of the time–assuming no radio outages–but you have to know how to operate the different systems, how to troubleshoot problems, and how to make things work in stressful conditions. The film shows this during the sequences where Mattingly and Swigert dock with and retrieve the Lunar Module, or LM (pronounced LEM). Mission Control simulates different failures and forces the command module pilot to think on his feet, diagnosing the problem, assessing the situation, and performing the right response. In order to be that highly proficient of an operator, the astronauts need to have almost the same understanding of the systems that engineers have. You may not have asked Ken Mattingly to design and fabricate a command module, but he has to know the hows and whys behind its design because it’s literally his life on the line.

Let me give you a more prosaic example of how all of this fits into the real world. IT teams are made up of people with a range of specialties: front-end developers, back-end developers, database developers, database administrators, network administrators, systems administrators, architects, and a whole lot more. If you have a Sev-1 crisis and your company’s applications are down, it’s natural to expect these groups of people to work within their specialties to solve problems. Ideally, you don’t have a single The Expert whose galaxy brain is responsible for solving all of the problems. This is because The Expert can be wrong and, no matter how long The Expert has been around in an organization, there’s always going to be a surfeit of knowledge. In a well-functioning engineering team under crisis, each person understands their piece of the puzzle and knows how to get information. App developers are looking through their logs while database administrators try to figure out if the issue is in their domain, network administrators review packet and firewall logs, and systems administrators check server logs. They can take and synthesize the information they’re processing to help shed light on the problem and develop viable solutions. When there are people with demonstrated expertise–as opposed to being The Expert–they can help narrow down places to look, compare the current issue to prior issues, and make connections which specialists alone might not have done. But even so, that person is still one member of a group.

This is also bleeding over into my next point:

FOCUS ON THE PROBLEM AT HAND

In one interesting scene, Swigert mentions to Lovell that they’re coming in too shallow and at this trajectory, will skip right out of the atmosphere. Lovell respond by saying “There are a thousand things that have to happen in order. We are on number 8; you’re talking about number 692.” In a tough situation, it’s easy to focus on a specific problem to the detriment of coming up with a solution. In a few lines of dialogue, we get the crux of an important point: document that there is an issue, follow up on the issue when it’s the appropriate time, and continue working on the most important thing. Under normal circumstances, it’s easy to prioritize issues based on who’s complaining loudest or most recently. To give you a data platform example, I might notice some slow reports in a log and decide to take the day speeding those up. It may not have been the single most important thing I could have done at that point in time, but it was a reasonable thing to do as it was causing people pain right then. But in a crisis situation, we have to avoid that temptation and focus on the single most important thing. Yes, those reports are slow. And they might cause us to lose the contract if we don’t fix them. But right now, the database is refusing connections and we need to fix that before we can think about reports.

Tying back to the previous idea of expertise without The Expert, this also allows engineering teams to focus on solving a problem. We have navigation specialists focusing on burn calculations, a sub-set of engineers trying to figure out how to put a square peg into a round hole, and a group of engineers working with Mattingly to power on a flight computer while using less than 12 amps and “You can’t run a vacuum cleaner on 12 amps, John.” Krantz and Glynn Lunney, as the two shifts’ flight directors we get to see in the film, need to worry about all of these problems, ensure that teams are working on the most critical things, and coordinate between teams. But they also need to be able to delegate tasks to team leads–they have to trust that those team leads will do the right thing because they don’t have the time or energy needed to coordinate efforts and solve problems.

Taking this back to the real world, part of my company’s production emergency process is to designate a coordinator. Typically, that person is someone separate from the engineers working on the problem, as the coordinator needs to communicate and process information effectively and it’s hard to do that while you’re typing furiously and scrambling to remember where that stupid log file gets written to and where you saw error code 3400. That coordinator role is only one portion of what Krantz and Lunney needed to do, but it’s a vital part, especially when a lot of people are working in small groups on different parts of the problem.

The film did an excellent job of portraying my next theme:

CREATIVITY AND PRAGMATISM

Apollo 13 really captured the spirit of great engineers under pressure: it is a combination of creativity in solving problems mixed with a down-to-earth thought process which forces you to grapple with the problem as it is rather than as you’d like it to be.

I think the line which most exemplified the pragmatism side was when Ken Mattingly prepared to enter the command module simulator, demanding that conditions be set the same as what the crew on board were experiencing. “I need a flashlight,” he says. When a tech offers up the flashlight in his hands, Mattingly responds, “That’s not what they have up there. Don’t give me anything they don’t have onboard.” Mattingly limits himself to the conditions and cirumstances of the actual command module to ensure that what he does on the ground can be replicated exactly by the crew onboard the Apollo 13 command module as it currently is, not as they ideally would wish it to be.

Meanwhile, the first major scene which really brought the combination of skills together was when Gene Krantz brought together all of his engineers and makes a show of throwing away the existing flight plan. All of their effort and planning is out the window because circumstances changed to the point where that plan is untenable. And now it’s time to improvise a new mission. We are in a crisis, so our expectations of normal behavior have gone in the trash along with that flight plan.

But we see another example of the brilliance that can come from creative, pragmatic people under pressure. In one of the most iconic sequences in the film, NASA engineers need to fit a square peg into a round hole. The carbon dioxide scrubbers in the lunar module are cylindrical and designed for a cabin with two people living in it for a fairly short amount of time. The command module CO2 scrubbers are larger, boxier, and completely useless as-is. Because nobody planned on using the lunar module as a lifeboat, there weren’t any spare CO2 scrubbers for the lunar module and without one, the crew would die of carbon dioxide poisoning long before they reached Earth.

Cue the engineers, dumping out everything available to them on a table and one technician saying, “Okay, people, listen up. The people upstairs handed us this one and we gotta come through.” No grousing, no complaints about how this isn’t normal. Instead, it’s a challenge. The other technicians start with the most logical step: let’s get it organized. Oh, and they get to brewing coffee too. I guess that’s the second logical step.

What they come up with is a bit of a mess, but it solves the problem. It is a hack in the most positive sense of the term: a clever and unexpected solution to a problem. You don’t want to make a living building these, but it’s good enough to get three men home.

Taking this a step further, use of the LM itself was a great example of creativity and pragmatism. The lunar module was designed to land on the moon, collect some rocks, and get them back to the command module. Using it as a lifeboat was, as the Grumman rep points out, outside of expected parameters. Even in this crisis, the rep is thinking as a bureaucrat: his team received specifications and requirements, his team met those specifications and requirements, and anything outside of those specifications and requirements he can’t guarantee. They did exactly what they needed to do in order to fulfill the contract and that, to them, was all they could do. This bureaucratic mindset is not a bad thing in normal circumstances–it reduces risk of legal liability by setting clear boundaries on expected behavior and promotes success by ensuring that people are willing to agree on the objectives to reach. But in the crisis, the Grumman rep gets shunted off to the side and the engineers take over. Because that’s what engineers do to bureaucrats when the going gets tough.

Another example of creativity marrying pragmatism involves the LM burn sequence. Lovell and Haise needed to perform a precise, controlled burn, but because the flight computer was turned off, they needed a way to ensure they stayed on track and did not stray too far. Their answer was to find a single, fixed point: Earth. As Lovell states, “If we can keep the earth in the window, fly manually, the co-ax crosshairs right on its terminator, all I have to know is how long do we need to burn the engine.” That sequence displays knowledge, creativity, and a way to make the best out of a given situation.

My third LM example takes us back to Ken Mattingly and the LM power supply. Mattingly is struggling to power on the flight computer with fewer than 12 amps and failing. He hits on a wild idea: the LM batteries still have power remaining, and there is an umbilical which provides power from the command module to the LM as a backup system. If they reverse the flow of power, they’ll draw from the LM batteries and into the command module, giving them enough power to get over the hump. Their unexpected lifeboat gives them one final miracle.

Creativity and pragmatism are invaluable assets for engineers in a crisis situation, letting them take stock of how things are and make best use of the circumstances to solve challenging problems. But they also need:

PERSEVERENCE UNDER PRESSURE

Being creative in a simulated situation or in day-to-day work is different from being able to get the same results under pressure. Perseverence is one of those key traits which can make the difference between success and failure in a crisis.

We get a glimpse of engineers persevering–and another example of show, don’t tell–when we see engineers sleeping in cots in a back room. “Is it A.M. or P.M.?” is a question you don’t generally want to hear from your employees, but sometimes a crisis requires this. Now, if your engineers have to do this regularly, you have a serious organizational problem. But when you are in the middle of a crisis, it’s good to know that you have people willing to push their limits like this. In other words, you want your employees to be willing to do this, but your employees shouldn’t be doing this except in true emergencies. Like three guys you’re trying to bring home from outer space.

Pressure doesn’t have to be huge and external, either. Think of the everyday, little things. Like, say, a projector bulb bursting. By the way, I had one of those happen to me during a training once. They do make a pretty loud bang when they go. Anyhow, what does Gene Krantz do after the first tool he uses fails on him? Move to a backup: the chalkboard. Have backups to your backups and be willing to move on. If they spent twenty minutes waiting for a staff member to find a spare overhead bulb, that’s twenty minutes they’re wasting during a crisis, at a point in which they need every minute they can get. Fail fast and be ready to abandon tools which are impeding a solution. By the way, I liked Krantz’s response to this: he’s frustrated. Frustration is okay–it’s part of the human condition, especially under stress. But don’t let it prevent you from doing what you need to do, and don’t get tunnel vision on this frustration, as you’ll only compound it without helping the situation any.

Underlying most of my video so far is an implicit cheerful, go-get-em, “We can do this!” attitude. But you don’t need that attitude all the time as an engineer in a state of crisis. Doubt, like frustration, is part of the human condition and to experience doubt in a situation of stress is normal. But again, just like frustration, focusing on the doubt gets nothing accomplished. Even if we doubt we can come up with something, we keep searching for a solution. A great example of this is where the technician doubts Mattingly will be able to turn on the flight computer without breaking 12 amps, but even so, he keeps at it because he refuses to let Mattingly or the crew on Apollo 13 down.

While we’re talking about topics like frustration and doubt, let’s make something clear:

WE STILL MAKE MISTAKES

Just because you’re an engineer–maybe even a smart engineer, maybe even one of the best in the world at what you do–it doesn’t mean you live mistake-free. And it’s surprisingly easy to make big mistakes simply by assuming that what you know is correct and failing to take into consideration changing circumstances.

Let me give you an example of that. During one attempt at maneuver, Lovell reports “We’re all out of whack. I’m trying to pitch down, but we’re yawing to the left. Why can’t I null this out?” Fred responds that “She wasn’t designed to fly like this, our center of gravity with the command module.” Lovell trained in the simulators and knew exactly how the LM is supposed to handle, but none of the simulations assumed that the LM would maneuver while still attached to the command module. As a result, all of those in-built expectations of behavior go right out the window and the crew are forced to compensate and learn anew.

Similarly, when Mission Control provided the crew with burn times to correct a trajectory problem, they made a crucial mistake: one of the flight path specialists tells Krantz, “We’re still shallowing up a bit in the re-entry corridor. It’s almost like they’re under weight.” They quickly realize the problem: Apollo 13 was intended to land on the moon, collect approximately 200 pounds of moon rocks, and bring them back to Earth. Therefore, they calculated the return trajectory based on that weight, but when the mission profile changed drastically, that small discrepancy made a difference and forced the crew to transfer items from the LM to the command module to keep things properly balanced.

A third example involves Haise, who is confused about why the CO2 numbers are rising so quickly, as “I went over those numbers three times.” But what he went over were the expectations for a two-man crew. Jack Swigert’s breathing threw off Fred Haise’s numbers, leading to the crew having less time than expected to solve this problem.

We generally do not need to be perfect in a crisis, and that’s a good thing–even at the best of times, I’m pretty sure nobody is perfect. What is important in these cases is that the engineers and crew understand that there is a mistake and correct things before they cause too much damage. And one way to limit the damage of mistakes is to have cross-checkers. I really liked this scene in which Lovell is filling in calculations for gimbal conversions and needs a double check of the arithmetic. Instead of trusting that Lovell will get the numbers right, he goes to Mission Control. And instead of one person doing the math, you see half a dozen engineers independently working on the problem. Having people who can check that what you are doing is correct is huge, and even more so if they’re able to perform independent evaluations.

As we near the close of this video, I want to talk about one last thing:

MANAGING ENGINEERS UNDER CRISIS

I really liked the way we saw Gene Krantz and Glynn Lunney manage engineers during the film, especially Krantz. He does an admirable job in performing the role of a manager. Early on, when the different teams are in a back room trying to figure out what to do, Krantz opens the floor and lets engineers confer and share ideas. This is an important part of management of engineers: they are idea people, but to get ideas out of them, they need to know you’re willing to listen and not shut them down immediately. But it’s also the job of a manager to make a decision. Engineers can go around and around on an issue until the cows come home or get knee-deep into the minutae of a situation, and sometimes, it’s the manager’s job to refocus people. Let your specialists figure out the options, but it’s your job to understand when it’s time to move.

Also, don’t solve a problem you don’t need to. We can spend a lot of time thinking about hypothetical situations and how we might proceed, but that’s a waste of brainpower and time, two critical resources during a crisis. Focus on the things you need to solve. Your team might need to solve more than one problem at a time, but make sure they’re actual problems and that they are the most important problems to solve right then.

And make sure your engineers understand exactly what it is they have to do. If you expect them to sleep in cots at the office for the duration of a crisis, you don’t want them twiddling their thumbs or having to look for work. If you’re a manager of managers, make sure that your leads have a grasp on the problem, understand the current state of affairs, and know what to work on. I’ve been in engineering crises–thankfully none which were life or death situations–and the most frustrating part is sitting there with nothing to do. You can’t go home (and it’s usually pretty late, so you want to go home), but there’s nothing you can actually do at the moment. That’s a failing of management we want to avoid.

To that extent, maximize communication. Bias toward over-sharing of problems and solutions rather than under-sharing. If a team sees a problem, it may turn out that it doesn’t actually need a solution and someone can tell them that. Or perhaps a member of another team sees an aspect of the problem which is worse than first anticipated, causing you to re-evaluate the problem altogether. Or maybe it turns out that a third team actually has the solution, but didn’t realize it was needed because they weren’t experiencing the problem.

Let me give you a real-life example of this, though I’m changing a few of the details to maintain appropriate confidences. Customers are calling because their data is not showing up in our system and it’s a critical time of the year where their data absolutely needs to be correct and up-to-date. Application developers are looking through their logs, trying to figure out what’s going on (and sometimes, trying to remember exactly how things work). I, as the database specialist, am looking through log entries in tables, database error logs, and whatever monitoring I can get my hands on to figure out if the database is rejecting data, if there are application errors when trying to write data, or what the deal is. We’re working together on this, but without sysadmins or network engineers to help us out. After some struggle, we engage the sysadmin, who tells us that things look fine on the servers, but noticed a major drop in network utilization over the past few hours, and the network engineer tells us that, oh yeah, there was an issue with a network link at a separate site and our co-lo’s technicians were working on the problem, but it only affected a few servers…specifically, our servers. This piece of information ended up being crucial, but there was a communication gap between the development teams (application and database) and the administration teams (systems and network). Had it been made clear to those administrative teams at the beginning, we might have saved hours of time trying to diagnose something that turned out not even to be something we caused.

That sharing also needs to be bi-directional. As a manager, getting regular reports from engineers is nice, but you have to be able to share syntheses of reports with everyone involved. One team might know the problem but lack the solution, and another team might have the solution but not know that there’s a problem; ensuring that people know what they need to know without a flood of unnecessary information is tricky, but that’s what a great manager has to do.

One thing great managers don’t do is management by hovering. Engineers know this all too well: we’re in a crisis, so you have levels of management standing over your desk as you’re trying to solve the problem. You can tell how big of a crisis it is by how many levels of management are all standing at your desk. They’re standing there, observing, often without providing direct insight or meaningful guidance. Oh, they may provide you guidance, but it’s rare that hoverers give you anything helpful. Hovering is a natural human instinct when you feel like you don’t have anything you can do but need to get a problem solved. To the extent that it signals the importance of the issue, it’s not the worst thing ever, but it does stress out engineers working on the problem. And hovering managers aren’t coordinating, which means all of those people who need information to help solve the problem aren’t getting it because the manager is standing over one person and watching that person type and click stuff.

Instead of hovering, be available. Frankly, even if you have nothing to do at that point in time, don’t hover. Gene Krantz? He doesn’t hover. He’s at his desk, he’s leading meetings, he’s talking to people. He’s available, but he doesn’t need to stand over the engineers working on the CO2 scrubber solution or the flight computer power solution or any other solution–he knows he has capable people who understand the problem and are motivated to solve it, so he’s in his role as collector and dissemninator of information and as the person responsible for ensuring that people are doing what they need to do.

Finally, if you are a manager, expressing in a positive but realistic manner is the way to go. Both traits are necessary here: if you are overly negative, people will wonder why they’re even there. If you’re sending the signal that you expect failure, that demotivates the people working on problems–after all, if we’re going to fail anyhow, why should I even try? But being overly cheerful or optimistic sounds Polyannish and runs the risk of losing the trust of engineers. Watch Gene Krantz throughout the film and he does a great job of straddling the line, including at two critical points in the film:

“Failure is not an option!”

And:

“With all due respect, sir, I believe this is going to be our finest hour.”

These are not the types of phrases you deploy on a day-to-day basis, but if you’re in a crisis, you want to know that the people in charge understand the situation and will move heaven and Earth to make it right. And from the manager’s side, that means making sure you give your engineers the best possible chance of success.

I hope you enjoyed this video. As I mentioned at the beginning, it’s a bit outside the norm for me, but it was a lot of fun to put together. And if you haven’t seen Apollo 13 in a while (or at all!), if you sat through this video, you’re definitely going to enjoy the movie. So until next time, take care.

Upcoming Events: Data Platform Summit 2020

I’ve talked about Data Platform Summit before, but now that we’re close to go time, I wanted to mention it again.

This year, I will be presenting one breakout session and one training class.

The breakout session is entitled Optimizing Backup Performance using Data Science Techniques. In it, I walk through the configuration settings we have available to us for tuning backups (and restorations) of databases and then apply data science techniques to find the best set for a given environment. It’s a great way of applying data science skills to a classic database administration task and should be fun for both audiences. That session, along with more than 200 others, will run from December 2nd through the 4th.

The training, meanwhile, will take place over two days, December 7th and 8th. The training is entitled Applying Forensic Accounting Techniques to Data Analysis in Python and SQL and is a lengthy enough title that it’s self-explanatory. In it, I answer the perennial question for data platform specialists: what does it mean to know your data? The same processes and techniques forensic accountants use to analyze data sets can help us gain a detailed understanding of data in our own environments and become more proficient. Along the way, we uncover a bit of dirty dealing in an artificial data set because hey, where’s the fun in talking about forensic accounting if we can’t have a fraud-based example?

Save Some Money

Use the discount code TRIANGLESQL to save big on sessions and trainings, and also kick a little bit to the Triangle Area SQL Server Users Group.

Q&A: The Curated Data Platform

On Thursday, I presented a session at PASS Summit entitled The Curated Data Platform. You can grab slides and links to additional information on my website. Thank you to everyone who attended the session.

During and after the session, I had a few questions come in from the audience, and I wanted to cover them here.

Cross-Platform Data Strategies

The first question was, “What handles the translation between the relational truth system and the Document system?” The context of the question comes from a discussion about product catalogs, and specifically this slide.

Document databases are great for things like product catalogs, where we meet the following properties:

  • Data has a very high read-to-write ratio.
  • You generally look at one item per page—in this example, one product.
  • The set of data to appear on a page is complex and typically has nested items: a product has attributes (title, price, brand, description) but also collections of sub-items (product image links, current stock in different stores, top reviews, etc.).
  • The data is not mission-critical: if updates are delayed or even occasionally lost, that is acceptable.

But I do like keeping a “golden record” version of the data and my biases push me toward storing that golden record in a relational database. I mentioned two processes in my talk: a regular push on updates and an occasional full sync to true up the document database(s).

And that leads to the question of, how do we do that? There are products from companies like CData and Oracle which can handle this, or you can write your own. If your source is SQL Server, I’d push for a two-phase process:

  1. Enable Change Data Capture on the SQL Server instance and have a scheduled task query the latest changes and write them to your document database(s). You can use constructs like FOR JSON PATH in SQL Server to shape the documents directly, or pull in the source data and shape it in your application code.
  2. Periodically (e.g., once an hour, once a day), grab all of the data, shape the documents, and perform a comparison with what’s out there. This will confirm that nothing slips through the cracks for longer than one time period and will keep disparate clusters of document data separated.

Of course, this golden record doesn’t need to be in a relational database—you could store it in a document database and use replication there to push data to different clusters. If you use Cosmos DB, for example, you can replicate to other regions easily.

Document Databases: Scale-Out vs Replication

Another attendee asked about “Document databases and scale-out vs replication.” In retrospect, I think I misinterpreted the question as asked, as I mentioned that scale-out and replication are one and the same: you replicate data between nodes in a cluster to achieve scale-out.

But this time around, I’m going to answer the question, “How do I choose between making my current cluster bigger and replicating out to a new cluster?”

Here are some key considerations:

  • If the issue you are trying to solve is geographical in nature, replicate out to a new cluster closer to your users. In other words, suppose you have a cluster hosted in Virginia. Many of your users are in Japan, so they have to deal with the network latency of pulling data from a Virginia-based data center. If this is the problem, create another document database cluster in Japan and replicate to it from Virginia.
  • If your cluster is in Virginia and is getting hammered hard by relatively local users, scaling out is generally a good option. That is, adding more servers to the existing cluster. Depending on your technology, there will be a maximum number of nodes or a maximum scale-out size available to you, so you’d have to check out those details.
  • If you’re getting close to that maximum scale-out size, it may make sense to replicate to another cluster in the same region and use a load balancer to shift load between the two. I have to be vague here because different technologies have different limits and I’m definitely not an expert on any document database technology’s systems architecture.

Cosmos DB and MongoDB

Another attendee asked, “I have heard that Azure Cosmos DB is built upon an older version of MongoDB – do you know if this is true?”

The answer is no, it’s not. The two platforms are different. I believe where the question comes from is around the MongoDB API for Cosmos DB. For a while, Cosmos DB supported an older version of the MongoDB API, specifically 3.2. That API was released in December of 2015. New Cosmos DB clusters support the MongoDB 3.6 API, which is still active.

But something I want to point out is that the API is an interface, not an implementation. That Cosmos DB supports a specific MongoDB API version doesn’t mean that the code bases are similar; it only means that you can safely (well, presumably safely) interact with both and expect to get the same results when you perform the same set of API steps with the same inputs.

Graph Languages

My last question came from an attendee who mentioned, “I thought GraphQL was the common standard for graph querying.”

The context for this is in my discussion of graph databases, particularly the slide in which I talk about the key issues I have with graph databases. For a bit more background than what I had time to get into during the session, Mala Mahadevan and I have talked about graph databases in a little bit of detail on a couple of occasions, once on the SQL Data Partners Podcast and once on Shop Talk.

As for the question, the comment I had made was that there is no common graph language. We have SQL for relational databases (and other mature data platform technologies) but historically haven’t had a common language for graph platforms, meaning that you have to learn a new language each time you move to a different platform. The Gremlin language is an attempt at creating a single language for graph databases and it’s making enough strides that it may indeed become the standard. But it’s not there yet.

Meanwhile, GraphQL, despite the name, is not a language for graph databases. It’s actually a language for querying APIs. The key idea is that you ask for data from an API and you get back just the data you want. But behind the API, your data can be stored in any sort of data source—or even multiple data sources. In other words, I might expose a product catalog API which hits Cosmos DB, a finance API which hits SQL Server, and a product associations API which hits Neo4j. Those three APIs could all be queried using GraphQL, as it’s up to the API to interpret inputs and return the appropriate outputs.

Approaching Zero: The Video

Thanks to a bit of gentle nudging from Anthony Nocentino (b | t), I’ve finally put together a recording of my session entitled Approaching Zero: Minimizing Downtime During Deployments.

I tried to work the word “asymptotic” into the title, but my editor didn’t let me.

It’s a long video but there’s a lot of content that I hope you enjoy. Long-time blog readers might remember my series on the topic, which gives you a bit more detail. You can also get the slides, code, and links to additional resources on my presentation page.

Finally, one call-out to Carlos L. Chacon (w | t): I’m not saying you chose a bad color, but one of us has a problem here:

It’s not easy being wiped out by a chroma key.

PASS Virtual Summit is Next Week

I have two sessions at PASS Virtual Summit 2020 next week, one a full-day training and the other a general session.

SQL Server on the Edge: a Full-Day Training

Next Monday, I’m giving a pre-con on using SQL Server, .NET Core, and Azure to help solve IoT issues. We’ll walk through a realistic scenario involving building security via employee key cards, show how to use Azure IoT Edge and Azure SQL Edge to power edge devices, and perform machine learning on employee comings and goings to track behavioral changes.

Here is the PASS Summit writeup if you’d like more information. Use the discount code LGDISCWY7 to save a bit of cash along the way.

The Curated Data Platform

In addition, I’m going to give a regular session entitled The Curated Data Platform.

The idea is, let’s take a fictional but realistic company, walk through the types of data problems it experiences, and see which data platform technologies solve its problems, along with the biggest players in those spaces, and some reference architectures to boot.

Register for PASS Summit

Registration is still open for PASS Summit 2020, so join me for the virtual event. Use the discount code LGDISCWY7 for additional savings and I hope to see you there.

PolyBase and Postgres

Last month, I presented at year 4 of SQL Trail, a conference normally in Richmond, Virginia but entirely online this year due to obvious reasons. Before the conference, one of the attendees asked for a session on PolyBase, specifically around accessing Postgres data from SQL Server. I didn’t have an example on that, so figured I could come up with one and (eventually) blog about it.

Preparing Postgres

I didn’t have a Postgres server lying around anywhere, so I decided to set up my own using a Docker container:

docker pull postgres
docker run --name pg -e POSTGRES_PASSWORD=pgtestpwd -p 5432:5432 -d postgres

What this does is first, pull down the latest Postgres image from Docker Hub. Then, we create a container named pg based off of the postgres image. We’re going to forward requests on port 5432 to the Postgres container and set the administrative password to pgtestpwd, clearly a secure password.

Next up, we need to retrieve the Postgres ODBC driver. You can find drivers on the Postgres website. I’m using Windows for this demo, so I grabbed the latest MSI from the MSI directory. We need to install the ODBC driver on the machine which is running SQL Server. In this example, I have SQL Server installed on my PC, not in a Docker container or VM, so I’ll install the ODBC driver directly on my PC. In a prod scenario, you’d install it on your production machine running SQL Server, not the production machine running PostgreSQL.

After that, we need to get some data for our database. I’m going to create a simple table called event which stores event data. Because the data itself doesn’t really matter, roll with me here. You can connect to your new Postgres server from Azure Data Studio, pgadmin4, or whatever other tool you’d like.

If you do connect to Postgres via Azure Data Studio, like I did, you want to make sure that you’ve set the connection type correctly:

Connect to a Postgres instance

If you don’t see this connection type, make sure you have the PostgreSQL extension for Azure Data Studio.

Make sure you have the PostgreSQL extension.

Once connected, we can run the following script. I decided to use the default postgres database, but you can certainly create your own database as well if you’d like to make this more realistic.

CREATE TABLE event(
    id serial,
    machine_id int,
    event_id varchar(40),
    event_type varchar(70),
    entity_type varchar(70),
    entity_id varchar(40),
    event_data json
)

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

-- Run this several times to generate data.
INSERT INTO event(
    machine_id,
    event_id,
    event_type,
    entity_type,
    entity_id,
    event_data
)
VALUES
(
    21644,
    uuid_generate_v4(),
    'Telemetry Ingest',
    'MachineTelemetry',
    uuid_generate_v4(),
    '{ "Test": "Yes" }'
);

SELECT * FROM event;

If everything worked as expected, the final SELECT statement should return some results for you.

Connecting to Postgres

Now that we have some data, let’s go back to SQL Server. I assume you’ve already installed and configured PolyBase—if not, check out my presentation on PolyBase. Note that this requires SQL Server 2019 or later, as that’s the first version which supports PolyBase to ODBC. Here’s a script which assumes a database named Scratch and a master key <<SomeSecureKey>>.

First up, we want to create a database scoped credential so that we can authenticate to Postgres. This credential gets encrypted using the database master key.

USE [Scratch]
GO
OPEN MASTER KEY DECRYPTION BY PASSWORD = '<<SomeSecureKey>>';
GO

IF NOT EXISTS
(
	SELECT 1
	FROM sys.database_scoped_credentials
	WHERE
		name = N'PostgresCredential'
)
BEGIN
	CREATE DATABASE SCOPED CREDENTIAL PostgresCredential
	WITH IDENTITY = 'postgres',
	SECRET = 'pgtestpwd';
END
GO

After creating the database scoped credential, we will create a data source pointing to Postgres, connecting on localhost and port 5432, just like we set up with Docker. We’ll specify the PostgreSQL Unicode driver and point to the postgres database. Note that PostgreSQL contains ANSI and Unicode drivers, meaning that all of your strings will be either ANSI or Unicode. Given my preferences in SQL Server, I think it’d be obvious which way I chose.

IF NOT EXISTS
(
    SELECT 1
    FROM sys.external_data_sources ds
    WHERE
        ds.name = N'PostgresEvents'
)
BEGIN
    CREATE EXTERNAL DATA SOURCE PostgresEvents WITH
    (
        LOCATION = 'odbc://localhost:5432',
        CONNECTION_OPTIONS = 'Driver={PostgreSQL Unicode}; Database=postgres',
		CREDENTIAL = PostgresCredential,
		PUSHDOWN = ON
    );
END
GO

Next, we’ll wrap up the external object creation by building out an external table. This external table will use NVARCHAR to represent both Postgres strings and the JSON data type.

IF NOT EXISTS
(
    SELECT 1
    FROM sys.external_tables t
    WHERE
        t.name = N'PostgresEvent'
)
BEGIN
    CREATE EXTERNAL TABLE dbo.PostgresEvent
    (
        id INT,
        machine_id INT,
        event_id NVARCHAR(40),
        event_type NVARCHAR(70),
        entity_type NVARCHAR(70),
        entity_id NVARCHAR(40),
        event_data NVARCHAR(255)
    )
    WITH
    (
        LOCATION = 'event',
        DATA_SOURCE = PostgresEvents
    );
END
GO

The final test comes when we query the data. Run the following and you should see the same set of rows that you saw when connecting directly to Postgres:

SELECT * FROM dbo.PostgresEvent;
GO

Postgres-Specific Notes

I am not particularly familiar with PostgreSQL, certainly not to the point where I could give you the ins and outs of compatibility issues. My simple tests were successful and I didn’t find a huge number of people complaining about compatibility issues with PolyBase, but Postgres has multiple languages and enough complex data types that there may very well be compatibility problems. But that said, one of the problems I expected—having a JSON data type—converted just fine. I did see an issue around the NUMERIC data type but didn’t dive in.

Shameless Advertising

In case you want to learn a lot more about PolyBase, check out PolyBase Revealed, the only book-length treatment on the topic. Not only that, but it also makes for a great monitor stand—it’s a thick enough book to make a difference, but thin enough that you can get more precise monitor adjustments with it. Buy several copies and relieve neck strain today!

Query Store: QDS Toolbox

I wanted to announce the first open source project officially released by ChannelAdvisor: the QDS Toolbox. This is an effort which Pablo Lozano and Efraim Sharon pushed hard internally and several database administrators and database engineers contributed to (though I wasn’t one of them).

From the summary page:

This is a collection of tools (comprised of a combination of views, procedures, functions…) developed using the Query Store functionality as a base to facilitate its usage and reports’ generation. These include but are not limited to:

– Implementations of SSMS’ GUI reports that can be invoked using T-SQL code, with added funcionalities (parameterization, saving results to tables) so they can be programmatically executed and used to send out mails.
– Quick analysis of a server’s overall activity to identify bottlenecks and points of high pressure on the SQL instance at any given time, both in real time or in the past.
– Cleanup of QDS’ cache with a smaller footprint than the internal one generates, with customization parameters to enable a customizable cleanup (such as removing information regarding dropped objects, cleaning details of ad-hoc or internal queries executed on the server as index maintenance operations).

The biggest of these is the third item. In our environment, Query Store could be a beast when trying to delete old data, and would often be the biggest performance problem on a given server.

In addition, several procedures exist as a way of aggregating data across databases. We have a sharded multi-tenant environment, where we might have 5-15 replicas of a database schema and assign customers to those databases. QDS Toolbox helps aggregate information across these databases so that you don’t need to look at each individually to understand performance problems. The database team has then created reports off of this to improve their understanding of what’s going on.

Check out the QDS Toolbox as a way to clean up data better than the built-in cleanup process and get additional information aggregated in a smart way.

Transaction Modes in SQL Server

In the following video, I take a look at the three most important transaction modes in SQL Server: autocommit, explicit transactions, and implicit transactions. Sorry, batch-scoped transactions, but nobody loves you.

If you’d prefer a textual rendering of the video, here are the pre-demo and post-demo sections, lightly edited for narrative flow.

Setting the Stage: Transactions and Modes

What I want to do in today’s post is to cover the different sorts of transaction modes and get into the debate about whether you should use explicit transactions or rely on auto-committed transactions for data modification in SQL Server. This came from an interesting discussion at work, where some of the more recent database engineers were curious about our company policy around transaction modes and understanding the whys behind it. I didn’t come up with the policy, but my thinking isn’t too far off from the people who did.

But before I get too far off course, let’s briefly lay out some of the basics around transactions.

When you modify data (that is, run a command like INSERT, UPDATE, MERGE, or TRUNCATE) or tables (CREATE, ALTER, DROP, etc.), that operation takes place inside a transaction.  A transaction is, according to Microsoft Docs, a single unit of work.  Everything in a transaction will either succeed as a whole or fail as a whole–you won’t end up with some operations succeeding and others not–it’s really all or nothing.  The importance of this goes back to relational databases having ACID properties, but because that’s a little far afield of where I want to go, I’ll give you a link if you’d like to learn more about the topic, as it helps explain why transactions are useful for relational database developers.

What I do want to get to is that there are three kinds of transactions:  autocommit transactions, explicit transactions, and implicit transactions.  There’s actually a fourth kind, batch-scoped transactions, but that only applies to Multiple Active Result Sets transactions and if you find yourself there, you’ve got bigger problems than deciding how you want to deal with transactions.

In the demo for the video, I show off each of the three transaction modes, including how you enable them, how you work with them, and any important considerations around them.

Recommendations

The easy recommendation is, don’t use implicit transactions.  For SQL Server developers and database administrators, this is unexpected behavior–the default is to use autocommit, so that if you run an INSERT statement by itself, the transaction automatically commits at the end.  If you set implicit transactions on, there is no UI indication that this is on and it becomes really easy to forget to commit a transaction.  I understand that if you come from an Oracle background, where implicit transactions are the norm, it might feel comfortable to enable this, but it becomes really easy to start a transaction, forget to commit or rollback, and leave for lunch, blocking access to a table for a considerable amount of time. And if you’re using Azure Data Studio, it appears that implicit transactions might not even work, so you’d be in a world of hurt if you were relying upon them.  So let’s throw this one away as a recommendation.

My recommendation, whenever you have data modification on non-temporary tables, is to use explicit transactions over autocommit.  I have a few reasons for this.

First, consistency.  Sometimes you will need explicit transactions.  For example, if I need to ensure that I delete from table A only if an insert into table B and an update of table C are successful, I want to link those together with an explicit transaction.  That way, either all three operations succeed or none of them succeed.  Given that I need explicit transactions some of the time, I’d rather be in the habit of using them; so to build that habit, I’d prefer to use them for all data modification queries.

Second, explicit transactions give you clarity around what is actually necessary in a transaction.  Suppose you query a table and load the results into a temporary table.  From there, you make some modifications, join to other tables, and reshape the data a bit.  So far, nothing I’ve mentioned requires an explicit transaction because you’re only working with temp tables here.  When you take the final results and update a real table, now we want to open a transaction.  By using an explicit transaction, I make it clear exactly what I intend to have in the transaction:  the update of a real table, but not the temp table shenanigans.

Third, as an implication of the second point, explicit transactions can help you reduce the amount of time you’re locking tables.  You can do all of your heavy lifting in temp tables before opening the transaction, and that means you don’t have to do that while locking the real table. In the best case, autocommit will behave the same, but saying “Here is where I want my transaction to be” also lets you think about whether you really want to do everything at one statement or break it up into smaller chunks.

Finally, if you use a loop, whether that be a cursor or WHILE statement, you can control whether you want one transaction per loop iteration or one transaction in total, and that’s entirely to do with whether you begin and commit the transaction outside of the loop or inside.  Having one transaction in total can be considerably faster in some circumstances, but if you have an expensive action in the loop, you can commit after each loop iteration.  This will minimize the amount of time you block any single operation waiting to access this table.  It will increase the total runtime of your query, but minimize the pain to other users, and that’s a trade-off  you can only make if you use explicit transactions.

Rules of Thumb

First, if you have a stored procedure which is simply running a SELECT statement, use autocommit.  There’s no real advantage to putting this into an explicit transaction and there is the downside that you might forget to commit.

Second, if you have a stored procedure which performs data modification on non-temporary tables, use an explicit transaction only over the area which modifies data.  Don’t begin the transaction until you’re ready to start modifying tables; this will minimize the amount of time you need to keep the transaction open and resources locked.

As a corollary of the second point, note that you can use explicit transactions to control parent-child relationships with stored procedures, where the parent begins a transaction, calls each child, and rolls back or commits at the end depending upon the results. That’s something you can’t do with autocommit, as each data modification statement would run in its own auto-committed transaction.

Third, if you are working with non-global temporary tables beforehand, don’t include any modification of those inside the explicit transaction.  If you are working with global temporary tables, I suppose you should treat them like non-temporary tables here if you expect other sessions to use them and care about blocking, though there’s a pretty small number of cases where it makes sense to have global temporary tables with multiple users, so I’d call that an edge case.

Fourth, in a loop, choose whether you want to put the explicit transaction around the loop or inside it.  In most cases, I prefer to put the transaction inside the loop to minimize the amount of time that I’m blocking others. This is probably the smarter move to make in busy transactional environments, where you want to prevent blocking as much as possible.  Also, If one loop iteration fails, you’ll have less you need to roll back, so you can fix the issue and pick back up where you left off. Note that at that point, you trade atomicity on the entire set of data for atomicity on a block of data, so if that’s a big enough concern, bite the bullet and put an explicit transaction around the loop. Or see if you can make it faster without a loop.

Fifth, outside of a stored procedure—that is, when I’m just writing ad hoc statements in a client tool—use explicit transactions if you’re doing something potentially risky. I know this brings up the question of “Why are you doing risky things in a client tool to begin with?” But that’s a story for a different day.

Sixth, watch out for nested transactions.  In SQL Server, there’s very little utility in them and their behavior is weird. Paul Randal explains in great detail just how broken they are, and I’d rather the product never have had them.  Anyhow, check to see if you’re in a transaction before opening one. The pattern I like to use comes from my Working Effectively with Legacy SQL talk (which, ironically enough, needs some changes to be brought up to date) and originally from smart people in the office who put it together before I got there. Here’s a simplified version of it for a sample stored procedure:

CREATE OR ALTER PROCEDURE dbo.GetFraction
@Divisor INT = 5
AS
 
DECLARE
    @AlreadyInTransaction BIT;

BEGIN TRY 
    IF ( @@TRANCOUNT > 0 )
    BEGIN
        SET @AlreadyInTransaction = 1;
    END
    ELSE
    BEGIN
        SET @AlreadyInTransaction = 0;
        BEGIN TRANSACTION;
    END;
 
    -- Note:  this is where you'd put your data modification statements.
    SELECT
        1.0 / @Divisor AS Quotient;
 
    IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )
    BEGIN
        COMMIT TRANSACTION;
    END;
END TRY
BEGIN CATCH
    IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )
    BEGIN
        ROLLBACK TRANSACTION;
    END;

    THROW;
END CATCH
GO

--Test the procedure
EXEC dbo.GetFraction @Divisor = 5;

--Start an explicit transaction
BEGIN TRANSACTION
EXEC dbo.GetFraction @Divisor = 5;
SELECT @@TRANCOUNT;
ROLLBACK TRANSACTION

Finally, make sure you roll back the transaction on failure.  If you write code using try-catch blocks, commit at the end of the TRY block or rollback at the beginning of the CATCH. Explicit transactions offer you more power, but come with the responsibility of handling transactions appropriately.

Thoughts?

What are your thoughts on explicit transactions versus autocommit? Do you prefer the ease of autocommit or the power of explicit transactions? Or where do you draw the line between the two? Leave your thoughts in the comments section below—either here or on the video—and let me know.