Apollo 13: Engineering Lessons Under Crisis

This video is a bit different from my normal fare, but by about halfway through the film I knew I wanted to do this.

Links of Note

Script

Hey, everybody. This video is going to be a little different from my normal fare. I recently got the opportunity to re-watch Apollo 13, a movie I enjoyed growing up but have not seen in more than two decades. Even as a kid, I loved the engineering sequences, in which Mission Control needed to solve unexpected and extremely challenging problems to bring back home Jim Lovell, Fred Haise, and Jack Swigert. In this video, I’m going to delve into detail on those challenges and important take-aways for engineers in any discipline, like, say, software development. By the way, when I use the term “engineer” in this video, I’m purposefully including all kinds of roles, from engineers proper to technicians and operators. The exact job title here isn’t important; it’s the skill set and mentality which matters the most. Anyhow, sit back and enjoy:

APOLLO 13: ENGINEERING LESSONS UNDER CRISIS

Before I get started in earnest, I want to make note that this video is based solely on the movie Apollo 13, a dramatization of Jim Lovell’s book, Lost Moon: the Perilous Voyage of Apollo 13. This is not a recounting of actual events as they happened–I understand there were areas where the film wasn’t completely accurate to reality and they made up a couple of bits to ratchet up the suspense, but my plan is to appreciate the film as such rather than inasmuch as it was a reflector of actual events. That said, it’s still a lot closer to reality than most such movies.

As another aside, something I really appreciated was just how good the acting was in this movie. It struck me throughout the film that they were doing a great job of the adage “Show, don’t tell.” For example, we learn a bit about the personality of Jim Lovell–played by Tom Hanks–as a former test pilot in the opening sequence, where we see him zooming along the highway in Houston, starting with a forward-facing shot of a car zipping through traffic and switching to a view of Lovell in his shiny, red Corvette. (By the way, I said stick to the film, but in real life, Lovell’s Corvette was blue. How dare you, Hollywood.) Anyway, he’s a straight-laced guy who lives fast, and we get that information simply from this first scene, comparing Lovell’s bearing and appearance with his need for speed. This fits the early astronaut profile to a tee.

Another instance of this is later on in the film, when you see a television interview–taped sometime before the mission–of Lovell recounting a story as a combat pilot in which he is, at night, with no homing signal or radar, trying to find an aircraft carrier running dark. Then the electronics short out on his plane, leaving him completely in the dark, but instead of panicking he keeps his head straight, maintains control, and ends up finding the carrier safely due to a bit of luck and a trail of phosphorescent algae. This is playing on the television for us to get a measure of the man and I appreciate that they do it this way rather than having Lovell tell his fellow astronauts “Hey, I survived finding an aircraft carrier with no running or interior lights in the middle of a dark ocean, so we can do this!” He acts calm and collected by showing his training, and he doesn’t need to tell us any of this.

Speaking of the other astronauts, in early sequences in the film, we see Lovell, Ken Mattingly–played by Gary Sinise–and Fred Haise–played by Bill Paxton–become very close in training, to the point where they can interpret each others’ moves and have implicit trust. When Jack Swigert, played by Kevin Bacon, has to swap in for Mattingly at the 11th hour, there is a bit of conflict. We can see an example of that conflict in the way the trio address each other–Lovell and Mattingly call Haise Freddo, but Swigert, being the outsider, refers to him as Fred. But by the climax of the film, we see that conflict resolved in a natural way, and we see the comradery between the two men as Swigert too refers to Haise as Freddo. It’s these little character touches, as well as exactly the right facial expression or the right tone of a word which tell so much more to us than what the actors need to say. I wanted to get that in early because the acting and characterization really hold up and they’re worth discussing even in a video about Mission Control.

So now let’s get to the main event.

My first point is:

CONFLICT HAPPENS

Conflict happens, but interpersonal conflict shouldn’t drive animosities. In a healthy team, people should have the attitude that we’re on the same side here, trying to solve the same problems. But we do see conflict in ideas, in part because they have different weights on preferences and different sets of information available to them. In the film, one of the first Mission Control crisis scenes has a pair of engineers arguing over whether to turn the Apollo 13 command module around or try to slingshot around the moon. Both men make good points: the engineer wanting to turn the ship around notes that it would be the fastest way to get the crew back to Earth and that’s critical because they’ve already lost a considerable amount of oxygen. The opposing side points out that the service module engine is damaged and might explode if they try to use it, so it’s effectively dead. They’re hurtling toward the moon, so to turn around, they would need to cancel all of that velocity and then build up velocity in the opposite direction, and the vessel doesn’t have enough known-good engine power to do this. But if they continue on their path, they can redirect their existing momentum and use the moon’s gravitational pull to slingshot them back toward Earth. It will take longer to get back, but is much more likely to succeed.

Although tensions run high during the scene, the engineers all understand that this is an important part of decision-making under crisis: you need to understand the options available and be willing to speak up if you see a mistake in progress. Ed Harris’s Gene Krantz also does a good job here by allowing the squabbling but shutting it down when it stops being productive. He makes sure that the good engineers are able to voice their concerns and lay out all of the necessary information–even if they don’t do a perfect job of remaining calm and thoughtful. As a group, they are able to think through the problems, and as the flight director, Krantz’s job is to make the call.

This works because:

THERE IS NO SOLE SUPERGENIUS

Something I appreciated in the film is that we didn’t have a singular, brilliant mind answering all of the problems. Because that’s not how it works in good organizations. You have different groups of people with specializations in different areas and the capability to solve different types of problems. Mission Control had different groups of engineers solving problems like CO2 scrubbing, turning back on the flight computer, and coming up with a flight path that will let the command module land in the ocean without burning up in atmosphere or skipping past the earth like a rock on a pond. This wasn’t Gene Krantz sitting in his lair, meticulously plotting and having scribes and underlings carry things out; this was a big effort from a lot of people quite capable of solving problems.

What’s interesting in the film is that Ken Mattingly is kind of set up in the “supergenius” role due to his tremendous work ethic and deep knowledge of the command module. But even in the sequences where Mattingly works to get the flight computer power-on sequence correct, he’s a member of a team. He’s not the one who realized that there would be power problems in the first place, and although he’s the one who we see come up with the final process, he’s not doing it alone. Furthermore, they don’t have Ken fix the flight computer, figure out burn time, and scrub CO2–he’s a smart man performing a heroic effort in doing his part to bring three men home alive. That he’s not a sole supergenius takes nothing away from that, but it does give us a critical insight that engineering under crisis is a team sport.

And while I’m talking about Mattingly, I do want to point out that although he’s an astronaut, he’s tied in with the engineers too. At that point in its history, NASA selected primarily for a few critical traits: intelligence, composure, and creativity under stress. This is part of why they loved choosing test pilots, as that’s a great demonstration of all three. The training program was rigorous and included intimate knowledge of the critical systems, as once you’re out in space, you’re on your own. Mission Control may be available most of the time–assuming no radio outages–but you have to know how to operate the different systems, how to troubleshoot problems, and how to make things work in stressful conditions. The film shows this during the sequences where Mattingly and Swigert dock with and retrieve the Lunar Module, or LM (pronounced LEM). Mission Control simulates different failures and forces the command module pilot to think on his feet, diagnosing the problem, assessing the situation, and performing the right response. In order to be that highly proficient of an operator, the astronauts need to have almost the same understanding of the systems that engineers have. You may not have asked Ken Mattingly to design and fabricate a command module, but he has to know the hows and whys behind its design because it’s literally his life on the line.

Let me give you a more prosaic example of how all of this fits into the real world. IT teams are made up of people with a range of specialties: front-end developers, back-end developers, database developers, database administrators, network administrators, systems administrators, architects, and a whole lot more. If you have a Sev-1 crisis and your company’s applications are down, it’s natural to expect these groups of people to work within their specialties to solve problems. Ideally, you don’t have a single The Expert whose galaxy brain is responsible for solving all of the problems. This is because The Expert can be wrong and, no matter how long The Expert has been around in an organization, there’s always going to be a surfeit of knowledge. In a well-functioning engineering team under crisis, each person understands their piece of the puzzle and knows how to get information. App developers are looking through their logs while database administrators try to figure out if the issue is in their domain, network administrators review packet and firewall logs, and systems administrators check server logs. They can take and synthesize the information they’re processing to help shed light on the problem and develop viable solutions. When there are people with demonstrated expertise–as opposed to being The Expert–they can help narrow down places to look, compare the current issue to prior issues, and make connections which specialists alone might not have done. But even so, that person is still one member of a group.

This is also bleeding over into my next point:

FOCUS ON THE PROBLEM AT HAND

In one interesting scene, Swigert mentions to Lovell that they’re coming in too shallow and at this trajectory, will skip right out of the atmosphere. Lovell respond by saying “There are a thousand things that have to happen in order. We are on number 8; you’re talking about number 692.” In a tough situation, it’s easy to focus on a specific problem to the detriment of coming up with a solution. In a few lines of dialogue, we get the crux of an important point: document that there is an issue, follow up on the issue when it’s the appropriate time, and continue working on the most important thing. Under normal circumstances, it’s easy to prioritize issues based on who’s complaining loudest or most recently. To give you a data platform example, I might notice some slow reports in a log and decide to take the day speeding those up. It may not have been the single most important thing I could have done at that point in time, but it was a reasonable thing to do as it was causing people pain right then. But in a crisis situation, we have to avoid that temptation and focus on the single most important thing. Yes, those reports are slow. And they might cause us to lose the contract if we don’t fix them. But right now, the database is refusing connections and we need to fix that before we can think about reports.

Tying back to the previous idea of expertise without The Expert, this also allows engineering teams to focus on solving a problem. We have navigation specialists focusing on burn calculations, a sub-set of engineers trying to figure out how to put a square peg into a round hole, and a group of engineers working with Mattingly to power on a flight computer while using less than 12 amps and “You can’t run a vacuum cleaner on 12 amps, John.” Krantz and Glynn Lunney, as the two shifts’ flight directors we get to see in the film, need to worry about all of these problems, ensure that teams are working on the most critical things, and coordinate between teams. But they also need to be able to delegate tasks to team leads–they have to trust that those team leads will do the right thing because they don’t have the time or energy needed to coordinate efforts and solve problems.

Taking this back to the real world, part of my company’s production emergency process is to designate a coordinator. Typically, that person is someone separate from the engineers working on the problem, as the coordinator needs to communicate and process information effectively and it’s hard to do that while you’re typing furiously and scrambling to remember where that stupid log file gets written to and where you saw error code 3400. That coordinator role is only one portion of what Krantz and Lunney needed to do, but it’s a vital part, especially when a lot of people are working in small groups on different parts of the problem.

The film did an excellent job of portraying my next theme:

CREATIVITY AND PRAGMATISM

Apollo 13 really captured the spirit of great engineers under pressure: it is a combination of creativity in solving problems mixed with a down-to-earth thought process which forces you to grapple with the problem as it is rather than as you’d like it to be.

I think the line which most exemplified the pragmatism side was when Ken Mattingly prepared to enter the command module simulator, demanding that conditions be set the same as what the crew on board were experiencing. “I need a flashlight,” he says. When a tech offers up the flashlight in his hands, Mattingly responds, “That’s not what they have up there. Don’t give me anything they don’t have onboard.” Mattingly limits himself to the conditions and cirumstances of the actual command module to ensure that what he does on the ground can be replicated exactly by the crew onboard the Apollo 13 command module as it currently is, not as they ideally would wish it to be.

Meanwhile, the first major scene which really brought the combination of skills together was when Gene Krantz brought together all of his engineers and makes a show of throwing away the existing flight plan. All of their effort and planning is out the window because circumstances changed to the point where that plan is untenable. And now it’s time to improvise a new mission. We are in a crisis, so our expectations of normal behavior have gone in the trash along with that flight plan.

But we see another example of the brilliance that can come from creative, pragmatic people under pressure. In one of the most iconic sequences in the film, NASA engineers need to fit a square peg into a round hole. The carbon dioxide scrubbers in the lunar module are cylindrical and designed for a cabin with two people living in it for a fairly short amount of time. The command module CO2 scrubbers are larger, boxier, and completely useless as-is. Because nobody planned on using the lunar module as a lifeboat, there weren’t any spare CO2 scrubbers for the lunar module and without one, the crew would die of carbon dioxide poisoning long before they reached Earth.

Cue the engineers, dumping out everything available to them on a table and one technician saying, “Okay, people, listen up. The people upstairs handed us this one and we gotta come through.” No grousing, no complaints about how this isn’t normal. Instead, it’s a challenge. The other technicians start with the most logical step: let’s get it organized. Oh, and they get to brewing coffee too. I guess that’s the second logical step.

What they come up with is a bit of a mess, but it solves the problem. It is a hack in the most positive sense of the term: a clever and unexpected solution to a problem. You don’t want to make a living building these, but it’s good enough to get three men home.

Taking this a step further, use of the LM itself was a great example of creativity and pragmatism. The lunar module was designed to land on the moon, collect some rocks, and get them back to the command module. Using it as a lifeboat was, as the Grumman rep points out, outside of expected parameters. Even in this crisis, the rep is thinking as a bureaucrat: his team received specifications and requirements, his team met those specifications and requirements, and anything outside of those specifications and requirements he can’t guarantee. They did exactly what they needed to do in order to fulfill the contract and that, to them, was all they could do. This bureaucratic mindset is not a bad thing in normal circumstances–it reduces risk of legal liability by setting clear boundaries on expected behavior and promotes success by ensuring that people are willing to agree on the objectives to reach. But in the crisis, the Grumman rep gets shunted off to the side and the engineers take over. Because that’s what engineers do to bureaucrats when the going gets tough.

Another example of creativity marrying pragmatism involves the LM burn sequence. Lovell and Haise needed to perform a precise, controlled burn, but because the flight computer was turned off, they needed a way to ensure they stayed on track and did not stray too far. Their answer was to find a single, fixed point: Earth. As Lovell states, “If we can keep the earth in the window, fly manually, the co-ax crosshairs right on its terminator, all I have to know is how long do we need to burn the engine.” That sequence displays knowledge, creativity, and a way to make the best out of a given situation.

My third LM example takes us back to Ken Mattingly and the LM power supply. Mattingly is struggling to power on the flight computer with fewer than 12 amps and failing. He hits on a wild idea: the LM batteries still have power remaining, and there is an umbilical which provides power from the command module to the LM as a backup system. If they reverse the flow of power, they’ll draw from the LM batteries and into the command module, giving them enough power to get over the hump. Their unexpected lifeboat gives them one final miracle.

Creativity and pragmatism are invaluable assets for engineers in a crisis situation, letting them take stock of how things are and make best use of the circumstances to solve challenging problems. But they also need:

PERSEVERENCE UNDER PRESSURE

Being creative in a simulated situation or in day-to-day work is different from being able to get the same results under pressure. Perseverence is one of those key traits which can make the difference between success and failure in a crisis.

We get a glimpse of engineers persevering–and another example of show, don’t tell–when we see engineers sleeping in cots in a back room. “Is it A.M. or P.M.?” is a question you don’t generally want to hear from your employees, but sometimes a crisis requires this. Now, if your engineers have to do this regularly, you have a serious organizational problem. But when you are in the middle of a crisis, it’s good to know that you have people willing to push their limits like this. In other words, you want your employees to be willing to do this, but your employees shouldn’t be doing this except in true emergencies. Like three guys you’re trying to bring home from outer space.

Pressure doesn’t have to be huge and external, either. Think of the everyday, little things. Like, say, a projector bulb bursting. By the way, I had one of those happen to me during a training once. They do make a pretty loud bang when they go. Anyhow, what does Gene Krantz do after the first tool he uses fails on him? Move to a backup: the chalkboard. Have backups to your backups and be willing to move on. If they spent twenty minutes waiting for a staff member to find a spare overhead bulb, that’s twenty minutes they’re wasting during a crisis, at a point in which they need every minute they can get. Fail fast and be ready to abandon tools which are impeding a solution. By the way, I liked Krantz’s response to this: he’s frustrated. Frustration is okay–it’s part of the human condition, especially under stress. But don’t let it prevent you from doing what you need to do, and don’t get tunnel vision on this frustration, as you’ll only compound it without helping the situation any.

Underlying most of my video so far is an implicit cheerful, go-get-em, “We can do this!” attitude. But you don’t need that attitude all the time as an engineer in a state of crisis. Doubt, like frustration, is part of the human condition and to experience doubt in a situation of stress is normal. But again, just like frustration, focusing on the doubt gets nothing accomplished. Even if we doubt we can come up with something, we keep searching for a solution. A great example of this is where the technician doubts Mattingly will be able to turn on the flight computer without breaking 12 amps, but even so, he keeps at it because he refuses to let Mattingly or the crew on Apollo 13 down.

While we’re talking about topics like frustration and doubt, let’s make something clear:

WE STILL MAKE MISTAKES

Just because you’re an engineer–maybe even a smart engineer, maybe even one of the best in the world at what you do–it doesn’t mean you live mistake-free. And it’s surprisingly easy to make big mistakes simply by assuming that what you know is correct and failing to take into consideration changing circumstances.

Let me give you an example of that. During one attempt at maneuver, Lovell reports “We’re all out of whack. I’m trying to pitch down, but we’re yawing to the left. Why can’t I null this out?” Fred responds that “She wasn’t designed to fly like this, our center of gravity with the command module.” Lovell trained in the simulators and knew exactly how the LM is supposed to handle, but none of the simulations assumed that the LM would maneuver while still attached to the command module. As a result, all of those in-built expectations of behavior go right out the window and the crew are forced to compensate and learn anew.

Similarly, when Mission Control provided the crew with burn times to correct a trajectory problem, they made a crucial mistake: one of the flight path specialists tells Krantz, “We’re still shallowing up a bit in the re-entry corridor. It’s almost like they’re under weight.” They quickly realize the problem: Apollo 13 was intended to land on the moon, collect approximately 200 pounds of moon rocks, and bring them back to Earth. Therefore, they calculated the return trajectory based on that weight, but when the mission profile changed drastically, that small discrepancy made a difference and forced the crew to transfer items from the LM to the command module to keep things properly balanced.

A third example involves Haise, who is confused about why the CO2 numbers are rising so quickly, as “I went over those numbers three times.” But what he went over were the expectations for a two-man crew. Jack Swigert’s breathing threw off Fred Haise’s numbers, leading to the crew having less time than expected to solve this problem.

We generally do not need to be perfect in a crisis, and that’s a good thing–even at the best of times, I’m pretty sure nobody is perfect. What is important in these cases is that the engineers and crew understand that there is a mistake and correct things before they cause too much damage. And one way to limit the damage of mistakes is to have cross-checkers. I really liked this scene in which Lovell is filling in calculations for gimbal conversions and needs a double check of the arithmetic. Instead of trusting that Lovell will get the numbers right, he goes to Mission Control. And instead of one person doing the math, you see half a dozen engineers independently working on the problem. Having people who can check that what you are doing is correct is huge, and even more so if they’re able to perform independent evaluations.

As we near the close of this video, I want to talk about one last thing:

MANAGING ENGINEERS UNDER CRISIS

I really liked the way we saw Gene Krantz and Glynn Lunney manage engineers during the film, especially Krantz. He does an admirable job in performing the role of a manager. Early on, when the different teams are in a back room trying to figure out what to do, Krantz opens the floor and lets engineers confer and share ideas. This is an important part of management of engineers: they are idea people, but to get ideas out of them, they need to know you’re willing to listen and not shut them down immediately. But it’s also the job of a manager to make a decision. Engineers can go around and around on an issue until the cows come home or get knee-deep into the minutae of a situation, and sometimes, it’s the manager’s job to refocus people. Let your specialists figure out the options, but it’s your job to understand when it’s time to move.

Also, don’t solve a problem you don’t need to. We can spend a lot of time thinking about hypothetical situations and how we might proceed, but that’s a waste of brainpower and time, two critical resources during a crisis. Focus on the things you need to solve. Your team might need to solve more than one problem at a time, but make sure they’re actual problems and that they are the most important problems to solve right then.

And make sure your engineers understand exactly what it is they have to do. If you expect them to sleep in cots at the office for the duration of a crisis, you don’t want them twiddling their thumbs or having to look for work. If you’re a manager of managers, make sure that your leads have a grasp on the problem, understand the current state of affairs, and know what to work on. I’ve been in engineering crises–thankfully none which were life or death situations–and the most frustrating part is sitting there with nothing to do. You can’t go home (and it’s usually pretty late, so you want to go home), but there’s nothing you can actually do at the moment. That’s a failing of management we want to avoid.

To that extent, maximize communication. Bias toward over-sharing of problems and solutions rather than under-sharing. If a team sees a problem, it may turn out that it doesn’t actually need a solution and someone can tell them that. Or perhaps a member of another team sees an aspect of the problem which is worse than first anticipated, causing you to re-evaluate the problem altogether. Or maybe it turns out that a third team actually has the solution, but didn’t realize it was needed because they weren’t experiencing the problem.

Let me give you a real-life example of this, though I’m changing a few of the details to maintain appropriate confidences. Customers are calling because their data is not showing up in our system and it’s a critical time of the year where their data absolutely needs to be correct and up-to-date. Application developers are looking through their logs, trying to figure out what’s going on (and sometimes, trying to remember exactly how things work). I, as the database specialist, am looking through log entries in tables, database error logs, and whatever monitoring I can get my hands on to figure out if the database is rejecting data, if there are application errors when trying to write data, or what the deal is. We’re working together on this, but without sysadmins or network engineers to help us out. After some struggle, we engage the sysadmin, who tells us that things look fine on the servers, but noticed a major drop in network utilization over the past few hours, and the network engineer tells us that, oh yeah, there was an issue with a network link at a separate site and our co-lo’s technicians were working on the problem, but it only affected a few servers…specifically, our servers. This piece of information ended up being crucial, but there was a communication gap between the development teams (application and database) and the administration teams (systems and network). Had it been made clear to those administrative teams at the beginning, we might have saved hours of time trying to diagnose something that turned out not even to be something we caused.

That sharing also needs to be bi-directional. As a manager, getting regular reports from engineers is nice, but you have to be able to share syntheses of reports with everyone involved. One team might know the problem but lack the solution, and another team might have the solution but not know that there’s a problem; ensuring that people know what they need to know without a flood of unnecessary information is tricky, but that’s what a great manager has to do.

One thing great managers don’t do is management by hovering. Engineers know this all too well: we’re in a crisis, so you have levels of management standing over your desk as you’re trying to solve the problem. You can tell how big of a crisis it is by how many levels of management are all standing at your desk. They’re standing there, observing, often without providing direct insight or meaningful guidance. Oh, they may provide you guidance, but it’s rare that hoverers give you anything helpful. Hovering is a natural human instinct when you feel like you don’t have anything you can do but need to get a problem solved. To the extent that it signals the importance of the issue, it’s not the worst thing ever, but it does stress out engineers working on the problem. And hovering managers aren’t coordinating, which means all of those people who need information to help solve the problem aren’t getting it because the manager is standing over one person and watching that person type and click stuff.

Instead of hovering, be available. Frankly, even if you have nothing to do at that point in time, don’t hover. Gene Krantz? He doesn’t hover. He’s at his desk, he’s leading meetings, he’s talking to people. He’s available, but he doesn’t need to stand over the engineers working on the CO2 scrubber solution or the flight computer power solution or any other solution–he knows he has capable people who understand the problem and are motivated to solve it, so he’s in his role as collector and dissemninator of information and as the person responsible for ensuring that people are doing what they need to do.

Finally, if you are a manager, expressing in a positive but realistic manner is the way to go. Both traits are necessary here: if you are overly negative, people will wonder why they’re even there. If you’re sending the signal that you expect failure, that demotivates the people working on problems–after all, if we’re going to fail anyhow, why should I even try? But being overly cheerful or optimistic sounds Polyannish and runs the risk of losing the trust of engineers. Watch Gene Krantz throughout the film and he does a great job of straddling the line, including at two critical points in the film:

“Failure is not an option!”

And:

“With all due respect, sir, I believe this is going to be our finest hour.”

These are not the types of phrases you deploy on a day-to-day basis, but if you’re in a crisis, you want to know that the people in charge understand the situation and will move heaven and Earth to make it right. And from the manager’s side, that means making sure you give your engineers the best possible chance of success.

I hope you enjoyed this video. As I mentioned at the beginning, it’s a bit outside the norm for me, but it was a lot of fun to put together. And if you haven’t seen Apollo 13 in a while (or at all!), if you sat through this video, you’re definitely going to enjoy the movie. So until next time, take care.

Q&A: The Curated Data Platform

On Thursday, I presented a session at PASS Summit entitled The Curated Data Platform. You can grab slides and links to additional information on my website. Thank you to everyone who attended the session.

During and after the session, I had a few questions come in from the audience, and I wanted to cover them here.

Cross-Platform Data Strategies

The first question was, “What handles the translation between the relational truth system and the Document system?” The context of the question comes from a discussion about product catalogs, and specifically this slide.

Document databases are great for things like product catalogs, where we meet the following properties:

  • Data has a very high read-to-write ratio.
  • You generally look at one item per page—in this example, one product.
  • The set of data to appear on a page is complex and typically has nested items: a product has attributes (title, price, brand, description) but also collections of sub-items (product image links, current stock in different stores, top reviews, etc.).
  • The data is not mission-critical: if updates are delayed or even occasionally lost, that is acceptable.

But I do like keeping a “golden record” version of the data and my biases push me toward storing that golden record in a relational database. I mentioned two processes in my talk: a regular push on updates and an occasional full sync to true up the document database(s).

And that leads to the question of, how do we do that? There are products from companies like CData and Oracle which can handle this, or you can write your own. If your source is SQL Server, I’d push for a two-phase process:

  1. Enable Change Data Capture on the SQL Server instance and have a scheduled task query the latest changes and write them to your document database(s). You can use constructs like FOR JSON PATH in SQL Server to shape the documents directly, or pull in the source data and shape it in your application code.
  2. Periodically (e.g., once an hour, once a day), grab all of the data, shape the documents, and perform a comparison with what’s out there. This will confirm that nothing slips through the cracks for longer than one time period and will keep disparate clusters of document data separated.

Of course, this golden record doesn’t need to be in a relational database—you could store it in a document database and use replication there to push data to different clusters. If you use Cosmos DB, for example, you can replicate to other regions easily.

Document Databases: Scale-Out vs Replication

Another attendee asked about “Document databases and scale-out vs replication.” In retrospect, I think I misinterpreted the question as asked, as I mentioned that scale-out and replication are one and the same: you replicate data between nodes in a cluster to achieve scale-out.

But this time around, I’m going to answer the question, “How do I choose between making my current cluster bigger and replicating out to a new cluster?”

Here are some key considerations:

  • If the issue you are trying to solve is geographical in nature, replicate out to a new cluster closer to your users. In other words, suppose you have a cluster hosted in Virginia. Many of your users are in Japan, so they have to deal with the network latency of pulling data from a Virginia-based data center. If this is the problem, create another document database cluster in Japan and replicate to it from Virginia.
  • If your cluster is in Virginia and is getting hammered hard by relatively local users, scaling out is generally a good option. That is, adding more servers to the existing cluster. Depending on your technology, there will be a maximum number of nodes or a maximum scale-out size available to you, so you’d have to check out those details.
  • If you’re getting close to that maximum scale-out size, it may make sense to replicate to another cluster in the same region and use a load balancer to shift load between the two. I have to be vague here because different technologies have different limits and I’m definitely not an expert on any document database technology’s systems architecture.

Cosmos DB and MongoDB

Another attendee asked, “I have heard that Azure Cosmos DB is built upon an older version of MongoDB – do you know if this is true?”

The answer is no, it’s not. The two platforms are different. I believe where the question comes from is around the MongoDB API for Cosmos DB. For a while, Cosmos DB supported an older version of the MongoDB API, specifically 3.2. That API was released in December of 2015. New Cosmos DB clusters support the MongoDB 3.6 API, which is still active.

But something I want to point out is that the API is an interface, not an implementation. That Cosmos DB supports a specific MongoDB API version doesn’t mean that the code bases are similar; it only means that you can safely (well, presumably safely) interact with both and expect to get the same results when you perform the same set of API steps with the same inputs.

Graph Languages

My last question came from an attendee who mentioned, “I thought GraphQL was the common standard for graph querying.”

The context for this is in my discussion of graph databases, particularly the slide in which I talk about the key issues I have with graph databases. For a bit more background than what I had time to get into during the session, Mala Mahadevan and I have talked about graph databases in a little bit of detail on a couple of occasions, once on the SQL Data Partners Podcast and once on Shop Talk.

As for the question, the comment I had made was that there is no common graph language. We have SQL for relational databases (and other mature data platform technologies) but historically haven’t had a common language for graph platforms, meaning that you have to learn a new language each time you move to a different platform. The Gremlin language is an attempt at creating a single language for graph databases and it’s making enough strides that it may indeed become the standard. But it’s not there yet.

Meanwhile, GraphQL, despite the name, is not a language for graph databases. It’s actually a language for querying APIs. The key idea is that you ask for data from an API and you get back just the data you want. But behind the API, your data can be stored in any sort of data source—or even multiple data sources. In other words, I might expose a product catalog API which hits Cosmos DB, a finance API which hits SQL Server, and a product associations API which hits Neo4j. Those three APIs could all be queried using GraphQL, as it’s up to the API to interpret inputs and return the appropriate outputs.

Query Store: QDS Toolbox

I wanted to announce the first open source project officially released by ChannelAdvisor: the QDS Toolbox. This is an effort which Pablo Lozano and Efraim Sharon pushed hard internally and several database administrators and database engineers contributed to (though I wasn’t one of them).

From the summary page:

This is a collection of tools (comprised of a combination of views, procedures, functions…) developed using the Query Store functionality as a base to facilitate its usage and reports’ generation. These include but are not limited to:

– Implementations of SSMS’ GUI reports that can be invoked using T-SQL code, with added funcionalities (parameterization, saving results to tables) so they can be programmatically executed and used to send out mails.
– Quick analysis of a server’s overall activity to identify bottlenecks and points of high pressure on the SQL instance at any given time, both in real time or in the past.
– Cleanup of QDS’ cache with a smaller footprint than the internal one generates, with customization parameters to enable a customizable cleanup (such as removing information regarding dropped objects, cleaning details of ad-hoc or internal queries executed on the server as index maintenance operations).

The biggest of these is the third item. In our environment, Query Store could be a beast when trying to delete old data, and would often be the biggest performance problem on a given server.

In addition, several procedures exist as a way of aggregating data across databases. We have a sharded multi-tenant environment, where we might have 5-15 replicas of a database schema and assign customers to those databases. QDS Toolbox helps aggregate information across these databases so that you don’t need to look at each individually to understand performance problems. The database team has then created reports off of this to improve their understanding of what’s going on.

Check out the QDS Toolbox as a way to clean up data better than the built-in cleanup process and get additional information aggregated in a smart way.

Transaction Modes in SQL Server

In the following video, I take a look at the three most important transaction modes in SQL Server: autocommit, explicit transactions, and implicit transactions. Sorry, batch-scoped transactions, but nobody loves you.

If you’d prefer a textual rendering of the video, here are the pre-demo and post-demo sections, lightly edited for narrative flow.

Setting the Stage: Transactions and Modes

What I want to do in today’s post is to cover the different sorts of transaction modes and get into the debate about whether you should use explicit transactions or rely on auto-committed transactions for data modification in SQL Server. This came from an interesting discussion at work, where some of the more recent database engineers were curious about our company policy around transaction modes and understanding the whys behind it. I didn’t come up with the policy, but my thinking isn’t too far off from the people who did.

But before I get too far off course, let’s briefly lay out some of the basics around transactions.

When you modify data (that is, run a command like INSERT, UPDATE, MERGE, or TRUNCATE) or tables (CREATE, ALTER, DROP, etc.), that operation takes place inside a transaction.  A transaction is, according to Microsoft Docs, a single unit of work.  Everything in a transaction will either succeed as a whole or fail as a whole–you won’t end up with some operations succeeding and others not–it’s really all or nothing.  The importance of this goes back to relational databases having ACID properties, but because that’s a little far afield of where I want to go, I’ll give you a link if you’d like to learn more about the topic, as it helps explain why transactions are useful for relational database developers.

What I do want to get to is that there are three kinds of transactions:  autocommit transactions, explicit transactions, and implicit transactions.  There’s actually a fourth kind, batch-scoped transactions, but that only applies to Multiple Active Result Sets transactions and if you find yourself there, you’ve got bigger problems than deciding how you want to deal with transactions.

In the demo for the video, I show off each of the three transaction modes, including how you enable them, how you work with them, and any important considerations around them.

Recommendations

The easy recommendation is, don’t use implicit transactions.  For SQL Server developers and database administrators, this is unexpected behavior–the default is to use autocommit, so that if you run an INSERT statement by itself, the transaction automatically commits at the end.  If you set implicit transactions on, there is no UI indication that this is on and it becomes really easy to forget to commit a transaction.  I understand that if you come from an Oracle background, where implicit transactions are the norm, it might feel comfortable to enable this, but it becomes really easy to start a transaction, forget to commit or rollback, and leave for lunch, blocking access to a table for a considerable amount of time. And if you’re using Azure Data Studio, it appears that implicit transactions might not even work, so you’d be in a world of hurt if you were relying upon them.  So let’s throw this one away as a recommendation.

My recommendation, whenever you have data modification on non-temporary tables, is to use explicit transactions over autocommit.  I have a few reasons for this.

First, consistency.  Sometimes you will need explicit transactions.  For example, if I need to ensure that I delete from table A only if an insert into table B and an update of table C are successful, I want to link those together with an explicit transaction.  That way, either all three operations succeed or none of them succeed.  Given that I need explicit transactions some of the time, I’d rather be in the habit of using them; so to build that habit, I’d prefer to use them for all data modification queries.

Second, explicit transactions give you clarity around what is actually necessary in a transaction.  Suppose you query a table and load the results into a temporary table.  From there, you make some modifications, join to other tables, and reshape the data a bit.  So far, nothing I’ve mentioned requires an explicit transaction because you’re only working with temp tables here.  When you take the final results and update a real table, now we want to open a transaction.  By using an explicit transaction, I make it clear exactly what I intend to have in the transaction:  the update of a real table, but not the temp table shenanigans.

Third, as an implication of the second point, explicit transactions can help you reduce the amount of time you’re locking tables.  You can do all of your heavy lifting in temp tables before opening the transaction, and that means you don’t have to do that while locking the real table. In the best case, autocommit will behave the same, but saying “Here is where I want my transaction to be” also lets you think about whether you really want to do everything at one statement or break it up into smaller chunks.

Finally, if you use a loop, whether that be a cursor or WHILE statement, you can control whether you want one transaction per loop iteration or one transaction in total, and that’s entirely to do with whether you begin and commit the transaction outside of the loop or inside.  Having one transaction in total can be considerably faster in some circumstances, but if you have an expensive action in the loop, you can commit after each loop iteration.  This will minimize the amount of time you block any single operation waiting to access this table.  It will increase the total runtime of your query, but minimize the pain to other users, and that’s a trade-off  you can only make if you use explicit transactions.

Rules of Thumb

First, if you have a stored procedure which is simply running a SELECT statement, use autocommit.  There’s no real advantage to putting this into an explicit transaction and there is the downside that you might forget to commit.

Second, if you have a stored procedure which performs data modification on non-temporary tables, use an explicit transaction only over the area which modifies data.  Don’t begin the transaction until you’re ready to start modifying tables; this will minimize the amount of time you need to keep the transaction open and resources locked.

As a corollary of the second point, note that you can use explicit transactions to control parent-child relationships with stored procedures, where the parent begins a transaction, calls each child, and rolls back or commits at the end depending upon the results. That’s something you can’t do with autocommit, as each data modification statement would run in its own auto-committed transaction.

Third, if you are working with non-global temporary tables beforehand, don’t include any modification of those inside the explicit transaction.  If you are working with global temporary tables, I suppose you should treat them like non-temporary tables here if you expect other sessions to use them and care about blocking, though there’s a pretty small number of cases where it makes sense to have global temporary tables with multiple users, so I’d call that an edge case.

Fourth, in a loop, choose whether you want to put the explicit transaction around the loop or inside it.  In most cases, I prefer to put the transaction inside the loop to minimize the amount of time that I’m blocking others. This is probably the smarter move to make in busy transactional environments, where you want to prevent blocking as much as possible.  Also, If one loop iteration fails, you’ll have less you need to roll back, so you can fix the issue and pick back up where you left off. Note that at that point, you trade atomicity on the entire set of data for atomicity on a block of data, so if that’s a big enough concern, bite the bullet and put an explicit transaction around the loop. Or see if you can make it faster without a loop.

Fifth, outside of a stored procedure—that is, when I’m just writing ad hoc statements in a client tool—use explicit transactions if you’re doing something potentially risky. I know this brings up the question of “Why are you doing risky things in a client tool to begin with?” But that’s a story for a different day.

Sixth, watch out for nested transactions.  In SQL Server, there’s very little utility in them and their behavior is weird. Paul Randal explains in great detail just how broken they are, and I’d rather the product never have had them.  Anyhow, check to see if you’re in a transaction before opening one. The pattern I like to use comes from my Working Effectively with Legacy SQL talk (which, ironically enough, needs some changes to be brought up to date) and originally from smart people in the office who put it together before I got there. Here’s a simplified version of it for a sample stored procedure:

CREATE OR ALTER PROCEDURE dbo.GetFraction
@Divisor INT = 5
AS
 
DECLARE
    @AlreadyInTransaction BIT;

BEGIN TRY 
    IF ( @@TRANCOUNT > 0 )
    BEGIN
        SET @AlreadyInTransaction = 1;
    END
    ELSE
    BEGIN
        SET @AlreadyInTransaction = 0;
        BEGIN TRANSACTION;
    END;
 
    -- Note:  this is where you'd put your data modification statements.
    SELECT
        1.0 / @Divisor AS Quotient;
 
    IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )
    BEGIN
        COMMIT TRANSACTION;
    END;
END TRY
BEGIN CATCH
    IF ( @AlreadyInTransaction = 0 AND @@TRANCOUNT > 0 )
    BEGIN
        ROLLBACK TRANSACTION;
    END;

    THROW;
END CATCH
GO

--Test the procedure
EXEC dbo.GetFraction @Divisor = 5;

--Start an explicit transaction
BEGIN TRANSACTION
EXEC dbo.GetFraction @Divisor = 5;
SELECT @@TRANCOUNT;
ROLLBACK TRANSACTION

Finally, make sure you roll back the transaction on failure.  If you write code using try-catch blocks, commit at the end of the TRY block or rollback at the beginning of the CATCH. Explicit transactions offer you more power, but come with the responsibility of handling transactions appropriately.

Thoughts?

What are your thoughts on explicit transactions versus autocommit? Do you prefer the ease of autocommit or the power of explicit transactions? Or where do you draw the line between the two? Leave your thoughts in the comments section below—either here or on the video—and let me know.

With ML Services, Watch Those Resource Groups

I wanted to cover something which has bitten me in two separate ways regarding SQL Server Machine Learning Services and Resource Governor.

Resource Governor and Default Memory

If you install a brand new copy of SQL Server and enable SQL Server Machine Learning Services, you’ll want to look at sys.resource_governor_external_resource_pools:

That’s a mighty fine cap you’re wearing.

By default, SQL Server will grant 20% of available memory to any R or Python scripts running. The purpose of this limit is to prevent you from hurting server performance with expensive external scripts (like, say, training large neural networks on a SQL Server).

Here’s the kicker: this affects you even if you don’t have Resource Governor enabled. If you see out-of-memory exceptions in Python or error messages about memory allocation in R, I’d recommend bumping this max memory percent up above 20, and I have scripts to help you with the job. Of course, making this change assumes that your server isn’t stressed to the breaking point; if it is, you might simply want to offload that work somewhere else.

Resource Governor and CPU

Notice that by default, the max CPU percent for external pools is 100, meaning that we get to push the server to its limits with respect to CPU.

Well, what happens if you accidentally change that? I found out the answer the hard way!

In my case, our servers were accidentally scaled down to 1% max CPU utilization. The end result was that even something as simple as print("Hello") in either R or Python would fail after 30 seconds. I thought it had to do with the Launchpad service causing problems, but after investigation, this was the culprit.

Identities blurred to protect the innocent.

The trickiest part about diagnosing this was that the Launchpad logs error messages gave no indication what the problem was—the error message was a vague “could not connect to Launchpad” error and the Launchpad error logs didn’t have any messages about the failed queries. So that’s one more thing to keep in mind when troubleshooting Machine Learning Services failures.

NVARCHAR Everywhere: A Thought Experiment

Doubling Down on Madness

In the last episode of Shop Talk, I laid out an opinion which was…not well received. So I wanted to take some time and walk through my thinking a little more cogently than I was able to do during Shop Talk.

Here’s the short version. When you create a table and need a string column, you have a couple options available: VARCHAR and NVARCHAR. Let’s say that you’re a developer creating a table to store this string data. Do you choose VARCHAR or NVARCHAR? The classic answer is, “It depends.” And so I talk about why that is in video format right below these words.

A Video from the Void

The Camps

Camp One: only use VARCHAR. Prior to SQL Server 2019, this is basically the set of people who never have to deal with internationalization. If you’re running solo projects or building systems where you know the complete set of users, and if there’s no need for Unicode, I can understand this camp. For projects of any significance, though, you usually have to go elsewhere.

Camp Two: default to VARCHAR, only use NVARCHAR when necessary. There are a lot of people in this camp, especially in the western world. Most of the companies I’ve worked at live in this camp.

Camp Three: default to NVARCHAR, but use VARCHAR when you know you don’t need Unicode. This is a fairly popular group as well, and outside of this thought experiment, I probably end up here.

Aaron Bertrand lays out the costs and benefits of Camps Two and Three (or, Camps VARCHAR Mostly and NVARCHAR Mostly), so I recommend reading his arguments and understanding that I am sympathetic to them.

But there is also Camp Four: NVARCHAR everywhere. And this camp is growing on me.

Why NVARCHAR Everywhere?

I see several benefits to this:

  • Developers and product owners don’t need to think about or guess whether a particular string value will ever contain Unicode data. Sometimes we guess wrong, and migrating from VARCHAR to NVARCHAR can be a pain.
  • NVARCHAR Everywhere avoids implicit conversion between string columns because you can assume that everything is NVARCHAR. Implicit conversion can be a nasty performance impediment.
  • Furthermore, you can train developers to preface string literals with N, ensure that data access tools ask for Unicode strings (most ORMs either default to Unicode or know enough to do it right), and ensure that every stored procedure string parameter is NVARCHAR because there are no exceptions. That’s one less thing you ever have to think about when designing or tuning a database and one less area where ambiguity in design can creep in.
  • If somebody tries to store Unicode data in a VARCHAR column, that information is silently lost.

Why Not NVARCHAR Everywhere?

The first thing you’ll hear from people about this is storage requirements: NVARCHAR characters are typically 2 bytes, whereas equivalent VARCHAR characters are typically 1 byte. For the nuanced version of this, Solomon Rutzky goes into great detail on the topic, but let’s stick with the simplistic version for now because I don’t think the added nuance changes the story any.

SQL Server has Unicode compression, meaning that, per row, if the data in a column can fit in your collation’s code page, the database engine can compress the Unicode data to take as much space as equivalent VARCHAR data would—maybe it’s a little bigger but we’re talking a tiny amount. Enabling row-level compression turns on Unicode compression as well and can provide additional compression benefits. And page-level compression does an even better job at saving space on disk. There are CPU costs, but my experience has been that compression will often be faster because I/O subsystems are so much slower than CPU, even with fancy all-flash arrays or direct-attached NVMe.

The exception is if you are using NVARCHAR(MAX) as your data type. In that case, Unicode and row-level compression won’t do anything and page-level compression only works if your data fits on a page rather than falling into LOB. Hugo Kornelis covers why that is. So that’s a weakness, which means I need a bulleted list here.

  • NVARCHAR(MAX) columns with overflow to LOB will be larger than their VARCHAR counterparts and we cannot use Unicode, Row, or Page compression to reduce storage.
  • If your max data length is less between 4001 and 8000 characters, you know the column will never have Unicode characters, and the data is highly compressible, you will save a lot of space with VARCHAR plus page-level compression, whereas in this zone, you’d need to use an NVARCHAR(MAX) column and lose out.
  • If you are in the unlucky situation where even row-level compression tanks your performance—something I’ve never seen but acknowledge it as a possibility—going with NVARCHAR becomes a trade-off between reducing storage and maximizing performance.

The Demo Code

In case you want to try out the demo code on your own, here it is:

USE [Scratch]
GO
DROP TABLE IF EXISTS dbo.TestTable;
DROP TABLE IF EXISTS dbo.NTestTable;
GO
CREATE TABLE dbo.TestTable
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn VARCHAR(150) NOT NULL,
    SomeOtherStringColumn VARCHAR(30) NOT NULL,
    CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED(Id)
);
GO
CREATE INDEX [IX_TestTable_SomeStringColumn] ON dbo.TestTable
(
    SomeStringColumn
);
GO
CREATE TABLE dbo.NTestTable
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn NVARCHAR(150) NOT NULL,
    SomeOtherStringColumn NVARCHAR(30) NOT NULL,
    CONSTRAINT [PK_NTestTable] PRIMARY KEY CLUSTERED(Id)
);
CREATE INDEX [IX_NTestTable_SomeStringColumn] ON dbo.NTestTable
(
    SomeStringColumn
);
GO

-- Test 1:  It's All ASCII.
INSERT INTO dbo.TestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE('A', 150),
    REPLICATE('X', 30)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 150),
    REPLICATE(N'X', 30)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';

-- Test 2:  Unicode me.
SELECT DATALENGTH(N'🐭');
SELECT DATALENGTH(N'🪕');
GO
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'🐭', 75),
    REPLICATE(N'🪕', 15)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';

-- Test 3:  Mix It Up.
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';
GO

-- Check the DATALENGTH
SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.TestTable;

SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.NTestTable;

-- Row Compression includes Unicode compression.
ALTER INDEX ALL ON dbo.NTestTable REBUILD WITH (DATA_COMPRESSION = ROW);
GO
-- Test 3a:  Continue to Mix It Up.
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';
GO

-- Another check of the DATALENGTH
SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.TestTable;

SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.NTestTable;

-- Let's check the LOB
DROP TABLE IF EXISTS dbo.NTestTableLob;
GO
CREATE TABLE dbo.NTestTableLob
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn NVARCHAR(MAX) NOT NULL,
    SomeOtherStringColumn NVARCHAR(MAX) NOT NULL,
    CONSTRAINT [PK_NTestTableLob] PRIMARY KEY CLUSTERED(Id) WITH(DATA_COMPRESSION = ROW)
);
-- Can't use NVARCHAR(MAX) as a key column in an index...
/* CREATE INDEX [IX_NTestTableLob_SomeStringColumn] ON dbo.NTestTableLob
(
    SomeStringColumn
)  WITH(DATA_COMPRESSION = ROW); */
GO

-- No overflow necessary.
INSERT INTO dbo.NTestTableLob
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- What about page-level compression?
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = PAGE);

EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- And to be fair, we'll see the same on NTestTable.
ALTER INDEX ALL ON dbo.NTestTable REBUILD WITH (DATA_COMPRESSION = PAGE);

EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- My page runneth over.
TRUNCATE TABLE dbo.NTestTableLob;
-- Let's reset the data compression.
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = NONE);
INSERT INTO dbo.NTestTableLob
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(10000)
    REPLICATE(N'🐭', 14800),
    REPLICATE(N'X', 28000) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.NTestTableLob';
GO
-- Now we compress.
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = PAGE);
GO
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- Time to clean up.
DROP TABLE IF EXISTS dbo.TestTable;
DROP TABLE IF EXISTS dbo.NTestTable;
DROP TABLE IF EXISTS dbo.NTestTableLob;
GO

SELECT
    N'🐭' as Mouse,
    '🐭' as [Mouse?];

Final Thoughts…For Now

I think where I stand right now is, for greenfield database development, I heavily bias toward NVARCHAR and could even say NVARCHAR Everywhere. I think the benefits outweigh the costs here.

For brownfield database development, it’s a harder call to make because you almost certainly have a mix of VARCHAR and NVARCHAR data types. If you already have a solid system within a brownfield database, stick with that system. For example, you might use NVARCHAR for user-entry fields but VARCHAR for internal system fields like codes and levels. If that pattern works for you, that’s fine.

If you’re in a brownfield development mess, I can see the potential benefit of migrating to NVARCHAR Everywhere, but the work-to-benefit ratio is probably not going to be acceptable for most companies. The exception here is if you find out that you’re losing valuable customer data and need to go through an internationalization project. It might be tempting to change the minimum amount necessary, though my philosophy is that if you have the opportunity to make big improvements, take them.

But as I mention in the video, I’m interested in your thoughts as well. So add them to the video or drop them in here. Is there something big I’m missing which makes NVARCHAR Everywhere untenable? Have I shifted your thinking at least a little bit? Let me know.

Building a Docker Container of a SQL Server Database

Today, we’re going to go through the process of turning a database you’ve built into a Docker container. Before we get started, here are the expectations:

  1. I want a fully running copy of SQL Server with whatever database I’m using, as well as key components installed.
  2. I want this not to be on a persistent volume. In other words, when I destroy the container and create a new one from my image, I want to reset back to the original state. I’m using this for technical demos, where I want to be at the same starting point each time.
  3. I want this to be as easy as possible for users of my container. I consider the use of a container here as not particularly noteworthy in and of itself, so the more time I make people think trying to set up my demo environment, the more likely it is that people will simply give up.

With that preamble aside, let’s get to work!

Step One: Have a Database

This might seem a little obvious, but I want to make it clear that we need the database set up exactly how we want it. This includes user configuration (usually with SQL authentication, given that we’re using Linux containers and passing them out to who-knows-where), database objects like tables and procedures, and whatever else we need. After that, I will take a database backup and save it to my local disk.

For this example, I’m going to talk through how I containerized the demos for my Approaching Zero talk. This is a fairly straightforward database with some pre-staged data and objects, but no features such as PolyBase, SQL Agent, or Machine Learning Services to throw a spanner in the works. Incidentally, if you do use those types of features, I was able to get them working in my Native Scoring demo, so it’s possible to do and the process actually isn’t that far removed from what we’re doing here.

Step Two: Spin Up a Container

It takes a container to beat a container. Well, not exactly, but this is the route that I like to take.

I’m going to spin up a container. Here’s the command I run:

docker run -d -e MSSQL_PID=Developer -e ACCEPT_EULA=Y -e SA_PASSWORD=YourPasswordGoesHereButDontUseThisOneBecauseItIsntAGood1! --name approaching-zero-db -p 51433:1433 mcr.microsoft.com/mssql/server:2019-latest

Taking this step by step, I want to run a docker container in the background (that’s what the -d command means). I’m going to set several environment variables: MSSQL_PID (to tell SQL Server which edition to use), ACCEPT_EULA (to promise Microsoft we won’t sue them), and SA_PASSWORD (to throw off our enemies with deceptively bad passwords). The name of my container will be called approaching-zero-db. You can, of course, name it whatever you’d like, and it doesn’t even have to be the same as what we’re going to push out to Docker Hub, so get creative if you really want.

The -p flag says that we’d like the container’s port 1433 to be represented on our host as port 51433. I already have SQL Server running on my host machine, so I’m selecting a totally different, unused port for the container. But as far as the container is concerned, it is listening on its port 1433, so it has that going for it.

Microsoft has their own container repository for SQL Server, and we’re getting the version tagged as 2019-latest, which is CU4 as of the time of this blog post going live. Downloading this may take a while, so try not to do this on a really slow internet connection.

Incidentally, you might get the following error:

C:\Program Files\Docker\Docker\resources\bin\docker.exe: Error response from daemon: Ports are not available: listen tcp 0.0.0.0:51433: bind: An attempt was made to access a socket in a way forbidden by its access permissions.

If you get this error, it means that something else is already listening on port 51433 on your host, so you’ll have to use a different port instead. Maybe 52433 or 51434 or something. You’ll need to run docker rm approaching-zero-db to clean up the mess before you try again, but don’t shed too many tears over the containers we have to slaughter along the path to glory.

Step Three: Restore the Dauphin

We have a container and a database backup, so let’s do what we do best: fitting square pegs into round holes.

To do this, first I will make a backup directory within our container:

docker exec -it approaching-zero-db mkdir /var/opt/mssql/backup

This command is pretty straightforward: we’re going to execute a command in interactive mode (-i) and allocate a pseudo-TTY (-t). Don’t know what a pseudo-TTY is? If it really matters, learn all about tty and then come back here.

Our backup has a home, so now let’s move it in and make it feel comfortable.

docker cp D:\SQLServer\Backup\ApproachingZero.bak approaching-zero-db:/var/opt/mssql/backup/ApproachingZero.bak

We’re using Docker to copy the backup from my local directory into the container. You would, of course, modify this to fit your system.

After the backup is up there, I like to run the following command in Powershell. The prior commands you could run in cmd, but this one’s a lot easier to read as a multi-liner:

docker exec -it approaching-zero-db /opt/mssql-tools/bin/sqlcmd -S localhost `
    -U SA -P YourPasswordGoesHereButDontUseThisOneBecauseItIsntAGood1! `
    -Q "RESTORE DATABASE ApproachingZero FROM DISK =
        '/var/opt/mssql/backup/ApproachingZero.bak' WITH MOVE 'ApproachingZero' TO '/var/opt/mssql/data/ApproachingZero.mdf', MOVE 'ApproachingZeor_Log' TO '/var/opt/mssql/data/ApproachingZero.ldf'"

Some builds of SQL Server containers don’t have mssql-tools installed, so you might need to install them separately. Let’s talk about that in a sub-header.

Step Three Point A: What to Expect when you Expected mssql-tools

If, for some reason, your container does not have mssql-tools installed, that’s okay. As long as you have an internet connection, you can get this done.

We’re first going to open up a shell on the container:

docker exec -it approaching-zero-db /bin/bash

Why bash and not ksh? Because I’m not hardcore enough to live my life in vi. I’m a frequent vi tourist but not a native.

Next up, we’re going to install some stuff. Unlike in the Microsoft instructions, we are Master and Commander of this container and so sudo won’t do much.

curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list | tee /etc/apt/sources.list.d/msprod.list

apt-get update 
apt-get install mssql-tools unixodbc-dev

And now we have mssql-tools and we can return to step 4, already in progress.

Step Four: Perform Sanity Tests

After you’ve restored the database, connect to it in SQL Server Management Studio, Azure Data Studio, or your tool of choice. Make sure the objects are there, that you can log in with the accounts you need, etc. You might need to create logins for those SQL authenticated accounts, for example—unless you’re like me and giving away sa access like it’s free candy at Halloween.

Here’s the important thing, though: any changes you make here will be on the container’s permanent record, so if you want a pristine experience every time you reconstitute that container, you’ll want to handle this step lightly.

Step Five: Admit Defeat on Sanity, Get Committed

We’ve made sure that the database is what we expected, that all of the pieces are there, and that queries are running as we expect. From there, it’s time to push this to Docker Hub.

The first step in pushing a container is to commit the current container as a new image with the docker commit command:

docker commit -m="Creating the Approaching Zero database." -a "{Your Name Here}" approaching-zero-db approaching-zero-db

We are running the commit command and passing in a message with -m, setting the author with -a, selecting the running container to commit (approaching-zero-db) and naming the local repository to which we want to commit this thing (approaching-zero-db).

At this point, you can spin up a local container off of your approaching-zero-db model, but nobody else can. So let’s fix that.

Step Six: Bust Out and Let the Whole World See You

We want to push this new image to Docker Hub. If you have not already, you’ll probably need to log in:

docker login

You might be able to authenticate with cached credentials, but if not you’ll enter your username and password and be good to go.

Now, we want to tag our local image and tell Docker what it will represent up in Docker Hub:

docker tag approaching-zero-db docker.io/feaselkl/presentations:approaching-zero-db

This takes my local repository named approaching-zero-db (with the default latest tag) and ties it back to Docker Hub, where in my feaselkl account I have a presentations repository and a tag called approaching-zero-db.

My last step is to push the fruits of my labor into Docker Hub for all to see.

docker push docker.io/feaselkl/presentations:approaching-zero-db

Step Seven: Using the Image

Now that I have the image up in Docker Hub, I can run the following commands from any machine with Docker installed and spin up a container based on my image:

docker pull docker.io/feaselkl/presentations:approaching-zero-db

docker run --name approaching-zero-db -p 51433:1433 docker.io/feaselkl/presentations:approaching-zero-db

And that’s it. I connect to port 51433 on localhost and authenticate with my extra-great sa username and password, and can run through whatever I need to do. When I’m done, I can stop and kill the image whenever I’d like:

docker stop approaching-zero-db
docker rm approaching-zero-db

Conclusion

In this post, we learned how to take an existing database in SQL Server—whether that be for Windows or Linux—and create a SQL Server container which includes this database.

Coda: Discourses and Distractions

I wanted to save a couple of notes here to the end in order not to distract you with too many tangents in the main body of the post.

Let’s do this as Q&A, as I haven’t done that in a while.

Couldn’t you just use Dockerfiles for all of this?

Short answer, yes. Long answer, I didn’t want to. Dropping a Dockerfile in my GitHub repo makes it easier for me, sure, but then makes it more difficult for people to follow along. As an experiment, I did include all of the steps in my Map to Success repository. Compare the section on “Run Docker Images” to “Build Docker Images” and tell me which one is easier.

If I were putting together a training on Docker, then it would make perfect sense. But I’m using containers as a background tool, and I want to get past it as soon as possible and with as little learner friction as possible.

Can’t you just script all of these steps?

Short answer, yes. Long answer, yes you can.

The only thing to watch out for in scripting is that I noticed a timing issue between when you copy the database backup to the container and when sqlcmd recognizes that the backup is there. I wasn’t able to get it all working in the same Powershell script, even when I did hacky things like adding Start-Sleep. But maybe you’ll have better luck.

Why did you label the local repository as approaching-zero-db when you knew you were going to push this as presentations:approaching-zero-db?

That’s a really good question.

Why not expose a volume in the container, put the database backup there, and commit the image without the database backup once you’re done?

Credit Bartosz Ratajzyk for the great question.

The short version is that I actually forgot about volumes when I was writing this up. But using docker cp isn’t the worst thing, especially when copying one file. Bartosz makes a great point though that we should remove the backup files before committing, regardless of the technique we use.

So why would you actually use this technique in real life?

Androgogy is real life.

Aside from that, this technique is quite useful for building automated test environments. Create a pre-defined database with a known beginning state, spin up a container, hammer that database with all of your automated tests, and destroy the container when you’re done. Unlike most cases, where you want to save the data permanently, these sorts of tests cry out for ephemeral databases.

If you’re looking at containers for most other purposes (like, say, production databases), you’d definitely be interested in persisted volumes and I’d hand you off to Andrew Pruski for that.

Why would you want to restore the Dauphin? You don’t seem like an Armagnac.

Yeah, if you pressed me on it I’d probably admit to Burgundian sympathies, but that’s mostly because of the role it played in the subsequent Dutch republic. But if you pressed me even further, I’ll admit that I was just rooting for injuries in that whole tete-a-tete.

The Importance of 0 in Regressions

Mala Mahadevan (b | t) brought an article to my attention regarding a regression between elevation and introversion/extroversion by state from a few years back. Before I get into this, I want to note that I haven’t read the linked journal article and am not casting aspersions at the blog post author or the journal article authors, but this was a good learning opportunity for an important concept.

Here is the original image:

I see a line. Definitely, definitely a line.

So boom, extroverts want to live on flat land and introverts in the mountains. Except that there are a few problems with this interpretation. Let’s go through them. I’ll focus entirely on interpreting this regression and try to avoid getting down any tangential rabbit holes…though knowing me, I’ll probably end up in one or two.

The Line is NOT the Data

One of the worst things we can do as data analysts is to interpret a regression line as the most important thing on a visual. The important thing here is the per-state set of data points, but our eyes are drawn to the line. The line mentally replaces the data, but in doing so, we lose the noise. And boy, is there a lot of noise.

Boy, is There a Lot of Noise

I don’t have the raw values but I think I can fake it well enough here to explain my point. If you look at a given elevation difference, there are some huge swings in values. For example, check out the four boxes I drew:

Mastery of boxes? Check.

On the left-most box, approximately the same elevation difference relates to ranges from roughly -0.6 to 1.8 or so. Considering that our actual data ranges from -2 to approximately 3, we’re talking about a huge slice. The second box spans the two extremes. The third and fourth boxes also take up well over half the available space.

This takes us to a problem with the thin line:

The Thin Black Line

When we draw a regression line, we typically draw a thin line to avoid overwhelming the visual. The downside to this is that it implies a level of precision which the data won’t support. We don’t see states clustered around this thin line; they’re all around it. Incorporating the variance in NEO E zscore for a given elevation difference, we have something which looks more like this:

That’s a thick line.

Mind you, I don’t have the actual numbers so I’m not drawing a proper confidence interval. I think it’d be pretty close to this, though, just from eyeballing the data and recognizing the paucity of data points.

So what’s the problem here? The lines are all pointing in the same direction, so there’s definitely a relationship…right?

Zeroing in on the Problem

Looking at the vertical axis, we have a score which runs from -2 to 3(ish), where negative numbers mean introversion and positive numbers extroversion. That makes 0 the midpoint where people are neither introverted nor extroverted. This is important because we want to show not only that this relationship is negative, but that it is meaningful. A quick and dirty way we can check this is to see how much of our confidence interval is outside the zero line. After all, we’re trying to interpret this as “people who live in higher-elevation areas tend to be more introverted.”

The answer? Not much.

With our fat confidence interval guess, the confidence interval for all 50 states (plus one swamp) includes the 0 line, meaning that even though we can draw a line pointing downward, we can’t conclusively say that there is any sort of relationship between introversion/extroversion and differences in elevation because both answers are within our realm of possibility for the entire range of the visual.

But hey, maybe I’m way off on my confidence interval guess. Let’s tighten it up quite a bit and shrink it roughly in half. That gives us an image which looks like this:

Conclusive evidence that Alaskans are introverts.

If I cut that confidence interval roughly in half, I lose enough states that those CI bars probably are too narrow. Conclusions we can draw include:

  • Any state with an elevation difference over ~16,000 is likely to have a NEO E zscore below 0.
  • Alaska is the only state with an elevation difference over 16,000.

For all of the other states, well, we still can’t tell.

Conclusion

Looking solely at this image, we can’t tell much about NEO E zscore versus elevation difference except that there appears to be a negative correlation which is meaningful for any state above 16,000 feet of difference in elevation. Based on the raw picture, however, your eyes want to believe that there’s a meaningful negative correlation. It’s just not there, though.

Bonus Round: Rabbit Holes I Semi-Successfully Avoided

I won’t get into any of these because they’re tangents and the further I get away from looking at the one picture, the more likely it is that I end up talking about something which the paper authors covered. Let me reiterate that I’m not trashing the underlying paper, as I haven’t read it. But here are a few things I’d want to think about:

  • This data is at the state level and shows elevation difference. When sampling, it seems like you’d want to sample at something closer to the county level in order to get actual elevation. After all, the conjecture is that there is a separating equilibrium between extroverts and introverts based on elevation.
  • Elevation difference is also a bit weird of a proxy to use by state. Not so weird that it’s hinky, but weird enough to make me think about it.
  • Looking at Alaska in particular, they had 710K people as of the 2010 census, but here are the top cities and their elevations:
CityPopulationElevation (feet)
Anchorage291,826102
Fairbanks31,535446
Juneau31,27556
Sitka8,88126
Ketchikan8,0500
Wasilla7,831341
Kenai7,10072
Kodiak6,13049
Bethel6,0803
I gave up after 9. Frankly, that’s 9 more than I expected to do.

This tells us that, at a minimum, ~56% of Alaska residents lived at or near sea level despite being one of the most mountainous states. If introverts want to live in high-elevation areas, it’s a little weird that they’re flocking to the coastline, which is supposed to be a high-extroversion area based on the journal article’s summary. But again, I didn’t read the article (or even look for a non-gated copy), so take that with plenty of grains of salt.

Learning Goals for 2020

Unlike the last couple of years (e.g., 2019), I’m lopping off the “Presentation” portion and focusing more on what I want to learn. Presentations will follow from some of this but there are few guarantees. I’m going to break this up into sections because if I just gave the full list I’d demoralize myself.

The Wide World of Spark

It’s a pretty good time to be working with Apache Spark, and I’m interested in deepening my knowledge considerably this year. Here’s what I’m focusing on in this area:

  • Azure Databricks. I’m pretty familiar with Azure Databricks already and have spent quite a bit of time working with the Community Edition, but I want to spend more time diving into the product and gain expertise.
  • Spark.NET, particularly F#. Getting better with F# is a big part of my 2020 learning goals, and so this fits two goals at once.
  • SparkR and sparklyr have been on my radar for quite a while, but I’ve yet to get comfortable with them. That changes in 2020.
  • Microsoft is putting a lot of money into Big Data Clusters and Azure Synapse Analytics, and I want to be at a point in 2020 where I’m comfortable working with both.
  • Finally, Kafka Streams and Spark Streaming round out my list. Kafka Streams isn’t Spark-related, but I want to be proficient with both of these.

Azure Everywhere

I’m eyeing a few Azure-related certifications for 2020. Aside from the elements above (Databricks, Big Data Clusters, and Synapse Analytics), I’ve got a few things I want to get better at:

  • Azure Data Factory. I skipped Gen1 because it didn’t seem worth it. Gen2 is a different story and it’s about time I got into ADF for real.
  • Azure Functions, especially F#. F# seems like a perfect language for serverless operations.
  • Azure Data Lake Gen2. Same deal with ADF, where Gen1 was blah, but Gen2 looks a lot better. I’ve got the hang of data lakes but really want to dive into theory and practices.

Getting Functional

I released my first F# projects in 2019. These ranged from a couple F#-only projects to combination C#-F# solutions. I learned a huge amount along the way, and 2020 is a good year for me to get even more comfortable with the language.

  • Serverless F#. This relates to Azure Functions up above.
  • Fable and the SAFE Stack. This is a part of programming where I’ve been pretty weak (that’s the downside of specializing in the data platform side of things), so it’d be nice to build up some rudimentary skills here.
  • Become comfortable with .NET Core. I’ve been a .NET Framework kind of guy for a while, and I’m just not quite used to Core. That changes this year.
  • Become comfortable with computational expressions and async in F#. I can use them, but I want to use them without having to think about it first.
  • Finish Domain Modeling Made Functional and Category Theory for Programmers. I’ve been working through these two books and plan to complete them.
  • Get more active in the community. I’ve created one simple pull request for FSharp.Data.SqlClient. I’d really like to add a few more in areas where it makes sense.

Data Science + Pipelines

I have two sets of goals in this category. The first is to become more comfortable with neural networks and boosted decision trees, and get back to where I was in grad school with regression.

The other set of goals is all about data science pipelines. I think you’ll be hearing a lot more about this over the next couple of years, but the gist is using data science-oriented version control systems (like DVC), Docker containers governed by Kubernetes, and pipeline software like MLFlow to build repeatable solutions for data science problems. Basically, data science is going through the maturation phase that software development in general went through over the past decade.

Video Editing

This last set of goals pertains to video editing rather than a data platform or programming topic. I want to make 2020 the year of the Linux desktop video production, and that means sitting down and learning the software side of things. I’m including YouTube tutorials and videos as well as improving my use of OBS Studio for TriPASS’s Twitch channel. If I get good enough at it, I might do a bit more with Twitch, but we’ll see.

Conclusion

Looking back at the list, it’s a pretty ambitious goal. Still, these are the areas where I think it’s worth me spending those crucial off-work hours, and I’d expect to see three or four new presentations come out of all of this.

New Website Design

I’ve been working on this for a little while now, but I finally decided to flip over the Catallaxy Services website to its new design today.

The old design was something I came up with about seven years ago and it definitely showed its age. In particular, it had a crummy mobile experience, with different font sizes, needing to pinch-scroll to see anything, and other dodgy bits.

I like the new website a lot better. Granted, if I didn’t like it as much, I wouldn’t have made the switch, but there are a few things I want to point out.

Resources

I do a lot of stuff, and it’s nice to be able to sort out most of those things in one location. That’s the Resources section.

Some of the things I do.

I also broke out resources by section, so if you want an overview of my community resources, you can click the button to show just those resources. Or if you want to give me money in exchange for goods and/or services, I have a “Paid Resources” section too.

Mobile-First Experience

I wanted to make sure not only that the site would work on a phone, but that it looked good on a phone. When I present, this will often be the first way people check out my site, so I want it to be a good experience.

Images

I’m still working on this, but one of the things about the old site which made it fast but boring is that it was entirely textual. There were no images. Now, there are images on pretty much every page. I want to see what I can do to improve things a bit more, but it’s a start.

Presentations Page

Like the Resources section, you can see a list of presentations and also filter by topic.

These machines, they never learn.

For each presentation, I have the same contents that I had before: abstract, slides, demo code, additional media, and links/resources. I’ve shuffled things around a bit so that it looks reasonably good on a wide-screen monitor but also works on a phone or tablet. Here’s a presentation where everything’s nice and filled out. For other talks, where not everything is filled out, I at least have sensible defaults.

Services

Something I didn’t do a great job of calling out before was availability for consulting and paid services. With the new site, I emphasize this in a few ways, whether that’s consulting, paid training (though actual paid training is forthcoming), or books + third-party offerings like PolyBase Revealed.

What’s Next

I have a few more things around the site, namely around custom training, which I intend to work on during the winter. I also have a couple of smaller ideas which I plan to implement over the next several days, but won’t call them out specifically. For now, though, I think this is a pretty good start.