This is part one of a series on launching a data science project.
This is the beginning of a series of posts around growing a data science project from the germ of an idea to its fruition as a stable oak. Before I get into the process, I want to start with a few data adages to which I stubbornly subscribe and which drive the need for quality processes.
You may disagree with any of these adages, but they drive my decision-making processes and I believe that they feed at least a little bit of the paranoia necessary to be an effective data architect.
Clean Data Is An Aspiration, Not A Reality
The concept of “clean” data is appealing to us—I have a talk on the topic and spend more time than I’m willing to admit trying to clean up data. But the truth is that, in a real-world production scenario, we will never have truly clean data. Whenever there is the possibility of human interaction, there is the chance of mistyping, misunderstanding, or misclicking, each of which can introduce invalid results. Sometimes we can see these results—like if we allow free-form fields and let people type in whatever they desire—but other times, the error is a bit more pernicious, like an extra 0 at the end of a line or a 10-key operator striking 4 instead of 7.
Even with fully automated processes, we still run the risk of dirty data: sensors have error ranges, packets can get dropped or sent out of order, and services fail for a variety of reasons. Each of these can negatively impact your data, leaving you with invalid entries.
Data Source Disagreements Will Happen
If you have the same data stored in two systems, there will inevitably be disagreements. These disagreements can happen for a number of reasons, including:
- Different business rules mean that different subsets of data go into each system. For example, one system might track data on work items, and a separate system could track data on work items specifically for Cap-X projects. If you don’t understand the business rules of each system, you might look at the difference in numbers and get confused.
- With manual data entry, people can make mistakes, and those mistakes might manifest in different ways in separate systems. If a person has to type data into two systems, the likelihood of typos affecting each system exactly the same way is fairly low.
- Different systems might purport to be the same but actually are based on different data sources. For example, the Penn World Table has two different mechanisms to calculate GDP: expenditure-side GDP and output-side GDP. In an ideal world, these are the same thing, but in reality, they’re a little bit different. If I build one system based off of expenditure-side GDP and you build another system based off of output-side GDP, our calculations will clash even though they’re supposed to represent the same thing.
- Some systems get updated more frequently than others, so one side might have newer data, even if they both come from the same source and have the same rules and calculations applied.
- Even in a scenario where you are reading from a warehouse which gets its data from a single source system, there is still latency, meaning that you might get an extract of data from the warehouse which is out of date. That too can lead to data discrepancies between sources.
You Will Always Have More Questions Than Data
This seems pretty self-explanatory—our ability to collect and process information is finite, whereas the set of questions we could ask is infinite. You might be able to collect an exhaustive data set about a very particular incident or set of incidents, but there are always more and broader questions a person can ask for which the data is not available. For example, let’s say that we have a comprehensive set of data about a single baseball game, including lineups, game actions, pitch locations, bat speeds on contact, and so on. No matter how detailed the data you provide, someone will be able to ask questions that your data cannot answer. One class of these questions involves trying to discern human behavior: why the manager picked one reliever over another, why the runner decided to advance from first to third base on a single to left field in the 4th inning, why this pitcher followed up a four-seam fastball inside with a changeup outside, etc.
Decision-Makers Often Don’t Know The Questions They’ll Have
I’ve built a few data warehouses in my time. The most frustrating part of building a data warehouse is that you have to optimize it for the question that people have, but it’s hard for people to imagine the questions that they will have far enough in advance that you can develop the thing. Decision-makers tend not to know the types of questions they can ask, including whether those questions are realistic or reasonable, until you prod them in a direction.
The even worse part is, you’ll be able to answer some of their questions, but invariably they will have questions which they cannot answer using your system, meaning either that you extend the system to answer those questions as well, users find some other way to satisfy their curiosity, or they forget about the question and potentially lose a valuable thread.
We have ways of coping with this, like storyboarding, iterative development, and storing vast amounts of semi-structured data, but it’s tough to figure out what to include in your data lake when you don’t have a proscribed set of required questions to answer (like, for example, a set of compliance forms you need to fill out regularly).
Data Abstracts The Particulars Of Time And Place
This is one that I have repeatedly stolen over the years from FA Hayek, who made this point in his essay The Use of Knowledge in Society (which ended up winning him a Nobel Prize three decades later). We abstract and aggregate data in order to make sense of it, but that data covers up a lot of deeper information. For example, we talk about “the” price of something, but price—itself an abstraction of information—depends upon the particulars of time and place, so the spot price of a gallon of gasoline can differ significantly over the course of just a few miles or a few days. We can collect the prices of gasoline at different stations over the course of time and can infer and adduce some of the underlying causes for these levels and changes, but the data we have explains just a fragment of the underlying reality.
The Need For Process
I consider all of the adages above to be true, and yet it’s my job to figure something out. To deal with these sorts of roadblocks, we build processes which give us structure and help us navigate some of the difficulties of managing this imperfect, incomplete, messy data.
Over the course of this series, I’m going to cover one particular process: the Microsoft Team Data Science Process. I don’t follow it strictly, but I do like the concepts behind it and I think it works well to describe how to increase the likelihood of launching a good data science project.
There are a couple of things that I like about this process. First, it hits most of the highlights: I think the combination of business understanding, data acquisition, modeling, and deployment is probably the right level of granularity. Each of these has plenty of details that we can (and will) dig into, but I think it’s a good starting point.
The other thing that I like about this process is that it explicitly recognizes that you will bounce around between these items—it’s not like you perform data acquisition once and are done with it. Instead, you may get into that phase, gather some data, start modeling, and then realize that you need to go back and ask more pointed business questions, or maybe you need to gather more data. This is an explicitly iterative process, and I think that correctly captures the state of the art.
Our Sample Project
Through this series, I’m going to use the 2018 Data Professionals Salary Survey. I took a look at the 2017 version when it came out, and then another look at the 2017 survey during my genetic programming series, but now I want to use the latest data. As we walk through each step of the Team Data Science process, we’ll cover implementation details and try to achieve a better understanding of data professional salaries. Of course, the key word here is probably try…