This is part two of a series on launching a data science project.
How Is Babby Data Science Project Formed?
Behind each data science project, there is (hopefully) someone higher up on the business side who wants it done. This person might have been the visionary behind this project or might simply be the sponsor who drives it because of the project’s potential value. Nevertheless, the data science team needs to seek out and gather as much information about that champion’s vision as possible. In a perfect scenario, this is the person handing out sacks of cash to you, and you want those sacks of cash because they buy pretty hardware and let you bring in smart people to work with (or for) you. You might even have several people interested in your project; if so, you’ll want to build a composite vision, one which hopefully includes all of the members’ visions. Just keep in mind that sometimes you can’t combine everybody’s dreams and get a coherent outcome, so you’ll need to drive the champion(s) toward clarity. Forthwith are a few clues to help.
Learn The Domain
The first clue is figuring out the domain. This is asking questions like what the company does, what other companies in the industry do, what kind of jargon people use, etc. The more you know, the better prepared you are to understand the mind(s) of your champion(s). But even if you’ve been at that company for a long time and have a detailed understanding of the business, you still want to interview your champion(s) and ask questions which expose the ideal outcome.
Stop, Collaborate, and Listen
The second clue is simple: listen. When interviewing people, listen for the following types of questions:
- How much / how many?
- Which category does this belong to?
- How can we segment this data?
- Is this weird?
- Which option should I choose?
Each of these is a different type of problem with its own set of statistical techniques and rules. For example, the “Is this weird?” question relates to anomaly detection: finding outliers or “weird” results. Figuring out which of these types of questions is most important to your champion is crucial to delivering the right product. You can build the absolute best regression ever, but if the person was expecting a recommendation engine, you’re going to disappoint.
As you listen to these types of questions, your goal is to nail down a specific problem with a specific answer. You want to narrow down the scope to something that your team can achieve, ideally something with a built-in measure for success. For example, here are a few specific problems that we could go solve:
- Find a model which predicts quarterly sales to within 5% no later than 30 days into the quarter.
- Given a title and description for a product, tell me a listing category which Amazon will, with at least 90% confidence, consider valid for this product.
- Determine the top three factors which most affect the number of years the first owner holds onto our mid-range sedan.
With a specific problem in mind, you can look for relevant data. Of course, you’ll probably need to modify the scope of this problem over time as you gather new information, but this gives you a starting point for success. Also, don’t expect something as clear-cut as the above early on; instead, people will hem and haw, not quite sure what they really want. You can take a fuzzy goal into data acquisition, but as you acquire data, you will want to work with the champion to focus down to a targeted and valuable problem.
Dig For Data
Once you have an interesting question, or even the bones of a question, start looking for data. Your champion likely has a decent understanding of what is possible given your data, so the first place to look is in-house data sources like databases, Excel, flat files, data accessible through internal APIs, and even reports (generated reports like PDFs or even printed-out copies). Your champion will hopefully be able to point you in the right direction and tell you where some of this data is located—especially if it’s hidden in paper format or in a spreadmart somewhere—but you’re going to be doing a lot of legwork between here and the data processing phase, and you’ll likely bounce back and forth between the two phases a number of times.
As you gather data resources, you will probably want to build a data dictionary for your data sources. A great data dictionary will include things like:
- The data type of each attribute: numeric, string, categorical, binary.
- The data format: CSV, SQL Server table, Hive table, JSON file, etc.
- The size of the data and number of records.
- The enumeration of valid categorical values.
- Other domain rules (if known).
I’d love to say that I always do this…but that’d be a lie. Still, as the saying goes, hypocrisy is the tribute that vice pays to virtue.
Learn Your Outputs
While you’re looking for data and focusing in on the critical problem, you also need to figure out the endgame for your product. Will there be a different engineering team (or multiple teams?) expecting to call a microservice API and get results back? Will you get a set of files each day and dump the results into a warehouse? What’s the acceptable latency?
The Engineering team should help solve this technical problem, although your champion should have insight here depending upon how the business side will need to use your results. If the business side is already getting files in once a day, they may be fine with your process running overnight and having results in a system by 8 AM, when the analysts start showing up at work. By contrast, you may have a fraud detection system which needs to make a decision in milliseconds. These two systems will have radically different sets of requirements, even if the output looks the same to the business side.
Going Through An Example
My motivating example, as mentioned yesterday, is data professional salaries—figuring out how to get more money does, after all, motivate me!
Let’s suppose we work for Data Platform Specialists, a company dedicated to providing DBAs and other data platform professionals with valuable market knowledge. We have come into possession of a survey of data professionals and want to build insights that we can share with our client base.
If you haven’t seen the survey yet, I recommend checking it out.
Once we have the salary data, we want to start building a data dictionary and see what the shape of the data looks like. We’d get information on the total number of rows, note that this is stored in Excel on a single worksheet, and then make some notes on the columns. For example, a number of these features are categorical: for example, TelecommuteDaysPerWeek has six options, ranging from “less than 1” to “5 or more.” By contrast, hours worked per week is an integer, which ranges from 5 to 200 (umm…).
There are quite a few columns, most of which originally came from dropdowns rather than users typing the data in. This is good, because users are typically the biggest enemy of clean data. But even in this example, we can see some interesting results: for example, about halfway through the image, you can see “111,000” in the SalaryUSD column. It turns out that this was a string field rather than numeric. If you simply try to turn it into numeric, it will fix “111,000” but it’d turn a German’s entry of “111.000” from $110,000 to $111 if you’re in the US. But I’m getting a bit ahead of myself here…
Where I want to go first is, what are the interesting types of questions we can ask given our vast wealth of domain knowledge and this compendium of valuable pricing insight? (Too obsequious? Maybe.)
- How much money does a DBA make? Similarly, how much does a data scientist make, or a developer who specializes in writing T-SQL?
- Which category of DBA (e.g., junior, mid-level, senior) does a particular type of work?
- How can we segment the DBAs in our survey?
- Suppose I work 80 hours per week. Compared to my peers, is this many hours per week weird?
- Which option should I choose as a career path? DBA? Data scientist? BI specialist?
Talking this through with our champion and other stakeholders, we can talk through some of the information we’ve already gathered, prime them toward asking focused questions, and narrow down to our question of interest:
How much money should we expect a data professional will make?
Well, that’s…broad. But this early on in the process, that’s probably a reasonable question. We might not want to put hard boundaries on ranges yet, but as we go along, we can narrow the question down further to something like, “Predict, within $5K USD, how much we should expect a data professional to make depending upon country of residence, education level, years of experience, and favorite Teenage Mutant Ninja Turtle (original series only).”
The end product that we want to support is a small website which allows people to build profiles and then takes that profile information and gets an estimate of how much they could make in different roles. Our job is to build a microservice API which returns a dollar amount based on the inputs we define as important.
Today’s post was all about getting into the business users’ heads. These are the people handing us big sacks of cash, after all, so we have to give them something nice in return. In the next post, we’ll go a lot deeper into data processing and ask the question, if data platform professionals are the gatekeepers between good data and bad, why are we so bad at filling out forms?