One of the things I like to point out in my Launching a Data Science Project talk is that all data is, by its nature, dirty. I figured I could get into that a little bit here and explain why. I’ll pick a few common problems and cover why each is intractable.
Subjective Data Lacks a Common Foundation of Understanding
Any time you ask for or collect subjective data, you assume a common foundation which does not exist. One great example of this is the Likert scale, which is usually a grading on a scale of 1-5 or 1-7 or 1-49 or whatever level of gradation you desire.
We see Likert scales often in surveys: “How enthusiastic are you about having to stand in line two more hours to pick up your tickets?” Then the answers typically range on a scale from Very Unenthusiastic to Unenthusiastic to Meh to Enthusiastic to Very Enthusiastic and people pick one of the five.
But here’s the problem: your “Enthusiastic” and my “Enthusiastic” are not necessarily the same. For a five-point scale it’s not quite as bad, but as the number of points of gradation increases, we’re more likely to find discrepancies. Even on a 10-point scale, my 5 and your 7 could be the same.
Here’s an example where this matters: a buddy of mine purchased a car and the dealer asked him to fill out a manufacturer’s survey. He was happy with the experience and rated the dealer a 9 out of 10 because he doesn’t give out 10s (I believe the phrase was something like “I still had to pay for the car. If they gave it to me for free, I’d give them a 10.”). In a big analysis, one person providing a 9 instead of a 10 doesn’t mean much—it might shift the mean down a thousandth of a point—but the manufacturer’s analysis penalizes dealers who get ratings lower than a 10. The underlying problem here is that the manufacturer is looking for happy customers. They have a happy customer, but due to underlying differences in definitions their system does not recognize him as happy. What’s funny in this is that a simple “Are you happy with your purchase?” would have gotten the answer they wanted without the pseudo-analytical numbering system and avoided the problem altogether.
Systems are Gameable
I went into this on the blog more than a decade ago, explaining why social welfare functions don’t really exist and how you can’t simply use money as an alternative for utility.
Suppose you want an analysis of how best to distribute goods among your population. An example of this might be to figure out budgets for different departments. You send out a survey and ask people to provide some sort of numeric score representing their needs. Department A responds on a scale from 1-10. Department B responds on a scale from 1-100. Department C responds on a scale from 999,999 to 1,000,000 because Department C is run by a smart thinker.
Fine. You send out a second survey, one stack ranking each department from A to G, ranking them 1-7 and doling out the budget based on perceived rank.
Well, as the head of Department C, I know that A and B are the big boys and I want their money. Departments F and G are run by paint-huffers and paste-eaters respectively, so nobody’s going to vote for them. Therefore, I will rank in order C, F/G, D/E, B/A. This gets even better if I can convince F and G to go along in my scheme, promising them more dollars for paint and paste if they also vote for me atop their lists and then for each other next. Knowing that D, E, B, and A will rank themselves at the top, our coalition of Trolls and Bozos has just enough push to take a bunch of money.
If your end users potentially receive value (or get penalized) based on the data they send, they will game the system to send the best data possible.
Data Necessarily Abstracts the Particulars of Time and Place
This is probably the most Hayekian slide I’ve ever created in a technical presentation, in no small part because I reference indirectly The Use of Knowledge in Society, an essay from 1945 which critiques many of the pretensions of central planners. A significant part of this essay is the idea that “data” is often subjective and incomplete, even without the two problems I’ve described above. An example Hayek uses is that the price of a particular agricultural commodity has within it implicit information concerning weather conditions, expectations of future yields, and a great deal of information which people might not even be able to articulate, much less explicitly correlate. This includes expectations (which naturally differ from person to person), different weightings of factors, and internalized experiences (which pop up as hunches or feelings).
This essay was key to Hayek eventually winning the Nobel Prize in Economics and holds up quite well today.
But What Does This Mean?
If you are a wanna-be central planner, it means you will fail from the get-go. Most of us aren’t wanna-be central planners, however, so the answer isn’t nearly as bad.
In each of these cases, one of the biggest conclusions we can draw is that we will never explain all of the variance in a system, particularly one which involves humans. People are complex, cranky, contrarian, and possess subtle knowledge you cannot extract as data. The complexities of humans will be a source of natural error which will make your analyses less accurate than if you were dealing with rule-based automatons.
It also means that adopting additional precision for imprecise problems is the wrong way of doing it. If you do use a Likert-type scale, fewer broad options beats many fine options because you’re less likely to run into expectation differences (where my 7 is your 5 and Ted’s 9.14).