All Data is Dirty: Subjective Data, Likert Scales, and More

One of the things I like to point out in my Launching a Data Science Project talk is that all data is, by its nature, dirty. I figured I could get into that a little bit here and explain why. I’ll pick a few common problems and cover why each is intractable.

Subjective Data Lacks a Common Foundation of Understanding

Any time you ask for or collect subjective data, you assume a common foundation which does not exist. One great example of this is the Likert scale, which is usually a grading on a scale of 1-5 or 1-7 or 1-49 or whatever level of gradation you desire.

We see Likert scales often in surveys: “How enthusiastic are you about having to stand in line two more hours to pick up your tickets?” Then the answers typically range on a scale from Very Unenthusiastic to Unenthusiastic to Meh to Enthusiastic to Very Enthusiastic and people pick one of the five.

But here’s the problem: your “Enthusiastic” and my “Enthusiastic” are not necessarily the same. For a five-point scale it’s not quite as bad, but as the number of points of gradation increases, we’re more likely to find discrepancies. Even on a 10-point scale, my 5 and your 7 could be the same.

Here’s an example where this matters: a buddy of mine purchased a car and the dealer asked him to fill out a manufacturer’s survey. He was happy with the experience and rated the dealer a 9 out of 10 because he doesn’t give out 10s (I believe the phrase was something like “I still had to pay for the car. If they gave it to me for free, I’d give them a 10.”). In a big analysis, one person providing a 9 instead of a 10 doesn’t mean much—it might shift the mean down a thousandth of a point—but the manufacturer’s analysis penalizes dealers who get ratings lower than a 10. The underlying problem here is that the manufacturer is looking for happy customers. They have a happy customer, but due to underlying differences in definitions their system does not recognize him as happy. What’s funny in this is that a simple “Are you happy with your purchase?” would have gotten the answer they wanted without the pseudo-analytical numbering system and avoided the problem altogether.

Systems are Gameable

I went into this on the blog more than a decade ago, explaining why social welfare functions don’t really exist and how you can’t simply use money as an alternative for utility.

Suppose you want an analysis of how best to distribute goods among your population. An example of this might be to figure out budgets for different departments. You send out a survey and ask people to provide some sort of numeric score representing their needs. Department A responds on a scale from 1-10. Department B responds on a scale from 1-100. Department C responds on a scale from 999,999 to 1,000,000 because Department C is run by a smart thinker.

Fine. You send out a second survey, one stack ranking each department from A to G, ranking them 1-7 and doling out the budget based on perceived rank.

Well, as the head of Department C, I know that A and B are the big boys and I want their money. Departments F and G are run by paint-huffers and paste-eaters respectively, so nobody’s going to vote for them. Therefore, I will rank in order C, F/G, D/E, B/A. This gets even better if I can convince F and G to go along in my scheme, promising them more dollars for paint and paste if they also vote for me atop their lists and then for each other next. Knowing that D, E, B, and A will rank themselves at the top, our coalition of Trolls and Bozos has just enough push to take a bunch of money.

If your end users potentially receive value (or get penalized) based on the data they send, they will game the system to send the best data possible.

Data Necessarily Abstracts the Particulars of Time and Place

This is probably the most Hayekian slide I’ve ever created in a technical presentation, in no small part because I reference indirectly The Use of Knowledge in Society, an essay from 1945 which critiques many of the pretensions of central planners. A significant part of this essay is the idea that “data” is often subjective and incomplete, even without the two problems I’ve described above. An example Hayek uses is that the price of a particular agricultural commodity has within it implicit information concerning weather conditions, expectations of future yields, and a great deal of information which people might not even be able to articulate, much less explicitly correlate. This includes expectations (which naturally differ from person to person), different weightings of factors, and internalized experiences (which pop up as hunches or feelings).

This essay was key to Hayek eventually winning the Nobel Prize in Economics and holds up quite well today.

But What Does This Mean?

If you are a wanna-be central planner, it means you will fail from the get-go. Most of us aren’t wanna-be central planners, however, so the answer isn’t nearly as bad.

In each of these cases, one of the biggest conclusions we can draw is that we will never explain all of the variance in a system, particularly one which involves humans. People are complex, cranky, contrarian, and possess subtle knowledge you cannot extract as data. The complexities of humans will be a source of natural error which will make your analyses less accurate than if you were dealing with rule-based automatons.

It also means that adopting additional precision for imprecise problems is the wrong way of doing it. If you do use a Likert-type scale, fewer broad options beats many fine options because you’re less likely to run into expectation differences (where my 7 is your 5 and Ted’s 9.14).

Advertisements

Getting Things Done Using Calendar Entries

I’ve been using a technique lately to try to get more stuff done. I wanted to share this technique in the hopes that it might help somebody else.

Ennui and the Overabundance of Things to Do

I tend to be a list-maker. On each of my computers I have a list of things to do, my phone has a list of things to do, and I have (literal) piles of books to read. I realize that I have years and years of material piling up, and I keep adding more to it because hey, there’s always something else to learn and I’m sure I’ll get to this thing…someday…

The problem is that the list (or collection of lists) gets big enough that the best option is to give up and play video games all day. That makes the list grow larger and creates a vicious cycle. Sometimes it means that I ignore a technology which fades away quickly (hi, Silverlight!) but other times it means that I’m late to the party on something important (like, say, neural networks).

Over the past couple of months, I’ve been working through a technique to try to handle things better: calendar entries.

The Spark

There was a person I worked with at one point in time who had calendar entries for everything. This seemed a bit silly to me: yeah, calendar entries are good for meetings and reminders to go someplace, but this got down to ridiculous levels like filling out paperwork, working on certain projects, sending e-mails, and even the mundane (which I won’t go into here to protect the guilty).

While I was at PASS Summit, I was thinking about all the things I needed to catch up on and how I had been neglecting too many things for too long. But although guilt is a powerful motivator, it clearly wasn’t enough to get me to do more. Nor were to-do lists sufficient for getting me, well, to do things. That’s when I thought about the silly idea of calendar entries everywhere.

Calendar Entries Everywhere

At Summit, before I left for home, I decided to start filling in calendar entries. As I’ve gone along, I’ve gotten a bit better about it, but here’s an early week as an example. I’ve marked out a bunch of entries which are private (appointments, personal to-dos, that kind of thing), involve work, or that I just don’t want to share with the broader world. I also didn’t show all of my calendars here, so my days end up a little bit more packed than it may appear at first blush.

I think I can slot you in between 2:24 PM and 2:29 PM on Thursday.
Bonus caption bit: Operation 10% was where I decided to throw away a lot of junk which had accumulated in my house over the past several years. This is what happens when I’m actually at home for a couple weeks in a row: I start getting hare-brained ideas and actually implement them.

As you can see, there’s a lot going on here. That leads into my next section, which starts now:

The Silly Psychology of Calendar Entries

By itself, calendar entries don’t really do much. But what I do is that I turn on notifications for these—that way, my phone pops up a message and Google Calendar has a JavaScript popup box which shows that hey, it’s time to go do something. These messages spur me to action and remind me that I should go do something. Not surprisingly, I then tend to go do something—sometimes even the thing that multiple devices are yelling at me to do!

This discipline technique works both ways: it’s a way for Present Me to get Future Me away from playing video games all the time, but it’s also a way for Future Me to remind Present Me that there are only so many hours in the day to do things, so if I want to work on a presentation, I can’t also read a book, write a blog post, do work, and review material at the same time. That, says Future Me, is why he plays so many video games all the time. Future Me is also really good at coming up with post hoc excuses.

Tips from a Practitioner

Here are a few tips I’ve learned the hard way:

  • Try to plan out the next 3-7 days in advance. If you need to, block out some time for that as well. In my case, I usually spend about 10-20 minutes during slack times thinking about what I want to do and creating those entries for the next week.
  • Be generous with your time allocations. I tend not to fill out anything less than half an hour, and typically use 60-90 minute blocks of time. Things always seem to take longer than you’d expect, so instead of estimating 15 or 30 minutes, go with the full hour if there’s any doubt.
  • Try to limit the number of distinct things you have going on. I want to learn a bunch of things all at once, but to keep things from getting jumbled, I try to have two or maybe three separate topics a week.
  • Variety is great. I’m the type of person who has trouble staying on topic for more than 60-90 minutes, so I want to switch things up during the day. Occasionally, I do get in the zone and blast through something for 3-4 hours straight, and when that occurs, I can always move calendar blocks around.
  • If you feel like getting started early, do it! The calendar entries aren’t strict edicts; they’re reminders of what you have going on.
  • Similarly, if you can’t do something at the appointed time or are completely unmotivated, no worries. Either move that block to another time or delete the entry. But even this is nice because it forces you to think about what you’re doing and if you’re simply procrastinating, that little bit of guilt might be enough to get you going.
  • Leave room in the calendar. Things will come up and you’ll want some flexibility. Also, even if you have an entire day to yourself, you won’t want to fill it with 12 hours of work-like stuff. Leave some gaps between things so you can take a break. That’s one of the big things I learned since the week that I listed above—even on a busy day, I tend not to book more than about 8 hours, and I tend to leave slack room for daily life (i.e., multiple multi-hour naps).

Conclusion: Will this Work for Me?

You might be asking that question. If so, I have no idea—I’m not you and I’m not making money on this so I don’t have much incentive to convince you that it would.

My best guess is that people with the following characteristics will probably get the most out of this technique:

  • You have some flexibility in your schedule right now. If you’re already booked solid for the next six months, there’s not much we can do here.
  • You can commit to performing some action for a block of time. Having young children will likely make this technique all but impossible.
  • You are at least a little neurotic. If you don’t care at all about skipping appointments or ignoring mechanical devices blinking and buzzing at you, this technique probably isn’t going to work.
  • You have a semblance of a plan. One important thing with this technique is that you can switch to the task without much mental overhead. A task which reads “go do stuff” won’t help you at all. By contrast, a task to continue reading a book is easy: you pick up where you left off. For things that are in between, you can always fill out the calendar entry’s description with some details of things to do if that helps.

One nice thing I’ve picked up from using this is that it seems the technique is one you know pretty quickly after starting whether it’s working for you or not. Since I started creating calendar entries everywhere, I’ve skipped a few tasks and had things come up, but it’s been very helpful in keeping me on track by reminding me of the things I need to do each day in a format which is a bit harder to ignore than to-do lists.

Become A Better Speaker

That’s a big title.  Expect “(on the margin)” in tiny font after the title.

What Other People Say

Let me start off with some great resources, and then I’ll move to a few things that have helped me in the past.

Troy Hunt

Troy is a master of preparation.  Check out his prep work he did for NDC.  I don’t remember if he linked in either of these posts, but he will schedule tweets for certain parts of his talk because he knows exactly when he’ll get to that point.  Also check out his speaker anti-patterns and fix them.  Of his anti-patterns, I most hate the Reader.  Please don’t read your slides; your audience can read faster than you can speak, and your audience can read the slides later.  Tell engaging stories, tie things together, and give people something they can remember when they do re-read those slides.  You’ll also enjoy his checklist of the 19 things he needs to do before the talk begins.

What I get from Troy:  Practice, practice, practice.  One thing I want to get better at is taping myself and forcing me to listen to it.  There are verbal tics which you can only find when you listen to yourself.

Brent Ozar

Brent is a top marketer in the SQL Server world.  Here’s some great advice from him on delivering technical presentations.  He’s absolutely right about giving people a link at the end telling them where to get additional materials.  He has additional information on how to handle slides, working with the audience, and getting prepped for the session.  I think his music idea is interesting; at a couple all-day cons we’ve hosted locally, the guy who hosts tends to put on some early morning music before festivities begin.  It lightens up the mood considerably and works a lot better than a bunch of introverts staring quietly at their phones for 40 minutes until someone makes a move to begin.

What I get from Brent:  I’ve cribbed a few things from him.  All of my slide decks have two pieces of information at the end:  “To learn more, go here” and “And for help, contact me.”  The first bit then points to a short form address URL I bought (http://CSmore.info) and which redirects you to the longer-form address.  On that link, I include slides, a link to my GitHub repo, and additional links and helpful information.  The second bit on contacting includes my e-mail address and Twitter handle.

Paul Randal

Paul has a great Pluralsight course on communication.  This course isn’t just about presentations, but there are two modules, one on writing presentations and one on giving presentations.  I highly recommend this course.

What I get from Paul:  Practice, prepare, and don’t panic.  Compare against my previous review of this course.  Things will inevitably go wrong, so take spare cables and extras of anything you can fit.  At this point, I even carry a mini projector in case of emergency.  I’ve not needed to use it but there might come a day.

Julia Evans

Julia has a fantastic blog post on improving talks and conferences.  I definitely like her point about understanding your target audience.  Her argument in favor of lightning talks is interesting, and I think for beginners, it’s a good idea.  For more experienced speakers, however, 10 minutes is barely an introduction, and sometimes I want a deep dive.  Those have to be longer talks just by their nature.

Another great point she makes is to give hard talks:  aim for beginners to scrape part of it but for experts to learn from it as well.  I absolutely love Kyle Kingsbury’s work too and he helped me get a handle on distributed systems, but in a way that I could re-read his posts several months later and pick out points I never got before.

What I get from Julia:  Find your motivation and make your talks work better for a broader range of people.  I have one talk in particular on the APPLY operator which has the goal of making sure that pretty much anybody, regardless of how long you’ve been in the field, learns something new.  There are a couple of examples which are easier for new users to understand and a couple of examples which are definitely more advanced but still straightforward enough that a new user can get there (even if it does take a little longer).  Ideally, I’d like all of my talks to be that way.

What I Recommend

Here are a few recommendations that I’ll throw out at you.  I’m going to try not to include too much overlap with the above links, as I really want you to read those posts and watch those videos.

  • Practice!  Practice until you know the first five minutes cold.  There are some presenters who will practice a talk multiple times a day for several days in a row.  I’m not one of those people, but if you’re capable of it, go for it.
  • Record yourself.  Find all of those placeholder words and beat them out of yourself.  I don’t mean just “Uh,” “Er,” “Um,” “Ah,” and so on.  In my case, I have a bad habit of starting sentences with “So…”  I’m working on eliminating that habit.  Recordings keep you honest.
  • Tell interesting stories.  To crib from one of my 18th century main men, Edmund Burke, “Example is the best school of mankind, and they will learn at no other.”  Theory ties your work together, and stories drive the audience.  Stories about failure and recovery from failure are particularly interesting; that’s one of the core tenets of drama.
  • Prep and practice your demos.  If you’re modifying anything (databases, settings, etc.) over the course of your demo, have a revert script at the end or revert your VM.  That way, you won’t forget about it at the end of a great talk, give the talk again later (after you’ve forgotten that you never rolled everything back), and have your demos fail because you forgot.  Not that this has happened to me…
  • Speaking of failure, prepare for failure.
    • Have extra cables.  I have all kinds of adapters for different types of projectors.  I have VGA, DVI (though I’ve only seen one or two projectors which required this), and HDMI adapters for both of my laptops in my bag at all times.
    • Prepare to be offline.  If your talk can be done offline, you should do it that way.  Internet connections at conferences are pretty crappy, and a lot of demo failures can be chalked up to flaky networks.  This means having your slides available locally, having your demo scripts available locally, etc.
    • Have your slides and demo scripts on a spare drive, USB stick, or something.  If all else fails, maybe you can borrow somebody else’s laptop for the talk.  I had to do this once.  It was embarrassing, but I got through it and actually got good scores.  The trick is to adapt, improvise, and overcome.  And you do that with preparation and practice.
    • Have your slides and demo scripts available online.  I know I mentioned assuming that your internet connection will flake out, but if your system flakes out and someone lends you a laptop but can’t accept USB sticks (maybe it’s a company laptop), at least you can grab the slides and code online.
    • If you do need an internet connection, have a MiFi or phone you can tether to your laptop, just in case.  If you have two or three redundant internet sources, the chances of them all failing are much lower than any single one failing.
    • Have a spare laptop if you can.  That’s hard and potentially expensive, but sometimes a computer just goes bye-bye an hour before your presentation.
  • Install updates on your VMs regularly.  Do it two nights before your presentation; that way, if an update nukes your system, you have a day to recover.  Also, it reduces the risk that Windows 10 will pick your presentation as the perfect time to install 700 updates.  Very helpful, that Windows 10 is.
  • When in doubt, draw it out.  I have embraced touchscreens on laptops, bought a nice stylus, and love drawing on the screen.  I think it helps the audience understand where you’re going better than using a laser pointer, and sometimes you don’t have a whiteboard.  If you don’t like touchscreens, ZoomIt still works with a mouse.
  • Speaking of which, learn how to use ZoomIt or some other magnification tool.  Even if you set your fonts bigger (which yes, you need to do), you will want to focus in on certain parts of text or deal with apps like SQL Server Management Studio which have fixed-size sections.

There are tomes of useful information on this topic, so a single blog post won’t have all of the answers, but hopefully this is a start.

Upcoming Events

First, a plug.  The folks at Red Gate have re-published my Tribal SQL chapter as a standalone article.  If you like it, go out and get a copy of the book, as there are a number of excellent authors there.

  • On May 16-18, I will be attending CarolinaCon.  It’s hard to beat $20 for a weekend of security talks.
  • On May 21st, I am going to present to the TriNUG Data SIG on In-Memory OLTP in SQL Server 2014 (AKA, Hekaton).
  • On June 14th, I am going to attend SQL Saturday #299 in Columbus.  I’m not sure yet if I’ll have the opportunity to present there, however.

Over the next few weeks, I plan to start putting up some more technical posts as I look into Hekaton and a couple of other topics.  The big one that I have is an analogy:  F# is to development DBAs what Powershell is to production DBAs.  I want to introduce F# as a set-based functional language which plays very nicely with a SQL Server infrastructure and provides a shallower learning curve than C#.

Also, now that the weather is warming up, it’s time to start hitting the trails and visiting parks.  This means I should be able to start taking more photographs again.

Continuity: Friend or enemy?

io9 has a great article about continuity and/or the lack thereof in the major contenders to the Marvel Universe.

I begin this with the following preface: I did not and do not read comics. As a kid, I was fascinated by the Transformers comics (why? Because fucking robot dinosaurs, THAT’S WHY), but I never internalized them the way others have. It’s only been the tremendous comic book movies and video games of the last few years that I’ve gotten interested. I still don’t know that I’d want to read comic books, but I’ve mulled it over, which is more than I can say as opposed to, say, ten years ago.

I agree with choice #1 being doomed — Spider-Man is a great character, but you can’t build an entire universe around him. The supporting characters are shit. The only way to pull it off would be to give Green Goblin his own movie, having Spider-Man entirely in the background. I just don’t know if people would watch them. Venom is an interesting character, but you’d need an insane slow burn to pull it off properly, as far as I understand. Slow burns don’t work in movies — TV series, maybe, but not movies.

#3 is also a tough sell, but I haven’t seen Man of Steel yet, so I’ll refrain from comment for the moment on Superman. Yet the Batman trilogy was brilliant, and one I can watch again and again (or would, if I had a Blu-Ray player here). And Ben Affleck as Batman…  You’re going to have a nearly impossible time, in terms of pure continuity, trying to connect Nolanverse Batman to Ben Affleck. The only alternative is to treat Batman vs. Superman as a complete reboot of Batman, which runs the risk of losing the very trilogy that made DC Films so profitable, profitable enough to the point where you COULD think about a Justice League film.

So entry #2 — Fox, who owns the Fantastic Four and the X-Men? How doomed are they? According to another io9 article, very.

There’s an inherent weakness to the X-Men, in that it’s pretty much an ensemble cast. The X-Men are about complex interrelationships between lots and lots of people. There’s a core — Wolverine, Magneto, Professor X, Jean Grey, maybe Mystique — but there’s a lot of byplay, people being brought forward and pushed back. That’s all perfectly acceptable. Yet what makes the Avengers work is firmly established characters, with clearly defined roles. That doesn’t really exist, at present, for the X-Men films. Cyclops/Scott Summers — a focal point in the comics and many of the video games — is killed off camera in the third movie. Okay, so let’s make a spinoff, say, about Wolverine? Give him his own movies? Those range from “eh” to “god awful.” The Magneto movie — and if any single character needs his own movie in X-Men, it’s Magneto — never materialized.

Yet Fox has made a decision — that decision is to make continuity out of all of the existing X-Men movies. First Class was really good; I enjoyed it immensely. Then again, I liked the first two X-Men movies also (the third one was a C+, until I saw what the actual story ought to have been; it’s drifted to a low C, high D now). To get any kind of reasonable continuity seems nigh impossible, and given the heavy use of time travel in Days of Future Past, which will make internal continuity difficult? I think it will take a herculean effort. If they pull it off, great; I’m looking forward to seeing how they treat Apocalypse and the Horsemen in the 2016 film.

This brings me to the subject of the Fantastic Four. Those movies I also enjoyed, somewhat. Michael Chiklis was an absolutely brilliant Thing. Jessica Alba was, uh, hot, I guess. Everyone else is more or less forgettable (Chris Evans nailed Captain America in a way he couldn’t the Human Torch). The problem began with how they conceptualized Dr. Doom. Look — Dr. Doom is the key to making the Fantastic Four work. He’s one of the best villains of all time. He’s smart, rich, and has diplomatic immunity; what’s not to hate? The movies made him into a sniveling twerp.

The reboot planned… I really don’t have high hopes. I like Kate Mara. I even like Michael B. Jordan. What we don’t know is who will play Dr. Doom. If a good choice is made, there’s a lot of room to grow there. Use the Silver Surfer/Galactus storyline again, too, but AFTER Dr. Doom is well established. Except — I’ve heard of precisely none of the people on the Dr. Doom shortlist. Okay, you’re going young, so you need a young Dr. Doom also. We can’t have Michael Fassbender play Magneto and Dr. Doom, I guess. (They want to make Fantastic Four and X-Men part of the same universe). Matt Smith, the Eleventh Doctor, wouldn’t be a bad name, to pick one at random, but I’m honestly struggling to think of a great young actor who could pull this off. Everything will depend on his choice. If he works, the franchise will do okay out of the gate (because very few people will watch it based on the name, and there is almost zero starpower right now). If he doesn’t…

Maybe io9 is right.

Hire Fast, Fire Fast

Great advice for a startup.  Great advice for any company.

One bad employee can be a cancer on an organization.  Keeping bad employees around means that you are essentially rewarding poor behavior, so good employees have less incentive to work hard.  Mark Suster focuses on startups, but this also applies to large companies as well, even though it’s a lot harder to fire somebody from a large company than a startup.