The Importance of 0 in Regressions

Mala Mahadevan (b | t) brought an article to my attention regarding a regression between elevation and introversion/extroversion by state from a few years back. Before I get into this, I want to note that I haven’t read the linked journal article and am not casting aspersions at the blog post author or the journal article authors, but this was a good learning opportunity for an important concept.

Here is the original image:

I see a line. Definitely, definitely a line.

So boom, extroverts want to live on flat land and introverts in the mountains. Except that there are a few problems with this interpretation. Let’s go through them. I’ll focus entirely on interpreting this regression and try to avoid getting down any tangential rabbit holes…though knowing me, I’ll probably end up in one or two.

The Line is NOT the Data

One of the worst things we can do as data analysts is to interpret a regression line as the most important thing on a visual. The important thing here is the per-state set of data points, but our eyes are drawn to the line. The line mentally replaces the data, but in doing so, we lose the noise. And boy, is there a lot of noise.

Boy, is There a Lot of Noise

I don’t have the raw values but I think I can fake it well enough here to explain my point. If you look at a given elevation difference, there are some huge swings in values. For example, check out the four boxes I drew:

Mastery of boxes? Check.

On the left-most box, approximately the same elevation difference relates to ranges from roughly -0.6 to 1.8 or so. Considering that our actual data ranges from -2 to approximately 3, we’re talking about a huge slice. The second box spans the two extremes. The third and fourth boxes also take up well over half the available space.

This takes us to a problem with the thin line:

The Thin Black Line

When we draw a regression line, we typically draw a thin line to avoid overwhelming the visual. The downside to this is that it implies a level of precision which the data won’t support. We don’t see states clustered around this thin line; they’re all around it. Incorporating the variance in NEO E zscore for a given elevation difference, we have something which looks more like this:

That’s a thick line.

Mind you, I don’t have the actual numbers so I’m not drawing a proper confidence interval. I think it’d be pretty close to this, though, just from eyeballing the data and recognizing the paucity of data points.

So what’s the problem here? The lines are all pointing in the same direction, so there’s definitely a relationship…right?

Zeroing in on the Problem

Looking at the vertical axis, we have a score which runs from -2 to 3(ish), where negative numbers mean introversion and positive numbers extroversion. That makes 0 the midpoint where people are neither introverted nor extroverted. This is important because we want to show not only that this relationship is negative, but that it is meaningful. A quick and dirty way we can check this is to see how much of our confidence interval is outside the zero line. After all, we’re trying to interpret this as “people who live in higher-elevation areas tend to be more introverted.”

The answer? Not much.

With our fat confidence interval guess, the confidence interval for all 50 states (plus one swamp) includes the 0 line, meaning that even though we can draw a line pointing downward, we can’t conclusively say that there is any sort of relationship between introversion/extroversion and differences in elevation because both answers are within our realm of possibility for the entire range of the visual.

But hey, maybe I’m way off on my confidence interval guess. Let’s tighten it up quite a bit and shrink it roughly in half. That gives us an image which looks like this:

Conclusive evidence that Alaskans are introverts.

If I cut that confidence interval roughly in half, I lose enough states that those CI bars probably are too narrow. Conclusions we can draw include:

  • Any state with an elevation difference over ~16,000 is likely to have a NEO E zscore below 0.
  • Alaska is the only state with an elevation difference over 16,000.

For all of the other states, well, we still can’t tell.

Conclusion

Looking solely at this image, we can’t tell much about NEO E zscore versus elevation difference except that there appears to be a negative correlation which is meaningful for any state above 16,000 feet of difference in elevation. Based on the raw picture, however, your eyes want to believe that there’s a meaningful negative correlation. It’s just not there, though.

Bonus Round: Rabbit Holes I Semi-Successfully Avoided

I won’t get into any of these because they’re tangents and the further I get away from looking at the one picture, the more likely it is that I end up talking about something which the paper authors covered. Let me reiterate that I’m not trashing the underlying paper, as I haven’t read it. But here are a few things I’d want to think about:

  • This data is at the state level and shows elevation difference. When sampling, it seems like you’d want to sample at something closer to the county level in order to get actual elevation. After all, the conjecture is that there is a separating equilibrium between extroverts and introverts based on elevation.
  • Elevation difference is also a bit weird of a proxy to use by state. Not so weird that it’s hinky, but weird enough to make me think about it.
  • Looking at Alaska in particular, they had 710K people as of the 2010 census, but here are the top cities and their elevations:
CityPopulationElevation (feet)
Anchorage291,826102
Fairbanks31,535446
Juneau31,27556
Sitka8,88126
Ketchikan8,0500
Wasilla7,831341
Kenai7,10072
Kodiak6,13049
Bethel6,0803
I gave up after 9. Frankly, that’s 9 more than I expected to do.

This tells us that, at a minimum, ~56% of Alaska residents lived at or near sea level despite being one of the most mountainous states. If introverts want to live in high-elevation areas, it’s a little weird that they’re flocking to the coastline, which is supposed to be a high-extroversion area based on the journal article’s summary. But again, I didn’t read the article (or even look for a non-gated copy), so take that with plenty of grains of salt.

One thought on “The Importance of 0 in Regressions

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s