Upcoming Events: SQL Saturday Tampa

Key Details

What: SQL Saturday Tampa.
Where: USF MUMA College of Business, 12212 USF Maple Drive, Tampa, Florida, 33620
When: Saturday, February 29th.
Admission is free. Register on the SQL Saturday website.

What I’m Presenting

1:15 PM — 2:15 PM — Data Virtualization with PolyBase

I couldn’t fit Tampa into my travel schedule last year, so it’s definitely nice to get the opportunity to go once more.

PolyBase Now Supports the Microsoft Access Excel Driver

SQL Server 2019 Cumulative Update 2 has come out and it is chock-full of PolyBase-related stuff.

One interesting tidbit I read was the following:

FIX: Include support for Excel ODBC driver and Text ODBC driver from Microsoft Access Database Engine Redistributable package in PolyBase for SQL Server 2019

If you tried to use Microsoft’s Excel driver prior to 2019 CU2, you’d get the following error:

Msg 105082, Level 16, State 1, Line LineNumber
105082;Generic ODBC error: [Microsoft][ODBC Excel Driver]Optional feature not implemented

To this point, I recommended in PolyBase Revealed that you use a different driver, like CData’s, which did work. CData’s driver still works (I assume…PolyBase ODBC support is a fluid situation, it seems), but now I can officially say that PolyBase supports the Microsoft Access Database Engine Redistributable driver for Microsoft Excel. Let’s go to the tape.

Prep Work

First thing’s first, we need to get a driver. There are three versions of this driver, one for 2016, one for 2013, and one for 2010. I installed the 2016 version on a virtual machine with SQL Server 2019 CU2 and PolyBase installed and configured, but no version of Microsoft Office installed. I installed the 64-bit driver. If you did this correctly, you should see the following in your ODBC Data Source Administrator:

Finally, support for dBASE.

After that, we create our external data source. If you want to play along at home, grab the North Carolina Population spreadsheet from my totally not sketchy at all Blob Storage container. It should be 479 KB downloaded.

Once you have the file, let’s open up our database master key and create an external data source:

USE [Scratch]
GO
OPEN MASTER KEY DECRYPTION BY PASSWORD = '<<SomeSecureKey>>';
GO

IF NOT EXISTS
(
    SELECT 1
    FROM sys.external_data_sources ds
    WHERE
        ds.name = N'NorthCarolinaPopulationExcelUntyped'
)
BEGIN
    CREATE EXTERNAL DATA SOURCE NorthCarolinaPopulationExcelUntyped WITH
    (
        LOCATION = 'odbc://noplace',
        CONNECTION_OPTIONS = 'Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)}; DBQ=C:\SourceCode\Polybase\NorthCarolinaPopulation.xlsx'
    );
END
GO

If you have already purchased enough copies of PolyBase Revealed to re-roof your house, you may notice that this code looks familiar. It’s pretty close to what we used for reading an Excel file using the CData driver, though instead of specifying a DSN that I’ve created, I’m specifying the file’s location here. Also of importance is that the driver must be listed as Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb) exactly, with all of those file types specified. If you don’t get that right, you’ll get the following error:

Msg 105070, Level 16, State 1, Line 28
105070; The ODBC data source name was not specified or the driver is missing.

Assuming you copy-paste or type the driver’s name correctly, it’s time to create an external table. The CData driver I used in the book required adding a RowID column and took a good look at the column types rather than the column values. That’s why the Excel table creation scripts look the way they do. With the Microsoft Access Database Engine Redistributable driver for Microsoft Excel, there are a few changes. Here’s a valid create script:

IF NOT EXISTS
(
    SELECT 1
    FROM sys.external_tables t
    WHERE
        t.name = N'NorthCarolinaPopulationExcelUntyped'
)
BEGIN
    CREATE EXTERNAL TABLE dbo.NorthCarolinaPopulationExcelUntyped
    (
        SUMLEV FLOAT(53),
        COUNTY FLOAT(53),
        PLACE FLOAT(53),
        PRIMGEO_FLAG FLOAT(53),
        NAME NVARCHAR(255),
        POPTYPE NVARCHAR(255),
        YEAR FLOAT(53),
        POPULATION FLOAT(53)
    )
    WITH
    (
        LOCATION = '[NorthCarolinaPopulation$]',
        DATA_SOURCE = NorthCarolinaPopulationExcelUntyped
    );
END
GO

How is this different? Let me count the ways.

  1. The script does not include a RowID column. If you do try to include this, the create statement will fail.
  2. The data types are different. This Excel driver looks at the first several rows (I think it’s 8 by default, but I’m not an expert with this driver) and infers data types which fit. Therefore, all of the numeric features were turned into FLOAT(53) even though I have the columns typed as generic text in the Excel file itself.
  3. The LOCATION parameter is a bit different: I need to add a dollar sign after the worksheet name. The brackets around the name are optional, but if you fail to include the dollar sign, you will get the following error:

Msg 2706, Level 16, State 1, Line 30
Table ‘`NorthCarolinaPopulation`’ does not exist.

Once you take care of creating the external table the right way, you get results:

Finally, we can unleash the power of Microsoft Excel.

Conclusion

Microsoft continues to invest in PolyBase. After the release of SQL Server 2019, the SQL Server team had 16 separate fixes and enhancements which reference PolyBase in the title. This is a nice one because it means you can read Excel files without having to pay for a third-party driver.

Upcoming Events: SQL Saturday South Florida BI

Key Details

What: SQL Saturday South Florida, BI Edition.
Where: Microsoft FLL, 6750 N Andrews Ave, Suite #400, Fort Lauderdale, Florida, 33309
When: Saturday, February 22nd.
Admission is free. Register on the SQL Saturday website.

What I’m Presenting

4:00 PM — 4:50 PM — Data Virtualization with PolyBase

This will be my second time at a SQL Saturday South Florida event, and I think this is a great time of year to host one.

The Importance of 0 in Regressions

Mala Mahadevan (b | t) brought an article to my attention regarding a regression between elevation and introversion/extroversion by state from a few years back. Before I get into this, I want to note that I haven’t read the linked journal article and am not casting aspersions at the blog post author or the journal article authors, but this was a good learning opportunity for an important concept.

Here is the original image:

I see a line. Definitely, definitely a line.

So boom, extroverts want to live on flat land and introverts in the mountains. Except that there are a few problems with this interpretation. Let’s go through them. I’ll focus entirely on interpreting this regression and try to avoid getting down any tangential rabbit holes…though knowing me, I’ll probably end up in one or two.

The Line is NOT the Data

One of the worst things we can do as data analysts is to interpret a regression line as the most important thing on a visual. The important thing here is the per-state set of data points, but our eyes are drawn to the line. The line mentally replaces the data, but in doing so, we lose the noise. And boy, is there a lot of noise.

Boy, is There a Lot of Noise

I don’t have the raw values but I think I can fake it well enough here to explain my point. If you look at a given elevation difference, there are some huge swings in values. For example, check out the four boxes I drew:

Mastery of boxes? Check.

On the left-most box, approximately the same elevation difference relates to ranges from roughly -0.6 to 1.8 or so. Considering that our actual data ranges from -2 to approximately 3, we’re talking about a huge slice. The second box spans the two extremes. The third and fourth boxes also take up well over half the available space.

This takes us to a problem with the thin line:

The Thin Black Line

When we draw a regression line, we typically draw a thin line to avoid overwhelming the visual. The downside to this is that it implies a level of precision which the data won’t support. We don’t see states clustered around this thin line; they’re all around it. Incorporating the variance in NEO E zscore for a given elevation difference, we have something which looks more like this:

That’s a thick line.

Mind you, I don’t have the actual numbers so I’m not drawing a proper confidence interval. I think it’d be pretty close to this, though, just from eyeballing the data and recognizing the paucity of data points.

So what’s the problem here? The lines are all pointing in the same direction, so there’s definitely a relationship…right?

Zeroing in on the Problem

Looking at the vertical axis, we have a score which runs from -2 to 3(ish), where negative numbers mean introversion and positive numbers extroversion. That makes 0 the midpoint where people are neither introverted nor extroverted. This is important because we want to show not only that this relationship is negative, but that it is meaningful. A quick and dirty way we can check this is to see how much of our confidence interval is outside the zero line. After all, we’re trying to interpret this as “people who live in higher-elevation areas tend to be more introverted.”

The answer? Not much.

With our fat confidence interval guess, the confidence interval for all 50 states (plus one swamp) includes the 0 line, meaning that even though we can draw a line pointing downward, we can’t conclusively say that there is any sort of relationship between introversion/extroversion and differences in elevation because both answers are within our realm of possibility for the entire range of the visual.

But hey, maybe I’m way off on my confidence interval guess. Let’s tighten it up quite a bit and shrink it roughly in half. That gives us an image which looks like this:

Conclusive evidence that Alaskans are introverts.

If I cut that confidence interval roughly in half, I lose enough states that those CI bars probably are too narrow. Conclusions we can draw include:

  • Any state with an elevation difference over ~16,000 is likely to have a NEO E zscore below 0.
  • Alaska is the only state with an elevation difference over 16,000.

For all of the other states, well, we still can’t tell.

Conclusion

Looking solely at this image, we can’t tell much about NEO E zscore versus elevation difference except that there appears to be a negative correlation which is meaningful for any state above 16,000 feet of difference in elevation. Based on the raw picture, however, your eyes want to believe that there’s a meaningful negative correlation. It’s just not there, though.

Bonus Round: Rabbit Holes I Semi-Successfully Avoided

I won’t get into any of these because they’re tangents and the further I get away from looking at the one picture, the more likely it is that I end up talking about something which the paper authors covered. Let me reiterate that I’m not trashing the underlying paper, as I haven’t read it. But here are a few things I’d want to think about:

  • This data is at the state level and shows elevation difference. When sampling, it seems like you’d want to sample at something closer to the county level in order to get actual elevation. After all, the conjecture is that there is a separating equilibrium between extroverts and introverts based on elevation.
  • Elevation difference is also a bit weird of a proxy to use by state. Not so weird that it’s hinky, but weird enough to make me think about it.
  • Looking at Alaska in particular, they had 710K people as of the 2010 census, but here are the top cities and their elevations:
CityPopulationElevation (feet)
Anchorage291,826102
Fairbanks31,535446
Juneau31,27556
Sitka8,88126
Ketchikan8,0500
Wasilla7,831341
Kenai7,10072
Kodiak6,13049
Bethel6,0803
I gave up after 9. Frankly, that’s 9 more than I expected to do.

This tells us that, at a minimum, ~56% of Alaska residents lived at or near sea level despite being one of the most mountainous states. If introverts want to live in high-elevation areas, it’s a little weird that they’re flocking to the coastline, which is supposed to be a high-extroversion area based on the journal article’s summary. But again, I didn’t read the article (or even look for a non-gated copy), so take that with plenty of grains of salt.