Upcoming Events: Hartford SQL Server Users Group

Key Details

What: Hartford SQL Server Users Group.
Where: 29 South Main Street, 4th floor, West Hartford, CT
When: Saturday, March 7th.
Admission is free. Register on the user group’s Meetup.

What I’m Presenting

5:30 PM — 7:30 PM — Approaching Zero

I won’t be able to make it to Hartford, unfortunately, so this will be a remote presentation.

Upcoming Events: SQL Saturday Baton Rouge BI

Key Details

What: SQL Saturday Baton Rouge, BI Edition.
Where: LSU Patrick Taylor Hall, Baton Rouge, Louisiana
When: Saturday, March 7th.
Admission is free. Register on the SQL Saturday website.

What I’m Presenting

11:30 AM — 12:30 PM — Classification with Naive Bayes
1:45 PM — 2:45 PM — Launching a Data Science Project: Cleaning is Half the Battle

It took me several years to make it to Baton Rouge for a SQL Saturday, and now I’m going back for the BI edition as well.

SQL Saturday Raleigh Pre-Cons

We have two pre-cons going this year at SQL Saturday Raleigh. These are both full-day trainings and will take place on Friday, April 17th at the Microsoft office in Raleigh.

Data Science with Databricks by Ginger Grant, currently $125.

Databricks is a very popular environment for developing data science solutions. More and more companies are interested in Databricks as it is very simple to set up and contains a collaborative workspace for working with a team of people. In this session you will see how to create Machine Learning Solutions with multiple workflows starting with ETL, to data exploration, model experimentation, and lastly to a production release of a data science solution.

Today, more and more development is performed on very large datasets. Attendees will learn how to use Apache Spark, which is part of Databricks, to rapidly analyze lots of data. Learn how to use Databricks to reduces operational complexity to create solutions which can be scaled up or down depending on the amount of data needed to process without having to change the underlying code.

Python, Jupyter Notebooks, and Apache Spark are the technologies used to create solutions within this session. No experience is required.

PowerShell Top to Bottom by Mark Wilkinson, also currently $125.

Lots of people are talking about PowerShell these days. If you are DBA that manages a few servers you have probably even used some PowerShell here and there that you found on the internet, but what if you need to write a script or function from scratch? What if you already have a few scripts that you have on hand but want to organize them into a module to distribute to your team members?

In this session we will be exploring the powerful and flexible “PowerShell” scripting language. We’ll start with the basics of writing simple scripts and work our way up to more complex topics like writing advanced functions, executing tasks in parallel, creating modules, and embedding .NET code in your PowerShell code. This will be an interactive session where we will work to build a useful set of functions to illustrate the many strengths of PowerShell.

These two trainings are a great way of getting even more out of SQL Saturday Raleigh.

PolyBase Now Supports the Microsoft Access Excel Driver

SQL Server 2019 Cumulative Update 2 has come out and it is chock-full of PolyBase-related stuff.

One interesting tidbit I read was the following:

FIX: Include support for Excel ODBC driver and Text ODBC driver from Microsoft Access Database Engine Redistributable package in PolyBase for SQL Server 2019

If you tried to use Microsoft’s Excel driver prior to 2019 CU2, you’d get the following error:

Msg 105082, Level 16, State 1, Line LineNumber
105082;Generic ODBC error: [Microsoft][ODBC Excel Driver]Optional feature not implemented

To this point, I recommended in PolyBase Revealed that you use a different driver, like CData’s, which did work. CData’s driver still works (I assume…PolyBase ODBC support is a fluid situation, it seems), but now I can officially say that PolyBase supports the Microsoft Access Database Engine Redistributable driver for Microsoft Excel. Let’s go to the tape.

Prep Work

First thing’s first, we need to get a driver. There are three versions of this driver, one for 2016, one for 2013, and one for 2010. I installed the 2016 version on a virtual machine with SQL Server 2019 CU2 and PolyBase installed and configured, but no version of Microsoft Office installed. I installed the 64-bit driver. If you did this correctly, you should see the following in your ODBC Data Source Administrator:

Finally, support for dBASE.

After that, we create our external data source. If you want to play along at home, grab the North Carolina Population spreadsheet from my totally not sketchy at all Blob Storage container. It should be 479 KB downloaded.

Once you have the file, let’s open up our database master key and create an external data source:

USE [Scratch]
GO
OPEN MASTER KEY DECRYPTION BY PASSWORD = '<<SomeSecureKey>>';
GO

IF NOT EXISTS
(
    SELECT 1
    FROM sys.external_data_sources ds
    WHERE
        ds.name = N'NorthCarolinaPopulationExcelUntyped'
)
BEGIN
    CREATE EXTERNAL DATA SOURCE NorthCarolinaPopulationExcelUntyped WITH
    (
        LOCATION = 'odbc://noplace',
        CONNECTION_OPTIONS = 'Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)}; DBQ=C:\SourceCode\Polybase\NorthCarolinaPopulation.xlsx'
    );
END
GO

If you have already purchased enough copies of PolyBase Revealed to re-roof your house, you may notice that this code looks familiar. It’s pretty close to what we used for reading an Excel file using the CData driver, though instead of specifying a DSN that I’ve created, I’m specifying the file’s location here. Also of importance is that the driver must be listed as Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb) exactly, with all of those file types specified. If you don’t get that right, you’ll get the following error:

Msg 105070, Level 16, State 1, Line 28
105070; The ODBC data source name was not specified or the driver is missing.

Assuming you copy-paste or type the driver’s name correctly, it’s time to create an external table. The CData driver I used in the book required adding a RowID column and took a good look at the column types rather than the column values. That’s why the Excel table creation scripts look the way they do. With the Microsoft Access Database Engine Redistributable driver for Microsoft Excel, there are a few changes. Here’s a valid create script:

IF NOT EXISTS
(
    SELECT 1
    FROM sys.external_tables t
    WHERE
        t.name = N'NorthCarolinaPopulationExcelUntyped'
)
BEGIN
    CREATE EXTERNAL TABLE dbo.NorthCarolinaPopulationExcelUntyped
    (
        SUMLEV FLOAT(53),
        COUNTY FLOAT(53),
        PLACE FLOAT(53),
        PRIMGEO_FLAG FLOAT(53),
        NAME NVARCHAR(255),
        POPTYPE NVARCHAR(255),
        YEAR FLOAT(53),
        POPULATION FLOAT(53)
    )
    WITH
    (
        LOCATION = '[NorthCarolinaPopulation$]',
        DATA_SOURCE = NorthCarolinaPopulationExcelUntyped
    );
END
GO

How is this different? Let me count the ways.

  1. The script does not include a RowID column. If you do try to include this, the create statement will fail.
  2. The data types are different. This Excel driver looks at the first several rows (I think it’s 8 by default, but I’m not an expert with this driver) and infers data types which fit. Therefore, all of the numeric features were turned into FLOAT(53) even though I have the columns typed as generic text in the Excel file itself.
  3. The LOCATION parameter is a bit different: I need to add a dollar sign after the worksheet name. The brackets around the name are optional, but if you fail to include the dollar sign, you will get the following error:

Msg 2706, Level 16, State 1, Line 30
Table ‘`NorthCarolinaPopulation`’ does not exist.

Once you take care of creating the external table the right way, you get results:

Finally, we can unleash the power of Microsoft Excel.

Conclusion

Microsoft continues to invest in PolyBase. After the release of SQL Server 2019, the SQL Server team had 16 separate fixes and enhancements which reference PolyBase in the title. This is a nice one because it means you can read Excel files without having to pay for a third-party driver.

Upcoming Events: SQL Saturday South Florida BI

Key Details

What: SQL Saturday South Florida, BI Edition.
Where: Microsoft FLL, 6750 N Andrews Ave, Suite #400, Fort Lauderdale, Florida, 33309
When: Saturday, February 22nd.
Admission is free. Register on the SQL Saturday website.

What I’m Presenting

4:00 PM — 4:50 PM — Data Virtualization with PolyBase

This will be my second time at a SQL Saturday South Florida event, and I think this is a great time of year to host one.