With ML Services, Watch Those Resource Groups

I wanted to cover something which has bitten me in two separate ways regarding SQL Server Machine Learning Services and Resource Governor.

Resource Governor and Default Memory

If you install a brand new copy of SQL Server and enable SQL Server Machine Learning Services, you’ll want to look at sys.resource_governor_external_resource_pools:

That’s a mighty fine cap you’re wearing.

By default, SQL Server will grant 20% of available memory to any R or Python scripts running. The purpose of this limit is to prevent you from hurting server performance with expensive external scripts (like, say, training large neural networks on a SQL Server).

Here’s the kicker: this affects you even if you don’t have Resource Governor enabled. If you see out-of-memory exceptions in Python or error messages about memory allocation in R, I’d recommend bumping this max memory percent up above 20, and I have scripts to help you with the job. Of course, making this change assumes that your server isn’t stressed to the breaking point; if it is, you might simply want to offload that work somewhere else.

Resource Governor and CPU

Notice that by default, the max CPU percent for external pools is 100, meaning that we get to push the server to its limits with respect to CPU.

Well, what happens if you accidentally change that? I found out the answer the hard way!

In my case, our servers were accidentally scaled down to 1% max CPU utilization. The end result was that even something as simple as print("Hello") in either R or Python would fail after 30 seconds. I thought it had to do with the Launchpad service causing problems, but after investigation, this was the culprit.

Identities blurred to protect the innocent.

The trickiest part about diagnosing this was that the Launchpad logs error messages gave no indication what the problem was—the error message was a vague “could not connect to Launchpad” error and the Launchpad error logs didn’t have any messages about the failed queries. So that’s one more thing to keep in mind when troubleshooting Machine Learning Services failures.

PolyBase and Excel: TOP Now Works

Back with SQL Server 2019 CU2, I reported an error with PolyBase connecting to Excel when trying to select TOP(10) from the table. I’m using the Microsoft Access Database Engine 2016 Redistributable’s Excel driver.

Here’s a sample external data source and table statement:

CREATE EXTERNAL DATA SOURCE VolcanoType WITH
(
    LOCATION = 'odbc://noplace',
    CONNECTION_OPTIONS = 'Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)}; DBQ=C:\SourceCode\Polybase\Scripts\04 - Data Virtualization\VolcanoTypes.xlsx'
);

CREATE EXTERNAL TABLE dbo.VolcanoType
(
    Type NVARCHAR(100),
    Description NVARCHAR(1000)
)
WITH
(
    LOCATION = '[VolcanoTypes$]',
    DATA_SOURCE = VolcanoType
);

Prior to CU5, I could run SELECT * FROM dbo.VolcanoType successfully, but trying SELECT TOP(10) * FROM dbo.VolcanoType would return the following error:

Msg 7320, Level 16, State 110, Line 1
Cannot execute the query “Remote Query” against OLE DB provider “MSOLEDBSQL” for linked server “(null)”. 105082;Generic ODBC error: [Microsoft][ODBC Excel Driver] Syntax error (missing operator) in query expression ‘`T_0`.`Type` `Type`’. .

Now, I could get this to work by turning off external pushdown: SELECT TOP(10) * FROM dbo.VolcanoType OPTION(DISABLE EXTERNALPUSHDOWN);

But now with CU5, I was excited to see bug reference 13525968 corrected:

Disables PolyBase Generic ODBC External Data Sources’ default behavior of pushing down the TOP operator and calling the SQLRowCount function

After installing CU5, I can run a TOP() operation hitting Excel and it won’t fail, so that’s good. Unfortunately, another bug still hasn’t been corrected so I’m hoping they are able to correct that in a future CU.

NVARCHAR Everywhere: A Thought Experiment

Doubling Down on Madness

In the last episode of Shop Talk, I laid out an opinion which was…not well received. So I wanted to take some time and walk through my thinking a little more cogently than I was able to do during Shop Talk.

Here’s the short version. When you create a table and need a string column, you have a couple options available: VARCHAR and NVARCHAR. Let’s say that you’re a developer creating a table to store this string data. Do you choose VARCHAR or NVARCHAR? The classic answer is, “It depends.” And so I talk about why that is in video format right below these words.

A Video from the Void

The Camps

Camp One: only use VARCHAR. Prior to SQL Server 2019, this is basically the set of people who never have to deal with internationalization. If you’re running solo projects or building systems where you know the complete set of users, and if there’s no need for Unicode, I can understand this camp. For projects of any significance, though, you usually have to go elsewhere.

Camp Two: default to VARCHAR, only use NVARCHAR when necessary. There are a lot of people in this camp, especially in the western world. Most of the companies I’ve worked at live in this camp.

Camp Three: default to NVARCHAR, but use VARCHAR when you know you don’t need Unicode. This is a fairly popular group as well, and outside of this thought experiment, I probably end up here.

Aaron Bertrand lays out the costs and benefits of Camps Two and Three (or, Camps VARCHAR Mostly and NVARCHAR Mostly), so I recommend reading his arguments and understanding that I am sympathetic to them.

But there is also Camp Four: NVARCHAR everywhere. And this camp is growing on me.

Why NVARCHAR Everywhere?

I see several benefits to this:

  • Developers and product owners don’t need to think about or guess whether a particular string value will ever contain Unicode data. Sometimes we guess wrong, and migrating from VARCHAR to NVARCHAR can be a pain.
  • NVARCHAR Everywhere avoids implicit conversion between string columns because you can assume that everything is NVARCHAR. Implicit conversion can be a nasty performance impediment.
  • Furthermore, you can train developers to preface string literals with N, ensure that data access tools ask for Unicode strings (most ORMs either default to Unicode or know enough to do it right), and ensure that every stored procedure string parameter is NVARCHAR because there are no exceptions. That’s one less thing you ever have to think about when designing or tuning a database and one less area where ambiguity in design can creep in.
  • If somebody tries to store Unicode data in a VARCHAR column, that information is silently lost.

Why Not NVARCHAR Everywhere?

The first thing you’ll hear from people about this is storage requirements: NVARCHAR characters are typically 2 bytes, whereas equivalent VARCHAR characters are typically 1 byte. For the nuanced version of this, Solomon Rutzky goes into great detail on the topic, but let’s stick with the simplistic version for now because I don’t think the added nuance changes the story any.

SQL Server has Unicode compression, meaning that, per row, if the data in a column can fit in your collation’s code page, the database engine can compress the Unicode data to take as much space as equivalent VARCHAR data would—maybe it’s a little bigger but we’re talking a tiny amount. Enabling row-level compression turns on Unicode compression as well and can provide additional compression benefits. And page-level compression does an even better job at saving space on disk. There are CPU costs, but my experience has been that compression will often be faster because I/O subsystems are so much slower than CPU, even with fancy all-flash arrays or direct-attached NVMe.

The exception is if you are using NVARCHAR(MAX) as your data type. In that case, Unicode and row-level compression won’t do anything and page-level compression only works if your data fits on a page rather than falling into LOB. Hugo Kornelis covers why that is. So that’s a weakness, which means I need a bulleted list here.

  • NVARCHAR(MAX) columns with overflow to LOB will be larger than their VARCHAR counterparts and we cannot use Unicode, Row, or Page compression to reduce storage.
  • If your max data length is less between 4001 and 8000 characters, you know the column will never have Unicode characters, and the data is highly compressible, you will save a lot of space with VARCHAR plus page-level compression, whereas in this zone, you’d need to use an NVARCHAR(MAX) column and lose out.
  • If you are in the unlucky situation where even row-level compression tanks your performance—something I’ve never seen but acknowledge it as a possibility—going with NVARCHAR becomes a trade-off between reducing storage and maximizing performance.

The Demo Code

In case you want to try out the demo code on your own, here it is:

USE [Scratch]
GO
DROP TABLE IF EXISTS dbo.TestTable;
DROP TABLE IF EXISTS dbo.NTestTable;
GO
CREATE TABLE dbo.TestTable
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn VARCHAR(150) NOT NULL,
    SomeOtherStringColumn VARCHAR(30) NOT NULL,
    CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED(Id)
);
GO
CREATE INDEX [IX_TestTable_SomeStringColumn] ON dbo.TestTable
(
    SomeStringColumn
);
GO
CREATE TABLE dbo.NTestTable
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn NVARCHAR(150) NOT NULL,
    SomeOtherStringColumn NVARCHAR(30) NOT NULL,
    CONSTRAINT [PK_NTestTable] PRIMARY KEY CLUSTERED(Id)
);
CREATE INDEX [IX_NTestTable_SomeStringColumn] ON dbo.NTestTable
(
    SomeStringColumn
);
GO

-- Test 1:  It's All ASCII.
INSERT INTO dbo.TestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE('A', 150),
    REPLICATE('X', 30)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 150),
    REPLICATE(N'X', 30)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';

-- Test 2:  Unicode me.
SELECT DATALENGTH(N'🐭');
SELECT DATALENGTH(N'🪕');
GO
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'🐭', 75),
    REPLICATE(N'🪕', 15)
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';

-- Test 3:  Mix It Up.
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';
GO

-- Check the DATALENGTH
SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.TestTable;

SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.NTestTable;

-- Row Compression includes Unicode compression.
ALTER INDEX ALL ON dbo.NTestTable REBUILD WITH (DATA_COMPRESSION = ROW);
GO
-- Test 3a:  Continue to Mix It Up.
TRUNCATE TABLE dbo.NTestTable;
INSERT INTO dbo.NTestTable
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.TestTable';
EXEC sp_spaceused 'dbo.NTestTable';
GO

-- Another check of the DATALENGTH
SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.TestTable;

SELECT TOP(1)
    SomeStringColumn,
    DATALENGTH(SomeStringColumn)
FROM dbo.NTestTable;

-- Let's check the LOB
DROP TABLE IF EXISTS dbo.NTestTableLob;
GO
CREATE TABLE dbo.NTestTableLob
(
    Id INT IDENTITY(1,1) NOT NULL,
    SomeStringColumn NVARCHAR(MAX) NOT NULL,
    SomeOtherStringColumn NVARCHAR(MAX) NOT NULL,
    CONSTRAINT [PK_NTestTableLob] PRIMARY KEY CLUSTERED(Id) WITH(DATA_COMPRESSION = ROW)
);
-- Can't use NVARCHAR(MAX) as a key column in an index...
/* CREATE INDEX [IX_NTestTableLob_SomeStringColumn] ON dbo.NTestTableLob
(
    SomeStringColumn
)  WITH(DATA_COMPRESSION = ROW); */
GO

-- No overflow necessary.
INSERT INTO dbo.NTestTableLob
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(100000)
    REPLICATE(N'A', 148) + N'🐭',
    REPLICATE(N'X', 28) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- What about page-level compression?
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = PAGE);

EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- And to be fair, we'll see the same on NTestTable.
ALTER INDEX ALL ON dbo.NTestTable REBUILD WITH (DATA_COMPRESSION = PAGE);

EXEC sp_spaceused 'dbo.NTestTable';
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- My page runneth over.
TRUNCATE TABLE dbo.NTestTableLob;
-- Let's reset the data compression.
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = NONE);
INSERT INTO dbo.NTestTableLob
(
    SomeStringColumn,
    SomeOtherStringColumn
)
SELECT TOP(10000)
    REPLICATE(N'🐭', 14800),
    REPLICATE(N'X', 28000) + N'🪕'
FROM sys.all_columns ac
    CROSS JOIN sys.all_columns ac2;
GO
EXEC sp_spaceused 'dbo.NTestTableLob';
GO
-- Now we compress.
ALTER INDEX ALL ON dbo.NTestTableLob REBUILD WITH (DATA_COMPRESSION = PAGE);
GO
EXEC sp_spaceused 'dbo.NTestTableLob';
GO

-- Time to clean up.
DROP TABLE IF EXISTS dbo.TestTable;
DROP TABLE IF EXISTS dbo.NTestTable;
DROP TABLE IF EXISTS dbo.NTestTableLob;
GO

SELECT
    N'🐭' as Mouse,
    '🐭' as [Mouse?];

Final Thoughts…For Now

I think where I stand right now is, for greenfield database development, I heavily bias toward NVARCHAR and could even say NVARCHAR Everywhere. I think the benefits outweigh the costs here.

For brownfield database development, it’s a harder call to make because you almost certainly have a mix of VARCHAR and NVARCHAR data types. If you already have a solid system within a brownfield database, stick with that system. For example, you might use NVARCHAR for user-entry fields but VARCHAR for internal system fields like codes and levels. If that pattern works for you, that’s fine.

If you’re in a brownfield development mess, I can see the potential benefit of migrating to NVARCHAR Everywhere, but the work-to-benefit ratio is probably not going to be acceptable for most companies. The exception here is if you find out that you’re losing valuable customer data and need to go through an internationalization project. It might be tempting to change the minimum amount necessary, though my philosophy is that if you have the opportunity to make big improvements, take them.

But as I mention in the video, I’m interested in your thoughts as well. So add them to the video or drop them in here. Is there something big I’m missing which makes NVARCHAR Everywhere untenable? Have I shifted your thinking at least a little bit? Let me know.

The Non-Technical Life Tip for Getting More Things Done

T-SQL Tuesday

This month’s T-SQL Tuesday topic comes from Kenneth Fisher.

T-SQL Tuesday
T-SQL Tuesday 127

How about for this months TSQL Tuesday let’s do another month of tips and tricks. But just to be a bit different, nothing to do with SQL Server (in fact let’s say nothing related to a DBMS). No SSMS tricks, no T-SQL tips, something completely unrelated. For example did you know that, in Windows, if you grab a window and shake it then all of the other windows will be minimized? And then if you shake it again they all pop back up.

So give me a non SQL related tip or trick. OS (windows, mac), Cloud (Azure, AWS), etc. Something you think will be handy for the rest of us.

Today, I’m going back to a classic.

The Non-Technical Life Tip for Getting More Things Done

You can put special emphasis on “the” there if you’d like. But that tip is managing your calendar. Back in 2019, I talked about a strategy I had put together over the prior couple of months. The gist of it is, fill in your schedule with the things you want to do. Here’s what it looked like back then:

This worked pretty well, but it had a couple of problems:

  • I still had multiple calendars, so the only place where I had my schedule in full was on my phone.
  • My phone is where I go to lose lots of time.
  • When I do need to be flexible, rearranging the calendar becomes a mess.
  • It’s hard to keep track of the things I want to do but don’t have strictly scheduled times set up already.

So what ended up happening is that, over time, I switched back to using a todo list, landing on Todoist given its functionality.

ToDo Lists Aren’t Perfect

Using a todo list did help me with some parts of the problem, particularly around flexibility and wanting to do things later but where “later” isn’t on a schedule. Eventually, I stopped filling in the calendar entries altogether.

But that led me back to the original problem I had: I wasn’t getting stuff done. I’d have 60-70 items on my todo list, knowing that my regular cadence was maybe 10-12 items per day, with 15-18 on a very productive day.

The math doesn’t work out, so let me walk through a dramatized version of my day:

  • Add a few items to my todo list that need done today.
  • Finish my normal 10-12 items.
  • Figure out how to reschedule the other 50 things on the list, bumping them back by a week or two.

The problem is that todo lists don’t really give you any of the discipline of having a plan or knowledge of what you can achieve. So you end up with eyes bigger than your stomach (to misappropriate a metaphor) and a huge list of things to do but hey, there’s still 6 hours in the day to accomplish these 50 tasks so let’s watch some TV for a while.

Becoming Indistractable

Not too long ago, I started listening to a book called Indistractable. I’m roughly halfway through it and most of the book is stuff I’ve heard, done, or otherwise know. But in the book, Nir Eyal recommends keeping a calendar with specific entries…sort of like what I did in 2019. But there are two extra catches.

Here’s what my new calendar looks like:

Calendar 2: Calend Harder

The first difference is that this is all of my calendar entries in one, including work, community, and personal. I don’t have to use my phone to see the whole thing (although I still can).

The second difference is in the colors. I now have several colors, which represent different things. Green represents personal time; yellow and orange represent work and work-related meetings, respectively; and dark grey represents relationships and time with others. I have a couple other types of entry such as presentations I’m giving are listed in blue + salmon, but it’s mostly green, yellow, and dark grey.

The color difference is big because it gives you a feel for what you’re doing with yourself. It helped me realize that, aside from dinner, I didn’t really see my wife often at all despite her being mere feet from me. Eyal mentions that for a lot of people, relationships get the remainder of whatever time is left over in your calendar, but if you don’t actively make time, other things encroach to the point where you just forget about them altogether. I know that scheduling time for relationships sounds a bit weird, but it’s really not—you’re protecting that time and ensuring that you don’t schedule meetings over it or feel the need to fill it with something “more productive” because calendars, like nature, abhor a vacuum.

Lightning Round: Bonus Tips

All of these revolve around the core notion of having a calendar in place, but I’ll burn through them quickly to keep this from being an essay.

Keep That Todo List

Todo lists are still important even when you have a calendar. The calendar is your commitment to yourself (and others) as to what you will do in the near term. The todo list then becomes a reminder of what you need to do in the near term, as well as things you’d like to do over time. The trick is, you don’t need to schedule all of the todo items up-front.

Every Day: 30 Minutes for Day Planning and Prep

Alongside the todo list, I have an entry at 9 PM to do planning and preparation. Early on, this was the time I spent simply getting those calendar entries in. Now that I am in a rhythm, I can spend this time looking at what’s left in my calendar and deciding what I want to do. For example, I had a few courses on my todo list for months; I’d bought access to them and wanted to learn, but never really felt like I had the time or motivation. Now I can put them in at certain time blocks, and know that I’m going to work on these. Importantly, I have others that I don’t put in or even think about. I’ll get to them someday, but I don’t need to fret about it in the meantime.

Timebox Most Things

This might make sense for people like me rather than the general population, but timebox things. Commit to 30-60 minutes on an item and then be done with it. If you need to get back to it, that’s fine—schedule another entry. My problem is that I tend to get squirrely after about an hour on a single thing and so I have a bunch of different 15-60 minute tasks throughout the day. If I really get in the zone on something, I don’t have to quit, but can reschedule other things. But that leads me to my next point.

Understand Your Limits

The point of timeboxing and calendar-based commitment isn’t to become some sort of machine that can work 18-hour days. I purposefully stop scheduling productive work at 9 PM. If I’m feeling really hyped about something and just want to keep going, I can, but I normally reserve the last few hours of the day to unwind, play games, read, and relax.

Focus on One Thing at a Time

Very few people can juggle multiple tasks at the same time. Even the best multi-taskers tend to be pre-emptive interrupt types rather than truly concurrent—they switch between several tasks but focus on the one task at a time. The problem is, the modern world is designed to take you out of focus. Between your phone, computer, multiple screens, group chats like Slack/Teams, social networks, and flashy services, everything’s trying to get your attention all the time.

Turn off most notifications. If an app spams me more than once, I turn off its notifications altogether. Spam, here, is simply unwanted messages. The American Airlines app notifies me when my plane is about to board—that’s a helpful message. If it started messaging me about how I can buy bunches of miles, I’d turn off notifications immediately.

Going further, I keep my phone on silent and sometimes even turn on Do Not Disturb mode (Android and iOS both offer this). Specific people (like my wife, my boss, my employees, and my parents) can still notify me immediately, but other people can wait a bit. I’ve timed this a few times and realized that during certain stretches of the day, I would get a work-related notification every 90-120 seconds. That could be e-mail, group chat, or whatever. Most of those messages I could ignore (e-mails which don’t require immediate action, group chat messages not directed toward me), but the phone buzz interrupted me nonetheless because hey, maybe this time it’s important!

The other exception is the calendar entries themselves–I have notifications on all of my calendar entries so that I do get a message that it’s time to move on to something else. That means I don’t have to check the clock quite so frequently and won’t get sucked into notification swamps.

When you end one task and are ready to go to the next, go ahead and check those e-mails and chat messages, but if you need to act on something non-urgent, set up a calendar entry and do it during that time. If it’s urgent, of course this goes out the window, but you’d be surprised how little is truly urgent.

Spend Less Time on Social Media

If you really need to be on Facebook, Twitter, Instagram, TikTok, whatever, schedule that time. Eyal recommended uninstalling the apps from your phone and just using the web interface in your pre-committed time. I don’t go quite that far, but I also spend very little time on social networks at this point and frankly, I’m not sure I’m really missing that much. I’ll check for notifications occasionally, maybe hang out for a little bit, but it just doesn’t form a big part of my life.

Sometimes, Call an Audible

Remember that you rule your calendar, not the other way around. If the weather is beautiful, go out and take that walk right now instead of waiting two days. Aside from meetings or hard deadlines, you’re filling out this calendar to get the most from your life. You can leave a bit of slack (hopefully more than I have) in case you need it or move/remove items as you determine that the world isn’t quite what you imagined it to be a few days ago when you created the calendar. And sometimes you get done with something early; in that case, you’ve got bonus time, so do with it what you will and indulge all of those time-sucking notifications and frivolities until the next calendar block hits.

“But What if I Don’t Like Your Advice?”

So you’ve gotten this far and decided that you don’t like what I’m saying, or you have loads of objections (or maybe “well, I like it but with this twist”). I am a normative individualist, so I fully believe that you know better than I do what works for you. If you want to take 30% of my advice and go in a wildly different direction on the rest, go for it and I hope it works out great. I won’t pretend that this is the only—or even the best—way to do things, just that it’s a method which works for me.

As I mentioned the last time I hit this topic, you might not be able to pull this off due to your job (e.g., working in a support role where you can’t necessarily schedule when you get to work on things), your kids, or other circumstances. This isn’t a universal trick, and I think you have to hit the characteristics I keyed in on last time for it to work to the utmost:

  • You have some flexibility in your schedule right now. If you’re already booked solid for the next six months, there’s not much we can do here.
  • You can commit to performing some action for a block of time. Having young children will likely make this technique all but impossible.
  • You are at least a little neurotic. If you don’t care at all about skipping appointments or ignoring mechanical devices blinking and buzzing at you, this technique probably isn’t going to work.
  • You have a semblance of a plan. One important thing with this technique is that you can switch to the task without much mental overhead. A task which reads “go do stuff” won’t help you at all. By contrast, a task to continue reading a book is easy: you pick up where you left off. For things that are in between, you can always fill out the calendar entry’s description with some details of things to do if that helps.

Installing TensorFlow and Keras for R on SQL Server 2019 ML Services

I make fairly heavy use of SQL Server Machine Learning Services in my work environment, and this includes using TensorFlow and Keras (in R and Python) to train neural networks and generate predictions against them.

With the upgrade to SQL Server 2019, the security model around Machine Learning Services has changed, restricting what you are able to do with the product. On the whole, that’s a good thing—in 2016 and 2017, you have some pretty good leeway to execute arbitrary code on a server, but they really limit that with 2019.

The purpose of today’s post is to show you how you can install R-Keras and TensorFlow against a server running SQL Server 2019 Machine Learning Services. If you want to install this for SQL Server R Services in 2016 or SQL Server 2017 Machine Learning Services, the process is a little bit easier.

A Major Caveat

There might be a better way to do this than what I’m going to show you. If so, it wouldn’t be the first time that I brute-forced a solution with Machine Learning Services that had a more elegant solution. But I can confirm that it works for me.

Step One: Prep Work

I’m going to assume you know how to install SQL Server ML Services. If you don’t, check out my presentation on ML Services in Production.

Tear Down This Firewall

You might need to disable a firewall rule which reads something like “Block network access for R local user accounts on SQL Server instance MSSQLSERVER.” This rule prohibits outbound web connections, making exfiltration of data harder. Unfortunately, it also makes package updating more difficult and my recollection was that it prevented installation of Keras, though I didn’t test this in writing the blog post, so I could be making that last part up. If your security team gets paranoid about disabling this firewall rule, turn it back on after installation and configuration is finished, though that does mean you won’t be able to do things like, say, hit an API endpoint with R or Python.

Install Anaconda

Next, we will want to install the Anaconda distribution of Python. This is true even if you have Python Services installed. I was never able to install TensorFlow + Keras on Python Services and have R Services configured to point over to Python Services to get R-Keras working correctly. Maybe I was doing something wrong, but installing another version of Anaconda does the trick. Plus, you’re going to have five or six versions of Anaconda installed on the machine regardless, as it seems like every other app these days wants its own version installed.

By the way, when installing Anaconda, there’s a check box about adding Anaconda to your PATH. Check that box even though they recommend you not do so. I needed to have the box checked to get R Services to identify Anaconda properly.

Double-Check that PATH

I installed Anaconda to C:\Anaconda, so keep that in mind. Make sure that you have the following in your PATH system environment variable:

  • C:\Anaconda
  • C:\Anaconda\Library\mingw-w64\bin
  • C:\Anaconda\Library\usr\bin
  • C:\Anaconda\Library\bin
  • C:\Anaconda\Scripts
  • C:\Anaconda\bin
  • C:\Anaconda\condabin

Then, add a new system environment variable: RETICULATE_PYTHON=C:\Anaconda\python.exe

Turning Off All of the Security

This next part may be a little scandalous, but my recollection is that I needed to do this to get everything working. I don’t remember if it was to get around our company’s permissions setup or if ML Services needed it set, but I had to set the Anaconda folder’s permissions recursively to grant Full Control to Users. You can try doing this without making that change and hopefully it works, but again, trial and error told me I needed to do this on my end. I also needed to grant All Application Packages and All Restricted Packages full control over the Anaconda folder. If you run icacls on the folder you’ll see something like the following:

Giving away the store with respect to rights.

Restart Everything

After you’ve set up the environment variables and rights, restart the SQL Server instance, both the database engine and the Launchpad service. This way they’ll pick up the new path variables.

Step Two: Install Keras

Now that we have ML Services installed and a good Python installation for Keras to live, we need to run the R shell. For SQL Server 2019, that’s %PROGRAMFILES%\Microsoft SQL Server\MSSQL15.MSSQLSERVER\R_SERVICES\bin\R.exe by default, but you may need to change the path based on your installation location, version of SQL Server, and whether this is a named instance.

Inside R, run the following:

install.packages("reticulate", repos = "http://archive.linux.duke.edu/cran/")
install.packages("tensorflow", repos = "http://archive.linux.duke.edu/cran/")
install.packages("keras", repos = "http://archive.linux.duke.edu/cran/")

Note that I specified a repo here. If you don’t specify a repo, you’ll get Microsoft’s CRAN repo for SQL Server 2019, which is way out of date. We need current versions of reticulate, tensorflow, and keras for this to work. I use Duke University’s archive because it’s close, and Case Western University’s when Duke gives me troubles, but you can also find your own favorite mirror.

What we’ve done so far is install Python support (reticulate), as well as the shells for TensorFlow and Keras. We need to hook them up next. If you read the R-Keras installation instructions, you’ll see that there is a way to configure from within R. I was able to get that working for 2016 and 2017, but 2019 gave me problems, so we’re trying a different route.

Step Three: a Pivot to Python

Open up an Anaconda prompt (though I don’t recall needing to do so as Administrator, probably because I already gave away the store in Step One) and run the following command:

conda update -n base -c defaults conda

What we’re doing here is updating the conda package itself on our base environment using the default channel. It may not be strictly necessary, but usually conda will be a little out of date when you install Anaconda on Windows, so it’s worth getting the latest version.

Then close and re-open your Anaconda prompt. After that, run the following:

conda create --name r-reticulate
conda activate r-reticulate
pip install tensorflow-probability
pip install tensorflow
pip install keras
 
conda activate base
pip install tensorflow-probability
pip install tensorflow
pip install keras

What I’m doing is building a new virtual environment named r-reticulate, which is what the reticulate package in R desires. Inside that virtual environment, I’m installing the latest versions of tensorflow-probability, tensorflow , and keras. I had DLL loading problems with TensorFlow 2.1 on Windows, so if you run into those, the proper solution is to ensure that you have the appropriate Visual C++ redistributables installed on your server.

Then, I switched back to the base virtual environment and installed the same packages. My thinking here is that I’ll probably need them for other stuff as well (and don’t tell anybody, but I’m not very good with Python environments).

Step Four: Test Things Out

Bounce back to R.exe and run the following commands:

reticulate::import("h5py")
reticulate::import("keras")

If those worked, congratulations! You have TensorFlow for CPU and Keras installed. If you need GPU support, you’ll need to configure that on your own; I got it working on a desktop in the past but between CUDA, TensorFlow, Keras, and R, it’s a lot of pain.

From here, you should be able to open up your SQL Server client of choice, connect to the SQL Server instance, and run the following:

EXEC sys.sp_execute_external_script
	@language = N'R',
	@script = N'reticulate::import("h5py")
	reticulate::import("keras")'

This will help ensure that ML Services is working as expected.

Bonus: Installing TensorFlow and Keras for Python on Python Services

If you just want to use Python to execute TensorFlow and Keras code and don’t want to tie it into R at all, the steps are a lot easier. That’s where my talk on Developing a Solution with SQL Server Machine Learning Services comes into play. We’re going to use sqlmlutils to install Python.

First, install sqlmlutils on your local machine. This does not need to be the same as the server where you’ve installed SQL Server Machine Learning Services. To do this, open up command prompt, run Python (oh, you need Python, and I’d probably recommend installing Anaconda locally as well), and run the following:

pip install sqlmlutils

Once you have sqlmlutils installed, we can continue on. For each database in which you need to access Keras and TensorFlow, run the following:

# Connect to the Database
# NOTE:  need to do this for *each* database!
#conn = sqlmlutils.ConnectionInfo(server="MyGreatServer", database="SomeDatabase")

sqlpy = sqlmlutils.SQLPythonExecutor(conn)
pkgmanager = sqlmlutils.SQLPackageManager(conn)

# Install a package as dbo, allowing anybody to use it.
# Upgrade if there is a newer version than what is already installed.
pkgmanager.install("tensorflow", scope = Scope.public_scope())
pkgmanager.install("keras", scope = Scope.public_scope())

Yeah, that was a lot easier. On the plus side, the method we followed to install Keras and TensorFlow for R allows us to use it in every database by default, rather than the per-database installation method for sqlmlutils.

Bonus Bonus: TensorFlow and Keras on R with sqlmlutils

I want to take a quick moment at the end of the post to explain why, although sqlmlutils works for installing TensorFlow and Keras against Python Services, I needed to run R.exe on the server for R Services.

I tried doing this all from sqlmlutils, but I ran into two problems. The first was in installing the reticulate, tensorflow, and keras packages. They would start running but would never actually finish. I let this run overnight just to see if it’d complete, and it never did. By contrast, installing them from the server worked every time.

Second, I was unable to run install_keras(method = "conda", conda = "C:\\Anaconda\\Scripts\\conda.exe) successfully, either via sqlmlutils or R.exe. That’s why I ended up creating the virtual environment and installing TensorFlow-related packages separately.

Building a Docker Container of a SQL Server Database

Today, we’re going to go through the process of turning a database you’ve built into a Docker container. Before we get started, here are the expectations:

  1. I want a fully running copy of SQL Server with whatever database I’m using, as well as key components installed.
  2. I want this not to be on a persistent volume. In other words, when I destroy the container and create a new one from my image, I want to reset back to the original state. I’m using this for technical demos, where I want to be at the same starting point each time.
  3. I want this to be as easy as possible for users of my container. I consider the use of a container here as not particularly noteworthy in and of itself, so the more time I make people think trying to set up my demo environment, the more likely it is that people will simply give up.

With that preamble aside, let’s get to work!

Step One: Have a Database

This might seem a little obvious, but I want to make it clear that we need the database set up exactly how we want it. This includes user configuration (usually with SQL authentication, given that we’re using Linux containers and passing them out to who-knows-where), database objects like tables and procedures, and whatever else we need. After that, I will take a database backup and save it to my local disk.

For this example, I’m going to talk through how I containerized the demos for my Approaching Zero talk. This is a fairly straightforward database with some pre-staged data and objects, but no features such as PolyBase, SQL Agent, or Machine Learning Services to throw a spanner in the works. Incidentally, if you do use those types of features, I was able to get them working in my Native Scoring demo, so it’s possible to do and the process actually isn’t that far removed from what we’re doing here.

Step Two: Spin Up a Container

It takes a container to beat a container. Well, not exactly, but this is the route that I like to take.

I’m going to spin up a container. Here’s the command I run:

docker run -d -e MSSQL_PID=Developer -e ACCEPT_EULA=Y -e SA_PASSWORD=YourPasswordGoesHereButDontUseThisOneBecauseItIsntAGood1! --name approaching-zero-db -p 51433:1433 mcr.microsoft.com/mssql/server:2019-latest

Taking this step by step, I want to run a docker container in the background (that’s what the -d command means). I’m going to set several environment variables: MSSQL_PID (to tell SQL Server which edition to use), ACCEPT_EULA (to promise Microsoft we won’t sue them), and SA_PASSWORD (to throw off our enemies with deceptively bad passwords). The name of my container will be called approaching-zero-db. You can, of course, name it whatever you’d like, and it doesn’t even have to be the same as what we’re going to push out to Docker Hub, so get creative if you really want.

The -p flag says that we’d like the container’s port 1433 to be represented on our host as port 51433. I already have SQL Server running on my host machine, so I’m selecting a totally different, unused port for the container. But as far as the container is concerned, it is listening on its port 1433, so it has that going for it.

Microsoft has their own container repository for SQL Server, and we’re getting the version tagged as 2019-latest, which is CU4 as of the time of this blog post going live. Downloading this may take a while, so try not to do this on a really slow internet connection.

Incidentally, you might get the following error:

C:\Program Files\Docker\Docker\resources\bin\docker.exe: Error response from daemon: Ports are not available: listen tcp 0.0.0.0:51433: bind: An attempt was made to access a socket in a way forbidden by its access permissions.

If you get this error, it means that something else is already listening on port 51433 on your host, so you’ll have to use a different port instead. Maybe 52433 or 51434 or something. You’ll need to run docker rm approaching-zero-db to clean up the mess before you try again, but don’t shed too many tears over the containers we have to slaughter along the path to glory.

Step Three: Restore the Dauphin

We have a container and a database backup, so let’s do what we do best: fitting square pegs into round holes.

To do this, first I will make a backup directory within our container:

docker exec -it approaching-zero-db mkdir /var/opt/mssql/backup

This command is pretty straightforward: we’re going to execute a command in interactive mode (-i) and allocate a pseudo-TTY (-t). Don’t know what a pseudo-TTY is? If it really matters, learn all about tty and then come back here.

Our backup has a home, so now let’s move it in and make it feel comfortable.

docker cp D:\SQLServer\Backup\ApproachingZero.bak approaching-zero-db:/var/opt/mssql/backup/ApproachingZero.bak

We’re using Docker to copy the backup from my local directory into the container. You would, of course, modify this to fit your system.

After the backup is up there, I like to run the following command in Powershell. The prior commands you could run in cmd, but this one’s a lot easier to read as a multi-liner:

docker exec -it approaching-zero-db /opt/mssql-tools/bin/sqlcmd -S localhost `
    -U SA -P YourPasswordGoesHereButDontUseThisOneBecauseItIsntAGood1! `
    -Q "RESTORE DATABASE ApproachingZero FROM DISK =
        '/var/opt/mssql/backup/ApproachingZero.bak' WITH MOVE 'ApproachingZero' TO '/var/opt/mssql/data/ApproachingZero.mdf', MOVE 'ApproachingZeor_Log' TO '/var/opt/mssql/data/ApproachingZero.ldf'"

Some builds of SQL Server containers don’t have mssql-tools installed, so you might need to install them separately. Let’s talk about that in a sub-header.

Step Three Point A: What to Expect when you Expected mssql-tools

If, for some reason, your container does not have mssql-tools installed, that’s okay. As long as you have an internet connection, you can get this done.

We’re first going to open up a shell on the container:

docker exec -it approaching-zero-db /bin/bash

Why bash and not ksh? Because I’m not hardcore enough to live my life in vi. I’m a frequent vi tourist but not a native.

Next up, we’re going to install some stuff. Unlike in the Microsoft instructions, we are Master and Commander of this container and so sudo won’t do much.

curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list | tee /etc/apt/sources.list.d/msprod.list

apt-get update 
apt-get install mssql-tools unixodbc-dev

And now we have mssql-tools and we can return to step 4, already in progress.

Step Four: Perform Sanity Tests

After you’ve restored the database, connect to it in SQL Server Management Studio, Azure Data Studio, or your tool of choice. Make sure the objects are there, that you can log in with the accounts you need, etc. You might need to create logins for those SQL authenticated accounts, for example—unless you’re like me and giving away sa access like it’s free candy at Halloween.

Here’s the important thing, though: any changes you make here will be on the container’s permanent record, so if you want a pristine experience every time you reconstitute that container, you’ll want to handle this step lightly.

Step Five: Admit Defeat on Sanity, Get Committed

We’ve made sure that the database is what we expected, that all of the pieces are there, and that queries are running as we expect. From there, it’s time to push this to Docker Hub.

The first step in pushing a container is to commit the current container as a new image with the docker commit command:

docker commit -m="Creating the Approaching Zero database." -a "{Your Name Here}" approaching-zero-db approaching-zero-db

We are running the commit command and passing in a message with -m, setting the author with -a, selecting the running container to commit (approaching-zero-db) and naming the local repository to which we want to commit this thing (approaching-zero-db).

At this point, you can spin up a local container off of your approaching-zero-db model, but nobody else can. So let’s fix that.

Step Six: Bust Out and Let the Whole World See You

We want to push this new image to Docker Hub. If you have not already, you’ll probably need to log in:

docker login

You might be able to authenticate with cached credentials, but if not you’ll enter your username and password and be good to go.

Now, we want to tag our local image and tell Docker what it will represent up in Docker Hub:

docker tag approaching-zero-db docker.io/feaselkl/presentations:approaching-zero-db

This takes my local repository named approaching-zero-db (with the default latest tag) and ties it back to Docker Hub, where in my feaselkl account I have a presentations repository and a tag called approaching-zero-db.

My last step is to push the fruits of my labor into Docker Hub for all to see.

docker push docker.io/feaselkl/presentations:approaching-zero-db

Step Seven: Using the Image

Now that I have the image up in Docker Hub, I can run the following commands from any machine with Docker installed and spin up a container based on my image:

docker pull docker.io/feaselkl/presentations:approaching-zero-db

docker run --name approaching-zero-db -p 51433:1433 docker.io/feaselkl/presentations:approaching-zero-db

And that’s it. I connect to port 51433 on localhost and authenticate with my extra-great sa username and password, and can run through whatever I need to do. When I’m done, I can stop and kill the image whenever I’d like:

docker stop approaching-zero-db
docker rm approaching-zero-db

Conclusion

In this post, we learned how to take an existing database in SQL Server—whether that be for Windows or Linux—and create a SQL Server container which includes this database.

Coda: Discourses and Distractions

I wanted to save a couple of notes here to the end in order not to distract you with too many tangents in the main body of the post.

Let’s do this as Q&A, as I haven’t done that in a while.

Couldn’t you just use Dockerfiles for all of this?

Short answer, yes. Long answer, I didn’t want to. Dropping a Dockerfile in my GitHub repo makes it easier for me, sure, but then makes it more difficult for people to follow along. As an experiment, I did include all of the steps in my Map to Success repository. Compare the section on “Run Docker Images” to “Build Docker Images” and tell me which one is easier.

If I were putting together a training on Docker, then it would make perfect sense. But I’m using containers as a background tool, and I want to get past it as soon as possible and with as little learner friction as possible.

Can’t you just script all of these steps?

Short answer, yes. Long answer, yes you can.

The only thing to watch out for in scripting is that I noticed a timing issue between when you copy the database backup to the container and when sqlcmd recognizes that the backup is there. I wasn’t able to get it all working in the same Powershell script, even when I did hacky things like adding Start-Sleep. But maybe you’ll have better luck.

Why did you label the local repository as approaching-zero-db when you knew you were going to push this as presentations:approaching-zero-db?

That’s a really good question.

Why not expose a volume in the container, put the database backup there, and commit the image without the database backup once you’re done?

Credit Bartosz Ratajzyk for the great question.

The short version is that I actually forgot about volumes when I was writing this up. But using docker cp isn’t the worst thing, especially when copying one file. Bartosz makes a great point though that we should remove the backup files before committing, regardless of the technique we use.

So why would you actually use this technique in real life?

Androgogy is real life.

Aside from that, this technique is quite useful for building automated test environments. Create a pre-defined database with a known beginning state, spin up a container, hammer that database with all of your automated tests, and destroy the container when you’re done. Unlike most cases, where you want to save the data permanently, these sorts of tests cry out for ephemeral databases.

If you’re looking at containers for most other purposes (like, say, production databases), you’d definitely be interested in persisted volumes and I’d hand you off to Andrew Pruski for that.

Why would you want to restore the Dauphin? You don’t seem like an Armagnac.

Yeah, if you pressed me on it I’d probably admit to Burgundian sympathies, but that’s mostly because of the role it played in the subsequent Dutch republic. But if you pressed me even further, I’ll admit that I was just rooting for injuries in that whole tete-a-tete.

Upcoming Events: Machine Learning Week

Key Details

What: Machine Learning Week, sponsored by Predictive Analytics World.
Where: This is a virtual event.
When: Sunday, May 31st through Thursday, June 4th.
Registration is $990 for the livestream and $1490 for livestream and recordings. Register on the Predictive Analytics World website.

What I’m Presenting

Wednesday, 2:45 PM — 3:05 PM PDT — Forecasting Demand in the e-Commerce Space

This is the first time I’m giving a public presentation explicitly related to what my company does. Typically, I present on tangential things: database administration, data science, security. But here, it’s front-and-center. This made it an interesting challenge to provide useful information without laying out any proprietary company information

Upcoming Events: SQL Saturday Brisbane (Virtual)

Key Details

What: SQL Saturday Brisbane.
Where: Internet Australia. You have to turn your monitor upside-down to see it correctly.
When: Saturday, May 30th.
Admission is free. Register on the SQL Saturday website.

What I’m Presenting

12:15 PM — 1:15 PM AEST — Data Virtualization with PolyBase

Please note that all of the times are in Australian Eastern Standard Time (AEST). This is UTC+10, so if you’re on the Eastern Daylight Time zone like me (UTC-4), that’s 14 hours ahead. In other words, I present on Friday starting at 10:15 PM EDT.

SQL on the Edge: Full-Day Training at PASS Summit 2020

I’m presenting a full-day training at PASS Summit again this year. Here are the details:

SQL Server on the Edge: IoT with SQL Server and .NET Core

In this day-long training session, you will learn about Azure SQL Database Edge, the version of SQL Server intended to run on Internet of Things (IoT) devices. We will discuss the types of scenarios we might try to solve using IoT devices. From there, we will learn about device management through the Azure IoT Hub, including installation of applications from the Azure marketplace, as well as the development and deployment of custom Docker containers in an Azure Container Registry. Over the course of the day, we will build out a practical scenario and take a look at features in Azure SQL Database Edge, including handling time series data and machine learning on edge devices. As we expand out from a single device, we will learn how to automate deployment and updates at scale using capabilities in Azure IoT Hub.

Prerequisites: A solid knowledge of T-SQL, a basic understanding of Docker and containers, some knowledge of .NET (C# or F#).

What Can You Expect?

I’m still working out all of the details, but here are the top-level items:

Module 0 — Prep Work

In the first module, we will understand why we might want to care about the Internet of Things, looking at the types of scenarios IoT can solve.

Module 1 — Configuring Azure IoT Edge Devices

In this module, we will set up Azure IoT hub, take a look at the Raspberry Pi 4, and install Azure IoT Edge on the Pi. We will also see how to connect a virtual machine to Azure IoT Hub to assist with testing.

Module 2 — Azure SQL Database Edge Installation

Once we have a device in Azure IoT Hub, we will see how to install Azure SQL Database Edge, including configuration and deployment of dacpacs on a VM and on a Raspberry Pi 4.

Module 3 — Developing and Deploying an Application

With a database in place, we will work on an IoT solution in .NET Core and connect to our Azure SQL Database Edge instance.

Module 4 — Diving into Time Series

One of the main promises of Azure SQL Database Edge is the support for time series, and we will investigate what is available on that front.

Module 5 — Machine Learning on the Edge

In this module, we will review ONNX, the Open Neural Network Exchange. We will see how to train a model on a host, deploy it to Azure SQL Database Edge, and predict using the native PREDICT operator.

Module 6 — Device Management

The final module will extend us beyond a single device, as scale is the name of the game with IoT. We will also look at tools available for monitoring and providing insights.

Course Objectives

Upon completion of this course, attendees will be able to:

  • Configure an Azure IoT Hub
  • Connect IoT edge devices (such as the Raspberry Pi) to Azure IoT Hub
  • Deploy Azure SQL Database Edge to edge devices en masse
  • Develop and deploy custom .NET code using containers
  • Deploy machine learning models to edge devices

If this sounds interesting to you, be sure to register for PASS Summit 2020.

SQL Server Machine Learning Services — The Joy of Native Scoring

I have released a new video on YouTube!

Well, it’s kind of an old video. You can tell because I could still get away with going to Hope, Washington at the time.

If John Rambo wants to go through your town, just let him.

A version of this video was first released for the PASS Insights newsletter, and I’ve held off for a bit before posting this version to let it ride its course.

If you want to get the code used in the demo, grab it from my GitHub repo.

I’ve had on my agenda to do more of these videos, so hopefully I’ll get a cadence going, and that someday it will be a regular cadence. That’d be great.