The Curated Data Platform is a 2-hour video training aimed at providing you a 30,000 foot overview of the data platform space. In this course, I take you through a variety of data platform technologies—such as relational databases, document databases, caching technologies, data lakes, and graph databases. I show you use cases in which these technologies can be great fits, as well as which companies and products are most relevant in that space today. This includes on-premises technologies as well as major services in Amazon Web Services and Azure.
Get This Course for Free! (Limited Time Offer)
Through Sunday, July 25, 2021, you can register for this course for free using the coupon code FIRSTMOVER when you check out. My one request with this is, if you use the coupon code, please be sure to leave feedback on the course—things which you liked, as well as things you wanted to see but didn’t. I intend to update this course over time to make it better based in part on learner feedback.
I’ve talked about Data Platform Summit before, but now that we’re close to go time, I wanted to mention it again.
This year, I will be presenting one breakout session and one training class.
The breakout session is entitled Optimizing Backup Performance using Data Science Techniques. In it, I walk through the configuration settings we have available to us for tuning backups (and restorations) of databases and then apply data science techniques to find the best set for a given environment. It’s a great way of applying data science skills to a classic database administration task and should be fun for both audiences. That session, along with more than 200 others, will run from December 2nd through the 4th.
The training, meanwhile, will take place over two days, December 7th and 8th. The training is entitled Applying Forensic Accounting Techniques to Data Analysis in Python and SQL and is a lengthy enough title that it’s self-explanatory. In it, I answer the perennial question for data platform specialists: what does it mean to know your data? The same processes and techniques forensic accountants use to analyze data sets can help us gain a detailed understanding of data in our own environments and become more proficient. Along the way, we uncover a bit of dirty dealing in an artificial data set because hey, where’s the fun in talking about forensic accounting if we can’t have a fraud-based example?
How about for this months TSQL Tuesday let’s do another month of tips and tricks. But just to be a bit different, nothing to do with SQL Server (in fact let’s say nothing related to a DBMS). No SSMS tricks, no T-SQL tips, something completely unrelated. For example did you know that, in Windows, if you grab a window and shake it then all of the other windows will be minimized? And then if you shake it again they all pop back up.
So give me a non SQL related tip or trick. OS (windows, mac), Cloud (Azure, AWS), etc. Something you think will be handy for the rest of us.
Today, I’m going back to a classic.
The Non-Technical Life Tip for Getting More Things Done
This worked pretty well, but it had a couple of problems:
I still had multiple calendars, so the only place where I had my schedule in full was on my phone.
My phone is where I go to lose lots of time.
When I do need to be flexible, rearranging the calendar becomes a mess.
It’s hard to keep track of the things I want to do but don’t have strictly scheduled times set up already.
So what ended up happening is that, over time, I switched back to using a todo list, landing on Todoist given its functionality.
ToDo Lists Aren’t Perfect
Using a todo list did help me with some parts of the problem, particularly around flexibility and wanting to do things later but where “later” isn’t on a schedule. Eventually, I stopped filling in the calendar entries altogether.
But that led me back to the original problem I had: I wasn’t getting stuff done. I’d have 60-70 items on my todo list, knowing that my regular cadence was maybe 10-12 items per day, with 15-18 on a very productive day.
The math doesn’t work out, so let me walk through a dramatized version of my day:
Add a few items to my todo list that need done today.
Finish my normal 10-12 items.
Figure out how to reschedule the other 50 things on the list, bumping them back by a week or two.
The problem is that todo lists don’t really give you any of the discipline of having a plan or knowledge of what you can achieve. So you end up with eyes bigger than your stomach (to misappropriate a metaphor) and a huge list of things to do but hey, there’s still 6 hours in the day to accomplish these 50 tasks so let’s watch some TV for a while.
Not too long ago, I started listening to a book called Indistractable. I’m roughly halfway through it and most of the book is stuff I’ve heard, done, or otherwise know. But in the book, Nir Eyal recommends keeping a calendar with specific entries…sort of like what I did in 2019. But there are two extra catches.
Here’s what my new calendar looks like:
The first difference is that this is all of my calendar entries in one, including work, community, and personal. I don’t have to use my phone to see the whole thing (although I still can).
The second difference is in the colors. I now have several colors, which represent different things. Green represents personal time; yellow and orange represent work and work-related meetings, respectively; and dark grey represents relationships and time with others. I have a couple other types of entry such as presentations I’m giving are listed in blue + salmon, but it’s mostly green, yellow, and dark grey.
The color difference is big because it gives you a feel for what you’re doing with yourself. It helped me realize that, aside from dinner, I didn’t really see my wife often at all despite her being mere feet from me. Eyal mentions that for a lot of people, relationships get the remainder of whatever time is left over in your calendar, but if you don’t actively make time, other things encroach to the point where you just forget about them altogether. I know that scheduling time for relationships sounds a bit weird, but it’s really not—you’re protecting that time and ensuring that you don’t schedule meetings over it or feel the need to fill it with something “more productive” because calendars, like nature, abhor a vacuum.
Lightning Round: Bonus Tips
All of these revolve around the core notion of having a calendar in place, but I’ll burn through them quickly to keep this from being an essay.
Keep That Todo List
Todo lists are still important even when you have a calendar. The calendar is your commitment to yourself (and others) as to what you will do in the near term. The todo list then becomes a reminder of what you need to do in the near term, as well as things you’d like to do over time. The trick is, you don’t need to schedule all of the todo items up-front.
Every Day: 30 Minutes for Day Planning and Prep
Alongside the todo list, I have an entry at 9 PM to do planning and preparation. Early on, this was the time I spent simply getting those calendar entries in. Now that I am in a rhythm, I can spend this time looking at what’s left in my calendar and deciding what I want to do. For example, I had a few courses on my todo list for months; I’d bought access to them and wanted to learn, but never really felt like I had the time or motivation. Now I can put them in at certain time blocks, and know that I’m going to work on these. Importantly, I have others that I don’t put in or even think about. I’ll get to them someday, but I don’t need to fret about it in the meantime.
Timebox Most Things
This might make sense for people like me rather than the general population, but timebox things. Commit to 30-60 minutes on an item and then be done with it. If you need to get back to it, that’s fine—schedule another entry. My problem is that I tend to get squirrely after about an hour on a single thing and so I have a bunch of different 15-60 minute tasks throughout the day. If I really get in the zone on something, I don’t have to quit, but can reschedule other things. But that leads me to my next point.
Understand Your Limits
The point of timeboxing and calendar-based commitment isn’t to become some sort of machine that can work 18-hour days. I purposefully stop scheduling productive work at 9 PM. If I’m feeling really hyped about something and just want to keep going, I can, but I normally reserve the last few hours of the day to unwind, play games, read, and relax.
Focus on One Thing at a Time
Very few people can juggle multiple tasks at the same time. Even the best multi-taskers tend to be pre-emptive interrupt types rather than truly concurrent—they switch between several tasks but focus on the one task at a time. The problem is, the modern world is designed to take you out of focus. Between your phone, computer, multiple screens, group chats like Slack/Teams, social networks, and flashy services, everything’s trying to get your attention all the time.
Turn off most notifications. If an app spams me more than once, I turn off its notifications altogether. Spam, here, is simply unwanted messages. The American Airlines app notifies me when my plane is about to board—that’s a helpful message. If it started messaging me about how I can buy bunches of miles, I’d turn off notifications immediately.
Going further, I keep my phone on silent and sometimes even turn on Do Not Disturb mode (Android and iOS both offer this). Specific people (like my wife, my boss, my employees, and my parents) can still notify me immediately, but other people can wait a bit. I’ve timed this a few times and realized that during certain stretches of the day, I would get a work-related notification every 90-120 seconds. That could be e-mail, group chat, or whatever. Most of those messages I could ignore (e-mails which don’t require immediate action, group chat messages not directed toward me), but the phone buzz interrupted me nonetheless because hey, maybe this time it’s important!
The other exception is the calendar entries themselves–I have notifications on all of my calendar entries so that I do get a message that it’s time to move on to something else. That means I don’t have to check the clock quite so frequently and won’t get sucked into notification swamps.
When you end one task and are ready to go to the next, go ahead and check those e-mails and chat messages, but if you need to act on something non-urgent, set up a calendar entry and do it during that time. If it’s urgent, of course this goes out the window, but you’d be surprised how little is truly urgent.
Spend Less Time on Social Media
If you really need to be on Facebook, Twitter, Instagram, TikTok, whatever, schedule that time. Eyal recommended uninstalling the apps from your phone and just using the web interface in your pre-committed time. I don’t go quite that far, but I also spend very little time on social networks at this point and frankly, I’m not sure I’m really missing that much. I’ll check for notifications occasionally, maybe hang out for a little bit, but it just doesn’t form a big part of my life.
Sometimes, Call an Audible
Remember that you rule your calendar, not the other way around. If the weather is beautiful, go out and take that walk right now instead of waiting two days. Aside from meetings or hard deadlines, you’re filling out this calendar to get the most from your life. You can leave a bit of slack (hopefully more than I have) in case you need it or move/remove items as you determine that the world isn’t quite what you imagined it to be a few days ago when you created the calendar. And sometimes you get done with something early; in that case, you’ve got bonus time, so do with it what you will and indulge all of those time-sucking notifications and frivolities until the next calendar block hits.
“But What if I Don’t Like Your Advice?”
So you’ve gotten this far and decided that you don’t like what I’m saying, or you have loads of objections (or maybe “well, I like it but with this twist”). I am a normative individualist, so I fully believe that you know better than I do what works for you. If you want to take 30% of my advice and go in a wildly different direction on the rest, go for it and I hope it works out great. I won’t pretend that this is the only—or even the best—way to do things, just that it’s a method which works for me.
As I mentioned the last time I hit this topic, you might not be able to pull this off due to your job (e.g., working in a support role where you can’t necessarily schedule when you get to work on things), your kids, or other circumstances. This isn’t a universal trick, and I think you have to hit the characteristics I keyed in on last time for it to work to the utmost:
You have some flexibility in your schedule right now. If you’re already booked solid for the next six months, there’s not much we can do here.
You can commit to performing some action for a block of time. Having young children will likely make this technique all but impossible.
You are at least a little neurotic. If you don’t care at all about skipping appointments or ignoring mechanical devices blinking and buzzing at you, this technique probably isn’t going to work.
You have a semblance of a plan. One important thing with this technique is that you can switch to the task without much mental overhead. A task which reads “go do stuff” won’t help you at all. By contrast, a task to continue reading a book is easy: you pick up where you left off. For things that are in between, you can always fill out the calendar entry’s description with some details of things to do if that helps.
I make fairly heavy use of SQL Server Machine Learning Services in my work environment, and this includes using TensorFlow and Keras (in R and Python) to train neural networks and generate predictions against them.
With the upgrade to SQL Server 2019, the security model around Machine Learning Services has changed, restricting what you are able to do with the product. On the whole, that’s a good thing—in 2016 and 2017, you have some pretty good leeway to execute arbitrary code on a server, but they really limit that with 2019.
The purpose of today’s post is to show you how you can install R-Keras and TensorFlow against a server running SQL Server 2019 Machine Learning Services. If you want to install this for SQL Server R Services in 2016 or SQL Server 2017 Machine Learning Services, the process is a little bit easier.
A Major Caveat
There might be a better way to do this than what I’m going to show you. If so, it wouldn’t be the first time that I brute-forced a solution with Machine Learning Services that had a more elegant solution. But I can confirm that it works for me.
You might need to disable a firewall rule which reads something like “Block network access for R local user accounts on SQL Server instance MSSQLSERVER.” This rule prohibits outbound web connections, making exfiltration of data harder. Unfortunately, it also makes package updating more difficult and my recollection was that it prevented installation of Keras, though I didn’t test this in writing the blog post, so I could be making that last part up. If your security team gets paranoid about disabling this firewall rule, turn it back on after installation and configuration is finished, though that does mean you won’t be able to do things like, say, hit an API endpoint with R or Python.
Next, we will want to install the Anaconda distribution of Python. This is true even if you have Python Services installed. I was never able to install TensorFlow + Keras on Python Services and have R Services configured to point over to Python Services to get R-Keras working correctly. Maybe I was doing something wrong, but installing another version of Anaconda does the trick. Plus, you’re going to have five or six versions of Anaconda installed on the machine regardless, as it seems like every other app these days wants its own version installed.
By the way, when installing Anaconda, there’s a check box about adding Anaconda to your PATH. Check that box even though they recommend you not do so. I needed to have the box checked to get R Services to identify Anaconda properly.
Double-Check that PATH
I installed Anaconda to C:\Anaconda, so keep that in mind. Make sure that you have the following in your PATH system environment variable:
Then, add a new system environment variable: RETICULATE_PYTHON=C:\Anaconda\python.exe
Turning Off All of the Security
This next part may be a little scandalous, but my recollection is that I needed to do this to get everything working. I don’t remember if it was to get around our company’s permissions setup or if ML Services needed it set, but I had to set the Anaconda folder’s permissions recursively to grant Full Control to Users. You can try doing this without making that change and hopefully it works, but again, trial and error told me I needed to do this on my end. I also needed to grant All Application Packages and All Restricted Packages full control over the Anaconda folder. If you run icacls on the folder you’ll see something like the following:
After you’ve set up the environment variables and rights, restart the SQL Server instance, both the database engine and the Launchpad service. This way they’ll pick up the new path variables.
Step Two: Install Keras
Now that we have ML Services installed and a good Python installation for Keras to live, we need to run the R shell. For SQL Server 2019, that’s %PROGRAMFILES%\Microsoft SQL Server\MSSQL15.MSSQLSERVER\R_SERVICES\bin\R.exe by default, but you may need to change the path based on your installation location, version of SQL Server, and whether this is a named instance.
Note that I specified a repo here. If you don’t specify a repo, you’ll get Microsoft’s CRAN repo for SQL Server 2019, which is way out of date. We need current versions of reticulate, tensorflow, and keras for this to work. I use Duke University’s archive because it’s close, and Case Western University’s when Duke gives me troubles, but you can also find your own favorite mirror.
What we’ve done so far is install Python support (reticulate), as well as the shells for TensorFlow and Keras. We need to hook them up next. If you read the R-Keras installation instructions, you’ll see that there is a way to configure from within R. I was able to get that working for 2016 and 2017, but 2019 gave me problems, so we’re trying a different route.
Step Three: a Pivot to Python
Open up an Anaconda prompt (though I don’t recall needing to do so as Administrator, probably because I already gave away the store in Step One) and run the following command:
conda update -n base -c defaults conda
What we’re doing here is updating the conda package itself on our base environment using the default channel. It may not be strictly necessary, but usually conda will be a little out of date when you install Anaconda on Windows, so it’s worth getting the latest version.
Then close and re-open your Anaconda prompt. After that, run the following:
conda create --name r-reticulate
conda activate r-reticulate
pip install tensorflow-probability
pip install tensorflow
pip install keras
conda activate base
pip install tensorflow-probability
pip install tensorflow
pip install keras
What I’m doing is building a new virtual environment named r-reticulate, which is what the reticulate package in R desires. Inside that virtual environment, I’m installing the latest versions of tensorflow-probability, tensorflow , and keras. I had DLL loading problems with TensorFlow 2.1 on Windows, so if you run into those, the proper solution is to ensure that you have the appropriate Visual C++ redistributables installed on your server.
Then, I switched back to the base virtual environment and installed the same packages. My thinking here is that I’ll probably need them for other stuff as well (and don’t tell anybody, but I’m not very good with Python environments).
Step Four: Test Things Out
Bounce back to R.exe and run the following commands:
If those worked, congratulations! You have TensorFlow for CPU and Keras installed. If you need GPU support, you’ll need to configure that on your own; I got it working on a desktop in the past but between CUDA, TensorFlow, Keras, and R, it’s a lot of pain.
From here, you should be able to open up your SQL Server client of choice, connect to the SQL Server instance, and run the following:
First, install sqlmlutils on your local machine. This does not need to be the same as the server where you’ve installed SQL Server Machine Learning Services. To do this, open up command prompt, run Python (oh, you need Python, and I’d probably recommend installing Anaconda locally as well), and run the following:
pip install sqlmlutils
Once you have sqlmlutils installed, we can continue on. For each database in which you need to access Keras and TensorFlow, run the following:
# Connect to the Database
# NOTE: need to do this for *each* database!
#conn = sqlmlutils.ConnectionInfo(server="MyGreatServer", database="SomeDatabase")
sqlpy = sqlmlutils.SQLPythonExecutor(conn)
pkgmanager = sqlmlutils.SQLPackageManager(conn)
# Install a package as dbo, allowing anybody to use it.
# Upgrade if there is a newer version than what is already installed.
pkgmanager.install("tensorflow", scope = Scope.public_scope())
pkgmanager.install("keras", scope = Scope.public_scope())
Yeah, that was a lot easier. On the plus side, the method we followed to install Keras and TensorFlow for R allows us to use it in every database by default, rather than the per-database installation method for sqlmlutils.
Bonus Bonus: TensorFlow and Keras on R with sqlmlutils
I want to take a quick moment at the end of the post to explain why, although sqlmlutils works for installing TensorFlow and Keras against Python Services, I needed to run R.exe on the server for R Services.
I tried doing this all from sqlmlutils, but I ran into two problems. The first was in installing the reticulate, tensorflow, and keras packages. They would start running but would never actually finish. I let this run overnight just to see if it’d complete, and it never did. By contrast, installing them from the server worked every time.
Second, I was unable to run install_keras(method = "conda", conda = "C:\\Anaconda\\Scripts\\conda.exe) successfully, either via sqlmlutils or R.exe. That’s why I ended up creating the virtual environment and installing TensorFlow-related packages separately.
This will be a review of my presentation goals for 2019. Tomorrow, I’m going to talk about a bunch of other stuff I did. It’s been a busy year.
Coming into the year, I had 5 major goals and 2 stretch goals. Here’s how I did on each of them.
Speak at 20 SQL Saturdays and 10 User Groups
I ended up hitting 20 SQL Saturdays on the nose. As far as user groups go, I had 13 user group presentations, including two remote. One successful goal down.
Speak in 4 Countries
I ended up speaking at events in six countries: the United States, Canada, Germany, Ireland, the United Kingdom, and Australia. Two successful goals down.
Update 4 Older Talks to Use GitPitch
I’ve become a lot more comfortable with GitPitch (to the point where I’m paying for the full product, as I want it around for a while). At the beginning of the year, I listed out a half-dozen talks which would be good to migrate to GitPitch. I moved four of them, rewrote a fifth (on PolyBase), and will probably retire the 6th at some point. Speaking of which…
Retire 4 Talks
Sometimes, it’s good to put a talk out to pasture. I ended up giving up on a few of them this year, though funny enough, I brought my talk on Biml out of retirement due to internal demand.
I missed on my two stretch goals, but did end up hitting all of my main goals, so I’ve got that going for me. Tomorrow, I get to brag about all of the stuff I was doing while missing those stretch goals…
The term “basic analysis” is pretty broad, so we’re going to cover a few concepts today. Most of this is in the realm of Exploratory Data Analysis, looking at the data to gain a better understanding of what you have before you generate hypotheses. This sounds bland and straightforward today, which is an indication of how well Tukey’s book was received and how insightful he was.
In today’s post, we’ll look at exploring our data along five avenues: summary analysis, growth analysis, gaps in data, duplicates and cardinality, and regression analysis. Most of this work will be in R, using the DBI package to connect to SQL Server.
Summary analysis is the easiest of these and generally should be your starting point. Here, you simply want to get an idea of the shape of your data. If you’re in SQL, I tend to start with a SELECT TOP(100) * from the table so I can get an idea of columns, value spread, and data lengths without putting much impact on the server. You can also peruse metadata like data types in sys.columns, foreign key relationships, constraints, and even calling code.
Over in R, we have the summary() function, which gives us a statistical summary of each column in a data frame. We can also use head(), which shows me the first few rows in a data frame. Here is an example of running a few summary functions against the bus table.
What we’ve learned here is that there are 700 buses (which you don’t see directly but can infer from the BusID quartiles). some of them came into service on 1990-01-01, which is probably a default value because our data set starts in 2011. Buses have retirement dates as well, and a bus without a retirement date is still active. In SQL, those would be rows with a NULL value; in R, those are the rows with NA values. There are 468 active buses in our current data set.
We would repeat this type of analysis for each of the major elements in our data set. Let me take a moment and cover my process for finding those “major elements” because I’m all too liable to gloss over it. Here’s what I tend to do when someone hands me a brand new database:
Find the tables with the most rows in them. In SQL Server Management Studio, you can right-click on a database and navigate to Reports –> Standard Reports –> Disk Usage by Table. This will generate a Reporting Services report showing approximate row counts (based on statistics), data size, and index size. Look at the top few tables there. Some of them are probably logging tables, which I tend to come back to later.
Pick one of the large tables which appears to have user-relevant information and perform summary analysis, looking at the top few rows to determine what’s in that table. These large tables tend to be fact-style data, storing relevant measures customers care about.
Recursively find foreign key constraints to other tables and look at those tables. These are often explanatory data, providing context to the fact-style data above. If you need help with recursively finding key constraints, I have a script (with bonus ASCII art) and a longer post as well. If your database has no foreign keys, there are still ways around it like looking at joins in the plan cache.
Repeat step 2 until you run out of big tables. Then move on to medium-sized tables you have not already reviewed.
Creating a database diagram will also be helpful here, especially if foreign key constraints exist.
Now that we’ve summarily covered summary analysis, let’s jump into the next category: growth analysis.
Growth analysis focuses on changes in ratios over time. For example, you may plot annual revenue, cost, and net margin by year. Doing this gives you an idea of how the company is doing: if costs are flat but revenue increases, you can assume economies of scale or economies of scope are in play and that’s a great thing. If revenue is going up but costs are increasing faster, that’s not good for the company’s long-term outlook.
For our data set, I’m going to use the following SQL query to retrieve bus counts on the first day of each year. To make the problem easier, I add and remove buses on that day, so we don’t need to look at every day or perform complicated analyses.
COUNT(*) AS NumberOfBuses
FROM dbo.Bus b
INNER JOIN dbo.Calendar c
ON b.DateFirstInService <= c.Date
AND ISNULL(b.DateRetired, '2018-12-31') >= c.Date
c.CalendarDayOfYear = 1
AND c.CalendarYear >= 2011
AND c.CalendarYear < 2019
I can show you the SQL results but let’s drop this into R and build a quick and dirty plot.
ggplot(activeBuses, aes(x = CalendarYear, y = NumberOfBuses)) +
labs(x = "Calendar Year", y = "Number of Buses", title = "Number of Buses by Year") +
ylim(0, 500) +
The first line with options is something I do with Jupyter to prevent it from creating huge graphs. From there, we’re creating a scatterplot with a line overlaid, labeling the axes, starting from 0, and using a minimalistic theme. Note that starting from 0 is not required—both line charts and scatter plots can start from points other than 0. I did this to make the steady growth more apparent.
Next, I want to look at the number of invoices per year. We invoice on a per-bus, per-item basis, so I would expect invoice growth to track bus growth reasonably closely. You can argue about economies of scale (maintenance workers become more efficient, you might get bigger discounts on parts, it makes sense to purchase capital equipment to make the per-bus maintenance fees lower, those types of things) but with the bus count growing so steadily, I’d think that invoices would grow similarly. So let’s take a look.
COUNT(*) AS NumberOfInvoices
FROM dbo.LineItem li
INNER JOIN dbo.Calendar c
ON li.LineItemDate = c.Date
Here is the R code:
ggplot(invoicesPerYear, aes(x = CalendarYear, y = NumberOfInvoices)) +
labs(x = "Calendar Year", y = "Number of Invoices", title = "Number of Invoices by Year") +
And the plot:
You can see that invoice growth was fairly steady from 2011 through 2017. Yeah, there are ups and downs but that’s normal in any real data set. The jump in 2018, however, is huge: we’ve effectively doubled the number of invoices despite bus growth being steady. Here’s the plot for expenditures by year, which code I’ve left out for the sake of making you do your own dirty work:
Those extra invoices added about a million dollars over expectations. This is our first indication that something interesting has happened. Note that this is not evidence of fraud, as there can be a number of innocent explanations: maybe the buses need to go through more maintenance because they’re older, maybe regulatory requirements forced more work on the buses, maybe we got a batch of lemons which need more work done on them. There are plenty of potential causes, but this is well outside the realm of noise.
We’ll shelve this for a little bit and look at our next topic, gap analysis.
Gap analysis is something you’d typically run when you care about the lack of a value. For example, accountants tend to get picky about check numbers and invoice numbers being complete. If you go from check 1001 to 1002 to 1004, an accountant wants to know what happened to check 1003. The reason is that if you don’t have a record of 1003, it’s possible that there was embezzlement.
To perform a quick gap analysis on line items, we can use the LEAD() window function, available since SQL Server 2012. Here’s an example of the window function in action:
WITH C AS
li.LineItemID AS CurrentLineItemID,
LEAD(li.LineItemID) OVER (ORDER BY li.LineItemID) AS NextLineItemID
FROM dbo.LineItem li
CurrentLineItemID + 1 AS rangestart,
NextLineItemID- 1 AS rangeend
NextLineItemID - CurrentLineItemID > 1;
Here’s what we get back:
We have several ranges of missing values here, which is a bit concerning, as our invoice numbers should be a complete set. There might be an innocuous reason for this. If we look at sys.columns, we can see that LineItemID is an identity column.
Identity columns are great for auto-incrementing surrogate keys but are less great for auto-incrementing keys with informational context. Let me explain what I mean. If we have line items from 1 to 1000 in the table, the next row we insert will have an ID of 1001 (assuming nobody has changed the seed and our increment value is 1). But what happens if we get an error trying to insert value 1001 and need to roll back the statement? In that case, the value 1001 has been burned and our next insertion attempt will be 1002. This can leave gaps in our data for a totally innocuous reason and without anybody actually knowing.
The same applies to sequence types: it is possible that you fail to insert using a sequence value and might lose that value forever. If you need to track a value like an invoice number, your best bet might be to gin up your own solution. You can create a table which stores your latest used value. Then, when it’s time to use the next value, go into the serializable transaction isolation level and take a lock on the table by beginning a transaction and selecting the value from the table. That will prevent anybody else from using the table and potentially grabbing the same invoice number as you.
In your insertion code, you can then increment the value, insert into the table, and if your operation was successful, update the value in the latest value table. Then close the transaction so other sessions can do their work.
This answer works in a low-throughput situation where you don’t expect more than one or two updates every few seconds. Fortunately, most systems which require this level of scrutiny tend to be fairly low-throughput or at least have relatively low concurrency. A process like generating checks for tens of thousands of employees has periods of high throughput but if you batch it all in one transaction on one session, the problem is still tractable.
I’m going to gloss over duplicates here because I’ll get into it in much more detail when we talk about cohort analysis later. For now, here are a few things I’d like to put in your mind.
What is a Duplicate?
There are two different ways we can think about duplicate data. The first way is exact matches on relevant columns where there is no unique key constraint preventing duplication. Suppose we have a LineItemID (which is just a surrogate key) and an InvoiceNumber on our table. That invoice number should be contiguous and unique for each line item. If we don’t have a unique key constraint on that table, however, it becomes possible for someone to use the same invoice number for two lines.
The other side of a duplicate is something which ought to be the same but isn’t, maybe due to a typo. My favorite example of this happens to come from a bank fraud case from a few years back:
When the Federal Reserve Bank of New York cleared five transactions made by the Bangladesh Bank hackers, the money went in two directions. On Thursday, Feb. 4, the Fed’s system sent $20 million to Sri Lanka and $81 million to the Philippines.
The Sri Lankan transaction contained a small but crucial error: The money was being sent to a bank account in the name of a nonprofit foundation, but the electronic message spelled it “fundation.” That prompted Deutsche Bank, an intermediary in the transaction, and a Sri Lankan bank to contact Bangladesh Bank, which led to the payment being cancelled and the money returned.
Here, “foundation” and “fundation” were supposed to be the same but a small typo made a big difference.
Duplicates and Fraud
In the Wake County fraud case, one measure of duplication is the number of invoices received on a single day. We can’t have a unique key on date and vendor (or date, vendor, and bus in our case) because it’s completely reasonable for a customer, on occasion, to send two invoices on the same day. In the Wake County case, however, they had 24 separate days with at least 50 invoices. 50 goes beyond reasonable.
I’m not going to be able to give much more than a primer here. Regression analysis is the topic of many a book and course in statistics and getting regression right can be a major time sink. Acknowledging that we will remain superficial here, we can still cover some of the ground. In its most basic form, regression is all about determining if there is a relationship between one or more input variables (also known as independent variables) and our output (the dependent variable).
We saw the line graph of invoices by year and of buses by year. My question is how much the number of buses ends up driving the number of invoices. My expectation is that the number of buses is a key factor in the number of invoices we deal with: as we add new buses to the fleet, I’d expect an approximately linear increase in the amount of maintenance work to perform, as well as the number of parts to purchase. We may see fluctuations but I expect to see a trend.
Regression by Month and Year
The first thing I want to do is regress the number of invoices versus buses using monthly data. My thought here is that the number of buses drives the monthly number of invoices and that the number of invoices grows approximately linearly with the number of buses. Let’s try these out.
First, I have my SQL query that I use to populate a data frame:
WITH buses AS
COUNT(*) AS NumberOfBuses
FROM dbo.Bus b
INNER JOIN dbo.Calendar c
ON b.DateFirstInService <= c.Date
AND ISNULL(b.DateRetired, '2018-12-31') >= c.Date
c.Date = c.FirstDayOfMonth
AND c.CalendarYear >= 2011
AND c.CalendarYear < 2019
COUNT(*) AS NumberOfInvoices,
SUM(li.Amount) AS TotalInvoicedAmount
FROM dbo.LineItem li
INNER JOIN dbo.Calendar c
ON li.LineItemDate = c.Date
FROM buses b
INNER JOIN expenses e
ON b.FirstDayOfMonth = e.FirstDayOfMonth
Then, I’d like to build a regression. Here is the R code for an Ordinary Least Squares linear regression:
In one function call, I get my linear regression which focuses on tying the number of invoices to the number of buses. I should note that my data is a filter where the date is earlier than 2018. We saw the big jump in invoices in 2018 and that ruins our results. Because I think something’s odd about that data, I’d like to see what it looks like if we factor out 2018 and look at 2011 through 2017. Here’s what I get back:
There are a couple of things to pick out of this. First, our R^2 is 0.45, so we are explaining 45% of the variance in NumberOfInvoices. That’s okay but really not that good. In social science contexts, explaining 45% of human behavior is a really good result. But here we’re explaining expenditures and I’d much rather see 85-95% of the variance explained before I think an expenses model is accurate.
One thing we can do to try to improve the regression is to add features.
Adding Features to Regression
We have two additional features at hand: calendar month and calendar year. Let’s try calendar month first:
The R^2 didn’t move much at all—it went from 45% to 46%. Frankly, that’s noise. At this level, if we’re not seeing a 10% bump (or more) in R^2, I don’t know if I want to include that feature. Notice also that calendar month is not significant according to p-value. We can and should make fun of p values as much as possible, but here’s a case where the results are clear and sensible. Calendar month isn’t a factor in this regression. So let’s remove it and try calendar year.
Now this result is quite interesting. Our R^2 didn’t change but now neither variable is significant! This is a great example of something called multicollinearity, one of the challenges of regression. Put simply, the number of buses increases by about the same number every year, so there is very high correlation between number of buses and calendar year. Running a correlation test against the two, I end up with a value of 0.978.
That is, 97.9% of the variance reflected in buses is also reflected in year. These two variables are co-linear. Because these two variables move almost 1 for 1, it is difficult for the regression algorithm to separate behavior in one versus the other. They’re both fighting to explain the same variance and so both end up with higher p-values. Also of interest is that the R^2 doesn’t change. Multicollinearity doesn’t make your overall predictions worse, but it does make it tougher to tell which independent variables are driving the change.
This is an extreme scenario, mind you, but mutlicollinearity is a common enough occurrence that you will want to be on the lookout for it. The other linear regression amigos are serial correlation (AKA autocorrelation) and heteroskedasticity (my favorite of the three).
Now let’s take a step back, as we’re not getting the job done with regressing at the month level. Instead of grouping by month, I’ve changed the SQL query to include just calendar year and number of buses / invoices. Let’s see how that looks:
I didn’t include the SQL code because it’s a trivial variant on the prior version. Yet I included the trivial variants on the R code because that’s how I roll. Here are my results:
Wow. We went from explaining less than half of all variance to explaining 97% of the variance. That’s a huge difference and is definitely an interesting result. For a fairly mechanical problem like this one, an R^2 of .97 is high but not “shake your head” high. If this were a social sciences problem and I got an R^2 of .97, I’d wonder what I did wrong.
I don’t like that I have so few data points, but even with the low number of data points, our regression output is indicating that there’s something there. We can also run plot(regICAnnualPre2018 and see that our residuals are both positive and negative and a small percentage of the total values:
What this tells us is that we do not see the residuals (that is, estimated – actual) consistently above or below 0, but rather spread out between them. If we saw the residuals consistently over (or under) 0, the residuals would show bias, which can be a problem when performing a regression analysis.
Finally, now that we have a good fit for the pre-2018 data, let’s see what adding 2018 does:
That’s a drop from 97% to 71%. It’s a huge drop. If we have no suspicions about data quality, that kind of drop can be devastating to us: it means our model is no longer a great model. But I do harbor some suspicions because 2018’s values are so much larger that I think there’s something weird going on.
One last note, we can take the annual pre-2018 model and generate a prediction to see what our model thinks 2018’s value ought to have been:
This returns 5362 versus our actual invoice count of 7700. That’s a difference of more than 2000. Again, this isn’t proof of wrongdoing but it helps us put into perspective the scope of what’s going on. It’s a data point that maybe something weird is going on and this is the scale of that weirdness.
In this post, we looked at a number of analytical techniques to gain insight into our data. We focused mostly on high-level aggregates here, which can help us get a basic understanding of our data. In the next post, we’re going to move to another level of analysis: cohort analysis. This will give us a better idea of just what’s going on with our data.
For the past three events, we hosted at William Peace University, which I absolutely enjoyed having as a venue. But with Wake Tech opening a campus just a couple of miles from the center of the Triangle, we wanted to make the trip more convenient for people living out in Chapel Hill and Durham.
The venue is gorgeous, a brand new building which looks fantastic. You should definitely submit a talk or four and come to the best SQL Saturday in the United States…that day…unless someone else launches one after us…
UPDATED 2018-03-13: SQL Server 2017 CU4 fixes this issue. See below.
We call SQL Server ML Services a lot. As in building hundreds of thousands of times a day to build models. It turns out that doing this has a negative effect: ML Services plans end up staying in the plan cache and don’t get removed. Here’s how our plan cache looks:
A plan cache. Precipitous drops are predicated by service restarts.
What happens is that things work fine for a while, until our plan cache hits about 70 GB, after which point we start getting RESOURCE_SEMAPHORE waits on some of our queries and the available space for buffer pool drops to single-digit gigabytes.
This is a problem on SQL Server 2016 and SQL Server 2017. It’s very unlikely to affect most people, as most people don’t do crazy stuff at this scale. But hey, what’s the fun in having a server of my own if I can’t bring it to its knees every once in a while?
The first thing that you might do here is try to run something like DBCC FREEPROCCACHE or maybe DBCC FREESYSTEMCACHE('SQL Plans') WITH MARK_IN_USE_FOR_REMOVAL; but neither of those did anything. It appears that R/ML Services plans are not marked for removal and will not clear, no matter how many times you try to flush the cache.
For now, the workaround I have is to restart the SQL Server service occasionally. You can see that I have done it twice in the above screenshot. Our application is resilient to short database downtimes, so this isn’t a bad workaround for us; it’s just a little bit of an annoyance.
One thing to keep in mind if you are in this scenario is that if you are running ML Services hundreds of thousands of times a day, your ExtensibilityData folders might have a lot of cruft which may prevent the Launchpad service from starting as expected. I’ve had to delete all folders in \MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityData\MSSQLSERVER01 after stopping the SQL Server service and before restarting it. The Launchpad service automatically does this, but if you have a huge number of folders in there, the service can time out trying to delete all of them. In my experience at least, the other folders didn’t have enough sub-folders inside to make it worth deleting, but that may just be an artifact of how we use ML Services.
I have worked with Microsoft on the issue and they’re going to release a patch in a future SQL Server 2017 CU to fix this issue. I’m not sure about SQL Server 2016 and also don’t know exactly when this patch will ship, but it’s working through the pipeline and I’m happy for that.
Microsoft has released SQL Server 2017 CU4, which fixes this buffer pool issue. After the patch, my SQL plan cache has not grown beyond 2 GB after 4 days, whereas prior to the patch, it’d be in the 50-60 GB range by then.
This post is a bit late, as I actually blew past 50 SQL Saturdays earlier in the year, but now that the year is over, I wanted to reflect just a little bit on why I enjoy speaking at SQL Saturdays.
I’ve spoken at 57 of them so far and want to break 75 in 2018. Here’s the year-by-year breakdown:
I love that the institution of SQL Saturday is popular enough that someone can attend 20+ events a year all all around the world. In my case, the vast majority of my travel is inside the United States, but I get to see parts of the country that I otherwise couldn’t (or wouldn’t think to visit), and I like that.
If you’ve thought about speaking at a SQL Saturday, I highly recommend giving it a shot. You don’t have to be a great speaker, and definitely don’t need to be a natural speaker in order to present. That’s exactly the situation I was in back in 2013, when I gave my first SQL Saturday presentation. That talk had a total of four attendees and it wasn’t a polished talk at all—which is a nice way of saying that even my recollection of the talk was that it wasn’t very good…though at least I still had all 4 attendees at the end of the talk! But like with everything else, you get better through practice and training.
You also don’t need to criss-cross the country; start with a local conference if there are any, or a nearby regional conference if you can get away with it. If you speak at one a year, you’re still getting good experience presenting and helping share your knowledge with the community, as well as picking up additional information and potentially making great contacts and friends.
It’s been another busy year for me presenting. Over the course of 2017, I gave a total of 50 talks at 42 events. It’s been a lot of fun getting to travel around the world, hitting places as far apart as Vienna, Austria and Sydney, Australia. I’m hoping to keep up this pace for next year as well.
Now, to look at the goals I had set for the year. As a quick reminder of my 2017 goals:
Speak at 20 SQL Saturdays and 10 user groups
Speak at 2 paid conferences
Give 6 webinars
Do a full-length, pictures-only talk
I wasn’t quite as successful this year as I was last year. Let’s see how I did:
Speak at 20 SQL Saturdays and 10 user groups
I ended up speaking at 22 SQL Saturdays this year (and would have been 23, had I not gotten sick the day before SQL Saturday Pittsburgh), so I beat that part of the goal. As far as user groups go, I barely eeked out speaking at 10 distinct user groups. So a big green checkmark for this goal.
Missed it by that much. I ended up doing 5 webinars this year. I was pushing for them earlier in the year but slacked off in the middle and that made all the difference.
Do a full-length, pictures-only talk
I haven’t gotten that far yet. I did end up doing a 10-minute talk which was dominated with pictures, but that’s not quite good enough to count.
On the plus side, I’ve been focusing on more graphics-heavy talks, so at least I’m moving toward a better equilibrium on that front.
I know I made my 2017 goals pretty ambitious, so I’m happy that I was able to do two of them. Doing more webinars is high on my agenda, particularly now that I have a decent recording setup. And that will also let me create more videos of my talks.