What: Data Platform Summit 2020 Where: Online When: November 30th through December 8th Register at the Data Platform Summit site.
What I’m Presenting
Monday, December 7th and Tuesday, December 8th, 12 PM through 4 PM EST — Data Analysis in Python and SQL
This is a paid post-event training. I promise to spend no more than about 20 minutes complaining about how data analysis is a lot easier to do in R… But I’m really looking forward to this—two separate 4-hour days digging into data analysis techniques using T-SQL as well as Python.
I’m also scheduled to present a regular session, though that hasn’t been announced yet, so stay tuned for that.
Here is the PASS Summit writeup if you’d like more information. And if you do decide to register for this full-day training, you’ll save $200 off of what it would have been had we attended PASS Summit in person.
The Curated Data Platform
I am really excited about this talk. Several years ago, I had a talk called Big Data, Small Data, and Everything in Between, and the idea of the talk was to walk through various data platform technologies and see where they fit.
That talk isn’t that out of date, but I decided to revamp it entirely, taking advantage of my insanity dedication as a Curator to give it a better name and a better theme.
The idea now is, let’s take a fictional but realistic company, walk through the types of data problems it experiences, and see which data platform technologies solve its problems, along with the biggest players in those spaces, and some reference architectures to boot.
The talk is currently under development and I plan to revise it a fair bit between now and Summit, but here’s a sneak peek of the agenda:
6:00 PM — 7:30 PM EDT — IoT and Machine Learning in Azure
This won’t be a formal talk so much as it is a discussion of IoT strategies around Azure. I’ll talk about combining together several services in Azure, where the pain points are, and discuss a few alternative strategies around processing and analyzing data.
I wanted to cover something which has bitten me in two separate ways regarding SQL Server Machine Learning Services and Resource Governor.
Resource Governor and Default Memory
If you install a brand new copy of SQL Server and enable SQL Server Machine Learning Services, you’ll want to look at sys.resource_governor_external_resource_pools:
By default, SQL Server will grant 20% of available memory to any R or Python scripts running. The purpose of this limit is to prevent you from hurting server performance with expensive external scripts (like, say, training large neural networks on a SQL Server).
Here’s the kicker: this affects you even if you don’t have Resource Governor enabled. If you see out-of-memory exceptions in Python or error messages about memory allocation in R, I’d recommend bumping this max memory percent up above 20, and I have scripts to help you with the job. Of course, making this change assumes that your server isn’t stressed to the breaking point; if it is, you might simply want to offload that work somewhere else.
Resource Governor and CPU
Notice that by default, the max CPU percent for external pools is 100, meaning that we get to push the server to its limits with respect to CPU.
Well, what happens if you accidentally change that? I found out the answer the hard way!
In my case, our servers were accidentally scaled down to 1% max CPU utilization. The end result was that even something as simple as print("Hello") in either R or Python would fail after 30 seconds. I thought it had to do with the Launchpad service causing problems, but after investigation, this was the culprit.
The trickiest part about diagnosing this was that the Launchpad logs error messages gave no indication what the problem was—the error message was a vague “could not connect to Launchpad” error and the Launchpad error logs didn’t have any messages about the failed queries. So that’s one more thing to keep in mind when troubleshooting Machine Learning Services failures.