Cross-Database Memory Optimized Queries Still Not Available

In SQL Server 2014, we learned that memory-optimized tables cannot join to tables outside the current database.  I wanted to see if this had changed in SQL Server 2016 CTP 3.2.

CREATE TABLE [dbo].[MOTest]
(
	MOTestID BIGINT IDENTITY(1,1) NOT NULL,
	ProductSubcategoryName VARCHAR(50) NOT NULL,
	CONSTRAINT [PK_MOTest] PRIMARY KEY NONCLUSTERED
	(
		MOTestID ASC
	)
) WITH(MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_ONLY)
GO

INSERT into dbo.MOTest
(
	ProductSubcategoryName
)
VALUES
('Mountain Bikes'),
('Road Bikes'),
('Touring Bikes'),
('Cranksets'),
('Chains'),
('Something Else');

SELECT TOP (100)
	*
FROM AdventureWorks2014.Production.ProductSubcategory ps
	INNER JOIN dbo.MOTest m
		ON m.ProductSubcategoryName = ps.Name;

My result:

(6 row(s) affected)
Msg 41317, Level 16, State 5, Line 24
A user transaction that accesses memory optimized tables or natively compiled modules cannot access more than one user database or databases model and msdb, and it cannot write to master.
In other words, we still cannot cross databases with memory-optimized tables.

New Talk Planned

I’m introducing a new talk for 2016:  Securing SQL Server.  Here’s my abstract:

A default SQL Server installation is reasonably secure, but “reasonably secure” doesn’t cut it anymore in an era in which one bad line of code, one week password, or one open port can result in your customer database ending up on Pastebin.  In this talk, we will look at different methods of securing a SQL Server instance, from venerable (principle of least privilege, Transparent Data Encryption) to novel (Always Encrypted, row-level security).  These tools and techniques will show us ways for developers, database administrators, and network specialists to work together to secure corporate assets and prevent CISOs from ending up on the nightly news.

My plan is to hit on a few topics, introduce some resources to learn more about the topic, and move on.  The goal is to provide an overview of some of the tools and technologies available, including:

  • Applying the principle of least privilege to SQL Server logins, users, and roles.
  • Implementing Transparent Data Encryption.
  • Implementing Always Encrypted in SQL Server 2016.
  • Implementing row-based security, both in SQL Server 2016 and prior to 2016.
  • Using network segmentation to limit access to SQL Server instances.
  • Writing good, parameterized SQL to prevent against SQL injection.
  • Reducing the SQL Server surface area.

There are entire books on SQL Server security, so a one-hour talk can’t cover everything.  Nonetheless, my intent with this talk is to give everybody who attends at least one new resource.

Checking SSRS Report Usage

At work, I recently was asked whether some reports were still in use. These are SQL Server Reporting Services reports, so fortunately there’s an easy way to find that out. Go to the SQL Server instance hosting Reporting Services and go to the ReportingServices database. Then, the following query will get reports run over the past week:

SELECT
	el.ReportPath,
	SUM(CASE WHEN el.RequestType = 'Interactive' THEN 1 ELSE 0 END) AS Interactive,
	SUM(CASE WHEN el.RequestType = 'Subscription' THEN 1 ELSE 0 END) AS Subscription
FROM dbo.ExecutionLog2 el
WHERE
	el.TimeStart > DATEADD(WEEK, -1, GETUTCDATE())
GROUP BY
	el.ReportPath
ORDER BY
	el.ReportPath;

This also breaks out interactive report requests versus subscriptions to see if people are seeking out these reports or if you’re pushing them.

Presentation Goals For Next Year

This was a big year for me as a presenter, having the opportunity to present at a dozen SQL Saturdays, not to mention SQL Server and .NET user group meetings in the Raleigh area as well as Lynchburg, Virginia.  Over the course of the next year, I want to expand upon my hard-earned experience and take the next step as a speaker.  Here are my goals:

  1. Present four new talks.
  2. Present to two brand new audiences.
  3. Speak at 12 SQL Saturdays and eight user groups.
  4. Get paid to speak at a conference.

Present Four New Talks

Over the past year and a half, I’ve honed two talks to the point that I’m happy to give them anywhere, any time.  I also have a couple of talks that I’ve given in the past and would like to get tuned to the point that I’m totally comfortable with them.  That range is okay, but I want to expand and give some brand new talks.  Some of it is me embracing the new, shiny things—R and Hadoop—as well as some of the “weird” side of the SQL world, like BIML.

Why four talks?  Because that sounds like an achievable goal.  It’s one talk per quarter, with the expectation that I can give each talk two or three times, sharpening it as I go along.  I’ll have plenty of material for user groups, and this will make me a more enticing option to bring to a user group or conference, because I’ll have a broader range of topics.  This will also help me grow as a speaker.

In order for me to grow as a speaker, I need to get out of my comfort zone present to some new audiences, and the easiest way to do that is to build up talks geared more for Python developers, data analysts, Hadoop developers, and people who know the technologies better than I do, but may not know how to tie them to SQL Server.  This leads right into my next topic.

Present To Two New Audiences

The RTP area has an abundance of user groups covering all kinds of technical interests.  Historically, I’ve presented at the .NET User Group’s data SIG as well as the local PASS Chapter.  I really enjoy both of those groups, but I’m also a member of several other user groups, including Powershell, Hadoop, and research & analytics groups.  I’ve never presented before these audiences, but as I get to know people in that field a little better and begin to understand the world according to a Hadoop developer or administrator, or what makes a data analyst tick, it will help me grow in my career and open up new interests.  This is an exciting prospect.

To these audiences, I think the biggest thing I have to bring to the table is my knowledge of the SQL Server side of things.  This will let me focus on topics like how to integrate SQL Server + Hadoop (using PolyBase, SSIS, Sqoop, etc.) or SQL Server + R (using RODBC or SQL Server 2016).  A fair percentage of the audience may not use SQL Server, but I want to entertain and inform, even if it’s not quite in their wheelhouse—that’s what I do with Hadoop, after all, when I bring it to audiences of .NET and SQL developers who haven’t yet seen the elephant in the room.

I’d love to be able to expand from there and become more than “the SQL Server guy who knows a little X” but the only way to do that, short of a career change, is to start at the beginning, make these connections, and learn from others in these communities.  And like the saying goes, there’s no better way to learn than to teach.

Speak At 12 SQL Saturdays And 8 User Groups

Holy moley, that’s a lot of talks.  I had the great fortune to speak at SQL Saturdays across the eastern US and midwest, and it was hard at times.  2016 is going to be the year in which I take those hard-earned lessons from the past year and push a little further.  I love helping people learn, really enjoy traveling to speak, and have grown to appreciate the value of making these connections with other speakers, community leaders, and conference attendees.

So where do I plan to speak?  Obviously, the answer is dependent upon who selects me, but I’d love to repeat at the SQL Saturdays I hit last year, as well as adding some places like Kansas City (for the barbecue crawl and to see some friendly faces in the area) and China.  The wife and I are thinking of a trip to China next year and the opportunity to present at a conference in Peking, Shanghai, Tianjin, or someplace else would fit perfectly with our plans.

A little closer to home, I know there are several user groups in the North Carolina and Virginia area, and introducing myself to those SQL Server User Groups and .NET User Groups would be a great way to give back to some of those places.

Get Paid To Speak At A Conference

I don’t want to turn speaking into a full-time job, but I think the next big step for me is to get accepted to speak at a conference in which I get paid to speak.  I love SQL Saturdays and what they bring to the SQL Server community.  They’re a fantastic way of seeing the world and getting good, free training.  But for me as a speaker, the next step is to get accepted to a conference which will pay me to come out.  It’s not so much about the money (although hey, I like money), but rather the vindication in knowing that I have the skills to teach and train people willing to fork out hard-earned cash.  I want to show that I can earn that respect.

Note that “paid conference” doesn’t have to be a huge one like PASS Summit or DevConnections.  I may submit for some of those, but I expect that I’m at least a year or two away from serious consideration.  I need to hone my skill and improve my performances before I think I’m ready to speak at the big show.

Presentation Year In Review

This year was my breakout year for presenting.  Here’s a quick look by the numbers.

User Group Meetings

I presented at the following user group meetings this year:

  1. TriNUG Data SIG, January 21st.
  2. TriPASS, March 17th.
  3. Lynchburg SQL Server User Group, March 26th.
  4. TriNUG Data SIG, April 22nd.
  5. F#/Analytics SIG, July 28th.
  6. TriNUG lightning talk, December 9th.
  7. TriPASS, December 15th.

SQL Saturdays

I presented at the following SQL Saturdays this year:

  1. Cleveland, February 7th.
  2. Tampa, February 28th.
  3. Richmond, March 21st.
  4. Jacksonville, May 9th.
  5. Rochester, May 16th.
  6. Nova Scotia, June 8th.
  7. Chattanooga, June 27th.
  8. Columbus, July 11th.
  9. Indianapolis, August 8th.
  10. Omaha, August 15th.
  11. Pittsburgh, October 3rd.
  12. Raleigh, October 10th.  — Yeah, I didn’t technically present here, but I ran it, so I’m counting it!
  13. Charlotte, October 17th.

Podcasts

I participated in my first podcast this year, when Carlos Chacon interviewed me for the SQL Data Partners podcast.

Benford’s Law

One of the talks I went to at SQL Saturday Charlotte was Bill Pearson’s talk on forensic analysis.  During the talk, Pearson mentioned Benford’s Law, and when I heard about it, I knew I wanted to investigate this phenomenon further.

Benford’s Law

The basic idea behind Benford’s Law is that, in many data sets, smaller digits appear more frequently than larger digits.  The number 1 appears almost three times as frequently as it would under a uniform distribution, whereas the numbers 7-9 each appear approximately half as frequently.

Benford’s Law is not a universal law, but it does appear to follow in a large number of scenarios, including financial data.  Pearson mentioned using Benford’s Law to find fraudulent journal entries, although in that scenario, analysts will look at the 2nd and later digits as well as the first.

To get a quick idea of what a Benford distribution would look like, here’s some quick R code:

bp <- data.frame(
  c(1,2,3,4,5,6,7,8,9),
  c(0.301,0.176,0.125,0.097,0.079,0.067,0.058,0.051,0.046)
)

names(bp)[1] <- "Digit"
names(bp)[2] <- "Frequency"
plot(bp)
lines(bp)

And here’s the ugly plot:

BenfordProbabilities

Testing Benford’s Law

I’m going to look at three separate sets of data to check Benford’s Law.  First, we’ll look at a Fibonacci sequence to see if it follows.  Second, we’ll look at my local HOA’s budget figures for 2013-2015.  Finally, I’ll grab random numbers.

Fibonacci Sequences

Generating a Fibonacci sequence in R is pretty easy.  Getting the leading digit of each element in an array is easy.  Let’s combine them together:

library(MASS)

len <- 1000
fibvals <- numeric(len)
fibvals[1] <- 1
fibvals[2] <- 1
for (i in 3:len) {
    fibvals[i] <- fibvals[i-1] + fibvals[i-2]
}

firstdigit <- function(k) {
    as.numeric(head(strsplit(as.character(k), '')[[1]],n=1))
}

fibfirst <- sapply(fibvals, firstdigit)
truehist(fibfirst, nbins=10)

This code snippet generates the first thousand Fibonacci numbers and uses the MASS library’s truehist() function to generate a histogram for the digits 1-9.  The end result looks like:

FibonacciBenford

Notice that this histogram fits very nicely with the plot above, showing that Fibonacci sequences will tend to follow Benford’s Law.

Budget Values

Each year, my HOA announces their annual budget.  I decided to grab the individual budgeted account values from the 2013, 2014, and 2015 budgets and apply the same analysis to each.  There are 107 line items in the budget, and here’s the breakdown by year:

library(MASS)

hoa <- read.table('http://www.catallaxyservices.com/media/blog/HOAValues.txt',header=T,sep='\t',quote="")

firstdigit <- function(k){
  as.numeric(head(strsplit(as.character(k), '')[[1]],n=1))
}

hoa$F2013 <- sapply(hoa$X2013,firstdigit)
hoa$F2014 <- sapply(hoa$X2014,firstdigit)
hoa$F2015 <- sapply(hoa$X2015,firstdigit)

f2013 <- subset(hoa, hoa$F2013 > 0)
f2014 <- subset(hoa, hoa$F2014 > 0)
f2015 <- subset(hoa, hoa$F2015 > 0)

truehist(f2013$F2013, nbins=10, xlab = "Leading Digit") + title("2013 Budget")
truehist(f2014$F2014, nbins=10, xlab = "Leading Digit") + title("2014 Budget")
truehist(f2015$F2015, nbins=10, xlab = "Leading Digit") + title("2015 Budget")

This code is just a little bit more complex than the other code above, and that’s because I need to filter out the $0 budget items. When I do that, I get the following graphs:

2013Budget

2014Budget

2015Budget

So, these results don’t quite follow Benford’s Law.  Instead of a smooth downward curve on the histogram, we see some jitter as well as a consistent bump in the 7’s digit, meaning a lot of $7000-7999 or $70,000-79,999 entries.  Based on this, I might wonder if our treasurer likes that number more than others, or if it’s just coincidence.

Random Numbers

Unlike the other two, my expectation coming in here is that we should see a fairly uniform distribution of numbers.  So let’s generate some random numbers:

library(MASS)

firstdigit <- function(k){
  as.numeric(head(strsplit(as.character(k), '')[[1]],n=1))
}

rn <- runif(10000, 1, 999999)
rnf <- sapply(rn,firstdigit)
truehist(rnf, nbins=10)

This generates a plot like so:

RandomNumbers

Moral of the story:  don’t use a random number generator to generate financial data…

Conclusion

Benford’s Law is a rather interesting concept.  You can find that it applies to a large number of data sets, including things as disparate as per-capita GDP and Twitter follower counts.  In today’s post, we saw how it applies to Fibonacci sequences and (with a bit of noise) to a relatively small set of budget items.

Using R For Quick Analysis

Before I got accustomed to using R, I would use Excel as my primary means for quick analysis.  Between pivot tables and simple graphs, it generally answered my questions.  The other day, however, I wanted to satisfy a bit of a curiosity, and found R to be a fantastic tool to get what I wanted.  Read on for two examples of plotting made quick.

Call Logging Trends

Note that I’m writing this blog post on November 21st, so Black Friday hasn’t hit yet.

At my employer, late November and early December is the peak busy season, as we support retailers whose primary revenue-generating event is the Thanksgiving to Christmas season.  My expectation, therefore, is that we are pushing our systems harder, but I was curious as to whether that has really cranked up yet.

Over the past year, I have been working on a solution to answer this problem:  aggregated call logging.  For years, we’ve collected data on a subset of our database calls, but due to the way this data was stored and processed, we only kept about a month’s worth of information.  I took the processing and long-term storage out of SQL Server and put it into Hadoop, bringing back aggregated information which I can store indefinitely because the row count is so much smaller.  The primary aggregation is called CallLogDailyAggregate, whose grain is irrelevant to this discussion except to say that it includes the log date, and has a measure called TotalNumberOfCalls which does exactly what it sounds.

Taking this aggregation, I can pull data from SQL Server and plot it in R very quickly.  First, because I’m running this from a Windows PC, I need to open the ODBC Data Sources and create a DSN.  I called mine “MONDB.”  From there, we can run the following bit of code:

library(RODBC)

logdata <- odbcConnect("MONDB")
res <- sqlQuery(logdata, "SELECT LogDate, SUM(TotalNumberOfCalls) AS NumberOfCalls FROM dbo.CallLogDailyAggregate clda where clda.LogDate >= '2015-08-01' GROUP by LogDate order by logdate")

plot(res)
reg <- lm(NumberOfCalls~LogDate, res)
abline(reg)

This builds a simplistic but very informative plot:

CallLog

This plot is ugly, but it does what we need it to do:  it shows that yes, over the past two months, we’ve seen a solid increase in the number of calls, moving from ~3.5 million per day up to ~4.5 million per day.  Note that there are a few outliers, including one in late September.  Most of these are probably just missing data that would need to be reprocessed, and the one in November is due to the fact that the day is underway, so we’ve only pulled in a few hours’ worth of data.  Certainly there’s a lot I could do to make this plot look nicer, and it’s not something that I would show upper management in its current form, but to satisfy my itch, I only need a couple lines of code—I don’t really even need the plot line, as I can eyeball the trend.

CPU Usage Over Time

Glenn Berry has a fantastic set of DMV queries, and I use them quite regularly.  One query I used to troubleshoot a system issue is his CPU utilization over time query.  Here is a slightly cleaned up version of it:

DECLARE
	@ts_now BIGINT =
	(
		SELECT
			cpu_ticks / (cpu_ticks / ms_ticks)
		FROM sys.dm_os_sys_info WITH (NOLOCK)
	);

SELECT TOP (256)
	SQLProcessUtilization,
	SystemIdle,
	100 - SystemIdle - SQLProcessUtilization AS [OtherProcessUtilization],
	DATEADD(ms, -1 * (@ts_now - [TIMESTAMP]), GETDATE()) AS [EventTime]
FROM
	(
		SELECT
			record.value('(./Record/@id)[1]', 'int') AS record_id,
			record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int')
			AS [SystemIdle],
			record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int')
			AS [SQLProcessUtilization],
			[TIMESTAMP]
		FROM
		(
			SELECT
				[TIMESTAMP],
				CONVERT(XML, record) AS [record]
			FROM sys.dm_os_ring_buffers WITH (NOLOCK)
			WHERE
				ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'
				AND record LIKE N'%<SystemHealth>%'
		) AS x
	) AS y
ORDER BY
	record_id DESC OPTION (RECOMPILE);

This query gives us back data over the past 256 minutes, and gives us an indication of CPU pressure on the part of SQL Server as well as other processes on the server. The problem is that it gives us 256 rows back, and it’s hard for humans to digest 256 rows of numbers. Fortunately, we have some R skills at our disposal:

library(RODBC)

logdata <- odbcConnect("MONDB")
res <- sqlQuery(logdata, "DECLARE
	@ts_now BIGINT =
                (
                SELECT
                cpu_ticks / (cpu_ticks / ms_ticks)
                FROM sys.dm_os_sys_info WITH (NOLOCK)
                );
                
                SELECT TOP (256)
                SQLProcessUtilization,
                SystemIdle,
                100 - SystemIdle - SQLProcessUtilization AS [OtherProcessUtilization],
                DATEADD(ms, -1 * (@ts_now - [TIMESTAMP]), GETDATE()) AS [EventTime]
                FROM
                (
                SELECT
                record.value('(./Record/@id)[1]', 'int') AS record_id,
                record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int')
                AS [SystemIdle],
                record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int')
                AS [SQLProcessUtilization],
                [TIMESTAMP]
                FROM
                (
                SELECT
                [TIMESTAMP],
                CONVERT(XML, record) AS [record]
                FROM sys.dm_os_ring_buffers WITH (NOLOCK)
                WHERE
                ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'
                AND record LIKE N'%<SystemHealth>%'
                ) AS x
                ) AS y
                ORDER BY
                record_id DESC OPTION (RECOMPILE);")

plot(res$EventTime,res$SQLProcessUtilization,type="l",col="#0072B2",ylim=c(0,100))
lines(res$EventTime,res$SystemIdle,col="#D55E00")
lines(res$EventTime,res$OtherProcessUtilization,col="#CC79A7")

This looks like a lot of code, but honestly, most of that is the SQL query.  When we use plot, it creates a quick and dirty graph.  In this case, I’m creating a line graph for SQL Server process utilization.  I then use the lines() function to plot the idle time and other process time, and I’m specifying colorblind-safe colors.  Here’s what my graph looks like against MONDB this fine morning:

CPUPlot

Again, if I were to send this to higher-level management, I’d clean it up, but if I’m interested in seeing whether we’re seeing CPU spikes, this is good enough.  We can see here that MONDB is running fine, with less than 50% CPU utilization over this time frame and nothing else running which can steal away CPU resources from SQL Server.

Conclusion

You don’t need to be a data scientist to take advantage of data analysis techniques, and for database administrators, we already have a wealth of data available to us.  Sometimes you’ll get tools which generate pretty graphs, but maybe your company can’t afford them, or maybe they’re only on a certain subset of servers.  In that case, don’t be afraid to pull out RStudio and write a few SQL queries to spot some trends and hopefully make your life easier.