Using R For Quick Analysis

Before I got accustomed to using R, I would use Excel as my primary means for quick analysis.  Between pivot tables and simple graphs, it generally answered my questions.  The other day, however, I wanted to satisfy a bit of a curiosity, and found R to be a fantastic tool to get what I wanted.  Read on for two examples of plotting made quick.

Call Logging Trends

Note that I’m writing this blog post on November 21st, so Black Friday hasn’t hit yet.

At my employer, late November and early December is the peak busy season, as we support retailers whose primary revenue-generating event is the Thanksgiving to Christmas season.  My expectation, therefore, is that we are pushing our systems harder, but I was curious as to whether that has really cranked up yet.

Over the past year, I have been working on a solution to answer this problem:  aggregated call logging.  For years, we’ve collected data on a subset of our database calls, but due to the way this data was stored and processed, we only kept about a month’s worth of information.  I took the processing and long-term storage out of SQL Server and put it into Hadoop, bringing back aggregated information which I can store indefinitely because the row count is so much smaller.  The primary aggregation is called CallLogDailyAggregate, whose grain is irrelevant to this discussion except to say that it includes the log date, and has a measure called TotalNumberOfCalls which does exactly what it sounds.

Taking this aggregation, I can pull data from SQL Server and plot it in R very quickly.  First, because I’m running this from a Windows PC, I need to open the ODBC Data Sources and create a DSN.  I called mine “MONDB.”  From there, we can run the following bit of code:

library(RODBC)

logdata <- odbcConnect("MONDB")
res <- sqlQuery(logdata, "SELECT LogDate, SUM(TotalNumberOfCalls) AS NumberOfCalls FROM dbo.CallLogDailyAggregate clda where clda.LogDate >= '2015-08-01' GROUP by LogDate order by logdate")

plot(res)
reg <- lm(NumberOfCalls~LogDate, res)
abline(reg)

This builds a simplistic but very informative plot:

CallLog

This plot is ugly, but it does what we need it to do:  it shows that yes, over the past two months, we’ve seen a solid increase in the number of calls, moving from ~3.5 million per day up to ~4.5 million per day.  Note that there are a few outliers, including one in late September.  Most of these are probably just missing data that would need to be reprocessed, and the one in November is due to the fact that the day is underway, so we’ve only pulled in a few hours’ worth of data.  Certainly there’s a lot I could do to make this plot look nicer, and it’s not something that I would show upper management in its current form, but to satisfy my itch, I only need a couple lines of code—I don’t really even need the plot line, as I can eyeball the trend.

CPU Usage Over Time

Glenn Berry has a fantastic set of DMV queries, and I use them quite regularly.  One query I used to troubleshoot a system issue is his CPU utilization over time query.  Here is a slightly cleaned up version of it:

DECLARE
	@ts_now BIGINT =
	(
		SELECT
			cpu_ticks / (cpu_ticks / ms_ticks)
		FROM sys.dm_os_sys_info WITH (NOLOCK)
	);

SELECT TOP (256)
	SQLProcessUtilization,
	SystemIdle,
	100 - SystemIdle - SQLProcessUtilization AS [OtherProcessUtilization],
	DATEADD(ms, -1 * (@ts_now - [TIMESTAMP]), GETDATE()) AS [EventTime]
FROM
	(
		SELECT
			record.value('(./Record/@id)[1]', 'int') AS record_id,
			record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int')
			AS [SystemIdle],
			record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int')
			AS [SQLProcessUtilization],
			[TIMESTAMP]
		FROM
		(
			SELECT
				[TIMESTAMP],
				CONVERT(XML, record) AS [record]
			FROM sys.dm_os_ring_buffers WITH (NOLOCK)
			WHERE
				ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'
				AND record LIKE N'%<SystemHealth>%'
		) AS x
	) AS y
ORDER BY
	record_id DESC OPTION (RECOMPILE);

This query gives us back data over the past 256 minutes, and gives us an indication of CPU pressure on the part of SQL Server as well as other processes on the server. The problem is that it gives us 256 rows back, and it’s hard for humans to digest 256 rows of numbers. Fortunately, we have some R skills at our disposal:

library(RODBC)

logdata <- odbcConnect("MONDB")
res <- sqlQuery(logdata, "DECLARE
	@ts_now BIGINT =
                (
                SELECT
                cpu_ticks / (cpu_ticks / ms_ticks)
                FROM sys.dm_os_sys_info WITH (NOLOCK)
                );
                
                SELECT TOP (256)
                SQLProcessUtilization,
                SystemIdle,
                100 - SystemIdle - SQLProcessUtilization AS [OtherProcessUtilization],
                DATEADD(ms, -1 * (@ts_now - [TIMESTAMP]), GETDATE()) AS [EventTime]
                FROM
                (
                SELECT
                record.value('(./Record/@id)[1]', 'int') AS record_id,
                record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int')
                AS [SystemIdle],
                record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int')
                AS [SQLProcessUtilization],
                [TIMESTAMP]
                FROM
                (
                SELECT
                [TIMESTAMP],
                CONVERT(XML, record) AS [record]
                FROM sys.dm_os_ring_buffers WITH (NOLOCK)
                WHERE
                ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'
                AND record LIKE N'%<SystemHealth>%'
                ) AS x
                ) AS y
                ORDER BY
                record_id DESC OPTION (RECOMPILE);")

plot(res$EventTime,res$SQLProcessUtilization,type="l",col="#0072B2",ylim=c(0,100))
lines(res$EventTime,res$SystemIdle,col="#D55E00")
lines(res$EventTime,res$OtherProcessUtilization,col="#CC79A7")

This looks like a lot of code, but honestly, most of that is the SQL query.  When we use plot, it creates a quick and dirty graph.  In this case, I’m creating a line graph for SQL Server process utilization.  I then use the lines() function to plot the idle time and other process time, and I’m specifying colorblind-safe colors.  Here’s what my graph looks like against MONDB this fine morning:

CPUPlot

Again, if I were to send this to higher-level management, I’d clean it up, but if I’m interested in seeing whether we’re seeing CPU spikes, this is good enough.  We can see here that MONDB is running fine, with less than 50% CPU utilization over this time frame and nothing else running which can steal away CPU resources from SQL Server.

Conclusion

You don’t need to be a data scientist to take advantage of data analysis techniques, and for database administrators, we already have a wealth of data available to us.  Sometimes you’ll get tools which generate pretty graphs, but maybe your company can’t afford them, or maybe they’re only on a certain subset of servers.  In that case, don’t be afraid to pull out RStudio and write a few SQL queries to spot some trends and hopefully make your life easier.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s