Something I am trying to harp upon is that R isn’t just a language for data analysts; it makes sense for DBAs to learn the language as well. Here’s a really simple example.
I have a client data warehouse which holds daily rollups of revenue and cost for customers. We’ve had some issues with the warehouse lately where data was not getting loaded due to system errors and timeouts, and our services team gave me a list of some customers who had gaps in their data due to persistent processing failures. I figured out the root cause behind this (which will show up as tomorrow’s post), but I wanted to make sure that we filled in all of the gaps.
My obvious solution is to write a T-SQL query, getting some basic information by day for each customer. I could scan through that result set, but the problem is that people aren’t great at reading tables of numbers; they do much better looking at pictures. This is where R comes into play.
My solution is just a few lines of R code, as well as a few lines of T-SQL. I’m using SQL Server 2016 but we don’t use SQL Server R Services (yet!), so I’m doing this the “old-fashioned” way by pulling data across the wire. Here’s the code:
install.packages("ggplot2") install.packages("RODBC") library(RODBC) library(ggplot2) conn <- odbcDriverConnect("Driver=SQL Server;Server=MYSERVER;Initial Catalog=MYDATABASE;Provider=SQLNCLI11.1;Integrated Security=SSPI") wh <- sqlQuery(conn, "SELECT fc.CustomerWK, fc.DateWK, COUNT(1) AS NumRecords, SUM(fc.Revenue) AS Revenue, SUM(fc.Cost) AS Cost FROM MYDATABASE.dbo.FactTable fc WHERE fc.CustomerWK IN (78296,80030,104098,104101,104104,108371) AND fc.DateWK > 20160901 GROUP BY fc.CustomerWK, fc.DateWK;") wh$CustomerWK <- as.factor(wh$CustomerWK) wh$DateWK <- as.Date(as.character(wh$DateWK), "%Y%m%d") str(wh) ggplot(wh, aes(x=DateWK, y=NumRecords, colour=CustomerWK, group=CustomerWK)) + geom_line() + xlab("Date") + ylab("Number Of Records")
Let’s walk through this step by step. After installing and loading the two relevant packages (RODBC to connect to SQL Server and ggplot2 to help us create a pretty graph), I open a connection to my server (with the name replaced to protect the innocent).
Next, I create a data frame called wh and populate it with the results of a query to the warehouse. This is a pretty simple SQL query which gets the number of rows by customer by day, and also shows me revenue and cost. I’m not using revenue and cost in this graph, but did look at them as part of my sanity checks.
Next up, I want to “fix” a couple data types. CustomerWK is an int and represents the customer’s surrogate key. Although this is an integer type, it’s really a factor. I have a small, unique set of categories; there’s no mathematical relationship between CustomerWK 1 and CustomerWK2. Anyhow, I replace CustomerWK with this new, factorized attribute.
After taking care of the CustomerWK factor, I convert DateWK to a date type. DateWK is an integer representation of the date in ISO format, so January 15th, 2016 would be represented as 20160115. I need to convert this from an integer to a character string, and then I can convert it to a date. I replace the DateWK value with this date type. I included the str(wh) call to show that my data frame really does have the correct types.
Finally, I call ggplot, passing in my warehouse data frame. I create an aesthetic, which tells the graphing engine what I want to see on the screen. I want to see the number of records in the fact table per day for each customer, so my Y coordinate is defined by NumRecords, my X coordinate by DateWK, and my group by CustomerWK. To make it easier to read, I color-code each customer.
After creating the aesthetic, I plot the results as a line graph using the geom_line() function, and then give meaningful X and Y axis labels.
What I get in return is a decent-enough looking graph:
I can easily see that customer 108371 experienced a major dropoff sometime in mid-October, and fell off the cliff in early November. The other customers have been fairly stable, leading me to believe that just one customer (in this cohort) has an issue. I was able to investigate the issue and determine the root cause of the falloff—that the customer stopped sending us data.
This is another example where knowing a little bit of R can be very helpful. Even if you aren’t building predictive models or performing advanced regressions, the ability to throw results into a plot and quickly spot outliers makes the job much easier. If I had to discern results from a Management Studio result set, I could still do the job, but I might have been thrown for a loop with customer 78296, whose counts fluctuated over a fairly wide band.