What: SQL Saturday Raleigh Where: Wake Tech RTP, 10908 Chapel Hill Road, Morrisville, North Carolina, 27560 When: Saturday, April 27th, all day Admission is free. Sign up at the SQL Saturday website.

What I’m Presenting

Nothing at all!

This will be the 5th SQL Saturday Raleigh that I’ve been involved with and my 4th running the event. This year, we’re in a brand new venue, one which just opened up a few months ago. This venue is right in the heart of the Triangle, so it should be a bit more convenient to get to for people than our prior events.

We will also be hosting a SQL Clinic, where attendees can ask questions about production issues, where (or what) to learn, or anything else.

This is part five in a series on applying forensic accounting techniques to SQL and R.

Dealing with Time

This post will be a pretty short one. In my talk, I don’t have any demos, mostly because much of cohort analysis has secretly been time series analysis at the same time. Instead, I’ll lob out a few points and call it a day.

Time series analysis, at its core, is all about how your data changes over time. The grain for time series analysis is important: as we saw in the last post, we were able to get an excellent result at the yearly level when regressing number of active buses versus number of line items.

Going down to the monthly grain, the number of active buses explained none of the monthly variance within a year and so it was a much weaker correlation.

This is one of the difficulties when doing time series work: the specific grain you work against can make or break an analysis.

Calendar tables will allow you to define your grain to almost any level. If you store data as dates, a calendar table can let you find such weird items as the third Tuesday of each month, Sundays after weeks with holidays on Wednesdays, or the third quarter of fiscal year 2017 (oh, I meant federal fiscal year, not state fiscal year).

Simple Analysis & Seasonality

One of the simplest forms of analysis is count of items per day or week or month. The goal is to look for seasonality: repeated periodic behavior. You might see more sales on one day of the week or one day of the month or one month of the year. You can then use techniques like ARIMA to project forward and estimate future behavior given prior trends.

Additive and Semi-Additive Measures

When performing a time series analysis, it’s important to understand whether your measure is additive, semi-additive, or non-additive with respect to time. Additivity is, simply, whether you can sum results across a dimension and get reasonable results.

For example, an additive measure over time is the sum of deposits in a bank account. If you deposit $10 today, $50 tomorrow, and $100 a week from now, you will have deposited a total of $160. Additive measures give us the most flexibility in how we analyze the data, so we typically prefer to deal with them.

Non-additive measures cannot be aggregated across any dimension, whether time or otherwise. An example of this is a person’s account number. You can’t sum up all of the account numbers of customers at your bank and get something meaningful.

Semi-additive measures fit somewhere in between: they are additive across certain dimensions and non-additive across others. Bank account balance is a great example of a semi-additive measure. Bank account balances are additive across customers: you can sum up all of the balances and determine the total amount of deposits in the bank at a particular point in time. What you cannot do, however, is sum up one customer’s account balances over different points in time and get a meaningful number. If my bank account had $100 in it yesterday and $100 in it today, I don’t have a claim for $200; I have $100 because that is my current balance.

With semi-additive measures, you have a couple of routes for analysis. First, you can look at the state of the measure at fixed points in time. For example, we might look at every account at 9 PM UTC on a daily basis; that helps us see how account balances change and get a consistent view of the bank’s holdings across time.

The other technique is to use feature engineering to convert from balances to debits and credits—that is, the individual transactions which make up the total balance. If you have those, then the values become additive across time and you could use those instead of account balances.

Looking at Date Oddities

In this section, I’ll briefly cover some date oddities which might be worth investigating. First, you can look for “odd” date combinations: do you have order dates after ship dates? Or do you have adjustments after a journal closes? When you have a natural temporal flow, it’s interesting to look for variances in that flow because they’ll tell you a lot about exceptions to processes. For example, there might be a rare circumstance where an order date comes after a ship date because your company might ship out demo products without an order and if a customer buys the unit you fill in the order date and let the customer hang onto the product. These rules are usually hard to discern and are the fuzzy domain of subject matter experts (who themselves may not have thought through the implications of these choices).

The other area where you might find oddities is to focus on date attributes. For example, slice data by the day of the week, by the day of the month, by weekdays versus weekends, by mid-week versus early week versus late week. You might find that invoices never get processed on weekends (because it’s a government agency and nobody works weekends) until one time a bunch of invoices get processed on a Saturday. Simple sums, counts, and averages can get you a long way toward understanding user behavior in these cases.

Don’t Forget About Holidays

Holidays change behavior as well. On some holidays, like Singles Day, sales can go through the roof . On others, like Christmas Day, sales tend to be much lower than you’d expect from just another day of the week. If you’re dealing with worker behavior, their holidays matter too: you won’t see much work done on Christmas Day because they’re all at home.

Also, be aware of floating holidays like Easter and observed holidays like Christmas Day observed. If December 25th is on a Sunday, then companies tend to observe the holiday on December 26th, meaning that even though it’s technically not a holiday, December 26th will still exhibit that kind of holiday behavior in certain circumstances.

It’s easiest to use a calendar table (hint, hint) so you can avoid calculating these kinds of rules on the fly.

Conclusion

This is a tiny fraction of everything you want to do with time series, but should serve as a quick primer on things to think about as you try to understand your data better.

This is part four in a series on applying forensic accounting techniques to SQL and R.

Mr. Keynes’ Aggregates Conceal the Most Fundamental Mechanisms of Change

In the last post, we focused on high-level aggregates to gain a basic understanding of our data. We saw some suspicious results but couldn’t say much more than “This looks weird” due to our level of aggregation. In this post, I want to dig into data at a lower level of detail. My working conception is the cohort, a broad-based comparison of data sliced by some business-relevant or analysis-relevant component.

Those familiar with Kimball-style data warehousing already understand where I’m going with this. In the basic analysis, we essentially look at fact data with a little bit of disaggregation, such as looking at data by year. In this analysis, we introduce dimensions (sort of) and slice our data by dimensions.

Our key dimensions in this analysis are buses, employees, vendors, and expense categories. We can break out our data by any combination of these. Typically, I try to break things out one core category at a time at first and then start combining them together for more complicated slices once I have a decent understanding of what’s going on.

In today’s post, we will look at cohorts along three angles: cardinality, dimensional analysis, and clustering behavior.

Cardinality and Duplicates

One of the first things I want to figure out as I dive into this data set is, how often do I see “duplicate” results? For line items, I’m calling it “duplicate” when I see more than one invoice per bus per vendor per day. The idea here is that I know vendors work on individual buses and we force them to invoice per-bus. Here’s the SQL query:

WITH records AS
(
SELECT
li.LineItemDate,
li.BusID,
li.VendorID,
COUNT(*) AS NumberOfInvoices
FROM dbo.LineItem li
GROUP BY
li.LineItemDate,
li.BusID,
li.VendorID
)
SELECT
NumberOfInvoices,
COUNT(*) AS NumberOfOccurrences
FROM records
GROUP BY
NumberOfInvoices
ORDER BY
NumberOfInvoices;

And my results:

NumberOfInvoices

NumberOfOccurrences

1

37224

2

118

3

1

This says that 99.4% of the time, an invoice is unique (note that you need to double-count 118 and triple-count 1 to get the right denominator). That’s a really good sign that this is an appropriate grain for determining what normally is unique.

By the way, how did I come up with that combination of columns? Well, I probably got an idea from the business side. With a bit of logic and a sane data model, you can get to a place where you need only a little bit of experimentation to figure out what “ought” to be unique.

So my next question is, who’s sending these duplicate invoices? Are all of our vendors doing it equally or do we have one or two dupe-senders? Well, let’s find out:

WITH records AS
(
SELECT
li.LineItemDate,
li.BusID,
li.VendorID,
COUNT(*) AS NumberOfInvoices
FROM dbo.LineItem li
GROUP BY
li.LineItemDate,
li.BusID,
li.VendorID
)
SELECT
VendorID,
COUNT(*) AS NumberOfOccurrences
FROM records
WHERE
NumberOfInvoices > 1
GROUP BY
VendorID
ORDER BY
VendorID;

Once more, here are our results:

VendorID

NumberOfOccurrences

2

11

5

59

6

1

7

2

8

4

9

16

10

4

11

1

12

6

13

6

14

9

At first glance, there’s a fair spread in vendors sending dupes. Most of our vendors have sent us duplicates at least once during the seven years of data we have. Sorting this by number of occurrences descending, we can see that vendor 5 has more than 3x as many dupes as the next-highest. That’s a little weird and it’s something we should at least keep in the back of our minds, but by itself, that’s not really weird behavior.

Now let’s look at cardinality. Cardinality is the number of distinct values in our data set and that’s something we can do quite easily in R using the rapply() function, which applies a function to each column in a data frame. I have loaded a data frame called lineItems which is just a simple SELECT * FROM dbo.LineItem; query pulled in via the DBI package.

This tells us that we have 37,463 unique line item IDs out of 37,463 rows. This is our primary key constraint, so it needs to be unique.

By contrast, we see 664 bus IDs, meaning that 664 buses have had service done during their lifespans. There are 700 buses in our data set, so that means 36 buses never had any maintenance done on them before retirement.

We can also look at interesting subsets of the data and see how that behavior differs from the broader set. For example, we know that you need two employees to approve an expenditure of more than $1000. So let’s look at high-value invoices which are under the two-signer cutoff, maybe $850 to $999.99.

12 of our 15 vendors have sent in invoices in this price range. Let’s use a function in data.table called setDT(), which lets us slice our data by a variable like VendorID. This is equivalent to writing a SQL statement which selects the count of results grouped by VendorID.

Four employees handled those invoices. Maybe those four are the only four who work with the vendor? I’ve seen places which do that, splitting out which accountant works with which vendor.

Nope. All 12 have worked with this vendor. But the four in question each have 30x as many invoices as the rest. Let’s dig deeper and see if we can unravel this mystery.

Dimensional Analysis

When I use the term “dimensional analysis,” I don’t necessarily mean true Kimball-style dimensions. Instead, think of it as an analysis of data grouped by some explanatory variable (or variables). For example, we can have a breakdown by time. Our data will come from the following SQL query:

SELECT
li.LineItemID,
li.BusID,
li.VendorID,
v.VendorName,
li.ExpenseCategoryID,
ec.ExpenseCategory,
li.EmployeeID,
e.FirstName,
e.LastName,
CONCAT(e.FirstName, ' ', e.LastName) AS EmployeeName,
li.LineItemDate,
c.CalendarMonth,
c.MonthName,
c.CalendarYear,
c.FirstDayOfMonth,
li.Amount
FROM dbo.LineItem li
INNER JOIN dbo.Employee e
ON li.EmployeeID = e.EmployeeID
INNER JOIN dbo.ExpenseCategory ec
ON li.ExpenseCategoryID = ec.ExpenseCategoryID
INNER JOIN dbo.Vendor v
ON li.VendorID = v.VendorID
INNER JOIN dbo.Calendar c
ON li.LineItemDate = c.Date;

Here’s an example of the number of invoices per month grouped by year and shown as a box plot:

And here’s the R code to generate that plot:

itemsByMonthAndYear <- lineItems %>% group_by(FirstDayOfMonth, CalendarYear, CalendarMonth) %>% summarize(n = n())
ggplot(data = itemsByMonthAndYear, aes(x = CalendarYear, y = n, group = CalendarYear)) +
geom_boxplot() +
theme_minimal() +
labs(x = NULL, y = NULL, title = "Box Plot of Number of Invoices by Month and Year")

Something that strikes me as odd about 2018 is that its low points are right where I would expect medium to high points in 2018 to be given 2011 through 2017. Let me look at each month as a scatter plot to see this more clearly.

ggplot(data = itemsByMonthAndYear, aes(x = FirstDayOfMonth, y = n)) +
geom_point() +
theme_minimal() +
labs(x = "Month", y = "Number of Invoices") +
theme(axis.text.x = element_blank())

Even in 2018, we have several months which fit the trend and a few which don’t. Let’s turn on our CSI Enhancer ™ and look specifically at 2018.

ggplot(data = filter(itemsByMonthAndYear, CalendarYear == 2018), aes(x = FirstDayOfMonth, y = n)) +
geom_point() +
scale_x_date() +
theme_minimal() +
labs(x = "Month", y = "Number of Invoices")

Something happened sometime around March and slowed down in September. Maybe there was something big going on that year which required more vehicle maintenance?

We’re not sure yet exactly what’s going on but we’re getting there.

Pivot to Expense Category

Here’s a box plot of sales by expense category. First the code, and then the image;

Delving into the category with the biggest positive outliers (Glass, Windshields & Windows), I found out that the only vendor selling here is vendor #5, Glass and Sons Glass and Accessories. Let’s take a closer look at Mr. Glass and his sons.

Glass and Sons

I’m going to look at the 4 expense categories which Glass and Sons specialize in. This is for the years 2011-2017 and shows expenses per invoice line within a category for each competitor in that category.

Glass and Sons is right in line with the other vendors for each category throughout the seven-year period. One thing I want to point out is that I added geom_boxplot(varwidth = TRUE) to let us see the number of invoices represented as the size of the bar. We can see that there’s no real impetus to choose one vendor over another during this time stretch.

In 2018, Glass and Sons became both a lot more expensive and a lot more popular than its competition. I may not be a big-city lawyer (or any kind of lawyer) but this smells fishy. At this point, I have concerns with this data. If it’s accurate, we might see fraudulent behavior. But before I start leveling accusations, though, I want to dig a bit further. In order for there to be fraud, I’d expect that there’s at least one person on the inside who is working with Glass and Sons. I don’t want to contact anyone on the other side because I can’t be positive who’s involved.

Analyzing Employees

I’m suspicious but don’t have enough yet to go to anyone. Let’s keep looking, this time looking at Glass and Sons invoices per county employee. First, prior to 2018:

Let’s go a step further and look at our big months versus our small months. Maybe there was a county change where these four were on some kind of special assignment? First let’s look at the non-weird months:

These are my four persons of interest. I think there’s something going on and I have enough evidence to go to my superiors and get a real investigation going.

Clustering Behavior

The last bit I want to look at today is clustering behavior. The idea here is that we might see clusters around certain points due to explicit human behavior. For example, suppose we have a corporate policy where any meal over $20 needs a receipt but any meal under $20 doesn’t. Many (most?) people will report accurately, but there is an incentive for people to report meals of $19.95 or $19.99 even if they only spent $3.00 at the grocery store that morning.

In our example, I know that $1000 is our cutoff point, so I’d expect some clustering just under $1K. I can use a histogram to see this in action.

We can see a fairly smooth curve regularly dropping…until we get to just under $1000. One last cohort-style comp. First, everybody but Glass and Sons:

And here’s Glass and Sons:

The multiple peaks aren’t outlandish. The giant wall just under the cutoff point for two employee signatures? Yeah, that’s suspicious.

Conclusion

In today’s post, we looked at different ways to slice and dice data sets, drilling down far enough to find reasonable evidence of fraud involving one vendor and four employees. Again, I want to stress that there’s no absolute proof here, but there’s enough that we can sic the real accountants on and figure out if all of those invoices match up with parts, if those parts made it to inventory, and those other pieces of evidence which will prove fraud.

In the next post, we’ll talk a little more about specific ways to handle time-series data.

The term “basic analysis” is pretty broad, so we’re going to cover a few concepts today. Most of this is in the realm of Exploratory Data Analysis, looking at the data to gain a better understanding of what you have before you generate hypotheses. This sounds bland and straightforward today, which is an indication of how well Tukey’s book was received and how insightful he was.

In today’s post, we’ll look at exploring our data along five avenues: summary analysis, growth analysis, gaps in data, duplicates and cardinality, and regression analysis. Most of this work will be in R, using the DBI package to connect to SQL Server.

Summary Analysis

Summary analysis is the easiest of these and generally should be your starting point. Here, you simply want to get an idea of the shape of your data. If you’re in SQL, I tend to start with a SELECT TOP(100) * from the table so I can get an idea of columns, value spread, and data lengths without putting much impact on the server. You can also peruse metadata like data types in sys.columns, foreign key relationships, constraints, and even calling code.

Over in R, we have the summary() function, which gives us a statistical summary of each column in a data frame. We can also use head(), which shows me the first few rows in a data frame. Here is an example of running a few summary functions against the bus table.

What we’ve learned here is that there are 700 buses (which you don’t see directly but can infer from the BusID quartiles). some of them came into service on 1990-01-01, which is probably a default value because our data set starts in 2011. Buses have retirement dates as well, and a bus without a retirement date is still active. In SQL, those would be rows with a NULL value; in R, those are the rows with NA values. There are 468 active buses in our current data set.

We would repeat this type of analysis for each of the major elements in our data set. Let me take a moment and cover my process for finding those “major elements” because I’m all too liable to gloss over it. Here’s what I tend to do when someone hands me a brand new database:

Find the tables with the most rows in them. In SQL Server Management Studio, you can right-click on a database and navigate to Reports –> Standard Reports –> Disk Usage by Table. This will generate a Reporting Services report showing approximate row counts (based on statistics), data size, and index size. Look at the top few tables there. Some of them are probably logging tables, which I tend to come back to later.

Pick one of the large tables which appears to have user-relevant information and perform summary analysis, looking at the top few rows to determine what’s in that table. These large tables tend to be fact-style data, storing relevant measures customers care about.

Recursively find foreign key constraints to other tables and look at those tables. These are often explanatory data, providing context to the fact-style data above. If you need help with recursively finding key constraints, I have a script (with bonus ASCII art) and a longer post as well. If your database has no foreign keys, there are still ways around it like looking at joins in the plan cache.

Repeat step 2 until you run out of big tables. Then move on to medium-sized tables you have not already reviewed.

Creating a database diagram will also be helpful here, especially if foreign key constraints exist.

Now that we’ve summarily covered summary analysis, let’s jump into the next category: growth analysis.

Growth Analysis

Growth analysis focuses on changes in ratios over time. For example, you may plot annual revenue, cost, and net margin by year. Doing this gives you an idea of how the company is doing: if costs are flat but revenue increases, you can assume economies of scale or economies of scope are in play and that’s a great thing. If revenue is going up but costs are increasing faster, that’s not good for the company’s long-term outlook.

For our data set, I’m going to use the following SQL query to retrieve bus counts on the first day of each year. To make the problem easier, I add and remove buses on that day, so we don’t need to look at every day or perform complicated analyses.

SELECT
c.CalendarYear,
COUNT(*) AS NumberOfBuses
FROM dbo.Bus b
INNER JOIN dbo.Calendar c
ON b.DateFirstInService <= c.Date
AND ISNULL(b.DateRetired, '2018-12-31') >= c.Date
WHERE
c.CalendarDayOfYear = 1
AND c.CalendarYear >= 2011
AND c.CalendarYear < 2019
GROUP BY
c.CalendarYear
ORDER BY
c.CalendarYear;

I can show you the SQL results but let’s drop this into R and build a quick and dirty plot.

options(repr.plot.width=6, repr.plot.height=4)
ggplot(activeBuses, aes(x = CalendarYear, y = NumberOfBuses)) +
geom_point() +
geom_line() +
labs(x = "Calendar Year", y = "Number of Buses", title = "Number of Buses by Year") +
ylim(0, 500) +
theme_minimal()

The first line with options is something I do with Jupyter to prevent it from creating huge graphs. From there, we’re creating a scatterplot with a line overlaid, labeling the axes, starting from 0, and using a minimalistic theme. Note that starting from 0 is not required—both line charts and scatter plots can start from points other than 0. I did this to make the steady growth more apparent.

Next, I want to look at the number of invoices per year. We invoice on a per-bus, per-item basis, so I would expect invoice growth to track bus growth reasonably closely. You can argue about economies of scale (maintenance workers become more efficient, you might get bigger discounts on parts, it makes sense to purchase capital equipment to make the per-bus maintenance fees lower, those types of things) but with the bus count growing so steadily, I’d think that invoices would grow similarly. So let’s take a look.

SELECT
c.CalendarYear,
COUNT(*) AS NumberOfInvoices
FROM dbo.LineItem li
INNER JOIN dbo.Calendar c
ON li.LineItemDate = c.Date
GROUP BY
c.CalendarYear
ORDER BY
c.CalendarYear;

Here is the R code:

ggplot(invoicesPerYear, aes(x = CalendarYear, y = NumberOfInvoices)) +
geom_point() +
geom_line() +
labs(x = "Calendar Year", y = "Number of Invoices", title = "Number of Invoices by Year") +
theme_minimal()

And the plot:

You can see that invoice growth was fairly steady from 2011 through 2017. Yeah, there are ups and downs but that’s normal in any real data set. The jump in 2018, however, is huge: we’ve effectively doubled the number of invoices despite bus growth being steady. Here’s the plot for expenditures by year, which code I’ve left out for the sake of making you do your own dirty work:

Those extra invoices added about a million dollars over expectations. This is our first indication that something interesting has happened. Note that this is not evidence of fraud, as there can be a number of innocent explanations: maybe the buses need to go through more maintenance because they’re older, maybe regulatory requirements forced more work on the buses, maybe we got a batch of lemons which need more work done on them. There are plenty of potential causes, but this is well outside the realm of noise.

We’ll shelve this for a little bit and look at our next topic, gap analysis.

Gap Analysis

Gap analysis is something you’d typically run when you care about the lack of a value. For example, accountants tend to get picky about check numbers and invoice numbers being complete. If you go from check 1001 to 1002 to 1004, an accountant wants to know what happened to check 1003. The reason is that if you don’t have a record of 1003, it’s possible that there was embezzlement.

To perform a quick gap analysis on line items, we can use the LEAD() window function, available since SQL Server 2012. Here’s an example of the window function in action:

WITH C AS
(
SELECT
li.LineItemID AS CurrentLineItemID,
LEAD(li.LineItemID) OVER (ORDER BY li.LineItemID) AS NextLineItemID
FROM dbo.LineItem li
)
SELECT
CurrentLineItemID + 1 AS rangestart,
NextLineItemID- 1 AS rangeend
FROM C
WHERE
NextLineItemID - CurrentLineItemID > 1;

Here’s what we get back:

We have several ranges of missing values here, which is a bit concerning, as our invoice numbers should be a complete set. There might be an innocuous reason for this. If we look at sys.columns, we can see that LineItemID is an identity column.

Identity columns are great for auto-incrementing surrogate keys but are less great for auto-incrementing keys with informational context. Let me explain what I mean. If we have line items from 1 to 1000 in the table, the next row we insert will have an ID of 1001 (assuming nobody has changed the seed and our increment value is 1). But what happens if we get an error trying to insert value 1001 and need to roll back the statement? In that case, the value 1001 has been burned and our next insertion attempt will be 1002. This can leave gaps in our data for a totally innocuous reason and without anybody actually knowing.

The same applies to sequence types: it is possible that you fail to insert using a sequence value and might lose that value forever. If you need to track a value like an invoice number, your best bet might be to gin up your own solution. You can create a table which stores your latest used value. Then, when it’s time to use the next value, go into the serializable transaction isolation level and take a lock on the table by beginning a transaction and selecting the value from the table. That will prevent anybody else from using the table and potentially grabbing the same invoice number as you.

In your insertion code, you can then increment the value, insert into the table, and if your operation was successful, update the value in the latest value table. Then close the transaction so other sessions can do their work.

This answer works in a low-throughput situation where you don’t expect more than one or two updates every few seconds. Fortunately, most systems which require this level of scrutiny tend to be fairly low-throughput or at least have relatively low concurrency. A process like generating checks for tens of thousands of employees has periods of high throughput but if you batch it all in one transaction on one session, the problem is still tractable.

Duplicates

I’m going to gloss over duplicates here because I’ll get into it in much more detail when we talk about cohort analysis later. For now, here are a few things I’d like to put in your mind.

What is a Duplicate?

There are two different ways we can think about duplicate data. The first way is exact matches on relevant columns where there is no unique key constraint preventing duplication. Suppose we have a LineItemID (which is just a surrogate key) and an InvoiceNumber on our table. That invoice number should be contiguous and unique for each line item. If we don’t have a unique key constraint on that table, however, it becomes possible for someone to use the same invoice number for two lines.

The other side of a duplicate is something which ought to be the same but isn’t, maybe due to a typo. My favorite example of this happens to come from a bank fraud case from a few years back:

When the Federal Reserve Bank of New York cleared five transactions made by the Bangladesh Bank hackers, the money went in two directions. On Thursday, Feb. 4, the Fed’s system sent $20 million to Sri Lanka and $81 million to the Philippines.

The Sri Lankan transaction contained a small but crucial error: The money was being sent to a bank account in the name of a nonprofit foundation, but the electronic message spelled it “fundation.” That prompted Deutsche Bank, an intermediary in the transaction, and a Sri Lankan bank to contact Bangladesh Bank, which led to the payment being cancelled and the money returned.

Here, “foundation” and “fundation” were supposed to be the same but a small typo made a big difference.

Duplicates and Fraud

In the Wake County fraud case, one measure of duplication is the number of invoices received on a single day. We can’t have a unique key on date and vendor (or date, vendor, and bus in our case) because it’s completely reasonable for a customer, on occasion, to send two invoices on the same day. In the Wake County case, however, they had 24 separate days with at least 50 invoices. 50 goes beyond reasonable.

Regression Analysis

I’m not going to be able to give much more than a primer here. Regression analysis is the topic of many a book and course in statistics and getting regression right can be a major time sink. Acknowledging that we will remain superficial here, we can still cover some of the ground. In its most basic form, regression is all about determining if there is a relationship between one or more input variables (also known as independent variables) and our output (the dependent variable).

We saw the line graph of invoices by year and of buses by year. My question is how much the number of buses ends up driving the number of invoices. My expectation is that the number of buses is a key factor in the number of invoices we deal with: as we add new buses to the fleet, I’d expect an approximately linear increase in the amount of maintenance work to perform, as well as the number of parts to purchase. We may see fluctuations but I expect to see a trend.

Regression by Month and Year

The first thing I want to do is regress the number of invoices versus buses using monthly data. My thought here is that the number of buses drives the monthly number of invoices and that the number of invoices grows approximately linearly with the number of buses. Let’s try these out.

First, I have my SQL query that I use to populate a data frame:

WITH buses AS
(
SELECT
c.FirstDayOfMonth,
c.CalendarMonth,
c.CalendarYear,
COUNT(*) AS NumberOfBuses
FROM dbo.Bus b
INNER JOIN dbo.Calendar c
ON b.DateFirstInService <= c.Date
AND ISNULL(b.DateRetired, '2018-12-31') >= c.Date
WHERE
c.Date = c.FirstDayOfMonth
AND c.CalendarYear >= 2011
AND c.CalendarYear < 2019
GROUP BY
c.FirstDayOfMonth,
c.CalendarMonth,
c.CalendarYear
),
expenses AS
(
SELECT
c.FirstDayOfMonth,
COUNT(*) AS NumberOfInvoices,
SUM(li.Amount) AS TotalInvoicedAmount
FROM dbo.LineItem li
INNER JOIN dbo.Calendar c
ON li.LineItemDate = c.Date
GROUP BY
c.FirstDayOfMonth
)
SELECT
b.FirstDayOfMonth,
b.CalendarMonth,
b.CalendarYear,
b.NumberOfBuses,
e.NumberOfInvoices,
e.TotalInvoicedAmount
FROM buses b
INNER JOIN expenses e
ON b.FirstDayOfMonth = e.FirstDayOfMonth
ORDER BY
b.FirstDayOfMonth;

Then, I’d like to build a regression. Here is the R code for an Ordinary Least Squares linear regression:

In one function call, I get my linear regression which focuses on tying the number of invoices to the number of buses. I should note that my data is a filter where the date is earlier than 2018. We saw the big jump in invoices in 2018 and that ruins our results. Because I think something’s odd about that data, I’d like to see what it looks like if we factor out 2018 and look at 2011 through 2017. Here’s what I get back:

There are a couple of things to pick out of this. First, our R^2 is 0.45, so we are explaining 45% of the variance in NumberOfInvoices. That’s okay but really not that good. In social science contexts, explaining 45% of human behavior is a really good result. But here we’re explaining expenditures and I’d much rather see 85-95% of the variance explained before I think an expenses model is accurate.

One thing we can do to try to improve the regression is to add features.

Adding Features to Regression

We have two additional features at hand: calendar month and calendar year. Let’s try calendar month first:

The R^2 didn’t move much at all—it went from 45% to 46%. Frankly, that’s noise. At this level, if we’re not seeing a 10% bump (or more) in R^2, I don’t know if I want to include that feature. Notice also that calendar month is not significant according to p-value. We can and should make fun of p values as much as possible, but here’s a case where the results are clear and sensible. Calendar month isn’t a factor in this regression. So let’s remove it and try calendar year.

Now this result is quite interesting. Our R^2 didn’t change but now neither variable is significant! This is a great example of something called multicollinearity, one of the challenges of regression. Put simply, the number of buses increases by about the same number every year, so there is very high correlation between number of buses and calendar year. Running a correlation test against the two, I end up with a value of 0.978.

That is, 97.9% of the variance reflected in buses is also reflected in year. These two variables are co-linear. Because these two variables move almost 1 for 1, it is difficult for the regression algorithm to separate behavior in one versus the other. They’re both fighting to explain the same variance and so both end up with higher p-values. Also of interest is that the R^2 doesn’t change. Multicollinearity doesn’t make your overall predictions worse, but it does make it tougher to tell which independent variables are driving the change.

This is an extreme scenario, mind you, but mutlicollinearity is a common enough occurrence that you will want to be on the lookout for it. The other linear regression amigos are serial correlation (AKA autocorrelation) and heteroskedasticity (my favorite of the three).

Now let’s take a step back, as we’re not getting the job done with regressing at the month level. Instead of grouping by month, I’ve changed the SQL query to include just calendar year and number of buses / invoices. Let’s see how that looks:

I didn’t include the SQL code because it’s a trivial variant on the prior version. Yet I included the trivial variants on the R code because that’s how I roll. Here are my results:

Wow. We went from explaining less than half of all variance to explaining 97% of the variance. That’s a huge difference and is definitely an interesting result. For a fairly mechanical problem like this one, an R^2 of .97 is high but not “shake your head” high. If this were a social sciences problem and I got an R^2 of .97, I’d wonder what I did wrong.

I don’t like that I have so few data points, but even with the low number of data points, our regression output is indicating that there’s something there. We can also run plot(regICAnnualPre2018 and see that our residuals are both positive and negative and a small percentage of the total values:

What this tells us is that we do not see the residuals (that is, estimated – actual) consistently above or below 0, but rather spread out between them. If we saw the residuals consistently over (or under) 0, the residuals would show bias, which can be a problem when performing a regression analysis.

Finally, now that we have a good fit for the pre-2018 data, let’s see what adding 2018 does:

That’s a drop from 97% to 71%. It’s a huge drop. If we have no suspicions about data quality, that kind of drop can be devastating to us: it means our model is no longer a great model. But I do harbor some suspicions because 2018’s values are so much larger that I think there’s something weird going on.

One last note, we can take the annual pre-2018 model and generate a prediction to see what our model thinks 2018’s value ought to have been:

This returns 5362 versus our actual invoice count of 7700. That’s a difference of more than 2000. Again, this isn’t proof of wrongdoing but it helps us put into perspective the scope of what’s going on. It’s a data point that maybe something weird is going on and this is the scale of that weirdness.

Conclusion

In this post, we looked at a number of analytical techniques to gain insight into our data. We focused mostly on high-level aggregates here, which can help us get a basic understanding of our data. In the next post, we’re going to move to another level of analysis: cohort analysis. This will give us a better idea of just what’s going on with our data.

I think this course was a good length and includes a lot of quality content. Janani presents everything in Python but I was able to follow along and do similar examples in Scala. The entire course length is just over 2 hours, which I think is fine for a “getting started” type of course.

This course contains a great overview of Spark 2 and how it differs from Spark 1, including the expansion of DataFrames, the combination of Spark contexts into a single SparkSession, and the Tungsten and Catalyst engines. Those two, in particular, were fascinating topics. I wish Janini had a chance to dig into them further, though I suppose that probably would not have been in the scope of a single course.

The use of Jupyter Notebooks for demos was a smart idea. Spark doesn’t have a “natural” UI and Jupyter lets you interact smoothly with the Python kernel and Spark cluster. I’m more used to Apache Zeppelin for Spark notebooks, so it was nice to see Jupyter in use here.

The one thing I regret about the course is the overuse of SnagIt-style boxes and circles. Showing something with a rectangle makes some sense when there’s a lot on the screen and you need to direct the user to a particular element. I think there was an over-use of this functionality, as we would regularly see Janini type something in and then immediately see a rectangle or circle around it. I think this was overkill and was more distracting than illuminating.

Despite that nitpick, if you are in the market for some introductory content on Spark and don’t mind working in Python, this is a great course to review.

What: Southern Fried F# Where: Red Hat, 100 East Davie Street, Raleigh, NC 27601 When: Saturday, April 13th, all day Admission is free. Sign up on Eventbrite.

What I’m Presenting

Nothing at all!

I’m one of the organizers for the inaugural edition of Southern Fried F#. Our goal is to expand functional programming in the Triangle area and we have a great lineup of speakers and topics.