This is part three in a series on applying forensic accounting techniques to SQL and R.
The term “basic analysis” is pretty broad, so we’re going to cover a few concepts today. Most of this is in the realm of Exploratory Data Analysis, looking at the data to gain a better understanding of what you have before you generate hypotheses. This sounds bland and straightforward today, which is an indication of how well Tukey’s book was received and how insightful he was.
In today’s post, we’ll look at exploring our data along five avenues: summary analysis, growth analysis, gaps in data, duplicates and cardinality, and regression analysis. Most of this work will be in R, using the DBI package to connect to SQL Server.
Summary analysis is the easiest of these and generally should be your starting point. Here, you simply want to get an idea of the shape of your data. If you’re in SQL, I tend to start with a
SELECT TOP(100) * from the table so I can get an idea of columns, value spread, and data lengths without putting much impact on the server. You can also peruse metadata like data types in
sys.columns, foreign key relationships, constraints, and even calling code.
Over in R, we have the
summary() function, which gives us a statistical summary of each column in a data frame. We can also use
head(), which shows me the first few rows in a data frame. Here is an example of running a few summary functions against the bus table.
What we’ve learned here is that there are 700 buses (which you don’t see directly but can infer from the BusID quartiles). some of them came into service on
1990-01-01, which is probably a default value because our data set starts in 2011. Buses have retirement dates as well, and a bus without a retirement date is still active. In SQL, those would be rows with a
NULL value; in R, those are the rows with
NA values. There are 468 active buses in our current data set.
We would repeat this type of analysis for each of the major elements in our data set. Let me take a moment and cover my process for finding those “major elements” because I’m all too liable to gloss over it. Here’s what I tend to do when someone hands me a brand new database:
- Find the tables with the most rows in them. In SQL Server Management Studio, you can right-click on a database and navigate to Reports –> Standard Reports –> Disk Usage by Table. This will generate a Reporting Services report showing approximate row counts (based on statistics), data size, and index size. Look at the top few tables there. Some of them are probably logging tables, which I tend to come back to later.
- Pick one of the large tables which appears to have user-relevant information and perform summary analysis, looking at the top few rows to determine what’s in that table. These large tables tend to be fact-style data, storing relevant measures customers care about.
- Recursively find foreign key constraints to other tables and look at those tables. These are often explanatory data, providing context to the fact-style data above. If you need help with recursively finding key constraints, I have a script (with bonus ASCII art) and a longer post as well. If your database has no foreign keys, there are still ways around it like looking at joins in the plan cache.
- Repeat step 2 until you run out of big tables. Then move on to medium-sized tables you have not already reviewed.
Creating a database diagram will also be helpful here, especially if foreign key constraints exist.
Now that we’ve summarily covered summary analysis, let’s jump into the next category: growth analysis.
Growth analysis focuses on changes in ratios over time. For example, you may plot annual revenue, cost, and net margin by year. Doing this gives you an idea of how the company is doing: if costs are flat but revenue increases, you can assume economies of scale or economies of scope are in play and that’s a great thing. If revenue is going up but costs are increasing faster, that’s not good for the company’s long-term outlook.
For our data set, I’m going to use the following SQL query to retrieve bus counts on the first day of each year. To make the problem easier, I add and remove buses on that day, so we don’t need to look at every day or perform complicated analyses.
SELECT c.CalendarYear, COUNT(*) AS NumberOfBuses FROM dbo.Bus b INNER JOIN dbo.Calendar c ON b.DateFirstInService <= c.Date AND ISNULL(b.DateRetired, '2018-12-31') >= c.Date WHERE c.CalendarDayOfYear = 1 AND c.CalendarYear >= 2011 AND c.CalendarYear < 2019 GROUP BY c.CalendarYear ORDER BY c.CalendarYear;
I can show you the SQL results but let’s drop this into R and build a quick and dirty plot.
options(repr.plot.width=6, repr.plot.height=4) ggplot(activeBuses, aes(x = CalendarYear, y = NumberOfBuses)) + geom_point() + geom_line() + labs(x = "Calendar Year", y = "Number of Buses", title = "Number of Buses by Year") + ylim(0, 500) + theme_minimal()
The first line with
options is something I do with Jupyter to prevent it from creating huge graphs. From there, we’re creating a scatterplot with a line overlaid, labeling the axes, starting from 0, and using a minimalistic theme. Note that starting from 0 is not required—both line charts and scatter plots can start from points other than 0. I did this to make the steady growth more apparent.
Next, I want to look at the number of invoices per year. We invoice on a per-bus, per-item basis, so I would expect invoice growth to track bus growth reasonably closely. You can argue about economies of scale (maintenance workers become more efficient, you might get bigger discounts on parts, it makes sense to purchase capital equipment to make the per-bus maintenance fees lower, those types of things) but with the bus count growing so steadily, I’d think that invoices would grow similarly. So let’s take a look.
SELECT c.CalendarYear, COUNT(*) AS NumberOfInvoices FROM dbo.LineItem li INNER JOIN dbo.Calendar c ON li.LineItemDate = c.Date GROUP BY c.CalendarYear ORDER BY c.CalendarYear;
Here is the R code:
ggplot(invoicesPerYear, aes(x = CalendarYear, y = NumberOfInvoices)) + geom_point() + geom_line() + labs(x = "Calendar Year", y = "Number of Invoices", title = "Number of Invoices by Year") + theme_minimal()
And the plot:
You can see that invoice growth was fairly steady from 2011 through 2017. Yeah, there are ups and downs but that’s normal in any real data set. The jump in 2018, however, is huge: we’ve effectively doubled the number of invoices despite bus growth being steady. Here’s the plot for expenditures by year, which code I’ve left out for the sake of making you do your own dirty work:
Those extra invoices added about a million dollars over expectations. This is our first indication that something interesting has happened. Note that this is not evidence of fraud, as there can be a number of innocent explanations: maybe the buses need to go through more maintenance because they’re older, maybe regulatory requirements forced more work on the buses, maybe we got a batch of lemons which need more work done on them. There are plenty of potential causes, but this is well outside the realm of noise.
We’ll shelve this for a little bit and look at our next topic, gap analysis.
Gap analysis is something you’d typically run when you care about the lack of a value. For example, accountants tend to get picky about check numbers and invoice numbers being complete. If you go from check 1001 to 1002 to 1004, an accountant wants to know what happened to check 1003. The reason is that if you don’t have a record of 1003, it’s possible that there was embezzlement.
To perform a quick gap analysis on line items, we can use the
LEAD() window function, available since SQL Server 2012. Here’s an example of the window function in action:
WITH C AS ( SELECT li.LineItemID AS CurrentLineItemID, LEAD(li.LineItemID) OVER (ORDER BY li.LineItemID) AS NextLineItemID FROM dbo.LineItem li ) SELECT CurrentLineItemID + 1 AS rangestart, NextLineItemID- 1 AS rangeend FROM C WHERE NextLineItemID - CurrentLineItemID > 1;
Here’s what we get back:
We have several ranges of missing values here, which is a bit concerning, as our invoice numbers should be a complete set. There might be an innocuous reason for this. If we look at
sys.columns, we can see that
LineItemID is an identity column.
Identity columns are great for auto-incrementing surrogate keys but are less great for auto-incrementing keys with informational context. Let me explain what I mean. If we have line items from 1 to 1000 in the table, the next row we insert will have an ID of 1001 (assuming nobody has changed the seed and our increment value is 1). But what happens if we get an error trying to insert value 1001 and need to roll back the statement? In that case, the value 1001 has been burned and our next insertion attempt will be 1002. This can leave gaps in our data for a totally innocuous reason and without anybody actually knowing.
The same applies to sequence types: it is possible that you fail to insert using a sequence value and might lose that value forever. If you need to track a value like an invoice number, your best bet might be to gin up your own solution. You can create a table which stores your latest used value. Then, when it’s time to use the next value, go into the serializable transaction isolation level and take a lock on the table by beginning a transaction and selecting the value from the table. That will prevent anybody else from using the table and potentially grabbing the same invoice number as you.
In your insertion code, you can then increment the value, insert into the table, and if your operation was successful, update the value in the latest value table. Then close the transaction so other sessions can do their work.
This answer works in a low-throughput situation where you don’t expect more than one or two updates every few seconds. Fortunately, most systems which require this level of scrutiny tend to be fairly low-throughput or at least have relatively low concurrency. A process like generating checks for tens of thousands of employees has periods of high throughput but if you batch it all in one transaction on one session, the problem is still tractable.
I’m going to gloss over duplicates here because I’ll get into it in much more detail when we talk about cohort analysis later. For now, here are a few things I’d like to put in your mind.
What is a Duplicate?
There are two different ways we can think about duplicate data. The first way is exact matches on relevant columns where there is no unique key constraint preventing duplication. Suppose we have a
LineItemID (which is just a surrogate key) and an
InvoiceNumber on our table. That invoice number should be contiguous and unique for each line item. If we don’t have a unique key constraint on that table, however, it becomes possible for someone to use the same invoice number for two lines.
The other side of a duplicate is something which ought to be the same but isn’t, maybe due to a typo. My favorite example of this happens to come from a bank fraud case from a few years back:
When the Federal Reserve Bank of New York cleared five transactions made by the Bangladesh Bank hackers, the money went in two directions. On Thursday, Feb. 4, the Fed’s system sent $20 million to Sri Lanka and $81 million to the Philippines.
The Sri Lankan transaction contained a small but crucial error: The money was being sent to a bank account in the name of a nonprofit foundation, but the electronic message spelled it “fundation.” That prompted Deutsche Bank, an intermediary in the transaction, and a Sri Lankan bank to contact Bangladesh Bank, which led to the payment being cancelled and the money returned.
Here, “foundation” and “fundation” were supposed to be the same but a small typo made a big difference.
Duplicates and Fraud
In the Wake County fraud case, one measure of duplication is the number of invoices received on a single day. We can’t have a unique key on date and vendor (or date, vendor, and bus in our case) because it’s completely reasonable for a customer, on occasion, to send two invoices on the same day. In the Wake County case, however, they had 24 separate days with at least 50 invoices. 50 goes beyond reasonable.
I’m not going to be able to give much more than a primer here. Regression analysis is the topic of many a book and course in statistics and getting regression right can be a major time sink. Acknowledging that we will remain superficial here, we can still cover some of the ground. In its most basic form, regression is all about determining if there is a relationship between one or more input variables (also known as independent variables) and our output (the dependent variable).
We saw the line graph of invoices by year and of buses by year. My question is how much the number of buses ends up driving the number of invoices. My expectation is that the number of buses is a key factor in the number of invoices we deal with: as we add new buses to the fleet, I’d expect an approximately linear increase in the amount of maintenance work to perform, as well as the number of parts to purchase. We may see fluctuations but I expect to see a trend.
Regression by Month and Year
The first thing I want to do is regress the number of invoices versus buses using monthly data. My thought here is that the number of buses drives the monthly number of invoices and that the number of invoices grows approximately linearly with the number of buses. Let’s try these out.
First, I have my SQL query that I use to populate a data frame:
WITH buses AS ( SELECT c.FirstDayOfMonth, c.CalendarMonth, c.CalendarYear, COUNT(*) AS NumberOfBuses FROM dbo.Bus b INNER JOIN dbo.Calendar c ON b.DateFirstInService <= c.Date AND ISNULL(b.DateRetired, '2018-12-31') >= c.Date WHERE c.Date = c.FirstDayOfMonth AND c.CalendarYear >= 2011 AND c.CalendarYear < 2019 GROUP BY c.FirstDayOfMonth, c.CalendarMonth, c.CalendarYear ), expenses AS ( SELECT c.FirstDayOfMonth, COUNT(*) AS NumberOfInvoices, SUM(li.Amount) AS TotalInvoicedAmount FROM dbo.LineItem li INNER JOIN dbo.Calendar c ON li.LineItemDate = c.Date GROUP BY c.FirstDayOfMonth ) SELECT b.FirstDayOfMonth, b.CalendarMonth, b.CalendarYear, b.NumberOfBuses, e.NumberOfInvoices, e.TotalInvoicedAmount FROM buses b INNER JOIN expenses e ON b.FirstDayOfMonth = e.FirstDayOfMonth ORDER BY b.FirstDayOfMonth;
Then, I’d like to build a regression. Here is the R code for an Ordinary Least Squares linear regression:
regICPre2018 <- lm(formula = NumberOfInvoices ~ NumberOfBuses, data = filter(expenditures, lubridate::year(FirstDayOfMonth) < 2018)) summary(regICPre2018)
In one function call, I get my linear regression which focuses on tying the number of invoices to the number of buses. I should note that my data is a filter where the date is earlier than 2018. We saw the big jump in invoices in 2018 and that ruins our results. Because I think something’s odd about that data, I’d like to see what it looks like if we factor out 2018 and look at 2011 through 2017. Here’s what I get back:
There are a couple of things to pick out of this. First, our R^2 is 0.45, so we are explaining 45% of the variance in
NumberOfInvoices. That’s okay but really not that good. In social science contexts, explaining 45% of human behavior is a really good result. But here we’re explaining expenditures and I’d much rather see 85-95% of the variance explained before I think an expenses model is accurate.
One thing we can do to try to improve the regression is to add features.
Adding Features to Regression
We have two additional features at hand: calendar month and calendar year. Let’s try calendar month first:
regICPre2018 <- lm(formula = NumberOfInvoices ~ NumberOfBuses + CalendarMonth, data = filter(expenditures, lubridate::year(FirstDayOfMonth) < 2018)) summary(regICPre2018)
The R^2 didn’t move much at all—it went from 45% to 46%. Frankly, that’s noise. At this level, if we’re not seeing a 10% bump (or more) in R^2, I don’t know if I want to include that feature. Notice also that calendar month is not significant according to p-value. We can and should make fun of p values as much as possible, but here’s a case where the results are clear and sensible. Calendar month isn’t a factor in this regression. So let’s remove it and try calendar year.
regICPre2018 <- lm(formula = NumberOfInvoices ~ NumberOfBuses + CalendarYear, data = filter(expenditures, lubridate::year(FirstDayOfMonth) < 2018)) summary(regICPre2018)
Now this result is quite interesting. Our R^2 didn’t change but now neither variable is significant! This is a great example of something called multicollinearity, one of the challenges of regression. Put simply, the number of buses increases by about the same number every year, so there is very high correlation between number of buses and calendar year. Running a correlation test against the two, I end up with a value of 0.978.
That is, 97.9% of the variance reflected in buses is also reflected in year. These two variables are co-linear. Because these two variables move almost 1 for 1, it is difficult for the regression algorithm to separate behavior in one versus the other. They’re both fighting to explain the same variance and so both end up with higher p-values. Also of interest is that the R^2 doesn’t change. Multicollinearity doesn’t make your overall predictions worse, but it does make it tougher to tell which independent variables are driving the change.
This is an extreme scenario, mind you, but mutlicollinearity is a common enough occurrence that you will want to be on the lookout for it. The other linear regression amigos are serial correlation (AKA autocorrelation) and heteroskedasticity (my favorite of the three).
Now let’s take a step back, as we’re not getting the job done with regressing at the month level. Instead of grouping by month, I’ve changed the SQL query to include just calendar year and number of buses / invoices. Let’s see how that looks:
regICAnnualPre2018 <- lm(formula = NumberOfInvoices ~ NumberOfBuses, data = filter(annualExpenditures, CalendarYear < 2018)) summary(regICAnnualPre2018)
I didn’t include the SQL code because it’s a trivial variant on the prior version. Yet I included the trivial variants on the R code because that’s how I roll. Here are my results:
Wow. We went from explaining less than half of all variance to explaining 97% of the variance. That’s a huge difference and is definitely an interesting result. For a fairly mechanical problem like this one, an R^2 of .97 is high but not “shake your head” high. If this were a social sciences problem and I got an R^2 of .97, I’d wonder what I did wrong.
I don’t like that I have so few data points, but even with the low number of data points, our regression output is indicating that there’s something there. We can also run
plot(regICAnnualPre2018 and see that our residuals are both positive and negative and a small percentage of the total values:
What this tells us is that we do not see the residuals (that is, estimated – actual) consistently above or below 0, but rather spread out between them. If we saw the residuals consistently over (or under) 0, the residuals would show bias, which can be a problem when performing a regression analysis.
Finally, now that we have a good fit for the pre-2018 data, let’s see what adding 2018 does:
regICAnnual <- lm(formula = NumberOfInvoices ~ NumberOfBuses, data = annualExpenditures) summary(regICAnnual)
That’s a drop from 97% to 71%. It’s a huge drop. If we have no suspicions about data quality, that kind of drop can be devastating to us: it means our model is no longer a great model. But I do harbor some suspicions because 2018’s values are so much larger that I think there’s something weird going on.
One last note, we can take the annual pre-2018 model and generate a prediction to see what our model thinks 2018’s value ought to have been:
predict(regICAnnualPre2018, newdata = filter(annualExpenditures, CalendarYear == 2018))
This returns 5362 versus our actual invoice count of 7700. That’s a difference of more than 2000. Again, this isn’t proof of wrongdoing but it helps us put into perspective the scope of what’s going on. It’s a data point that maybe something weird is going on and this is the scale of that weirdness.
In this post, we looked at a number of analytical techniques to gain insight into our data. We focused mostly on high-level aggregates here, which can help us get a basic understanding of our data. In the next post, we’re going to move to another level of analysis: cohort analysis. This will give us a better idea of just what’s going on with our data.