Forensic Accounting: Time Series Analysis

This is part five in a series on applying forensic accounting techniques to SQL and R.

Dealing with Time

This post will be a pretty short one. In my talk, I don’t have any demos, mostly because much of cohort analysis has secretly been time series analysis at the same time. Instead, I’ll lob out a few points and call it a day.

Time series analysis, at its core, is all about how your data changes over time. The grain for time series analysis is important: as we saw in the last post, we were able to get an excellent result at the yearly level when regressing number of active buses versus number of line items.

Annual buses versus line items regression, A-OK.

Going down to the monthly grain, the number of active buses explained none of the monthly variance within a year and so it was a much weaker correlation.

Monthly buses versus line items regression, A-No-Way.

This is one of the difficulties when doing time series work: the specific grain you work against can make or break an analysis.

Use Calendar Tables

This is a strong recommendation for pretty much any SQL Server instance: have a calendar table. I have a two-part series on calendar tables, with part 1 showing how to build a table and part 2 showing sample queries.

Calendar tables will allow you to define your grain to almost any level. If you store data as dates, a calendar table can let you find such weird items as the third Tuesday of each month, Sundays after weeks with holidays on Wednesdays, or the third quarter of fiscal year 2017 (oh, I meant federal fiscal year, not state fiscal year).

Simple Analysis & Seasonality

One of the simplest forms of analysis is count of items per day or week or month. The goal is to look for seasonality: repeated periodic behavior. You might see more sales on one day of the week or one day of the month or one month of the year. You can then use techniques like ARIMA to project forward and estimate future behavior given prior trends.

Additive and Semi-Additive Measures

When performing a time series analysis, it’s important to understand whether your measure is additive, semi-additive, or non-additive with respect to time. Additivity is, simply, whether you can sum results across a dimension and get reasonable results.

For example, an additive measure over time is the sum of deposits in a bank account. If you deposit $10 today, $50 tomorrow, and $100 a week from now, you will have deposited a total of $160. Additive measures give us the most flexibility in how we analyze the data, so we typically prefer to deal with them.

Non-additive measures cannot be aggregated across any dimension, whether time or otherwise. An example of this is a person’s account number. You can’t sum up all of the account numbers of customers at your bank and get something meaningful.

Semi-additive measures fit somewhere in between: they are additive across certain dimensions and non-additive across others. Bank account balance is a great example of a semi-additive measure. Bank account balances are additive across customers: you can sum up all of the balances and determine the total amount of deposits in the bank at a particular point in time. What you cannot do, however, is sum up one customer’s account balances over different points in time and get a meaningful number. If my bank account had $100 in it yesterday and $100 in it today, I don’t have a claim for $200; I have $100 because that is my current balance.

With semi-additive measures, you have a couple of routes for analysis. First, you can look at the state of the measure at fixed points in time. For example, we might look at every account at 9 PM UTC on a daily basis; that helps us see how account balances change and get a consistent view of the bank’s holdings across time.

The other technique is to use feature engineering to convert from balances to debits and credits—that is, the individual transactions which make up the total balance. If you have those, then the values become additive across time and you could use those instead of account balances.

Looking at Date Oddities

In this section, I’ll briefly cover some date oddities which might be worth investigating. First, you can look for “odd” date combinations: do you have order dates after ship dates? Or do you have adjustments after a journal closes? When you have a natural temporal flow, it’s interesting to look for variances in that flow because they’ll tell you a lot about exceptions to processes. For example, there might be a rare circumstance where an order date comes after a ship date because your company might ship out demo products without an order and if a customer buys the unit you fill in the order date and let the customer hang onto the product. These rules are usually hard to discern and are the fuzzy domain of subject matter experts (who themselves may not have thought through the implications of these choices).

The other area where you might find oddities is to focus on date attributes. For example, slice data by the day of the week, by the day of the month, by weekdays versus weekends, by mid-week versus early week versus late week. You might find that invoices never get processed on weekends (because it’s a government agency and nobody works weekends) until one time a bunch of invoices get processed on a Saturday. Simple sums, counts, and averages can get you a long way toward understanding user behavior in these cases.

Don’t Forget About Holidays

Holidays change behavior as well. On some holidays, like Singles Day, sales can go through the roof . On others, like Christmas Day, sales tend to be much lower than you’d expect from just another day of the week. If you’re dealing with worker behavior, their holidays matter too: you won’t see much work done on Christmas Day because they’re all at home.

Also, be aware of floating holidays like Easter and observed holidays like Christmas Day observed. If December 25th is on a Sunday, then companies tend to observe the holiday on December 26th, meaning that even though it’s technically not a holiday, December 26th will still exhibit that kind of holiday behavior in certain circumstances.

It’s easiest to use a calendar table (hint, hint) so you can avoid calculating these kinds of rules on the fly.

Conclusion

This is a tiny fraction of everything you want to do with time series, but should serve as a quick primer on things to think about as you try to understand your data better.

Advertisements

Forensic Accounting: Cohort Analysis

This is part four in a series on applying forensic accounting techniques to SQL and R.

Mr. Keynes’ Aggregates Conceal the Most Fundamental Mechanisms of Change

In the last post, we focused on high-level aggregates to gain a basic understanding of our data. We saw some suspicious results but couldn’t say much more than “This looks weird” due to our level of aggregation. In this post, I want to dig into data at a lower level of detail. My working conception is the cohort, a broad-based comparison of data sliced by some business-relevant or analysis-relevant component.

Those familiar with Kimball-style data warehousing already understand where I’m going with this. In the basic analysis, we essentially look at fact data with a little bit of disaggregation, such as looking at data by year. In this analysis, we introduce dimensions (sort of) and slice our data by dimensions.

Our key dimensions in this analysis are buses, employees, vendors, and expense categories. We can break out our data by any combination of these. Typically, I try to break things out one core category at a time at first and then start combining them together for more complicated slices once I have a decent understanding of what’s going on.

In today’s post, we will look at cohorts along three angles: cardinality, dimensional analysis, and clustering behavior.

Cardinality and Duplicates

One of the first things I want to figure out as I dive into this data set is, how often do I see “duplicate” results? For line items, I’m calling it “duplicate” when I see more than one invoice per bus per vendor per day. The idea here is that I know vendors work on individual buses and we force them to invoice per-bus. Here’s the SQL query:

WITH records AS
(
	SELECT
		li.LineItemDate,
		li.BusID,
		li.VendorID,
		COUNT(*) AS NumberOfInvoices
	FROM dbo.LineItem li
	GROUP BY
		li.LineItemDate,
		li.BusID,
		li.VendorID
)
SELECT
	NumberOfInvoices,
	COUNT(*) AS NumberOfOccurrences
FROM records
GROUP BY
	NumberOfInvoices
ORDER BY
	NumberOfInvoices;

And my results:

NumberOfInvoicesNumberOfOccurrences
137224
2118
31

This says that 99.4% of the time, an invoice is unique (note that you need to double-count 118 and triple-count 1 to get the right denominator). That’s a really good sign that this is an appropriate grain for determining what normally is unique.

By the way, how did I come up with that combination of columns? Well, I probably got an idea from the business side. With a bit of logic and a sane data model, you can get to a place where you need only a little bit of experimentation to figure out what “ought” to be unique.

So my next question is, who’s sending these duplicate invoices? Are all of our vendors doing it equally or do we have one or two dupe-senders? Well, let’s find out:

WITH records AS
(
	SELECT
		li.LineItemDate,
		li.BusID,
		li.VendorID,
		COUNT(*) AS NumberOfInvoices
	FROM dbo.LineItem li
	GROUP BY
		li.LineItemDate,
		li.BusID,
		li.VendorID
)
SELECT
	VendorID,
	COUNT(*) AS NumberOfOccurrences
FROM records
WHERE
	NumberOfInvoices > 1
GROUP BY
	VendorID
ORDER BY
	VendorID;

Once more, here are our results:

VendorIDNumberOfOccurrences
211
559
61
72
84
916
104
111
126
136
149

At first glance, there’s a fair spread in vendors sending dupes. Most of our vendors have sent us duplicates at least once during the seven years of data we have. Sorting this by number of occurrences descending, we can see that vendor 5 has more than 3x as many dupes as the next-highest. That’s a little weird and it’s something we should at least keep in the back of our minds, but by itself, that’s not really weird behavior.

Now let’s look at cardinality. Cardinality is the number of distinct values in our data set and that’s something we can do quite easily in R using the rapply() function, which applies a function to each column in a data frame. I have loaded a data frame called lineItems which is just a simple SELECT * FROM dbo.LineItem; query pulled in via the DBI package.

rapply(lineItems, function(x) { length(unique(x)) })

That gives us the following results:

ColumnValues
LineItemID37,463
BusID664
VendorID15
ExpenseCategoryID28
EmployeeID12
LineItemDate1837
Amount23,585

This tells us that we have 37,463 unique line item IDs out of 37,463 rows. This is our primary key constraint, so it needs to be unique.

By contrast, we see 664 bus IDs, meaning that 664 buses have had service done during their lifespans. There are 700 buses in our data set, so that means 36 buses never had any maintenance done on them before retirement.

We can also look at interesting subsets of the data and see how that behavior differs from the broader set. For example, we know that you need two employees to approve an expenditure of more than $1000. So let’s look at high-value invoices which are under the two-signer cutoff, maybe $850 to $999.99.

highValueInvoices <- lineItems %>% dplyr::filter(Amount >= 850 & Amount < 1000)
rapply(highValueInvoices, function(x) { length(unique(x)) })
ColumnValues
LineItemID996
BusID452
VendorID12
ExpenseCategoryID9
EmployeeID12
LineItemDate525
Amount576

12 of our 15 vendors have sent in invoices in this price range. Let’s use a function in data.table called setDT(), which lets us slice our data by a variable like VendorID. This is equivalent to writing a SQL statement which selects the count of results grouped by VendorID.

data.table::setDT(highValueInvoices)[, .N, keyby=VendorID]
VendorIDN
175
222
5525
648
746
842
912
1172
1210
1311
1488
1545

Hmm, vendor 5 shows up again. Maybe they sell expensive equipment. Let’s look at invoices over $1000.

data.table::setDT(filter(lineItems, Amount > 1000))[, .N, keyby=VendorID]
VendorIDN
1171
27
637
7105
8101
96
11161
127
1311
14157
1546

Hmm, no vendor 5 at all? That’s starting to get weird. Let’s look at amounts over $995 for this vendor.

data.table::setDT(filter(lineItems, VendorID == 5 & Amount > 995))[, .N, keyby=Amount]
AmountN
996.061
997.251
997.431
999.291
999.99411

411 invoices of exactly one penny under the cutoff? That’s…weird.

How about employees? Who’s working with this vendor on these items?

data.table::setDT(filter(lineItems, VendorID == 5 & Amount > 995))[, .N, keyby=EmployeeID]
EmployeeIDN
480
8104
10123
12108

Four employees handled those invoices. Maybe those four are the only four who work with the vendor? I’ve seen places which do that, splitting out which accountant works with which vendor.

data.table::setDT(filter(lineItems, VendorID == 5 & year(LineItemDate) == 2018))[, .N, keyby=EmployeeID]
EmployeeIDN
124
222
321
4610
525
621
726
8666
919
10667
1128
12631

Nope. All 12 have worked with this vendor. But the four in question each have 30x as many invoices as the rest. Let’s dig deeper and see if we can unravel this mystery.

Dimensional Analysis

When I use the term “dimensional analysis,” I don’t necessarily mean true Kimball-style dimensions. Instead, think of it as an analysis of data grouped by some explanatory variable (or variables). For example, we can have a breakdown by time. Our data will come from the following SQL query:

SELECT
	li.LineItemID,
	li.BusID,
	li.VendorID,
	v.VendorName,
	li.ExpenseCategoryID,
	ec.ExpenseCategory,
	li.EmployeeID,
	e.FirstName,
	e.LastName,
	CONCAT(e.FirstName, ' ', e.LastName) AS EmployeeName,
	li.LineItemDate,
	c.CalendarMonth,
	c.MonthName,
	c.CalendarYear,
	c.FirstDayOfMonth,
	li.Amount
FROM dbo.LineItem li
	INNER JOIN dbo.Employee e
		ON li.EmployeeID = e.EmployeeID
	INNER JOIN dbo.ExpenseCategory ec
		ON li.ExpenseCategoryID = ec.ExpenseCategoryID
	INNER JOIN dbo.Vendor v
		ON li.VendorID = v.VendorID
	INNER JOIN dbo.Calendar c
		ON li.LineItemDate = c.Date;

Here’s an example of the number of invoices per month grouped by year and shown as a box plot:

This is normal growth, Jim. Normal growth.

And here’s the R code to generate that plot:

itemsByMonthAndYear <- lineItems %>% group_by(FirstDayOfMonth, CalendarYear, CalendarMonth) %>% summarize(n = n())
ggplot(data = itemsByMonthAndYear, aes(x = CalendarYear, y = n, group = CalendarYear)) +
    geom_boxplot() +
    theme_minimal() +
    labs(x = NULL, y = NULL, title = "Box Plot of Number of Invoices by Month and Year")

Something that strikes me as odd about 2018 is that its low points are right where I would expect medium to high points in 2018 to be given 2011 through 2017. Let me look at each month as a scatter plot to see this more clearly.

ggplot(data = itemsByMonthAndYear, aes(x = FirstDayOfMonth, y = n)) +
    geom_point() +
    theme_minimal() +
    labs(x = "Month", y = "Number of Invoices") +
    theme(axis.text.x = element_blank())
Pictured: the Loch Ness Monster in its natural habitat, the X-Y plane.

Even in 2018, we have several months which fit the trend and a few which don’t. Let’s turn on our CSI Enhancer ™ and look specifically at 2018.

ggplot(data = filter(itemsByMonthAndYear, CalendarYear == 2018), aes(x = FirstDayOfMonth, y = n)) +
    geom_point() +
    scale_x_date() +
    theme_minimal() +
    labs(x = "Month", y = "Number of Invoices")
Just a normal trend, yep.

Something happened sometime around March and slowed down in September. Maybe there was something big going on that year which required more vehicle maintenance?

We’re not sure yet exactly what’s going on but we’re getting there.

Pivot to Expense Category

Here’s a box plot of sales by expense category. First the code, and then the image;

expenseCategoriesByMonth <- lineItems %>%
    group_by(ExpenseCategory, FirstDayOfMonth, CalendarMonth, CalendarYear) %>%
    summarize(n = n(), total = sum(Amount), avg = mean(Amount))

ggplot(data = expenseCategoriesByMonth, aes(x = ExpenseCategory, y = total)) +
    geom_boxplot() +
    scale_y_continuous(trans = "log10", labels = scales::dollar) +
    coord_flip() +
    theme_minimal()
I look at this and think of a bunch of resistors.

Delving into the category with the biggest positive outliers (Glass, Windshields & Windows), I found out that the only vendor selling here is vendor #5, Glass and Sons Glass and Accessories. Let’s take a closer look at Mr. Glass and his sons.

Glass and Sons

I’m going to look at the 4 expense categories which Glass and Sons specialize in. This is for the years 2011-2017 and shows expenses per invoice line within a category for each competitor in that category.

options(repr.plot.width=8, repr.plot.height=6)
ggplot(data = filter(lineItems, ExpenseCategoryID %in% c(4, 8, 18, 19) & CalendarYear < 2018),
       aes(x = VendorName, y = Amount)) +
    geom_boxplot(varwidth = TRUE) +
    scale_y_continuous(labels = scales::dollar) +
    coord_flip() +
    labs(y = NULL, x = NULL) +
    theme_minimal() +
    facet_wrap(facets = ~ExpenseCategory, ncol = 2)
Apparently, this county government lives in a space with perfect competition.

Glass and Sons is right in line with the other vendors for each category throughout the seven-year period. One thing I want to point out is that I added geom_boxplot(varwidth = TRUE) to let us see the number of invoices represented as the size of the bar. We can see that there’s no real impetus to choose one vendor over another during this time stretch.

So how about 2018?

options(repr.plot.width=8, repr.plot.height=6)
ggplot(data = filter(lineItems, ExpenseCategoryID %in% c(4, 8, 18, 19) & CalendarYear == 2018),
       aes(x = VendorName, y = Amount)) +
    geom_boxplot(varwidth = TRUE) +
    scale_y_continuous(labels = scales::dollar) +
    coord_flip() +
    labs(y = NULL, x = NULL) +
    theme_minimal() +
    facet_wrap(facets = ~ExpenseCategory, ncol = 2)
Demand curves slope upward?

In 2018, Glass and Sons became both a lot more expensive and a lot more popular than its competition. I may not be a big-city lawyer (or any kind of lawyer) but this smells fishy. At this point, I have concerns with this data. If it’s accurate, we might see fraudulent behavior. But before I start leveling accusations, though, I want to dig a bit further. In order for there to be fraud, I’d expect that there’s at least one person on the inside who is working with Glass and Sons. I don’t want to contact anyone on the other side because I can’t be positive who’s involved.

Me right now.

Analyzing Employees

I’m suspicious but don’t have enough yet to go to anyone. Let’s keep looking, this time looking at Glass and Sons invoices per county employee. First, prior to 2018:

options(repr.plot.width=6, repr.plot.height=4)
ggplot(data = filter(lineItems, VendorID == 5 & CalendarYear < 2018),
       aes(x = EmployeeName, y = Amount)) +
    geom_boxplot(varwidth = TRUE) +
    scale_y_continuous(labels = scales::dollar) +
    coord_flip() +
    labs(y = NULL, x = NULL) +
    theme_minimal()
Some spread, nothing outlandish.

Now for 2018:

options(repr.plot.width=6, repr.plot.height=4)
ggplot(data = filter(lineItems, VendorID == 5 & CalendarYear == 2018),
       aes(x = EmployeeName, y = Amount)) +
    geom_boxplot(varwidth = TRUE) +
    scale_y_continuous(labels = scales::dollar) +
    coord_flip() +
    labs(y = NULL, x = NULL) +
    theme_minimal()
Those 4 people from before? They showed up again.

Let’s go a step further and look at our big months versus our small months. Maybe there was a county change where these four were on some kind of special assignment? First let’s look at the non-weird months:

options(repr.plot.width=6, repr.plot.height=4)
ggplot(data = filter(lineItems, VendorID == 5 & CalendarYear == 2018 & CalendarMonth %in% 1,2,10,11,12),
       aes(x = EmployeeName, y = Amount)) +
    geom_boxplot(varwidth = TRUE) +
    scale_y_continuous(labels = scales::dollar) +
    coord_flip() +
    labs(y = NULL, x = NULL) +
    theme_minimal()
Still a spread. Not a ridiculous spread, but a spread.

And now the weird months:

options(repr.plot.width=6, repr.plot.height=4)
ggplot(data = filter(lineItems, VendorID == 5 & CalendarYear == 2018 & CalendarMonth %in% 3,4,5,6,7,8,9),
       aes(x = EmployeeName, y = Amount)) +
    geom_boxplot(varwidth = TRUE) +
    scale_y_continuous(labels = scales::dollar) +
    coord_flip() +
    labs(y = NULL, x = NULL) +
    theme_minimal()
Special assignment. The specialest.

These are my four persons of interest. I think there’s something going on and I have enough evidence to go to my superiors and get a real investigation going.

Clustering Behavior

The last bit I want to look at today is clustering behavior. The idea here is that we might see clusters around certain points due to explicit human behavior. For example, suppose we have a corporate policy where any meal over $20 needs a receipt but any meal under $20 doesn’t. Many (most?) people will report accurately, but there is an incentive for people to report meals of $19.95 or $19.99 even if they only spent $3.00 at the grocery store that morning.

In our example, I know that $1000 is our cutoff point, so I’d expect some clustering just under $1K. I can use a histogram to see this in action.

options(repr.plot.width=6, repr.plot.height=4)
ggplot(data = filter(lineItems), aes(x = Amount)) +
    geom_histogram(bins = 100) +
    scale_x_continuous(labels = scales::dollar) +
    labs(y = NULL, x = NULL) +
    theme_minimal()
The “asymptotic curve hits a wall” graph.

We can see a fairly smooth curve regularly dropping…until we get to just under $1000. One last cohort-style comp. First, everybody but Glass and Sons:

A smooth descent.

And here’s Glass and Sons:

This is normal, right?

The multiple peaks aren’t outlandish. The giant wall just under the cutoff point for two employee signatures? Yeah, that’s suspicious.

Conclusion

In today’s post, we looked at different ways to slice and dice data sets, drilling down far enough to find reasonable evidence of fraud involving one vendor and four employees. Again, I want to stress that there’s no absolute proof here, but there’s enough that we can sic the real accountants on and figure out if all of those invoices match up with parts, if those parts made it to inventory, and those other pieces of evidence which will prove fraud.

In the next post, we’ll talk a little more about specific ways to handle time-series data.

Azure Data Studio Notebooks: GitHub Issues

There are a couple of Azure Data Studio GitHub issues near and dear to my heart that I’d like to point out in the hopes that it’ll raise their profile and give the Azure Data Studio product team more reason to make my life easier.

Command Mode

The first issue I’d love to see addressed is to have a command mode, which Jupyter Notebooks already includes. Having a command mode makes life easier because it allows us to have:

Jupyter Notebook Shortcuts

The other issue I’d love to see addressed is to add Jupyter notebook shortcuts. This includes edit mode shortcuts as well as command mode shortcuts.

Forensic Accounting: Generating Sketchy (But Not Too Sketchy) Data

This is part two in a series on applying forensic accounting techniques to SQL and R.

Start with a Data Model

My plan here is to create a model which is a simple version of the Wake County transportation fraud case. To do so, I’ve created the simple data model which we see below:

The forensic accounting database diagram courtesy of SQLdbm.

As a quick note, I include a Calendar table to make it easier for me to generate data. I’ll also use it for some queries later on in the series.

Here’s the gist: we have employees at our organization. We own and operate buses. Buses come into service on a particular date and they eventually get retired. During their operational window, we need to perform maintenance on the buses and buy new parts. We purchase these parts from vendors and the parts fit into expense categories. Each vendor sells products or services from at least one category and we can only buy parts where we have established a link between vendor and category.

All of these are simple tables and way easier than reality—for example, we have categories of products but no specific products, no variants, etc. If you came to this series looking for a complete data model covering vehicle transportation and maintenance, you’ll be sorely disappointed.

Add Data and Stir

The next thing we need to do is fill in this data. I’m not going to include every static row here, but I will mention how I load each table.

Static Tables

  • dbo.Employee: I picked 12 names and added them to the table.
  • dbo.ExpenseCategory: I have 28 categories. 3 of them I made up myself and the other 25 I got from a website which sells bus parts and supplies. #Research
  • dbo.Vendor: I created 15 phony vendors and added them to the table.
  • dbo.VendorExpenseCategory: For each of the 15 vendors, I picked expense categories which appeared to fit. For example, a vendor named Bus Repair Shack (a classy organization, I’m sure) performs vehicle repair and vehicle maintenance but doesn’t sell any parts directly.

That leaves just two tables, but they’re doozies.

Bus

We’ll start with the bus table. Our agency started at the beginning of 2011 with an allotment of 300 decrepit buses:

INSERT INTO dbo.Bus
(
	BusID,
	DateFirstInService,
	DateRetired
)
SELECT TOP (300)
	ROW_NUMBER() OVER (ORDER BY NEWID()) AS BusID,
	'1990-01-01' AS DateFirstInService,
	NULL AS DateRetired
FROM dbo.Calendar;

We add 50 shiny new buses each year:

INSERT INTO dbo.Bus
(
	BusID,
	DateFirstInService,
	DateRetired
)
SELECT
	300 + ROW_NUMBER() OVER (ORDER BY c.Date, ao.object_id),
	c.Date,
	NULL AS DateRetired
FROM dbo.Calendar c
	CROSS JOIN ( SELECT TOP(50) object_id FROM sys.all_objects ) ao
WHERE
	c.CalendarDayOfYear = 1
	AND c.Date < '2019-01-01';

Then we retire approximately 36 beat-up hunks of junk on the first day of each year:

WITH candidates AS
(
	SELECT
		ROW_NUMBER() OVER (PARTITION BY c.Date ORDER BY NEWID()) AS rownum,
		c.Date,
		b.BusID,
		b.DateFirstInService,
		b.DateRetired
	FROM dbo.Calendar c
		INNER JOIN dbo.Bus b
			ON c.Date > b.DateFirstInService
	WHERE
		c.CalendarDayOfYear = 1
		AND c.Date < '2019-01-01'
)
UPDATE b
SET
	DateRetired = c.Date
FROM candidates c
	INNER JOIN dbo.Bus b
		ON c.BusID = b.BusID
WHERE
	c.rownum <= 36;

This means that, on net each year, we add somewhere around 14 new buses.

Line Items — Legit Data

This is where stuff gets crazy. First, I created a table named #ValuePerCategory, which has the mean price and the price standard deviation for each expense category. To get this information, I trawled through the catalog and picked reasonable-enough values for each of the categories. This is my level of commitment to getting things right(ish). The standard deviations, though, I just made up. I didn’t look at huge numbers of products and calculate these values myself. That’s the limit of my commitment to excellence and why I don’t have a giant banner on my stadium.

With this temp table in place, I created a new temporary table for line items and loaded it with a heap of data:

DECLARE
	@InclusionThreshold DECIMAL(7,6) = 0.00067,
	@Precision INT = 2;

INSERT INTO #LineItem
(
	BusID,
	VendorID,
	ExpenseCategoryID,
	EmployeeID,
	LineItemDate,
	Amount
)
SELECT
	b.BusID,
	vec.VendorID,
	vec.ExpenseCategoryID,
	r.EmployeeID,
	c.Date,
	ABS(s.Amount) AS Amount
FROM dbo.Calendar c
	CROSS JOIN dbo.Bus b
	CROSS JOIN dbo.VendorExpenseCategory vec
	INNER JOIN #ValuePerCategory vpc
		ON vec.ExpenseCategoryID = vpc.ExpenseCategoryID
	CROSS APPLY (
		SELECT
			RAND(CHECKSUM(NEWID())) AS rand1,
			RAND(CHECKSUM(NEWID())) AS rand2,
			FLOOR(RAND(CHECKSUM(NEWID())) * (12 - 1 + 1)) + 1 AS EmployeeID,
			RAND(CHECKSUM(NEWID())) AS InclusionThreshold
	) r
	CROSS APPLY (
		SELECT
			ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * vpc.StdDevPrice, @Precision) + vpc.MeanPrice AS Amount
	) s
WHERE
	c.Date >= '2011-01-01'
	AND c.Date < '2019-01-01'
	AND c.IsWeekend = 0
	AND c.HolidayName IS NULL
	AND b.DateFirstInService <= c.Date
	AND ISNULL(b.DateRetired, '9999-12-31') > c.Date
	AND r.InclusionThreshold < @InclusionThreshold;

Basically, we generate one row per combination of calendar day, bus, and vendor expense category. We then assume that prices follow a normal distribution and choose an amount randomly from a normal distribution defined by the mean price and price standard deviation for that expense category. We only make purchases on non-holiday weekdays (these are government employees after all) for buses which are in service. Finally, the inclusion threshold is a random number from 0-1 using RAND() to give us a uniform distribution. If the resulting random value is below our inclusion threshold, we include the value in our sample. I played around with numbers until I landed on 0.067% of rows. Even with that low a threshold, we still ended up with tens of thousands of entries, which is plenty for what we need.

The nice thing about following normal distributions and having a large number of rows is that there will be oddities even without introducing fraud. I want to make it clear that oddities are not in themselves proof of fraud. Assuming you pull randomly from a normal distribution, you can expect approximately 3 in 1000 rows to be greater than three standard deviations from the mean. Real-life distributions aren’t always normal, so even that rule of thumb has its limits. But to the extent that distributions aren’t normal, they tend to increase the number of seeming outliers, not decrease.

Line Items — Fraudulent Data

Now we want to introduce some fraud. Some people decided to collude with one of our vendors. They started their fraud on February 9th, 2018 and ended their fraud on October 14th, 2018. They made sure to limit purchases to no more than $999.99 because our cutoff for requiring two signatures is $1000. On each day, there is a very small chance of a fraudulent invoice per combination of calendar day, bus, and valid expense category: 0.0012%.

SET @InclusionThreshold = 0.000012;
INSERT INTO #LineItem
(
	BusID,
	VendorID,
	ExpenseCategoryID,
	EmployeeID,
	LineItemDate,
	Amount
)
SELECT
	b.BusID,
	vec.VendorID,
	vec.ExpenseCategoryID,
	CASE r.EmployeeID
		WHEN 1 THEN 4
		WHEN 2 THEN 8
		WHEN 3 THEN 10
		WHEN 4 THEN 12
		ELSE 4
	END AS EmployeeID,
	c.Date,
	CASE
		WHEN ABS(s.Amount) >= 1000 THEN 999.99
		ELSE ABS(s.Amount)
	END AS Amount
FROM dbo.Calendar c
	CROSS JOIN dbo.Bus b
	CROSS JOIN dbo.VendorExpenseCategory vec
	-- This is to simulate multiple entries
	CROSS JOIN dbo.Bus b_simmult
	INNER JOIN #ValuePerCategory vpc
		ON vec.ExpenseCategoryID = vpc.ExpenseCategoryID
	CROSS APPLY (
		SELECT
			RAND(CHECKSUM(NEWID())) AS rand1,
			RAND(CHECKSUM(NEWID())) AS rand2,
			FLOOR(RAND(CHECKSUM(NEWID())) * (4 - 1 + 1)) + 1 AS EmployeeID,
			RAND(CHECKSUM(NEWID())) AS InclusionThreshold
	) r
	CROSS APPLY (
		SELECT
			ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * (vpc.StdDevPrice * 2), @Precision) + (vpc.MeanPrice * 5) AS Amount
	) s
WHERE
	c.Date >= '2018-02-09'
	AND c.Date < '2018-10-14'
	AND c.IsWeekend = 0
	AND c.HolidayName IS NULL
	AND b.DateFirstInService <= c.Date
	AND ISNULL(b.DateRetired, '9999-12-31') > c.Date
	AND vec.VendorID = 5
	AND r.InclusionThreshold < @InclusionThreshold;

INSERT INTO dbo.LineItem 
(
	BusID,
	VendorID,
	ExpenseCategoryID,
	EmployeeID,
	LineItemDate,
	Amount
)
SELECT
	li.BusID,
	li.VendorID,
	li.ExpenseCategoryID,
	li.EmployeeID,
	li.LineItemDate,
	li.Amount
FROM #LineItem li
ORDER BY
	li.LineItemDate,
	li.BusID,
	li.VendorID,
	li.ExpenseCategoryID,
	li.Amount DESC;

Still, we do add in a pretty large number of invoices, enough to make some noise.

Line Item — Missing Invoices

Our late CFO (who got hit by a bus while listening to Alanis Morisette) kinda-sorta accidentally(?) deleted some line items from our database. Nobody knows what was in there but we can say the late CFO was wearing some very expensive suits for someone making county government money.

DELETE
FROM dbo.LineItem
WHERE LineItemID IN (6, 7, 14, 20, 199, 200, 201, 202, 339, 340, 341);

Conclusion

Before digging into the sundry techniques around forensic accounting, I wanted to make sure we had a good understanding of our data set, as I’ll use it consistently through the series. I did kind of give away the game with this post, but as we walk through techniques, we’ll be able to see if we find the problem or if our culprits end up hiding in the noise.

Forensic Accounting: Introduction and Case Study

This is part one in a series on applying forensic accounting techniques to SQL and R.

Necessary Disclaimers

Before I get started, I have some disclaimers I need to get out of the way.

I am not a CPA. My wife is a CPA but I’m the one writing this so you get my interpretation, not hers. I have taken accounting courses and worked on accounting and budgeting projects, but I would have landed Al Capone in jail and probably miss important tax deductions, so don’t take accounting advice from me. Also, no CPEs for you accountants who do end up reading this…

The other disclaimer is that all of the data in this series is totally made-up. I am working off of a case study which did happen, but we are going to use some artificial data rather than the real data. This is for a couple of reasons. First, I wasn’t able to get ahold of the original data set. But even if I did, I think artificial data is better for this talk because I can pare things down to the minimum needed to understand the problem. The more realistic things get, the muddier they are and the more likely I am to lose people in the weeds. Therefore, I intend to keep things simple and a good bit easier than reality.

Who are You People and Why are You in my Living Room?

My intended audience for this series is developers and database administrators, not accountants. If you are an accountant looking to hone some data platform skills, you’ll probably pick up some stuff. If you are a developer or database administrator, my goal is to add some really cool tools to your belt.

This is NOT Just About Fraud

Before I dig into my case study, I want to make it absolutely clear that these techniques will help you do a lot more than uncover fraud in your environment. My hope is that there is no fraud going on in your environment and you never need to use these tools for that purpose.

Even with no fraud, there is an excellent reason to learn and use these tools: they help you better understand your data. A common refrain from data platform presenters is “Know your data.” I say it myself. Then we do some hand-waving stuff, give a few examples of what that entails, and go on to the main point of whatever talks we’re giving. Well, this series is dedicated to knowing your data and giving you the right tools to learn and know your data.

A Case Study with a Happy Ending

About 15 years ago, WRAL began reporting on a fraud investigation involving the Wake County Public School System in Wake County, North Carolina (which is primarily the city of Raleigh). They did some great investigative reporting and I highly recommend reading some of the articles to learn more about what happened. What follows is my summary.

Several employees for Wake County Public School Systems’ transportation department, including the then-director, conspired with Barnes Motor & Parts in a scheme involving false invoices for non-existent parts and services. The parties then collected the money from these fraudulent invoices, split it up, and used it to purchase everything from gift cards to trucks and boats. This happened over the course of a few years before investigators caught it. To this day we still don’t know exactly how much money was embezzled from Wake County taxpayers, but the government was able to claw back $5 million. Barnes Motor & Parts paid $3 million, including a $2.5 million criminal fine and $500,000 civil fine for their part in the scheme. As for county employees, several of them went to prison.

It’s good to know that the people responsible for this crime paid, but for our purposes, I think there are two questions which are interesting:

  1. How were they able to get away with this for so long?
  2. How did investigators eventually uncover this fraud?

I’ll briefly cover each of those questions with their own headers because that makes this jumble of words appear a bit more coherent from a distance.

How They Got Away With It: No Meddling Kids or Talking Dogs

One of the worst parts of this story is the complete lack of oversight or accountability in the transportation department, which allowed corruption to fester for years.

The fraud participants were able to take advantage of a rule which existed at the time: any invoice of under $2500 needed only one employee signature, whereas invoices over $2500 required two signatures. Therefore, the culprits at Barnes Motor & Parts submitted invoices under the $2500 limit so that the corrupt county employees could sign off without further oversight.

During this time, it appears that there were no internal or external audits taking place. When one finally did occur, they found discrepancies and eventually unraveled this fraud. So let’s talk a bit about how the investigators did it.

What Investigators Did

One of the key tools investigators used here was actually pretty simple: linear regression. Michael East had a breakout session at a CFO Symposium where he covered this and a few other topics. Unfortunately, the PDF with materials is no longer online (as far as I can tell) but I was able to snag the following graph from his talk:

One of these dots is not like the others.

As the number of buses increases, maintenance costs should be linear (or sub-linear if you can take advantage of economies of scale, as I think the chart really shows). But in the year with the greatest amount of fraud, maintenance costs were millions of dollars over what you would have expected. That is a red flag.

Another red flag is that there were 24 separate days in which Barnes Motor & Parts submitted more than 50 invoices, all of which happened to be under the $2500 limit. Vendors will often try to fit as much as possible onto one invoice because creating invoices is annoying and sending out multiple invoices increases the risk of some of them being lost in the process. I can see a company creating 2 or 3 invoices for separate business units or something, but it’s really hard to think of a reasonable explanation for 50+ invoices all coming in on the same day from the same vendor.

The Flow of This Series

From here on out, we’re going to re-create something similar to the Wake County problem but using artificial data. We’ll then walk through a series of techniques to learn more about our data. Yes, the exercise will be fraud detection but the skills you learn are ultimately about gaining a better understanding of your data. Even if you have zero fraud, you will still be better at your job because you understand what it means to know your data and can design solutions which better fit your specific circumstances.

In the next post, I will show how we can generate some sketchy (but not outlandish) data. Then, the next several posts in the series will cover analysis techniques. Finally, we’ll wrap up with additional techniques I don’t cover in detail but which are still potentially interesting. Stay tuned!

WhoIsActive In SSMS

Several years ago now, I put together a post on helpful wrapper procedures for calling sp_WhoIsActive. Since then, I’ve taken to three sp_whoisactive calls as SSMS keyboard shortcuts.

Ctrl + F1: What’s Going On?

Here is my bread and butter query in a format you can easily copy and paste:

EXEC sp_whoisactive @get_full_inner_text = 1, @get_outer_command = 1, @find_block_leaders = 1, @get_task_info = 2, @get_transaction_info = 1, @get_plans = 0, @output_column_list = '[session_id][login_name][dd%][CPU][reads][writes][wait_info][blocking_session_id][blocked_session_count][sql_text][sql_command][status][tempdb_%][%]', @sort_order = '[session_id]';

And here it is in a little bit nicer of a format so we can cover it:

EXEC sp_whoisactive
	@get_full_inner_text = 1,
	@get_outer_command = 1,
	@find_block_leaders = 1,
	@get_task_info = 2,
	@get_transaction_info = 1,
	@get_plans = 0,
	@output_column_list = '[session_id][login_name][dd%][CPU][reads][writes][wait_info][blocking_session_id][blocked_session_count][sql_text][sql_command][status][tempdb_%][%]',
	@sort_order = '[session_id]';

This call is typically very fast. There are cases where sp_whoisactive can be slow, such as pulling back execution plans. In scenarios where I just want to see what’s going on with a minimum of fuss, I run this command. I’ve ordered the list so that I can see some of the most important columns for me up-front: how long queries have been running, their total resource utilization, reasons for waits, and blocking details. Getting this info, as well as the command, without scrolling is nice when troubleshooting.

Ctrl + 3 For Queries From Me

This second command focuses on my queries. This is useful when looking at busy servers with a lot of activity going and where it might be difficult to figure out which session is actually yours. Here is the copy-pastable block:

EXEC sp_whoisactive @filter_type = 'login', @filter = '{{YOURNAMEHERE}}', @get_full_inner_text = 1, @get_outer_command = 1, @find_block_leaders = 1, @get_task_info = 2, @get_transaction_info = 1, @get_plans = 0, @output_column_list = '[session_id][login_name][dd%][CPU][reads][writes][wait_info][blocking_session_id][blocked_session_count][sql_text][sql_command][status][tempdb_%][%]', @sort_order = '[session_id]';

And the block which humans can read:

EXEC sp_whoisactive
	@filter_type = 'login',
	@filter = '{{YOURNAMEHERE}}',
	@get_full_inner_text = 1,
	@get_outer_command = 1,
	@find_block_leaders = 1,
	@get_task_info = 2,
	@get_transaction_info = 1,
	@get_plans = 0,
	@output_column_list = '[session_id][login_name][dd%][CPU][reads][writes][wait_info][blocking_session_id][blocked_session_count][sql_text][sql_command][status][tempdb_%][%]',
	@sort_order = '[session_id]';

The column set is the same as before but now I’m filtering to my login. As a quick note, you’ll obviously want to replace {{YOURNAMEHERE}} with your login.

Ctrl + 4 For Even More

When I need to get execution plans, that’s when I break out Ctrl+4. Here is the one-liner:

EXEC sp_whoisactive @get_full_inner_text = 1, @get_outer_command = 1, @find_block_leaders = 1, @get_task_info = 2, @get_transaction_info = 1, @get_plans = 1, @output_column_list = '[session_id][login_name][dd%][CPU][reads][writes][wait_info][blocking_session_id][blocked_session_count][sql_text][sql_command][status][tempdb_%][%]', @sort_order = '[session_id]';

And the full query:

EXEC sp_whoisactive
	@get_full_inner_text = 1,
	@get_outer_command = 1,
	@find_block_leaders = 1,
	@get_task_info = 2,
	@get_transaction_info = 1,
	@get_plans = 1,
	@output_column_list = '[session_id][login_name][dd%][CPU][reads][writes][wait_info][blocking_session_id][blocked_session_count][sql_text][sql_command][status][tempdb_%][%]',
	@sort_order = '[session_id]';

The big difference here is the addition of @get_plans = 1, which can make the query slower on busy systems. As a result, this is one I’m less likely to run than the others. Still, when you need an execution plan, this is a valuable query to run.

Conclusion: I Ran out of Mnemonics

Those are three of the ways that I like using sp_whoisactive in day-to-day database development and query tuning work. This is one of the most flexible stored procedures out there, but that flexibility can seem overwhelming at first. Hopefully this post gives you a starting point to build out your own helper queries.

Vote For SPARK-27006

In lieu of a PolyBase Revealed post today, I’m going to ask a favor of you. Yes, you, the person wearing the thing, right.

Vote For Duke

There is a Jira ticket for the Apache Spark project, SPARK-27006. The gist of this ticket is to bring .NET support to Spark, specifically by supporting DataFrames in C# (and hopefully F#). No support for Datasets or RDDs is included in here, but giving .NET developers DataFrame access would make it easy for us to write code which interacts with Spark SQL and a good chunk of the SparkSession object.

One of the more interesting parts of this is the support of C# UDFs and UDAFs, allowing us to write potentially complicated functions in .NET and execute them directly against data in Spark.

If you like the idea of being able to write C# (or hopefully F#) code against Spark clusters, I have a request to make of you: vote for this Jira ticket. If you do not already have a Jira account for Apache projects, you will need to create one, but that’s a couple of minutes’ work. Then click the “Vote for this issue” link on the right-hand side. Having a large number of votes for this issue will show the Spark community that there’s value in bringing .NET support in as a first-class citizen. It also tells Microsoft that there’s a community interested in the topic, so they should dedicate time and get it right.