Forensic Accounting: Cohort Analysis

This is part four in a series on applying forensic accounting techniques to SQL and R.

Mr. Keynes’ Aggregates Conceal the Most Fundamental Mechanisms of Change

In the last post, we focused on high-level aggregates to gain a basic understanding of our data. We saw some suspicious results but couldn’t say much more than “This looks weird” due to our level of aggregation. In this post, I want to dig into data at a lower level of detail. My working conception is the cohort, a broad-based comparison of data sliced by some business-relevant or analysis-relevant component.

Those familiar with Kimball-style data warehousing already understand where I’m going with this. In the basic analysis, we essentially look at fact data with a little bit of disaggregation, such as looking at data by year. In this analysis, we introduce dimensions (sort of) and slice our data by dimensions.

Our key dimensions in this analysis are buses, employees, vendors, and expense categories. We can break out our data by any combination of these. Typically, I try to break things out one core category at a time at first and then start combining them together for more complicated slices once I have a decent understanding of what’s going on.

In today’s post, we will look at cohorts along three angles: cardinality, dimensional analysis, and clustering behavior.

Cardinality and Duplicates

One of the first things I want to figure out as I dive into this data set is, how often do I see “duplicate” results? For line items, I’m calling it “duplicate” when I see more than one invoice per bus per vendor per day. The idea here is that I know vendors work on individual buses and we force them to invoice per-bus. Here’s the SQL query:

WITH records AS
(
	SELECT
		li.LineItemDate,
		li.BusID,
		li.VendorID,
		COUNT(*) AS NumberOfInvoices
	FROM dbo.LineItem li
	GROUP BY
		li.LineItemDate,
		li.BusID,
		li.VendorID
)
SELECT
	NumberOfInvoices,
	COUNT(*) AS NumberOfOccurrences
FROM records
GROUP BY
	NumberOfInvoices
ORDER BY
	NumberOfInvoices;

And my results:

NumberOfInvoicesNumberOfOccurrences
137224
2118
31

This says that 99.4% of the time, an invoice is unique (note that you need to double-count 118 and triple-count 1 to get the right denominator). That’s a really good sign that this is an appropriate grain for determining what normally is unique.

By the way, how did I come up with that combination of columns? Well, I probably got an idea from the business side. With a bit of logic and a sane data model, you can get to a place where you need only a little bit of experimentation to figure out what “ought” to be unique.

So my next question is, who’s sending these duplicate invoices? Are all of our vendors doing it equally or do we have one or two dupe-senders? Well, let’s find out:

WITH records AS
(
	SELECT
		li.LineItemDate,
		li.BusID,
		li.VendorID,
		COUNT(*) AS NumberOfInvoices
	FROM dbo.LineItem li
	GROUP BY
		li.LineItemDate,
		li.BusID,
		li.VendorID
)
SELECT
	VendorID,
	COUNT(*) AS NumberOfOccurrences
FROM records
WHERE
	NumberOfInvoices > 1
GROUP BY
	VendorID
ORDER BY
	VendorID;

Once more, here are our results:

VendorIDNumberOfOccurrences
211
559
61
72
84
916
104
111
126
136
149

At first glance, there’s a fair spread in vendors sending dupes. Most of our vendors have sent us duplicates at least once during the seven years of data we have. Sorting this by number of occurrences descending, we can see that vendor 5 has more than 3x as many dupes as the next-highest. That’s a little weird and it’s something we should at least keep in the back of our minds, but by itself, that’s not really weird behavior.

Now let’s look at cardinality. Cardinality is the number of distinct values in our data set and that’s something we can do quite easily in R using the rapply() function, which applies a function to each column in a data frame. I have loaded a data frame called lineItems which is just a simple SELECT * FROM dbo.LineItem; query pulled in via the DBI package.

rapply(lineItems, function(x) { length(unique(x)) })

That gives us the following results:

ColumnValues
LineItemID37,463
BusID664
VendorID15
ExpenseCategoryID28
EmployeeID12
LineItemDate1837
Amount23,585

This tells us that we have 37,463 unique line item IDs out of 37,463 rows. This is our primary key constraint, so it needs to be unique.

By contrast, we see 664 bus IDs, meaning that 664 buses have had service done during their lifespans. There are 700 buses in our data set, so that means 36 buses never had any maintenance done on them before retirement.

We can also look at interesting subsets of the data and see how that behavior differs from the broader set. For example, we know that you need two employees to approve an expenditure of more than $1000. So let’s look at high-value invoices which are under the two-signer cutoff, maybe $850 to $999.99.

highValueInvoices <- lineItems %>% dplyr::filter(Amount >= 850 & Amount < 1000)
rapply(highValueInvoices, function(x) { length(unique(x)) })
ColumnValues
LineItemID996
BusID452
VendorID12
ExpenseCategoryID9
EmployeeID12
LineItemDate525
Amount576

12 of our 15 vendors have sent in invoices in this price range. Let’s use a function in data.table called setDT(), which lets us slice our data by a variable like VendorID. This is equivalent to writing a SQL statement which selects the count of results grouped by VendorID.

data.table::setDT(highValueInvoices)[, .N, keyby=VendorID]
VendorIDN
175
222
5525
648
746
842
912
1172
1210
1311
1488
1545

Hmm, vendor 5 shows up again. Maybe they sell expensive equipment. Let’s look at invoices over $1000.

data.table::setDT(filter(lineItems, Amount > 1000))[, .N, keyby=VendorID]
VendorIDN
1171
27
637
7105
8101
96
11161
127
1311
14157
1546

Hmm, no vendor 5 at all? That’s starting to get weird. Let’s look at amounts over $995 for this vendor.

data.table::setDT(filter(lineItems, VendorID == 5 & Amount > 995))[, .N, keyby=Amount]
AmountN
996.061
997.251
997.431
999.291
999.99411

411 invoices of exactly one penny under the cutoff? That’s…weird.

How about employees? Who’s working with this vendor on these items?

data.table::setDT(filter(lineItems, VendorID == 5 & Amount > 995))[, .N, keyby=EmployeeID]
EmployeeIDN
480
8104
10123
12108

Four employees handled those invoices. Maybe those four are the only four who work with the vendor? I’ve seen places which do that, splitting out which accountant works with which vendor.

data.table::setDT(filter(lineItems, VendorID == 5 & year(LineItemDate) == 2018))[, .N, keyby=EmployeeID]
EmployeeIDN
124
222
321
4610
525
621
726
8666
919
10667
1128
12631

Nope. All 12 have worked with this vendor. But the four in question each have 30x as many invoices as the rest. Let’s dig deeper and see if we can unravel this mystery.

Dimensional Analysis

When I use the term “dimensional analysis,” I don’t necessarily mean true Kimball-style dimensions. Instead, think of it as an analysis of data grouped by some explanatory variable (or variables). For example, we can have a breakdown by time. Our data will come from the following SQL query:

SELECT
	li.LineItemID,
	li.BusID,
	li.VendorID,
	v.VendorName,
	li.ExpenseCategoryID,
	ec.ExpenseCategory,
	li.EmployeeID,
	e.FirstName,
	e.LastName,
	CONCAT(e.FirstName, ' ', e.LastName) AS EmployeeName,
	li.LineItemDate,
	c.CalendarMonth,
	c.MonthName,
	c.CalendarYear,
	c.FirstDayOfMonth,
	li.Amount
FROM dbo.LineItem li
	INNER JOIN dbo.Employee e
		ON li.EmployeeID = e.EmployeeID
	INNER JOIN dbo.ExpenseCategory ec
		ON li.ExpenseCategoryID = ec.ExpenseCategoryID
	INNER JOIN dbo.Vendor v
		ON li.VendorID = v.VendorID
	INNER JOIN dbo.Calendar c
		ON li.LineItemDate = c.Date;

Here’s an example of the number of invoices per month grouped by year and shown as a box plot:

This is normal growth, Jim. Normal growth.

And here’s the R code to generate that plot:

itemsByMonthAndYear <- lineItems %>% group_by(FirstDayOfMonth, CalendarYear, CalendarMonth) %>% summarize(n = n())
ggplot(data = itemsByMonthAndYear, aes(x = CalendarYear, y = n, group = CalendarYear)) +
    geom_boxplot() +
    theme_minimal() +
    labs(x = NULL, y = NULL, title = "Box Plot of Number of Invoices by Month and Year")

Something that strikes me as odd about 2018 is that its low points are right where I would expect medium to high points in 2018 to be given 2011 through 2017. Let me look at each month as a scatter plot to see this more clearly.

ggplot(data = itemsByMonthAndYear, aes(x = FirstDayOfMonth, y = n)) +
    geom_point() +
    theme_minimal() +
    labs(x = "Month", y = "Number of Invoices") +
    theme(axis.text.x = element_blank())
Pictured: the Loch Ness Monster in its natural habitat, the X-Y plane.

Even in 2018, we have several months which fit the trend and a few which don’t. Let’s turn on our CSI Enhancer ™ and look specifically at 2018.

ggplot(data = filter(itemsByMonthAndYear, CalendarYear == 2018), aes(x = FirstDayOfMonth, y = n)) +
    geom_point() +
    scale_x_date() +
    theme_minimal() +
    labs(x = "Month", y = "Number of Invoices")
Just a normal trend, yep.

Something happened sometime around March and slowed down in September. Maybe there was something big going on that year which required more vehicle maintenance?

We’re not sure yet exactly what’s going on but we’re getting there.

Pivot to Expense Category

Here’s a box plot of sales by expense category. First the code, and then the image;

expenseCategoriesByMonth <- lineItems %>%
    group_by(ExpenseCategory, FirstDayOfMonth, CalendarMonth, CalendarYear) %>%
    summarize(n = n(), total = sum(Amount), avg = mean(Amount))

ggplot(data = expenseCategoriesByMonth, aes(x = ExpenseCategory, y = total)) +
    geom_boxplot() +
    scale_y_continuous(trans = "log10", labels = scales::dollar) +
    coord_flip() +
    theme_minimal()
I look at this and think of a bunch of resistors.

Delving into the category with the biggest positive outliers (Glass, Windshields & Windows), I found out that the only vendor selling here is vendor #5, Glass and Sons Glass and Accessories. Let’s take a closer look at Mr. Glass and his sons.

Glass and Sons

I’m going to look at the 4 expense categories which Glass and Sons specialize in. This is for the years 2011-2017 and shows expenses per invoice line within a category for each competitor in that category.

options(repr.plot.width=8, repr.plot.height=6)
ggplot(data = filter(lineItems, ExpenseCategoryID %in% c(4, 8, 18, 19) & CalendarYear < 2018),
       aes(x = VendorName, y = Amount)) +
    geom_boxplot(varwidth = TRUE) +
    scale_y_continuous(labels = scales::dollar) +
    coord_flip() +
    labs(y = NULL, x = NULL) +
    theme_minimal() +
    facet_wrap(facets = ~ExpenseCategory, ncol = 2)
Apparently, this county government lives in a space with perfect competition.

Glass and Sons is right in line with the other vendors for each category throughout the seven-year period. One thing I want to point out is that I added geom_boxplot(varwidth = TRUE) to let us see the number of invoices represented as the size of the bar. We can see that there’s no real impetus to choose one vendor over another during this time stretch.

So how about 2018?

options(repr.plot.width=8, repr.plot.height=6)
ggplot(data = filter(lineItems, ExpenseCategoryID %in% c(4, 8, 18, 19) & CalendarYear == 2018),
       aes(x = VendorName, y = Amount)) +
    geom_boxplot(varwidth = TRUE) +
    scale_y_continuous(labels = scales::dollar) +
    coord_flip() +
    labs(y = NULL, x = NULL) +
    theme_minimal() +
    facet_wrap(facets = ~ExpenseCategory, ncol = 2)
Demand curves slope upward?

In 2018, Glass and Sons became both a lot more expensive and a lot more popular than its competition. I may not be a big-city lawyer (or any kind of lawyer) but this smells fishy. At this point, I have concerns with this data. If it’s accurate, we might see fraudulent behavior. But before I start leveling accusations, though, I want to dig a bit further. In order for there to be fraud, I’d expect that there’s at least one person on the inside who is working with Glass and Sons. I don’t want to contact anyone on the other side because I can’t be positive who’s involved.

Me right now.

Analyzing Employees

I’m suspicious but don’t have enough yet to go to anyone. Let’s keep looking, this time looking at Glass and Sons invoices per county employee. First, prior to 2018:

options(repr.plot.width=6, repr.plot.height=4)
ggplot(data = filter(lineItems, VendorID == 5 & CalendarYear < 2018),
       aes(x = EmployeeName, y = Amount)) +
    geom_boxplot(varwidth = TRUE) +
    scale_y_continuous(labels = scales::dollar) +
    coord_flip() +
    labs(y = NULL, x = NULL) +
    theme_minimal()
Some spread, nothing outlandish.

Now for 2018:

options(repr.plot.width=6, repr.plot.height=4)
ggplot(data = filter(lineItems, VendorID == 5 & CalendarYear == 2018),
       aes(x = EmployeeName, y = Amount)) +
    geom_boxplot(varwidth = TRUE) +
    scale_y_continuous(labels = scales::dollar) +
    coord_flip() +
    labs(y = NULL, x = NULL) +
    theme_minimal()
Those 4 people from before? They showed up again.

Let’s go a step further and look at our big months versus our small months. Maybe there was a county change where these four were on some kind of special assignment? First let’s look at the non-weird months:

options(repr.plot.width=6, repr.plot.height=4)
ggplot(data = filter(lineItems, VendorID == 5 & CalendarYear == 2018 & CalendarMonth %in% 1,2,10,11,12),
       aes(x = EmployeeName, y = Amount)) +
    geom_boxplot(varwidth = TRUE) +
    scale_y_continuous(labels = scales::dollar) +
    coord_flip() +
    labs(y = NULL, x = NULL) +
    theme_minimal()
Still a spread. Not a ridiculous spread, but a spread.

And now the weird months:

options(repr.plot.width=6, repr.plot.height=4)
ggplot(data = filter(lineItems, VendorID == 5 & CalendarYear == 2018 & CalendarMonth %in% 3,4,5,6,7,8,9),
       aes(x = EmployeeName, y = Amount)) +
    geom_boxplot(varwidth = TRUE) +
    scale_y_continuous(labels = scales::dollar) +
    coord_flip() +
    labs(y = NULL, x = NULL) +
    theme_minimal()
Special assignment. The specialest.

These are my four persons of interest. I think there’s something going on and I have enough evidence to go to my superiors and get a real investigation going.

Clustering Behavior

The last bit I want to look at today is clustering behavior. The idea here is that we might see clusters around certain points due to explicit human behavior. For example, suppose we have a corporate policy where any meal over $20 needs a receipt but any meal under $20 doesn’t. Many (most?) people will report accurately, but there is an incentive for people to report meals of $19.95 or $19.99 even if they only spent $3.00 at the grocery store that morning.

In our example, I know that $1000 is our cutoff point, so I’d expect some clustering just under $1K. I can use a histogram to see this in action.

options(repr.plot.width=6, repr.plot.height=4)
ggplot(data = filter(lineItems), aes(x = Amount)) +
    geom_histogram(bins = 100) +
    scale_x_continuous(labels = scales::dollar) +
    labs(y = NULL, x = NULL) +
    theme_minimal()
The “asymptotic curve hits a wall” graph.

We can see a fairly smooth curve regularly dropping…until we get to just under $1000. One last cohort-style comp. First, everybody but Glass and Sons:

A smooth descent.

And here’s Glass and Sons:

This is normal, right?

The multiple peaks aren’t outlandish. The giant wall just under the cutoff point for two employee signatures? Yeah, that’s suspicious.

Conclusion

In today’s post, we looked at different ways to slice and dice data sets, drilling down far enough to find reasonable evidence of fraud involving one vendor and four employees. Again, I want to stress that there’s no absolute proof here, but there’s enough that we can sic the real accountants on and figure out if all of those invoices match up with parts, if those parts made it to inventory, and those other pieces of evidence which will prove fraud.

In the next post, we’ll talk a little more about specific ways to handle time-series data.

Advertisements

2 thoughts on “Forensic Accounting: Cohort Analysis

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s