Forensic Accounting: Generating Sketchy (But Not Too Sketchy) Data

This is part two in a series on applying forensic accounting techniques to SQL and R.

Start with a Data Model

My plan here is to create a model which is a simple version of the Wake County transportation fraud case. To do so, I’ve created the simple data model which we see below:

The forensic accounting database diagram courtesy of SQLdbm.

As a quick note, I include a Calendar table to make it easier for me to generate data. I’ll also use it for some queries later on in the series.

Here’s the gist: we have employees at our organization. We own and operate buses. Buses come into service on a particular date and they eventually get retired. During their operational window, we need to perform maintenance on the buses and buy new parts. We purchase these parts from vendors and the parts fit into expense categories. Each vendor sells products or services from at least one category and we can only buy parts where we have established a link between vendor and category.

All of these are simple tables and way easier than reality—for example, we have categories of products but no specific products, no variants, etc. If you came to this series looking for a complete data model covering vehicle transportation and maintenance, you’ll be sorely disappointed.

Add Data and Stir

The next thing we need to do is fill in this data. I’m not going to include every static row here, but I will mention how I load each table.

Static Tables

  • dbo.Employee: I picked 12 names and added them to the table.
  • dbo.ExpenseCategory: I have 28 categories. 3 of them I made up myself and the other 25 I got from a website which sells bus parts and supplies. #Research
  • dbo.Vendor: I created 15 phony vendors and added them to the table.
  • dbo.VendorExpenseCategory: For each of the 15 vendors, I picked expense categories which appeared to fit. For example, a vendor named Bus Repair Shack (a classy organization, I’m sure) performs vehicle repair and vehicle maintenance but doesn’t sell any parts directly.

That leaves just two tables, but they’re doozies.

Bus

We’ll start with the bus table. Our agency started at the beginning of 2011 with an allotment of 300 decrepit buses:

INSERT INTO dbo.Bus
(
	BusID,
	DateFirstInService,
	DateRetired
)
SELECT TOP (300)
	ROW_NUMBER() OVER (ORDER BY NEWID()) AS BusID,
	'1990-01-01' AS DateFirstInService,
	NULL AS DateRetired
FROM dbo.Calendar;

We add 50 shiny new buses each year:

INSERT INTO dbo.Bus
(
	BusID,
	DateFirstInService,
	DateRetired
)
SELECT
	300 + ROW_NUMBER() OVER (ORDER BY c.Date, ao.object_id),
	c.Date,
	NULL AS DateRetired
FROM dbo.Calendar c
	CROSS JOIN ( SELECT TOP(50) object_id FROM sys.all_objects ) ao
WHERE
	c.CalendarDayOfYear = 1
	AND c.Date < '2019-01-01';

Then we retire approximately 36 beat-up hunks of junk on the first day of each year:

WITH candidates AS
(
	SELECT
		ROW_NUMBER() OVER (PARTITION BY c.Date ORDER BY NEWID()) AS rownum,
		c.Date,
		b.BusID,
		b.DateFirstInService,
		b.DateRetired
	FROM dbo.Calendar c
		INNER JOIN dbo.Bus b
			ON c.Date > b.DateFirstInService
	WHERE
		c.CalendarDayOfYear = 1
		AND c.Date < '2019-01-01'
)
UPDATE b
SET
	DateRetired = c.Date
FROM candidates c
	INNER JOIN dbo.Bus b
		ON c.BusID = b.BusID
WHERE
	c.rownum <= 36;

This means that, on net each year, we add somewhere around 14 new buses.

Line Items — Legit Data

This is where stuff gets crazy. First, I created a table named #ValuePerCategory, which has the mean price and the price standard deviation for each expense category. To get this information, I trawled through the catalog and picked reasonable-enough values for each of the categories. This is my level of commitment to getting things right(ish). The standard deviations, though, I just made up. I didn’t look at huge numbers of products and calculate these values myself. That’s the limit of my commitment to excellence and why I don’t have a giant banner on my stadium.

With this temp table in place, I created a new temporary table for line items and loaded it with a heap of data:

DECLARE
	@InclusionThreshold DECIMAL(7,6) = 0.00067,
	@Precision INT = 2;

INSERT INTO #LineItem
(
	BusID,
	VendorID,
	ExpenseCategoryID,
	EmployeeID,
	LineItemDate,
	Amount
)
SELECT
	b.BusID,
	vec.VendorID,
	vec.ExpenseCategoryID,
	r.EmployeeID,
	c.Date,
	ABS(s.Amount) AS Amount
FROM dbo.Calendar c
	CROSS JOIN dbo.Bus b
	CROSS JOIN dbo.VendorExpenseCategory vec
	INNER JOIN #ValuePerCategory vpc
		ON vec.ExpenseCategoryID = vpc.ExpenseCategoryID
	CROSS APPLY (
		SELECT
			RAND(CHECKSUM(NEWID())) AS rand1,
			RAND(CHECKSUM(NEWID())) AS rand2,
			FLOOR(RAND(CHECKSUM(NEWID())) * (12 - 1 + 1)) + 1 AS EmployeeID,
			RAND(CHECKSUM(NEWID())) AS InclusionThreshold
	) r
	CROSS APPLY (
		SELECT
			ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * vpc.StdDevPrice, @Precision) + vpc.MeanPrice AS Amount
	) s
WHERE
	c.Date >= '2011-01-01'
	AND c.Date < '2019-01-01'
	AND c.IsWeekend = 0
	AND c.HolidayName IS NULL
	AND b.DateFirstInService <= c.Date
	AND ISNULL(b.DateRetired, '9999-12-31') > c.Date
	AND r.InclusionThreshold < @InclusionThreshold;

Basically, we generate one row per combination of calendar day, bus, and vendor expense category. We then assume that prices follow a normal distribution and choose an amount randomly from a normal distribution defined by the mean price and price standard deviation for that expense category. We only make purchases on non-holiday weekdays (these are government employees after all) for buses which are in service. Finally, the inclusion threshold is a random number from 0-1 using RAND() to give us a uniform distribution. If the resulting random value is below our inclusion threshold, we include the value in our sample. I played around with numbers until I landed on 0.067% of rows. Even with that low a threshold, we still ended up with tens of thousands of entries, which is plenty for what we need.

The nice thing about following normal distributions and having a large number of rows is that there will be oddities even without introducing fraud. I want to make it clear that oddities are not in themselves proof of fraud. Assuming you pull randomly from a normal distribution, you can expect approximately 3 in 1000 rows to be greater than three standard deviations from the mean. Real-life distributions aren’t always normal, so even that rule of thumb has its limits. But to the extent that distributions aren’t normal, they tend to increase the number of seeming outliers, not decrease.

Line Items — Fraudulent Data

Now we want to introduce some fraud. Some people decided to collude with one of our vendors. They started their fraud on February 9th, 2018 and ended their fraud on October 14th, 2018. They made sure to limit purchases to no more than $999.99 because our cutoff for requiring two signatures is $1000. On each day, there is a very small chance of a fraudulent invoice per combination of calendar day, bus, and valid expense category: 0.0012%.

SET @InclusionThreshold = 0.000012;
INSERT INTO #LineItem
(
	BusID,
	VendorID,
	ExpenseCategoryID,
	EmployeeID,
	LineItemDate,
	Amount
)
SELECT
	b.BusID,
	vec.VendorID,
	vec.ExpenseCategoryID,
	CASE r.EmployeeID
		WHEN 1 THEN 4
		WHEN 2 THEN 8
		WHEN 3 THEN 10
		WHEN 4 THEN 12
		ELSE 4
	END AS EmployeeID,
	c.Date,
	CASE
		WHEN ABS(s.Amount) >= 1000 THEN 999.99
		ELSE ABS(s.Amount)
	END AS Amount
FROM dbo.Calendar c
	CROSS JOIN dbo.Bus b
	CROSS JOIN dbo.VendorExpenseCategory vec
	-- This is to simulate multiple entries
	CROSS JOIN dbo.Bus b_simmult
	INNER JOIN #ValuePerCategory vpc
		ON vec.ExpenseCategoryID = vpc.ExpenseCategoryID
	CROSS APPLY (
		SELECT
			RAND(CHECKSUM(NEWID())) AS rand1,
			RAND(CHECKSUM(NEWID())) AS rand2,
			FLOOR(RAND(CHECKSUM(NEWID())) * (4 - 1 + 1)) + 1 AS EmployeeID,
			RAND(CHECKSUM(NEWID())) AS InclusionThreshold
	) r
	CROSS APPLY (
		SELECT
			ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * (vpc.StdDevPrice * 2), @Precision) + (vpc.MeanPrice * 5) AS Amount
	) s
WHERE
	c.Date >= '2018-02-09'
	AND c.Date < '2018-10-14'
	AND c.IsWeekend = 0
	AND c.HolidayName IS NULL
	AND b.DateFirstInService <= c.Date
	AND ISNULL(b.DateRetired, '9999-12-31') > c.Date
	AND vec.VendorID = 5
	AND r.InclusionThreshold < @InclusionThreshold;

INSERT INTO dbo.LineItem 
(
	BusID,
	VendorID,
	ExpenseCategoryID,
	EmployeeID,
	LineItemDate,
	Amount
)
SELECT
	li.BusID,
	li.VendorID,
	li.ExpenseCategoryID,
	li.EmployeeID,
	li.LineItemDate,
	li.Amount
FROM #LineItem li
ORDER BY
	li.LineItemDate,
	li.BusID,
	li.VendorID,
	li.ExpenseCategoryID,
	li.Amount DESC;

Still, we do add in a pretty large number of invoices, enough to make some noise.

Line Item — Missing Invoices

Our late CFO (who got hit by a bus while listening to Alanis Morisette) kinda-sorta accidentally(?) deleted some line items from our database. Nobody knows what was in there but we can say the late CFO was wearing some very expensive suits for someone making county government money.

DELETE
FROM dbo.LineItem
WHERE LineItemID IN (6, 7, 14, 20, 199, 200, 201, 202, 339, 340, 341);

Conclusion

Before digging into the sundry techniques around forensic accounting, I wanted to make sure we had a good understanding of our data set, as I’ll use it consistently through the series. I did kind of give away the game with this post, but as we walk through techniques, we’ll be able to see if we find the problem or if our culprits end up hiding in the noise.