This is part two in a series on applying forensic accounting techniques to SQL and R.
Start with a Data Model
My plan here is to create a model which is a simple version of the Wake County transportation fraud case. To do so, I’ve created the simple data model which we see below:
As a quick note, I include a Calendar table to make it easier for me to generate data. I’ll also use it for some queries later on in the series.
Here’s the gist: we have employees at our organization. We own and operate buses. Buses come into service on a particular date and they eventually get retired. During their operational window, we need to perform maintenance on the buses and buy new parts. We purchase these parts from vendors and the parts fit into expense categories. Each vendor sells products or services from at least one category and we can only buy parts where we have established a link between vendor and category.
All of these are simple tables and way easier than reality—for example, we have categories of products but no specific products, no variants, etc. If you came to this series looking for a complete data model covering vehicle transportation and maintenance, you’ll be sorely disappointed.
Add Data and Stir
The next thing we need to do is fill in this data. I’m not going to include every static row here, but I will mention how I load each table.
Static Tables
- dbo.Employee: I picked 12 names and added them to the table.
- dbo.ExpenseCategory: I have 28 categories. 3 of them I made up myself and the other 25 I got from a website which sells bus parts and supplies. #Research
- dbo.Vendor: I created 15 phony vendors and added them to the table.
- dbo.VendorExpenseCategory: For each of the 15 vendors, I picked expense categories which appeared to fit. For example, a vendor named Bus Repair Shack (a classy organization, I’m sure) performs vehicle repair and vehicle maintenance but doesn’t sell any parts directly.
That leaves just two tables, but they’re doozies.
Bus
We’ll start with the bus table. Our agency started at the beginning of 2011 with an allotment of 300 decrepit buses:
INSERT INTO dbo.Bus
(
BusID,
DateFirstInService,
DateRetired
)
SELECT TOP (300)
ROW_NUMBER() OVER (ORDER BY NEWID()) AS BusID,
'1990-01-01' AS DateFirstInService,
NULL AS DateRetired
FROM dbo.Calendar;
We add 50 shiny new buses each year:
INSERT INTO dbo.Bus
(
BusID,
DateFirstInService,
DateRetired
)
SELECT
300 + ROW_NUMBER() OVER (ORDER BY c.Date, ao.object_id),
c.Date,
NULL AS DateRetired
FROM dbo.Calendar c
CROSS JOIN ( SELECT TOP(50) object_id FROM sys.all_objects ) ao
WHERE
c.CalendarDayOfYear = 1
AND c.Date < '2019-01-01';
Then we retire approximately 36 beat-up hunks of junk on the first day of each year:
WITH candidates AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY c.Date ORDER BY NEWID()) AS rownum,
c.Date,
b.BusID,
b.DateFirstInService,
b.DateRetired
FROM dbo.Calendar c
INNER JOIN dbo.Bus b
ON c.Date > b.DateFirstInService
WHERE
c.CalendarDayOfYear = 1
AND c.Date < '2019-01-01'
)
UPDATE b
SET
DateRetired = c.Date
FROM candidates c
INNER JOIN dbo.Bus b
ON c.BusID = b.BusID
WHERE
c.rownum <= 36;
This means that, on net each year, we add somewhere around 14 new buses.
Line Items — Legit Data
This is where stuff gets crazy. First, I created a table named #ValuePerCategory
, which has the mean price and the price standard deviation for each expense category. To get this information, I trawled through the catalog and picked reasonable-enough values for each of the categories. This is my level of commitment to getting things right(ish). The standard deviations, though, I just made up. I didn’t look at huge numbers of products and calculate these values myself. That’s the limit of my commitment to excellence and why I don’t have a giant banner on my stadium.
With this temp table in place, I created a new temporary table for line items and loaded it with a heap of data:
DECLARE
@InclusionThreshold DECIMAL(7,6) = 0.00067,
@Precision INT = 2;
INSERT INTO #LineItem
(
BusID,
VendorID,
ExpenseCategoryID,
EmployeeID,
LineItemDate,
Amount
)
SELECT
b.BusID,
vec.VendorID,
vec.ExpenseCategoryID,
r.EmployeeID,
c.Date,
ABS(s.Amount) AS Amount
FROM dbo.Calendar c
CROSS JOIN dbo.Bus b
CROSS JOIN dbo.VendorExpenseCategory vec
INNER JOIN #ValuePerCategory vpc
ON vec.ExpenseCategoryID = vpc.ExpenseCategoryID
CROSS APPLY (
SELECT
RAND(CHECKSUM(NEWID())) AS rand1,
RAND(CHECKSUM(NEWID())) AS rand2,
FLOOR(RAND(CHECKSUM(NEWID())) * (12 - 1 + 1)) + 1 AS EmployeeID,
RAND(CHECKSUM(NEWID())) AS InclusionThreshold
) r
CROSS APPLY (
SELECT
ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * vpc.StdDevPrice, @Precision) + vpc.MeanPrice AS Amount
) s
WHERE
c.Date >= '2011-01-01'
AND c.Date < '2019-01-01'
AND c.IsWeekend = 0
AND c.HolidayName IS NULL
AND b.DateFirstInService <= c.Date
AND ISNULL(b.DateRetired, '9999-12-31') > c.Date
AND r.InclusionThreshold < @InclusionThreshold;
Basically, we generate one row per combination of calendar day, bus, and vendor expense category. We then assume that prices follow a normal distribution and choose an amount randomly from a normal distribution defined by the mean price and price standard deviation for that expense category. We only make purchases on non-holiday weekdays (these are government employees after all) for buses which are in service. Finally, the inclusion threshold is a random number from 0-1 using RAND()
to give us a uniform distribution. If the resulting random value is below our inclusion threshold, we include the value in our sample. I played around with numbers until I landed on 0.067% of rows. Even with that low a threshold, we still ended up with tens of thousands of entries, which is plenty for what we need.
The nice thing about following normal distributions and having a large number of rows is that there will be oddities even without introducing fraud. I want to make it clear that oddities are not in themselves proof of fraud. Assuming you pull randomly from a normal distribution, you can expect approximately 3 in 1000 rows to be greater than three standard deviations from the mean. Real-life distributions aren’t always normal, so even that rule of thumb has its limits. But to the extent that distributions aren’t normal, they tend to increase the number of seeming outliers, not decrease.
Line Items — Fraudulent Data
Now we want to introduce some fraud. Some people decided to collude with one of our vendors. They started their fraud on February 9th, 2018 and ended their fraud on October 14th, 2018. They made sure to limit purchases to no more than $999.99 because our cutoff for requiring two signatures is $1000. On each day, there is a very small chance of a fraudulent invoice per combination of calendar day, bus, and valid expense category: 0.0012%.
SET @InclusionThreshold = 0.000012;
INSERT INTO #LineItem
(
BusID,
VendorID,
ExpenseCategoryID,
EmployeeID,
LineItemDate,
Amount
)
SELECT
b.BusID,
vec.VendorID,
vec.ExpenseCategoryID,
CASE r.EmployeeID
WHEN 1 THEN 4
WHEN 2 THEN 8
WHEN 3 THEN 10
WHEN 4 THEN 12
ELSE 4
END AS EmployeeID,
c.Date,
CASE
WHEN ABS(s.Amount) >= 1000 THEN 999.99
ELSE ABS(s.Amount)
END AS Amount
FROM dbo.Calendar c
CROSS JOIN dbo.Bus b
CROSS JOIN dbo.VendorExpenseCategory vec
-- This is to simulate multiple entries
CROSS JOIN dbo.Bus b_simmult
INNER JOIN #ValuePerCategory vpc
ON vec.ExpenseCategoryID = vpc.ExpenseCategoryID
CROSS APPLY (
SELECT
RAND(CHECKSUM(NEWID())) AS rand1,
RAND(CHECKSUM(NEWID())) AS rand2,
FLOOR(RAND(CHECKSUM(NEWID())) * (4 - 1 + 1)) + 1 AS EmployeeID,
RAND(CHECKSUM(NEWID())) AS InclusionThreshold
) r
CROSS APPLY (
SELECT
ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * (vpc.StdDevPrice * 2), @Precision) + (vpc.MeanPrice * 5) AS Amount
) s
WHERE
c.Date >= '2018-02-09'
AND c.Date < '2018-10-14'
AND c.IsWeekend = 0
AND c.HolidayName IS NULL
AND b.DateFirstInService <= c.Date
AND ISNULL(b.DateRetired, '9999-12-31') > c.Date
AND vec.VendorID = 5
AND r.InclusionThreshold < @InclusionThreshold;
INSERT INTO dbo.LineItem
(
BusID,
VendorID,
ExpenseCategoryID,
EmployeeID,
LineItemDate,
Amount
)
SELECT
li.BusID,
li.VendorID,
li.ExpenseCategoryID,
li.EmployeeID,
li.LineItemDate,
li.Amount
FROM #LineItem li
ORDER BY
li.LineItemDate,
li.BusID,
li.VendorID,
li.ExpenseCategoryID,
li.Amount DESC;
Still, we do add in a pretty large number of invoices, enough to make some noise.
Line Item — Missing Invoices
Our late CFO (who got hit by a bus while listening to Alanis Morisette) kinda-sorta accidentally(?) deleted some line items from our database. Nobody knows what was in there but we can say the late CFO was wearing some very expensive suits for someone making county government money.
DELETE
FROM dbo.LineItem
WHERE LineItemID IN (6, 7, 14, 20, 199, 200, 201, 202, 339, 340, 341);
Conclusion
Before digging into the sundry techniques around forensic accounting, I wanted to make sure we had a good understanding of our data set, as I’ll use it consistently through the series. I did kind of give away the game with this post, but as we walk through techniques, we’ll be able to see if we find the problem or if our culprits end up hiding in the noise.
2 thoughts on “Forensic Accounting: Generating Sketchy (But Not Too Sketchy) Data”