This is part six in a series on applying forensic accounting techniques to SQL and R.
What the Numbers Tell Us
In this last section of analysis, we’re going to look at three tells for finding potentially fraudulent data: round number analysis, last-digit analysis, and first-digit analysis with Benford’s Law.
Round Number Analysis
Round number analysis focuses on the final digits of a number, specifically looking at how many 0s there are at the end after rounding off the decimal. We’re interested in the percentage of records in each rank (or Type). For example, $10,820 is a Type 1 because there is a single 0 at the end. $8,300 is a Type 2 because there are two 0s at the end. $11,000 is a Type 3. And $9009 is a Type 0 because there are no 0s at the end.
With that in mind, how do we do this kind of calculation in SQL Server? With one of my favorite operators:
APPLY. Here is a query to get the breakdown of ranks by vendor in our generated data set:
%%sql WITH records AS ( SELECT v.VendorName, a.RoundedAmount FROM dbo.LineItem li INNER JOIN dbo.Vendor v ON li.VendorID = v.VendorID CROSS APPLY ( SELECT ROUND(li.Amount, 0) AS RoundedAmount ) a WHERE a.RoundedAmount > 0 ) SELECT r.VendorName, SUM(t4.IsType4) AS Type4, SUM(t3.IsType3) AS Type3, SUM(t2.IsType2) AS Type2, SUM(t1.IsType1) AS Type1, SUM(t0.IsType0) AS Type0, COUNT(1) AS NumberOfInvoices, CAST(100.0 * SUM(t0.IsType0) / COUNT(1) AS DECIMAL(5,2)) AS PercentType0 FROM records r CROSS APPLY(SELECT CASE WHEN r.RoundedAmount % 10000 = 0 THEN 1 ELSE 0 END AS IsType4) t4 CROSS APPLY(SELECT CASE WHEN t4.IsType4 = 0 AND r.RoundedAmount % 1000 = 0 THEN 1 ELSE 0 END AS IsType3) t3 CROSS APPLY(SELECT CASE WHEN t3.IsType3 = 0 AND r.RoundedAmount % 100 = 0 THEN 1 ELSE 0 END AS IsType2) t2 CROSS APPLY(SELECT CASE WHEN t2.IsType2 = 0 AND r.RoundedAmount % 10 = 0 THEN 1 ELSE 0 END AS IsType1) t1 CROSS APPLY(SELECT CASE WHEN t4.IsType4 = 0 AND t3.IsType3 = 0 AND t2.IsType2 = 0 AND t1.IsType1 = 0 THEN 1 ELSE 0 END AS IsType0) t0 GROUP BY r.VendorName ORDER BY PercentType0 DESC;
Let’s talk through this code. First, we have a common table expression,
records. We get the vendor name and rounded amount of each invoice, using
CROSS APPLY to “hide” the rounding calculation. In this case, the
CROSS APPLY is probably overkill but for more complex scenarios, it can make life easier by letting us see what the author intended in the
SELECT clause rather than piecing through calculations.
SELECT v.VendorName, a.RoundedAmount FROM dbo.LineItem li INNER JOIN dbo.Vendor v ON li.VendorID = v.VendorID CROSS APPLY ( SELECT ROUND(li.Amount, 0) AS RoundedAmount ) a WHERE a.RoundedAmount > 0
From there, we take our
records CTE and check to see if there are any Type 4 (or higher) results:
FROM records r CROSS APPLY(SELECT CASE WHEN r.RoundedAmount % 10000 = 0 THEN 1 ELSE 0 END AS IsType4) t4
In other words, if
RoundedAmount divides cleanly by 10,000 then we have a Type 4 or higher. We’ll call that boolean flag
IsType4 and alias the function as
Next, we want to chain another function to our result, this time seeing if something is both not a Type 4 but still divisible cleanly by 1000. If so, this is a Type 3.
FROM records r CROSS APPLY(SELECT CASE WHEN r.RoundedAmount % 10000 = 0 THEN 1 ELSE 0 END AS IsType4) t4 CROSS APPLY(SELECT CASE WHEN t4.IsType4 = 0 AND r.RoundedAmount % 1000 = 0 THEN 1 ELSE 0 END AS IsType3) t3
We continue to do the same type of operation for
t1, and any results which are not Type 1-4+ must therefore be Type 0.
The Value of Round Number Analysis
Okay, so why do we care about the number of round numbers? Because we might be able to find a result like the following:
|Bus Repair Shack||0||1||10||74||899||983||91.45|
|The Longevity Crew||0||0||4||41||458||503||91.05|
|Clean Sweep Cleaning Supplies||0||0||10||242||2548||2800||91.00|
|Fuel Suppliers, Ltd||0||0||0||46||450||496||90.73|
|Doodads and Thingies, Inc.||0||1||22||293||3009||3324||90.52|
|VehiCo Parts and Accessories||0||0||23||296||3044||3363||90.51|
|Tony’s Fixer Upper Buses||0||0||25||335||3388||3748||90.39|
|Electronics and Repairs||0||0||18||168||1738||1924||90.33|
|Comfort Rider, GmbH||0||0||20||254||2538||2812||90.26|
|Mobility Accessories R Us||0||0||9||173||1685||1867||90.25|
|Fuel Associates, Unlimited||0||0||2||51||446||499||89.38|
|Tony’s Bus Fixer Uppers||0||0||17||145||1333||1495||89.16|
|Glass and Sons Glass and Accessories||0||411||15||767||3612||4394||82.20|
One of these customers looks suspicious. It’s the one with 411 Type 3 values. In a normal data set (where “normal” generally means “not a budget”), about 90% of your results will be Type 0, about 9% will be Type 1, about 0.9% will be Type 2, and so on.
In other words, in most cases:
Last Digit Analysis
The rule of thumb for last digit analysis is that the last digit of each number in a set of numbers will likely follow the uniform distribution. In other words, there is no discernable pattern.
For example, here is a column chart of the distribution of the last digit of the population of towns and cities in North Carolina as of 2013:
You can see a pretty flat spread along the whole of the North Carolina population data set. So how about our fraudulent data?
This distribution is quite uniform as well. Incidentally, if you recall back to the second part of this series, we generated all of our “real” and “fraudulent” data using normal distributions. With normally distributed data, the last digit follows a uniform distribution.
When Does This Not Hold?
There are a couple of situations where you can expect the last digit not to be uniform:
- Budgets are heavily skewed toward 0. This makes sense because budgets are estimates.
- Approximations in measurements tend to be biased toward 0 and 5.
- People fudging data tend to bias away from 0 and 5 because they’re not “random enough.”
Now, there has also been some researching arguing that uniformity is less frequent than we might think, but I haven’t read that article.
I have a longer post on Benford’s Law already, so I’d recommend reading that.
If you’re curious about our generated data, here’s what it looks like, including with the fraudulent data:
And here’s what it looks like for just 2018, the year of fraudulent data:
There’s a bit of a difference to be sure, and that might spur on additional analysis.
In today’s post, I wanted to look at a few analytical techniques around distributions of numbers. They are useful in fraud detection mostly because people are terrible random number generators, and so these techniques can be useful in seeing the potential of bias. They won’t typically prove bias, but can be a good first step.