Forensic Accounting: Numeral Analysis

This is part six in a series on applying forensic accounting techniques to SQL and R.

What the Numbers Tell Us

In this last section of analysis, we’re going to look at three tells for finding potentially fraudulent data: round number analysis, last-digit analysis, and first-digit analysis with Benford’s Law.

Round Number Analysis

Round number analysis focuses on the final digits of a number, specifically looking at how many 0s there are at the end after rounding off the decimal. We’re interested in the percentage of records in each rank (or Type). For example, $10,820 is a Type 1 because there is a single 0 at the end. $8,300 is a Type 2 because there are two 0s at the end. $11,000 is a Type 3. And $9009 is a Type 0 because there are no 0s at the end.

With that in mind, how do we do this kind of calculation in SQL Server? With one of my favorite operators: APPLY. Here is a query to get the breakdown of ranks by vendor in our generated data set:

WITH records AS
	FROM dbo.LineItem li
		INNER JOIN dbo.Vendor v
			ON li.VendorID = v.VendorID
				ROUND(li.Amount, 0) AS RoundedAmount
		) a
		a.RoundedAmount > 0
	SUM(t4.IsType4) AS Type4,
	SUM(t3.IsType3) AS Type3,
	SUM(t2.IsType2) AS Type2,
	SUM(t1.IsType1) AS Type1,
	SUM(t0.IsType0) AS Type0,
	COUNT(1) AS NumberOfInvoices,
	CAST(100.0 * SUM(t0.IsType0) / COUNT(1) AS DECIMAL(5,2)) AS PercentType0
FROM records r
	CROSS APPLY(SELECT CASE WHEN r.RoundedAmount % 10000 = 0 THEN 1 ELSE 0 END AS IsType4) t4
	CROSS APPLY(SELECT CASE WHEN t4.IsType4 = 0 AND r.RoundedAmount % 1000 = 0 THEN 1 ELSE 0 END AS IsType3) t3
	CROSS APPLY(SELECT CASE WHEN t3.IsType3 = 0 AND r.RoundedAmount % 100 = 0 THEN 1 ELSE 0 END AS IsType2) t2
	CROSS APPLY(SELECT CASE WHEN t2.IsType2 = 0 AND r.RoundedAmount % 10 = 0 THEN 1 ELSE 0 END AS IsType1) t1
	CROSS APPLY(SELECT CASE WHEN t4.IsType4 = 0 AND t3.IsType3 = 0 AND t2.IsType2 = 0 AND t1.IsType1 = 0 THEN 1 ELSE 0 END AS IsType0) t0
	PercentType0 DESC;

Let’s talk through this code. First, we have a common table expression, records. We get the vendor name and rounded amount of each invoice, using CROSS APPLY to “hide” the rounding calculation. In this case, the CROSS APPLY is probably overkill but for more complex scenarios, it can make life easier by letting us see what the author intended in the SELECT clause rather than piecing through calculations.

FROM dbo.LineItem li
	INNER JOIN dbo.Vendor v
		ON li.VendorID = v.VendorID
			ROUND(li.Amount, 0) AS RoundedAmount
	) a
	a.RoundedAmount > 0

From there, we take our records CTE and check to see if there are any Type 4 (or higher) results:

FROM records r
	CROSS APPLY(SELECT CASE WHEN r.RoundedAmount % 10000 = 0 THEN 1 ELSE 0 END AS IsType4) t4

In other words, if RoundedAmount divides cleanly by 10,000 then we have a Type 4 or higher. We’ll call that boolean flag IsType4 and alias the function as t4.

Next, we want to chain another function to our result, this time seeing if something is both not a Type 4 but still divisible cleanly by 1000. If so, this is a Type 3.

FROM records r
	CROSS APPLY(SELECT CASE WHEN r.RoundedAmount % 10000 = 0 THEN 1 ELSE 0 END AS IsType4) t4
	CROSS APPLY(SELECT CASE WHEN t4.IsType4 = 0 AND r.RoundedAmount % 1000 = 0 THEN 1 ELSE 0 END AS IsType3) t3

We continue to do the same type of operation for t2 and t1, and any results which are not Type 1-4+ must therefore be Type 0.

The Value of Round Number Analysis

Okay, so why do we care about the number of round numbers? Because we might be able to find a result like the following:

Bus Repair Shack01107489998391.45
The Longevity Crew0044145850391.05
Clean Sweep Cleaning Supplies00102422548280091.00
Fuel Suppliers, Ltd0004645049690.73
Doodads and Thingies, Inc.01222933009332490.52
VehiCo Parts and Accessories00232963044336390.51
Tony’s Fixer Upper Buses00253353388374890.39
Electronics and Repairs00181681738192490.33
Safety First00235365219577890.33
Comfort Rider, GmbH00202542538281290.26
Mobility Accessories R Us0091731685186790.25
Engine Mates00283193038338589.75
Fuel Associates, Unlimited0025144649989.38
Tony’s Bus Fixer Uppers00171451333149589.16
Glass and Sons Glass and Accessories0411157673612439482.20

One of these customers looks suspicious. It’s the one with 411 Type 3 values. In a normal data set (where “normal” generally means “not a budget”), about 90% of your results will be Type 0, about 9% will be Type 1, about 0.9% will be Type 2, and so on.

In other words, in most cases:

Last Digit Analysis

The rule of thumb for last digit analysis is that the last digit of each number in a set of numbers will likely follow the uniform distribution. In other words, there is no discernable pattern.

For example, here is a column chart of the distribution of the last digit of the population of towns and cities in North Carolina as of 2013:

Ignore the line. It was not meant for you. Yet.

You can see a pretty flat spread along the whole of the North Carolina population data set. So how about our fraudulent data?

As flat as Buzz Aldrin’s hair.

This distribution is quite uniform as well. Incidentally, if you recall back to the second part of this series, we generated all of our “real” and “fraudulent” data using normal distributions. With normally distributed data, the last digit follows a uniform distribution.

When Does This Not Hold?

There are a couple of situations where you can expect the last digit not to be uniform:

  1. Budgets are heavily skewed toward 0. This makes sense because budgets are estimates.
  2. Approximations in measurements tend to be biased toward 0 and 5.
  3. People fudging data tend to bias away from 0 and 5 because they’re not “random enough.”

Now, there has also been some researching arguing that uniformity is less frequent than we might think, but I haven’t read that article.

Benford’s Law

I have a longer post on Benford’s Law already, so I’d recommend reading that.

If you’re curious about our generated data, here’s what it looks like, including with the fraudulent data:

Benford strikes again.

And here’s what it looks like for just 2018, the year of fraudulent data:

All 9, all the time.

There’s a bit of a difference to be sure, and that might spur on additional analysis.


In today’s post, I wanted to look at a few analytical techniques around distributions of numbers. They are useful in fraud detection mostly because people are terrible random number generators, and so these techniques can be useful in seeing the potential of bias. They won’t typically prove bias, but can be a good first step.


Upcoming Events: Roanoke Valley .NET User Group

Key Details

What: Roanoke Valley .NET User Group
Where: Grandin Co-Lab, 1327 Grandin Road SW, Roanoke, VA
When: Thursday, May 2nd at 6 PM
Admission is free. Register on Meetup.

What I’m Presenting

06:00 PM — 08:00 PM — Machine Learning with .NET

This is a brand new talk that I’ll unveil for the first time for this user group so it should be fun.

Buffalo Draft Review, Day 2

Pick 38: Buffalo trades a 5th rounder to the Raiders to move up two picks, then selects RT Cody Ford.

Buffalo has added 7 new offensive linemen this offseason, including 6 unrestricted free agents and now Ford. Ford is likely the new starting right tackle. With him in place, the question is now whether Ty Nsekhe or Dion Dawkins is the LT. Nsekhe was better as a left tackle than a right tackle, and Dawkins can project as a really good guard…though the Bills brought in 3 linemen who can play guard in Spencer Long, Jon Feliciano, and Quinton Spain. Add Wyatt Teller and there’s a lot of competition for those guard positions.

Ford is a first-round talent and a great pass blocker, which I’m all for, and trading a 5th rounder to move up to get him is something I won’t lose any sleep tonight over. It’s going to be a tough minicamp and preseason for offensive line coach Bobby Johnson.

Pick 74: Buffalo drafts RB Devin Singletary.

Singletary is a shifty runner. Perhaps a shady runner? Best case scenario, he becomes an elusive back and potential starter. And learning from McCoy can’t hurt. He also seems to be a bit more powerful than his size indicates.

However, right after him the Packers picked Jace Sternberger, and I think I’d rather have a good tight end prospect than a running back prospect for Buffalo here. The Bills have zero good tight ends. After passing on Irv Smith in the 2nd round, Sternberger was probably the last good tight end prospect on the board.

Pick 96: Buffalo trades both 4th round picks to the Redskins to get pick 96 and choose TE Dawson Knox.

After me complaining about missing Jace Sternberger, the Bills trade both 4th rounders to move back into the 3rd round and choose Dawson Knox, a tight end out of Ole Miss. He’s fast and has some of the skills which translate to success in the NFL. On the downside, he’s a major project and never scored a touchdown in college. Granted, Ole Miss had several targets above him on the list, but I would have liked to have seen a bit more production.

Overall, I’m happy with pick #2, less happy with pick #3, and moderately pleased with pick #4. I think I would rather have had Sternberger at 74 than trading back up for Knox, but I will give Brandon Beane credit: he’s brought in at least two impact guys at positions of need and is giving Josh Allen all the tools he can get to succeed. I don’t know if Knox contributes much next year but maybe he develops into a good tight end over the next couple of seasons.

Upcoming Events: Triad BI User Group

Key Details

What: Triad SQL Server BI User Group
Where: 4821 Koger Boulevard, Greensboro, North Carolina
When: Tuesday, April 30th, 6 PM
Admission is free. Register on Meetup.

What I’m Presenting

06:30 PM — 08:00 PM — Launching a Data Science Project: Cleaning is Half the Battle

The Triad user groups are always a pleasure to visit. I don’t have a chance to get out that way very often considering how provincial I am, but if you are in the area, stop on by.

Book Review: Mastering .NET Machine Learning

This is a review of Jamie Dixon’s book Mastering .NET Machine Learning. For the sake of transparency, Jamie is a friend and if I hated his book I wouldn’t badmouth it in public, so adjust your priors accordingly.

I enjoyed this book. Jamie wrote his book before Microsoft started on the ML.NET library, so he focuses on Accord. The book starts with a little bit on C# but quickly moves to F# to cover a number of scenarios around data science in .NET including data cleansing, regression analysis, clustering, cross-validation, and a bit on neural networks and even IoT.

On the plus side, Jamie’s conversational tone is evident throughout. He makes it easy for .NET developers new to F# and the world of data science to get a toehold in the field and I don’t think there’s a part of the book which is overly complex. Jamie also has a couple of consistent themes he uses throughout the book, including Adventure Works data and open data sources such as traffic stops.

Speaking of traffic stop information, I appreciate that Jamie includes failed scenarios and not just successes. Failure to find something interesting is at least as common as success, whether that be due to technique, incorrectly specified features, or simply a lack of correlation. But this does lead me to an issue I have with the book: you get invested in some of the data sets but I don’t think there’s a really good model in the end. For example, Jamie uses a few techniques to try to gain insights from traffic stop information but I don’t think he ever gets to a strong conclusion.

Overall, would I recommend this book? If you’re a .NET developer wanting to learn a bit about statistical analysis without wanting to learn R or Python, this is a good book for you. You still have to learn F# but the learning curve from C# to there is shallower than R or Python, I think…though only a little. I think the strongest parts of this book are where Jamie integrates data science with practical implementation like building a .NET website to act as a front end.

Upcoming Events: SQL Saturday Raleigh

Key Details

What: SQL Saturday Raleigh
Where: Wake Tech RTP, 10908 Chapel Hill Road, Morrisville, North Carolina, 27560
When: Saturday, April 27th, all day
Admission is free. Sign up at the SQL Saturday website.

What I’m Presenting

Nothing at all!

This will be the 5th SQL Saturday Raleigh that I’ve been involved with and my 4th running the event. This year, we’re in a brand new venue, one which just opened up a few months ago. This venue is right in the heart of the Triangle, so it should be a bit more convenient to get to for people than our prior events.

We will also be hosting a SQL Clinic, where attendees can ask questions about production issues, where (or what) to learn, or anything else.