Benford’s Law

One of the talks I went to at SQL Saturday Charlotte was Bill Pearson’s talk on forensic analysis.  During the talk, Pearson mentioned Benford’s Law, and when I heard about it, I knew I wanted to investigate this phenomenon further.

Benford’s Law

The basic idea behind Benford’s Law is that, in many data sets, smaller digits appear more frequently than larger digits.  The number 1 appears almost three times as frequently as it would under a uniform distribution, whereas the numbers 7-9 each appear approximately half as frequently.

Benford’s Law is not a universal law, but it does appear to follow in a large number of scenarios, including financial data.  Pearson mentioned using Benford’s Law to find fraudulent journal entries, although in that scenario, analysts will look at the 2nd and later digits as well as the first.

To get a quick idea of what a Benford distribution would look like, here’s some quick R code:

bp <- data.frame(

names(bp)[1] <- "Digit"
names(bp)[2] <- "Frequency"

And here’s the ugly plot:


Testing Benford’s Law

I’m going to look at three separate sets of data to check Benford’s Law.  First, we’ll look at a Fibonacci sequence to see if it follows.  Second, we’ll look at my local HOA’s budget figures for 2013-2015.  Finally, I’ll grab random numbers.

Fibonacci Sequences

Generating a Fibonacci sequence in R is pretty easy.  Getting the leading digit of each element in an array is easy.  Let’s combine them together:


len <- 1000
fibvals <- numeric(len)
fibvals[1] <- 1
fibvals[2] <- 1
for (i in 3:len) {
    fibvals[i] <- fibvals[i-1] + fibvals[i-2]

firstdigit <- function(k) {
    as.numeric(head(strsplit(as.character(k), '')[[1]],n=1))

fibfirst <- sapply(fibvals, firstdigit)
truehist(fibfirst, nbins=10)

This code snippet generates the first thousand Fibonacci numbers and uses the MASS library’s truehist() function to generate a histogram for the digits 1-9.  The end result looks like:


Notice that this histogram fits very nicely with the plot above, showing that Fibonacci sequences will tend to follow Benford’s Law.

Budget Values

Each year, my HOA announces their annual budget.  I decided to grab the individual budgeted account values from the 2013, 2014, and 2015 budgets and apply the same analysis to each.  There are 107 line items in the budget, and here’s the breakdown by year:


hoa <- read.table('',header=T,sep='\t',quote="")

firstdigit <- function(k){
  as.numeric(head(strsplit(as.character(k), '')[[1]],n=1))

hoa$F2013 <- sapply(hoa$X2013,firstdigit)
hoa$F2014 <- sapply(hoa$X2014,firstdigit)
hoa$F2015 <- sapply(hoa$X2015,firstdigit)

f2013 <- subset(hoa, hoa$F2013 > 0)
f2014 <- subset(hoa, hoa$F2014 > 0)
f2015 <- subset(hoa, hoa$F2015 > 0)

truehist(f2013$F2013, nbins=10, xlab = "Leading Digit") + title("2013 Budget")
truehist(f2014$F2014, nbins=10, xlab = "Leading Digit") + title("2014 Budget")
truehist(f2015$F2015, nbins=10, xlab = "Leading Digit") + title("2015 Budget")

This code is just a little bit more complex than the other code above, and that’s because I need to filter out the $0 budget items. When I do that, I get the following graphs:




So, these results don’t quite follow Benford’s Law.  Instead of a smooth downward curve on the histogram, we see some jitter as well as a consistent bump in the 7’s digit, meaning a lot of $7000-7999 or $70,000-79,999 entries.  Based on this, I might wonder if our treasurer likes that number more than others, or if it’s just coincidence.

Random Numbers

Unlike the other two, my expectation coming in here is that we should see a fairly uniform distribution of numbers.  So let’s generate some random numbers:


firstdigit <- function(k){
  as.numeric(head(strsplit(as.character(k), '')[[1]],n=1))

rn <- runif(10000, 1, 999999)
rnf <- sapply(rn,firstdigit)
truehist(rnf, nbins=10)

This generates a plot like so:


Moral of the story:  don’t use a random number generator to generate financial data…


Benford’s Law is a rather interesting concept.  You can find that it applies to a large number of data sets, including things as disparate as per-capita GDP and Twitter follower counts.  In today’s post, we saw how it applies to Fibonacci sequences and (with a bit of noise) to a relatively small set of budget items.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s