In the Papers: Data Sets Conforming to Benford’s Law

This is a review of R.C. Hall’s Properties of Data Sets that Conform to Benford’s Law. This is a math-heavy article which lays out proofs of certain properties you would expect from a dataset which follows Benford’s Law. I’ve found the principle of Benford’s Law interesting, with blog posts back in 2015 and 2019 on the topic, and so I was interested in reading the article primarily because I wanted to get a better idea of which distributions of datasets follow Benford’s Law and which don’t.

Hall lays this out on page 3: “If the probability density function of the mantissas of the logarithm of a data set is Uniform distribution then the data set conforms exactly to Benford’s law.” Page 4 follows with a proof. But, uh, if you don’t have a solid stats background, that’s a tough statement to follow. Fortunately, there are great examples of distributions which do follow Benford’s Law and which don’t.

For distributions which do, exponential distributions follow Benford exactly. Gamma, beta, and Weibull distributions follow pretty closely. Log-normal approaches Benford as the standard deviation increases and chi-squared approaches as the number of degrees of freedom increases. Distributions which emphatically do not follow Benford are Gaussian (normal) and uniform distributions.

Coming into this reading, I had an intuition that datasets are more likely to follow Benford’s Law if they span several orders of magnitude, and looking at the sets of distributions, that makes a lot of sense—especially if we consider the cases of log-normal and chi-squared, which only approach Benford under circumstances which increase the spread of values.

Starting on page 33, Hall also debunks “the summarization test,” which is the idea that “adding all numbers that begin with a particular first digit or first two digits” and expecting that this will be a uniform distribution. Hall proves that this is true for exponential distributions, but not for the others, meaning that a series can be very close to a Benford series yet fail the summarization test, making summarization a bad test. The reason for this is that most of the relevant real-world sets following Benford’s Law are actually log-normal, not exponential.

Overall, I recommend this paper even if you aren’t deeply familiar with statistics and have trouble following the proofs. Hall includes quite a few graphs and writes clearly. The one unfortunate thing is that I read this paper on a grayscale tablet, meaning that all of the charts were really difficult to follow due to the color choices Excel makes by default. For that reason, I’d recommend reading the paper on a color-enabled screen.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s