Mathematics Laws for Data Scientists: Benford’s Law

Edward Girling
5 min readMar 29, 2021

--

Not uniform as expected but Benford’s…

This article series will cover more obscure, but nonetheless useful mathematics laws.

At the bottom of the page are links to the resources folder containing all code and data and to other articles in the series. To run the code in the Google Colab Notebooks, download the whole google drive folder (Mathematics Laws for Data Scientists), unzip it, and upload it to your google drive (so in Colab the path is ‘drive/my drive/Mathematics Laws for Data Scientists’). All data is included and will run with out any path alterations.

Benford’s Law

Benford’s law is also known as Newcomb-Benford’s law, law of anomalous numbers or the first digit law.

Factoid: The law was originally discovered in 1881 by Simon Newcomb by observing anomalous wear patterns on the pages for logarithm tables, but was later rediscovered in 1938 by Frank Benford by observing the same wear patterns for logarithm tables. Frank Benford also rigorously tested the law across 20 different domains.

Statement of the Law: A set of numbers satisfies Benford’s law if the leading digit, d, element of the set {1, 2, …,9} occurs with probability

As a histogram with probability on the y axis and digit on x axis, this looks like:

Benford’s law applies to sets even if the set is of numbers other than base 10 and can be generalized to apply to digits other than the first one.

Assumptions:

  • The set spans multiple orders of magnitude, e.g. the smallest numbers are 1/1000 the size of the largest numbers.
  • The set cannot exclude particular digits, e.g. the set does not contain 8.
  • The set has no arbitrary cutoffs, e.g. the a bank a account that has a minimum account balance.

Generally Applicable:

  • Sets with numbers that result from mathematical combinations of numbers e.g. quantity x stock price.
  • Sets of naturally occurring phenomenon.

Generally not applicable

  • Sets with numbers assigned sequentially.
  • Sets with numbers influenced by human thought, e.g. prices.
  • Sets of transactions to an account that has maximum or minimum value.

Examining Election Fraud Claims with Benford’s law and Python

There have been widespread and unsubstantiated claims of election fraud in the 2020 US presidential election. I will do a very cursory analysis to see if there appears to be any analytical data to support this. The data is the constituency (county) level votes counts by candidate for the 2016 and 2020 presidential elections.

Data Collection Source(2012 & 2016): MIT ELECTION LAB

Data Primary Source(2012 & 2016): HARVARD DATAVERSE

Data Primary Source (2020): KAGGLE

Check to see assumptions met (2016):

  1. The set is naturally occurring with no arbitrary maximum or minimum values. A candidate can receive 0 votes or however many votes he gets.
  2. The data span several orders of magnitude. The min vote count is 3; the max vote count is 2.46 million.
  3. (optional) The data has a large number of points (9,468 points)

Since we have a lot of points, a simple histogram can be used to examine the data heuristically. I used a histogram plotted as a scatter plot since the points were so close together.

Check to see assumptions met (2020):

  1. The set is naturally occurring with no arbitrary maximum or minimum values. A candidate can get 0 to however many votes he gets.
  2. The data span several orders of magnitude. The min vote count is 0; the max vote count is 3.03 million.
  3. (optional) The data has a large number of points (32,177 points). This set has more points because it breaks out the top significant candidates.

I did 2012 as a placeholder while waiting for a request for 2020 data to come through. It is presented below for reference.

Check to see assumptions met (2012): All met since same dataset at 2016 just a different sub sample.

Let’s look at the histogram again.

Overall the plots are very close.

More analytical approaches exists such as using a Pearson’s Chi-squared test to determine whether the distributions are different to the 90, 95, and or 99 percent significance levels.

Benford’s law can be very useful; however, it is also contested. This article from Reuters explains it in depth. In essence, a distribution may follow not Benford’s law because the assumptions may be violated in a subtle way. A researcher may not able to detect this violation and may wrongly think the distribution is fraudulent. For example, were I to zoom in and look at individual groupings of precincts(smaller than counties), I would see lots of non-Benford distributions. The reason Benford’s does not hold is explained in depth by Walter Mebane. The essence of it is that the size of the precincts is a major driver for the first digit and the vote counts don’t vary across multiple orders of magnitude.

(Opinion) I take a very conservative approaches to a contested topic. I primarily use this tool as a weak indicator of fraud. Failing Benford’s law causes me to examine the data more closely, but if I find nothing, I move on.

The code is best reviewed interactively through Google Colab. The code is mostly data cleaning, but as any good data scientist knows, that is just how it goes.

All resource for the series are consolidated in the resources folder.

--

--

Edward Girling

Mathematician, enjoys his knowledge distilled. Find my insight deep, my jokes laughable, my resources useful, connect with me on twitter @Rowlando_13