Benford’s law forms from the observation that in real-life datasets, the leading digits or set of digits follow a distribution in a successively decreasing manner, with number 1 having the highest frequency. As an example, take the population of all countries. The data is collected from a Kaggle location, and leading integers are pulled out as follows:
pop_data <- read.csv("./population.csv")
ben_data <- pop_data %>% select(pop = `Population..2020.`)
library(stringr)
ben_data$digi <- str_extract(ben_data$pop, "^\\d{1}")
ben_data$digi <- as.integer(ben_data$digi)
The next step is to plot the histogram using the extracted digits.
data:image/s3,"s3://crabby-images/50723/50723437f0e7e70c1e8358859cd5031bcf159c4c" alt=""
Let’s not stop here, extract the first two digits and plot.
data:image/s3,"s3://crabby-images/e0cc2/e0cc2326b92314bd17299393d03fe309777ddcd2" alt=""