Data & Statistics

The Betting Equation

In his book, The Ten Equations that Rule the World, David Sumpter discusses how he could create an ‘edge’ in betting. The term edge refers to the advantage a bettor wants to develop over the bookmaker. Let’s understand Sumpter’s first equation that runs the world.  

P = \frac{1}{1 + \alpha x^{\beta}}

Before getting into the details, let’s recall how odds are specified in (UK) style betting.  It communicates the profit in the form of a/b (a-to-b). It means your profit is (a/b) x wagered amount. The bookmaker’s probability associated with winning an a/b bet is:

P = \frac{1}{(a+b)/b} = \frac{1}{1+a/b}

Notice that the equation represents fair probability, which does not hold in real life. Substitute x with a/b; you will notice the similarity with the earlier equation. 

P = \frac{1}{1+x}

From several years of betting data, Sumpter concludes that the bookmaker’s probability underestimates strong favourites and overestimates weak favourites. He adds alpha and beta to compensate for this behaviour, which also becomes the ‘edge’. 

Let’s illustrate how this edge develops. Suppose the odds are 1/3 (1 to 3). The probability of the favourite winning, as per the bookmaker, is:
P = 1/(1+1/3) = 0.75
The expected payoff (on betting a dollar) is 0.75 x (1/3) – 0.25 x 1 = 0.

Now, use the modified equation with alpha = 1.16 and beta = 1.25, which Sumpter estimated by regressing historical data.
P = 1/(1+1.16 x (1/3)1.25) = 0.77
The expected payoff is 0.77 x (1/3) – 0.23 x 1 = 0.027 or a 2.7% edge! 

The Betting Equation Read More »

Two Kings in a Deck – The Simulation

Let’s build an R simulation to verify the results obtained in the previous post.

Step 1: Create a deck of cards

suits <- c("Diamonds", "Spades", "Hearts", "Clubs")
numbers <- c("Ace", "Deuce", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten", "Jack", "Queen", "King")
deck <- expand.grid(Number = numbers, Suit = suits)
deck <- paste(deck$Number, deck$Suit)
"Ace Diamonds"   "Deuce Diamonds" "Three Diamonds" "Four Diamonds"  "Five Diamonds"  "Six Diamonds"   "Seven Diamonds" "Eight Diamonds" "Nine Diamonds"  "Ten Diamonds"   "Jack Diamonds"  "Queen Diamonds" "King Diamonds"  "Ace Spades"     "Deuce Spades"   "Three Spades"  "Four Spades"    "Five Spades"    "Six Spades"     "Seven Spades"   "Eight Spades"   "Nine Spades"    "Ten Spades"     "Jack Spades"   
"Queen Spades"   "King Spades"    "Ace Hearts"     "Deuce Hearts"   "Three Hearts"   "Four Hearts"    "Five Hearts"    "Six Hearts"    "Seven Hearts"   "Eight Hearts"   "Nine Hearts"    "Ten Hearts"     "Jack Hearts"    "Queen Hearts"   "King Hearts"    "Ace Clubs"     
"Deuce Clubs"    "Three Clubs"    "Four Clubs"     "Five Clubs"     "Six Clubs"      "Seven Clubs"    "Eight Clubs"    "Nine Clubs"    "Ten Clubs"      "Jack Clubs"     "Queen Clubs"    "King Clubs" 

Step 2: Identify the positions containing the ‘string’. Here is an example.

green <- c("Green Wood", "Green paper", "Red Wood", "Green chilly", "Light green")
grep("Green", green, ignore.case = TRUE)
1 2 4 5

The vector ‘green’ carries the string ‘green’ in its 1, 2, 4 and 5 columns. 

Step 3: Find the difference between the identified column numbers and check if 1 appears anywhere (suggesting they belong consecutively). 

any(diff(grep("Green", green, ignore.case = TRUE)) ==1)
TRUE

Step 4: Apply the scheme to the deck of cards, run it a million times, and find the average number of times the ‘any’ command gave ‘TRUE’.

itr <- 1000000

card_match <- replicate(itr, {
  shuffle <- sample(deck, 52, replace = FALSE)
  mat <- grep("King", shuffle, ignore.case = TRUE)
  consec <- diff(mat)
  
  if(any(consec == 1)) {
    counter <- 1 
  }else{
    counter <- 0
  }
})

mean(card_match)
0.216743

Not far from what we got analytically.

Two Kings in a Deck – The Simulation Read More »

Two Kings in a Deck

What is the probability of at least one instance of two “Kings” being next to each other in a well-shuffled deck of cards? 

When you see the term “at least one”, you either estimate the chance of 1, 2, 3, and 4 and add them up, or you calculate the probability of 0 appearance and subtract it from 1. We will do the latter. 

The situation in which no King cards are next to each other can be obtained by arranging non-king cards in a row and placing the Kings in between. It is like considering the 52-4 = 48 non-Kings as the bars and 4 Kings are the stars. For each arrangement of the bars, these stars can be placed at 49 locations (from the left of the first bar to the right of the last bar)

*| | * | ….| * | * |

The arrangement of 4 cards in 49 places is 49P4 per each line of other cards. Since there are 48! arrangements possible, the total number becomes 49P4 x 48! If you divide this quantity by the maximum arrangements, 52! we get the required probability.  

P(0 Kings) = (49P4 x 48!)/52! = ([49!/(49-4)!] x 48!)/52!

P(at least 1) = 1 – ([49!/(49-4)!] x 48!)/52! = 0.217

How do we verify this is correct? Let’s perform this exercise a million times; that is next. 

Two Kings in a Deck Read More »

Cheryl’s Birthday

Here is the vital puzzle about Cheryl’s birthday. Cheryl gives a set of days and asks her friends Albert and Bernard to guess her birthday. 

May 15, May 16, May 19, June 17, June 18, July 14, July 16, Aug 14, Aug 15, Aug 17

As a clue, she gives the Month to Albert and the Date to Bernard. 

Albert (has info on Month) says, “I don’t know the birthday, but I know Bernard doesn’t know either.”
Bernard (who has info on Date) says, “I didn’t know at first, but now I do.” 
Albert: “Now I also know Cheryl’s birthday.”

So what is Cheryl’s birthday?

Let’s start the analysis by placing the Months and Dates in the following table. 

141516171819
MayXXX
JuneXX
JulyXX
AugXXX

Looking at the table column-wise, you can find two unique numbers, 18 and 19; if Bernard (‘Date Guy’) gets those, he will quickly identify the birthday (as June 18 and May 19 are the only possibilities). But Albert (Month guy) says he knew Bernard did not know. The only way Albert can say this with confidence is because May and June were not the months he got as the clue.  Let’s remove those two months from the table.

141516171819
May
June
JulyXX
AugXXX

Now, Albert got the message. He says he knows it now. This means that the number he got was not 14; otherwise, it would have caused confusion between July 14 and August 14. Remove those as well.  

141516171819
May
June
JulyX
AugXX

Albert can only say he knew when it was July, the only unique Month in the table. Chery’s birthday is July 16.  

Cheryl’s Birthday: Wiki

Cheryl’s Birthday Read More »

Birthday Problem – The UK Example

We have simulated the birthday problem, assuming that births are equally likely to occur throughout the year. However, we saw from the UK example that the actual births do not always follow this assumption and can vary.  

So, we want to know the impact of this deviation on the number of people sharing a common birthday. We repeat the simulation but incorporate the actual probability. Here is a comparison for a group of 23. In the first case, we used the actual probability (average daily births), and in the second case, we used the assumption of equally likely.

birth_day <- function(people, iterations, probs){
  birth <- replicate(iterations, {
    days <- sample(1:366, people, replace = TRUE, prob = probs)
    duplicated(days) %>% max()
  })
  mean(birth)
}

birth_day(23, 10000, B_data$average)
birth_day(23, 10000, rep(1/366, 366))
0.5079
0.5026

Like before, we can scan the whole spectrum of groups and compare the theory with the actual. Unfilled circles represent the theory, and the red line represents the actual.

Reference

How popular is your birthday?: ONS

Birthday Problem – The UK Example Read More »

More about the Birthday Problem

We have seen the birthday problem earlier. A group of 23 has a 50% chance that two members will share a birthday.

The R code that scans the probability of two people sharing a birthday against the number of people in the group is given by:

birth_day <- function(people, iterations, probs){
  birth <- replicate(iterations, {
    days <- sample(1:366, people, replace = TRUE, prob = probs)
    duplicated(days) %>% max()
  })
  mean(birth)
}

itr <- 80
b_day <- rep(0, itr)

for (i in 1:itr) {
   b_day[i] <- birth_day(i, 10000, rep(1/366, 366))
}

You don’t really need to write such lengthy codes, as R has a built-in function, ‘pbirthday’, that calculates the birthday probability. 

for (i in 1:80) {
  b_day[i] <- pbirthday(i, classes = 366, coincident = 2)
}

The two assumptions used in this work are that birthdays are independent and equally likely. However, these are not guaranteed to be true in real life. Here is an example of how births are distributed: the average daily births in England and Wales from 1995 to 2014.

Some months, such as July, September and October, are more popular than others for giving birth. But does it make the probabilities of the birthday problem different from the theory? We will see that next. 

Reference

How popular is your birthday?: ONS

More about the Birthday Problem Read More »

Stars, Bars and Cookies 

How many ways are there to distribute 7 cookies to 4 kids? Here are two things to help you. It is perfectly OK for someone to get no cookie. The cookies are identical. The second property suggests the order in which cookies are given doesn’t matter. Imagine a third child getting two cookies; that can happen in the beginning, mid or at the end.  

The problem is solved using the famous ‘stars and bars’ method. Step one is to box cookies into four compartments, each representing a child. Suppose kid 1 gets one, kid 2 gets two, kid 3 gets one, and kid 4 gets three. The representation is: 

* | ** | * | ***

The solution to the problem has now become the rearrangement of these stars and bars. Since the order doesn’t matter, it is a ‘combinations’ problem of 10 (7 stars + 3 bars), where three are bars (10C3). It is also 10 possibilities among 7 bars (10C7). 

10C3 = 10C7 = 120

Stars, Bars and Cookies  Read More »

The Necktie Paradox

Two friends, Andy and Boris, are making a bet about their neckties, which their spouses gifted them (so they don’t know the price). The bet goes like this: each will call the spouse, and whoever has the cheaper tie wins and gets the other person’s more expensive tie. 

Andy reasons that he had a 50/50 chance of winning the bet. If he loses, he loses the value of his tie. If he wins, he wins more than the value of his tie. In other words, there is a 50% probability he loses x and a 50% chance he gets more than x. So, he must wager. 

Boris also thinks the same. Obviously, this is impossible, where both men have the advantage in this game. 

There is a logical error in their reasoning. The person’s argument of losing x and gaining more than x suggests he thinks he will lose a less expensive tie and get a more expensive tie, which is not correct. Let $20 be the price of a tie and $40 the other. Since each doesn’t know the price at the time of the bet, there are four possibilities with a 25% chance each. 

Andy has a $40 tie, and Boris has a $20 tie – Andy’s gain: – $40
Andy has a $20 tie, and Boris has a $40 tie – Andy’s gain: + $40
Andy has a $40 tie, and Boris has a $40 tie – Andy’s gain: 0
Andy has a $20 tie, and Boris has a $20 tie – Andy’s gain: 0

The expected value is: 0.25 x -40 + 0.25 x 40 + 0.25 x 0 + 0.25 x 0 = 0. The same goes for Boris. 

The Necktie Paradox Read More »

About the Shrinking Middle Class

The term ‘middle class’ is ubiquitous in economics, sociology and politics. Yet, most of the receivers don’t know what it means! The term originated as a requirement to fill the void between the wealthy upper class and the poor lower class. We attempt to understand the concept quantitatively and will use a US viewpoint.  

There is no single definition for the household income range required to qualify as middle-class. The first approach was to form intuition about middle-class income. Because of this, it varied from individual to individual. To some, it was $24,000 to $96,000, whereas to others, it was $20,000 to $200,000.

The second method was to divide into income quintiles (5 sets of 20%) and give the top 20% to the upper and the bottom 20% to the lower class. The remaining (middle) three quintiles (20% to 80%) are middle class.

Pew Research Center follows a different definition. It takes the median national household income for a family of four and constructs the middle-class income range to be between 67% and 200% of it. For example, if the median household income is $70,000, the middle class becomes between $47,000 and $140,000. But why is it shrinking?

The Pew Research Center found that between 1971 and 2021, the middle income decreased from 61% to 50%. Below is a closer view of the data. 

Median Income
of households
19712020 % increase
Upper-income $130,008$219,57269%
Middle-class$90,131$59,93450%
Lower-income$20,604$29,96345%

The data suggests that the growth rate of the upper class was faster, and the lower class was slower than the middle. This would mean expansion of the top and bottom brackets, naturally, at the expense of the middle. 

Share of adults19712020 
Upper-income 25%29%
Middle-class61%50%
Lower-income14%21%

References

Middle Class: Wiki
How the American middle class has changed in the past five decades: Pew Research Center
Steven Pressman, Defining and Measuring the Middle Class, Working Paper 007, American Institute of Economic Researcher.

About the Shrinking Middle Class Read More »

Cavaliers AND LeBron AND Playoff

We have seen one extreme of the AND rule of probability, where people forget to realise how the conjunction makes events rarer. A well-known case is Linda’s problem. Here is the pictorial representation of the AND rule, which combines three events. 

The shaded region shows the joint probability of A, B, and C. As the number of events increases, the ‘common area’ shrinks. There is another extreme case of conjunction fallacy, typically used by journalists. Read the following title that appeared in CBS Sports. 

Cavaliers win first playoff series without LeBron James since 1993 by taking Game 7 over Magic

The writer has combined Cleaveland’s playoff entry, the first-round victory, and Lebron’s absence, making it a ‘rare’ sensational event.  

1. Is this the first playoff entry? No, this young Cavaliers team has been playing well. They were also in the playoff last season (2022–23) but lost against the Nicks.
2. So this must be the first series win (ever)? No, they have recently made four consecutive NBA final appearances (2014-15, 2015-16, 2016-17, 2017-18). Note that these are not just one series victory; we are talking about championship finals—four times!
3. But this happened after a long time? No, the team won the first rounds in 2007-08, 2008-09 and 2009-10.
4. Then, how can I make it a rare event? Find things in common and start subtracting them. Win, Lebron, Decade, the list goes on.

Cavaliers AND LeBron AND Playoff Read More »