Data & Statistics

What is in the Envelope

Here is a puzzle. There is a 200 EURO currency in one of the two envelopes, A and B. If you guess the right one, you can get the cash. Additionally, you can avail of a clue, which works as follows: There is a jar with five balls in it – three of them having the alphabet (A or B) of the envelope that carries the currency and two with the other alphabet. You can pick on the ball at random if you like. The price to pay for the clue is 25 EURO. The questions are:

1) Is the clue worth 25 EURO?
2) If not, what is the maximum amount you would like to pay?
3) Would you be willing to pay for a second clue and pick up another ball?

Let’s answer the first question. The expected value from the guess without taking any clues is 0.5 x 200 + 0.5 x 0 = 100 EURO. It is because there is a 50-50 chance that your guess turns right. What is the expected value of the guess with the first clue? It is 0.6 x 200 + 0.4 x 0 = 120 EURO. When you pick one ball, there is a 60% (0.6) chance that it is the right one (3 out of 5) and a 40% chance it is the wrong one.

Therefore, the maximum added value of going for the clue is 120 – 100 = 20 EURO. So, the answer to the first question is NO, and the second is 20 EURO.

What about a second pick? To answer this, we will need to perform several conditional probabilities using our favourite Bayes’ rule, which we’ll do next.

Is Extra Information Helpful? A Probability Puzzle: William Spaniel

What is in the Envelope Read More »

The Pirate Problem

The pirate chief and his four mates found 100 gold coins and wanted to divide them among themselves. As expected, they are perfectly rational and strategic people. Here are a few rules.

  1. The leader can propose a division.
  2. If half of the team (including the leader) accepts the proposal, it becomes valid.
  3. If not, the chief will be thrown out, the next in line will become the chief, and the game will continue.

So, what should be the chief’s offer to survive?

To find the solution to this problem, we must start from the last pirate and work backwards.

If the last one becomes the chief, he doesn’t need to make any offer and can keep all 100 coins. Simple! But what happens if two pirates remain? Then, the chief can decide not to give anything to the last one as he secures the approval by voting himself.

So, moving another level up – with three pirates. The chief requires at least two votes, but he gets one, i.e., his own. Also, he doesn’t want to give away more money than he needs to. Which of the other two pirates is cheap to buy? There is no point in giving money to the next person as he will disapprove; he knows he can keep all the coins by becoming the nest chief. Therefore, the last guy will vote for the current chief if the former gets at least one coin.

Now, four. The chief needs one more vote. He looks at the three and figures out what would happen if he loses, and the second one becomes the new chief. If that happens, the third one will not get any coin. Therefore, he becomes the cheapest vote to buy.

In the last case, the original case with five pirates, the chief needs three votes to survive. One comes from him, and he needs to buy two more. There is one cheap way: we know what happens if the proposal fails and the next becomes the chief. That will lead to the third and fifth not getting any coins, and they know that. So buy those two.

The Pirate Problem Read More »

Cournot Duopoly Game 

Cournot Duopoly is an economic strategy game where two firms producing the same type of products compete for the market by controlling their output. It is a simultaneous move game, and there is no collusion.

Let the firms be 1 and 2 choosing to produce q1 and q1 quantity of goods. In this model, they only decide how much to make. The price will be determined by the market using an inverse demand curve. So, price P = a – b x Q, where Q = q1 + q2 and a and b are positive numbers.

The marginal costs of production are C1 and C2, respectively. These suggest the profit of firm 1, Profit 1 = revenue – cost = P x q1 – C1 x q1. Similarly, Profit 2 = P x q2 – C2 x q2.

Profit 1 = (a – b x (q1 + q2) )x q1 – C1 x q1
= (a – bq1 – bq2 – C1)q1
Profit 2 = (a – bq1 – bq2 – C2)q2

Nash Equilibirum

To get the Nash equilibrium, we’ll maximise the payoffs (profits) of firm 1 and firm 2 by differentiating with respect to q1 and q2 and setting them to zero.

d(Profit 1) / d(q1) = a – 2bq1 – bq2 – C1 = 0
d(Profit 2) / d(q2) = a – bq1 – 2bq2 – C2 = 0
q1 = (a – bq2 – C1)/2b
q2 = (a – bq1 – C2) / 2b

So, q1 is a function of q2 and vice versa. The first equation, a straight line, will look like the following.

Cournot Duopoly Game  Read More »

Road Safety in India – Comparison with the US

In the last three posts, we have been looking at the statistics of road accidents in India. It would be interesting at this stage to compare that with the US.

ParameterThe USIndia
Population
(mln)
3301321
Fatalities38,824131,714
Fatalities per
million population
117.65100
Injured
Persons
2,282,015348,729
Injured per
million population
6915263
Crashes/
Accidents
6,393,624366,138
Accidents per
million population
19374277
Survival probability
Injured /(Injured +fatality)
0.980.72

Road Safety in India – Comparison with the US Read More »

Road Safety in India – Survival rate

In the final episode of accident data analysis, we will go into the remaining key stats – injuries and fatalities – and postulate a potential problem with the interpretation, i.e. data registration. But first, a plot of the number of injuries per population.

Kerala is now 33% more than the nearest rival, almost suggesting it is the most dangerous state for a passenger. But is that entirely true? Let’s see the following statistic – the fatalities per 100,000 population.

Strangely, it moves down to the 16th. Puducherry, which is third in injuries, also goes down. To understand this better, let’s define survival rate = the number of injured / (number of injured + number of dead).

Yes, Kerala has a > 90% survival chance after an accident. It may indicate a few things:
1) Kerala has better accident care for the injured (that prevents them from dying)
2) Kerala has more proportion of low-intensity accidents compared to other states
3) Kerala’s registration system is more thorough in recording incidents. And higher survival rate is an artefact of having a higher reporting rate of all incidents, however minor it could be.

Not so fast

When you are about to conclude data collection, here is another one: the proportion of grievously injured people among the total Injured.

Almost 75% of the injured are seriously injured. So to conclude, Kerala remains one the most dangerous for road safety, but most of the injured are somehow saved, despite the severity.

Road Safety in India – Survival rate Read More »

Road Safety in India – Dangerous States

One of the rather unfortunate aspects of statistics is that it doesn’t say why something has happened. They also can’t reveal data quality, making it difficult to compare different entities. Therefore, it leaves the burden of interpretation in the hands of the (responsible) reporter. Not always a desirable combination! With this introduction, let’s continue with the road safety data. This time we go deeper into state-level statistics.

Number of Accidents

Does this make Goa the most accident-prone region? Not necessarily. It is one of the smaller states in India with about 1.4 mln population. The same goes for Puducherry, at number four, with a quarter of a million. If you want to know the difficulties of interpreting data from a smaller population, read this post. Another factor is the incident reporting system. It may not be a coincidence that the top four regions are also known for better data recording, with the three among the four (Kerala, Goa and Puducherry) at the top-5 of the human development index. We’ll come back to this a bit later.

The same statistics on a different basis – the number of accidents per 10,000 vehicles – are below:

Before we move on: let’s try and understand if we can explain the top candidates based on their vehicle per population density. For that, we divide accidents per 100,000 population with accidents per 10,000 vehicles and divide by 10.

Yes, the top regions (Sikkim, Madhya Pradesh and Jammu & Kashmir) of the previous plot are way down in this plot. Again the statistics of smaller samples. That leaves one curious entity that we haven’t addressed so far – Kerala, which is among the top so far, not so small in population (33 million) or in vehicle density (~ 0.5). More about this coming up next.

The R code used for building the plots is below:

state_data %>% 
  ggplot(aes(x=reorder(State, Acci_per_Pop), y=Acci_per_Pop, fill = State)) + 
  geom_bar(stat = "identity") +
  geom_col() +
  coord_flip()

HDI of Indian States: Wiki

Road Safety in India – Dangerous States Read More »

Road Safety in India

One of the reasons statistics have a poor reputation in society is the way commentators tell incomplete stories. Typically, data can hold multiple layers of truth; not all are evident from the descriptions. In the next few posts, we will try and understand how road safety has performed in India in the last 50 years.

Road Accidents

It’s been increasing but showing a little turnaround in the last decade.

The Number of Fatalities

Surely, the numbers are stabilizing but not decreasing. We need to go deeper into any confounding effects, such as population change or any growth in the number of vehicles.

Risk to a person

So, the risk to an average person remains high though it has stabilized in recent times. The next question is if road travel has become more dangerous.

Risk to a passenger

In the basic sense, it is just a reflection of the exponential growth of vehicles – the base or denominator – in the last few years. In other words, the threat to life has not increased proportionally to the increase in the number of vehicles. One can also argue that automobiles are becoming better in safety performance.

Road Safety in India Read More »

Subscribing Irrationality

We have seen the role of expected values as a rational means of making decisions. Or the expected utility in other cases. But life is not as simple as in the case of a textbook example. And life never presents situations such as betting on a number of a die or an 80% chance of $45 vs a sure-shot $30, where someone can estimate the value arithmetically. It gives options on products with price tags. But how the value of a product is visible to the decision-maker?

The author, Dan Arie, discusses this dilemma and concludes that most humans like to have a reference and use a value based on relativity. Be it the price of a meal or television – we need something to relate to before choosing an option. And the sellers know that very well and try to use it in pricing their products. Here is one possible example I encountered this morning – the subscription offers of The Atlantic magazine.

Select your plan

First, the big picture: here is what you see on the website:

There are three options: online, online + print and online + print + something else! We shall come to that something else sometime later. Imagine if the choice was between the two options, digital and digital + print:

As seen in various studies, the aspiring subscriber makes a comparison a may go for the second most expensive option. She may further justify her action for the online version as a new way of working in the digitalised world.

It is more expensive – thrice the difference between the first two
Visibly distinct – three-digit whole number vs two-digit factions with deception (e.g. 79.99 sounding 70 instead of 80)
It has repeated mentions of the word ‘free’: likely a lure for the emotional few.

Let’s do a few hypothetical calculations to demonstrate the expected value (to the seller).

Case 1: two options – 80% for option 1 and 20% for option 2. The seller’s earnings per subscription = 0.8 x 80 + 0.2 x 90 = 82.
Case 2: three options and no ‘free’ – 60% for option 1 and 40% for option 2. Earnings per subscription = 0.6 x 80 + 0.4 x 90 = 84.
Case 3: three options and ‘free’ – 60% for option 1, 30% for option 2 and 10% for option 3. Earnings per subscription = 0.6 x 80 + 0.3 x 90 + 0.1 x 120 = 87.

Dan Ariely, Predictably Irrational

Subscribing Irrationality Read More »

Hypergeometric Distribution – Picking Without Replacement

‘Picking without replacement is the key phrase to understanding hypergeometric probability distribution. Here is another example, 30 names, 10 girls and 20 boys, are put in a sorting hat, and the top five are randomly selected for top prizes. What is the probability that four girls and one boy will win the honours?

Needless to say: it is a game without replacement. We know how to do such problems, as we have done a few earlier using combinations formula. Multiply combinations of picking 4 boys from 10 with 1 girl from 20 and divide by the total combinations – of 5 from 30.

\\ P(\textrm{4 boys and 1 girl}) = \frac{_{10}C_4 \textrm{ }*\textrm{ } _{20}C_1\textrm{ }}{_{30}C_5}

(10!/(4!6!)) x (20!/(1!19!)) /(30!/(5!25!))
= (10 x 9 x 8 x 7 / 4 x 3 x 2) x (20) / (30 x 29 x 28 x 27 x 26 / 5 x 4 x 3 x 1)
= (5 x 4 x 3 x 2 x 20 x 10 x 9 x 8 x 7) / (4 x 3 x 2 x 30 x 29 x 28 x 27 x 26)
= (5 x 10 x 2 x 7) / (3 x 29 x 7 x 3 x 13)

choose(10,4)*choose(20,1) / choose(30,5)

Or simply,

dhyper(4, 10, 20, 5, log = FALSE)

There is a 2.95 % (0.02947244) chance that it can happen this way!

Hypergeometric Distribution – Picking Without Replacement Read More »

Hypergeometric Distribution

Hypergeometric Distribution is a discrete distribution best suited for estimating probabilities of card playing. For example, what is the probability distribution of spades in a five-card poker hand? Before getting into the formula, we’ll see how R estimates it.

dhyper(x, m, n, k, log = FALSE)

For zero occurrence of spades after drawing five cards without replacement,
x: number of spades = 0
m: number of spades in the deck = 13
n: number of other cards in the deck = total cards – m = 52- 13 = 39
k: number of cards drawn from the deck = 5

dhyper(0, 13, 39, 5, log = FALSE)

Here is the distribution in a five-hand poker hand.

Hypergeometric Distribution Read More »