Data & Statistics

Friendship Paradox

You know, your friends, on average, have more friends than you do! I know it is a bit difficult to swallow that feeling. We will explore it mathematically. On the one hand, it follows from what we have seen before – the inspection paradox and the waiting-time paradox. But we will use a different approach here.

Count your friends

Consider the following relationship tree.

What it means is that A has only one friend (i.e., B) denoted by A(1; B). Similarly B(4; A, C, E, H), C (4; B, D, E, H), D(2; C, H), E(3; B, C, F), F(2; E, G), G(1; F), H(3; B, C, D). So the total number of friends among those eight is 1 + 4 + 4 + 2 + 3 + 2 + 1 + 3 = 20. The average number of friends, therefore, is 20/8 = 2.5.

And their friends

How do we do it? The easier way is to call out each of them and ask how many friends they have. For example, take A: she will ask her only friend, B, to call out her friends. B has four friends (note that it also includes A). Let us represent that as A{B(4)}. Similarly, B{A(1), C(4), E(3), H(3)}, C {B(4), D(2), E(3), H(3)}, D{C(4), H(3)}, E{B(4), C(4), F(2)}, F{E(3), G(1)}, G{F(2)}, H{B(4), C(4), D(2)}. The total number is 60. To calculate the average friends’ friends, you should divide 60 by the friends you calculated earlier, i.e., 20. 60/20 = 3.

So by counting, you prove that the average number of friends (2.5) is smaller than friends’ friends (3). The whole exercise can be summarised in the following table

IndividualFriends
(P)
Friends’ friends
(Q)
Mean
Friends’ friends
(P/Q)
A1 44
B4 112.75
C4123
D273.5
E3103.33
F242
G1 22
H3103.33
Total206023.91

Analytical Proof

Look at the diagram once more. We replace the number of friends that A have with dA (dA represents the degree of the vertex that points from person A).

\\ \text{In other words, the total number of friends here is }, \\\\ d_A + d_B + d_C + d_D + d_E + d_F + d_G + d_H } =  \Sigma{x_i}

which we know is 20. The average number of friends is obtained by dividing this by the total number of individuals, n.

\\ \frac{d_A + d_B + d_C + d_D + d_E + d_F + d_G + d_H}{n} = \frac{\Sigma{x_i}}{n}

Now, we will move to the total friends of friends.

\\ \text{the number of friends that A's friends have = } d_B \\ \\  \text{the number of friends that B's friends have= } d_A + d_C + d_E + d_H \\ \\  \text{the number of friends that C's friends have= } d_B + d_D + d_E + d_H \\ \\  \text{the number of friends that D's friends have= } d_C + d_H \\ \\  \text{the number of friends that E's friends have= } d_B + d_C + d_F\\ \\  \text{the number of friends that F's friends have= } d_E + d_G \\ \\  \text{the number of friends that G's friends have= } d_F \\ \\  \text{the number of friends that H's friends have= } d_B + d_C + d_D \\ \\  \text{Total number of friends that all friends have= }   d_B+ d_A + d_C + d_E + d_H + d_B + d_D + d_E + d_H +  d_C + d_H + d_B + d_C +  d_E + d_G +  d_F + d_B + d_C + d_D = \\ \\ d_A + 4 d_B + 4 d_C + 2 d_D + 3 d_E + 2 d_F + d_G + 3 d_H \\ \\ \text{the average number of friends of friends is = } \frac{d_A + 4 d_B + 4 d_C + 2 d_D + 3 d_E + 2 d_F + d_G + 3 d_H}{d_A + d_B + d_C + d_D + d_E + d_F + d_G + d_H}

which was 60/20 in our case. If you are confused, remember these:

the total number of individuals = n
the avg. number of friends of individuals = total number of friends / total number of individuals
the avg. number of friends of friends = total number of friends of friends / total number of friends

Back to the equations.

\\ \text{the average number of friends of friends = } \\\\ \frac{d_A + 4 d_B + 4 d_C + 2 d_D + 3 d_E + 2 d_F + d_G + 3 d_H}{d_A + d_B + d_C + d_D + d_E + d_F + d_G + d_H}

Look carefully, dA appears once in the numerator, and dA has one friend (B). dB appears four times and dB has four friends, and so on. This is no accident as the number of friends is counted multiple times as they appear in the friends’ friends list. Apply this rule to the equation, and you get.

\\ \text{the average number of friends of friends = } \\\\ \frac{d_A*d_A + d_B*d_B + d_C*d_C + d_D*d_D + d_E*d_E + d_F*d_F + d_G*d_G + d_H*d_H}{d_A + d_B + d_C + d_D + d_E + d_F + d_G + d_H} \\ \\ = \frac{d_A^2 + d_B^2 + d_C^2 + d_D^2 + d_E^2 + d_F^2 + d_G^2 + d_H^2}{d_A + d_B + d_C + d_D + d_E + d_F + d_G + d_H} \\ \\ = \frac{\Sigma{x_i^2}}{\Sigma{x_i}} \\ \\ \text{divide the numerator and the denominator by n} \\ \\  =  \frac{\Sigma{x_i^2}/n}{\Sigma{x_i}/n} \\ \\ \text{add and subtract } (\Sigma{x_i})^2/n^2 \text { at the numerator} \\ \\  =  \frac{[\Sigma{x_i^2}/n +  (\Sigma{x_i})^2/n^2  -   (\Sigma{x_i})^2/n^2]}{\Sigma{x_i}/n} \\ \\ =  \frac{[\Sigma{x_i^2}/n  -   (\Sigma{x_i})^2/n^2]}{\Sigma{x_i}/n} +  \frac{[(\Sigma{x_i})/n]^2}{\Sigma{x_i}/n} \\ \\ =  \frac{[\Sigma{x_i^2}/n  -   (\Sigma{x_i})^2/n^2]}{\Sigma{x_i}/n} + {\Sigma{x_i}/n}

The second term, we know from the earlier section, is the average number of friends of individuals. The first term is nothing but the variance divided by the mean.

The mean number of friends of friends = (mean of friends ) + (variance / mean), which is equal to or greater than the mean of friends.

References

Why do your friends have more friends than you do? The American Journal of Sociology

The friendship paradox: MIT Blossoms

Friendship Paradox Read More »

Waiting-Time Paradox

Have you ever wondered why you always had to wait longer at the bus stop? If you agree with that but feel it was more of a mental thing than anything else, hold on a bit. There are chances that your feeling is valid and may be explained using probability theory.

Remember the inspection paradox?

See the previous post if you don’t know what I am talking about. The waiting time paradox is a variation of the inspection paradox. And we will see how, so brush up on your probability and expected value basics. As a one-line summary, the expected value is the average value!

A bus every 10 minutes

You know a bus for your destination comes every 10 minutes at a bus stop. In other words, six buses every hour. You start with the premise that the average waiting time is five minutes, assuming you randomly reach the stop. One day, at one minute for the next bus (waiting time 1 min) or another day, a minute after the last bus (waiting time = 9 min), etc.

Do not forget that buses also come with certain uncertainties (unless your pick-up is at the starting station). Now, let’s get the arrival times of this bus at a given stop (at some distance after the starting point). I can make them up for illustration but will, instead, resort to the Poisson function and get random time intervals between two consecutive buses.

Here they are: made using the R code, “rpois(6, 10)“: 10, 10, 11, 3, 10, 16. These are minutes, inside an hour, between buses, created randomly. The code [rpose(N = sample Size, lambda = expected interval)] randomly generated 6 intervals at a mean value 10.

The average waiting times inside these six slots are 10/2, 10/2, 11/2, 3/2, 10/2 and 16/2. You know what happens next.

Probabilities and expectations

Compute your probability of catching a bus during that hour. They are 10/60, 10/60, 11/60, 3/60, 10/60 and 16/60. Each number on the numerator corresponds to a slot and a time between two buses, which equals the respective waiting time, as described in the previous paragraph.

The expected value (average waiting time) = Probability x average waiting time corresponding to that probability = (10/60) x (10/2) + (10/60) x (10/2) + (11/60) x (11/2) + (3/60) x (3/2) + 10/60 x (10/2) + (16/60) x (16/2) = 5.72.

Average is > 5 minutes

The following code summarises the whole exercise.

number_of_buses <- 100 
avg_wait_buses <- 10

bus_arrive <- rpois(number_of_buses, avg_wait_buses )
prob_bus_slot <- bus_arrive/sum(bus_arrive)
avg_wait_slot <-  bus_arrive / 2
Expected_wait <- sum(prob_bus_slot*avg_wait_slot)

Waiting-Time Paradox Read More »

Inspection Paradox

Suppose a school has three classes, A, B and C, holding 20, 60 and 100 students. What is the average number of students in a class? Well, it depends on who you ask. Let’s understand this deeper.

If you ask the school principal, she will give you 60 as the average number. It’s simple, (20 + 60 + 100) / 3 = 60. But what happens if you choose to sample 50 students and ask them about the number of students in their classes and then average it?

You wait at the school bus stop and randomly select 50 students. Since the sampling is random, you are likely to catch the following numbers from each class.

The probability of finding a student from class A = (20/180). Therefore, the number of students from class A in the total sample of 50 = 50 x (20/180) = 5.55, about 6.

Similarly, from class B, you catch 50 x (60 / 180) = 16.66 or about 17 students, and from class C, 50 x (100 / 180) = 27.77 or about 28 students.

So what are the responses from the students? 5.55 will say 20 students in their class because they are from class A. 16.66 will say 60, and 27.77 says 100. So, the survey average is (5.55 x 20 + 16.66 x 60 + 27.77 x 100) / (5.55 + 16.66 + 27.77) = 3887.6 / 50 = 77.8.

So from the school, you get 60, and from the survey, 78, and no one is lying.

Inspection Paradox Read More »

Wisdom of the Crowd

Wisdom of the crowd is an intriguing phenomenon, which received its popularity in public discourse due to Galton when he surprisingly found a near-accurate estimate of the weight of a prize-winning ox by the common public. In other words, whatever errors in the guesses of the individuals got cancelled when averaged. An essential prerequisite of achieving the group’s wisdom is that the errors of individuals are statistically independent of each other.

Vul and Pashler took this to the next level. They wanted to test it for single individuals by allowing them to make multiple guesses. They asked 428 participants (through an internet poll) eight questions regarding general knowledge. Half of the participants were asked to make a second guess immediately after the questionnaire, and the other half were asked after three weeks (without telling them it would come after three weeks).

The results were surprising. The within-person average of the guesses (or the mean square error) was lower than either of the guesses for the immediate responders. The value was even lower for late-responders.

Wisdom of the Crowd Read More »

One-Son Policy

You may have heard about the one-child policy. Imagine if a country decides to have a one-son policy. The question is: will it lead to a gender imbalanced society? Intuition suggests that there will be more girls in the land because of this policy. Let’s analyse this by estimating the expected values.

The analytical solution

Assume that the probability of either sex is equal, 1/2. We have seen earlier that if p is the probability of something, then the expected value (denoted by E) or the average time we need to wait for the first success is 1/p. In our case, if N is the number of children in a family with 1/2 as the parameter, E(N) = 1/(1/2) = 2. Let Nm be the number of male children and Nf be the number of females, N = Nm + Nf and E(N) = E(Nm) + E(Nf) = 2. E(Nm) for a one-son policy regime has to be 1. That means E(Nf) = 2 – 1 = 1. So the expected number of males and females are equal! Not convinced? We will illustrate with examples.

Testing the concept

1 child scenario: The simplest case is one child in a family. The child can be M (male) or F (female). p(M) = 1/2, p(F) = 1/2. Nm(M) = 1, Nm(F) = 0, Nf(M) = 0, and Nf(F) = 1.

E(Nm) = Nm(M) x p(M) + Nm(F) x p(F) = 1 x (1/2) + 0 x (1/2) = 1/2.
E(Nf) = Nf(M) x p(M) + Nf(F) x p(F) = 0 x (1/2) + 1 x (1/2) = 1/2.
So the expected values of males and females are equal

A bit of explanation on the notation:
Nm(M) is the number of boys in an M scenario, and Nm(F) is the number of boys in an F scenario.
Nf(M) is the number of girls in an M scenario, and Nf(F) is the number of girls in an F scenario.

Maximum 2 children in a family: The combinations are M, FM, FF.
p(M) = 1/2, p(FM) = 1/4, p(FF) = 1/4
Nm(M) = 1, Nm(FM) = 1, Nm(FF) = 0
Nf(M) = 0, Nf(FM) = 1, Nf(FF) = 2

E(Nm) = Nm(M) x p(M) + Nm(FM) x p(FM) + Nm(FF) x p(FF) = 1 x (1/2) + 1 x (1/4) + 0 x (1/4) = 3/4.
E(Nf) = Nf(M) x p(M) + Nf(FM) x p(FM) + Nf(FF) x p(FF) = 0 x (1/2) + 1 x (1/4) + 2 x (1/4) = 3/4.
Again, the expected values of males and females are equal

Maximum 3 children in a family: The combinations are M, FM, FFM, FFF.
p(M) = 1/2, p(FM) = 1/4, p(FFM) = 1/8, p(FFF) = 1/8
Nm(M) = 1, Nm(FM) = 1, Nm(FFM) = 1, Nm(FFF) = 0
Nf(M) = 0, Nf(FM) = 1, Nf(FFM) = 2, Nf(FFF) = 3

E(Nm) = Nm(M) x p(M) + Nm(FM) x p(FM) + Nm(FFM) x p(FFM) + Nm(FFF) x p(FFF)
= 1 x (1/2) + 1 x (1/4) + 1 x (1/8) + 0 x (1/8) = 7/8.
E(Nf) = Nf(M) x p(M) + Nf(FM) x p(FM) + Nf(FFM) x p(FFM) + Nf(FFF) x p(FFF)
= 0 x (1/2) + 1 x (1/4) + 2 x (1/8) + 3 x (1/8)= 7/8.
Once again!

More Proof?

We will do one final time, where the family waits for a boy but stops at 4. The combinations are M, FM, FFM, FFFM, and FFFF.
p(M) = 1/2, p(FM) = 1/4, p(FFM) = 1/8, p(FFFM) = 1/16, p(FFFF) = 1/16.
Nm(M) = 1, Nm(FM) = 1, Nm(FFM) = 1, Nm(FFFM) = 1, Nm(FFFF) = 0
Nf(M) = 0, Nf(FM) = 1, Nf(FFM) = 2, Nf(FFFM) = 3, Nf(FFFF) = 4

E(Nm) = Nm(M) x p(M) + Nm(FM) x p(FM) + Nm(FFM) x p(FFM) + Nm(FFFM) x p(FFFM) + Nm(FFFF) x p(FFFF)
= 1 x (1/2) + 1 x (1/4) + 1 x (1/8) + 1 x (1/16) + 0 x (1/16) = 15/16.
E(Nf) = Nf(M) x p(M) + Nf(FM) x p(FM) + Nf(FFM) x p(FFM) + Nf(FFFM) x p(FFFM) + Nf(FFFF) x p(FFFF)
= 0 x (1/2) + 1 x (1/4) + 2 x (1/8) + 3 x (1/16) + 4 x (1/16) = 15/16.
Still, there is a balance between boys and girls

One-Son Policy Read More »

Hot Hands of Randomness

A lot of basketball fans believe in hot hands and streak shooting. These phrases roughly suggest the probability of a basket is higher following a hit than a miss. A spectator suddenly gets confidence in a player who shot the previous two shots to make the next. At that moment, she would not think about probabilities (coin tossing games) but only consider the momentum and confidence of the shooter.

It is interesting to notice that this behaviour of the fans is just the opposite of the belief in “the law of small numbers” or the gambler’s fallacy. The reason: basketball is being played by individuals with flesh and blood (and lucky charm), whereas gambling is by machines that work on chances (and expected values).

In 1985, Gilovich et al. did a study that analysed the data from 48 home games of the Philadelphia 76ers and their opponents during the 1980-81 season. The following is what they found.

The first table is about the (conditional) probability of a hit, given the player had already hits (3, 2 and 1). P(hit) represents the overall shooting percentage of the player.

P(hit)P(hit|3H)P(hit|2H)P(hit|1H)
Richardson0.500.480.500.49
Erving0.520.480.520.53
Hollins0.460.320.460.46
Cheeks0.560.590.540.55
Jones0.470.270.430.45
Toney0.460.340.400.43
Jones0.540.530.470.53
Mix0.520.360.480.51
Dawkins 0.620.510.580.57

Similar statistics for a player to hit after missing the previous shots are presented below.

P(hit)P(hit|3M)P(hit|2M)P(hit|1M)
Richardson0.500.500.470.56
Erving0.520.520.510.51
Hollins0.460.500.490.46
Cheeks0.560.770.600.60
Jones0.470.500.480.47
Toney0.460.520.530.51
Jones0.540.610.580.58
Mix0.520.700.560.52
Dawkins 0.620.880.730.71

The data showed just the opposite of what people believed. The hits following previous hits were lower than previous misses.

Gilovich, T.; Vallone, R.; Tversky. A., Cognitive Psychology 17, 295-314 (1985)

Hot Hands of Randomness Read More »

Whose Fault was It?

A company gets their tools from three factories, 10% from F1, 50% from F2 and 40% from F3. The fault rates of the three factories are 2.5%, 1.5% and 1%, respectively. If the company finds one of the tools defective, from which factory is it likely?

Let’s write down things in the proper format and test the probabilities, one by one. Let P(F1|D) be the probability that the tool came from F1 given the information (that it is faulty), P(D|F1), the fault rate of F1 and P(F1) is the prior chance of F1 in the company.

\\ P(F1|D) = \frac{P(D|F1)*P(F1)}{P(D|F1)*P(F1) + P(D|F2)*P(F2) + P(D|F3)*P(F3)}  \\ \\ P(F2|D) = \frac{P(D|F2)*P(F2)}{P(D|F1)*P(F1) + P(D|F2)*P(F2) + P(D|F3)*P(F3)}  \\ \\ P(F3|D) = \frac{P(D|F3)*P(F3)}{P(D|F1)*P(F1) + P(D|F2)*P(F2) + P(D|F3)*P(F3)}

I guess you already know that I applied Bayes’ theorem to each factory. Now, add numbers and solve.

\\ P(F1|D) = \frac{0.025*0.1}{0.025*0.1 + 0.015*0.5 + 0.01*0.4}  = \frac{0.0025}{0.014} = 0.18\\ \\ P(F2|D) = \frac{0.015*0.5}{0.025*0.1 + 0.015*0.5 + 0.01*0.4}  = \frac{0.0075}{0.014} = 0.54\\ \\ P(F3|D) = \frac{0.01*0.4}{0.025*0.1 + 0.015*0.5 + 0.01*0.4}  = \frac{0.004}{0.014} = 0.29

The tool is more likely to have come from F2.

Whose Fault was It? Read More »

Coin Tossing and NBA Playoffs

In the previous post, we have established the expected probabilities of the winning margins in the NBA playoffs. This time we compare them with the past data and test our hypothesis.

The null hypothesis (H0) was that the games-outcomes occur with equal probabilities, irrespective of the rank in the regular season. The alternative hypothesis (H1) says that the game outcomes are not at equal chances but based on the team’s strength.

Matches including round 1

We gather all the playoff outcomes until 2021, from the starting point of 2003, the beginning year of the best-of-seven series for round one. The summary of the observed results and expectations is in the following table.

Observed (O)Expected (E)(O-E)2/E
4-game series520.125 x 285
= 35.63
7.52
5-game series740.25 x 285
= 71.25
0.106
6-game series990.3125 x 285
= 89.06
1.109
7-game series600.3125 x 285
= 89.06
9.48
Total28528518.21
obsfreq <- c(52,74, 99, 60)
nullprobs <- c(0.125,0.25, 0.313, 0.312)
chisq.test(obsfreq,p=nullprobs)

Not among equals

With a chi2 of 18.2 (> the critical value of 7.81 for the 5% significance level) and a p-value of 0.00042, we reject the null hypothesis that the playoff games were coin-tosses. The results are hardly surprising because of the presence of first-round playoffs in the data. Most of the matches in the first round are not played among equals. In the first round, the number one (of the regular season) plays against the number 8, 2 against 7, 3 against 6, and 4 against 5. With a closer review of the table, you can see far more 4-0 results (52) than expected (35) and fewer 4-3s (60) than coin-tosses (89).

After the first round

Once the screener of the first round is completed, it is the start of the conference semi-finals. We now anticipate that the matches are between equals or teams of comparable strength. We repeat the test by taking data from 2003 to 2021, excluding the first round. The analysis is below.

Observed (O)Expected (E)(O-E)2/E
4-game series190.125 x 132
= 16.5
0.38
5-game series300.25 x 132
= 33
0.27
6-game series520.3125 x 132
= 41.316
2.76
7-game series310.3125 x 132
= 41.316
2.52
Total1321325.93

The chi2 is below the critical value (7.81), corresponding to a 5% significance level. Yes, the null hypothesis stays.

The final

We will now take it to the next level. Let’s cover only the NBA finals of the last 75 years. We expect the outcome to match that of matching probability events. And the expected chi2 value could be the lowest we have seen so far. Keep your fingers crossed.

Observed (O)Expected (E)(O-E)2/E
4-game series90.125 x 75
= 9.37
0.015
5-game series180.25 x 75
= 18.75
0.03
6-game series290.3125 x 75
= 23.44
1.3
7-game series190.3125 x 75
= 23.44
0.83
Total75752.17

Yes, it is – chi2 of 2.17 with a p-value of 0.54.

Coin Tossing and NBA Playoffs Read More »

Are NBA Playoffs Random?

NBA playoff matches are progressing these days, and it is time we look at the winning probabilities. How significant is the role of chance in these games? Are the outcomes independent and random with equal probabilities for wins and losses, like coin-tossing?

Equal probability scenario

Although the title speaks about randomness, what we examine here is equal probability outcomes. We will formulate hypotheses, calculate expected probabilities and test them using historical data. Playoff matches take the best-of-seven format. In this structure, the team that reaches four wins first takes the round (series) and goes to the next.

Scenario 1: 4 games

There are only two ways the round can end in four games: if A wins all four or if B does the same. P(AAAA) = (1/2)4 = 0.0625 (if the games are independent). P(BBBB) = (1/2)4 = 0.0625. The probability that the game ends in four games is 0.0625 + 0.0625 = 0.125.

Scenario 2: 5 games

The possibilities of A winning in five games are BAAAA, ABAAA, AABAA and AAABA. The last option, AAAAB, doesn’t exist as team A got the necessary four before reaching the potential fifth. The probabilities of each of these outcomes are (1/2)5, and there are eight possible outcomes (4 each for A and B). The probability that the game ends in four games is 4x(1/2)5 + 4x(1/2)5 = 0.25.

Scenario 3: 6 games

6-game winning options for A are BBAAAA, BABAAA, BAABAA, BAAABA, ABBAAA, ABABAA, ABAABA, AABBAA, AABABA, AAABBA. As before, the sequences AAABAB, AAAABB etc. are irrelevant. The probability that the game ends in six games is 10x(1/2)6 + 10x(1/2)6 = 0.3125.

Scenario 3: 7 games

Estimating the probability of a seven-game series is easy. You will get it by subtracting all the previous ones from 1 = 1 – (0.125 + 0.25 + 0.3125) = 0.3125.

Chi2 to the rescue

The null hypothesis, H0: The outcomes happen with equal probabilities, irrespective of the rank in the regular season. In other words, it is anybody’s game on a given day. The alternative hypothesis, H1, is that the game outcomes are not due to luck but based on the level of the team. We will do chi-squared statistics to test. That is next.

Are NBA Playoffs Random? Read More »

Birthday Problem Reloaded

We have seen earlier that the probability of two people sharing a birthday in a group of 40 is about 90%. And that is surprising to many of you. Part of the reason is that you start imagining someone sharing their birthday with yours. Well, that is quite another problem.

Probability of some sharing My birthday

The probability of someone sharing my (my, in inverted comma) is different. This problem is not as random as the previous one, as at least one person (i.e., me) is fixed! Let’s start with the inverse problem.

What is the probability that no one else has your birthday in a group of n people? In a group of 2, the chance is (364/365), and for 3, it is (364/365)*(364/365), and so on. In general, for a group of n people, the chance is (364/365)n-1. Since the probability you share a birthday with one and the probability that you do not share are mutually exclusive events (both can not happen at the same time), you subtract one from 1 to get the other, or the answer is 1 – (364/365)n-1.

For n = 10, the probability is 2.4%; for n = 100, it becomes 24%. To get a 50% chance that someone is sharing your birthday, you need 253 people in the room!

Birthday Problem Reloaded Read More »