Blogs – Page 83

Are NBA Playoffs Random?

April 23, 2022

NBA playoff matches are progressing these days, and it is time we look at the winning probabilities. How significant is the role of chance in these games? Are the outcomes independent and random with equal probabilities for wins and losses, like coin-tossing?

Equal probability scenario

Although the title speaks about randomness, what we examine here is equal probability outcomes. We will formulate hypotheses, calculate expected probabilities and test them using historical data. Playoff matches take the best-of-seven format. In this structure, the team that reaches four wins first takes the round (series) and goes to the next.

Scenario 1: 4 games

There are only two ways the round can end in four games: if A wins all four or if B does the same. P(AAAA) = (1/2)⁴ = 0.0625 (if the games are independent). P(BBBB) = (1/2)⁴ = 0.0625. The probability that the game ends in four games is 0.0625 + 0.0625 = 0.125.

Scenario 2: 5 games

The possibilities of A winning in five games are BAAAA, ABAAA, AABAA and AAABA. The last option, AAAAB, doesn’t exist as team A got the necessary four before reaching the potential fifth. The probabilities of each of these outcomes are (1/2)⁵, and there are eight possible outcomes (4 each for A and B). The probability that the game ends in four games is 4x(1/2)⁵ + 4x(1/2)⁵ = 0.25.

Scenario 3: 6 games

6-game winning options for A are BBAAAA, BABAAA, BAABAA, BAAABA, ABBAAA, ABABAA, ABAABA, AABBAA, AABABA, AAABBA. As before, the sequences AAABAB, AAAABB etc. are irrelevant. The probability that the game ends in six games is 10x(1/2)⁶ + 10x(1/2)⁶ = 0.3125.

Scenario 3: 7 games

Estimating the probability of a seven-game series is easy. You will get it by subtracting all the previous ones from 1 = 1 – (0.125 + 0.25 + 0.3125) = 0.3125.

Chi² to the rescue

The null hypothesis, H0: The outcomes happen with equal probabilities, irrespective of the rank in the regular season. In other words, it is anybody’s game on a given day. The alternative hypothesis, H1, is that the game outcomes are not due to luck but based on the level of the team. We will do chi-squared statistics to test. That is next.

Are NBA Playoffs Random? Read More »

Birthday Problem Reloaded

April 22, 2022

We have seen earlier that the probability of two people sharing a birthday in a group of 40 is about 90%. And that is surprising to many of you. Part of the reason is that you start imagining someone sharing their birthday with yours. Well, that is quite another problem.

Probability of some sharing My birthday

The probability of someone sharing my (my, in inverted comma) is different. This problem is not as random as the previous one, as at least one person (i.e., me) is fixed! Let’s start with the inverse problem.

What is the probability that no one else has your birthday in a group of n people? In a group of 2, the chance is (364/365), and for 3, it is (364/365)*(364/365), and so on. In general, for a group of n people, the chance is (364/365)^n-1. Since the probability you share a birthday with one and the probability that you do not share are mutually exclusive events (both can not happen at the same time), you subtract one from 1 to get the other, or the answer is 1 – (364/365)^n-1.

For n = 10, the probability is 2.4%; for n = 100, it becomes 24%. To get a 50% chance that someone is sharing your birthday, you need 253 people in the room!

Birthday Problem Reloaded Read More »

Hypothesis Testing – Chi-Squared

April 21, 2022

The following table lists the weights of 50 boys (6-year-olds) sampled randomly. Can you test the hypothesis that the weights of 6-years old boys follow a normal distribution with a mean = 25 and a standard deviation = 2? We will do a chi-squared test to find out the answer to this.

28	24	27	24	27
26	25	29	22	24
23	25	21	22	25
26	27	27	26	29
28	27	22	23	21
29	24	23	23	22
25	22	29	28	30
24	28	26	25	25
28	29	26	27	30
22	31	25	24	27

The hypotheses

The null hypothesis, H₀, in this case: there is no difference between the observed frequencies and the expected frequencies of a normal distribution with mean = 25 and standard deviation = 2.

The alternative hypothesis, H_A: there is a difference between the observed frequencies and the expected frequencies of a normal distribution with mean = 25 and standard deviation = 2.

Estimation of chi²

Let us divide the data in the previous table into six groups of equal ranges. The frequencies of those ranges are counted. The expected frequency is estimated from the cumulative distribution function of the normal distribution for each of the ranges using the formula

E_i = n x [F(U_i) – F(L_i)]

n is the number of samples, F(U_i) is the upper limit of a range, and F(L_i) is the lower limit.

Range	Observed Frequency (O)	Expected Frequency (E)	(O-E)²/E
20 – 21	2	0.83	1.65
22 -23	10	6.65	1.69
24 -25	13	16.45	0.72
26 – 27	12	16.07	1.03
28 – 29	10	6.21	2.31
30 – 31	3	0.94	4.51
			11.92

The critical value at the 5% significance level and the p-value are estimated using the following R code.

qchisq(0.05, 5, lower.tail = FALSE)
pchisq(11.92, df=5, lower.tail=FALSE)

The critical value is 11.07, and the p-value is 0.036. Since the estimated chi-square (11.92) is outside the critical value, we reject the null hypothesis that the data follow the normal distribution with a mean = 25 and a standard deviation of 2.

Hypothesis Testing – Chi-Squared Read More »

Chi-square for Independence

April 20, 2022

Another application of the chi-square statistics is to test for independence, applied to categorical variables. For example, if you did sampling and wanted to know the gender dependence on higher education. Following is the data collected.

	Male	Female
No Graduation	7	6
College	16	13
Bachelors	15	16
Masters	11	8

Test for Independence

You perform a chi-square to test for independence.

	Male Observed	Female Observed	Total
No Graduation	7	6	13
College	16	13	29
Bachelors	15	16	31
Masters	11	8	19
Total	49	43	92

^Observed^Data

The expected data for a perfectly independent scenario is calculated as below. The expected value at (row i, column j) is obtained by RowSum(i) x ColumnSum(j)/(Grand Total).

	Male Expected	Female Expected	Total
No Graduation	13×49/92 = 6.92	13×43/92 = 6.08	13
College	29×49/92 = 15.45	29×43/92 = 13.55	29
Bachelors	31×49/92 = 16.51	31×43/92 = 14.49	31
Masters	19×49/92 = 10.12	19×43/92 = 8.88	19
Total	49	43	92

^{Expected Data}

The Chi-square is calculated as

	Male (O-E)²/E	Female (O-E)²/E	Chi²
No Graduation	(7-6.92)²/6.92 = 0.00092	(6-6.08)²/6.08 = 0.00105
College	(16-15.45)²/15.45 = 0.0196	(13-13.55)²/13.55 = 0.0223
Bachelors	(15-16.51)²/16.51 = 0.138	(16-14.49)²/14.49 = 0.157
Masters	(11-10.12)²/10.12 = 0.0765	(8-8.88)²/8.88 = 0.087
Total	0.235	0.268	0.503

^Chi-square

Like, we have done previously, plug in the value for the 5% significance level (0.05) in R function qchisq with degrees of freedom 3. The answer is 7.81, which is the critical value. The calculated value of 0.503 is lower than 7.81, and therefore, the null hypothesis that education level is independent of gender can not be rejected. The p-value can be calculated by using 0.503 inside the R function, pchisq. The answer is 0.918.

qchisq(0.05, 3, lower.tail = FALSE)
pchisq(0.503, df=3, lower.tail=FALSE)

The R code for the whole exercise

edu_data <- matrix(c(7, 16, 15, 11, 6, 13, 16, 8), ncol = 4 , byrow = TRUE)
colnames(edu_data) <- c("no Grad", "College", "Bachelors", "Masters")
rownames(edu_data) <- c("male", "female")

chisq.test(edu_data)

Chi-square for Independence Read More »

Chi-Square Test for the Lefties

April 19, 2022

Randomness and the subsequent scattering of data can confuse people interpreting observations. Take this example: from studies, we know that 10% of the population is left-handed. You surveyed 150 people (randomly selected) and found that 20 are left-handed. Does this violate the theory, or it’s just normal? What do we do?

Goodness of fit

You perform a chi-square goodness of fit test on the data.

	Observed (O)	Expected (E)	(O-E)²/E
Left	20	15	25/15
Right	130	135	25/135
Total	150	150	1.85

We will reject the notion (that 10% is left-handed) with a 5% significance level. In other words, the evidence shall be outside the 95% confidence interval to support the alternative hypothesis. In our case, the alternative hypothesis is that the proportion of lefties is more than 10% of the population. So how do you estimate the critical value at a 0.05 (5%) significance level? In an old-fashioned way, there is a lookup table where you find out the number by matching the degrees of freedom (in this case, df = 1) and the significance level. We use the following R code to get it.

qchisq(0.05, 1, lower.tail = FALSE) # qchisq(p, df)

The answer is 3.84. In other words, the calculated value of the chi-squared needs to be greater than 3.84 to be outside the range to reject the notion (or the null hypothesis). In our case, it is 1.85, which is less than 3.84, and we can’t reject the notion of 10% lefties, although we see 20 in 150!

p-value

How to calculate our favourite p-value from this? For that, we plug in the chi-square value (1.85) in the pchisq function.

pchisq(1.85, df=1, lower.tail=FALSE)

The answer is 0.1737. Needless to say, pchisq is the inverse of qchisq. In other words

qchisq(0.1737, 1, lower.tail = FALSE)

gives 1.85.

Everything in one step

The following R code will do everything from the start

obsfreq <- c(20,130)
nullprobs <- c(0.1,0.9)
chisq.test(obsfreq,p=nullprobs)

The answer will be in the following format

	Chi-squared test for given probabilities

data:  obsfreq
X-squared = 1.8519, df = 1, p-value = 0.1736

Chi-Square Test for the Lefties Read More »

Expectations of a Random process

April 18, 2022

Gambler’s fallacy is the belief that a random process is somehow self-correcting. The word, somehow, is important here to note as you know that it is a random process and has no brain to self-correct! Check this out:

“The mean IQ of children in a city is 100. You take a sample of 50 children for a study. If the first child measures an IQ of 150, what is the expected average IQ of the group?”
Tversky and Kahneman, Psycological Bulletin, 1971

The answer is (150 + 49 x 100) / 50 = 101, althoght most people thought it would still be 100.

Another famous example is the hot hand in basketball. It is a belief from the spectators that a person’s success for a basket depends on previous success. But a lot of studies have not found any evidence for such dependencies.

Expectations of a Random process Read More »

When We Start Predicting Randomness

April 17, 2022

The answer to the birthday problem comes as a surprise to most of us. If you have not heard about this puzzle, you may read one of my earlier posts. At the same time, it is far less controversial than, say, the Monty Hall problem, as the former can be solved mathematically without much ambiguity.

A similar story on a Canadian lottery is narrated in the book The Drunkard’s Walk: How Randomness Rules Our Lives by Leonard Mlodinow. The officials have decided to give away 500 cars to 2.4 million of its subscribers using the unclaimed prize money of the past. They used a computer program to randomly pick 500 individuals and published the unsorted results, obviously not aware of the possibility of double-counting. And the result? One number was repeated, and a person got the car twice!

Let’s write an R code to numerically understand the probability of such an event to happen – a repeat of a number within 500, spread over 2.4 million. The program repeats a random number generation (the sample function), 500 at a time, 10000 times and averages it.

B<- 10000
n<- 500

results<- replicate(B, {
  lot<- sample(1:2400000, n, replace=TRUE)
  any(duplicated(lot))
})
mean(results)

The program output, the probability of a repeat inside 500 numbers, is 5.4% – not an infinitesimally small number by any standard. The following plot shows how it grows with the sample size.

You will see that the repeat is almost certain before the number reaches about 4000.

When We Start Predicting Randomness Read More »

Probability of Defective Coin

April 16, 2022

You suspect that one of the 100 coins in a bag is 60% biased towards heads. You take one coin at random, flip it 50 times and get 30 heads. What is the probability that the coin you chose is biased?

Let’s do this Bayesian challenge step by step. Let B = biased, nB = not biased, and D = outcome (data). The probability that the coin is biased, given the data of 30 heads out of 100 flips,

$\\ p(B|D) = \frac{P(D|B)*P(B)}{P(D|B)P(B) + P(D|nB)*P(nB)}$

The likelihood, P(Data|B)

The first term of the numerator is called the likelihood. What is the likelihood of getting 30 heads in 50 flips if the coin is biased 60%? We know how to get that quantity – apply a binomial trial with a chance of 0.6 for success and 0.4 for failure. The term in the denominator, P(D|nB)*P(nB), the probability of 30 heads if the coin is not biased, means you make success and failure at 0.5.

Since you have the information that 1 in 100 coins is faulty, you chose 1/100 as the prior knowledge. The R code of the calculation is given below.

head_p <- 0.6
tail_q <- 1 - head_p
prior <- 1/100

flip   <- 50
success_head <- 30

prob_bias <- choose(flip, success_head)*(head_p^(success_head))*(tail_q^(flip-success_head))  

head_p <- 0.5
tail_q <- 1 - head_p
prior_no <- 1 - prior
prob_no_bias <- choose(flip, success_head)*(head_p^(success_head))*(tail_q^(flip-success_head))  

post <- (prob_bias*prior)/(prob_bias*prior + prob_no_bias*prior_no )
post

You get a probability of ca. 2.6%. You are not satisfied, and you flip another 50 times and get 35 heads. What is your inference now? Repeat the same calculation, but don’t forget to update your belief from 1 in 100 to 2.6 in 100.

Probability of Defective Coin Read More »

Expected Value of A Die

April 15, 2022

You already know the concept of expected values. It is an outcome multiplied by its probability, summed over all those outcomes. What is the expected value of an n sided die? Let us mathematically estimate the answer to this question.

$\\ \text{Let X be 1, 2, ...n sides of the die} \\ \\ p(X) = \frac{1}{n} \text{, assuming it is a fair die and each side is equally likely} \\ \\ E(X) = \sum\limits_{X=1}^n [p(X)*X] \\\\ E(X) = 1*\frac{1}{n} + 2*\frac{1}{n} = 3*\frac{1}{n} + ... + n*\frac{1}{n} \\ \\ = \frac{1}{n} [1 + 2 + 3 + ... n] = \frac{1}{n} \frac{n(n+1)}{2} = \frac{(n+1)}{2}$

If you want to test: for a normal six-sided die, n = 6, E(X) = (6+1)/2 = 3.5. Similarly, the expected value of the sum of 2 dice. E(X + X) = E(X)+E(X) = 3.5 + 3.5 = 7.

It may be a bit worrying to some of you. How can the expected value of a fair die is one number, and that too, a fraction? You know, by now, that the expected value is more of a concept and is equal to the long term average, in this case, of several rolls of a die.

Expected Value of A Die Read More »

Two Envelope Problem

April 14, 2022

Consider this: there are two envelopes on the table containing some cash. You don’t know how much it is, but you know that one contains twice the amount as the other. You were given one of them at random. You have two choices: open the one you got and take the cash, or switch to the other. What will you do?

Expected Value

Naturally, you will go for the expected value of a switch, and if that is greater than the money at hand, you will make the exchange. Let the amount of money inside your envelope be X. You doubt the other envelope has 2X or X/2 amount. And they are equally likely (probability of half). The Expected value, in that case, is (1/2)x2X + (1/2)x(X/2) = X + (1/4)X > X. So, you switch, right?

The problem with the problem

No, you will not. Think about the issue with a swap. After making the switch, you can make the same expected value calculations for the original envelope, draw the same conclusion, and switch back—and you do it forever!

The problem, in reality, involves two conditional probabilities and needs to be solved separately. First, you assume that the second envelope contains less than the first. In the second case, you consider the opposite, i.e., the second envelope holds more than the first. You multiply the respective probabilities to get the answer.

E(2nd) = V(2nd | 1st < 2nd)xP(1st < 2nd) + V(2nd | 1st > 2nd) x P(1st > 2nd).
For X and 2X, E(2nd) = 2X(1/2) + X(1/2) = 1.5X.

Naturally, if X and 2X are the two options, the average value with the first envelope is not X, but (X+2X)/2 = 1.5X

They match, and no need to switch!

Two Envelope Problem Read More »