Data & Statistics

Simpson’s Paradox – Berkeley data

May 23, 2023

We have seen Simpson’s paradox in one of the earlier posts. A famous one was the discrepancy in observed admission rates of men and women from six departments at Berkeley. Here is what the data shows; the dataset is available on GitHub.

Admit	Gender	Dept	Frequency
Admitted	Male	A	512
Rejected	Male	A	313
Admitted	Female	A	89
Rejected	Female	A	19
Admitted	Male	B	353
Rejected	Male	B	207
Admitted	Female	B	17
Rejected	Female	B	8
Admitted	Male	C	120
Rejected	Male	C	205
Admitted	Female	C	202
Rejected	Female	C	391
Admitted	Male	D	138
Rejected	Male	D	279
Admitted	Female	D	131
Rejected	Female	D	244
Admitted	Male	E	53
Rejected	Male	E	138
Admitted	Female	E	94
Rejected	Female	E	299
Admitted	Male	F	22
Rejected	Male	F	351
Admitted	Female	F	24
Rejected	Female	F	317

The paradox

If one considers the university as a whole, here is the summary

Admit	Gender	#
Admitted	Male	1198
Rejected	Male	1493
Admitted	Female	557
Rejected	Female	1278
Total		4526

Proportion of Male admitted = 1198 /(1198+1493) = 0.45

Proportion of female admitted = 557/(557 + 1278) = 0.30

There is a difference in success rates for men and women. But what about department-wise ‘discrimination’? Here are the success rates of males and females in each department.

Department	Male	Female
A	0.62	0.82
B	0.63	0.68
C	0.37	0.34
D	0.33	0.35
E	0.28	0.24
F	0.06	0.07

Success rates of females are at par or even higher in every department! Let’s probe further and check where they applied against the success rates.

Department	% Male Applied	% Female Applied	Admission Rate (%)
A	30	6	64
B	21	1	63
C	12	32	35
D	15	20	34
E	7	21	25
F	14	19	6
Total	100	100

Women preferred more competitive departments with lower acceptance rates, whereas more men opted for departments with better acceptance rates.

Simpson’s Paradox – Berkeley data Read More »

Confounding vs Effect Modification

May 22, 2023

We have seen confounders before; it is a factor that associates with both exposure and outcome, thereby deceiving investigators of a causal relationship between the two.

For example, smoking is a confounder that misleads people to conclude that drinking can lead to lung cancer. In reality, smokers have a higher tendency to drink, and smokers have a higher tendency to get lung cancer. Until you stratify and find the impact of drinking on smokers and non-smokers, you are unlikely to figure out the error.

On the other hand, if the variable impact the outcome and not the exposure, it is an effect modification. A simple example is the immunisation status of an individual can impact the person’s susceptibility to getting the infection from the virus.

Confounding vs Effect Modification Read More »

Collider Bias – The Math

May 21, 2023

So far, I have addressed the collider-bias phenomena qualitatively. This time, I will try to show through numbers. It can be complex as the illustration involves a lot of arithmetic. The reference material provided at the end is a good read, further grasping the concept.

Imagine a situation where exposure is obesity, the risk factor is smoking, the outcome is mortality, and the collider is diabetes. If you are confused about what each represents, here is the expected storyline: A research group does study the impact of obesity on mortality in a set of people who have diabetes and comes up with a counterintuitive conclusion (perhaps that obesity decreases mortality)!

Set of information

Total study population = 1000
Smokers = 500
Non-smokers = 500
Obese = 500
Non-obese = 500
Baseline diabetes risk (non-smoking, non-obese)= 4%
Obesity increases diabetes risk by 16 % points
Smoking increases diabetes risk by 12% points
Baseline mortality risk (non-smoking, non-obese, nondiabetic)= 5%
Obesity increases mortality risk by 2.5% points
Smoking increases mortality risk by 15% points
Diabetic increases mortality by 5%

Calculations on the total sample

The overall study population is depicted as

Now, calculate the mortality rates of each quadrant and portion into obesity and non-obesity conditions.

Total mortality of NS-NO (non-smoking, non-obese) quadrant
= # of diabetic x diabetic mortality + # non-diabetic x baseline mortality
= 0.04 x 250 x (0.05 + 0.05) + (250 – 0.04 x 250) x 0.05
= 1 + 12 = 13
(note that diabetic mortality = baseline mortality + diabetic increases mortality)

S-NO (smoking, non-obese) quadrant
= # of diabetic x (diabetic mortality + smoking mortality) + # non-diabetic x (Baseline mortality + smoking mortality)
= (0.04 + 0.12) x 250 x (0.05 + 0.05 + 0.15) + (250 – (0.04 + 0.12) x 250) x (0.05 + 0.15)
= 52

S-O (smoking, obese) quadrant
= (0.04 + 0.12 + 0.16) x 250 x (0.05 + 0.05 + 0.15 + 0.025) + (250 – (0.04 + 0.12 + 0.16) x 250) x (0.05 + 0.15 + 0.025)
= 60

NS-O (non-smoking, obese) quadrant
= (0.04 + 0.16) x 250 x (0.05 + 0.05 + 0.025) + (250 – (0.04 + 0.16) x 250) x (0.05 + 0.025)
= 21

Calculations (for the total sample)
Mortality rate with obesity = (60 + 21) / 500 = 16.5%
Mortality rate without obesity = (13 + 52) / 500 = 13%
An increase of 3.5%

Calculations on the sub-sample

Suppose the study stratified the sample and analysed only people who have diabetes. The study sample space is as follows.

Do the same exercise as before

NS-NO quadrant
= # of diabetic x diabetic mortality
= 0.04 x 250 x (0.05 + 0.05)
= 1

S-NO quadrant
= # of diabetic x (diabetic mortality + smoking mortality)
= (0.04 + 0.12) x 250 x (0.05 + 0.05 + 0.15)
= 10

S-O quadrant
= (0.04 + 0.12 + 0.16) x 250 x (0.05 + 0.05 + 0.15 + 0.025)
= 22

NS-O quadrant
= (0.04 + 0.16) x 250 x (0.05 + 0.05 + 0.025)
= 6

Calculations (for the sub-sample)
Mortality rate with obesity = (22 + 6) / 130= 21.5%
Mortality rate without obesity = (1 + 10) / 50= 22 %
A decrease of 0.5%

Reference

Collider Bias in Observational Studies: Dtsch Arztebl Int.

Collider Bias – The Math Read More »

The Obesity Paradox

May 20, 2023

The obesity paradox is the idea that people who are overweight live longer than normal-weight people. While later studies have found this claim invalid, the notion stayed in public discourse ever since.

There are many explanations for this odd observation. One of them goes with the parameter of measurement itself – the survival rate after getting cardiovascular disease. Studies found that obese people may get the disease much earlier in life and therefore survive a longer proportion of life with it.

Another one is collider stratification bias, which happens when two variables, e.g., risk factor and outcome, influence a third, namely, the likelihood of being sampled. It works in the following way:

Obese individuals may have developed CAD because they are obese or because of another stronger condition, e.g., smoking or genetics. In other words, CAD, the collider, is caused by 1) obesity and 2) the (more severe) condition (smoking). In this simple two-cause model, a stratification of variables means among individuals with CAD, obese individuals are less likely to be smokers, and non-obese individuals are more likely to be smokers. Subsequently, obesity may appear protective against mortality (outcome) because its presence indicates the absence of a more harmful risk factor – smoking.

References

The ‘obesity paradox’ may not be a paradox at all: International Journal of Obesity

Obesity is bad regardless of the obesity paradox for hypertension and heart disease: J Clin Hypertens

Association of Body Mass Index With Lifetime Risk of Cardiovascular Disease and Compression of Morbidity: JAMA Cardiology

The Obesity Paradox Read More »

Physical Activity and Health

May 17, 2023

The March issue of the British Journal of Sports Medicine came out with the results from a 9-year-long cohort study of people who did physical activity and its impact on influenza and pneumonia.

Before we get into details, note that it is a cohort study – of 577 909 US adults. Cohort studies are observational, whereas randomised controlled trials (RCTs) are interventional. Establishing causations from observational studies is problematic.

A key finding of the study has been the association of lowered risk of influenza and pneumonia with aerobic physical activity.

Reference

Webber BJ, et al. Br J Sports Med 2023;0:1–8.

Physical Activity and Health Read More »

Fisher’s Exact Test

May 16, 2023

Fisher’s exact test is a statistical significance test that calculates the p-value and indicates an association between two variables. For example, scientists tagged 50 king penguins in each of three nesting areas (lower, middle, and upper) and counted the numbers that were alive or dead after a year. The following were the results.

	Alive	Dead
Upper nesting area	43	7
Middle nesting area	44	6
Lower nesting area	49	1

Are these differences significant?

penguin.nest <- data.frame("Alive" = c(43, 44, 49), "Dead" = c(7, 6, 1), row.names = c("Lower", "Middle", "Upper"))
fisher.test(penguin.nest)

The p-value is 0.0896; it is not significant.

Fisher’s Exact Test Read More »

Hypergeometry of counterfeits

May 15, 2023

A collection of 15 gold coins contains 4 counterfeits. If 2 of them are randomly selected to be sold at the auction, find the probability that

neither of them is a counterfeit
only one of them is a counterfeit
both coins are counterfeits.

This is a hypergeometric probability distribution – picking without replacement. If X is the number of counterfeit coins (hypergeometric random variable),

$P(X = 0) = \frac{_{4}C_0 \textrm{ }*\textrm{ } _{11}C_2\textrm{ }}{_{15}C_2}$

choose(4,0)*choose(11,2) / choose(15,2)

0.52

$P(X = 1) = \frac{_{4}C_1 \textrm{ }*\textrm{ } _{11}C_1\textrm{ }}{_{15}C_2}$

choose(4,1)*choose(11,1) / choose(15,2)

0.42

$P(X = 2) = \frac{_{4}C_2 \textrm{ }*\textrm{ } _{11}C_0\textrm{ }}{_{15}C_2}$

choose(4,2)*choose(11,0) / choose(15,2)

0.06

Or simply,

dhyper(2, 4, 11, 2, log = FALSE)

Hypergeometry of counterfeits Read More »

Coffee Overflow

May 14, 2023

A coffee machine is regulated to charge 195 ml per cup with a standard deviation of 5 ml. Assuming the amount of fill is normally distributed, what is the probability that 200 ml cups will overflow?

For normal distributions,

$P(X \ge 200) = P(z \ge \frac{200-\mu}{\sigma}) = P(z \ge \frac{200-195}{5})$

Or you may use this simple R command

1 - pnorm(200, 195, 5)

0.1586553

Coffee Overflow Read More »

Craps Probability – Don’t Pass

May 13, 2023

Another type of bet in craps is a ‘don’t pass bet’. Here, the winning opportunities are the opposite of what we have seen before. Well, not really; had that been the case, the player would have got an exactly opposite, +1.41% advantage, which is absurd. A player never holds winning odds in gambling! The rules are almost the opposite, but getting 12 in the first throw makes a pass (no win. no loss). Let’s list down all the possible outcomes and the payoff table.

The player throws the dice and wins at once if the total for the first throw is 2 or 3.
The player loses if the outcome is 7 or 11.
It’s a pass if the outcome is 12.
The throws 4, 5, 6, 8, 9 or 10 are called points.
If the first throw is a point, it is repeated until the same number (the point) comes back (player loses) or 7 (player wins).

The probability of winning a point 4 is the joint probability of winning 4 in the first roll and the probability of getting 7 (and not 4) in the second.

Dice Roll	Payoff	Probability	Return
7 or 11 (come-out loss)	-1	16.67 + 5.56 = 22.23	-22.23
2, 3 (come-out win)	1	2.78 + 5.56 = 8.34	8.34
12 (come-out push)	0	2.78	0
Point 4 loss	-1	8.33*8.33/(8.33+16.67) = 2.78	-2.78
Point 5 loss	-1	11.11*11.11/(11.11+16.67) = 4.44	-4.44
Point 6 loss	-1	13.89*13.89/(13.89+16.67) = 6.31	-6.31
Point 8 loss	-1	13.89*13.89/(13.89+16.67) = 6.31	-6.31
Point 9 loss	-1	11.11*11.11/(11.11+16.67) = 4.44	-4.44
Point 10 loss	-1	8.33*8.33/(8.33+16.67) = 2.78	-2.78
Point 4 win	1	8.33*16.67/(8.33+16.67) = 5.55	5.55
Point 5 win	1	11.11*16.67/(11.11+16.67) = 6.67	6.67
Point 6 win	1	13.89*16.67/(13.89+16.67) = 7.58	7.58
Point 8 win	1	13.89*16.67/(13.89+16.67) = 7.58	7.58
Point 9 win	1	11.11*16.67/(11.11+16.67) = 6.67	6.67
Point 10 win	1	8.33*16.67/(8.33+16.67) = 5.55	5.55
Overall		100	-1.35

So, as usual, the house wins.

Craps Probability – Don’t Pass Read More »

Craps Probability

May 12, 2023

Here we continue and determine the probability of winning one of the craps moves, the pass line bet. Let’s summarise the ways of winning (and losing) and the corresponding payoffs.

Dice Roll	Payoff	Probability
7 or 11 (come-out win)	1	P₇ + P₁₁
2, 3, or 12 (come-out loss)	-1	P₂ + P₃ + P₁₂
Point 4 win	1	P₄*P_4/7
Point 5 win	1	P₅*P_5/7
Point 6 win	1	P₆*P_6/7
Point 8 win	1	P₈*P_8/7
Point 9 win	1	P₉*P_9/7
Point 10 win	1	P₁₀*P_10/7
Point 4 loss	-1	P₄*P_7/4
Point 5 loss	-1	P₅*P_7/5
Point 6 loss	-1	P₆*P_7/6
Point 8 loss	-1	P₈*P_7/8
Point 9 loss	-1	P₉*P_7/9
Point 10 loss	-1	P₁₀*P_7/10

The notations are:
P₇ = probability of getting a 7
P_4/7 = probability of getting a 4 over 7 (in the second throw, after getting a 4 in the first throw) etc.

The probability of winning a point 4 is the joint probability of winning 4 in the first roll and the probability of getting 4 (and not 7) in the second. Let’s calculate each of these probabilities using the reference table.

Dice Roll	Probability	%
2	1/36	2.78
3	2/36	5.56
4	3/36	8.33
5	4/36	11.11
6	5/36	13.89
7	6/36	16.67
8	5/36	13.89
9	4/36	11.11
10	3/36	8.33
11	2/36	5.56
12	1/36	2.78

A sample calculation goes like this: The probability of point 4 is P₄ (8.33) multiplied with chances of 4 over 4 or 7 (8.33/(8.33 +16.67)). I.e., 8.33*8.33/(8.33 +16.67) = 2.78. Similarly, the probability of losing a point 4 = P₄ (8.33) x chance of 7 over 4 or 7 (16.67/(8.33 +16.67)).

Dice Roll	Payoff	Probability	Return
7 or 11 (come-out win)	1	16.67 + 5.56 = 22.23	22.23
2, 3, or 12 (come-out loss)	-1	2.78 + 5.56 + 2.78 = 11.12	-11.12
Point 4 win	1	8.33*8.33/(8.33+16.67) = 2.78	2.78
Point 5 win	1	11.11*11.11/(11.11+16.67) = 4.44	4.44
Point 6 win	1	13.89*13.89/(13.89+16.67) = 6.31	6.31
Point 8 win	1	13.89*13.89/(13.89+16.67) = 6.31	6.31
Point 9 win	1	11.11*11.11/(11.11+16.67) = 4.44	4.44
Point 10 win	1	8.33*8.33/(8.33+16.67) = 2.78	2.78
Point 4 loss	-1	8.33*16.67/(8.33+16.67) = 5.55	-5.55
Point 5 loss	-1	11.11*16.67/(11.11+16.67) = 6.67	-6.67
Point 6 loss	-1	13.89*16.67/(13.89+16.67) = 7.58	-7.58
Point 8 loss	-1	13.89*16.67/(13.89+16.67) = 7.58	-7.58
Point 9 loss	-1	11.11*16.67/(11.11+16.67) = 6.67	-6.67
Point 10 loss	-1	8.33*16.67/(8.33+16.67) = 5.55	-5.55
Overall		100	-1.41

No surprise, the house wins; at 1.43%

Craps Probability Read More »