The Z-score and Percentile

It has been found that the scores obtained by students follow a normal distribution with a mean of 75 and a standard deviation of 10. The top 10% end up in the university. What is the minimum mark for a student if she gets admission to the university?

The first step is to convert the percentage to the Z-score. It can be done in one of two ways.

qnorm(0.1, lower.tail = FALSE)
qnorm(0.9, lower.tail = TRUE)

Note that if you do not specify, the default for qnorm will be lower.tail = TRUE.

Z = (X – mean)/standard deviation
X = Z x standard deviation + mean
X = 1.28 x 10 + 75 = 87.8

The Z-score and Percentile Read More »

Test for Independence – Illustration

We have seen how R calculates the chi-squared test for independence. This time, we will estimate it manually while developing an intuition of the calculations. Here are the observed values.

High SchoolBachelorsMastersPh.d.Total

Now, the expected values are estimated by assuming independence, which allows us to multiply the marginal probabilities to obtain the joint probabilities.

First cell

The observed frequency of the female and high school is 60. The expected frequency, if they are independent, is the product of the marginals (being a female and being in high school): (201/395) x (100/395) x 395. The last multiplication with 395 is to get the frequency from the probability. (201/395) x (100/395) x 395 = 50.88. In the same way, we can estimate the other cells.

High SchoolBachelorsMastersPh.d.Total
chi-squared = sum(observed - expected)2 / expected
= (60 - 50.88)2/50.88 +  (54 - 49.87)2/49.87 + (46 - 50.38)2/50.38 + (41 - 49.87)2/49.87 + (40 - 49.11)2/49.11 + (44 - 48.13)2/48.13 + (53 - 48.62)2/48.62 + (58 - 48.13)2/48.13

You can look at the chi-squared table for 8.008746 with degrees of freedom = 3 for the p-value.

Test for Independence – Illustration Read More »

Law of Total Probability

Three machines make parts in a factory. The following information about the production line is available.

Machine 1 makes 50% of the parts
Machine 2 makes 25% of the parts
Machine 3 makes 25% of the parts

5% of the parts by Machine 1 are defective
10% of the parts by Machine 2 are defective
12% of the parts by Machine 3 are defective

If a part is randomly selected, what is the probability that it is defective?

The solution goes to a fundamental rule of probability that relates total probability to marginal and conditional probabilities.

P(A) = P(A ∩ B1) + P(A ∩ B2) + … + P(A ∩ Bk)
P(A) = P(B1)P(A|B1) + P(B2)P(A|B2) + … + P(Bk)P(A|Bk)

Using this equation, we get
P(Defective) = P(Machine1)P(Defective|Machine1) + P(Machine2)P(Defective|Machine2) + P(Machine3)P(Defective|Machine3)
= 0.5×0.05 + 0.25×0.1 + 0.25×0.12 = 0.08 or 8% chance.

Law of Total Probability Read More »

Gender and Education Level

Here is a step-by-step process for performing a chi-squared test of independence using R. The following is a survey result from a random sample of 395 people. The survey asked about participants’ education levels. Based on the collected data, do you find any relationships? Consider a 5% significance level.

High SchoolBachelorsMastersPh.d.Total

Step 1: Make a Table

data= matrix(c(60, 54, 46, 41, 40, 44, 53, 57), ncol=4, byrow=TRUE)

colnames(data) = c('High School','Bachelors','Masters','Ph.d.')
rownames(data) <- c('Female','Male')

       High School Bachelors Masters Ph.d.
Female          60        54      46    41
Male            40        44      53    57

Step 2: Apply chisq.test on the table

	Pearson's Chi-squared test

data:  survey
X-squared = 8.0061, df = 3, p-value = 0.04589

Step 3: Interpret the results

The chi-squared = 8.0061 at degrees of freedom = 3. As the p-value = 0.04589 < 0.05, we reject the null hypothesis; the education level depends on the gender at a 5% significance level.

Chi-Square Tests: PennState

Gender and Education Level Read More »

Children and Poisson Distribution

A recent survey showed the number of children per household as follows:

# Children
per household

Check how the data compares with a Poisson distribution with lambda = 1.

You may recall that lambda is the expected value of Poisson distribution. First, we create the Poisson probabilities for each category from the number of children per household = 0.

poi_prob <- dpois(0:5, 1)
0.367879441 0.367879441 0.183939721 0.061313240 0.015328310 0.003065662

Having established the expected probabilities, we perform the chi-squared test.

poi_prob <- dpois(0:5, 1)
child_perHouse <- c(4.3, 1.0, 2.3, 0.3, 0.2, 0)
chisq.test(child_perHouse, p = poi_prob, rescale.p = TRUE)
	Chi-squared test for given probabilities

data:  child_perHouse
X-squared = 2.4883, df = 5, p-value = 0.7783

Children and Poisson Distribution Read More »

Family-Wise Error Rate and The Bonferroni Correction

We have seen family-wise error rate (FWER) as the probability of making at least one Type 1 error when conducting m hypothesis tests.

FWER = P(falsely reject at least one null hypothesis)
= 1 – P(do not reject any null hypothesis)
= 1 – P(∩j=1n {do not falsely reject H0,j})

If each of these tests is independent, the required probability equals (1 – α)n, and
FWER = 1 – (1 – α)n

For example, if the significance level is 0.05 (α) for five tests,
FWER = 1 – (1 – 0.05)5
And, if you make n = 100 independent tests,
FWER = 1 – (1 – 0.05)100 = 0.994; guaranteed to make at least one Type I error.

One of the classical methods of managing the FWER is the Bonferroni correction. As per this, the corrected alpha is the original alpha divided by the number of tests, n.
Bonferroni corrected α = original α / n

For five tests,
FWER = 1 – (1 – 0.05/5)5 = 0.049; and for 100 tests
FWER = 1 – (1 – 0.05/100)100 = 0.049

Family-Wise Error Rate and The Bonferroni Correction Read More »

Family-Wise Error Rate

Imagine a fair coin is tossed 10 times to test the hypothesis, H0: the coin is unbiased. The coin likely lands on heads (or tails) in about 5 (or 4, 3, 2) of those. If it landed 4 times on heads and 6 in tails, we can do a simple chi-squared test to verify.
chi-squared = (4 – 5)2 /5 + (4 – 5)2 /5 = 0.4. 0.4 is too low; we can’t reject null

But what happens if all the tosses land on tails?
chi-squared = (0 – 5)2 /5 + (10 – 5)2 /5 = 10. We reject at 99% confidence level. We know the probability of this happening is (1/2)10 = 1/1024.

	Chi-squared test for given probabilities

data:  c(0, 10)
X-squared = 10, df = 1, p-value = 0.001565

What about 1024 players, each having a (fair) coin toss 10 times each?

1 - dbinom(0, 1024, 1/1024)

In other words, there is a possibility that one person will reject the null hypothesis and conclude that the coin is based! An incorrect rejection of a null hypothesis or a false positive result. In other words, if we test a lot (a family) of hypotheses, there is a high probability of getting one very small p-value by chance.

Family-Wise Error Rate Read More »

The Log-Rank Test for Survival

Here are Kaplan – Meier plots for males and females taken from the results of a cancer study. The data comes from the ‘BrainCancer’ dataset from the R library ISLR2. It contains the survival times for patients with primary brain tumours undergoing treatment.

At first glance, it appears that females were doing better, up to about 50 months, until the two lines merged. The question is: is the difference (between the two survival plots) statistically significant?

You may think of using two-sample t-tests comparing the means of survival times. But the presence of censoring makes life difficult. So, we use the log-rank test.

The idea here is to test the null hypothesis, H0, that the expected value of the random variable X, E(X) = 0, and to build a test statistic of the following form,

W = \frac{X - E(X)}{\sqrt{Var(X)}}

X is the sum of the number of people who died at each time.

X = \sum\limits_{k = 1}^{K} q_{1k}

R does the job for you; use the library, survival.

survdiff(Surv(time, status) ~ sex)
survdiff(formula = Surv(time, status) ~ sex)

            N Observed Expected (O-E)^2/E (O-E)^2/V
sex=Female 45       15     18.5     0.676      1.44
sex=Male   43       20     16.5     0.761      1.44

 Chisq= 1.4  on 1 degrees of freedom, p= 0.2 

p = 0.2; we cannot reject the null hypothesis of no difference in survival curves between females and males.


An introduction to Statistical Learning: James, Witten, Hastie, Tibshirani, Taylor

The Log-Rank Test for Survival Read More »

Kaplan-Meier Estimate

The outcome variable in survival analysis is the time until an event occurs. Since studies are often time-bounded, some patients may survive the event at the end of the study, and others may stop responding to the survey midway through. In either case, those patients’ survival times are censored. As censored patients also provide valuable data, the analyst gets into a dilemma of whether to discard those candidates.

Let’s examine five patients in a study. The filled circles represent the completion of the event (e.g., death), and the open circles represent the censoring (either dropping out or surviving the study’s end date).

The survival function, S(t), is the probability that the true survival time (T) exceeds some fixed number t.
S(t) = P(T > t)
S(t) decreases with time (t) as the probability decreases as time passes.

In the above example, how do you conclude the probability of surviving 300 days, S(300)? Will it be 1/3 = 0.33 (only the one survived out of three events, ignoring the censored) or 3/5 = 0.6 (assuming the censored candidates also survived)? What difference does it make to the conclusion that one of them dropped out early when she was too sick?

Kaplan and Meier came up with a smart solution to this. Note that they worked on this problem separately. Their survival curve is made the following way.
1) The first event happened at time 100. The probability of survival at t = 100 is 4/5, noting that four of the five patients were known to have survived that stage.

2) We now proceed to the next event, patient 3. Note that we skipped the censored time of patient 2.

Now, two out of three survived. The overall survival probability at t = 200 is (4/5) x (2/3).

3) Move to the last event (patient 5); the survival function is zero ((4/5) x (2/3) x 0). This leads to the Kaplan -Meier plot:

Kaplan-Meier Estimate Read More »

Survival analysis – Sensoring

We have seen survival plots before. Survival plots represent ‘time to event’ in survival analysis. For example, in the case of cancer diagnostics, survival analysis measures the time it takes from exposure to the event, which is most likely death.

These analyses are done following a group of candidates (or patients) between two time periods, i.e., the start and end of the study. Candidates are enrolled at different times during the period, and the ‘time to event’ is noted down. Censoring is a term in survival analysis that denotes when the researcher does not know the exact time-to-event for an included observation.

Right censoring
The term is used when you know the person is still surviving at the end of the study period. Let x be the time since enrollment, and then all we know is the time-to-event ti > x. Imagine a study that started in 2010 and ended in 2020, and a person who was enrolled in 2018 was still alive at the study’s culmination. So we know that xi > 2 years. The same category applies to patients who missed out on follow-ups.

Left censoring
This happens in observational studies, where the risk happens before entering the studies. Because of this, the researcher cannot observe the time when the event occurred. Obviously, this can’t happen if the event is death.

Interval censoring
It occurs when the time until an event of interest is not known precisely and, instead, only is known to fall between two time stamps.

Survival analysis – Sensoring Read More »