Randomness and the subsequent scattering of data can confuse people interpreting observations. Take this example: from studies, we know that 10% of the population is left-handed. You surveyed 150 people (randomly selected) and found that 20 are left-handed. Does this violate the theory, or it’s just normal? What do we do?
Goodness of fit
You perform a chi-square goodness of fit test on the data.
Observed (O) | Expected (E) | (O-E)2/E | |
Left | 20 | 15 | 25/15 |
Right | 130 | 135 | 25/135 |
Total | 150 | 150 | 1.85 |
We will reject the notion (that 10% is left-handed) with a 5% significance level. In other words, the evidence shall be outside the 95% confidence interval to support the alternative hypothesis. In our case, the alternative hypothesis is that the proportion of lefties is more than 10% of the population. So how do you estimate the critical value at a 0.05 (5%) significance level? In an old-fashioned way, there is a lookup table where you find out the number by matching the degrees of freedom (in this case, df = 1) and the significance level. We use the following R code to get it.
qchisq(0.05, 1, lower.tail = FALSE) # qchisq(p, df)
The answer is 3.84. In other words, the calculated value of the chi-squared needs to be greater than 3.84 to be outside the range to reject the notion (or the null hypothesis). In our case, it is 1.85, which is less than 3.84, and we can’t reject the notion of 10% lefties, although we see 20 in 150!
p-value
How to calculate our favourite p-value from this? For that, we plug in the chi-square value (1.85) in the pchisq function.
pchisq(1.85, df=1, lower.tail=FALSE)
The answer is 0.1737. Needless to say, pchisq is the inverse of qchisq. In other words
qchisq(0.1737, 1, lower.tail = FALSE)
gives 1.85.
Everything in one step
The following R code will do everything from the start
obsfreq <- c(20,130)
nullprobs <- c(0.1,0.9)
chisq.test(obsfreq,p=nullprobs)
The answer will be in the following format
Chi-squared test for given probabilities
data: obsfreq
X-squared = 1.8519, df = 1, p-value = 0.1736