Data & Statistics

Bland-Altman Plot

Bland-Altman analysis is used to study the agreement between two measurements. Here is how it is created.

Step 1: Collect the two measurements

Sample_Data <- data.frame(A=c(6, 5, 3, 5, 6, 6, 5, 4, 7, 8, 9,
10, 11, 13, 10, 4, 15, 8, 22, 5), B=c(5, 4, 3, 5, 5, 6, 8, 6, 4, 7, 7, 11, 13, 5, 10, 11, 14, 8, 9, 4))

Step 2: Calculate the means of the measurement 1 and measurement 2

Sample_Data$average <- rowMeans(Sample_Data) 

Step 3: Calculate the difference between measurement 1 and measurement 2

Sample_Data$difference <- Sample_Data$A - Sample_Data$B

Step 4: Calculate the limits of the agreement based on the chosen confidence interval

mean_difference <- mean(Sample_Data$difference)
lower_limit <- mean_difference - 1.96*sd( Sample_Data$difference )
upper_limit <- mean_difference + 1.96*sd( Sample_Data$difference )

Step 5: Create a scatter plot with the mean on the X-axis and the difference on the Y-axis. Mark the limits and the mean of difference.

ggplot(Sample_Data, aes(x = average, y = difference)) +
  geom_point(size=3) +
  geom_hline(yintercept = mean_difference, color= "red", lwd=1.5) +
  geom_hline(yintercept = lower_limit, color = "green", lwd=1.5) +
  geom_hline(yintercept = upper_limit, color = "green", lwd=1.5) +
  ggtitle("")+ 
       ylab("Difference")+
       xlab("Average") 

Bland-Altman Plot Read More »

Larry Bird and Binomial Distribution

Following are the free throw statistics from basketball great Larry Bird’s two seasons.

Total pairs of throws: 338
Pairs where both throws missed: 5
Pairs where one missed: 82
Pairs where both made: 251

Test the hypothesis that Mr Bird’s free throw follows binomial distribution with p = 0.8.
H0 = Bird’s free throw probability of success followed a binomial distribution with p = 0.8
HA = The distribution did not follow a binomial distribution with p = 0.8

We will use chi-square Goodness of Fit to test the hypothesis. The probabilities of making 0, 1 and 2 free throws for a person with a probability of success of 0.8 is

bino_prob <- dbinom(0:2, 2, 0.8)
 0.04 0.32 0.64

The chi-square test is:

chisq.test(child_perHouse, p = bino_prob, rescale.p = TRUE)

	Chi-squared test for given probabilities

data:  child_perHouse
X-squared = 17.256, df = 2, p-value = 0.000179

Larry Bird and Binomial Distribution Read More »

This Sentence is False!

‘This sentence is false’ is an example of what is known as the Liar Paradox.

This sentence is false.

Look at the first option for the answer—true. To do that, we check what the sentence says about itself. It says about itself that it is false. If it is true, then it is false, which is a contradiction, and therefore, the answer ‘true’ is not acceptable.

The second option is false. Since the sentence claims about itself as false, then it’s false that it’s false, which again is a contradiction.

This Sentence is False! Read More »

The Z-score and Percentile

It has been found that the scores obtained by students follow a normal distribution with a mean of 75 and a standard deviation of 10. The top 10% end up in the university. What is the minimum mark for a student if she gets admission to the university?

The first step is to convert the percentage to the Z-score. It can be done in one of two ways.

qnorm(0.1, lower.tail = FALSE)
qnorm(0.9, lower.tail = TRUE)
1.28

Note that if you do not specify, the default for qnorm will be lower.tail = TRUE.

Z = (X – mean)/standard deviation
X = Z x standard deviation + mean
X = 1.28 x 10 + 75 = 87.8

The Z-score and Percentile Read More »

Test for Independence – Illustration

We have seen how R calculates the chi-squared test for independence. This time, we will estimate it manually while developing an intuition of the calculations. Here are the observed values.

High SchoolBachelorsMastersPh.d.Total
Female60544641201
Male40445357194
Total100989998395

Now, the expected values are estimated by assuming independence, which allows us to multiply the marginal probabilities to obtain the joint probabilities.

First cell

The observed frequency of the female and high school is 60. The expected frequency, if they are independent, is the product of the marginals (being a female and being in high school): (201/395) x (100/395) x 395. The last multiplication with 395 is to get the frequency from the probability. (201/395) x (100/395) x 395 = 50.88. In the same way, we can estimate the other cells.

High SchoolBachelorsMastersPh.d.Total
Female50.8849.8750.3849.87201
Male49.1148.1348.6248.13194
Total100989998395
chi-squared = sum(observed - expected)2 / expected
= (60 - 50.88)2/50.88 +  (54 - 49.87)2/49.87 + (46 - 50.38)2/50.38 + (41 - 49.87)2/49.87 + (40 - 49.11)2/49.11 + (44 - 48.13)2/48.13 + (53 - 48.62)2/48.62 + (58 - 48.13)2/48.13
 8.008746

You can look at the chi-squared table for 8.008746 with degrees of freedom = 3 for the p-value.

Test for Independence – Illustration Read More »

Law of Total Probability

Three machines make parts in a factory. The following information about the production line is available.

Machine 1 makes 50% of the parts
Machine 2 makes 25% of the parts
Machine 3 makes 25% of the parts

5% of the parts by Machine 1 are defective
10% of the parts by Machine 2 are defective
12% of the parts by Machine 3 are defective

If a part is randomly selected, what is the probability that it is defective?

The solution goes to a fundamental rule of probability that relates total probability to marginal and conditional probabilities.

P(A) = P(A ∩ B1) + P(A ∩ B2) + … + P(A ∩ Bk)
P(A) = P(B1)P(A|B1) + P(B2)P(A|B2) + … + P(Bk)P(A|Bk)

Using this equation, we get
P(Defective) = P(Machine1)P(Defective|Machine1) + P(Machine2)P(Defective|Machine2) + P(Machine3)P(Defective|Machine3)
= 0.5×0.05 + 0.25×0.1 + 0.25×0.12 = 0.08 or 8% chance.

Law of Total Probability Read More »

Gender and Education Level

Here is a step-by-step process for performing a chi-squared test of independence using R. The following is a survey result from a random sample of 395 people. The survey asked about participants’ education levels. Based on the collected data, do you find any relationships? Consider a 5% significance level.

High SchoolBachelorsMastersPh.d.Total
Female60544641201
Male40445357194
Total100989998395

Step 1: Make a Table

data= matrix(c(60, 54, 46, 41, 40, 44, 53, 57), ncol=4, byrow=TRUE)

colnames(data) = c('High School','Bachelors','Masters','Ph.d.')
rownames(data) <- c('Female','Male')

survey=as.table(data)
survey
       High School Bachelors Masters Ph.d.
Female          60        54      46    41
Male            40        44      53    57

Step 2: Apply chisq.test on the table

chisq.test(survey)
	Pearson's Chi-squared test

data:  survey
X-squared = 8.0061, df = 3, p-value = 0.04589

Step 3: Interpret the results

The chi-squared = 8.0061 at degrees of freedom = 3. As the p-value = 0.04589 < 0.05, we reject the null hypothesis; the education level depends on the gender at a 5% significance level.

Chi-Square Tests: PennState

Gender and Education Level Read More »

Children and Poisson Distribution

A recent survey showed the number of children per household as follows:

# Children
per household
Population
04.3
11.0
22.3
30.3
40.2
50

Check how the data compares with a Poisson distribution with lambda = 1.

You may recall that lambda is the expected value of Poisson distribution. First, we create the Poisson probabilities for each category from the number of children per household = 0.

poi_prob <- dpois(0:5, 1)
0.367879441 0.367879441 0.183939721 0.061313240 0.015328310 0.003065662

Having established the expected probabilities, we perform the chi-squared test.

poi_prob <- dpois(0:5, 1)
child_perHouse <- c(4.3, 1.0, 2.3, 0.3, 0.2, 0)
chisq.test(child_perHouse, p = poi_prob, rescale.p = TRUE)
	Chi-squared test for given probabilities

data:  child_perHouse
X-squared = 2.4883, df = 5, p-value = 0.7783

Children and Poisson Distribution Read More »

Family-Wise Error Rate and The Bonferroni Correction

We have seen family-wise error rate (FWER) as the probability of making at least one Type 1 error when conducting m hypothesis tests.

FWER = P(falsely reject at least one null hypothesis)
= 1 – P(do not reject any null hypothesis)
= 1 – P(∩j=1n {do not falsely reject H0,j})

If each of these tests is independent, the required probability equals (1 – α)n, and
FWER = 1 – (1 – α)n

For example, if the significance level is 0.05 (α) for five tests,
FWER = 1 – (1 – 0.05)5
And, if you make n = 100 independent tests,
FWER = 1 – (1 – 0.05)100 = 0.994; guaranteed to make at least one Type I error.

One of the classical methods of managing the FWER is the Bonferroni correction. As per this, the corrected alpha is the original alpha divided by the number of tests, n.
Bonferroni corrected α = original α / n

For five tests,
FWER = 1 – (1 – 0.05/5)5 = 0.049; and for 100 tests
FWER = 1 – (1 – 0.05/100)100 = 0.049

Family-Wise Error Rate and The Bonferroni Correction Read More »