Data & Statistics

Sampling Methods: Non-Probabilistic Sampling

In non-probability sampling, in contrast with its probability analogue, some elements in the population have a zero probability of getting selected or unknown. Because of this, the technique’s representativeness becomes doubtful, and the margin of error becomes uncertain.

Convenience Sampling

As the name suggests, convener sampling picks the samples that are easily accessible. For example, a researcher working on the relationship between coffee drinking and body mass does a survey among her close friends and relatives!

Snowball Sampling

In snowball sampling, initially, a small group of participants are recruited (as per convenience!). The sample is extended by asking the first group to provide their contacts for new participants. And it goes on.

Purposive Sampling

In purposive sampling, the researcher uses her judgment to select the participants. A typical example is seeking expert opinion.

Quota Sampling

It is similar to the strata sampling technique; however, the elements are selected based on convenience sampling.

Sampling Methods: Non-Probabilistic Sampling Read More »

Sampling Methods: Probabilistic Sampling

As we have seen earlier, a statistician estimates the population parameters from the sample parameters. And sampling is the all-important process of selecting subjects or groups that provide the (representative) data required for the work.

Sampling can be of two types – probabilistic and non-probabilistic.

In probabilistic sampling, individual samples are selected based on a known probability distribution. In other words, each element in the group has a known and non-zero probability of being selected. This minimises the risk of systematic bias, i.e., the production of over- or under-representation of sub-groups while picking participants. There are four major types of probabilistic sampling.

Random Sampling

In simple random sampling, each element in the sampling frame has an equal and independent probability of being included. It works well when the population is homogenous. Random sampling is usually done without replacement, although the other possibility – with replacement – is also valid. An easy method is to write down all cases in the population and draw uniform random numbers to select.

Stratified Sampling

In stratified random sampling, the sample is divided into multiple mutually exclusive strata. Sampling then starts from each stratum separately, using random sampling. The separately sampled elements are added together to form the final sample. This technique is critical in less homogenous populations, such that the sample is representative of the strata.

Cluster sampling

In multistage cluster sampling, samples are randomly selected in stages. The steps are:
1) the population is divided into mutually exclusive clusters.
2) use random sampling to select clusters
3) second-level random sampling is done inside the selected clusters to select samples.

Bootstrap aggregating

In bootstrap aggregating or bagging, several samples are generated (or bagged) randomly from the population with replacement. Different analytical methods are developed for each sample.

Sampling Methods: Probabilistic Sampling Read More »

Pascal’s Mugging

Remember Pascal’s Wager? It was an argument for the existence of God based on the idea of a payoff matrix, which is heavily dominated by the case in which God exists. So, the conclusion was to believe in God without seeking evidence. Something similar to Pascal’s wager but doesn’t require infinite rewards is Pascal’s mugging, a concept made by Eliezer Yudkowsky and later elaborated by Nick Bostrom.

Pascal was walking down the street. He was stopped by a shady-looking man asking for his wallet. It was an attempt to mug but without having any arms to threaten the victim. Knowing he has no gun to threaten Pascal, the mugger offers a deal. “Give me your wallet and I will give back double the money tomorrow.”

A utilitarian who trusts the expected value theory can make the following calculations and make a decision:
Imagine I have $10 in my wallet, and the minimum probability expected from the mugger keeping his promise is 50% for a break-even value.
This is because the expected value = – 10 + 0.5 x 20 = 0. Since the shady-looking man is not convincing enough to have such a high chance of repaying, I decide not to hand over my wallet.

Hearing the answer, the mugger increases the deal to 10x. That means if Pascal thinks the mugger has a 1/10 chance of paying 10x, he can hand over the wallet. The answer again was negative. The mugger increases the payment to 1000, 10000, and a million. What would Pascal do?

Now, Pascal is in a dilemma. On the one hand, he knows that the probability that the mugger will pay a million dollars is close to zero, and therefore, he must not hand over the wallet. However, as a rational utilitarian, he can’t ignore the fact that the calculations give a profit if the payback is 1 million dollars.

Pascal’s Mugging Read More »

Winning Russian Roulette

Ana and Becky want to play a safer version of Russian Roulette. The game starts with a coin toss. Whoever wins the toss puts a single round in a six-shot toy revolver, spins the cylinder, places the muzzle against a target, and pulls the trigger. If the loaded chamber aligns with the barrel, the weapon fires, and the player wins.

Given the rules of the game, how important is the toss?

We will evaluate the probability of each player – the one who wins (W) the toss and the one who loses (L).

Let’s write down the scenarios where the toss winner (W) wins the contest.
1) W wins in the first round
2) W doesn’t win in the first, survives the second (L’s chance) and wins the third.
3) W doesn’t win in the first, survives the second (L’s chance), doesn’t win the third, survives the fourth, and wins the fifth.

The total probability of W winning is the sum of all individual probabilities.

1) chance of winning the first round = 1/6
2) chance of winning the third round = chance of not winning the first x chance of not losing the second x chance of winning the third = (5/6)x(5/6)x(1/6) = (1/6)x(5/6)2
3) chance of winning the fifth = (1/6)x(5/6)4
Overall probability = 1/6 + (5/6)x(5/6)x(1/6) = (1/6)x(5/6)2 + (1/6)x(5/6)4 + (1/6)x(5/6)6 + … = 0.545 = 54.5%

On the other hand, the person who lost the toss has a total probability of 45.5% of winning the game. So, a 9% advantage is given by the toss.

Surviving The Deadliest 2-Player Game: Vsauce2

Winning Russian Roulette Read More »

The Carpenter Rule

Here is another problem with conditional probability. There are three men and two women. If one of the men is a carpenter, what is the probability of randomly selecting a carpenter and a man from the group?

Let M represent man, W represents woman, and C represents carpenter.
P(M AND C) = P(M) x P(C|M)
The probability of choosing a man is 3 in 5; the probability that it’s a carpenter, given a man is chosen, is 1/3
= 3/5 x 1/3 = 1/5

This can also be done the following way
P(C AND M) = P(C) x P(M|C)
The probability of choosing a man is 1 in 5; the probability that it’s a man, given it’s a carpenter, is 100%.
= 1/5 x 1 = 1/5

The Carpenter Rule Read More »

Non-Transitive DIce

There are three fair 6-sided dice with the following sides:
A. [2, 2, 4, 4, 9, 9]
B. [1, 1, 6, 6, 8, 8]
C. [3, 3, 5, 5, 7, 7]

If A plays against B, what is the probability of A winning?

A vs B
The required Probability is:
Probability of rolling a 2 x probability of 2 winning + Probability of rolling a 4 x probability of 4 winning + Probability of rolling a 9 x probability of 9 winning
= (2/6) x (2/6) + (2/6) x (2/6) + (2/6) x 1 = 20/36 = 55.55%

Count the number of red dots and divide it by the total number of dots.

What is the chance of B winning if B plays against C?

B vs C
= (2/6) x (0) + (2/6) x (4/6) + (2/6) x 1 = 20/36 = 55.55%

and

C vs A
= (2/6) x (2/6) + (2/6) x (4/6) + (2/6) x (4/6)= 20/36 = 55.55%

Each die beats the other with a probability of 55.55%.

Non-Transitive DIce Read More »

Identical Twins

1 in 10 sets of twins are identical, and 9 in 10 are fraternal. What is the probability that Adam and his twin brother Ben are identical twins?

Fraternal twins result from fertilising two eggs with two sperm during the same pregnancy. They may not be the same sex. On the other hand, identical twins result from the fertilisation of a single egg by a single sperm, with the fertilised egg then splitting into two. As a result, identical twins share the same genomes and are always of the same sex.

National Human Genome Reseach Institute

Contrary to how it appears otherwise, this is not a marginal probability but a conditional. The information that the twins are both males (Adam and his brother) triggers this shift. We will use the Bayes rule to solve this problem. The probability that Adam and Ben are identical twins, given they are males, is:
P(I|M) = P(M|I)xP(I)/[P(M|I)xP(I) + P(M|F)xP(F)]

P(M|I) = Probability of both males, given they are identical twins, is 1/2 (always same sex).
P(I) = Probability of twins being identical twins = 1/10
P(M|F) = Probability of both males, given they are fraternal twins, is 1/4 (MM out of the possible four, MM, FM, MF, FF)
P(F) = Probability of twins being fraternal twins = 9/10

P(I|M) = (1/2)x(1/10)/[(1/2)x(1/10) + (1/4)x(9/10)] = 1/5.5 = 0.1818 OR 18.18%

Reference

FRATERNAL TWINS: NIH

Counter-Intuitive Probability. What’s The Chance Twin Brothers Are Identical?: MindYourDecisions

Identical Twins Read More »

People v. Collins v. Statistics

People v. Collins was a 1968 trial in the Supreme Court of California that reversed the convictions of Janet and Malcolm Collins by a jury in Los Angeles of second-degree robbery. The events that led to the original conviction were as follows:

On June 18, 1964, Mrs. Juanita Brooks was walking home along an alley in San Pedro, City of Los Angeles. She was suddenly pushed to the ground, and she saw a young woman running, wearing “something dark.” She had hair “between a dark blond and a light blond,”. After the incident, Mrs Brooks discovered that her purse, containing between $35 and $40, was missing.

At about the same time, John Bass, who lived on the street at the end of the alley, saw a woman run out of the alley and enter a yellow automobile driven by a black male wearing a moustache and beard.

The prosecutor (of the jury trial) brought a statistician to prove the crime using the laws of probability. The expert hypothesised the following chances and made his calculations.

Characteristic Probability
1Yellow automobile1/10
2Man with moustache 1/4
3Girl with poleytail 1/10
4Girl with blondehair Girl with blonde hair
5Interracial couple in a car1/10
6Interracial couple in car1/1000

The profession used the product rule (the AND rule) of probability to prove the case. Multiplying all the probabilities, he concluded that the chance for any couple to possess the characteristics of the defendants is 1/12,000,000! Note that such multiplication is only possible if the individual probabilities are independent.

But the statistician (and the jury) convenient ignored a few things:
1) The validity of the probabilities. Those were his ‘inventions’ without any support from data.
2) The events (characteristics) were not independent from each other. More likely than not, the person with a beard has a moustache.
3) Once, a blond girl and a black man with a beard were counted, talking about the low probability of an interracial couple in a car is wrong. Think about it—the probability of an interracial couple, given one is a blond girl and another is a bearded black man, must be close to 1.

References

People v. Collins: Justia US Law
The Worst Math Ever Used In Court: Vsauce2
A Conversation About Collins: William B. Fairley; Frederick Mosteller

People v. Collins v. Statistics Read More »

El Niño Year

The world is experiencing another El Niño episode. El Niño is a climate pattern of more than usual warming surface waters in the eastern Pacific Ocean. It is defined as a phenomenon in the equatorial Pacific Ocean (Niño 3.4 region) marked by a positive departure for five consecutive three-month running mean sea surface temperature (SST = 28°C) by +0.5°C.

Reference

Equatorial Pacific Sea Surface Temperatures: NCEI

El Niño Year Read More »

Durbin-Watson Test

We have seen examples of regression where the basic assumption of uncorrelated residuals is compromised. Finding the autocorrelation of the residuals using Durbin–Watson is one way to diagnose the correlation. Here, we perform a step-by-step estimation.

Step 1: Plot the data

plot(Nif_data$Year, Nif_data$Index, xlab = "Year", ylab = "Index")

Step 2: Develop a regression model

fit <- lm(Index ~ Year, data=Nif_data)
summary(fit)
Call:
lm(formula = Index ~ Year, data = Nif_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3410.3  -544.5   -96.5   507.6  5603.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -438.128     27.801  -15.76   <2e-16 ***
Year         566.726      2.243  252.65   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1020 on 5352 degrees of freedom
Multiple R-squared:  0.9226,	Adjusted R-squared:  0.9226 
F-statistic: 6.383e+04 on 1 and 5352 DF,  p-value: < 2.2e-16

Step 3: Estimate Residuals

Nif_data$resid <- resid(fit)

Step 4: Durbin–Watson (D-W) Statistics

D-W statistic is the sum of differences between successive residuals squared divided by the sum of residuals squared.

D-W Statistics = sum (ei - ei-1)2 / sum(ei2)
sum(diff(Nif_data$resid)^2) / sum(Nif_data$resid^2)
0.006301032

R can do better – using the ‘durbinWatsonTest’ function from the library ‘car’.

library(car)
fit <- lm(Index ~ Year, data=Nif_data)
durbinWatsonTest(fit)
lag Autocorrelation D-W Statistic p-value
   1       0.9936623   0.006301032       0
 Alternative hypothesis: rho != 0

Durbin-Watson Test Read More »