Data & Statistics

The Small Sample Fallacy

You may remember an older post titled “Life in a Funnel“. It discussed the wired (i.e., extreme) variation of averages (of the rate of certain illnesses, income levels, etc.) arising from groups and places with smaller populations. An often-quoted example is the prevalence of the lowest rates of kidney cancer in the US. These are regions of often rural, sparsely populated, traditionally Republican states. People who come across this data would rationalise this to cleaner air, healthier lifestyles or fresh foods. But when it comes to regions with the highest prevalence of the same disease, they also belong to mostly rural, less populated, Republican!

The incidence of disease is enclosed in a 95% confidence interval.

The plot shows the variability of observed averages simulated from precisely the same incident rate. As the sample size decreases, the sample average goes up or down dramatically, creating an illusion that forces the public to believe something real. This is known as the small sample fallacy. It’s a mistake where one attaches a causal explanation to a statistical artefact due to a smaller sample size.

A simple illustration is to imagine a bowl containing several marbles – 50% red and 50% green. If you take only four samples out of it to check the proportion, there is a 12.5% chance of getting either all green or all red. Work out this binomial probability with p = 0.5, n = 4 and x = 4.
P(X=4 red) = 4C4 0.54 0.50 = 0.0625
P(X=4 green) = 4C4 0.54 0.50 = 0.0625
P(all red or all green) = 0.0625 + 0.0625 = 0.125 or 12.5%.

There is a 12.5% chance of seeing an extreme sample average from an otherwise perfectly balanced population. In other words, 12 investigators out of 100 will see a distorted value!

Now, increase the samples to 10, the probability of observing all red or all green reduces dramatically to,
P(all red or all green) = 10C10 0.5100.50 + 10C10 0.5100.50
P(all red or all green) = 0.00098 + 0.00098 = 0.00196 or just 0.2%.

Reference

The Small Sample Fallacy: Kevin deLaplante

The Small Sample Fallacy Read More »

Gambler’s Ruin

A gambler starts with a fixed amount of money ($i) and bets $1 in a fair game (i.e., the probability of winning or losing is 0.5) each time until she has 0 or n dollars. What is the probability she ends up with $0, and what chance does she get $n?

This is a perfect example of a Markov process. This is because the only thing relevant to the gambler at any point in time is the money she has at that time. Imagine her end goal is to reach 5. Let’s assume she has p chance to win a dollar and q = 1-p chance to lose the bet amount. The Markov chain representation is as follows.

Here is the translation matrix.

Take two cases: a fair bet (p = 0.5, q = 0.5) and a favourable bet (p = 0.6, q = 0.4). She starts with $3, represented as X = [0, 0, 0, 1, 0, 0]. Note that the first element represents 0, then 1, etc, and the sixth element denotes 5 (the goal). After 50 steps, the end state probability is,
P50 * X

P50 * [0, 0, 0, 1, 0, 0] = [0.4, 0, 0, 0, 0, 0.6]. The answer has to be the fourth column of P50. There is a 40% chance she ends up 0 and a 60% chance it’s 5$.

By the way, for p = 0.5, the analytical solution for the probability of reaching n, starting from i, is,
ai = i/n; a3 = 3/5 = 60%.

For p does not equal 0.5, it is,

ai = (1 – ri)/(1 – rn)
r = (1-p)/p

Now, imagine the probability of winning is 0.6.

p = 0.6; r = 0.4/0.6 = 0.67
a3 = (1 – r3)/(1 – r5) = 0.81
See the fourth column of the matrix above.

L26.9 Gambler’s Ruin: MIT OpenCourseWare

Gambler’s Ruin Read More »

Markovian Umbrella Run

Becky has four umbrellas. During her workday, she travels between home and the office. She takes an umbrella only when it rains; otherwise, it remains where it was last—in the office or at home. Suppose on a given day that all her umbrellas are in the office, whereas she’s at home preparing for the office, and if it rains, she will get wet. The question is:
If the location has a 60% probability of rain, what is the chance that Becky gets wet?

The problem can be solved as a Markovian chain. For that, we must divide the conditions into five states. They are
0: no umbrella state
1: one umbrella
2: two umbrellas
3: three umbrellas
4: four umbrellas

We must know what movements are possible from one state to another to develop translation probabilities.

From 0: Becky must go from 0 to 4, from one place without an umbrella to the other with all umbrellas. As this must happen irrespective of whether it rains or not, the probability of this movement is 1.

From 1: If it rains, p = 0.6, Becky carries the umbrella with her. In other words, she goes from state 1 to state 4 (3 already + 1 incoming).
If it doesn’t rain, p = 0.4, she will go from state 1 to state 3.

From 2: If it rains, state 2 to state 3. If it doesn’t rain, she will go from 2 to 2.

From 3: If it rains, 3 to 2; if it doesn’t, 3 to 1.

From 4: If it rains, 4 to 1; if it doesn’t, 4 to 0.

Here is the diagram representing the chain.

The translation matrix is

The required task is to find the stable end distribution of cities, which can be done using the relationship.

Xn = Pn X0

We use the Matrix calculator for P100

Multiply this with any starting distribution, we get the end state probabilities as,

The probability that she’s at state 0, P(0) = 0.09. Since the probability of rain when Becky is at 0 state is 0.6, the chance she gets drenched is 0.09 x 0.6 = 0.054 or about 5%.

Markovian Umbrella Run Read More »

Confidence Interval in Poll Surveys

Consider a large population from which you are randomly sampling 1000 people. The ask is to get a simple YES or NO answer from each survey participant about a candidate. Suppose 450 people gave a YES answer; what is the margin of error in the estimate if you use a confidence level of 95%?

h \pm 1.96 . \frac{\sigma} {\sqrt{n}}

The sample size, n = 1000. Since 450 out of 1000 responded YES, we approximate the value 450/1000 (the sample ratio, h) as the population probability (p) for YES. The next step is to estimate sigma, the standard deviation. This can be done in two ways.

Solution as Bernoulli trial

This is a Bernoulli trial, and the standard deviation per trial is nothing but the square root of p x (1-p), where p is the probability of YES.
sd = root(p x (1-p) = root(0.45 x 0.55) = 0.497.

\\ 0.45 \pm 1.96 . \frac{0.497} {\sqrt{1000}} \\\\ 0.45 \pm 0.031

Thus, the population percentage p is in the interval [0.45 – 0.031, 0.45 + 0.031] or [0.42, 0.48] at 95% confidence interval.

Confidence Interval in Poll Surveys Read More »

Polls and 3.5%

In this post and the next, we will do the poll survey problem in two different ways. First, what number of candidates is required in the survey to form a signal of a 3.5 percentage point difference?

The steps are
find the signal (we know that already (3.5% or 0.035)
find the noise (standard deviation)
estimate signal/nose and equate it to 1.96/root(n)
estimate n

Standard deviation

Imagine a survey asking a random potential voter a question about a candidate. The answer is YES or NO. YES carries 1, and NO carries 0 value. Let p be the probability of getting a YES, something we don’t know now. From Bernaulli trial (this can be a decent Bernaulli trial), the standard deviation is p x (1-p) per trial. For p = 0.5 (equal probabilities for YES and NO), the standard deviation (sd) is 0.5. The value for sd is 0.49 for 60:40 and 40:60, 0.46 for 70:30 and 30:70 etc. Therefore, using a standard deviation of 0.5 in the poll won’t be a big crime.

Samples

signal/noise = 0.035/0.5 = 0.07
n = (1.96/0.07)2
= 780
or about 1000 people.

We will address the same problem in the opposite way in the next post.

Polls and 3.5% Read More »

Confidence with an Edge

Anne has developed a 3% edge in sports betting based on some intelligent math, but she has yet to learn the exact advantage. She wants to bet $ 1 on a team at odds of 1/2. How many bets does she need to make before she knows her edge?

Before we get into the question, let’s familiarise ourselves with what Nate Silver popularised as ‘signal’ and ‘noise’. The signal is what we expect—in simple language, it’s the mean. The noise is the variability, or, in other words, the standard deviation.

The problem mentioned above is another way of stating the number of trials required for Anne to develop the confidence interval (say 95%) that can clearly distinguish the signal (the edge, 0.03) from the noise (the standard deviation). Just a reminder: for a fair bet, the signal (the long-term average) should have been 0, but since Anne has an edge of 0.03, it must be 0.03.

The confidence interval per average trial is given by the following formula. h is the signal, sigma is the standard deviation, and n is the number of trials.

h \pm 1.96 . \frac{\sigma} {\sqrt{n}}

If the odds are 1/2, for a wager of 1, one gets 0.5 or loses 1. Also, the winning probability is,
2/(2+1) = 0.667.

Standard deviation

The squared distance for a win is (0.5 – 0.03)2 and for a loss (-1 – 0.03)2 . The average squared distance (the variance),

\sigma^2 = (0.667).(0.5 - 0.03)^2 + (0.333).(-1-0.03)^2 = 0.5

The standard deviation is the square root = 0.71

Confidence interval

0.03 \pm 1.96 . \frac{0.71} {\sqrt{n}}

Now, all we need to do is estimate n such that the term on the right-hand side of the plus/minus equals or less than the signal value.

1.96 x noise / root(n) < signal
1.96/root(n) < signal/noise
n > (1.96/(signal/noise))2
n > (1.96/(0.03/0.71)2
n > 2140

Reference

The Ten Equations that Rule the World: David Sumpter

Confidence with an Edge Read More »

Viterbi Algorithm – NLP

Let’s try out the Viterbi Algorithm using the example given in the ritvikmath channel using R. It is about parts of speech tagging of a sentence,
“The Fans Watch The Race”.

The transition and emission probabilities are given in matrix forms.

hmm <- initHMM(c("DET","NOUN", "VERB"), c("THE","FANS", "WATCH", "RACE"), transProbs=matrix(c(0.0, 0.0, 0.5, 0.9, 0.5, 0.5, 0.1, 0.5, 0.0), nrow = 3),
	emissionProbs=matrix(c(0.2, 0.0, 0.0, 0.0, 0.1, 0.2, 0.0, 0.3, 0.15, 0.0, 0.1, 0.3), nrow = 3))
print(hmm)
$States
[1] "DET"  "NOUN" "VERB"

$Symbols
[1] "THE"   "FANS"  "WATCH" "RACE" 

$startProbs
      DET      NOUN      VERB 
0.3333333 0.3333333 0.3333333 

$transProbs
      to
from   DET NOUN VERB
  DET  0.0  0.9  0.1
  NOUN 0.0  0.5  0.5
  VERB 0.5  0.5  0.0

$emissionProbs
      symbols
states THE FANS WATCH RACE
  DET  0.2  0.0  0.00  0.0
  NOUN 0.0  0.1  0.30  0.1
  VERB 0.0  0.2  0.15  0.3

Now, write down the observations (The Fans Watch The Race) and run the following commands.

observations <- c("THE","FANS", "WATCH", "THE", "RACE")

vPath <- viterbi(hmm,observations)

vPath 
 "DET"  "NOUN" "VERB" "DET"  "NOUN"

References

The Viterbi Algorithm: ritvikmath

Viterbi Algorithm – NLP Read More »

Viterbi algorithm – R Program

The steps we built in the previous post can be done using the following R code. Note that you are required to install the library, HMM, for that.

library(HMM)
hmm <- initHMM(c("H","F"), c("NOR","COL", "DZY"), transProbs=matrix(c(0.7, 0.4, 0.3, 0.6), nrow = 2),
	emissionProbs=matrix(c(0.5, 0.1, 0.4, 0.3, 0.1, 0.6), nrow = 2))

observations <- c("NOR","COL","DZY")

vPath <- viterbi(hmm,observations)

vPath 
"H" "H" "F"

Viterbi algorithm – R Program Read More »

Viterbi algorithm – The Solution

A doctor wants to diagnose whether a patient has a fever or is healthy. The patient can explain the conditions in three options: “Normal,” “Cold,” or “Dizzy.” The doctor has statistical data on health and how patients feel. If a patient comes to the doctor and reports “normal” on the first day, “cold” on the second, and “dizzy” on the third, how is the diagnosis done?

The probability of H today, given it was H the previous day = 0.7
The probability of F today, given it was H the previous day = 0.3
The probability of F today, given it was F the previous day = 0.6
The probability of H today, given it was F the previous day = 0.4
The probability of appearing N, given H, P(N|H) = 0.5
The probability of appearing C, given H, P(C|H) = 0.4
The probability of appearing D, given H, P(D|H) = 0.1
The probability of appearing N, given F, P(N|F) = 0.1
The probability of appearing C, given F, P(C|F) = 0.3
The probability of appearing D, given F, P(D|F) = 0.6

The Viterbi steps are:
1) Estimate prior probabilities of being healthy (H) or fever (F).
2) Calculate the posterior (the numerator) of day 1 using Bayes.
3) Compare the posteriors of the two posteriors (H vs F) and find out the most likely conditions.
4) Use it as the prior for the next day and repeat steps 2 and 3.

Day 1: “Normal”

Step 2: Calculate the posterior (the numerator) of day 1 using Bayes.
P(H|N) = P(N|H) x P(H) = 0.57 x 0.5
P(F|N) = P(N|F) x P(F) = 0.43 x 0.1
You can see the pattern already. The first comes from the transition probabilities (0.57) and the second from the emission (0.5).

Step 3: Compare the posteriors
P(H|N) > P(F|N)

Day 2: “Cold”

Now, we move to the next. Note that we can’t remove the branch that was lower on the first day as it can contribute to the next day.

Part 1: We start with Healthy (day 1) – Healthy (day 2) by multiplying the transition probability (0.7) with the emission probability (0.4), and the contribution comes from the earlier step (0.29). 0.7 x 0.4 x 0.29 = 0.08.
That is not the only way to arrive at “Healthy”. It can also come from the “Fever” of the previous day. Do the same math: 0.4 x 0.4 x 0.04 = 0.0064

So, there are two ways to arrive at H on the second day. One has a probability of 0.08 (start – H – H), and the other has 0.0064 ((start – C – H)). The first is greater than the second; retain Start-H-H.

0.08 is the maximum probability we carry now for H to fight against F on day 2.

Part 2: Do the same analysis for the maximum probability for F on day 2.
Two arrows lead to “Fever”. The first gives a probability of 0.6 x 0.3 x 0.04 = 0.0072, and the second, 0.3 x 0.3 x 0.29 = 0.02. Retain 0.02 as the probability for “Fever” as it is higher than 0.0072.

Final Part: Compete H vs F. 0.08 > 0.02. So H is on day 2 as well.

Day 3: “Dizzy”

H-H: 0.7 x 0.1 x 0.08 = 0.0056
F-H: 0.4 x 0.1 x 0.02 = 0.0008
F-F: 0.6 x 0.6 x 0.02 = 0.0072
H-F: 0.3 x 0.6 x 0.08 = 0.0144

Given the patient’s feedback, the assessment would have resulted in Healthy on days 1 and 2 but Fever on day 3.

References

Viterbi algorithm: Wiki

The Viterbi Algorithm: ritvikmath

Viterbi algorithm – The Solution Read More »

Viterbi algorithm

A doctor wants to diagnose whether a patient has a fever or is healthy. The patient can explain the conditions in three options: “Normal,” “Cold,” or “Dizzy.” The doctor has statistical data on health and how patients feel. If a patient comes to the doctor and reports “normal” on the first day, “cold” on the second, and “dizzy” on the third, how is the diagnosis done?

Before we get to the solution, let’s recognise this as a hidden Markov process.
The hidden (latent) variables are “Fever” and “Healthy.” They have translation probabilities.
The observed variables are: “Normal,” “Cold,” and “Dizzy.” They have emission probabilities.

The Viterbi algorithm finds the recursive solution of the most likely state sequence using the maximum a posteriori probability. As the earlier post shows, the Bayes equation may estimate the posterior priority. As the objective is to find the maximum a posteriori, we require only its numerator in Viterbi. In our case,

P(H|N) = P(N|H) x P(H)

H represents “Healthy”, and N denotes “Normal”.

The steps are:
1) Estimate prior probabilities of being healthy (H) or fever (F).
2) Calculate the posterior (the numerator) of day 1 using Bayes.
3) Compare the posteriors of the two posteriors (H vs F) and find out the most likely conditions.
4) Use it as the prior for the next day and repeat steps 2 and 3.

Here is the hidden Markov tree, and we will see the calculations next.

Viterbi algorithm Read More »