Data & Statistics

Gambler, Learner and Logician

Amanda, Becky and Carol are at a betting station near the coin-flipping game. They observe five tosses and see all of them landing on heads.

Amanda: “It landed five times on heads; a tail is due, and I will bet on tails this time.
Becky: “It is a bised coin. The probability for the next flip to land on heads is high; I will bet on heads.
Carol: “Amanda has committed a fallacy. Becky may be right, but induction based on the first five tosses can still be logically incorrect. So there is no point betting either way”.

Who is right here?

Amanda has committed the Gambler’s fallacy. By expecting a tails due, she forgets about the independence of the trials.
Becky’s stand is based on her interpretation of the observations. Her argument is still not logically water-tight. From the casino’s point of view, using a biased coin is risky; people like Becky will find it easily and become rich.
In the absence of strong evidence, Amanda’s logic is more acceptable.

Gambler, Learner and Logician Read More »

Dieing to Fill Glasses

Here is a game: There are six empty glasses – numbered 1 through six. You roll a die and fill an empty glass that matches the die roll. If the number on the die matches with an already-filled glass, it will be emptied. How many rolls are required to fill all six glasses?

Suppose there are five filled glasses, the number of die rolls required before the game ends is denoted by E(5). Based on this definition, E(0) must be the number of die rolls to finish the game starting with 0 filled glasses, equivalent to the original question.

E(5) = (1/6)[1] + (5/6)[1 + E(4) ]

E(4) must be in the following form to extend the logic.
E(4) = (2/6)[1 + E(5)] + (4/6)[1 + E(3) ]
E(3) = (3/6)[1 + E(4)] + (3/6)[1 + E(2) ]
E(2) = (4/6)[1 + E(3)] + (2/6)[1 + E(1) ]
E(1) = (5/6)[1 + E(2)] + (1/6)[1 + E(0) ]
E(0) = (6/6)[1 + E(1)]

Solving the five equations with five unknowns,

E(0) = 83.2

Reference

Can You Solve The Dice Rolling Drinking Game?: MindYourDecisions

Dieing to Fill Glasses Read More »

Contingency Tables – Continued

Contingency Tables are one way to organise data. Here is a data summary of computer users in a group.

PCMacRow
Totals
Male453883
Female405595
Column
Totals
8593178

Joint Probability

What is the joint probability of Female and Mac?
First, the answer: go to the cell at the junction of Female and Mac, i.e., 55 and divide by the total. 55/178 = 0.309.

Now the theory:
P (F AND Mac) = P(F | Mac) x P(Mac)
P(F | Mac) = 55/93
P(Mac) = 93/178
P (F AND Mac) = (55/93) x (93/178) = 55/178 = 0.309.

PCMacRow
Totals
Male45/178
= 0.25
38/178
= 0.21
Female40/178
= 0.22
55/178
= 0.31
Column
Totals

Conditional Probabilities

Conditional probability is the probability that an event occurs, given another event has happened. Given that a customer is female, what is the probability she’ll purchase a Mac?

The answer is female-Mac cell (55) and divide it with the female row total (95). 55/95 = 0.58.

PCMacRow
Totals
MaleP(P|M)
45/83
P(M|M)
38/83
83
FemaleP(P|F)
40/95
P(M|F)
55/95
95
Column
Totals
8593178
PCMacRow
Totals
MaleP(M|P)
45/85
P(M|M)
38/93
83
FemaleP(F|P)
40/85
P(F|M)
55/93
95
Column
Totals
8593178

Contingency Tables – Continued Read More »

Contingency Tables

Contingency Tables are one way to organise data. Here is a data summary of computer users in a group.

PCMacRow
Totals
Male453883
Female405595
Column
Totals
8593178

The intersection between the row and column defines one piece of information. For example, The intersection of PC and Male, 45, is the number of males (who participated in the survey) who use a PC at work, the junction between row total and females (95) is the total number of females in the survey, and a total of 178 people in the study, etc.

Marginal, Joint, and conditional probabilities 

Before we get into the calculations, a gentle reminder on probability.
P(event) = # Events / # Outcomes.

Marginal probabilities are the probabilities for single events without counting the other events in the table.
P(Female) = # Females / # Grand Total = 95 / 178 = 0.53.
P(Mac sold) = # Mac / # Grand Total = 93/178 = 0.52.

Let’s redraw the contingency table with marginal probabilities now.

PCMacRow
Totals
Male83/178
0.47
Female95/178
0.53
Column
Totals
85/178
0.48
93/178
0.52
178/178
1.0

Clearly, the numbers are all sitting on the margins, hence the name.

We’ll see the other two probabilities in the next post.

Reference

Statistics By Jim: Page

Contingency Tables Read More »

Chi-Square Distribution

Chi-Square is a family of continuous distribution, widely used in hypothesis tests. The shape of a chi-square distribution is determined by what is known as the degree of freedom (df).

A chi-square test operates by comparing the observed distribution to what you expect if there is no relationship between the categorical variables.

Chi-Square Distribution Read More »

Likelihood Function – Part II

In the previous post, we estimated the likelihood of getting six people sick for two parameters (prevalence), 7% and 8%. We can also calculate the ratio between the two likelihoods:

L(theta = 0.07 | data = 6) / L(theta = 0.08 | data = 6) = 0.153 / 0.123 = 1.24. 

It means that the prevalence of 7% supports the data 1.24 times more than the prevalence of 8%. What about a sweep of likelihood over the entire parameter space? The function that gives the distribution of likelihoods of all possible values of parameters for a given data is the likelihood function.

As the parameter (theta) defines a model (e.g., binomial probability mass function), what the likelihood function is telling us is, given I have this data, what is the chance that the given model is true? In other words, we want the model that is mostly to have produced our data.

Likelihood Function – Part II Read More »

Likelihood Function

Consider two possible prevalence values for a rare disease, 0.07 and 0.08, respectively. If 100 samples from each city are taken, and six people are found positive, which prevalence value is likely?

Let’s visualise the situation 1:

And the situation 2:

It is clear that the first possibility, the prevalence (‘the parameter’) 0.07, is more likely, given 6 people tested positive as probability = 0.153 for the first case is > 0.123 for the second.

Summarising: for the parameter of 7%, the probability of getting six out of a hundred is 0.153. It becomes the likelihood.
L(theta = 0.07; y = 6) = 0.153 and L(theta = 0.08; y = 6) = 0.123

Here is the R code that generated the plot in situation 2.

xx <- seq(1,20)
P <-  dbinom(xx, 100, prob = 0.08)
binom_data <- data.frame(xx, P)

binom_data %>%  ggplot(aes(x=xx, y=P, label=P, fill=factor(ifelse(xx==6,"Highlighted","Normal")))) +
  geom_bar(stat="identity", show.legend = FALSE) +
  geom_text(aes(label=factor(ifelse(P > 0.01, round(P, 3),"")))) +
  scale_x_discrete(name = "Positive Sample", limits=factor(seq(1, 20, 1))) +
  scale_y_continuous(name = "Probability") +
  theme_solarized(light = TRUE) 

Likelihood Function Read More »

Period Life Expectancy – Plots

We have seen the calculations behind life expectancy, the lifespan of a hypothetical cohort ageing based on the measured mortality rates of a given period, as a statistical projection of the current conditions. Here, we plot the life expectancy that we estimated previously.

library(tidyverse)
library(ggthemes)
L_data %>% ggplot(aes(Age, Life.Exp)) +
geom_point() +
   geom_rug() +
  scale_x_continuous(name="Age", limits=c(0, 120), minor_breaks = seq(0, 120, 5), breaks=seq(0, 120, 10)) +
  scale_y_continuous(name="Life Expectancy", limits=c(0, 80), minor_breaks = seq(0, 80, 5), breaks=seq(0, 80, 10)) + 
 theme_solarized(light = TRUE) 

The death probability (data) at each age is presented below.

The plot with the Y-axis in the logarithmic scale shows finer details, especially in the lower age categories.

You can see below the dynamics of survival – 85,000 of the 100,000 are alive until almost the age of 60.

Period Life Expectancy – Plots Read More »

Period Life Expectancy

The period-life expectancy at a given age is the average remaining number of years expected for a person at that exact age, estimated from the mortality rate of that particular time. Let’s work out the calculation using the death probability (probability of dying within one year) table. The death probability is estimated from the mortality rates at each age (from census data for a short period). Here are the first few lines of the data (for complete data, see reference).

AgeP (Death)
00.005837
10.00041
20.000254
30.000207
40.000167

We start with 100,000 people in the cohort. The number of deaths in a given year, Yx = the probability of death (in Yx) x people alive (in Yx). In our example, for Y1, it is 100,000 x 0.005837 = 583.7.

The number of people alive in the next year (Yx+1 ) = people alive (in Yx) – the number of deaths in a given year (x). I.e., # Alivex+1 = 100,000 – 583.7 = 99,416. This number multiplied by the probability gives the number of dead in Yx+1.

The next step is to calculate the average number of people alive in the age category. It can be calculated as a mid-point average of the number of people in Yx+1 + (1/2) of the death in Yx. That equals 99,416 + 0.5 x 583.7 = 99,708.

In the next step, the total number of person-years lived by the cohort from age x until all cohort members have died. It is the sum of the numbers in the mid-point average column from age x to the last row in the table. Suppose there are a total of 120 columns (age numbers), and you want to calculate the person-years of age 24, you add all average alive from Y24 till Y119.

Life expectancy for a given age = person-years / persons alive.

Here are the first and the last 10 years of calculations of a table that has 120 rows (Y0 – Y119).

References

Actuarial Life Table: SSA

The Life Table: lifeexpectancy.org

Period Life Expectancy Read More »

Likelihood Ratio – Fagan’s Nomogram

We have seen the likelihood ratio as the property of a diagnostic tool. Let’s take the fictitious screening tool we evaluated in the last post with LR+ = 10.7. Imagine a patient comes to a clinic with a few symptoms of a disease with a prevalence of 0.1 (very likely, age-adjusted), and this screening is a possible option. Would you recommend this? Note that the doctor will decide on further (costly) treatment only if she gets a conformation (posterior probability) of > 50% chance of the disease.

From the relationship we derived last time, 

OR_Post = LR x OR_Pri

Odds ratio (posterior) = 10.7 x 0.11 = 1.07
P(poterior) / (1 – P(poterior)) = 1.88
1/P(poterior) = 1 + 1/1.88
P(poterior) = 1.88/2.88 = 0.54

A nomogram of the following type is built to make such calculations simpler.

Draw a line from the ‘pre-test probability’ to ‘the likelihood ratio’ and extend it to the ‘post-test probability line. The intersection gives the posterior probability.

Here is an illustration of the method. Let’s use Fagan’s nomogram for the previous case,

To answer the original question: this test may be recommended as it can bring the probability over 0.5 if the test comes positive. Not to forget, if the test comes negative (LR- = 0.044), the posterior probability becomes 0.005.

Smaller prior

On the other hand, if the prior probability is lower, say, 0.01, as you can see below, the test is not very useful to make a conclusive decision.

Such a disease would require a diagnostic tool with a likelihood ratio of 100 or above to make a decision. Connect 0.01 (prior probability) to 0.5 (minimum decision criterion) and find out the likelihood ratio.

Likelihood Ratio – Fagan’s Nomogram Read More »