March 2024

Identical Twins

1 in 10 sets of twins are identical, and 9 in 10 are fraternal. What is the probability that Adam and his twin brother Ben are identical twins?

Fraternal twins result from fertilising two eggs with two sperm during the same pregnancy. They may not be the same sex. On the other hand, identical twins result from the fertilisation of a single egg by a single sperm, with the fertilised egg then splitting into two. As a result, identical twins share the same genomes and are always of the same sex.

National Human Genome Reseach Institute

Contrary to how it appears otherwise, this is not a marginal probability but a conditional. The information that the twins are both males (Adam and his brother) triggers this shift. We will use the Bayes rule to solve this problem. The probability that Adam and Ben are identical twins, given they are males, is:
P(I|M) = P(M|I)xP(I)/[P(M|I)xP(I) + P(M|F)xP(F)]

P(M|I) = Probability of both males, given they are identical twins, is 1/2 (always same sex).
P(I) = Probability of twins being identical twins = 1/10
P(M|F) = Probability of both males, given they are fraternal twins, is 1/4 (MM out of the possible four, MM, FM, MF, FF)
P(F) = Probability of twins being fraternal twins = 9/10

P(I|M) = (1/2)x(1/10)/[(1/2)x(1/10) + (1/4)x(9/10)] = 1/5.5 = 0.1818 OR 18.18%

Reference

FRATERNAL TWINS: NIH

Counter-Intuitive Probability. What’s The Chance Twin Brothers Are Identical?: MindYourDecisions

Identical Twins Read More »

People v. Collins v. Statistics

People v. Collins was a 1968 trial in the Supreme Court of California that reversed the convictions of Janet and Malcolm Collins by a jury in Los Angeles of second-degree robbery. The events that led to the original conviction were as follows:

On June 18, 1964, Mrs. Juanita Brooks was walking home along an alley in San Pedro, City of Los Angeles. She was suddenly pushed to the ground, and she saw a young woman running, wearing “something dark.” She had hair “between a dark blond and a light blond,”. After the incident, Mrs Brooks discovered that her purse, containing between $35 and $40, was missing.

At about the same time, John Bass, who lived on the street at the end of the alley, saw a woman run out of the alley and enter a yellow automobile driven by a black male wearing a moustache and beard.

The prosecutor (of the jury trial) brought a statistician to prove the crime using the laws of probability. The expert hypothesised the following chances and made his calculations.

Characteristic Probability
1Yellow automobile1/10
2Man with moustache 1/4
3Girl with poleytail 1/10
4Girl with blondehair Girl with blonde hair
5Interracial couple in a car1/10
6Interracial couple in car1/1000

The profession used the product rule (the AND rule) of probability to prove the case. Multiplying all the probabilities, he concluded that the chance for any couple to possess the characteristics of the defendants is 1/12,000,000! Note that such multiplication is only possible if the individual probabilities are independent.

But the statistician (and the jury) convenient ignored a few things:
1) The validity of the probabilities. Those were his ‘inventions’ without any support from data.
2) The events (characteristics) were not independent from each other. More likely than not, the person with a beard has a moustache.
3) Once, a blond girl and a black man with a beard were counted, talking about the low probability of an interracial couple in a car is wrong. Think about it—the probability of an interracial couple, given one is a blond girl and another is a bearded black man, must be close to 1.

References

People v. Collins: Justia US Law
The Worst Math Ever Used In Court: Vsauce2
A Conversation About Collins: William B. Fairley; Frederick Mosteller

People v. Collins v. Statistics Read More »

El Niño Year

The world is experiencing another El Niño episode. El Niño is a climate pattern of more than usual warming surface waters in the eastern Pacific Ocean. It is defined as a phenomenon in the equatorial Pacific Ocean (Niño 3.4 region) marked by a positive departure for five consecutive three-month running mean sea surface temperature (SST = 28°C) by +0.5°C.

Reference

Equatorial Pacific Sea Surface Temperatures: NCEI

El Niño Year Read More »

Durbin-Watson Test

We have seen examples of regression where the basic assumption of uncorrelated residuals is compromised. Finding the autocorrelation of the residuals using Durbin–Watson is one way to diagnose the correlation. Here, we perform a step-by-step estimation.

Step 1: Plot the data

plot(Nif_data$Year, Nif_data$Index, xlab = "Year", ylab = "Index")

Step 2: Develop a regression model

fit <- lm(Index ~ Year, data=Nif_data)
summary(fit)
Call:
lm(formula = Index ~ Year, data = Nif_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3410.3  -544.5   -96.5   507.6  5603.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -438.128     27.801  -15.76   <2e-16 ***
Year         566.726      2.243  252.65   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1020 on 5352 degrees of freedom
Multiple R-squared:  0.9226,	Adjusted R-squared:  0.9226 
F-statistic: 6.383e+04 on 1 and 5352 DF,  p-value: < 2.2e-16

Step 3: Estimate Residuals

Nif_data$resid <- resid(fit)

Step 4: Durbin–Watson (D-W) Statistics

D-W statistic is the sum of differences between successive residuals squared divided by the sum of residuals squared.

D-W Statistics = sum (ei - ei-1)2 / sum(ei2)
sum(diff(Nif_data$resid)^2) / sum(Nif_data$resid^2)
0.006301032

R can do better – using the ‘durbinWatsonTest’ function from the library ‘car’.

library(car)
fit <- lm(Index ~ Year, data=Nif_data)
durbinWatsonTest(fit)
lag Autocorrelation D-W Statistic p-value
   1       0.9936623   0.006301032       0
 Alternative hypothesis: rho != 0

Durbin-Watson Test Read More »

Autocorrelation of Sine Waves

We have seen autocorrelation and cross-correlation functions. This time, we will create a few neat patterns using sinusoidal waves.

sine_wave = sin(pi*1:100/10)
plot(sine_wave, main=expression( y == sin(pi*t/10)), col = "red", type = "l", xlim=c(0,100), xlab = "t", ylab = "y(t)")

And the autocorrelation of the function is:

acf(sine_wave, lag.max = 80, main = expression(Autocorrelation  ~ of ~ y == sin(pi*t/10)))

What about a cross-correlation between a sine and a cos wave?

sine_wave = sin(pi*1:100/10)
cos_wave = cos(pi*1:100/10)
ccf(sine_wave, cos_wave, 80, main = expression(Crosscorrelation ~ of ~ y == sin(pi*t/10) ~ AND ~ y == cos(pi*t/10)))

We will zoom the plot to know the maximum cross-correlation, which is expected to be the time delay between the waves; 5 in our case.

Autocorrelation of Sine Waves Read More »

Cross-Correlation

If autocorrelation represents the correlation of a variable with itself but at different time lags, cross-correlation is between two variables (at time lags). We will use a few R functions to illustrate the concept. The following are data from our COVID archives.

‘ccf’ function of ‘stats’ library is similar to ‘acf’ we used for the autocorrelation. Here is the cross-collation between the infected and the recovered.

ccf(C_data$Confirmed, C_data$Recovered, 25)

The ‘lag2.plot’ function of the ‘astsa’ library presents the same information differently.

lag2.plot(C_data$Confirmed, C_data$Recovered, 20)

The correlation peaked at a lag of about 12 days, which may be interpreted as the recovery time from the disease. On the other hand, the infected vs deceased gives a slightly higher lag at the maximum correlation.

Cross-Correlation Read More »

Autocorrelation – gg_lag

We continue from the Aus beer production data set. In the last post, we manually built scatter plots at a few lags (lag = 0, 1, 2, and 4) and introduced the ‘acf’ function.

There is another R function, ‘gg_lag’, which can be an even better illustration.

new_production %>% gg_lag(Beer)

For example, the lag = 1 plot (top left). The yellow cluster represents the plot of Q4 on the Y-axis against Q3 of the same year on the X-axis. On the other hand, the purple cluster represents the plot of Q1 on the Y-axis against Q4 of the previous year on the X-axis.

Similarly, for the lag = 2 plot (top middle), The yellow represents the plot of Q4 against Q2 (two quarters earlier) of the same year. The purple represents the plot of Q1 against Q3 of the previous year.

The lag = 4 plot shows a strong positive correlation (everyone on the diagonal line). This is not surprising; it plots Q4 of one year against Q4 of the previous year, Q1 of one year with Q1 of the previous year, etc.

Autocorrelation – gg_lag Read More »

Autocorrelation – Quarterly Production

As we have seen, autocorrelation is the correlation of a variable in a time series with itself but at a time lag. For example, how are the variables at time ts correlated to the t-1s? We will use the aus_production dataset from the R library ‘fpp3’ to illustrate the concept of autocorrelation.

library(fpp3)
str(aus_production)
 $ Quarter    : qtr [1:74] 1992 Q1, 1992 Q2, 1992 Q3, 1992 Q4, 1993 Q1, 1993 Q2, 1993 Q3, 1993 Q4, 1994 Q1, 1994 Q2, 1994 Q3, 1994 Q4, 1995 Q1, 1995 Q2, 1995 Q3,...
 $ Beer       : num [1:74] 443 410 420 532 433 421 410 512 449 381 ...
 $ Tobacco    : num [1:74] 5777 5853 6416 5825 5724 ...
 $ Bricks     : num [1:74] 383 404 446 420 394 462 475 443 421 475 ...
 $ Cement     : num [1:74] 1289 1501 1539 1568 1450 ...
 $ Electricity: num [1:74] 38332 39774 42246 38498 39460 ...
 $ Gas        : num [1:74] 117 151 175 129 116 149 163 138 127 159 ...

We will use the beer production data.

Let’s plot the production data with itself without any lag in time.

plot(B_data, B_data, xlab = "Beer Production at i", ylab = "Beer Production at i")

There is no surprise here; the data is in perfect correlation. In the next step, we will give a lag of one time interval, i.e., a plot of (2,1), (3,2), (4,3), etc. The easiest way to achieve this is to remove the first element of the vector and plot against the vector with the last element removed.

plot(B_data[-1], B_data[-n], xlab = "Beer Production at i - 1", ylab = "Beer Production at i")

You clearly see a lack of correlation. What about a plot with a lag of 2 and 4?

There is a negative correlation compared with the time series 2 quarters ago.

There is a good correlation compared with the time series 4 quarters ago.

The whole process is established using the autocorrelation function (ACF).

autocorrelation <- acf(B_data, lag.max=10, plot=FALSE)
plot(autocorrelation,
     main="Autocorrelation",
     xlab="Lag Parameter",
     ylab="ACF")

ACF at large parameter 1 indicates how successive values of beer production relate to each other, 2 indicates how production two periods apart relate to each other, etc.

Autocorrelation – Quarterly Production Read More »

Heteroskedasticity

We have seen it before; the residuals in linear regression must be random and independent. It also means that the residuals are scattered around a mean with constant variance. This is known as homoscedasticity. When that is violated, it’s heteroscedasticity.

Here is the (simulated) height weight data from the UCLA statistics, with the regression line.

model <- lm(Wt.Pounds ~ Ht.Inch, data=H_data)
plot(H_data$Ht.Inch , H_data$Wt.Pounds, xlab = "Height (Inches)", ylab = "Weight (Pounds)")
abline(model, col = "deepskyblue3")

We will do a residual plot, a collection of points corresponding to the distances from the best-fit line.

H_data$resid <- resid(model)

On the other hand, see the plot of sale price vs property area.

The relationship has heteroscedasticity, as evident from the skewness of residual towards one side.

Heteroskedasticity Read More »