Data & Statistics

Autocorrelation of Sine Waves

March 7, 2024

We have seen autocorrelation and cross-correlation functions. This time, we will create a few neat patterns using sinusoidal waves.

sine_wave = sin(pi*1:100/10)
plot(sine_wave, main=expression( y == sin(pi*t/10)), col = "red", type = "l", xlim=c(0,100), xlab = "t", ylab = "y(t)")

And the autocorrelation of the function is:

acf(sine_wave, lag.max = 80, main = expression(Autocorrelation  ~ of ~ y == sin(pi*t/10)))

What about a cross-correlation between a sine and a cos wave?

sine_wave = sin(pi*1:100/10)
cos_wave = cos(pi*1:100/10)

ccf(sine_wave, cos_wave, 80, main = expression(Crosscorrelation ~ of ~ y == sin(pi*t/10) ~ AND ~ y == cos(pi*t/10)))

We will zoom the plot to know the maximum cross-correlation, which is expected to be the time delay between the waves; 5 in our case.

Autocorrelation of Sine Waves Read More »

If autocorrelation represents the correlation of a variable with itself but at different time lags, cross-correlation is between two variables (at time lags). We will use a few R functions to illustrate the concept. The following are data from our COVID archives.

‘ccf’ function of ‘stats’ library is similar to ‘acf’ we used for the autocorrelation. Here is the cross-collation between the infected and the recovered.

ccf(C_data$Confirmed, C_data$Recovered, 25)

The ‘lag2.plot’ function of the ‘astsa’ library presents the same information differently.

lag2.plot(C_data$Confirmed, C_data$Recovered, 20)

The correlation peaked at a lag of about 12 days, which may be interpreted as the recovery time from the disease. On the other hand, the infected vs deceased gives a slightly higher lag at the maximum correlation.

Cross-Correlation Read More »

Autocorrelation – gg_lag

March 5, 2024

We continue from the Aus beer production data set. In the last post, we manually built scatter plots at a few lags (lag = 0, 1, 2, and 4) and introduced the ‘acf’ function.

There is another R function, ‘gg_lag’, which can be an even better illustration.

new_production %>% gg_lag(Beer)

For example, the lag = 1 plot (top left). The yellow cluster represents the plot of Q4 on the Y-axis against Q3 of the same year on the X-axis. On the other hand, the purple cluster represents the plot of Q1 on the Y-axis against Q4 of the previous year on the X-axis.

Similarly, for the lag = 2 plot (top middle), The yellow represents the plot of Q4 against Q2 (two quarters earlier) of the same year. The purple represents the plot of Q1 against Q3 of the previous year.

The lag = 4 plot shows a strong positive correlation (everyone on the diagonal line). This is not surprising; it plots Q4 of one year against Q4 of the previous year, Q1 of one year with Q1 of the previous year, etc.

Autocorrelation – gg_lag Read More »

Autocorrelation – Quarterly Production

March 4, 2024

As we have seen, autocorrelation is the correlation of a variable in a time series with itself but at a time lag. For example, how are the variables at time ts correlated to the t-1s? We will use the aus_production dataset from the R library ‘fpp3’ to illustrate the concept of autocorrelation.

library(fpp3)
str(aus_production)

 $ Quarter    : qtr [1:74] 1992 Q1, 1992 Q2, 1992 Q3, 1992 Q4, 1993 Q1, 1993 Q2, 1993 Q3, 1993 Q4, 1994 Q1, 1994 Q2, 1994 Q3, 1994 Q4, 1995 Q1, 1995 Q2, 1995 Q3,...
 $ Beer       : num [1:74] 443 410 420 532 433 421 410 512 449 381 ...
 $ Tobacco    : num [1:74] 5777 5853 6416 5825 5724 ...
 $ Bricks     : num [1:74] 383 404 446 420 394 462 475 443 421 475 ...
 $ Cement     : num [1:74] 1289 1501 1539 1568 1450 ...
 $ Electricity: num [1:74] 38332 39774 42246 38498 39460 ...
 $ Gas        : num [1:74] 117 151 175 129 116 149 163 138 127 159 ...

We will use the beer production data.

Let’s plot the production data with itself without any lag in time.

plot(B_data, B_data, xlab = "Beer Production at i", ylab = "Beer Production at i")

There is no surprise here; the data is in perfect correlation. In the next step, we will give a lag of one time interval, i.e., a plot of (2,1), (3,2), (4,3), etc. The easiest way to achieve this is to remove the first element of the vector and plot against the vector with the last element removed.

plot(B_data[-1], B_data[-n], xlab = "Beer Production at i - 1", ylab = "Beer Production at i")

You clearly see a lack of correlation. What about a plot with a lag of 2 and 4?

There is a negative correlation compared with the time series 2 quarters ago.

There is a good correlation compared with the time series 4 quarters ago.

The whole process is established using the autocorrelation function (ACF).

autocorrelation <- acf(B_data, lag.max=10, plot=FALSE)
plot(autocorrelation,
     main="Autocorrelation",
     xlab="Lag Parameter",
     ylab="ACF")

ACF at large parameter 1 indicates how successive values of beer production relate to each other, 2 indicates how production two periods apart relate to each other, etc.

Autocorrelation – Quarterly Production Read More »

Autocorrelation of Residuals

March 3, 2024

Autocorrelation of Residuals Read More »

Heteroskedasticity

March 2, 2024

We have seen it before; the residuals in linear regression must be random and independent. It also means that the residuals are scattered around a mean with constant variance. This is known as homoscedasticity. When that is violated, it’s heteroscedasticity.

Here is the (simulated) height weight data from the UCLA statistics, with the regression line.

model <- lm(Wt.Pounds ~ Ht.Inch, data=H_data)
plot(H_data$Ht.Inch , H_data$Wt.Pounds, xlab = "Height (Inches)", ylab = "Weight (Pounds)")
abline(model, col = "deepskyblue3")

We will do a residual plot, a collection of points corresponding to the distances from the best-fit line.

H_data$resid <- resid(model)

On the other hand, see the plot of sale price vs property area.

The relationship has heteroscedasticity, as evident from the skewness of residual towards one side.

Heteroskedasticity Read More »

White Noise

March 1, 2024

White noise is a series that contains a collection of uncorrelated random variables. The following is a time series of Gaussian white noise generated as:

W_data <- 5*rnorm(500,0,1)

The series has zero mean and a finite variance.

Signal with noise

Many real-life signals are composed of periodic signals contaminated by noise, such as the one shown above. Here is a periodic signal given by
2cos(2pi*t/100 + pi)

Now, add the white noise we developed earlier:

White Noise Read More »

Pilot Hit

February 29, 2024

Data from a television production company suggests that 10% of their shows are blockbuster hits, 15% are moderate success, 50% do break even, and 25% lose money. Production managers select new shows based on how they fare in pilot episodes. The company has seen 95% of the blockbusters, 70% of moderate, 60% of breakeven and 20% of losers receive positive feedback.

Given the background,
1) How likely is a new pilot to get positive feedback?
2) What is the probability that a new series will be a blockbuster if the pilot gets positive feedback?

The first step is to list down all the marginal probabilities as given in the background.

	Pilot	Outcome	Total
	Positive	Negative
Huge Success			0.10
Moderate			0.15
Break Even			0.50
Loser			0.25
Total			1.0

The next step is to estimate the joint probabilities of pilot success in each category.
95% of blockbusters get positive feedback = 0.95 x 0.1 = 0.095.
Let’s fill the respective cells with joint probabilities.

	Pilot	Outcome	Total
	Positive	Negative
Huge Success	0.095	0.005	0.10
Moderate	0.105	0.045	0.15
Break Even	0.30	0.20	0.50
Loser	0.05	0.20	0.25
Total	0.55	0.45	1.0

The rest is straightforward.
The answer to the first question: the chance of positive feedback = sum of all probabilities under positive = 0.55 or 55%.
The second quesiton is P(success|positive) = 0.095/0.55 = 0.17 = 17%

	Pilot	Outcome
	P(Positive)	P(success\|Positive)
Huge Success	0.095	0.17
Moderate	0.105	0.19
Break Even	0.30	0.55
Loser	0.05	0.09
Total	0.55	1.0

Reference

Basic probability: zedstatistics

Pilot Hit Read More »

Moving Average – Continued

February 28, 2024

Here, we plot the daily electricity production data that was used in the last post.

Following is the R code, which uses the filter function for building the 2-sided moving average. The subsequent plot represents the method on the first five points (the red circle represents the centred average of the first five points).

E_data$ma5 <- stats::filter(E_data$IPG2211A2N, filter = rep(1/5, 5), sides = 2)

For the one-sided:

E_data$ma5 <- stats::filter(E_data$IPG2211A2N, filter = rep(1/5, 5), sides = 1)

The 5-day moving average is below.

You can get a super-smooth data trend using the monthly (30-day) moving average.

Moving Average – Continued Read More »

Moving Average

February 27, 2024

Moving averages and means to smoothen noisy (time series) data, unearthing the underlying trends. The process is also known as filtering the data.

Moving averages (MA) are a series of averages estimated on a pre-defined number of consecutive data. For example, MA₅ contains the set of all the averages of 5 successive members of the original series. The following relationship represents how a centred or two-sided moving average is estimated.

MA₅ = (X_t-2 + X_t-1 + X_t + X_t+1 + X_t+2)/5

‘t’ represents the time at which the moving average is estimated. t-1 is the observation just before t, and t+1 means one observation immediately after t. To illustrate the concept, we develop the following table representing consecutive 10 points (electric signals) in a dataset.

Date	Signal
1	72.5052
2	70.6720
3	62.4502
4	57.4714
5	55.3151
6	58.0904
7	62.6202
8	63.2485
9	60.5846
10	56.3154

The centred MA starts from point 3 (the midpoint of 1, 2, 3, 4, 5). The value at 3 is the mean of the first five points (72.5052 + 70.6720 + 62.4502 + 57.4714 + 55.3151) /5 = 63.68278.

Date	Signal	Average
1	72.5052
2	70.6720
3	62.4502	63.68278
4	57.4714
5	55.3151
6	58.0904
7	62.6202
8	63.2485
9	60.5846
10	56.3154

This process is continued – MA on point 4 is the mean of points 2, 3, 4, 5 and 6, etc.

Date	Signal	MA
1	72.5052
2	70.6720
3	62.4502	63.68278 [1-5]
4	57.4714	60.79982 [2-6]
5	55.3151	59.18946 [3-7]
6	58.0904	59.34912 [4-8]
7	62.6202	59.97176 [5-9]
8	63.2485	60.17182 [6-10]
9	60.5846
10	56.3154

The one-sided moving average is different. It estimates MA at the end of the interval.
MA₅ = (X_t-4 + X_t-3 + X_t-2 + X_t-1 + X_t)/5

Date	Signal	MA
1	72.5052
2	70.6720
3	62.4502
4	57.4714
5	55.3151	63.68278 [1-5]
6	58.0904	60.79982 [2-6]
7	62.6202	59.18946 [3-7]
8	63.2485	59.34912 [4-8]
9	60.5846	59.97176 [5-9]
10	56.3154	60.17182 [6-10]

We will look at the impact of moving averages in the next post.

Moving Average Read More »