mtcars Dataset – Correlation Coefficient

We have seen a couple of plots showing relationships between variables in the ‘mtcars‘ database.

Statisticians use single numbers to quantify the strength and direction of the relationship. One of them is the correlation coefficient which quantifies linear relationships. Before going into correlation coefficients, let’s first learn the covariance between two variables.

Covariance

The sample covariance among two variables based on N observations of each is,

Cov(x,y) = \frac{1}{N-1}\sum\limits^{N}_{i = 1} (x_i - \bar{x}) * (y_i - \bar{y})

You see N − 1 in the denominator rather than N when the population mean is not known and is replaced by the sample mean (X bar).

sum((car_data$mpg - mean(car_data$mpg))*(car_data$cyl - mean(car_data$cyl))) / 31

Or simply,

cov(car_data$mpg, car_data$cyl)
-9.17

Correlation coefficient

The Pearson correlation coefficient is the covariance of the two variables divided by the product of their standard deviations.

r_{x,y} = \frac{Cov(x,y)}{s_xs_y}

cov(car_data$mpg, car_data$cyl) /(sd(car_data$mpg)*sd(car_data$cyl))

Or use the following command from ‘corrplot‘ package

cor(car_data$mpg, car_data$cyl)
-0.85

The greater the absolute value of the correlation coefficient, the stronger the relationship. The maximum value is 1 (+1 and -1), which represents a perfectly linear relationship. A positive value means when one variable increases, the other one also increases. On the other hand, a negative value suggests when one value increases, the other decreases.

In exploratory analyses, however, you may want to know the relationships between several variables in one go. That is the topic for the next post.

References

The Correlation Coefficient: Investopedia
Covariance: Wiki