Data & Statistics

Principal Component Analysis Applied

Let’s apply what we learned in the ‘mtcars’ data. We use R to perform the calculations. We require two packages, ‘stats’ and ‘ggbiplot’, to do the job.

library(stats)
library(ggbiplot)

Start with the simplest first – two variables – mpg and disp

data("mtcars")
car_data <- mtcars
mtcars.pca <- prcomp(car_data[,c(1,3)], center = TRUE,scale. = TRUE)
ggbiplot(mtcars.pca,
  ellipse = TRUE,
  labels = rownames(car_data)
)

You can see a few clusters – things on the right, left and centre. You can also see two arrows, one corresponding to mpg and another to disp. It’s true we don’t need to do a PCA for two variables; a 2-D can do the job already.

You can already start interpreting the PCA plot. The Cadilac and Lincoln are closer to the disp line in the PCA plot, which is towards the northwest of the Displacement vs Mileage plot. On the other hand, Honda, Porche etc., are closer to the mpg axis.

mtcars.pca <- prcomp(car_data[,c(1,3, 6, 7)], center = TRUE,scale. = TRUE)
ggbiplot(mtcars.pca)
ggbiplot(mtcars.pca,
  ellipse = TRUE,
  labels = rownames(car_data)#,
)

Principal Component Analysis Applied Read More »

Principal Component Analysis – The Concept

Last time we saw the practical difficulty of analysing data from four or more measured variables. The demands a means of reducing the numbers to two so that it appears on a 2-D plot but gives the message we want – that similar candidates cluster together.

In other words, one must perform necessary mathematical manipulations to convert the parameters to a different set of variables (principal components), select the top two or the principal components, and plot them. All these happen without losing much of the information embedded inside it.

PCA is the technique of compressing data from a large set of measurements into a smaller number of independent (i.e., uncorrelated) variables that captures the core of the original data. Note that the principal component themselves are linear combinations of the original variables.

The first principal component, which becomes the X-axis, defines the direction of the maximum variation of data.

Principal Component Analysis – The Concept Read More »

Principal Component Analysis – Building the Case

Do you remember the “mtcars” dataset? It’s data collected from the 1974 Motor Trend US magazine and it comprises fuel consumption and ten aspects of automobile design and performance for 32 automobiles (1973–74 models). We’ll use it to explain the concept of principal component analysis or PCA.

If we measure only one aspect, we can present the data on a line plot:

You can see that Toyota Corolla, Fiat 128 etc., are similar to each other, and have relatively higher mileage values, whereas Cadillac Fleetwood and Lincoln Continental have lower.

If we measure two properties, we can present the data in a 2-D graph.

If we measure one more property, we would add one more axis to the graph for a 3-D plot. But what happens if we have four or more parameters? PCA can take four or more measurements and make a 2-D PCA plot.

Principal Component Analysis – Building the Case Read More »

Population Inflection

The news that India has overtaken China as the most populous country in the world sparked a flurry of debates in the public discourse. And, as usual, many of them instilling fear and aimed at demonising specific communities. But, as we have seen before, the data was not as bad as one would imagine.

And the reason is visible in the following plot. You may see an inflection point, denoting a change in growth rate (not absolute growth). The location of inflection is estimated using R with the help of the package “inflection”.

x = in_pop$Year[1:72]
y = in_pop$All[1:72]/1e6


plot(x,y,cex=0.3,pch=19, ylab = "Population in Millions", xlab = "Year", ylim = c(0, 1500), col = "blue", type ="l", lwd=3)
grid()

bb <- ese(x,y,0)
pese <- bb[,3]

abline(v=pese, col="red", lwd=2, lty=2)

And this will lead to an eventual peak and a further decline, as per projections.

In the following plot, you will see what happens to the different age groups. The under-25 (green) has already peaked, 25-65 (brown) will be in a couple of decades from now and the old (> 65, white) to stay flat by the end of this millennium.

India’s population growth will come to an end: Our World in Data

Population Inflection Read More »

The Lost Diamond of Bayes

Here is a problem that combines combinations with Bayes’s rule. A card is lost from the 52-card deck. Two cards are drawn from the deck and found to be both diamonds. What is the probability that the lost card is a diamond?

Let’s write down Bayes’ equation first.

P(L_D|2_D) = \frac{P(2_D|L_D)*P(L_D)}{P(2_D|L_D)*P(L_D) + P(2_D|L_{nD})*P(L_{nD})}

P(LD|2D) = The probability that the lost card is a diamond, given two diamonds are drawn.
P(2D|LD) = The probability of drawing two diamonds if the lost card is a diamond
P(LD) = The probability of losing a diamond.
P(2D|LnD) = The probability of drawing two diamonds if the lost card is not a diamond
P(LnD) = The probability of losing a card other than a diamond.

Evaluating each term,
As there are 13 diamonds in a pack of 52 cards, P(LD) is 13 in 52 (13/52 = 1/4), and P(LnD) is 52-13 in 52 (3/4).
P(2D|LD), or the probability of drawing two diamonds from a deck with a missing diamond, is 12C2 / 51C2 = 12 x 11 / (51 x 50).
P(2D|LnD), or the probability of drawing two diamonds from a deck with a missing non-diamond, is 13C2 / 51C2 = 13 x 12 / (51 x 50).

\\ P(L_D|2_D) = \frac{\frac{12*11}{51*50}*\frac{1}{4}}{\frac{12*11}{51*50}*\frac{1}{4} + \frac{13*12}{51*50}*\frac{3}{4}} \\ \\ \frac{12*11*(1/4)}{12*11*(1/4) + 13*12*(3/4)} = \frac{11}{50}

P(LD|2D) = 11/50 = 22%

The Lost Diamond of Bayes Read More »

Committee of Couples

From a group of five married couples, how many committees of four or five people can be formed if no two people on the committee may be married to each other?

4-member commitee

There are 5C4 ways to choose four couples. Then there are 2C1 ways to pick one person from each couple.

5C4 x 2C1 x 2C1 x 2C1 x 2C1 = 5 x 2 x 2 x 2 x 2 = 80

5-member commitee

5C5 x 2C1 x 2C1 x 2C1 x 2C1 x 2C1 = 1 x 2 x 2 x 2 x 2 x 2 = 32

The required combinations (OR = union) = 80 + 32 = 112

Without those restrictions, there could have been 10C4 + 10C5 possibilities.

Committee of Couples Read More »

Rearranging Mississippi

How many distinct ways can all the letters in MISSISSIPPI be arranged to form a new word?

Before we answer this, let’s do something simpler; the number of ways of arranging the word CAT. It can form CAT, CTA, TCA, TAC, ACT, and ATC; in six ways.

We can also use the permutation formula to arrive at the same. Why permutation? Well, the order matters here, or else it would have been only one combination possible. So, 3P3 = 3!/0! = 3! = 3 x 2 x 1 = 6.

MISSISSIPPI

There are 11 letters in the word MISSISSIPPI. So it is 11!. But some of the letters are the same. There are four Is, four Ss and two Ps in it. You don’t want multiple-count the repeated ones. The way to avoid it is to divide the original permutations (11!) with the respective repeated permutations. So the required value is

11!/(4!4!2!) = 11 x 10 x 9 x 8 x 7 x 6 x 5 x 4! /(4! x 4 x 3 x 2 x 1 x 2 x 1)

= 11 x 10 x 9 x 7 x 5 = 34650.

Rearranging Mississippi Read More »

In a 5-card hand – Counting

We evaluated three card probabilities in the previous post. It is important to verify the calculations, well, by actually counting the occurrences by shuffling it a million times and drawing five cards. But first, build the deck:

suits <- c("Diamonds", "Spades", "Hearts", "Clubs")
face <- c("Jack", "Queen", "King")
numb <- c("Deuce", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten")
face_card <- expand.grid(Face = face, Suit = suits)
face_card <- paste(face_card$Face, face_card$Suit)

numb_card <- expand.grid(Numb = numb, Suit = suits)
numb_card <- paste(numb_card$Numb, numb_card$Suit)

Aces <- paste("Ace", suits) 

deck <- c(Aces, numb_card, face_card)

Four face cards

itr <- 1000000

shuff <- replicate(itr, {
draw <- sample(deck, 5, replace = FALSE, prob = rep(1/52, 52))  

dr <- sum(str_detect(draw, "Queen|King|Jack"))

if(dr == 4){
  counter <- 1
}else{ counter <- 0}

})

mean(shuff)

The answer turns out to be: 0.007548

Three cards are kings

itr <- 1000000

shuff <- replicate(itr, {
draw <- sample(deck, 5, replace = FALSE, prob = rep(1/52, 52))  

dr <- sum(str_detect(draw, "King"))

if(dr == 3 ){
  counter <- 1
}else{ counter <- 0}

})

mean(shuff)
0.001717

All five cards are hearts

itr <- 1000000

shuff <- replicate(itr, {
draw <- sample(deck, 5, replace = FALSE, prob = rep(1/52, 52))  

dr <- sum(str_detect(draw, "Hearts"))

if(dr == 5 ){
  counter <- 1
}else{ counter <- 0}

})

mean(shuff)
0.00048

In a 5-card hand – Counting Read More »

In a 5-card hand

In a 5-card hand, what is the probability of getting four face cards?

It is a 52-card deck, and it has 12 face cards. That means there are 40 non-face cards. The required combination should include five cards, in which four of which are going to be face cards and one of them is going to be a non-face card.

Since the order in which they come doesn’t matter, we use combinations. So the answer is

Out of the 12 face cards, we choose four and out of the 40 other cards, we choose 1, divided by all possible combinations, i.e. out of the 52 cards, choose 5.

12C4 x 40C1 / 52C5 = 0.00076

Three cards are kings

Out of the 4 kings, we choose three kings and out of the 48 other cards, we choose 1 non-king

P = 4C3 x 48C2 / 52C5 = 0.001736

All five cards are hearts

P = 13C5 / 52C5 = 0.000495

In a 5-card hand Read More »

Summary Statistics of Linear Transformations

Here are the summary statistics for 31 daily high temperatures of a location in degrees Fahrenheit. What are the corresponding numbers in degrees Celcius?

Mean86.6oF
Median87.3oF
Standard Deviation5.2oF
Variance27.04oF

Central tendency and variability during transformations

A few exercises before try and estimate the answer.

A few exercises before try and estimate the answer. Consider three numbers, 5,6,7. The mean, median, standard deviation and variance o the collection are 6, 6, 1 and 1.

Now add 3 to each and find the summary statistics:

The new set is 8, 9, and 10 and the summary is 9, 9, 1, 1. The mean and median of the new set are just 3 more than the original, and the variance and the standard deviations are unchanged.

Multiply each by 4 and the summary statistics:

The new set is 20, 24, and 28 and the summary is 24, 24, 4, 16. The mean and median of the new set a4 times the original, and the variance is 4 times and the standard deviation is 42 times.

Transformation of oF to oC

The relationship (which is a linear transformation is)

C = (5/9) x (F – 32)

C = -(160/9) + (5/9) F

Applying what we learned earlier,

Mean in oC = -(160/9) + (5/9) x 86.6 = 30.3
Median in oC = -(160/9) + (5/9) x 87.3 = 30.7
Standard deviation in oC = (5/9) x 5.2 = 2.89
Variance in oC = (5/9)2 x 5.22 = 8.35

Linear Transformations: jbstatistics

Summary Statistics of Linear Transformations Read More »