July 2023

Principal Component Analysis – Building the Case

Do you remember the “mtcars” dataset? It’s data collected from the 1974 Motor Trend US magazine and it comprises fuel consumption and ten aspects of automobile design and performance for 32 automobiles (1973–74 models). We’ll use it to explain the concept of principal component analysis or PCA.

If we measure only one aspect, we can present the data on a line plot:

You can see that Toyota Corolla, Fiat 128 etc., are similar to each other, and have relatively higher mileage values, whereas Cadillac Fleetwood and Lincoln Continental have lower.

If we measure two properties, we can present the data in a 2-D graph.

If we measure one more property, we would add one more axis to the graph for a 3-D plot. But what happens if we have four or more parameters? PCA can take four or more measurements and make a 2-D PCA plot.

Principal Component Analysis – Building the Case Read More »

Population Inflection

The news that India has overtaken China as the most populous country in the world sparked a flurry of debates in the public discourse. And, as usual, many of them instilling fear and aimed at demonising specific communities. But, as we have seen before, the data was not as bad as one would imagine.

And the reason is visible in the following plot. You may see an inflection point, denoting a change in growth rate (not absolute growth). The location of inflection is estimated using R with the help of the package “inflection”.

x = in_pop$Year[1:72]
y = in_pop$All[1:72]/1e6


plot(x,y,cex=0.3,pch=19, ylab = "Population in Millions", xlab = "Year", ylim = c(0, 1500), col = "blue", type ="l", lwd=3)
grid()

bb <- ese(x,y,0)
pese <- bb[,3]

abline(v=pese, col="red", lwd=2, lty=2)

And this will lead to an eventual peak and a further decline, as per projections.

In the following plot, you will see what happens to the different age groups. The under-25 (green) has already peaked, 25-65 (brown) will be in a couple of decades from now and the old (> 65, white) to stay flat by the end of this millennium.

India’s population growth will come to an end: Our World in Data

Population Inflection Read More »

The Lost Diamond of Bayes

Here is a problem that combines combinations with Bayes’s rule. A card is lost from the 52-card deck. Two cards are drawn from the deck and found to be both diamonds. What is the probability that the lost card is a diamond?

Let’s write down Bayes’ equation first.

P(L_D|2_D) = \frac{P(2_D|L_D)*P(L_D)}{P(2_D|L_D)*P(L_D) + P(2_D|L_{nD})*P(L_{nD})}

P(LD|2D) = The probability that the lost card is a diamond, given two diamonds are drawn.
P(2D|LD) = The probability of drawing two diamonds if the lost card is a diamond
P(LD) = The probability of losing a diamond.
P(2D|LnD) = The probability of drawing two diamonds if the lost card is not a diamond
P(LnD) = The probability of losing a card other than a diamond.

Evaluating each term,
As there are 13 diamonds in a pack of 52 cards, P(LD) is 13 in 52 (13/52 = 1/4), and P(LnD) is 52-13 in 52 (3/4).
P(2D|LD), or the probability of drawing two diamonds from a deck with a missing diamond, is 12C2 / 51C2 = 12 x 11 / (51 x 50).
P(2D|LnD), or the probability of drawing two diamonds from a deck with a missing non-diamond, is 13C2 / 51C2 = 13 x 12 / (51 x 50).

\\ P(L_D|2_D) = \frac{\frac{12*11}{51*50}*\frac{1}{4}}{\frac{12*11}{51*50}*\frac{1}{4} + \frac{13*12}{51*50}*\frac{3}{4}} \\ \\ \frac{12*11*(1/4)}{12*11*(1/4) + 13*12*(3/4)} = \frac{11}{50}

P(LD|2D) = 11/50 = 22%

The Lost Diamond of Bayes Read More »

Committee of Couples

From a group of five married couples, how many committees of four or five people can be formed if no two people on the committee may be married to each other?

4-member commitee

There are 5C4 ways to choose four couples. Then there are 2C1 ways to pick one person from each couple.

5C4 x 2C1 x 2C1 x 2C1 x 2C1 = 5 x 2 x 2 x 2 x 2 = 80

5-member commitee

5C5 x 2C1 x 2C1 x 2C1 x 2C1 x 2C1 = 1 x 2 x 2 x 2 x 2 x 2 = 32

The required combinations (OR = union) = 80 + 32 = 112

Without those restrictions, there could have been 10C4 + 10C5 possibilities.

Committee of Couples Read More »

Rearranging Mississippi

How many distinct ways can all the letters in MISSISSIPPI be arranged to form a new word?

Before we answer this, let’s do something simpler; the number of ways of arranging the word CAT. It can form CAT, CTA, TCA, TAC, ACT, and ATC; in six ways.

We can also use the permutation formula to arrive at the same. Why permutation? Well, the order matters here, or else it would have been only one combination possible. So, 3P3 = 3!/0! = 3! = 3 x 2 x 1 = 6.

MISSISSIPPI

There are 11 letters in the word MISSISSIPPI. So it is 11!. But some of the letters are the same. There are four Is, four Ss and two Ps in it. You don’t want multiple-count the repeated ones. The way to avoid it is to divide the original permutations (11!) with the respective repeated permutations. So the required value is

11!/(4!4!2!) = 11 x 10 x 9 x 8 x 7 x 6 x 5 x 4! /(4! x 4 x 3 x 2 x 1 x 2 x 1)

= 11 x 10 x 9 x 7 x 5 = 34650.

Rearranging Mississippi Read More »

In a 5-card hand – Counting

We evaluated three card probabilities in the previous post. It is important to verify the calculations, well, by actually counting the occurrences by shuffling it a million times and drawing five cards. But first, build the deck:

suits <- c("Diamonds", "Spades", "Hearts", "Clubs")
face <- c("Jack", "Queen", "King")
numb <- c("Deuce", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten")
face_card <- expand.grid(Face = face, Suit = suits)
face_card <- paste(face_card$Face, face_card$Suit)

numb_card <- expand.grid(Numb = numb, Suit = suits)
numb_card <- paste(numb_card$Numb, numb_card$Suit)

Aces <- paste("Ace", suits) 

deck <- c(Aces, numb_card, face_card)

Four face cards

itr <- 1000000

shuff <- replicate(itr, {
draw <- sample(deck, 5, replace = FALSE, prob = rep(1/52, 52))  

dr <- sum(str_detect(draw, "Queen|King|Jack"))

if(dr == 4){
  counter <- 1
}else{ counter <- 0}

})

mean(shuff)

The answer turns out to be: 0.007548

Three cards are kings

itr <- 1000000

shuff <- replicate(itr, {
draw <- sample(deck, 5, replace = FALSE, prob = rep(1/52, 52))  

dr <- sum(str_detect(draw, "King"))

if(dr == 3 ){
  counter <- 1
}else{ counter <- 0}

})

mean(shuff)
0.001717

All five cards are hearts

itr <- 1000000

shuff <- replicate(itr, {
draw <- sample(deck, 5, replace = FALSE, prob = rep(1/52, 52))  

dr <- sum(str_detect(draw, "Hearts"))

if(dr == 5 ){
  counter <- 1
}else{ counter <- 0}

})

mean(shuff)
0.00048

In a 5-card hand – Counting Read More »

In a 5-card hand

In a 5-card hand, what is the probability of getting four face cards?

It is a 52-card deck, and it has 12 face cards. That means there are 40 non-face cards. The required combination should include five cards, in which four of which are going to be face cards and one of them is going to be a non-face card.

Since the order in which they come doesn’t matter, we use combinations. So the answer is

Out of the 12 face cards, we choose four and out of the 40 other cards, we choose 1, divided by all possible combinations, i.e. out of the 52 cards, choose 5.

12C4 x 40C1 / 52C5 = 0.00076

Three cards are kings

Out of the 4 kings, we choose three kings and out of the 48 other cards, we choose 1 non-king

P = 4C3 x 48C2 / 52C5 = 0.001736

All five cards are hearts

P = 13C5 / 52C5 = 0.000495

In a 5-card hand Read More »

Summary Statistics of Linear Transformations

Here are the summary statistics for 31 daily high temperatures of a location in degrees Fahrenheit. What are the corresponding numbers in degrees Celcius?

Mean86.6oF
Median87.3oF
Standard Deviation5.2oF
Variance27.04oF

Central tendency and variability during transformations

A few exercises before try and estimate the answer.

A few exercises before try and estimate the answer. Consider three numbers, 5,6,7. The mean, median, standard deviation and variance o the collection are 6, 6, 1 and 1.

Now add 3 to each and find the summary statistics:

The new set is 8, 9, and 10 and the summary is 9, 9, 1, 1. The mean and median of the new set are just 3 more than the original, and the variance and the standard deviations are unchanged.

Multiply each by 4 and the summary statistics:

The new set is 20, 24, and 28 and the summary is 24, 24, 4, 16. The mean and median of the new set a4 times the original, and the variance is 4 times and the standard deviation is 42 times.

Transformation of oF to oC

The relationship (which is a linear transformation is)

C = (5/9) x (F – 32)

C = -(160/9) + (5/9) F

Applying what we learned earlier,

Mean in oC = -(160/9) + (5/9) x 86.6 = 30.3
Median in oC = -(160/9) + (5/9) x 87.3 = 30.7
Standard deviation in oC = (5/9) x 5.2 = 2.89
Variance in oC = (5/9)2 x 5.22 = 8.35

Linear Transformations: jbstatistics

Summary Statistics of Linear Transformations Read More »

Geometric Distribution

One in five cars in the city is green. What is the probability that the fifth car is the first green car?

We already know we can solve this problem using the negative binomial distribution function. But there is a special one for these types – where the arrival time of the first in question. That is the geometric distribution. The formal expression of the probability that the first occurrence of success requires k independent trials, each with success probability p is

\\ P(X = k) = p * (1-p)^{k-1}

To answer the question in the beginning, we substitute p = 0.20 (one in fifth), car number = 5; the required probability is (0.2)*(1-0.2)4 = 0.08192

The R code for the same calculation is

dgeom(4, prob = 0.2, log = FALSE)

The below geometric distribution chart shows the probability of seeing the first green car in precisely 1, 2, 3, etc. rolls, up to 30.

Geometric Distribution Read More »

Discrete Probability Distributions 

We have seen a few discrete probability distributions by now. Today we summarise them and find the relationships and the differences. The following are considered here:

Bernoulli distribution

The Bernoulli distribution is the distribution of the number of successes on a single Bernoulli trial. In a Bernoulli trial, you either get a success (1) or a failure (0). Therefore, a Bernoulli random variable can take either zero or one. E.g., if a coin is tossed once, what is the probability that it comes up heads?

Binomial distribution

When you carry out multiple Bernoulli trials, we get into a Binomial distribution. E.g., if I toss a coin ten times, what is the probability of getting exactly four heads? So, you can already conclude that the Bernoulli distribution is a special case of the binomial distribution with one trial.

Geometric distribution

The geometric distribution is the distribution of the number of Bernoulli trials to get the first success. E.g., if a coin is tossed repeatedly, what is the probability that the first head comes on the fifth trial?

Negative binomial distribution

A general case of the geometric distribution is the negative binomial distribution. It is the distribution of the number of trials needed to get a certain number of successes in repeated independent Bernoulli trials. E.g., if a coin is tossed repeatedly, what is the probability that the third head comes on the tenth trial?

Hypergeometric distribution

The hypergeometric distribution is similar to the binomial distribution but without replacement, or the trials are not independent. E.g., if five cards are drawn from a deck without replacement, what is the probability of getting two spades?

Poisson distribution

It is the distribution of the number of events in a given duration if those are occurring randomly and independently. What is the probability of having exactly three shark attacks on a particular beach this year? The Poisson distribution is approximated to a binomial distribution if the number of trials is large and the probability is small.

Reference

Overview of Some Discrete Probability Distributions: jbstatistics

Discrete Probability Distributions  Read More »