Data & Statistics

Ecological fallacy

A lot of data we know describe the general trends of a region or group than from its members. E.g. the crime rate of a city is estimated based on the total number of crimes divided by the number of people. It is not calculated based on surveys of every individual member. A city can be high in crime rates, yet 99% of the individuals are not under any threat to their life or property. In other words, there may be a few pockets in the city that experience disproportionately more crimes than the rest.

Ecological fallacy describes the logical error when we take a statistic meant to represent an area or group and apply it to the individuals or objects inside. It gets the name because the data was meant to describe the system, the environment or the ecology.

A lot of stereotypes arise out of ecological fallacy. A well-known example is racial profiling, in which a person is discriminated against or stereotyped against her ethnicity, religion or nationality. Simpson’s paradox, something we had discussed in the past, is a special case of ecological fallacy.

A classical case was the 1950 paper published by Robinson in American Sociological Review. He found a positive correlation between migrants (colour) and illiteracy. Yet, he found, at the state level, a negative correlation (-0.53) between illiteracy and the number of people born outside the US. This was counterintuitive. One possible explanation is that while migrants tend to be more illiterate, they tend to migrate to regions that are, on average, more literate, such as big cities.

Robinson, W. S; American Sociological Review, 1950, 15 (3),  351-357.

Ecological fallacy Read More »

p-value Fallacy

We have seen the p-value before. A higher p-value of a test suggests that the sample results are consistent with the null hypothesis. Correspondingly, to reject the null hypothesis, you like to have lower p-values.

p-values are probabilities observing this extreme sample statistics when the null hypothesis is correct. For example, if 0.05 is the p-value of a study to test the effectiveness of a drug, then you should understand that even if the medicine has no effect, 5% of the studies will give the results you obtained.

It doesn’t stop here. People now think that 5% is the error rate of the test. And this is termed the p-value fallacy. The error associated with a particular p-value is estimated to be much higher than the p-value.

p-value Fallacy Read More »

Counting Cards

Blackjack is a special kind of casino game as it doesn’t have independence from hand to hand. Because a card, once left the deck, can not come back until the deck is reshuffled. It means that the previous card affects the probability of the next card, and therefore, card counting is a possible means to estimate future win potential. Consider this:

The probability of drawing a natural from a deck is

P(\text{Natural}) = 2*\frac{4}{52}*\frac{16}{51} \approx 0.0483 \approx \frac{1}{21}

Suppose an ace is removed from the deck. The updated probability for the natural is:

P(\text{Natural}) = 2*\frac{3}{51}*\frac{16}{50} \approx 0.0376 \approx \frac{1}{27}

Or if a three was removed,

P(\text{Natural}) = 2*\frac{4}{51}*\frac{16}{50} \approx 0.0502 \approx \frac{1}{20}

Counting Cards Read More »

The Blackjack

What is the probability of getting a natural or blackjack when a player delt two-card total of 21 from a well-shuffled 5-deck shoe?

casino, poker, blackjack-5619014.jpg

Getting exactly 21 from two cards means you must get an ace (ace can take 1 or 11 at the player’s discretion) and a ten-count card (number 10 or a face card). We will estimate the probability by two different methods.

Method 1

From cards from n-decks, the probability of getting an ace in the first deal followed by a 10 in the second is given by:

\frac{4n}{52n}*\frac{16n}{52n-1}

Similarly, the probability of having a 10-card followed by an ace is:

\frac{16n}{52n}*\frac{4n}{52n-1}

Combining the two, you get the total probability of natural.

P(\text{Natural}) = \frac{4n}{52n}*\frac{16n}{52n-1} + \frac{16n}{52n}*\frac{4n}{52n-1} = \frac{32n}{13(52n-1)}

For a 5-deck shoe, substitute n = 5, and you get 0.0475 or 4.75%.

Method 2

We will use the familiar combinations formula to get the same.

P(\text{Natural}) = \frac{_{4n}C_1*_{16n}C_1}{_{52n}C_2} = \frac{20*80}{33670} = 0.04752 (\text{ for n = 5})

Combinations Calculator

The Blackjack Read More »

Card Games – Continued

Five cards are dealt from a standard 52-card deck. What is the probability of drawing four face cards and an Ace? First, we do it analytically.

Combinations

We are choosing four face cards from the total possible 12 (and without replacement). Since the order of the deal doesn’t matter, it is a combination problem of the following form.

\\ _{12}C_4 = \frac{12!}{(12-4)!4!} = \frac{12!}{8!4!} \\ \\ = \frac{12*11*10*9}{4*3*2*1} = 495

Now, calculate the ways of choosing one ace card from a total of four.

\\ _{4}C_1 = \frac{4!}{(4-1)!1!} = \frac{4!}{3!} = 4

To get the required probability, we have to divide the product of the two with all the possible combinations (of selecting five cards from the deck of 52).

\\ _{52}C_5 = \frac{52!}{(52-5)!5!} = \frac{52!}{47!5!}  \\ \\ = \frac{52*51*50*49*48}{5*4*3*2*1} = 2598960

Finally, the probability is calculated by dividing the combination of getting four face cards and one ace card with the possibility of getting five cards from the deck.

\\ P = \frac{_{12}C_4 * _4C_1} {_{52}C_5} \\ \\ P = \frac{\frac{12!}{8!4!}*\frac{4!}{3!1!}}{\frac{52!}{47!5!}} \\ \\ \frac{495 * 4}{2598960} = 0.0007618432

We can execute the whole calculation using the following R code:

choose(12,4)*choose(4,1)/choose(52,5)

Monte Carlo

Let’s start building the deck as we did last time.

suits <- c("Diamonds", "Spades", "Hearts", "Clubs")
face <- c("Jack", "Queen", "King")
numb <- c("Deuce", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten")

face_card <- expand.grid(Face = face, Suit = suits)
face_card <- paste(face_card$Face, face_card$Suit)

numb_card <- expand.grid(Numb = numb, Suit = suits)
numb_card <- paste(numb_card$Numb, numb_card$Suit)

Aces <- paste("Ace", suits) 

deck <- c(Aces, numb_card, face_card)

Now start drawing five cards at random and check your hands. Repeat that a million times, and count the number of times you got what you wanted. And divide it by the million.

B <- 1000000

results <- replicate(B, {

hand <- sample(1:52, 5, replace = FALSE)
deal <- deck[hand]

match_1 <- sum(face_card %in% deal)
match_2 <- sum(Aces %in% deal)

if(match_1 == 4 & match_2 ==1) {
  counter <- 1 
}else{
  counter <- 0
}
  
})

sum(results) / B 

Card Games – Continued Read More »

Card Games

Let’s play some card games. Today we will create a deck of cards using R programming.

There are 52 cards in a deck. And they fall into four suits – Diamonds, Spades, Hearts and Clubs. Each of these suits can have nine numbers (2 – 10), three faces (Jack, Queen and King) or an Ace. For example, a card can be an Ace of Clubs, another may be a four of Hearts etc.

suits <- c("Diamonds", "Spades", "Hearts", "Clubs")
numbers <- c("Ace", "Deuce", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten", "Jack", "Queen", "King")

We will use the expand.grid function in R to create a data frame with all combinations of suits and numbers.

deck <- expand.grid(Number = numbers, Suit = suits)

If you print the output from the previous command (as_tibble(deck)), you get the following data frame with 52 rows. The top five rows are shown below.

Number
<fctr>
Suit
<fctr>
AceDiamonds
DeuceDiamonds
ThreeDiamonds
FourDiamonds
FiveDiamonds

It still doesn’t look like cards, as they are separated by the columns of the data frame. So we will paste them into each other and create one deck of 52 entries, as follows.

deck <- paste(deck$Number, deck$Suit)
“Ace Diamonds” “Deuce Diamonds” “Three Diamonds” “Four Diamonds” “Five Diamonds” “Six Diamonds” “Seven Diamonds” “Eight Diamonds” “Nine Diamonds” “Ten Diamonds” “Jack Diamonds” “Queen Diamonds” “King Diamonds” “Ace Spades” “Deuce Spades” “Three Spades” “Four Spades” “Five Spades” “Six Spades” “Seven Spades” “Eight Spades” “Nine Spades” “Ten Spades” “Jack Spades” “Queen Spades” “King Spades” “Ace Hearts” “Deuce Hearts” “Three Hearts” “Four Hearts” “Five Hearts” “Six Hearts” “Seven Hearts” “Eight Hearts” “Nine Hearts” “Ten Hearts” “Jack Hearts” “Queen Hearts” “King Hearts” “Ace Clubs” “Deuce Clubs” “Three Clubs” “Four Clubs” “Five Clubs” “Six Clubs” “Seven Clubs” “Eight Clubs” “Nine Clubs” “Ten Clubs” “Jack Clubs” “Queen Clubs” “King Clubs”

We will do the first two estimations now. How many kings are in the deck, and what is the probability of drawing a king from the deck?

kings <- paste("King", suits) 
deck %in% kings
sum(deck %in% kings) # output is 4
mean(deck %in% kings) # output is 0.07692308


# paste("King", suits) creates all kings (output:  "King Diamonds" "King Spades"   "King Hearts"   "King Clubs" )
# The command, %in%, searches for 'kings' inside the vector 'deck' and returns TRUE or FALSE, depending on whether it matched or not. 
# sum adds up all - each TRUE gets 1, and FALSE gets 0.
# mean gives the average: sum/total count

Card Games Read More »

Permutations and Combinations Continued

Five strong contenders are running a race. How many ways can the gold, silver and bronze be awarded? We have five possible athletes to choose from but three at a time. The order does matter here as these are first, second and third places. Also, one person can not be first and second, or repetition is not allowed. So it is a permutation problem.

\\ _nP_r = \frac{n!}{(n-r)!}  \\ _5P_3 = \frac{5!}{(5-3)!} = \frac{5!}{2!} = \frac{5 * 4 * 3 * 2 * 1}{2 * 1} = 60

Now, you have five topping choices to make pizza with three toppings. How many distinct pizzas can you make? The first thing to notice here is the lack of order – your selection of pepperoni, onions and mushrooms is no different from onions, pepperoni and mushrooms or mushrooms, onions, pepperoni etc. It becomes a combination problem.

\\ _nC_r = \frac{n!}{(n-r)! r!} \\ _{5}C_3 = \frac{5!}{(5-3)!3!} = \frac{5!}{2!3!} = \frac{5*4*3*2*1}{(2*1)((3*2*1)} = \frac{5*4}{2} = 10

Not to forget that this problem does not allow you to choose one-topping twice, which real shops may permit!

Permutations and Combinations Continued Read More »

Permutations and Combinations

At a party of 25 people, how many handshakes are expected if each person asks hands with every other? Is this a permutation problem or a combination?

Before that, what is a permutation, and what is a combination? They both represent different ways of arranging things from an available list of options. For example, how many unique passcodes are possible for a combination lock of 4 wheels? Let’s count. There are ten possibilities for the first wheel (0-9), another 10 for the second, and the same for the third and fourth. Total possibilities = 10 x 10 x 10 x 10 = 104 = 10000. Some people will say there are 9999 ways, as one is always available to start!

Permutations

How many four-digit numbers can you make from 4, 6, 7 and 8? It is different from the combination lock problem as you can’t get to use it again once you use up one of the digits. So let’s do the counting: in the first place, you have four possibilities; in the second place, since one of them is used up, you have three, then two and finally, the remaining one. So the total permutatoins are 4 x 3 x 2 x 1 = 24.

Permutations of n available options taken r at a time = nPr. In the digit case, it was 4P4

\\ _nP_r = \frac{n!}{(n-r)!} \\ \\ _4P_4 = \frac{4!}{(4-4)!} = \frac{4!}{0!} = \frac{4 * 3 * 2 * 1}{1} = 24

Combinations

The combination is where you make arrangements when the order does not matter. The handshake problem is a combination case. When two people shake their hands, one handshake happens. In other words, this becomes a permutation problem but discounting the double-counting. It is called combinations or nCr.

\\ _nC_r = \frac{n!}{(n-r)! r!} \\ \\ _{25}C_2 = \frac{25!}{(25-2)!2!} = \frac{25!}{23!2!} = \frac{25*24*23*22*....*1}{(23*22*21*....*1)((2*1)} = \frac{25*24}{2} = 300

300 unique handshakes will happen.

Factorial Calculator

Permutations and Combinations Read More »

Outliers

An outlier is an anomalous value in the dataset. Consider the following dataset.

1.972.10.91.82.2
1.41.851.311.921.8
1.5410.71.331.712.4
1.621.221.71.631.6
1.791.521.831.81.69

Sort

Do you identify the outlier here? The easiest way is to sort the data in ascending order.

0.91.221.311.331.4
1.521.541.61.621.63
1.691.71.711.791.8
1.81.81.831.851.92
1.972.12.22.410.7

The value at the bottom right appears suspicious. The average of the set with the last value is 2.05, and that without is 1.69.

Plot

Another way to identify an outlier is to plot.

Histogram
Boxplot

Outliers Read More »

The Magic Pill for India’s Population Explosion

We have discussed this in the past – the total fertility rate of Indian women had slipped past the magic number of 2.1 a couple of years ago. It is not a surprise to those who followed history; continuation of progress that started sometime in 1960 (look at the constant slope of decline until recently). So, are you saying that the overall population in India is declining? Well, I did not say that, and it will not happen for another 20-25 years due to the increase in life expectancy and the need to fill the gaps in the age funnel in the coming years.

Then, how do you interpret the recent proposal for a population control law in India? To those who are confused: it is not a law to encourage people to have more children, as the data-backed decision-maker in you may be thinking! It is about the opposite – more likely, a rule to restrict the number of children per woman to a fixed one (perhaps two).

Let’s consider the possible reasons behind such a move by the government.

Irrationality of mind

Start with our favourite logical fallacy, i.e., availability bias. Just picture a Muslim mother with seven children walking on the street. Isn’t it fitting to the stereotype? Pew research report proved this is far from the truth. The fertility of Indian Muslim women reached 2.6 in 2015 and has been declining faster than any other religion! You may be wondering why I used the image of a Muslim woman. You will see it at the end. To those who want a more neutral example, how about this: the picture of a million people getting out at the landmark of Mumbai, the Chhatrapati Shivaji Terminus?

The Claim Instinct 

We have seen the fallacy of the much-celebrated one-child policy of China. The story is no different. If you missed clicking the earlier link, this is your second chance to click on the all-important plot at the Gapminder website. Almighty leaders like to leave such legacies; population control offers one occasion.

The realpolitik

It is typical for far-right politics to find an enemy in their territory and marginalise them based on their state of living. That is their tried and tested model of survival among their supporters. In India, it is the Muslims that fit the bill.

The solution

Enforcement of child control is not a solution to the population problem in a democratic modern society. If you think a community is lagging, bring them into the mainstream and not alienate them further.

National Family Health Survey, India

Factfulness, Hans Rosling

The Magic Pill for India’s Population Explosion Read More »