All

Outliers

June 2, 2022

An outlier is an anomalous value in the dataset. Consider the following dataset.

1.97	2.1	0.9	1.8	2.2
1.4	1.85	1.31	1.92	1.8
1.54	10.7	1.33	1.71	2.4
1.62	1.22	1.7	1.63	1.6
1.79	1.52	1.83	1.8	1.69

Sort

Do you identify the outlier here? The easiest way is to sort the data in ascending order.

0.9	1.22	1.31	1.33	1.4
1.52	1.54	1.6	1.62	1.63
1.69	1.7	1.71	1.79	1.8
1.8	1.8	1.83	1.85	1.92
1.97	2.1	2.2	2.4	10.7

The value at the bottom right appears suspicious. The average of the set with the last value is 2.05, and that without is 1.69.

Plot

Another way to identify an outlier is to plot.

Outliers Read More »

The Magic Pill for India’s Population Explosion

June 1, 2022

We have discussed this in the past – the total fertility rate of Indian women had slipped past the magic number of 2.1 a couple of years ago. It is not a surprise to those who followed history; continuation of progress that started sometime in 1960 (look at the constant slope of decline until recently). So, are you saying that the overall population in India is declining? Well, I did not say that, and it will not happen for another 20-25 years due to the increase in life expectancy and the need to fill the gaps in the age funnel in the coming years.

Then, how do you interpret the recent proposal for a population control law in India? To those who are confused: it is not a law to encourage people to have more children, as the data-backed decision-maker in you may be thinking! It is about the opposite – more likely, a rule to restrict the number of children per woman to a fixed one (perhaps two).

Let’s consider the possible reasons behind such a move by the government.

Irrationality of mind

Start with our favourite logical fallacy, i.e., availability bias. Just picture a Muslim mother with seven children walking on the street. Isn’t it fitting to the stereotype? Pew research report proved this is far from the truth. The fertility of Indian Muslim women reached 2.6 in 2015 and has been declining faster than any other religion! You may be wondering why I used the image of a Muslim woman. You will see it at the end. To those who want a more neutral example, how about this: the picture of a million people getting out at the landmark of Mumbai, the Chhatrapati Shivaji Terminus?

The Claim Instinct

We have seen the fallacy of the much-celebrated one-child policy of China. The story is no different. If you missed clicking the earlier link, this is your second chance to click on the all-important plot at the Gapminder website. Almighty leaders like to leave such legacies; population control offers one occasion.

The realpolitik

It is typical for far-right politics to find an enemy in their territory and marginalise them based on their state of living. That is their tried and tested model of survival among their supporters. In India, it is the Muslims that fit the bill.

The solution

Enforcement of child control is not a solution to the population problem in a democratic modern society. If you think a community is lagging, bring them into the mainstream and not alienate them further.

National Family Health Survey, India

Factfulness, Hans Rosling

The Magic Pill for India’s Population Explosion Read More »

One-Tailed Or Two-Tailed?

May 31, 2022

So how do you decide whether to choose one-tailed or two-tailed? It is not as straightforward as it may sound. Let’s look at the distributions that we have seen in the last post. So, first, the two-tailed test.

The shaded area represents the probability that a value will fall within the range. The smaller the value, you attain more confidence to reject the default – the null hypothesis. In this case, I have calculated the sum of the two regions to be 0.05. I guess you know what it means? It represents the alpha (significance level) of 5%.

Mean salary of 30k

So, if the null hypothesis (H₀) was that the mean salary of engineers is exactly 30k, you can easily prove it is not the case if you find a sample mean of more than 35.9 or less than 24.1. Mathematically, it is:

$\\ H_0 = 30 \\ \\ H_A \neq 30$

So far, so good.

What will you do if you decide to prove only the higher side of the claim? i.e., you want to establish that the salary is more than 30k. That will mean the following shaded area of the distribution.

$\\ H_0 = 30 \\ \\ H_A > 30$

You can now see the problem: you can claim you have achieved the same 5% alpha at a sample mean of 35k.

In case you are wondering from where I got these 34.9, 24.1 and 35, try typing the following R codes.

pnorm(34.95, 30, 3, lower.tail = FALSE) # gives an answer 0.049 for one-tailed. 

1 - pnorm(35.9, 30, 3) + pnorm(24.1, 30, 3) # gives 0.049 for two-tailed. 
#the above code is equivalent to 
pnorm(35.9, 30, 3, lower.tail = FALSE) + pnorm(24.1, 30, 3, lower.tail = TRUE)

pnorm represents the cumulative density function of a normal distribution with a mean = 30 and the standard deviation = 3.

One-Tailed Or Two-Tailed? Read More »

One-Tailed and Two-Tailed test

May 30, 2022

We have seen the significance value alpha as the threshold probability of rejecting the null hypothesis. Let us illustrate it graphically. Consider this hypothesis: The average weighting time of the ticket counter is more than 30 minutes. The null and alternate hypotheses are:

H₀: average less than or equals 30
H_A: average more than 30

To establish your theory, you need to prove that the mean is greater than 30 (H_A) beyond doubt. In the following illustration, the right-hand side tail provides the region you need to show your measured waiting time. That gives the alpha probability for the H₀ to be valid.

Consider this: the starting monthly salary of computer engineers is $30k. The alternate hypothesis needs to prove this is not the case, the number may be lower or higher (two extremes or tails).

H₀: average = $30k
H_A: average not equal to $30k

One-Tailed and Two-Tailed test Read More »

Setting the Evidentiary Standard

May 29, 2022

We will continue with the basic terms of hypothesis testing. The first one is the significance level. Alpha, as popularly known, is set by the person in charge of the testing and signifies the strength of the evidence required to establish the tester’s proposition (alternative hypothesis). We are familiar with the alpha of 0.05 (5%). But how does one choose the right level?

The famous analogy is the court cases. For civil cases (deal with personal rights), scholars define 51% of the evidence to support a claim. On the other hand, criminal cases may require far more, say, more than 90% of the evidence, for a verdict against the suspect. It may go to 99% when the potential punishment of the guilty is severe. You may look back at an older post to see the significance of where the judges draw their lines.

In the same way, the analyst may decide on a stricter significance level or a lower probability of rejecting the null hypothesis if the stakes are high. Putting it differently, an alpha of 0.01 means a 1% probability that the test will produce a statistically significant result if the null hypothesis is correct.

Reference

Jim Frost, “Hypothesis Testing: An Intuitive Guide for Making Data Driven Decisions”

Setting the Evidentiary Standard Read More »

Hypothesis Testing

May 28, 2022

We have done it several times in the past. The objective of hypothesis testing is to assess, using sample data, two mutually exclusive theories about the properties of a population. Please see my earlier post for the definitions of sample and population. The two theories are the null hypothesis and the alternative hypothesis.

The null hypothesis (H₀) typically represents the default state or the state of “no effect“. For example, you compare the means of two groups, such as people who took a particular drug and people who received the placebo. As a drug researcher, your objective is to find the effectiveness of the medicine. And that lays the foundation for your alternative hypothesis (H_A or H₁) – that the drug has a non-zero effect. The default state (H₀) assumes the drug has no impact. To be specific, H₀ assumes the difference between two means equals zero.

H₁ states that the population parameter value does not equal the H0 value. Notice the words, population and parameter. The ambition of the test is to create statements on the who space and not just on the sample itself. And if the sample contains sufficient evidence, we will see what is sufficient, you will reject the null hypothesis in favour of the alternative.

Hypothesis Testing Read More »

Examples of Exponential Distribution

May 27, 2022

Let us work out some problems using exponential distributions. In reliability theory, it is common to assume that the lifespan of machines and components are random variables. That may suggest that the failure comes as Poisson, and the time between failures is exponentially distributed.

The time to failure of a tool follows an exponential distribution with the mean time between failures (MTBF) of 500 days. Calculate the probability this tool will fail before 500 days.

$F(X < t) = 1 - e^{\lambda t}$

Since the mean time (for failure) is 500, lambda, the parameter is 1/500. Substituting for lambda and time, the probability becomes 1 – exp((1/500) *500) = 0.63 or 63%. The following R code also gives the same result.

pexp(500, 1/500)

What is the probability that the tool will not fail for 1000 days?

$1 - F(X < 1000) = e^{(1/500) * 1000}$

The answer is 0.1353 or 13.53%

Examples of Exponential Distribution Read More »

Memoryless Distribution

May 26, 2022

The exponential distribution is the time between two Poisson events. You may consider exponential distribution as the inverse of Poisson distribution. If the former is the events per time duration, the latter is the time per event. Since Poisson events are independent of each other, it should not be difficult to accept that the exponential distribution is called memoryless.

The following two plots may explain the inverse relationship. First is the Poisson PMF for events with parameter lambda = 5.

The next plot is the corresponding exponential distribution for the same lambda (5).

Memorylessness

We will look at the formal derivation of this memorylessness. First, what does a memoryless function mean? It means that the past segment of action has no impact on the subsequent segment. For example, the time required for a person to get a “one” of a die-roll (to enter a snake and ladder game) doesn’t depend on the previous 5 minutes that she had waited. Mathematically it means:

$P(X > t + s | X > t) = P(X > s) \text { ; } t \text{ is forgotten!}$

Let’s apply the Conjunction Rule,

$\\ P(A \cap B) = P(A) * P(B | A) \\ \\ P(B | A) = \frac{P(A \cap B)}{P(A)} \\ \\ \therefore P(X > t + s | X > t) = \frac{P(X > t + s \text{ } \cap \text{ } X > t) }{P(X > t)} \\ \\ \text{If } X > s + t \text{ , then } X > t \text{ is redundant} \\ \\ P(X > t + s | X > t) = \frac{P(X > t + s) }{P(X > t)} = \frac{e^{-\lambda(s+t)}}{e^{-\lambda(t)}} = e^{-\lambda s} \\ \\ \text{This is the definition of } P(X > s) \text{!}$

Memoryless Distribution Read More »

Irrational Faith in Guns

May 25, 2022

If having a gun increases the risk of gun-related violent death in the home, why do people choose to own guns?
^{Pierre, J. M., “The psychology of guns: risk, fear, and motivated reasoning”, Palgrave Communications, 5, 2019}

My thoughts go with the children, teachers at the Robb Elementary School in Texas, and their family members.

I will start with my viewpoint on this debate of whether guns kill people vs people kill people – people attack, and they use readily available weapons to cause harm to “the other“. The more lethal the tool used, the deadlier the injury, with death as the endpoint. In other words, if stones are accessible, the outraged may throw and hurt the other, and if guns are accessible, they may kill a few; in a more barbarian society, replace guns with bombs! Only the scale changes. It is as simple as that.

There are statistics, and so are beliefs

In one of the previous posts, we discovered that suicides dominated the gun-related deaths. Studies after studies report the association that homicides are largely incidents committed by family members and acquaintances and not strangers.

Yet, the society, the US in this context, supports and takes great pride in possession of guns! The proponents of guns have several reasons (excuses) to support their position, starting with individual freedom (we have seen it in Covid-19 mask mandates!), to what is known, as per some studies, as the knowledge deficit model.

But one theory that became the most prominent among them points to the aspect of human decision making – i.e., irrationality, controlled by cognitive biases (cherry-picking, motivated reasoning, availability heuristics, status quo bias). As per Metzl, this behaviour stems from the notion of the cultural heritage of gun owners. And it does not come as a surprise that other social cancers (a.k.a. resistance to progress) – religiosity, racism, sexism, nationalism – too originate from similar backgrounds.

The Second Amendment

The story goes back to the second amendment of the US constitution that states, “A well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed.”

First, you need to remember that this followed centuries-old practices of England (English Bill of Rights of 1689), which was embraced and ratified by the US constitution in 1791. And the reason to carry this baggage of the past? The answer is complicated.

Social scientists have been approaching this American love of guns through the lenses of gender, masculinity and race. On top of these, there are the thriving forces of fear of “bad guys, thugs and carjackers“, amply fostered by the ever-powerful National Rifle Association (NRA).

Uncertain future

A solution based on rational, data-based arguments is unlikely to reap any rewards against motivated reasoning. The issue is deep-rooted in American society as a national identity, symbol of resistance, and a collective history of race, gender and socioeconomic status. And as always, such diseases require long term care to heal.

Exponential Distribution

May 24, 2022

The exponential distribution is a continuous distribution that runs on a single parameter, lambda. The probability density function (PDF) of this distribution is given by

$f(x) = \lambda e^{-\lambda x}, \text{ } \lambda > 0$

What does the PDF look like? We use the R code dexp for that. Following is the PDF of an exponential distribution function with parameter lambda = 0.01.

The cumulative distribution of exponential distribution is given by

$F(x) = 1- e^{-\lambda x}$

Mean and variance of exponential distribution

The expected value (mean) and variance of the distribution is given by

$\\ E(X) = \frac{1}{\lambda} \\ \\ var(X) = \frac{1}{\lambda^2}$

Exponential Distribution Read More »