Picking Candies

A jar contains 60 candies, 10 reds, 20 blues and 30 yellows. If one takes out candies one by one, what is the probability that there is at least one yellow and one blue left after all the red candies have been taken out?

The solution is a combination of two mutually exclusive probabilities.

1) Probability that yellow is the 60th candy and blue is the last candy among the bunch of 10 reds and 20 blues.
OR
2) Probability that blue is the 60th candy and yellow is the last candy among the bunch of 10 reds and 30 yellows.

1a. The Probability that one of the 30 yellows is the last candy among 60 candies = 30/60
1b. The Probability that one of the 20 blues is the last candy among 30 candies = 20/30
2a. The Probability that one of the 20 blues is the last candy among 60 candies = 20/60
2b. The Probability that one of the 30 yellows is the last candy among 40 candies = 30/40

The first probability is a joint probability (‘AND’ rule) of 1a and 1b
The second probability is a joint probability of 2a and 2b
The final probability is the sum (‘OR’ rule) of the two.

(30/60) x (20/30) + (20/60) x (30/40) = 0.58

Picking Candies Read More »

Infinite Banana

A monkey reached a banana farm. As a rational monkey, it wants to let a dice decide the number of bananas to eat. The rules of the game are:

If the die finds 1 to 5, it eats that many bananas
If the die gets 6, it eats 5, tosses again, and the game continues.

What is the expected number of bananas that the monkey eats?

The exepcted value is: (1/6) x 1 + (1/6) x 2 + (1/6) x 3 + (1/6) x 4 + (1/6) x 5 + (1/6) x [5 + (1/6) x 1 + (1/6) x 2 + …]
= [20/6] + (1/6)[20/6] + (1/62)[20/6] + …
= [20/6] x [1 + 1/6 + 1/62 + 1/63 + …]
= [20/6] x [1/(1-1/6)]
= [20/6] x [6/5] = 20/5 = 4

Note we used the relationship for the infinite series 1 + x + x2 + x3 + … = 1/(1-x).

Infinite Banana Read More »

Turn of the knob

Came across one of the finest videos on YouTube about our past, present and future life, Yuval Noah Harari’s talk to youngsters and teachers, which triggered the idea of this post.

Turning the knob

Knowledge is like turning the knob. When it turns, you see things in a new light; until then, no matter how hard you try, you don’t get it out of ‘common knowledge’. Unfortunately, the common knowledge is almost always wrong!

The hyperpigmented on the equator

Take the favourite example of pigmentation of humans living in the equatorial region. For a moment, let’s ignore the people who believe that people of colour are of a separate species. We are dealing with more reasonable people here. If the narrative is that people in sunny regions have become dark-skinned because of heat and light, it’s an easier narrative to sell. It fits with the common knowledge – we all know what happens when we fry things; a little too much and it turns black.

Unfortunately, that’s not how things evolve. The theory of evolution switch needs to turn on. What about this: a group of people (perhaps dominated by the light-skinned) reach a sunny region. A few of them got skin cancer due to their lack of protective pigmentation and died maybe a few years earlier than their accidentally darker companions. That raised (by a small margin) the probability of darker parents, their children and their children having the advantage, and wow, after 10,000 years, there was a complete dominance of the dark. So, will that happen in Australia after 10,000 years? We’ll answer that in the end.

Humans of Flores

There used to be a pack of humans living in Flores, an island in Indonesia (until they were extinct about 50,000 years ago). They were humans as they shared the homo family. They were different humans because we are homo sapiens, and they were not. They were pretty short – about 1 m. tall – people. Not just them but the animals of that island as well. A simple convincing argument is that the animals got trapped on the island, became resource-constrained, and to survive, they had to consume less food. And they became smaller. It’s convincing because 1) it gives a feeling that one bunch of people after starvation has shrunk, or 2) they passed a genetic code to the children and made them shrink.

Turn the switch, and you get it: big humans reached the island. Once they got disconnected from the mainland due to sea level rise, the larger ones faced a more significant disadvantage due to food shortage, and the smaller ones survived better. In the next generation, there were disproportionally smaller kids from the surviving parents (the new group has larger ones too). Turn a few pages, centuries and generations: the island is full of smaller humans. This narrative is difficult to fathom without the switch as it is against the common knowledge. First, how can more miniature humans be fitter? That doesn’t conform very well with the stereotypes! Second, something forcing people (in one lifetime) to become smaller is easier to imagine than this chance game of smaller ones surviving (in a hundred lifetimes).

The future evolutions

That naturally begs the question. Will the Australians (the white Australians) turn back after 10,000 years? Even the broader question: What will be the next evolution of humans? The answer to the first question is a no, and the answer to the second question is impossible to predict.

The code lies in the knowledge paradox we are in. Australian whites won’t turn black because they know why it happens and what to do against death from skin cancer. It could be as simple as using sunscreen (or deciding not to venture out in the UV-intense part of the day). And this will translate to other things as well. If we know something gives us a disadvantage, we will engineer means to counter it. It has to be a disadvantage that gave the survivors the chance to survive, and we are closing those weaknesses!

Must watch video

Yuval Noah Harari Speaks to Young Readers & Teachers: Yuval Noah Harari

Turn of the knob Read More »

Logistic Regression of The Heart Failure Data

Let’s work out a few more matrices to continue with the heart data. First, let’s recall the data using the str() command.

str(h_data)
'data.frame':	299 obs. of  13 variables:
 $ Age      : num  75 55 65 50 65 90 75 60 65 80 ...
 $ Anaemia  : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
 $ Cr_Ph    : int  582 7861 146 111 160 47 246 315 157 123 ...
 $ Diabetes : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
 $ Ej_fr    : int  20 38 20 20 20 40 15 60 65 35 ...
 $ BP       : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
 $ Platelets: num  26.5 26.3 16.2 21 32.7 ...
 $ Ser_Cr   : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
 $ Ser_Na   : int  130 136 129 137 116 132 137 131 138 133 ...
 $ Sex      : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
 $ Smoking  : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
 $ Time     : int  4 6 7 7 8 8 10 10 10 10 ...
 $ Death    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Logistic Regression

mod2 <- glm(Death ~ ., data = h_data, family = 'binomial')
summary(mod2)

Call:
glm(formula = Death ~ ., family = "binomial", data = h_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1848  -0.5706  -0.2401   0.4466   2.6668  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) 10.1849290  5.6565703   1.801 0.071774 .  
Age          0.0474191  0.0158006   3.001 0.002690 ** 
Anaemia1    -0.0074705  0.3604891  -0.021 0.983467    
Cr_Ph        0.0002222  0.0001779   1.249 0.211684    
Diabetes1    0.1451498  0.3511886   0.413 0.679380    
Ej_fr       -0.0766625  0.0163291  -4.695 2.67e-06 ***
BP1         -0.1026794  0.3587069  -0.286 0.774688    
Platelets   -0.0119962  0.0188906  -0.635 0.525404    
Ser_Cr       0.6660933  0.1814926   3.670 0.000242 ***
Ser_Na      -0.0669811  0.0397351  -1.686 0.091855 .  
Sex1        -0.5336580  0.4139180  -1.289 0.197299    
Smoking1    -0.0134922  0.4126178  -0.033 0.973915    
Time        -0.0210446  0.0030144  -6.981 2.92e-12 ***
---

Observe the p-values (Pr(>|z|)) for the regression coefficients, and we find that only ‘Age’ and ‘Ser_Cr’ have significant contributions to the response variable, ”Death. Therefore, we can already do a good job by fitting only those two variables.

Logistic Regression of The Heart Failure Data Read More »

LOC of Heart Failure Data

Here is a popular dataset taken from Kaggle on patient data on heart failures.

'data.frame':	299 obs. of  13 variables:
 $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
 $ anaemia                 : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
 $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
 $ diabetes                : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
 $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
 $ high_blood_pressure     : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
 $ platelets               : num  26.5 26.3 16.2 21 32.7 ...
 $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
 $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
 $ sex                     : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
 $ smoking                 : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
 $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
 $ DEATH_EVENT             : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

LOC of Heart Failure Data Read More »

Lorenz Curve and Gini Coefficient

The plot of cumulative household income vs cumulative households is called a Lorenz curve. Let’s build a hypothetical one using R.

Step 1: the households are organised according to their incomes, from the lowest to the highest.

income <- c(100, 200, 400, 500, 800)
household <- c(85, 65, 40, 25, 10) 

Step 2: The cumulative proportions of incomes and households are calculated.

cum_income <- cumsum(income)*100/sum(income)
cum_household <- cumsum(household)*100/sum(household)

Step 3: Plot the cumulative proportion of households against the cumulative proportion of income to get the Lorenz curve.

If every household has the same income, the Lorenz curve passes through the diagonal. And it is called the perfect equality line.

On the other extreme, if one household has all the income, you get the perfect inequality line.

Gini Coefficient

Let the area between the perfect equality line and the Lorenz curve be A, and the area under the Lorenz curve be B.
Gini coefficient = Area A / (Area A + Area B)

Case 1: If the Lorenz curve coincides with the perfect equality line
Gini coefficient = 0 / (0 + Area B) = 0; perfect equality
Case 2: If the Lorenz curve coincides with the perfect inequality line
Gini coefficient = Area A / (Area A + 0) = 1; perfect inequality

Lorenz Curve and Gini Coefficient Read More »

ROC and AUC

We’ll demonstrate the concept of ROC (Receiver Operating Characteristics) and AUC (Area Under Curve) with the help of (simulated) weight data and using R codes. Here are the first ten rows of the data.

weight       obese
86.48505	0			
88.04764	0			
111.50064	0			
112.69730	0			
121.53974	0			
122.34533	0			
126.53330	0			
129.34565	0			
129.46268	1			
130.17398	1

Here ‘obese’ is the outcome variable that takes one of the two values, 0 (not obese) or 1 (obese). The ‘weight’ is the independent variable, also known as the predictor.

Now, we’ll do logistic regression of the data using the generalised linear model (‘glm’), store the output in a variable and plot.

plot(weight, obese, col = "blue", cex  = 1.5,  cex.axis = 1.5, cex.lab = 1.6)
glm.fit <- glm(obese ~ weight, family = "binomial")
lines(weight, glm.fit$fitted.values, lwd = 3)

Estimation of ROC and AUM requires the package, ‘pROC’.

par(bg = "antiquewhite1", pty = "s")
roc(obese, glm.fit$fitted.values, plot = TRUE, legacy.axes = TRUE, col = "brown", lwd = 3, print.auc = TRUE, auc.polygon = TRUE)

We used the following options to get the final plot.
par(pty = “s”); for plotting the graph as a square
plot = TRUE; for plotting the graph
legacy.axes = TRUE; for plotting 1- specificity on the x-axis instead of the default specificity
print.auc = TRUE; to print the value of AUC on the graph
auc.polygon = TRUE; to present AUC as a shaded area.

ROC and AUC Read More »

Logistic Regression: Receiver Operating Characteristics

Receiver Operating Characteristics (ROC) curve is used to estimate the worth of a logistic regression model. It is a plot between the true positive rate on the Y – axis and the false positive rate on the X – axis.

The ROC curve is a representation of the performance of a binary classification model. The diagonal line represents the case of not using a model. The area under the curve (AUC) of the ROC is a significant parameter. A model with a higher area under the curve (AUC) is preferred.

Logistic Regression: Receiver Operating Characteristics Read More »

Logistic Regression and Accuracy Paradox

The next step is to assign a cut-off probability value for decision-making. Based on that, the decision-maker will classify the observation as either class 1 or 0. If Pc is that cut-off value, the following condition occurs.

Y_i =     \begin{cases}       0, & \text{if } P(Y_i = 1) < P_c \\       1, & \text{if } P(Y_i = 1) \ge P_c     \end{cases}

The following classification table is obtained if the cut-off value is 0.5.

\\ ln(\frac{P(Y = 1)}{1 - P(Y = 1)}) = \beta_0 + \beta_1 X_i \text{ OR} \\ \\ P(Y = 1) = \frac{e^{\beta_0 + \beta_1 X_i}}{1 + e^{\beta_0 + \beta_1 X_i}} \\ \\ P(Y = 1) = \frac{e^{15.2968142 - 0.2360207 X_i}}{1 + e^{15.2968142 - 0.2360207 X_i}}

Launch
T (F)
Actual
Damage
Prediction
(Pc = 0.5)
6600
7010
6900
8000
6800
6700
7200
7300
7000
5711
6311
7010
7800
6700
5311
6700
7500
7000
8100
7600
7900
7510
7600
5811

The accuracy of the prediction may be obtained by counting the actual and predicted values or by using the function, ‘confusionMatrix’ of the ‘caret’ library.

True Positive (TP) = Damage = 1 & Prediction = 1
False Negative (FN) = Damage = 1 & Prediction = 0
False Positive (FP) = Damage = 0 & Prediction = 1
True Negative (TN) = Damage = 0 & Prediction = 0

Sensitivity = TP / (TP + FN) =  0.57 (57%)
Specificity  = TN / (TN + FP) = 1 (100%)
overall accuracy = (TP + TN) / (TP + TN + FN + FP) = 0.875

Although the overall accuracy is at an impressive 87.5%, the true positive rate or the failure estimation rate is pretty average (57%). Considering the significance of the decision, one way to deal with it is to increase the sensitivity by reducing the cut-off probability to 0.2. That leads to the following.

Sensitivity = TP / (TP + FN) =  0.857 (85.7%)
Specificity  = TN / (TN + FP) = 0.53 (53%)
overall accuracy = (TP + TN) / (TP + TN + FN + FP) = 0.625

Accuracy Paradox

As you can see, the sensitivity of the second case, the lower cut-off value, is higher, but the overall accuracy of the prediction is poorer. And this is a key step of decision making – choosing higher accuracy of predicting failures (positive values) over the overall classification accuracy.

Logistic Regression and Accuracy Paradox Read More »

Logistic Regression and Challenger

Remember the post on the O-ring failure of the Challenger disaster?

Here, we do a step-by-step logistic regression of the data. The following table shows the O-ring damages and launch temperatures of 24 previous shuttle launches. Damage = 1 represents failure, and 0 denotes no failure.

Flight
#
Launch
T (F)
O-ring
Damage
STS 1660
STS 2701
STS 3690
STS 4800
STS 5680
STS 6670
STS 7720
STS 8730
STS 9700
STS 41B571
STS 41C631
STS 41D701
STS 41G780
STS 51A670
STS 51C531
STS 51D670
STS 51B750
STS 51G700
STS 51F810
STS 51I760
STS 51J790
STS 61A751
STS 61B760
STS 61C581

The logit function is:

ln(\frac{P(Y = 1)}{1 - P(Y = 1)}) = Z = \beta_0 + \beta_1 X_i

The beta values (the intercept and slope) are obtained by the ‘generalised linear model’, glm function in R.

logic <- glm(Damange ~ ., data = challenge, family = "binomial")
summary(logic)
Call:
glm(formula = Damange ~ ., family = "binomial", data = challenge)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0608  -0.7372  -0.3712   0.3948   2.2321  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  15.2968     7.3286   2.087   0.0369 *
Temp         -0.2360     0.1074  -2.198   0.0279 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 28.975  on 23  degrees of freedom
Residual deviance: 20.371  on 22  degrees of freedom
AIC: 24.371

Number of Fisher Scoring iterations: 5

The following plot is obtained by substituting the values 15.2968142 and -0.2360207 in the function.

Having developed the probabilities, now it’s time for decision-making. That is next.

Logistic Regression and Challenger Read More »