Data & Statistics

Infinite Banana

A monkey reached a banana farm. As a rational monkey, it wants to let a dice decide the number of bananas to eat. The rules of the game are:

If the die finds 1 to 5, it eats that many bananas
If the die gets 6, it eats 5, tosses again, and the game continues.

What is the expected number of bananas that the monkey eats?

The exepcted value is: (1/6) x 1 + (1/6) x 2 + (1/6) x 3 + (1/6) x 4 + (1/6) x 5 + (1/6) x [5 + (1/6) x 1 + (1/6) x 2 + …]
= [20/6] + (1/6)[20/6] + (1/62)[20/6] + …
= [20/6] x [1 + 1/6 + 1/62 + 1/63 + …]
= [20/6] x [1/(1-1/6)]
= [20/6] x [6/5] = 20/5 = 4

Note we used the relationship for the infinite series 1 + x + x2 + x3 + … = 1/(1-x).

Infinite Banana Read More »

Logistic Regression of The Heart Failure Data

Let’s work out a few more matrices to continue with the heart data. First, let’s recall the data using the str() command.

str(h_data)
'data.frame':	299 obs. of  13 variables:
 $ Age      : num  75 55 65 50 65 90 75 60 65 80 ...
 $ Anaemia  : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
 $ Cr_Ph    : int  582 7861 146 111 160 47 246 315 157 123 ...
 $ Diabetes : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
 $ Ej_fr    : int  20 38 20 20 20 40 15 60 65 35 ...
 $ BP       : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
 $ Platelets: num  26.5 26.3 16.2 21 32.7 ...
 $ Ser_Cr   : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
 $ Ser_Na   : int  130 136 129 137 116 132 137 131 138 133 ...
 $ Sex      : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
 $ Smoking  : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
 $ Time     : int  4 6 7 7 8 8 10 10 10 10 ...
 $ Death    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Logistic Regression

mod2 <- glm(Death ~ ., data = h_data, family = 'binomial')
summary(mod2)

Call:
glm(formula = Death ~ ., family = "binomial", data = h_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1848  -0.5706  -0.2401   0.4466   2.6668  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) 10.1849290  5.6565703   1.801 0.071774 .  
Age          0.0474191  0.0158006   3.001 0.002690 ** 
Anaemia1    -0.0074705  0.3604891  -0.021 0.983467    
Cr_Ph        0.0002222  0.0001779   1.249 0.211684    
Diabetes1    0.1451498  0.3511886   0.413 0.679380    
Ej_fr       -0.0766625  0.0163291  -4.695 2.67e-06 ***
BP1         -0.1026794  0.3587069  -0.286 0.774688    
Platelets   -0.0119962  0.0188906  -0.635 0.525404    
Ser_Cr       0.6660933  0.1814926   3.670 0.000242 ***
Ser_Na      -0.0669811  0.0397351  -1.686 0.091855 .  
Sex1        -0.5336580  0.4139180  -1.289 0.197299    
Smoking1    -0.0134922  0.4126178  -0.033 0.973915    
Time        -0.0210446  0.0030144  -6.981 2.92e-12 ***
---

Observe the p-values (Pr(>|z|)) for the regression coefficients, and we find that only ‘Age’ and ‘Ser_Cr’ have significant contributions to the response variable, ”Death. Therefore, we can already do a good job by fitting only those two variables.

Logistic Regression of The Heart Failure Data Read More »

LOC of Heart Failure Data

Here is a popular dataset taken from Kaggle on patient data on heart failures.

'data.frame':	299 obs. of  13 variables:
 $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
 $ anaemia                 : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
 $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
 $ diabetes                : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
 $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
 $ high_blood_pressure     : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
 $ platelets               : num  26.5 26.3 16.2 21 32.7 ...
 $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
 $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
 $ sex                     : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
 $ smoking                 : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
 $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
 $ DEATH_EVENT             : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

LOC of Heart Failure Data Read More »

Lorenz Curve and Gini Coefficient

The plot of cumulative household income vs cumulative households is called a Lorenz curve. Let’s build a hypothetical one using R.

Step 1: the households are organised according to their incomes, from the lowest to the highest.

income <- c(100, 200, 400, 500, 800)
household <- c(85, 65, 40, 25, 10) 

Step 2: The cumulative proportions of incomes and households are calculated.

cum_income <- cumsum(income)*100/sum(income)
cum_household <- cumsum(household)*100/sum(household)

Step 3: Plot the cumulative proportion of households against the cumulative proportion of income to get the Lorenz curve.

If every household has the same income, the Lorenz curve passes through the diagonal. And it is called the perfect equality line.

On the other extreme, if one household has all the income, you get the perfect inequality line.

Gini Coefficient

Let the area between the perfect equality line and the Lorenz curve be A, and the area under the Lorenz curve be B.
Gini coefficient = Area A / (Area A + Area B)

Case 1: If the Lorenz curve coincides with the perfect equality line
Gini coefficient = 0 / (0 + Area B) = 0; perfect equality
Case 2: If the Lorenz curve coincides with the perfect inequality line
Gini coefficient = Area A / (Area A + 0) = 1; perfect inequality

Lorenz Curve and Gini Coefficient Read More »

ROC and AUC

We’ll demonstrate the concept of ROC (Receiver Operating Characteristics) and AUC (Area Under Curve) with the help of (simulated) weight data and using R codes. Here are the first ten rows of the data.

weight       obese
86.48505	0			
88.04764	0			
111.50064	0			
112.69730	0			
121.53974	0			
122.34533	0			
126.53330	0			
129.34565	0			
129.46268	1			
130.17398	1

Here ‘obese’ is the outcome variable that takes one of the two values, 0 (not obese) or 1 (obese). The ‘weight’ is the independent variable, also known as the predictor.

Now, we’ll do logistic regression of the data using the generalised linear model (‘glm’), store the output in a variable and plot.

plot(weight, obese, col = "blue", cex  = 1.5,  cex.axis = 1.5, cex.lab = 1.6)
glm.fit <- glm(obese ~ weight, family = "binomial")
lines(weight, glm.fit$fitted.values, lwd = 3)

Estimation of ROC and AUM requires the package, ‘pROC’.

par(bg = "antiquewhite1", pty = "s")
roc(obese, glm.fit$fitted.values, plot = TRUE, legacy.axes = TRUE, col = "brown", lwd = 3, print.auc = TRUE, auc.polygon = TRUE)

We used the following options to get the final plot.
par(pty = “s”); for plotting the graph as a square
plot = TRUE; for plotting the graph
legacy.axes = TRUE; for plotting 1- specificity on the x-axis instead of the default specificity
print.auc = TRUE; to print the value of AUC on the graph
auc.polygon = TRUE; to present AUC as a shaded area.

ROC and AUC Read More »

Logistic Regression: Receiver Operating Characteristics

Receiver Operating Characteristics (ROC) curve is used to estimate the worth of a logistic regression model. It is a plot between the true positive rate on the Y – axis and the false positive rate on the X – axis.

The ROC curve is a representation of the performance of a binary classification model. The diagonal line represents the case of not using a model. The area under the curve (AUC) of the ROC is a significant parameter. A model with a higher area under the curve (AUC) is preferred.

Logistic Regression: Receiver Operating Characteristics Read More »

Logistic Regression and Accuracy Paradox

The next step is to assign a cut-off probability value for decision-making. Based on that, the decision-maker will classify the observation as either class 1 or 0. If Pc is that cut-off value, the following condition occurs.

Y_i =     \begin{cases}       0, & \text{if } P(Y_i = 1) < P_c \\       1, & \text{if } P(Y_i = 1) \ge P_c     \end{cases}

The following classification table is obtained if the cut-off value is 0.5.

\\ ln(\frac{P(Y = 1)}{1 - P(Y = 1)}) = \beta_0 + \beta_1 X_i \text{ OR} \\ \\ P(Y = 1) = \frac{e^{\beta_0 + \beta_1 X_i}}{1 + e^{\beta_0 + \beta_1 X_i}} \\ \\ P(Y = 1) = \frac{e^{15.2968142 - 0.2360207 X_i}}{1 + e^{15.2968142 - 0.2360207 X_i}}

Launch
T (F)
Actual
Damage
Prediction
(Pc = 0.5)
6600
7010
6900
8000
6800
6700
7200
7300
7000
5711
6311
7010
7800
6700
5311
6700
7500
7000
8100
7600
7900
7510
7600
5811

The accuracy of the prediction may be obtained by counting the actual and predicted values or by using the function, ‘confusionMatrix’ of the ‘caret’ library.

True Positive (TP) = Damage = 1 & Prediction = 1
False Negative (FN) = Damage = 1 & Prediction = 0
False Positive (FP) = Damage = 0 & Prediction = 1
True Negative (TN) = Damage = 0 & Prediction = 0

Sensitivity = TP / (TP + FN) =  0.57 (57%)
Specificity  = TN / (TN + FP) = 1 (100%)
overall accuracy = (TP + TN) / (TP + TN + FN + FP) = 0.875

Although the overall accuracy is at an impressive 87.5%, the true positive rate or the failure estimation rate is pretty average (57%). Considering the significance of the decision, one way to deal with it is to increase the sensitivity by reducing the cut-off probability to 0.2. That leads to the following.

Sensitivity = TP / (TP + FN) =  0.857 (85.7%)
Specificity  = TN / (TN + FP) = 0.53 (53%)
overall accuracy = (TP + TN) / (TP + TN + FN + FP) = 0.625

Accuracy Paradox

As you can see, the sensitivity of the second case, the lower cut-off value, is higher, but the overall accuracy of the prediction is poorer. And this is a key step of decision making – choosing higher accuracy of predicting failures (positive values) over the overall classification accuracy.

Logistic Regression and Accuracy Paradox Read More »

Logistic Regression and Challenger

Remember the post on the O-ring failure of the Challenger disaster?

Here, we do a step-by-step logistic regression of the data. The following table shows the O-ring damages and launch temperatures of 24 previous shuttle launches. Damage = 1 represents failure, and 0 denotes no failure.

Flight
#
Launch
T (F)
O-ring
Damage
STS 1660
STS 2701
STS 3690
STS 4800
STS 5680
STS 6670
STS 7720
STS 8730
STS 9700
STS 41B571
STS 41C631
STS 41D701
STS 41G780
STS 51A670
STS 51C531
STS 51D670
STS 51B750
STS 51G700
STS 51F810
STS 51I760
STS 51J790
STS 61A751
STS 61B760
STS 61C581

The logit function is:

ln(\frac{P(Y = 1)}{1 - P(Y = 1)}) = Z = \beta_0 + \beta_1 X_i

The beta values (the intercept and slope) are obtained by the ‘generalised linear model’, glm function in R.

logic <- glm(Damange ~ ., data = challenge, family = "binomial")
summary(logic)
Call:
glm(formula = Damange ~ ., family = "binomial", data = challenge)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0608  -0.7372  -0.3712   0.3948   2.2321  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  15.2968     7.3286   2.087   0.0369 *
Temp         -0.2360     0.1074  -2.198   0.0279 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 28.975  on 23  degrees of freedom
Residual deviance: 20.371  on 22  degrees of freedom
AIC: 24.371

Number of Fisher Scoring iterations: 5

The following plot is obtained by substituting the values 15.2968142 and -0.2360207 in the function.

Having developed the probabilities, now it’s time for decision-making. That is next.

Logistic Regression and Challenger Read More »

Breast Cancer Diagnostic Data Set

The dataset, known as the Breast Cancer Wisconsin (Diagnostic) Dataset, was obtained from Kaggle. It was built by Dr Wolberg, who used fluid samples from patients with solid breast masses. It provides ten features of cells from each sample – the mean value, extreme value and standard error of 10 features for the image returning 30 variables. Those ten components are:

  1. radius 
  2. texture 
  3. perimeter
  4. area
  5. smoothness 
  6. compactness 
  7. concavity 
  8. concave points
  9. symmetry
  10. fractal dimension

The objective is to match the outcome, and diagnosis, which takes two values vz benign (B) or malignant (M). The following plot gives the overall summary of how the various features compare between benign (B) and malignant (M)

or a density plot

We do a correlation plot next:

corr_mat <- cor(b_data[,2:ncol(b_data)])
corrplot(corr_mat)

Breast Cancer Diagnostic Data Set Read More »

Logistic Regression

We know linear regression, which allows us to find the relationship between two variables and let us predict a dependent variable from an independent variable.

In this example, the function associated with the red dotted line lets us estimate the fat% if a BMI value is known.

But what happens if the data is available, like the following?

Here, the survey gives either a YES or NO as the answer (1 = YES, 0 = NO). The linear regression and the subsequent equation are meaningless here. In such cases, we resort to logistic regression.

The objective of the logistic regression is not to get the value of Y but the probability. E.g., if the X value is 9, there is a 50% chance of getting a YES. On the other hand, X = 2 has a higher probability of getting a NO.

The plot tells you that the data is best suited for classification. Ys with < 50% chance to occur will be classified as the YES category, and < 50% is in NO.

Logistic Regression Read More »