August 2023

Lorenz Curve and Gini Coefficient

The plot of cumulative household income vs cumulative households is called a Lorenz curve. Let’s build a hypothetical one using R.

Step 1: the households are organised according to their incomes, from the lowest to the highest.

income <- c(100, 200, 400, 500, 800)
household <- c(85, 65, 40, 25, 10) 

Step 2: The cumulative proportions of incomes and households are calculated.

cum_income <- cumsum(income)*100/sum(income)
cum_household <- cumsum(household)*100/sum(household)

Step 3: Plot the cumulative proportion of households against the cumulative proportion of income to get the Lorenz curve.

If every household has the same income, the Lorenz curve passes through the diagonal. And it is called the perfect equality line.

On the other extreme, if one household has all the income, you get the perfect inequality line.

Gini Coefficient

Let the area between the perfect equality line and the Lorenz curve be A, and the area under the Lorenz curve be B.
Gini coefficient = Area A / (Area A + Area B)

Case 1: If the Lorenz curve coincides with the perfect equality line
Gini coefficient = 0 / (0 + Area B) = 0; perfect equality
Case 2: If the Lorenz curve coincides with the perfect inequality line
Gini coefficient = Area A / (Area A + 0) = 1; perfect inequality

Lorenz Curve and Gini Coefficient Read More »

ROC and AUC

We’ll demonstrate the concept of ROC (Receiver Operating Characteristics) and AUC (Area Under Curve) with the help of (simulated) weight data and using R codes. Here are the first ten rows of the data.

weight       obese
86.48505	0			
88.04764	0			
111.50064	0			
112.69730	0			
121.53974	0			
122.34533	0			
126.53330	0			
129.34565	0			
129.46268	1			
130.17398	1

Here ‘obese’ is the outcome variable that takes one of the two values, 0 (not obese) or 1 (obese). The ‘weight’ is the independent variable, also known as the predictor.

Now, we’ll do logistic regression of the data using the generalised linear model (‘glm’), store the output in a variable and plot.

plot(weight, obese, col = "blue", cex  = 1.5,  cex.axis = 1.5, cex.lab = 1.6)
glm.fit <- glm(obese ~ weight, family = "binomial")
lines(weight, glm.fit$fitted.values, lwd = 3)

Estimation of ROC and AUM requires the package, ‘pROC’.

par(bg = "antiquewhite1", pty = "s")
roc(obese, glm.fit$fitted.values, plot = TRUE, legacy.axes = TRUE, col = "brown", lwd = 3, print.auc = TRUE, auc.polygon = TRUE)

We used the following options to get the final plot.
par(pty = “s”); for plotting the graph as a square
plot = TRUE; for plotting the graph
legacy.axes = TRUE; for plotting 1- specificity on the x-axis instead of the default specificity
print.auc = TRUE; to print the value of AUC on the graph
auc.polygon = TRUE; to present AUC as a shaded area.

ROC and AUC Read More »

Logistic Regression: Receiver Operating Characteristics

Receiver Operating Characteristics (ROC) curve is used to estimate the worth of a logistic regression model. It is a plot between the true positive rate on the Y – axis and the false positive rate on the X – axis.

The ROC curve is a representation of the performance of a binary classification model. The diagonal line represents the case of not using a model. The area under the curve (AUC) of the ROC is a significant parameter. A model with a higher area under the curve (AUC) is preferred.

Logistic Regression: Receiver Operating Characteristics Read More »

Logistic Regression and Accuracy Paradox

The next step is to assign a cut-off probability value for decision-making. Based on that, the decision-maker will classify the observation as either class 1 or 0. If Pc is that cut-off value, the following condition occurs.

Y_i =     \begin{cases}       0, & \text{if } P(Y_i = 1) < P_c \\       1, & \text{if } P(Y_i = 1) \ge P_c     \end{cases}

The following classification table is obtained if the cut-off value is 0.5.

\\ ln(\frac{P(Y = 1)}{1 - P(Y = 1)}) = \beta_0 + \beta_1 X_i \text{ OR} \\ \\ P(Y = 1) = \frac{e^{\beta_0 + \beta_1 X_i}}{1 + e^{\beta_0 + \beta_1 X_i}} \\ \\ P(Y = 1) = \frac{e^{15.2968142 - 0.2360207 X_i}}{1 + e^{15.2968142 - 0.2360207 X_i}}

Launch
T (F)
Actual
Damage
Prediction
(Pc = 0.5)
6600
7010
6900
8000
6800
6700
7200
7300
7000
5711
6311
7010
7800
6700
5311
6700
7500
7000
8100
7600
7900
7510
7600
5811

The accuracy of the prediction may be obtained by counting the actual and predicted values or by using the function, ‘confusionMatrix’ of the ‘caret’ library.

True Positive (TP) = Damage = 1 & Prediction = 1
False Negative (FN) = Damage = 1 & Prediction = 0
False Positive (FP) = Damage = 0 & Prediction = 1
True Negative (TN) = Damage = 0 & Prediction = 0

Sensitivity = TP / (TP + FN) =  0.57 (57%)
Specificity  = TN / (TN + FP) = 1 (100%)
overall accuracy = (TP + TN) / (TP + TN + FN + FP) = 0.875

Although the overall accuracy is at an impressive 87.5%, the true positive rate or the failure estimation rate is pretty average (57%). Considering the significance of the decision, one way to deal with it is to increase the sensitivity by reducing the cut-off probability to 0.2. That leads to the following.

Sensitivity = TP / (TP + FN) =  0.857 (85.7%)
Specificity  = TN / (TN + FP) = 0.53 (53%)
overall accuracy = (TP + TN) / (TP + TN + FN + FP) = 0.625

Accuracy Paradox

As you can see, the sensitivity of the second case, the lower cut-off value, is higher, but the overall accuracy of the prediction is poorer. And this is a key step of decision making – choosing higher accuracy of predicting failures (positive values) over the overall classification accuracy.

Logistic Regression and Accuracy Paradox Read More »

Logistic Regression and Challenger

Remember the post on the O-ring failure of the Challenger disaster?

Here, we do a step-by-step logistic regression of the data. The following table shows the O-ring damages and launch temperatures of 24 previous shuttle launches. Damage = 1 represents failure, and 0 denotes no failure.

Flight
#
Launch
T (F)
O-ring
Damage
STS 1660
STS 2701
STS 3690
STS 4800
STS 5680
STS 6670
STS 7720
STS 8730
STS 9700
STS 41B571
STS 41C631
STS 41D701
STS 41G780
STS 51A670
STS 51C531
STS 51D670
STS 51B750
STS 51G700
STS 51F810
STS 51I760
STS 51J790
STS 61A751
STS 61B760
STS 61C581

The logit function is:

ln(\frac{P(Y = 1)}{1 - P(Y = 1)}) = Z = \beta_0 + \beta_1 X_i

The beta values (the intercept and slope) are obtained by the ‘generalised linear model’, glm function in R.

logic <- glm(Damange ~ ., data = challenge, family = "binomial")
summary(logic)
Call:
glm(formula = Damange ~ ., family = "binomial", data = challenge)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0608  -0.7372  -0.3712   0.3948   2.2321  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  15.2968     7.3286   2.087   0.0369 *
Temp         -0.2360     0.1074  -2.198   0.0279 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 28.975  on 23  degrees of freedom
Residual deviance: 20.371  on 22  degrees of freedom
AIC: 24.371

Number of Fisher Scoring iterations: 5

The following plot is obtained by substituting the values 15.2968142 and -0.2360207 in the function.

Having developed the probabilities, now it’s time for decision-making. That is next.

Logistic Regression and Challenger Read More »

Breast Cancer Diagnostic Data Set

The dataset, known as the Breast Cancer Wisconsin (Diagnostic) Dataset, was obtained from Kaggle. It was built by Dr Wolberg, who used fluid samples from patients with solid breast masses. It provides ten features of cells from each sample – the mean value, extreme value and standard error of 10 features for the image returning 30 variables. Those ten components are:

  1. radius 
  2. texture 
  3. perimeter
  4. area
  5. smoothness 
  6. compactness 
  7. concavity 
  8. concave points
  9. symmetry
  10. fractal dimension

The objective is to match the outcome, and diagnosis, which takes two values vz benign (B) or malignant (M). The following plot gives the overall summary of how the various features compare between benign (B) and malignant (M)

or a density plot

We do a correlation plot next:

corr_mat <- cor(b_data[,2:ncol(b_data)])
corrplot(corr_mat)

Breast Cancer Diagnostic Data Set Read More »

Logistic Regression

We know linear regression, which allows us to find the relationship between two variables and let us predict a dependent variable from an independent variable.

In this example, the function associated with the red dotted line lets us estimate the fat% if a BMI value is known.

But what happens if the data is available, like the following?

Here, the survey gives either a YES or NO as the answer (1 = YES, 0 = NO). The linear regression and the subsequent equation are meaningless here. In such cases, we resort to logistic regression.

The objective of the logistic regression is not to get the value of Y but the probability. E.g., if the X value is 9, there is a 50% chance of getting a YES. On the other hand, X = 2 has a higher probability of getting a NO.

The plot tells you that the data is best suited for classification. Ys with < 50% chance to occur will be classified as the YES category, and < 50% is in NO.

Logistic Regression Read More »

Logistic Regression – Cleveland Data

Let’s do a logistic regression of health data. Experiments with the Cleveland database focused on distinguishing the presence (value: 1,2,3,4) from the absence (value 0). The featured health parameters are

Age
Sex
CP: chest pain
Trestbps: resting blood pressure (mm Hg)
Chol: serum cholesterol (mg/dl)
Fbs: fasting blood sugar > 120 mg/dl
Restecg: Rest ECG
Thalach: maximum heart rate achieved during the thallium stress test
Exang: exercise-induced angina
Oldpeak: ST depression induced by exercise relative to rest
Slope: the slope of the peak exercise ST segment
Ca: number of major vessels (0-3) coloured by fluoroscopy
Thal:
Hd: diagnosis of heart disease

After cleaning up and conditioning, the data looks like this:

297 obs. of  14 variables:
 $ Age     : num  63 67 67 37 41 56 62 57 63 53 ...
 $ Sex     : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
 $ CP      : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
 $ Trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
 $ Chol    : num  233 286 229 250 204 236 268 354 254 203 ...
 $ Fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
 $ Restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
 $ Thalach : num  150 108 129 187 172 178 160 163 147 155 ...
 $ Exang   : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
 $ Oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ Slope   : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
 $ Ca      : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
 $ Thal    : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
 $ Hd      : Factor w/ 5 levels "0","1","2","3",..: 1 3 2 1 1 1 4 1 3 2 ...
logic <- glm(Hd ~ ., data = heart_data, family = "binomial")
predicted.data <- data.frame(Prob.HD = logic$fitted.values, HD = heart_data$Hd)
par(cex=0.8, mai=c(0.7,0.7,0.2,0.5), bg = "antiquewhite1")
plot(x = predicted.data$HD, y = predicted.data$Prob.HD)

An even fancier plot can be made using the following code:

logic <- glm(Hd ~ ., data = heart_data, family = "binomial")
predicted.data <- data.frame(Prob.HD = logic$fitted.values, HD = heart_data$Hd)
predicted.data <- predicted.data[order(predicted.data$Prob.HD, decreasing = FALSE),]
predicted.data$Rank <- 1:nrow(predicted.data)
  
ggplot(data = predicted.data, aes(x = Rank, y = Prob.HD)) +
  geom_point(aes(color = HD), alpha = 1, shape = 4, stroke = 2) +
  xlab("Index") +
  ylab("Predicted Probability of Getting Heart Disease") +
  theme(text = element_text(color = "white"), 
        panel.background = element_rect(fill = "black"), 
        plot.background = element_rect(fill = "black"),
        panel.grid = element_blank(),
        legend.text = element_text(color = "black"),
        legend.title = element_text(color = "black"),
        axis.text = element_text(color = "white"),
        axis.ticks = element_line(color = "white")) 

Logistic Regression – Cleveland Data Read More »

Shortcuts to Accidents

We saw cognitive reflection problems, where our mind (brain) wants us to lock in – what it believes to be – a ‘timely’ answer which it gets via mental shortcuts. Here is one such question

RoadMajor
Accidents
Minor
Accidents
Road 1200016
Road 21000?

Fill the box with the question mark to make the accidents in two roads equivalent.

Studies have shown a high proportion of people answered 8. Their attempt was perhaps to maintain the same ratio (2000:16 == 1000:8). But the question was to estimate the number of minor incidents required for a road with fewer major accidents to make it equivalent to the one with more major accidents. Naturally, it should be much more than 1000 (the shortfall of major accidents on Road 2 vs Road 1).

Cars and workers

Another famous trick puzzle has the following form:

It takes 7 workers to make 7 cars in 7 days. How many days would it take 5 workers to make 5 cars?

Park your instincts to answer 5 (so that 5-5-5 matches with 7-7-7!) for a while. Try this first,
If 7 workers can build 4 cars in 3 days, how many days would it take 8 workers to build 6 cars??
I assume more people answer the second one correctly because it shows fewer visible patterns and may slow you down.

Answer: car per worker per day = (4/7)/3 = 4/21. So, 8 workers can make 32/21 cars in a day. But we want 6 cars => (32/21) x X (days) = 6. X = (21 x 6)/32 = 3.9 days.

In the same way, the first question is answered as follows:
(7/7)/7 = 1/7 car per worker per day. 5 workers can make 5/7 cars in a day. For making 5 cars, one needs (5/7) x X (days) = 5 or X = 35/5 = 7 days.

Shortcuts to Accidents Read More »

Time Series Analysis – Decomposition

Here is another time series, namely, the air passengers.

A key task of the time series analysis is to break down the data into signal and noise. In R, there is a function called decompose to do the job.

decom_AP <- decompose(AP, type = "additive")
plot(decom_AP)

Note that the data is already in a time series format. If it is a regular data frame, use function ‘ts’ first before attempting the decompose function.

Here is the illustration – the data (blue circle), compared with the seasonality.

Here is data with seasonality + trend

And finally, data is compared with the sum of all three, seasonality + trend + random

Time Series Analysis – Decomposition Read More »