Logistic Regression – Cleveland Data

Let’s do a logistic regression of health data. Experiments with the Cleveland database focused on distinguishing the presence (value: 1,2,3,4) from the absence (value 0). The featured health parameters are

Age
Sex
CP: chest pain
Trestbps: resting blood pressure (mm Hg)
Chol: serum cholesterol (mg/dl)
Fbs: fasting blood sugar > 120 mg/dl
Restecg: Rest ECG
Thalach: maximum heart rate achieved during the thallium stress test
Exang: exercise-induced angina
Oldpeak: ST depression induced by exercise relative to rest
Slope: the slope of the peak exercise ST segment
Ca: number of major vessels (0-3) coloured by fluoroscopy
Thal:
Hd: diagnosis of heart disease

After cleaning up and conditioning, the data looks like this:

297 obs. of  14 variables:
 $ Age     : num  63 67 67 37 41 56 62 57 63 53 ...
 $ Sex     : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
 $ CP      : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
 $ Trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
 $ Chol    : num  233 286 229 250 204 236 268 354 254 203 ...
 $ Fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
 $ Restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
 $ Thalach : num  150 108 129 187 172 178 160 163 147 155 ...
 $ Exang   : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
 $ Oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ Slope   : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
 $ Ca      : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
 $ Thal    : Factor w/ 3 levels "3","6","7": 2 1 3 1 1 1 1 1 3 3 ...
 $ Hd      : Factor w/ 5 levels "0","1","2","3",..: 1 3 2 1 1 1 4 1 3 2 ...
logic <- glm(Hd ~ ., data = heart_data, family = "binomial")
predicted.data <- data.frame(Prob.HD = logic$fitted.values, HD = heart_data$Hd)
par(cex=0.8, mai=c(0.7,0.7,0.2,0.5), bg = "antiquewhite1")
plot(x = predicted.data$HD, y = predicted.data$Prob.HD)

An even fancier plot can be made using the following code:

logic <- glm(Hd ~ ., data = heart_data, family = "binomial")
predicted.data <- data.frame(Prob.HD = logic$fitted.values, HD = heart_data$Hd)
predicted.data <- predicted.data[order(predicted.data$Prob.HD, decreasing = FALSE),]
predicted.data$Rank <- 1:nrow(predicted.data)
  
ggplot(data = predicted.data, aes(x = Rank, y = Prob.HD)) +
  geom_point(aes(color = HD), alpha = 1, shape = 4, stroke = 2) +
  xlab("Index") +
  ylab("Predicted Probability of Getting Heart Disease") +
  theme(text = element_text(color = "white"), 
        panel.background = element_rect(fill = "black"), 
        plot.background = element_rect(fill = "black"),
        panel.grid = element_blank(),
        legend.text = element_text(color = "black"),
        legend.title = element_text(color = "black"),
        axis.text = element_text(color = "white"),
        axis.ticks = element_line(color = "white"))