Continuing from the previous post, we will add all the predictors in the regression this time. The notation is response variable ~ followed by ‘.’ [dot].
fit3 <- lm(medv~., Boston)
summary(fit3)
Call:
lm(formula = medv ~ ., data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.1304 -2.7673 -0.5814 1.9414 26.2526
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.617270 4.936039 8.431 3.79e-16 ***
crim -0.121389 0.033000 -3.678 0.000261 ***
zn 0.046963 0.013879 3.384 0.000772 ***
indus 0.013468 0.062145 0.217 0.828520
chas 2.839993 0.870007 3.264 0.001173 **
nox -18.758022 3.851355 -4.870 1.50e-06 ***
rm 3.658119 0.420246 8.705 < 2e-16 ***
age 0.003611 0.013329 0.271 0.786595
dis -1.490754 0.201623 -7.394 6.17e-13 ***
rad 0.289405 0.066908 4.325 1.84e-05 ***
tax -0.012682 0.003801 -3.337 0.000912 ***
ptratio -0.937533 0.132206 -7.091 4.63e-12 ***
lstat -0.552019 0.050659 -10.897 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.798 on 493 degrees of freedom
Multiple R-squared: 0.7343, Adjusted R-squared: 0.7278
F-statistic: 113.5 on 12 and 493 DF, p-value: < 2.2e-16
The obvious questions that come here are:
- Was this a helpful exercise?
- Which variables are better at predicting the response?
To answer the first question, we look at the results from two parameters, the R2 and F values. They both signify how the variance around the model is away from that around the mean. R-squared of 0.7343 and F-statistic = 113.5 are good enough to justify this activity.
The clues to the second question are also in the results (the stars at the end of the coefficients). The results show that two variables, ‘Indus’ and ‘age’, have very low significance (high p-values). So, we remove the least damaging predictors to the model and reevaluate using the ‘update’ command.
fit4 <- update(fit3, ~.-age-indus)
summary(fit4)
Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
tax + ptratio + lstat, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.1814 -2.7625 -0.6243 1.8448 26.3920
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.451747 4.903283 8.454 3.18e-16 ***
crim -0.121665 0.032919 -3.696 0.000244 ***
zn 0.046191 0.013673 3.378 0.000787 ***
chas 2.871873 0.862591 3.329 0.000935 ***
nox -18.262427 3.565247 -5.122 4.33e-07 ***
rm 3.672957 0.409127 8.978 < 2e-16 ***
dis -1.515951 0.187675 -8.078 5.08e-15 ***
rad 0.283932 0.063945 4.440 1.11e-05 ***
tax -0.012292 0.003407 -3.608 0.000340 ***
ptratio -0.930961 0.130423 -7.138 3.39e-12 ***
lstat -0.546509 0.047442 -11.519 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.789 on 495 degrees of freedom
Multiple R-squared: 0.7342, Adjusted R-squared: 0.7289
F-statistic: 136.8 on 10 and 495 DF, p-value: < 2.2e-16
An introduction to Statistical Learning: James, Witten, Hastie, Tibshirani, Taylor