Multiple Regression Continued

Continuing from the previous post, we will add all the predictors in the regression this time. The notation is response variable ~ followed by ‘.’ [dot].

fit3 <- lm(medv~., Boston)
summary(fit3)

Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.1304  -2.7673  -0.5814   1.9414  26.2526 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  41.617270   4.936039   8.431 3.79e-16 ***
crim         -0.121389   0.033000  -3.678 0.000261 ***
zn            0.046963   0.013879   3.384 0.000772 ***
indus         0.013468   0.062145   0.217 0.828520    
chas          2.839993   0.870007   3.264 0.001173 ** 
nox         -18.758022   3.851355  -4.870 1.50e-06 ***
rm            3.658119   0.420246   8.705  < 2e-16 ***
age           0.003611   0.013329   0.271 0.786595    
dis          -1.490754   0.201623  -7.394 6.17e-13 ***
rad           0.289405   0.066908   4.325 1.84e-05 ***
tax          -0.012682   0.003801  -3.337 0.000912 ***
ptratio      -0.937533   0.132206  -7.091 4.63e-12 ***
lstat        -0.552019   0.050659 -10.897  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.798 on 493 degrees of freedom
Multiple R-squared:  0.7343,	Adjusted R-squared:  0.7278 
F-statistic: 113.5 on 12 and 493 DF,  p-value: < 2.2e-16

The obvious questions that come here are:

Was this a helpful exercise?
Which variables are better at predicting the response?

To answer the first question, we look at the results from two parameters, the R² and F values. They both signify how the variance around the model is away from that around the mean. R-squared of 0.7343 and F-statistic = 113.5 are good enough to justify this activity.

The clues to the second question are also in the results (the stars at the end of the coefficients). The results show that two variables, ‘Indus’ and ‘age’, have very low significance (high p-values). So, we remove the least damaging predictors to the model and reevaluate using the ‘update’ command.

fit4 <- update(fit3, ~.-age-indus)
summary(fit4)

Call:
lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
    tax + ptratio + lstat, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.1814  -2.7625  -0.6243   1.8448  26.3920 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  41.451747   4.903283   8.454 3.18e-16 ***
crim         -0.121665   0.032919  -3.696 0.000244 ***
zn            0.046191   0.013673   3.378 0.000787 ***
chas          2.871873   0.862591   3.329 0.000935 ***
nox         -18.262427   3.565247  -5.122 4.33e-07 ***
rm            3.672957   0.409127   8.978  < 2e-16 ***
dis          -1.515951   0.187675  -8.078 5.08e-15 ***
rad           0.283932   0.063945   4.440 1.11e-05 ***
tax          -0.012292   0.003407  -3.608 0.000340 ***
ptratio      -0.930961   0.130423  -7.138 3.39e-12 ***
lstat        -0.546509   0.047442 -11.519  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.789 on 495 degrees of freedom
Multiple R-squared:  0.7342,	Adjusted R-squared:  0.7289 
F-statistic: 136.8 on 10 and 495 DF,  p-value: < 2.2e-16

An introduction to Statistical Learning: James, Witten, Hastie, Tibshirani, Taylor

Related Posts