13 Inference in Regression

13.3 Multiple Regression

This page includes Statistical Videos

OBJECTIVES By the end of this section, I will be able to …

Find the multiple regression equation, interpret the multiple regression coefficients, and use the multiple regression equation to make predictions.
Calculate and interpret the adjusted coefficient of determination.
Perform the $F$ test for the overall significance of the multiple regression.
Conduct $t$ tests for the significance of individual predictor variables.
Explain the use and effect of dummy variables in multiple regression.
Apply the strategy for building a multiple regression model.

1 Finding the Multiple Regression Equation, Interpreting the Coefficients, and Making Predictions

Thus far, we have examined the relationship between the response variable $y$ and a single predictor variable $x$ . In our data-filled world, however, we often encounter situations where we can use more than one $x$ variable to predict the $y$ variable. This is called multiple regression.

Regression analysis using a single $y$ variable and a single $x$ variable is called simple linear regression.

Multiple regression describes the linear relationship between one response variable $y$ and more than one predictor variable, $x_{1}, x_{2}, x_{3}, \dots$ . The multiple regression equation is an extension of the regression equation

$\hat{y} = b_{0} + b_{1} x_{1} + b_{2} x_{2} + \dots b_{k} x_{k}$

where $k$ represents the number of $x$ variables in the equation, and $b_{0}, b_{1}, b_{2}, \dots b_{k}$ represent the multiple regression coefficients.

The interpretation of the regression coefficients is similar to the interpretation of the slope $b_{1}$ in simple linear regression, except that we also state that the other $x$ variables are held constant. The interpretation of the $y$ intercept $b_{0}$ is similar to the simple linear regression case. The next example illustrates the multiple regression equation, and shows how to interpret the multiple regression coefficients.

EXAMPLE 11 Multiple regression equation, coefficients, and prediction

Page 744

breakfastcereals3

The data set Breakfast Cereals includes several predictor variables and one response variable, $y = (nutritional) rating$ .

Use technology to find the multiple regression equation for predicting $y = rating$ , using $x_{1} = fiber$ and $x_{2} = sugar$ . State the equation with a sentence.
State the values of the multiple regression coefficients.
Interpret the multiple regression coefficients for using $x_{1} = fiber$ and $x_{2} = sugar$ .
Use the multiple regression equation to predict the rating of a breakfast cereal with 5 mg of fiber and 10 mg of sugar.

When we perform a multiple regression of one variable on (or against or versus) a set of other variables, the first variable is always the $y$ variable, and the set of variables following the word on are the $x$ variables.

Solution

Using the instructions in the Step-by-Step Technology Guide at the end of this section, we open the Breakfast Cereals data set and perform a multiple regression of $y = rating$ on $x_{1} = fiber$ and $x_{2} = sugar$ . Note that this does not represent extrapolation, as there are cereals in the data set that have either zero grams of fiber (such as Cap’n Crunch) or zero grams of sugar (such as Cream of Wheat).

A partial Minitab printout is shown in Figure 20. A partial SPSS printout is in Figure 21. The multiple regression equation is

$\begin{matrix} \hat{y} & = b_{0} + b_{1} x_{1} + b_{2} x_{2} \\ = 52.22 + 2.869 (fiber) - 2.246 (sugar) \end{matrix}$

The estimated nutritional rating equals 52.22 points plus 2.869 times the number of grams of fiber minus 2.246 times the number of grams of sugar.

FIGURE 20 Multiple regression equation in Minitab.

FIGURE 21 Multiple regression equation in SPSS.
The values of the multiple regression coefficients are $b_{0} = 52.22$ , $b_{1} = 2.869$ , and $b_{2} = - 2.246$ .
The multiple regression coefficients are interpreted as follows:
- $b_{0} = 52.22$ (y intercept). The estimated nutritional rating when there are zero grams of fiber and zero grams of sugar is 52.22.
- $b_{1} = 2.869$ . For every increase of one gram of fiber, the estimated increase in nutritional rating is 2.869 points, when the amount of sugar is held constant.
- $b_{2} = - 2.246$ . For every increase of one gram of sugar, the estimated decrease in nutritional rating is 2.246 points, when the amount of fiber is held constant.
When making predictions in multiple regression, beware of the pitfalls of extrapolation, just like those for simple linear regression. Further, in multiple regression, the values for all predictor variables must lie within their respective ranges. Otherwise, the prediction represents extrapolation, and it may be misleading.

Page 745
To find the predicted rating for a breakfast cereal with $x_{1} = fiber = 5$ and $x_{2} = sugar = 10$ , we plug these values into the multiple regression equation from part (a):

$\hat{y} = 52.22 + 2.869 (5) - 2.246 (10) = 44.105$

The predicted nutritional rating for a breakfast cereal with 5 mg of fiber and 10 mg of sugar is 44.105.

NOW YOU CAN DO

Exercises 9–16.

2 The Adjusted Coefficient of Determination

Recall from Section 4.3 that we measure the goodness of a regression equation using the coefficient of determination $r^{2} = SSR / SST$ . In multiple regression, we use the same formula for the coefficient of determination (though the letter $r$ is promoted to a capital $R$ ).

Multiple Coefficient of Determination $R^{2}$

The multiple coefficient of determination is given by:

$R^{2} = \frac{SSR}{SST} 0 \leq R^{2} \leq 1$

where SSR is the sum of squares regression, and SST is the total sum of squares. The multiple coefficient of determination $R^{2}$ represents the proportion of the variability in the response $y$ that is explained by the multiple regression equation.

Unfortunately, when a new $x$ variable is added to the multiple regression equation, the value of $R^{2}$ always increases, even when the variable is not useful for predicting $y$ . So, we need a way to adjust the value of $R^{2}$ as a penalty for having too many unhelpful $x$ variables in the equation. This is the adjusted coefficient of determination, $R_{adj}^{2}$ .

Adjusted Coefficient of Determination $R_{adj}^{2}$

The adjusted coefficient of determination is given by the formula

$R_{adj}^{2} = 1 - (1 - R^{2}) (\frac{n - 1}{n - k - 1})$

where $n$ is the number of observations, $k$ is the number of $x$ variables, and $R^{2}$ is the multiple coefficient of determination. $R_{adj}^{2}$ is always smaller than $R^{2}$ .

$R_{adj}^{2}$ is preferable to $R^{2}$ as a measure of the goodness of a regression equation, because $R_{adj}^{2}$ will decrease if an unhelpful $x$ variable is added to the regression equation. The interpretation of $R_{adj}^{2}$ is similar to $R^{2}$ .

EXAMPLE 12 Calculating and interpreting the adjusted coefficient of determination $R_{adj}^{2}$

For the multiple regression in Example 11, we have $SSR = 12, 251.6$ and $SST = 14, 996.8$ .

Calculate the multiple coefficient of determination for $R^{2}$ .
There are $n = 77$ observations. Compute and interpret the adjusted coefficient of determination $R_{adj}^{2}$ .

Solution

$R^{2} = \frac{SSR}{SST} = \frac{12, 251.6}{14, 996.8} \approx 0.8169$
$R_{adj}^{2} = 1 - (1 - R^{2}) (\frac{n - 1}{n - k - 1}) = 1 - (1 - 0.8169) (\frac{77 - 1}{77 - 2 - 1}) \approx 0.8120$

Page 746

Therefore, the multiple regression equation accounts for 81.2% of the variability in nutritional rating.

NOW YOU CAN DO

Exercises 17–20.

Thus far, in this section, we have used descriptive methods for multiple regression. Next, we learn how to perform inference in multiple regression.

3 The $F$ Test for the Overall Significance of the Multiple Regression

The multiple regression model is an extension of the regression model from Section 13.1, and it approximates the relationship between $y$ and the collection of $x$ variables.

Multiple Regression Model

The population multiple regression equation is

$y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{k} x_{k} + ε$

where $β_{0}, β_{1}, β_{2}, \dots β_{k}$ are the parameters of the population regression equation, $k$ is the number of $x$ variables, and $ε$ represents the random error term that follows a normal distribution with mean 0 and constant variance $σ^{2}$ .

The population parameters $β_{0}, β_{1}, β_{2}, \dots β_{k}$ are unknown, so we must perform inference to learn about them. We begin by asking, is our multiple regression useful? To answer this question, we perform the F test for the overall significance of the multiple regression.

Our multiple regression will not be useful if all of the population parameters equal zero, for this will result in no relationship at all between the $y$ variable and the set of $x$ variables. Thus, the hypotheses for the $F$ test are

$H_{0} : β_{1} = β_{2} = \dots = β_{k} = 0$ No linear relationship exists between $y$ and any of the $x$ variables. The overall multiple regression is not significant.
$H_{a} : At least one of the β' s \neq 0$ . A linear relationship exists between $y$ and at least one of the $x$ variables. The overall multiple regression is significant.

The $F$ test is not valid if there is strong evidence that the regression assumptions have been violated. The multiple regression assumptions are similar to those for simple linear regression. We illustrate the steps for the $F$ test with the following example.

$H_{a}$ states that at least one of the $β' s \neq 0$ , not that all of the $β' s \neq 0$ .

EXAMPLE 13 $F$ test for the overall significance of the multiple regression

For the breakfast cereal data, we are interested in determining whether a linear relationship exists between $y = (nutritional) rating$ , and $x_{1} = vitamins$ and $x_{2} = sodium$ .

Determine whether the regression assumptions have been violated.
Perform the $F$ test for the overall significance of the multiple regression of rating on vitamins and sodium, using level of significance $α = 0.01$ .

Solution

Figure 22, a scatterplot of the residuals versus fitted values, contains no strong evidence of unhealthy patterns. Although the two cereals, All-Bran with Extra Fiber and Product 19, are unusual, the vast majority of the data suggest that the independence, constant variance, and zero-mean assumptions are satisfied. Figure 23 indicates that the normality assumption is satisfied.
The Minitab multiple regression results are provided in Figure 24.
Page 747

FIGURE 22 Scatterplots of residuals versus fitted values.

FIGURE 23 Normal probability plot of the residuals.

FIGURE 24 Minitab results for regression of rating on vitamins and sodium.

Step 1 State the hypotheses and the rejection rule. We have $k = 2 x$ variables, so the hypotheses are
- $H_{0} : β_{1} = β_{2} = 0$ . No linear relationship exists between rating and either vitamins and sodium. The overall multiple regression is not significant.
- $H_{a} : At least one of β_{1}$ and $β_{2} \neq 0$ . A linear relationship exists between rating and at least one of vitamins and sodium. The overall multiple regression is significant.
Reject $H_{0}$ if the $p -value \leq α = 0.01$ .
Step 2 Find the $F$ statistic and the $p$ -value. These are located in the ANOVA table portion of the printout, denoted Analysis of Variance. From Figure 24, we have $F = 7.66$ and $p -value = 0.001$ . The $p$ -value represents $P (F > 7.66)$ .
Step 3 Conclusion and interpretation. The $p -value 0.001 is \leq α = 0.001$ , so we reject $H_{0}$ . There is evidence, at level of significance $α = 0.01$ , for a linear relationship between rating and at least one of vitamins and sodium. The overall multiple regression is significant.

The ANOVA table is a convenient way to organize a set of statistics, which can be used to perform multiple regression as well as ANOVA.

NOW YOU CAN DO

Exercises 21, 22, 27, and 28.

Once we find that the overall multiple regression is significant, we may ask: Which of the individual $x$ variables have a significant linear relationship with the response variable $y$ ?

4 The $t$ Test for the Significance of Individual Predictor Variables

To determine whether a particular $x$ variable has a significant linear relationship with the response variable $y$ , we perform the $t$ test that was used in Section 13.1 to test for the significance of that $x$ variable. One may perform as many such $t$ tests as there are $x$ variables in the model, which is $k$ assuming the overall $F$ test is significant. We illustrate the steps for performing a set of $t$ tests using the following example.

Page 748

EXAMPLE 14 Performing a set of $t$ tests for the significance of a set of individual $x$ variables

Using the results from Example 13, do the following, using level of significance $α = 0.05$ :

Test 1: Test whether a significant linear relationship exists between rating and vitamins.
Test 2: Test whether a significant linear relationship exists between rating and sodium.

Solution

The regression assumptions were verified in Example 13.

Step 1 For each hypothesis test, state the hypotheses and the rejection rule. Vitamins is $x_{1}$ , and sodium is $x_{2}$ , so the hypotheses are
- Test 1:
  - $H_{0} : β_{1} = 0$ . No linear relationship exists between rating and vitamins.
  - $H_{a} : β_{1} \neq 0$ . A linear relationship exists between rating and vitamins.
  - Reject $H_{0}$ if the $p -value \leq α = 0.05$ .
- Test 2:
  - $H_{0} : β_{2} = 0$ . No linear relationship exists between rating and sodium.
  - $H_{a} : β_{2} \neq 0$ . A linear relationship exists between rating and sodium.
  - Reject $H_{0}$ if the $p -value \leq α = 0.05$ .
Step 2 For each hypothesis test, find the $t$ statistic and the $p$ -value. Figure 25 is an excerpt of the Minitab results from Figure 24, with the relevant $t$ statistics and $p$ -values highlighted. For Test 1, the $p$ -value represents $P (t < - 0.97) + P (t > 0.97)$ , because it is a two-tailed test. For Test 2, the $p$ -value represents $P (t < - 3.19) + P (t > 3.19)$ .
- Test 1: $t = - 0.97$ , with $p$ -value 0.336.
- Test 2: $t = - 3.19$ , with $p$ -value 0.002.
FIGURE 25 $t$ statistics and $p$ -values for the $t$ test for the significance of the $x$ variables vitamins and sodium.
Step 3 For each hypothesis test, state the conclusion and interpretation.
- Test 1: The $p -value = 0.336$ , which is not $\leq α = 0.05$ . Therefore, we do not reject $H_{0}$ . There is insufficient evidence of a linear relationship between rating and vitamins when sodium is held constant. Perhaps surprisingly, the $x$ variable vitamins is not significant, meaning that the amount of vitamins in the breakfast cereal is not helpful in predicting nutritional rating.
- Test 2: The $p -value = 0.002$ , which is $\leq α = 0.05$ . Therefore, we reject $H_{0}$ . There is evidence of a linear relationship between rating and sodium when vitamins is held constant. The $x$ variable sodium is significant, meaning that the amount of sodium in the breakfast cereal is helpful in predicting nutritional rating.

NOW YOU CAN DO

Exercises 23 and 30.

Page 749

So far, all of our $x$ variables have been continuous. But what if we want to include a categorical variable as a predictor?

5 Dummy variables in Multiple regression

The data set Pulse and Temp contains the heart rate, body temperature, and sex of 130 men and women. We want to use $x_{1} = heart rate$ and $x_{2} = sex$ to predict $y = body temperature$ . However, sex is a categorical variable, so we must recode the values of $x_{2}$ as follows:

$x_{2} : Let " female " = 1 and let " male " = 0$

The variable $x_{2}$ is called a dummy variable, because it recodes the values of the binomial (categorical) variable sex into values of 0 and 1.

A dummy variable is a predictor variable used to recode a binomial categorical variable in regression, and taking values 0 or 1.

This recoding will provide us with two different regression equations, one for the females $(x_{2} = 1)$ and one for the males $(x_{2} = 0)$ , shown here:

Females: $\hat{y} = b_{0} + b_{1} x_{1} + b_{2} x_{2} = b_{0} + b_{1} x_{1} + b_{2} (1) = (b_{0} + b_{2}) + b_{1} x_{1}$
Males: $\hat{y} = b_{0} + b_{1} x_{1} + b_{2} x_{2} = b_{0} + b_{1} x_{1} + b_{2} (0) = b_{0} + b_{1} x_{1}$

Note that these two regression equations have the same slope $b_{1}$ , but different $y$ intercepts. The females have $y$ intercept $(b_{0} + b_{2})$ , whereas the males have $y$ intercept $b_{0}$ . See Figure 29 on page 750. The difference in $y$ intercepts is $(b_{0} + b_{2}) - b_{0} = b_{2}$ , which is the coefficient of the dummy variable $x_{2}$ . Let us illustrate with an example.

EXAMPLE 15 Dummy variables in multiple regression

Verify that the regression assumptions are met.
Perform a multiple regression of $y = body temperature$ on $x_{1} = heart rate$ and $x_{2} = sex$ , using level of significance $α = 0.05$ . Find the two regression equations, one for females and the other for males.
Construct a scatterplot of $y = body temperature$ versus $x_{1} = heart rate$ , using different-shaped points to show the different sexes. Place the two regression equations on the scatterplot.
Interpret the coefficient of the dummy variable $x_{2} = sex$ .

Solution

Figures 26 and 27 contain no evidence of unhealthy patterns. We therefore conclude that the regression assumptions are verified.

FIGURE 26 Scatterplot of residuals versus fitted values.

FIGURE 27 Normal probability plot of the residuals.

Page 750
Figure 28 contains the multiple regression results. The $p$ -value for the $F$ test is 0.001, which is $\leq α = 0.05$ , so we conclude that the overall regression is significant. The regression results tell us that $b_{0} = 96.251$ , $b_{1} = 0.02526$ , and $b_{2} = 0.270$ . Thus, our two regression equations are:
- Females: $\begin{array}{l} \hat{y} = (b_{0} + b_{2}) + b_{1} x_{1} & = (96.251 + 0.27) + 0.02526 x_{1} \\ = 96.521 + 0.02526 x_{1} \end{array}$
- Males: $\hat{y} = b_{0} + b_{1} x_{1} = 96.251 + 0.02526 x_{1}$
FIGURE 28 Results for multiple regression of $y = body temperature$ on $x_{1} = heart rate$ and $x_{2} = sex$ .

FIGURE 29 Scatterplot showing parallel regression lines when using dummy variables.
Figure 29 contains the scatterplot of $y = body temperature$ versus $x_{1} = heart rate$ , with the orange dots representing females and the blue dots representing males. The regression lines are shown, orange for females, blue for males. Note that the regression lines are parallel, because they each have the same slope $b_{1} = 0.02521$ . So the only difference in the lines is the $y$ intercepts.
- For females, the $y$ intercept is $b_{0} + b_{2} = 96.251 + 0.27 = 96.521$ .
- For males, the $y$ intercept is simply $b_{0} = 96.251$ .
The vertical distance between the parallel regression lines equals $b_{2} = 0.27$ , as shown in Figure 29. Thus, we interpret the coefficient $b_{2}$ of the dummy variable $x_{2}$ as the estimated increase in $y = body temperature$ for those observations with $x_{2} = 1$ (females), as compared to those with $x_{2} = 0$ (males), when heart rate is held constant. That is, for the same heart rate, females have a body temperature that is higher than that of males, by an estimated 0.27 degrees.

NOW YOU CAN DO

Exercise 29.

6 Strategy for Building a Multiple Regression Model

In order to bring together all you have learned of multiple regression, we now present a general strategy for building a multiple regression model.

Strategy for Building a Multiple Regression Model

Step 1 The $F$ Test.

Construct the multiple regression equation using all relevant predictor variables. Apply the $F$ test for the significance of the overall regression, in order to make sure that a linear relationship exists between the response $y$ and at least one of the predictor variables.

Page 751
Step 2 The $t$ Tests.

Perform the $t$ tests for the individual predictors. If at least one of the predictors is not significant (that is, its $p$ -value is greater than level of significance $α$ ), then eliminate the $x$ variable with the largest $p$ -value from the model. Ignore the $p$ -value of $β_{0}$ . Repeat Step 2 until all remaining predictors are significant.

We eliminate only one variable at a time. It may happen that eliminating one nonsignificant variable will nudge a second, formerly nonsignificant, variable into significance.
Step 3 Verify the Assumptions.

For your final model, verify the regression assumptions.
Step 4 Report and Interpret Your Final Model.
1. Provide the multiple regression equation for your final model.
2. Interpret the multiple regression coefficients so that a nonstatistician could understand.
3. Report and interpret the standard error of the estimate s and the adjusted coefficient of determination $R_{adj}^{2}$ .

We illustrate this strategy, known as backward stepwise regression, in the following example.

EXAMPLE 16 Strategy for building a multiple regression model

baseball2013

The author of this book first became interested in the field of statistics through the enjoyment of sports statistics, especially baseball, which is packed with interesting statistics. Today, professional sports teams are seeking competitive advantage through the analysis of data and statistics, such as Sabermetrics (Society of American Baseball Research, www.sabr.org), as shown in the motion picture Moneyball.

Suppose a baseball researcher is interested in predicting $y = runs scored$ , using the data set Baseball 2013 and the following predictor variables:

$x_{1} = Hits, the number of hits (of all kinds) the player makes$
$x_{2} = Doubles, the number of doubles the player makes$
$x_{3} = Triples, the number of triples the player makes$
$x_{4} = Home Runs, the number of home runs the player makes$
$x_{5} = RBIs, the number of runs batted in (runs scored by other players, but caused by this player)$
$x_{6} = Walks, the number of walks issued to the player$
$x_{7} = Batting Average, the number of hits divided by the number of at-bats$
$x_{8} = Red Sox, a dummy variable equal to 1 if the player plays for the Boston Red Sox, and 0 otherwise$

Use the Strategy for Building a Multiple Regression Model to build the best multiple regression model for predicting the number of runs scored using these predictor variables, at level of significance $α = 0.05$ .

Solution

The data set Baseball 2013 contains the batting statistics of the $n = 448$ players in Major League Baseball who had at least 100 at-bats during the 2013 season (Source: www.seanlahman.com/baseball-archive/statistics).

Step 1 The $F$ Test. Figure 30 shows the Minitab results of a regression of $y = runs scored$ on the set of predictor variables $x_{1}, x_{2}, x_{3}, \dots, x_{8}$ . The $p$ -value for the $F$ test is significant, so we know that a linear relationship exists between $y = runs scored$ and at least one of the $x$ variables.
Step 2 The $t$ test (the first time). In Figure 30, the $p$ -value for Batting Average is greater than level of significance $α = 0.05$ . We therefore eliminate the Batting Average from the model. Perhaps surprisingly, a player's batting average is evidently not helpful in predicting the number of runs that player will score when all other predictors are held constant.

Page 752

FIGURE 30 Step 1: $F$ test is significant.

FIGURE 31 All $x$ variables are significant; we have our final model.
Step 2 The $t$ test (the second time). We repeat Step 2 as long as there are $x$ variables with $p$ -values greater than level of significance $α = 0.05$ . Figure 31 shows the results of performing the multiple regression of $y = Runs Scored$ on all the $x$ variables except Batting Average. No further variables have $p$ -values below 0.05; therefore, no further variables are excluded from the model. In other words, we have our final model.
Step 3 Verify the assumptions. For our final model, we now verify the regression assumptions. Figures 32 and 33 show no patterns for the bulk of the data that would indicate a violation of the regression assumptions. We therefore conclude that the regression assumptions are verified.

FIGURE 32 Scatterplot of residuals versus fitted values.

FIGURE 33 Normal probability plots of the residuals.
Step 4 Report and interpret your final model.
1. The multiple regression equation for the final model is shown here.
  
  $\begin{array}{l} \hat{y} = & - 1.283 + 0.3182 Hits + 0.2105 Doubles + 1.429 Triples + 0.6833 Home Runs \\ - 0.1059 RBIs + 0.2319 Walks + 5.08 Red Sox \end{array}$
  
  Page 753
2. We interpret the coefficient for Hits, and leave to the exercises the interpretation of the other multiple regression coefficients. “For each additional hit that a player makes, the estimated increase in the number of runs that player will score is 0.3182, when all the other $x$ variables are held constant.”
3. The standard error of the estimate for the final model is $s = 6.4952 \approx 6.5$ . That is, using the multiple regression equation in (a), the size of the typical prediction error will be about 6.5 runs. The value of the adjusted coefficient of determination is $R_{adj}^{2} = 93.07 %$ . In other words, 93.07% of the variability in the number of runs scored is accounted for by this multiple regression equation.

NOW YOU CAN DO

Exercises 24–26 and 31–33.