OBJECTIVES By the end of this section, I will be able to …
1 Finding the Multiple Regression Equation, Interpreting the Coefficients, and Making Predictions
Thus far, we have examined the relationship between the response variable and a single predictor variable . In our data-filled world, however, we often encounter situations where we can use more than one variable to predict the variable. This is called multiple regression.
Regression analysis using a single variable and a single variable is called simple linear regression.
Multiple regression describes the linear relationship between one response variable and more than one predictor variable, . The multiple regression equation is an extension of the regression equation
where represents the number of variables in the equation, and represent the multiple regression coefficients.
The interpretation of the regression coefficients is similar to the interpretation of the slope in simple linear regression, except that we also state that the other variables are held constant. The interpretation of the intercept is similar to the simple linear regression case. The next example illustrates the multiple regression equation, and shows how to interpret the multiple regression coefficients.
EXAMPLE 11 Multiple regression equation, coefficients, and prediction
744
breakfastcereals
The data set Breakfast Cereals includes several predictor variables and one response variable, .
When we perform a multiple regression of one variable on (or against or versus) a set of other variables, the first variable is always the variable, and the set of variables following the word on are the variables.
Solution
Using the instructions in the Step-by-Step Technology Guide at the end of this section, we open the Breakfast Cereals data set and perform a multiple regression of on and . Note that this does not represent extrapolation, as there are cereals in the data set that have either zero grams of fiber (such as Cap’n Crunch) or zero grams of sugar (such as Cream of Wheat).
A partial Minitab printout is shown in Figure 20. A partial SPSS printout is in Figure 21. The multiple regression equation is
The estimated nutritional rating equals 52.22 points plus 2.869 times the number of grams of fiber minus 2.246 times the number of grams of sugar.
When making predictions in multiple regression, beware of the pitfalls of extrapolation, just like those for simple linear regression. Further, in multiple regression, the values for all predictor variables must lie within their respective ranges. Otherwise, the prediction represents extrapolation, and it may be misleading.
745
To find the predicted rating for a breakfast cereal with and , we plug these values into the multiple regression equation from part (a):
The predicted nutritional rating for a breakfast cereal with 5 mg of fiber and 10 mg of sugar is 44.105.
NOW YOU CAN DO
Exercises 9–16.
2 The Adjusted Coefficient of Determination
Recall from Section 4.3 that we measure the goodness of a regression equation using the coefficient of determination . In multiple regression, we use the same formula for the coefficient of determination (though the letter is promoted to a capital ).
Multiple Coefficient of Determination
The multiple coefficient of determination is given by:
where SSR is the sum of squares regression, and SST is the total sum of squares. The multiple coefficient of determination represents the proportion of the variability in the response that is explained by the multiple regression equation.
Unfortunately, when a new variable is added to the multiple regression equation, the value of always increases, even when the variable is not useful for predicting . So, we need a way to adjust the value of as a penalty for having too many unhelpful variables in the equation. This is the adjusted coefficient of determination, .
Adjusted Coefficient of Determination
The adjusted coefficient of determination is given by the formula
where is the number of observations, is the number of variables, and is the multiple coefficient of determination. is always smaller than .
is preferable to as a measure of the goodness of a regression equation, because will decrease if an unhelpful variable is added to the regression equation. The interpretation of is similar to .
EXAMPLE 12 Calculating and interpreting the adjusted coefficient of determination
For the multiple regression in Example 11, we have and .
Solution
746
Therefore, the multiple regression equation accounts for 81.2% of the variability in nutritional rating.
NOW YOU CAN DO
Exercises 17–20.
Thus far, in this section, we have used descriptive methods for multiple regression. Next, we learn how to perform inference in multiple regression.
3 The Test for the Overall Significance of the Multiple Regression
The multiple regression model is an extension of the regression model from Section 13.1, and it approximates the relationship between and the collection of variables.
Multiple Regression Model
The population multiple regression equation is
where are the parameters of the population regression equation, is the number of variables, and represents the random error term that follows a normal distribution with mean 0 and constant variance .
The population parameters are unknown, so we must perform inference to learn about them. We begin by asking, is our multiple regression useful? To answer this question, we perform the F test for the overall significance of the multiple regression.
Our multiple regression will not be useful if all of the population parameters equal zero, for this will result in no relationship at all between the variable and the set of variables. Thus, the hypotheses for the test are
The test is not valid if there is strong evidence that the regression assumptions have been violated. The multiple regression assumptions are similar to those for simple linear regression. We illustrate the steps for the test with the following example.
states that at least one of the , not that all of the .
EXAMPLE 13 test for the overall significance of the multiple regression
For the breakfast cereal data, we are interested in determining whether a linear relationship exists between , and and .
Solution
747
Step 1 State the hypotheses and the rejection rule. We have variables, so the hypotheses are
Reject if the .
The ANOVA table is a convenient way to organize a set of statistics, which can be used to perform multiple regression as well as ANOVA.
NOW YOU CAN DO
Exercises 21, 22, 27, and 28.
Once we find that the overall multiple regression is significant, we may ask: Which of the individual variables have a significant linear relationship with the response variable ?
4 The Test for the Significance of Individual Predictor Variables
To determine whether a particular variable has a significant linear relationship with the response variable , we perform the test that was used in Section 13.1 to test for the significance of that variable. One may perform as many such tests as there are variables in the model, which is assuming the overall test is significant. We illustrate the steps for performing a set of tests using the following example.
748
EXAMPLE 14 Performing a set of tests for the significance of a set of individual variables
Using the results from Example 13, do the following, using level of significance :
Solution
The regression assumptions were verified in Example 13.
Step 1 For each hypothesis test, state the hypotheses and the rejection rule. Vitamins is , and sodium is , so the hypotheses are
NOW YOU CAN DO
Exercises 23 and 30.
749
So far, all of our variables have been continuous. But what if we want to include a categorical variable as a predictor?
5 Dummy variables in Multiple regression
The data set Pulse and Temp contains the heart rate, body temperature, and sex of 130 men and women. We want to use and to predict . However, sex is a categorical variable, so we must recode the values of as follows:
The variable is called a dummy variable, because it recodes the values of the binomial (categorical) variable sex into values of 0 and 1.
A dummy variable is a predictor variable used to recode a binomial categorical variable in regression, and taking values 0 or 1.
This recoding will provide us with two different regression equations, one for the females and one for the males , shown here:
Note that these two regression equations have the same slope , but different intercepts. The females have intercept , whereas the males have intercept . See Figure 29 on page 750. The difference in intercepts is , which is the coefficient of the dummy variable . Let us illustrate with an example.
EXAMPLE 15 Dummy variables in multiple regression
Solution
Figures 26 and 27 contain no evidence of unhealthy patterns. We therefore conclude that the regression assumptions are verified.
750
NOW YOU CAN DO
Exercise 29.
6 Strategy for Building a Multiple Regression Model
In order to bring together all you have learned of multiple regression, we now present a general strategy for building a multiple regression model.
Strategy for Building a Multiple Regression Model
Step 1 The Test.
Construct the multiple regression equation using all relevant predictor variables. Apply the test for the significance of the overall regression, in order to make sure that a linear relationship exists between the response and at least one of the predictor variables.
751
Step 2 The Tests.
Perform the tests for the individual predictors. If at least one of the predictors is not significant (that is, its -value is greater than level of significance ), then eliminate the variable with the largest -value from the model. Ignore the -value of . Repeat Step 2 until all remaining predictors are significant.
We eliminate only one variable at a time. It may happen that eliminating one nonsignificant variable will nudge a second, formerly nonsignificant, variable into significance.
Step 3 Verify the Assumptions.
For your final model, verify the regression assumptions.
Step 4 Report and Interpret Your Final Model.
We illustrate this strategy, known as backward stepwise regression, in the following example.
EXAMPLE 16 Strategy for building a multiple regression model
baseball2013
The author of this book first became interested in the field of statistics through the enjoyment of sports statistics, especially baseball, which is packed with interesting statistics. Today, professional sports teams are seeking competitive advantage through the analysis of data and statistics, such as Sabermetrics (Society of American Baseball Research, www.sabr.org), as shown in the motion picture Moneyball.
Suppose a baseball researcher is interested in predicting , using the data set Baseball 2013 and the following predictor variables:
Use the Strategy for Building a Multiple Regression Model to build the best multiple regression model for predicting the number of runs scored using these predictor variables, at level of significance .
Solution
The data set Baseball 2013 contains the batting statistics of the players in Major League Baseball who had at least 100 at-bats during the 2013 season (Source: www.seanlahman.com/baseball-archive/statistics).
Step 2 The test (the first time). In Figure 30, the -value for Batting Average is greater than level of significance . We therefore eliminate the Batting Average from the model. Perhaps surprisingly, a player's batting average is evidently not helpful in predicting the number of runs that player will score when all other predictors are held constant.
752
The multiple regression equation for the final model is shown here.
753
NOW YOU CAN DO
Exercises 24–26 and 31–33.