11 Multiple Regression

SECTION 11.3 Exercises

For Exercises 11.59 to 11.61, see page 568; for 11.62 and 11.63, see page 571; for 11.64 to 11.66, see pages 474–575; for 11.67 and 11.68, see page 577; for 11.69 to 11.72, see page 580.

Question 11.73

11.73 Quadratic models.

Sketch each of the following quadratic equations for values of $x$ between 0 and 5. Then describe the relationship between $μ_{y}$ and $x$ in your own words.

$μ_{y} = 6 + 3 x + x^{2}$ .
$μ_{y} = 6 - 3 x + x^{2}$ .
$μ_{y} = 6 + 3 x - x^{2}$ .
$μ_{y} = 6 - 3 x - x^{2}$ .

11.73

(a) The relationship is curved; as $x$ increases, $μ_{Y}$ also increases, but at larger values of $x$ , $μ_{Y}$ increases more rapidly. (b) The relationship is curved; as $x$ increases, $μ_{Y}$ decreases at first but then starts to increase slowly, but at larger values of $x$ , $μ_{Y}$ increases more rapidly. (c) The relationship is curved; as $x$ increases, $μ_{Y}$ increases at first but then starts to decrease slowly, but at larger values of $x$ , $μ_{Y}$ decreases more rapidly. (d) The relationship is curved; as $x$ increases, $μ_{Y}$ decreases, but at larger values of $x$ , $μ_{Y}$ decreases more rapidly.

Question 11.74

11.74 Models with indicator variables.

Suppose that $x$ is an indicator variable with the value 0 for Group A and 1 for Group B. The following equations describe relationships between the value of $μ_{y}$ and membership in Group A or B. For each equation, give the value of the mean response $μ_{y}$ for Group A and for Group B.

$μ_{y} = 10 + 5 x$ .
$μ_{y} = 5 + 10 x$ .
$μ_{y} = 5 + 100 x$ .

Question 11.75

11.75 Differences in means.

Verify that the coefficient of $x$ in each part of the previous exercise is equal to the mean for Group B minus the mean for Group A. Do you think that this will be true in general? Explain your answer.

11.75

(a) $15 - 10 = 5$ , which is the slope for $x$ . (b) $15 - 5 = 10$ , which is the slope for $x$ . (c) $105 - 5 = 100$ , which is the slope for $x$ . Yes, it is true in general as long as $x$ is an indicator variable with values 0 and 1.

Question 11.76

11.76 Models with interactions.

Suppose that $x_{1}$ is an indicator variable with the value 0 for Group A and 1 for Group B, and $x_{2}$ is a quantitative variable. Each of the following models describes a relationship between $μ_{y}$ and the explanatory variables $x_{1}$ and $x_{2}$ . For each model, substitute the value 0 for $x_{1}$ , and write the resulting equation for $μ_{y}$ in terms of $x_{2}$ for Group A. Then substitute $x_{1} = 1$ to obtain the equation for Group B, and sketch the two equations on the same graph. Describe in words the difference in the relationship for the two groups.

$μ_{y} = 40 + 30 x_{1} + 2 x_{2} + 4 x_{1} x_{2}$ .
$μ_{y} = 40 + 30 x_{1} + 4 x_{2} + 2 x_{1} x_{2}$ .
$μ_{y} = 30 + 40 x_{1} - 2 x_{2} + 4 x_{1} x_{2}$ .

Question 11.77

11.77 Differences in slopes and intercepts.

Refer to the previous exercise. Verify that the coefficient of $x_{1} x_{2}$ is equal to the slope for Group B minus the slope for Group A in each of these cases. Also, verify that the coefficient of $x_{1}$ is equal to the intercept for Group B minus the intercept for Group A in each of these cases. Do you think these two results will be true in general? Explain your answer.

11.77

(a) $6 - 2 = 4$ , which is the coefficient of $x_{1} x_{2} . 70 - 40 = 30$ , which is the coefficient of $x_{1}$ . (b) $6 - 4 = 2$ , which is the coefficient of $x_{1} x_{2} . 70 - 40 = 30$ , which is the coefficient of $x_{1}$ . (c) $2 - (- 2) = 4$ , which is the coefficient of $x_{1} x_{2} . 70 - 30 = 40$ , which is the coefficient of $x_{1}$ . These results will be true in general as long as $x_{1}$ is an indicator variable with values 0 and 1.

Question 11.78

11.78 Write the model.

For each of the following situations write a model for $μ_{y}$ of the form

$μ_{y} = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p}$

where $p$ is the number of explanatory variables. Be sure to give the value of $p$ and, if necessary, explain how each of the $x$ ’s is coded.

Page 583

A model where the explanatory variable is a categorical variable with three possible values.
A model where there are four explanatory variables. One of these is categorical with two possible values; another is categorical with four possible values. Include a term that would model an interaction of the first categorical variable and the third (quantitative) explanatory variable.
A cubic regression, where terms up to and including the third power of an explanatory variable are included in the model.

Question 11.79

CASE 11.2

11.79 Predicting movie revenue.

A plot of theater count versus box office revenue suggests that the relationship may be slightly curved.

movies

Examine this question by running a regression to predict the box office revenue using the theater count and the square of the theater count. Report the relevant test statistic with its degrees of freedom and $P$ -value, and summarize your conclusion.
Now view this analysis in the framework of testing a hypothesis about a collection of regression coefficients, which you studied in Section 11.2 (page 559). The first model includes theater count and the square of theater count, while the second includes only theater count. Run both regressions and find the value of $R^{2}$ for each. Find the $F$ statistic for comparing the models based on the difference in the values of $R^{2}$ . Carry out the test and report your conclusion.
Verify that the square of the $t$ statistic that you found in part (a) for testing the coefficient of the quadratic term is equal to the $F$ statistic that you found for this exercise.

11.79

(a) $H_{0} : β_{2} = 0. H_{a} : β_{2} \neq 0. t = 3.03, d f = 40, 0.002 < P -value < 0.005$ . The quadratic term for theaters, ${Theaters}^{2}$ , is significant and should be included in the model already containing Theaters. (b) For the first model, $R^{2} = 0.5125$ ; for the second model, $R^{2} = 0.4009$ . $F = 9.16, d f$ are 1 and $40, 0.001 < P -value < 0.01, {Theaters}^{2}$ should be included in a model that already contains Theaters. (c) ${3.03}^{2} = 9.18 \approx 9.16$ with rounding error.

Question 11.80

CASE 11.2

11.80 Assessing collinearity in the movie revenue model.

Many software packages will calculate VIF values for each explanatory variable. In this exercise, calculate the VIF values using several multiple regressions, and then use them to see if there is collinearity among the movie explanatory variables.

movies

Use statistical software to estimate the multiple regression model for predicting Budget based on Opening and Theaters. Calculate the VIF value for Budget using $R^{2}$ from this model and the formula
$VIF = \frac{1}{1 - R^{2}}$
Use statistical software to estimate the multiple regression model for predicting Opening based on Budget and Theaters. Calculate the VIF value for Opening using $R^{2}$ from this model and the formula from part (a).
Use statistical software to estimate the multiple regression model for predicting Theaters based on Budget and Opening. Calculate the VIF value for Theaters using $R^{2}$ from this model and the formula from part (a).
Do any of the calculated VIF values indicate severe collinearity among the explanatory variables? Explain your response.

Question 11.81

CASE 11.2

11.81 Predicting movie revenue, continued.

Refer to Exercise 11.79. Although a quadratic relationship between total U.S. revenue and theater count provides a better fit than the linear model, it does not make sense that box office revenue would again increase for very low budgeted movies (unless you are the Syfy Channel). An alternative approach to describe the relationship between theater count and box office revenue is to consider a piecewise linear equation.

movies

It appears the relationship between theater count and U.S. revenue changes around a count of 2800 theaters. Create a new variable that is the max $(0, Theaters - 2800)$ . This is simply the difference between the theater count and 2800 with all negative differences rounded to 0.
Fit the model with theater count and the variable you created in part (a). Report the relevant test statistic with its degrees of freedom and $P$ -value, and summarize your conclusion.
Obtain the fitted values from this model, and plot them versus theater count. Use this diagram to explain why this is called a piecewise linear model.
Compare the results of this model with the quadratic fit of Exercise 11.79. Which model do you prefer? Explain your answer.

11.81

(b) $t = 3.21, d f = 40, P -value = 0.0026$ . The new variable is significant and should be included in the model already containing Theaters. (It should be noted that Theaters is no longer significant in this model and could be removed.) (c) As shown in the plot, it is called a piecewise linear model because we are only measuring linearity for a piece of the variable Theaters (greater than 2800). (d) The results for the quadratic model and the results for the piecewise linear model are very similar; both models required that we retain the additional variable (quadratic or new). Answers will vary for preference; both models add some complexity for interpretation.

Question 11.82

CASE 11.2

11.82 Predicting movie revenue: Model selection.

Refer to the data set on movie revenue in Case 11.2 (page 550). In addition to the movie’s budget, opening-weekend revenue, and opening-weekend theater count, the data set also includes a column named Sequel. Sequel is 1 if the corresponding movie is a sequel, and Sequel is 0 if the movie is not a sequel. Assuming opening-weekend revenue (Opening) is in the model, there are eight possible regression models. For example, one model just includes Opening; another model includes Opening and Theaters; and another model includes Opening, Sequel, and Theaters. Run these eight regressions and make a table giving the regression coefficients, the value of $R^{2}$ , and the value of $s$ for each regression. (If an explanatory variable is not included in a particular regression, enter a value 0 for its coefficient in the table.) Mark coefficients that are statistically significant at the 5% level with an asterisk (*). Summarize your results and state which model you prefer.

movies

Page 584

Question 11.83

CASE 11.2

11.83 Effect of an outlier.

In Exercise 11.50 (page 563), we identified a movie that had much higher revenue than predicted. Remove this movie and repeat the previous exercise. Does the removal of this movie change which model you prefer?

movies

11.83

			Regression Coefficients
# variables	$R$ -Square	$s$	Intercept	Opening	Budget	Theaters	Sequel
1	0.716	41.368	18.04207	2.4782*	0	0	0
2	0.7785	36.997	6.14917	2.14815*	0.34102*	0	0
2	0.7486	39.414	21.26068	2.65651*	0	0	−31.97239*
2	0.7434	39.821	−55.80153	2.09021*	0	0.02787*	0
3	0.7919	36.334	9.88794	2.31116*	0.29527*	0	−21.28865
3	0.7831	37.092	−61.61398	2.23857*	0	0.03141*	−35.44187*
3	0.7821	37.176	−22.31507	2.03073*	0.30005*	0.01128	0
4	0.8008	36.026	−35.76104	2.15636*	0.21796	0.01843	−26.12162

Five models have all terms significant: Opening alone, with Budget, with Sequel, with Theaters, or with Theaters and Sequel. Clearly, the model with both Theaters and Sequel is better than those with just Sequel or just Theaters. Likewise the models with 2 variables are better than just Opening alone. Which leaves two potentially good models: Opening with Budget or Opening with Theaters and Sequel. Both have very similar $R^{2}$ and $s$ values, so arguments for either model could be made.