Chapter 11: Multiple Regression

SECTION 11.2 EXERCISES

For Exercise 11.13, see page 621; and for Exercise 11.14, see page 625.

Question 11.15

11.15 Refining the GPA model using all six explanatory variables: Residual checks. Figure 11.11 (page 631) provides a list of the top models based on R². Let’s look more closely at the four models listed with p = 3 and p = 4. Fit each of these models to the data and obtain the residuals. Do the data, at least approximately, meet the conditions of the multiple regression model? Provide some plots to support your opinion.

Question 11.16

11.16 Refining the GPA model using all six explanatory variables: Inference. Refer to the previous exercise. For each of the four models considered in the previous exercise, report the least-squares equation, estimated model standard deviation s, and the P-values for each of the individual coefficients. Based on these results and the residuals checks of the previous exercise, which model do you think provides the “best’’ fit? Explain your answer.

Question 11.17

11.17 A mechanistic explanation of popularity. In Exercise 10.65 (page 605), correlations between an adolescent’s “popularity,’’ expression of a serotonin receptor gene, and rule-breaking behaviors were assessed. An additional portion of the analysis looked at the relationship between the gene expression level and popularity, after adjusting for rule-breaking (RB) behaviors. This adjustment was necessary because RB is positively associated both with this gene expression and with popularity in adolescents. The following summarizes these regression analyses using the composite (questionnaire and video) RB score. A total of 202 individuals were included in this analysis.

	b	s(b)
Model 1
Gene expression	0.204	0.066
Model 2
Gene expression	0.161	0.066
RB.composite	0.100	0.030

For all analysis use the 0.05 significance level.

(a) What are the error degrees of freedom for Model 1 and Model 2?
(b) Test the null hypothesis that the serotonin gene receptor coefficient is equal to 0 in Model 1. State the test statistic and P-value.
(c) Perform both individual-variable t tests for Model 2. Again state the test statistics and P-values.
(d) Is there still a positive relationship between the serotonin gene receptor expression level and popularity after adjusting for RB? If yes, compare the increase in popularity for a unit increase in gene expression (while RB remains unchanged) in the two models.

Results such as these suggest not only that adolescents with high serotonin receptor gene expression are predisposed to increased RB behaviors, but also that such behaviors are socially advantageous.

Question 11.18

11.18 Predicting college debt: Multiple regression. Refer to Exercises 10.12 (page 579) and 10.17 (page 580) for a description of the problem. Let’s now consider fitting a model using Admit, GradRate, InCostAid, and OutCostAid as the explanatory variables.

(a) Write out the statistical model for this analysis, making sure to specify all assumptions.
(b) Run the multiple regression model and specify the fitted regression equation.
(c) Obtain the residuals from part (b) and check assumptions. Is Baruch College still an unusual case? Provide a brief summary.
(d) Run the same multiple regression model but this time without Baruch College. Again comment on the residuals.
(e) Should we proceed with inference using the entire data set? Or the data set without Baruch College? Explain your reasoning.

Question 11.19

11.19 Predicting college debt: Inference. Refer to the previous exercise. Let’s proceed using the data set without Baruch College.

(a) Report the least-squares equation using all four variables.
(b) What percent of the variability in average debt is explained by this model?
(c) Report the F statistic, its degrees of freedom, and the P-value. What do you conclude based on this test result?
(d) Using this F test and the individual parameter t tests, write a one-paragraph summary of this model’s fit to the data.

635

Question 11.20

11.20 Testing a collection of variables. Refer to the previous exercise. For the model that included all p = 4 explanatory variables, only InCostAid is found significant using the individual parameter t tests. This raises the question whether these other three variables further contribute to the prediction of average debt given in-state cost is in the model.

In this chapter, we discussed the F test for a collection of regression coefficients. In most cases, this capability is provided by the software. When it is not, the test can be performed using the R²-values from the larger (full) and smaller (reduced) models. The test statistic is

with q and n − p − 1 degrees of freedom. is the value for the full model, and is the value for the reduced model. Here n = 24 schools, p = 4 variables in the full model, and q = 3 variables were removed to form the reduced model. Plug in the values of R² from part (b) of the previous exercise and the R² value from Figure 10.13 (page 580). Compute the test statistic and P-value. Do Admit, GradRate, and OutCostAid combined add any significant predictive information beyond what is already contained in InCostAid?

Question 11.21

11.21 Comparison of prediction intervals. Refer to the previous exercise. Another way to compare these two models is in terms of prediction. The Ohio State University has Admit = 56, GradRate = 59, InCostAid = 12,103, and OutCostAid = 28,603. Use statistical software to construct.

(a) a 95% prediction interval based on the model with all p = 4 predictors.
(b) a 95% prediction interval based on the model using just InCostAid.
(c) Compare the two intervals. Do the two models give similar predictions? Which provides a more narrow prediction interval?

Question 11.22

11.22 Consider the sex of the students. Refer to Exercises 11.15 and 11.16. The seventh explanatory variable provided in the GPA data set is a sex indicator variable. This variable (SEX) takes the value 0 for males and 1 for females. If we include it in our model involving the other six variables, it allows the intercept to differ for the two genders. Using b₇ to represent the fitted coefficient for the SEX variable, the estimated male intercept is b₀ + b₇(0) = b₀ and the estimated female intercept is b₀ + b₇(1) = b₀ + b₇. The difference between these two intercept estimates is (b₀ + b₇) − b₀ = b₇, so the coefficient is also an estimate of the difference in intercepts.

(a) Include the variable SEX with the other six explanatory variables and refit the model. Compare the fit of this model, using R² and s, with the model in Figure 11.10 (pages 627–629).
(b) Does this indicator variable appear to contribute to our explanation of GPA? Report the test results.
(c) Does the coefficient suggest males or females have higher GPA scores? Explain your answer.

Question 11.23

11.23 Predicting energy-drink consumption. Energy-drink advertising consistently emphasizes a physically active lifestyle and often features extreme sports and risk taking. Are these typical characteristics of an energy-drink consumer? A researcher decided to examine the links between energy-drink consumption, sport-related (jock) identity, and risk taking.⁵ She invited more than 1500 undergraduate students enrolled in large introductory-level courses at a public university to participate. Each participant had to complete a 45-minute anonymous questionnaire. From this questionnaire, jock identity and risk-taking scores were obtained, where the higher the score, the stronger the trait. She ended up with 795 respondents. The following table summarizes the results of a multiple regression analysis using the frequency of energy-drink consumption in the past 30 days as the response variable:

Explanatory variable	b
Age	−0.02
Sex (1 = female, 0 = male)	−0.11**
Race (1 = nonwhite, 0 = white)	−0.02
Ethnicity (1 = Hispanic, 0 = non-Hispanic)	0.10**
Parental education	0.02
College GPA	−0.01
Jock identity	0.05
Risk taking	0.19***

A superscript of ** means that the individual coefficient t test had a P-value less than 0.01, and a superscript of *** means that the test had a P-value less than 0.001. All other P-values were greater than 0.05.

(a) The overall F statistic is reported to be 8.11. What are the degrees of freedom associated with this statistic?
(b) R is reported to be 0.28. What percent of the variation in energy-drink consumption is explained by the model? Is this a highly predictive model? Explain.
(c) Interpret each of the regression coefficients that are significant.
(d) The researcher states, “Controlling for gender, age, race, ethnicity, parental educational achievement, and college GPA, each of the predictors (risk taking and jock identity) was positively associated with energy-drink consumption frequency.’’ Explain what is meant by “controlling for’’ these variables and how this helps strengthen her assertion that jock identity and risk taking are positively associated with energy-drink consumption.

636

Question 11.24

11.24 Is the number of tornadoes increasing? In Exercise 10.19, data on the number of tornadoes in the United States between 1953 and 2014 were analyzed to see if there was a linear trend over time. Some argue that it’s not the number of tornadoes increasing over time, but rather the probability of sighting them because there are more people living in the United States. Let’s investigate this by including the U.S. census count as an additional explanatory variable.

(a) Using numerical and graphical summaries, describe the relationship between each pair of variables.
(b) Perform a multiple regression using both year and census count as explanatory variables. Write down the fitted model.
(c) Obtain the residuals from part (b). Plot them versus the two explanatory variables and generate a Normal quantile plot. What do you conclude?
(d) Test the hypothesis that there is a linear increase over time. State the null and alternative hypotheses, test statistic, and P-value. What is your conclusion?