For Exercise 11.13, see page 621; and for Exercise 11.14, see page 625.
11.15 Refining the GPA model using all six explanatory variables: Residual checks. Figure 11.11 (page 631) provides a list of the top models based on R2. Let’s look more closely at the four models listed with p = 3 and p = 4. Fit each of these models to the data and obtain the residuals. Do the data, at least approximately, meet the conditions of the multiple regression model? Provide some plots to support your opinion.
11.16 Refining the GPA model using all six explanatory variables: Inference. Refer to the previous exercise. For each of the four models considered in the previous exercise, report the least-
11.17 A mechanistic explanation of popularity. In Exercise 10.65 (page 605), correlations between an adolescent’s “popularity,’’ expression of a serotonin receptor gene, and rule-
b | s(b) | |
---|---|---|
Model 1 | ||
Gene expression | 0.204 | 0.066 |
Model 2 | ||
Gene expression | 0.161 | 0.066 |
RB.composite | 0.100 | 0.030 |
For all analysis use the 0.05 significance level.
(a) What are the error degrees of freedom for Model 1 and Model 2?
(b) Test the null hypothesis that the serotonin gene receptor coefficient is equal to 0 in Model 1. State the test statistic and P-value.
(c) Perform both individual-
(d) Is there still a positive relationship between the serotonin gene receptor expression level and popularity after adjusting for RB? If yes, compare the increase in popularity for a unit increase in gene expression (while RB remains unchanged) in the two models.
Results such as these suggest not only that adolescents with high serotonin receptor gene expression are predisposed to increased RB behaviors, but also that such behaviors are socially advantageous.
11.18 Predicting college debt: Multiple regression. Refer to Exercises 10.12 (page 579) and 10.17 (page 580) for a description of the problem. Let’s now consider fitting a model using Admit, GradRate, InCostAid, and OutCostAid as the explanatory variables.
(a) Write out the statistical model for this analysis, making sure to specify all assumptions.
(b) Run the multiple regression model and specify the fitted regression equation.
(c) Obtain the residuals from part (b) and check assumptions. Is Baruch College still an unusual case? Provide a brief summary.
(d) Run the same multiple regression model but this time without Baruch College. Again comment on the residuals.
(e) Should we proceed with inference using the entire data set? Or the data set without Baruch College? Explain your reasoning.
11.19 Predicting college debt: Inference. Refer to the previous exercise. Let’s proceed using the data set without Baruch College.
(a) Report the least-
(b) What percent of the variability in average debt is explained by this model?
(c) Report the F statistic, its degrees of freedom, and the P-value. What do you conclude based on this test result?
(d) Using this F test and the individual parameter t tests, write a one-
635
11.20 Testing a collection of variables. Refer to the previous exercise. For the model that included all p = 4 explanatory variables, only InCostAid is found significant using the individual parameter t tests. This raises the question whether these other three variables further contribute to the prediction of average debt given in-
In this chapter, we discussed the F test for a collection of regression coefficients. In most cases, this capability is provided by the software. When it is not, the test can be performed using the R2-values from the larger (full) and smaller (reduced) models. The test statistic is
with q and n − p − 1 degrees of freedom. is the value for the full model, and is the value for the reduced model. Here n = 24 schools, p = 4 variables in the full model, and q = 3 variables were removed to form the reduced model. Plug in the values of R2 from part (b) of the previous exercise and the R2 value from Figure 10.13 (page 580). Compute the test statistic and P-value. Do Admit, GradRate, and OutCostAid combined add any significant predictive information beyond what is already contained in InCostAid?
11.21 Comparison of prediction intervals. Refer to the previous exercise. Another way to compare these two models is in terms of prediction. The Ohio State University has Admit = 56, GradRate = 59, InCostAid = 12,103, and OutCostAid = 28,603. Use statistical software to construct.
(a) a 95% prediction interval based on the model with all p = 4 predictors.
(b) a 95% prediction interval based on the model using just InCostAid.
(c) Compare the two intervals. Do the two models give similar predictions? Which provides a more narrow prediction interval?
11.22 Consider the sex of the students. Refer to Exercises 11.15 and 11.16. The seventh explanatory variable provided in the GPA data set is a sex indicator variable. This variable (SEX) takes the value 0 for males and 1 for females. If we include it in our model involving the other six variables, it allows the intercept to differ for the two genders. Using b7 to represent the fitted coefficient for the SEX variable, the estimated male intercept is b0 + b7(0) = b0 and the estimated female intercept is b0 + b7(1) = b0 + b7. The difference between these two intercept estimates is (b0 + b7) − b0 = b7, so the coefficient is also an estimate of the difference in intercepts.
(a) Include the variable SEX with the other six explanatory variables and refit the model. Compare the fit of this model, using R2 and s, with the model in Figure 11.10 (pages 627–
(b) Does this indicator variable appear to contribute to our explanation of GPA? Report the test results.
(c) Does the coefficient suggest males or females have higher GPA scores? Explain your answer.
11.23 Predicting energy-
Explanatory variable | b |
---|---|
Age | −0.02 |
Sex (1 = female, 0 = male) | −0.11** |
Race (1 = nonwhite, 0 = white) | −0.02 |
Ethnicity (1 = Hispanic, 0 = non- |
0.10** |
Parental education | 0.02 |
College GPA | −0.01 |
Jock identity | 0.05 |
Risk taking | 0.19*** |
A superscript of ** means that the individual coefficient t test had a P-value less than 0.01, and a superscript of *** means that the test had a P-value less than 0.001. All other P-values were greater than 0.05.
(a) The overall F statistic is reported to be 8.11. What are the degrees of freedom associated with this statistic?
(b) R is reported to be 0.28. What percent of the variation in energy-
(c) Interpret each of the regression coefficients that are significant.
(d) The researcher states, “Controlling for gender, age, race, ethnicity, parental educational achievement, and college GPA, each of the predictors (risk taking and jock identity) was positively associated with energy-
636
11.24 Is the number of tornadoes increasing? In Exercise 10.19, data on the number of tornadoes in the United States between 1953 and 2014 were analyzed to see if there was a linear trend over time. Some argue that it’s not the number of tornadoes increasing over time, but rather the probability of sighting them because there are more people living in the United States. Let’s investigate this by including the U.S. census count as an additional explanatory variable.
(a) Using numerical and graphical summaries, describe the relationship between each pair of variables.
(b) Perform a multiple regression using both year and census count as explanatory variables. Write down the fitted model.
(c) Obtain the residuals from part (b). Plot them versus the two explanatory variables and generate a Normal quantile plot. What do you conclude?
(d) Test the hypothesis that there is a linear increase over time. State the null and alternative hypotheses, test statistic, and P-value. What is your conclusion?