Chapter 11: Multiple Regression

CHAPTER 11 EXERCISES

Question 11.25

pabmi

11.25 Checking for a polynomial relationship. When looking at the residuals from the simple linear model of BMI versus physical activity (PA), Figure 10.5 (page 566) suggested a possible curvilinear relationship. Let’s investigate this further. Multiple regression can be used to fit the polynomial curve of degree q, y = β₀ + β₁x + β₂x² + … + β_qx^q, through the creation of additional explanatory variables x², x³, etc. Let’s investigate a quadratic fit (q = 2) for the physical activity problem.

(a) It is often best to subtract the sample mean $\bar{x}$ before creating the necessary explanatory variables. In this case, the average number of steps per day is 8.614. Create new explanatory variables x₁ = (PA − 8.614) and x₂ = (PA − 8.614)² and run a multiple regression for BMI using the explanatory variables x₁ and x₂. Write down the fitted regression line.
(b) The regression model that included only PA had a R² = 14.9%. What is R² with the inclusion of this quadratic term?
(c) Obtain the residuals from part (a) and check the multiple regression assumptions. Are there any remaining patterns in the data? Are the residuals approximately Normal? Explain.
(d) Test the hypothesis that the coefficient of the variable (PA − 8.614)² is equal to 0. Report the t statistic, degrees of freedom, and P-value. Does the quadratic term contribute significantly to the fit? Explain your answer.

11.25 (a) F = 10.44, P-value < 0.0001; $\hat{y}$ = 23.39556 − 0.68175x₁ + 0.10195x₂. (b) 17.71%. (c) No violations. (d) H₀: $β$ ₂ = 0, H_α: $β$ ₂ ≠ 0; t = 1.83, P-value = 0.0696.

Question 11.26

arch

11.26 Architectural firm billings. A summary of firms engaged in commercial architecture in the Indianapolis, Indiana, area provides firm characteristics, including total annual billing in the current year, total annual billing in the previous year, the number of architects, the number of engineers, and the number of staff employed in the firm.⁶ Consider developing a model to predict current total billing using the other four variables.

(a) Using numerical and graphical summaries, describe the distribution of current and past year total billing and the number of architects, engineers, and staff.
(b) For each of the 10 pairs of variables, use graphical and numerical summaries to describe the relationship.
(c) Carry out a multiple regression. Report the fitted regression equation and the value of the regression standard error s.
(d) Analyze the residuals from the multiple regression. Are there any concerns?
(e) A firm did not report its current total billing but had $1 million in billing last year and employs three architects, one engineer, and 17 staff members. What is the predicted total billing for this firm?
(f) This analysis utilized the data from all commercial firms in the Indianapolis area that responded to the survey. Provide justification for the use of inference under this setting.

The following six exercises use the MOVIES data file. This data set contains an SRS of 43 movies released four to five years ago to guarantee they are no longer in the theaters. This sample was collected from the Internet Movie Database (IMDb) to see if information available soon after a movie’s theatrical release can successfully predict total U.S. revenue.⁷ All dollar amounts are measured in millions of U.S. dollars.

Page 637

Question 11.27

movies

11.27 Predicting movie revenue—preliminary analysis. The response variable is a movie’s total U.S. revenue (USRevenue). Let’s consider as explanatory variables the movie’s budget (Budget); opening-weekend revenue (Opening); the number of theaters (Theaters) the movie was in for the opening weekend; and the movie’s IMDb rating (Ratings), which is on a 1 to 10 scale (10 being best). While this rating is updated continuously, we’ll assume that the current rating is the rating at the end of the first week.

(a) Using numerical and graphical summaries, describe the distribution of each explanatory variable. Are there any unusual observations that should be monitored?
(b) Using numerical and/or graphical summaries, describe the relationship between each pair of explanatory variables.

11.27 (a) Budget and Opening are right-skewed. Theaters and Ratings are left-skewed. (b) The correlations are 0.403, 0.570, 0.625, 0.281, 0.151, and − 0.022. Budget, Opening, and Theaters have the largest correlations among them (first three listed), Ratings is not highly correlated with any of the other three (last three listed).

Question 11.28

movies

11.28 Predicting movie revenue—simple linear regressions. Now let’s look at the response variable and its relationship with each explanatory variable.

(a) Using numerical and graphical summaries, describe the distribution of the response variable, USRevenue.
(b) This variable is not Normally distributed. Does this violate one of the key model assumptions? Explain.
(c) Generate scatterplots of each explanatory variable and USRevenue. Do all these relationships look linear? Explain what you see

Question 11.29

movies

11.29 Predicting movie revenue—multiple linear regression. Now consider fitting a model using all the explanatory variables.

(a) Write out the statistical model for this analysis, making sure to specify all assumptions.
(b) Run the multiple regression model and specify the fitted regression equation.
(c) Obtain the residuals from part (b) and check assumptions. Comment on any unusual residuals or patterns in the residuals.
(d) What percent of the variability in USRevenue is explained by this model?

11.29 (a) F = 32.28, P-value < 0.0001. USRevenue = $β$ ₀ + $β_{1}$ Budget + $β$ ₂Opening + $β$ ₃Theaters + $β$ ₄Ratings + ϵ, where ϵ ~N(0, $σ$ ) and independent. (b) $\hat{y}$ = − 170.40874 + 0.09252Budget + 1.91600Opening + 0.02961Theaters + 17.29923Ratings. (c) The residual plot shows a slight downward trend, suggesting another model may be more appropriate. (d) R² = 77.26%.

Question 11.30

movies

11.30 A simpler model. In the multiple regression analysis using all four explanatory variables, Theaters and Budget appear to be the least helpful (given that the other two explanatory variables are in the model).

(a) Perform a new analysis using only the movie’s opening-weekend revenue and IMDb rating. Give the estimated regression equation for this analysis.
(b) What percent of the variability in USRevenue is explained by this model?
(c) Test the null hypothesis that Theaters and Budget combined add no additional predictive information beyond what is already contained in Opening and Opinion?

Question 11.31

movies

11.31 Predicting U.S. movie revenue. The movie Kick-Ass was released during this same time period. It had a budget of $30.0 million and was shown in 3065 theaters, grossing $19.83 million during the first weekend. Use software to construct the following.

(a) A 95% prediction interval based on the model with all three explanatory variables.
(b) A 95% prediction interval based on the model using only opening-weekend revenue and budget.
(c) Compare the two intervals. Do the models give similar predictions and standard errors?

11.31 (a) ( − $25,514,500, $158,200,600). (b) ( − $28,098,300, $154,239,900). (c) The intervals are similar.

Question 11.32

movies

11.32 Considering the log transformation. Refer to Exercise 11.29. Variables like income often have very skewed distributions. This can result in certain cases strongly influencing the fit of the model. A common remedy is to take the log before analysis. Create a new response variable by taking the log of U.S. Revenue and fit the model using all four predictors. Obtain the residuals and assess the model conditions. Do these data fit the linear regression model better than the untransformed data? Explain your answer.

The following three exercises use the RANKINGS data file. Since 2004, The Times Higher Education Supplement has provided an annual ranking of the world universities. A total score for each university is calculated based on the scores for the following explanatory variables: Teaching (30%), Research (30%), Citations (30%), Industry Income (2.5%), and International Outlook (7.5%) The percents represent the contributions of each score to the total. For our purposes, we will assume that these weights are unknown and will focus on the development of a model for the total score based on the first three explanatory variables. The report includes a table for the top 200 universities.⁸ The RANKINGS data file contains a random sample of 55 of these universities. This is not a random sample of all universities, but for our purposes here, we will consider it to be.

Page 638

Question 11.33

rankings

11.33 Annual ranking of world universities. Let’s consider developing a model to predict total score (Overall) based on the teaching, research, and citations scores.

(a) Using numerical and graphical summaries, describe the distribution of each explanatory variable.
(b) Using numerical and graphical summaries, describe the relationship between each pair of explanatory variables.

11.33 (a) Teaching and research are both right-skewed. Citations is left-skewed. (b) Teaching and research are very strongly linearly related (r = 0.8993). Citations does not look related to either teaching or research (r = 0.1878 and 0.0691, respectively).

Question 11.34

rankings

11.34 Looking at the simple linear regressions. Now let’s look at the relationship between each explanatory variable and the total score.

(a) Generate scatterplots for each explanatory variable and the total score. Do these relationships all look linear?
(b) Compute the correlation between each explanatory variable and the total score. Are certain explanatory variables more strongly associated with the total score?

Question 11.35

rankings

11.35 Multiple linear regression model. Now consider a regression model using all three explanatory variables.

(a) Write out the statistical model for this analysis, making sure to specify all assumptions.
(b) Run the multiple regression model and specify the fitted regression equation.
(c) Generate a 95% confidence interval for each coefficient. Should any of these intervals contain 0? Explain.
(d) What percent of the variation in total score is explained by this model? What is the estimate for σ?

11.35 (a) Overall = $β$ ₀ + $β_{1}$ Teaching + $β$ ₂Research + $β$ ₃Citations + ε, ε~N(0, $σ$ ) and independent. (b) F = 736.38, P-value < 0.0001, $\hat{y}$ = 8.16814 + 0.26432Teaching + 0.32800Research + 0.27513Citations. (c) For teaching: (0.19280, 0.33583); for research: (0.26790, 0.38810); for Citations: (0.23850, 0.31176). (d) R² = 97.74%; s = 1.72272.

Question 11.36

sevengr

11.36 Predicting GPA of seventh-graders. Refer to the educational data for 78 seventh-grade students given in Table 1.3 (page 26). We view GPA as the response variable. IQ, gender, and self-concept are the explanatory variables.

(a) Find the correlation between GPA and each of the explanatory variables. What percent of the total variation in student GPAs can be explained by the straight-line relationship with each of the explanatory variables?
(b) The importance of IQ in explaining GPA is not surprising. The purpose of the study is to assess the influence of self-concept on GPA. So we will include IQ in the regression model and ask, “How much does self-concept contribute to explaining GPA after the effect of IQ on GPA is taken into account?’’ Give a model that can be used to answer this question.
(c) Run the model and report the fitted regression equation. What percent of the variation in GPA is explained by the explanatory variables in your model?
(d) Translate the question of interest into appropriate null and alternative hypotheses about the model parameters. Give the value of the test statistic and its P-value. Write a short summary of your analysis with an emphasis on your conclusion.

The following three exercises use the HAPPY data file. The World Database of Happiness is an online registry of scientific research on the subjective appreciation of life. It is available at worlddatabaseofhappiness.eur.nl, and the project is directed by Dr. Ruut Veenhoven, Erasmus University, Rotterdam. One inventory presents the “average happiness’’ score for various nations. This average is based on individual responses from numerous general population surveys to a general life satisfaction (well-being) question. Scores range from 0 (dissatisfied) to 10 (satisfied). The Nation-Master website, www.nationmaster.com, contains a collection of statistics associated with various nations. For our analysis, we will consider the GINI index, which measures the degree of inequality in the distribution of income (higher score = greater inequality), the degree of corruption in government (higher score = less corruption), average life expectancy, and the degree of democracy (higher score = more civil and political liberties).

Question 11.37

happy

11.37 Predicting a nation’s “average happiness’’ score. Consider the five statistics for each nation: LSI, the average life-satisfaction score; GINI, the GINI index; CORRUPT, the degree of government corruption; LIFE, the average life expectancy; and DEMOCRACY, a measure of civil and political liberties.

(a) Using numerical and graphical summaries, describe the distribution of each variable.
(b) Using numerical and graphical summaries, describe the relationship between each pair of variables.

11.37 (a) GINI and CORRUPT to the right, the other three to the left. CORRUPT, DEMOCRACY, and LIFE have the most skewness. (b) LSI seems moderately correlated with Corrupt, Democracy, and Life, (r = 0.6974, 0.6092, and 0.7219) but is not related to GINI much at all (r = − 0.0503). Among the others, only CORRUPT seems to be moderately related to both DEMOCRACY and LIFE (r = 0.7474 and 0.6503); other relationships appear weak.

Question 11.38

happy

11.38 Building a multiple linear regression model. Let’s now build a model to predict the life-satisfaction score, LSI.

(a) Consider a simple linear regression using GINI as the explanatory variable. Run the regression and summarize the results. Be sure to check assumptions.
(b) Now consider a model using GINI and LIFE. Run the multiple regression and summarize the results. Again be sure to check assumptions.
(c) Now consider a model using GINI, LIFE, and DEMOCRACY. Run the multiple regression and summarize the results. Again be sure to check assumptions.
(d) Now consider a model using all four explanatory variables. Again summarize the results and check assumptions.

Page 639

Question 11.39

happy

11.39 Selecting from among several models. Refer to the results from the previous exercise.

(a) Make a table giving the estimated regression coefficients, standard errors, t statistics, and P-values.
(b) Describe how the coefficients and P-values change for the four models.
(c) Based on the table of coefficients, suggest another model. Run that model, summarize the results, and compare it with the other ones. Which model would you choose to explain LSI? Explain.

11.39 (a) Refer to your regression output. (b) For example, the t statistic for the GINI coefficient grows from t = − 0.42 (P = 0.675) to t = 4.25 (P < 0.0005). The DEMOCRACY t is 3.53 in the third model (P < 0.0005) but drops to 0.71 (P = 0.479) in the fourth model. (c) A good choice is to use GINI, LIFE, and CORRUPT. All three coefficients are significant, and R² = 70%.

The following six exercises use the BIOMARK data file. Healthy bones are continually being renewed by two processes. Through bone formation, new bone is built; through bone resorption, old bone is removed. If one or both of these processes are disturbed—by disease, aging, or space travel, for example—bone loss can be the result. The variables VO+ and VO− measure bone formation and bone resorption, respectively. Osteocalcin (OC) is a biochemical marker for bone formation: higher levels of bone formation are associated with higher levels of OC. A blood sample is used to measure OC, and it is much less expensive to obtain than direct measures of bone formation. The units are milligrams of OC per milliliter of blood (mg/ml). Similarly, tartrate-resistant acid phosphatase (TRAP) is a biochemical marker for bone resorption that is also measured in blood. It is measured in units per liter (U/l). These variables were measured in a study of 31 healthy women aged 11 to 32 years.⁹ Variables with the first letter “L’’ are the logarithms of the measured variables.

Question 11.40

biomark

11.40 Bone formation and resorption. Consider the following four variables: VO+, a measure of bone formation; VO−, a measure of bone resorption; OC, a biomarker of bone formation; and TRAP, a biomarker of bone resorption.

(a) Using numerical and graphical summaries, describe the distribution of each of these variables.
(b) Using numerical and graphical summaries, describe the relationship between each pair of variables.

Question 11.41

biomark

11.41 Predicting bone formation. Let’s use regression methods to predict VO+, the measure of bone formation.

(a) Because OC is a biomarker of bone formation, we start with a simple linear regression using OC as the explanatory variable. Run the regression and summarize the results. Be sure to include an analysis of the residuals.
(b) Because the processes of bone formation and bone resorption are highly related, it is possible that there is some information in the bone resorption variables that can tell us something about bone formation. Use a model with both OC and TRAP, the biomarker of bone resorption, to predict VO+. Summarize the results. In the context of this model, it appears that TRAP is a better predictor of bone formation, VO+, than the biomarker of bone formation, OC. Is this view consistent with the pattern of relationships that you described in the previous exercise? One possible explanation is that, although all these variables are highly related, TRAP is measured with more precision than OC.

11.41 (a) F = 22.34, P-value < 0.0001, $\hat{y}$ = 334.03439 + 19.50471OC. The residual plot shows a possible outlier. (b) F = 21.62, P-value < 0.0001, $\hat{y}$ = 57.70419 + 6.41466OC + 53.39331TRAP. TRAP is much more significant (t = 3.50, P-value = 0.0016) in this model than OC (t = 1.25, P-value = 0.2210).

Question 11.42

biomark

11.42 More on predicting bone formation. Now consider a regression model for predicting VO+ using OC, TRAP, and VO−.

(a) Write out the statistical model for this analysis including all assumptions.
(b) Run the multiple regression to predict VO+ using OC, TRAP, and VO−. Summarize the results.
(c) Make a table giving the estimated regression coefficients, standard errors, and t statistics with P-values for this analysis and for the two that you ran in the previous exercise. Describe how the coefficients and the P-values differ for the three analyses.
(d) Give the percent of variation in VO+ explained by each of the three models and the estimate of σ. Give a short summary.
(e) The results you found in part (b) suggest another model. Run that model, summarize the results, and compare them with the results in part (b).

Question 11.43

biomark

11.43 Predicting bone formation using transformed variables. Because the distributions of VO+, VO−, OC, and TRAP tend to be skewed, it is common to work with logarithms rather than the measured values. Using the questions in the previous three exercises as a guide, analyze the log data.

11.43 All variables are Normal when log transformed. All pairs are positively associated: strongest between LVO + and LVO − (r = 0.8396) and weakest between LOC and LVO − (r = 0.5545). Using logOC: $\hat{y}$ = 4.39 + 0.71logOC, t = 6.57, P < 0.0001, R² = 59.83%, s = 0.36. Using logOC and logTRAP: $\hat{y}$ = 4.26 + 0.43logOC + 0.42logTRAP, t = 2.56, P = 0.0162, t = 2.06, P = 0.0484, R² = 65.14%, s = 0.34. Using all three: $\hat{y}$ = 0.87 + 0.39logOC + 0.03logTRAP + 0.67logVO − , t = 3.40, P = 0.0021, t = 0.17, P = 0.8624, t = 5.71, P < 0.0001, R² = 84.21%, s = 0.23. The best model uses only logOC and logVO − : $\hat{y}$ = 0.83298 + 0.40589logOC + 0.68159logVO − , R² = 84.19%, s = 0.23.

Question 11.44

biomark

11.44 Predicting bone resorption. Refer to Exercises 11.40, 11.41, and 11.42. Answer these questions with the roles of VO+ and VO− reversed; that is, run models to predict VO−, with VO+ as an explanatory variable.

Question 11.45

biomark

11.45 Predicting bone resorption using transformed variables. Refer to the previous exercise. Rerun using logs.

11.45 Using logOC: $\hat{y}$ = 5.21 + 0.44logOC, t = 3.59, P = 0.0012, R² = 30.75%, s = 0.41. Using logOC and logTRAP: $\hat{y}$ = 5.04 + 0.06logOC + 0.59logTRAP, t = 0.31, P = 0.7618, t = 2.61, P = 0.0144, R² = 44.30%, s = 0.37. Using all three: $\hat{y}$ = 1.57 − 0.29logOC + 0.24logTRAP + 0.81logVO + , t = − 2.08, P = 0.0468, t = 1.47, P = 0.1523, t = 5.71, P < 0.0001, R² = 74.77%, s = 0.26. The best model uses only logVO + alone: $\hat{y}$ = 1.75657 + 0.7305logVO + , R² = 70.49%, s = 0.27.

The following 11 exercises use the PCB data file. Polychlorinated biphenyls (PCBs) are a collection of synthetic compounds, called congeners, that are particularly toxic to fetuses and young children. Although PCBs are no longer produced in the United States, they are still found in the environment. Because human exposure to these PCBs is primarily through the consumption of fish, the Environmental Protection Agency (EPA) monitors the PCB levels in fish. Unfortunately, there are 209 different congeners and measuring all of them in a fish specimen is an expensive and time-consuming process. You’ve been asked to see if the total amount of PCBs in a specimen can be estimated with only a few, easily quantifiable congeners.¹⁰ If this can be done, costs can be greatly reduced.

Page 640

Question 11.46

pcb

11.46 Relationships among PCB congeners. Consider the following variables: PCB (the total amount of PCB) and four congeners: PCB52, PCB118, PCB138, and PCB180.

(a) Using numerical and graphical summaries, describe the distribution of each of these variables.
(b) Using numerical and graphical summaries, describe the relationship between each pair of variables.

Question 11.47

pcb

11.47 Predicting the total amount of PCB. Use the four congeners PCB52, PCB118, PCB138, and PCB180 in a multiple regression to predict PCB.

(a) Write the statistical model for this analysis. Include all assumptions.
(b) Run the regression and summarize the results.
(c) Examine the residuals. Do they appear to be approximately Normal? When you plot them versus each of the explanatory variables, are any patterns evident?

11.47 (a) PCB = $β$ ₀ + $β_{1}$ PCB52 + $β$ ₂PCB118 + $β$ ₃PCB138 + $β$ ₄PCB180 + ϵ, where ϵ ~ N(0, $σ$ ) and independent. (b) F = 1456.18, P-value < 0.0001, $\hat{y}$ = 0.93692 + 11.87270PCB52 + 3.76107PCB118 + 3.88423PCB138 + 4.18230PCB180. All individual predictors are significant. (c) The residual plot shows a possible violation of constant variance. The residuals are Normal, except for two possible outliers.

Question 11.48

pcb

11.48 Adjusting the analysis for potential outliers. The examination of the residuals in part (c) of the previous exercise suggests that there may be two outliers, one with a high residual and one with a low residual.

(a) Because of safety issues, we are more concerned about underestimating PCB in a specimen than about overestimating. Give the specimen number for each of the two suspected outliers. Which one corresponds to an overestimate of PCB?
(b) Rerun the analysis with the two suspected outliers deleted, summarize these results, and compare them with those you obtained in the previous exercise.

Question 11.49

pcb

11.49 More on predicting the total amount of PCB. Run a regression to predict PCB using the variables PCB52, PCB118, and PCB138. Note that this is similar to the analysis that you did in Exercise 11.47, with the change that PCB180 is not included as an explanatory variable.

(a) Summarize the results.
(b) In this analysis, the regression coefficient for PCB118 is not statistically significant. Give the estimate of the coefficient and the associated P-value.
(c) Find the estimate of the coefficient for PCB118 and the associated P-value for the model analyzed in Exercise 11.47.
(d) Using the results in parts (b) and (c), write a short paragraph explaining how the inclusion of other variables in a multiple regression can have an effect on the estimate of a particular coefficient and the results of the associated significance test.

11.49 (a) F = 786.71, P-value < 0.0001, $\hat{y}$ = − 1.01840 + 12.64419PCB52 + 0.31311PCB118 + 8.25459PCB138. (b) b₁₁₈ = 0.31311, P-value = 0.7083. (c) b₁₁₈ = 3.76107, P-value < 0.0001. (d) When we add PCB180 to the model, it makes PCB118 useful for prediction.

Question 11.50

pcb

11.50 Multiple regression model for total TEQ. Dioxins and furans are other classes of chemicals that can cause undesirable health effects similar to those caused by PCB. The three types of chemicals are combined using toxic equivalent scores (TEQs), which attempt to measure the health effects on a common scale. The PCB data file contains TEQs for PCB, dioxins, and furans. The variables are called TEQPCB, TEQDIOXIN, and TEQFURAN. The data file also includes the total TEQ, defined to be the sum of these three variables.

(a) Consider using a multiple regression to predict TEQ using the three components TEQPCB, TEQDIOXIN, and TEQFURAN as explanatory variables. Write the multiple regression model in the form
$TEQ = β_{0} + β_{1} TEQPCB + β_{2} TEQDIOXIN + β_{3} TEQFURAN + ϵ$
Give numerical values for the parameters β₀, β₁, β₂, and β₃.
(b) The multiple regression model assumes that the ϵ’s are Normal with mean zero and standard deviation σ. What is the numerical value of σ?
(c) Use software to run this regression and summarize the results.

Question 11.51

pcb

11.51 Multiple regression model for total TEQ, continued. The information summarized in TEQ is used to assess and manage risks from these chemicals. For example, the World Health Organization (WHO) has established the tolerable daily intake (TDI) as one to four TEQs per kilogram of body weight per day. Therefore, it would be very useful to have a procedure for estimating TEQ using just a few variables that can be measured cheaply. Use the four PCB congeners PCB52, PCB118, PCB138, and PCB180 in a multiple regression to predict TEQ. Give a description of the model and assumptions, summarize the results, examine the residuals, and write a summary of what you have found.

11.51 TEQ = $β$ ₀ + $β_{1}$ PCB52 + $β$ ₂PCB118 + $β$ ₃PCB138 + $β$ ₄PCB180 + ϵ, where ϵ ~ N(0, $σ$ ) and independent. F = 33.53, P-value < 0.0001. Only PCB118 tests significant individually. The residual plot shows a couple potential outliers, which are also causing a slight right-skew in the Normal quantile plot.

Question 11.52

pcb

11.52 Predicting total amount of PCB using transformed variables. Because distributions of variables such as PCB, the PCB congeners, and TEQ tend to be skewed, researchers frequently analyze the logarithms of the measured variables. Create a data set that has the logs of each of the variables in the PCB data file. Note that zero is a possible value for PCB126; most software packages will eliminate these cases when you request a log transformation.

(a) If you do not do anything about the 16 zero values of PCB126, what does your software do with these cases? Is there an error message of some kind?
(b) If you attempt to run a regression to predict the log of PCB using the log of PCB126 and the log of PCB52, are the cases with the zero values of PCB126 eliminated? Do you think that this is a good way to handle this situation?
(c) The smallest nonzero value of PCB126 is 0.0052. One common practice when taking logarithms of measured values is to replace the zeros by one-half of the smallest observed value. Create a logarithm data set using this procedure; that is, replace the 16 zero values of PCB126 by 0.0026 before taking logarithms. Use numerical and graphical summaries to describe the distributions of the log variables.

Page 641

Question 11.53

pcb

11.53 Predicting total amount of PCB using transformed variables, continued. Refer to the previous exercise.

(a) Use numerical and graphical summaries to describe the relationships between each pair of log variables.
(b) Compare these summaries with the summaries that you produced in Exercise 11.46 for the measured variables.

11.53 (a) The correlations are all positive; the largest correlation is 0.956 (LPCB and LPCB138); the smallest 0.227 (LPCB28 and LPCB180). There is one outlier (specimen 39) in LPCB28; the latter stands out because of the “stack” of values in the LPCB126 data set that arose from the adjustment of the zero terms. (b) All correlations are higher with the transformed data.

Question 11.54

pcb

11.54 Even more on predicting total amount of PCB using transformed variables. Use the log data set that you created in Exercise 11.52 to find a good multiple regression model for predicting the log of PCB. Use only log PCB variables for this analysis. Write a report summarizing your results.

Question 11.55

pcb

11.55 Predicting total TEQ using transformed variables. Use the log data set that you created in Exercise 11.52 to find a good multiple regression model for predicting the log of TEQ. Use only log PCB variables for this analysis. Write a report summarizing your results and comparing them with the results that you obtained in the previous exercise.

11.55 A good model includes logPCB28, logPCB118, and logPCB126; R² = 0.7764. Adding more variables doesn’t increase R² much.

Question 11.56

pcb

11.56 Interpretation of coefficients in log PCB regressions. Use the results of your analysis of the log PCB data in Exercise 11.54 to write an explanation of how regression coefficients, standard errors of regression coefficients, and tests of significance for explanatory variables can change depending on what other explanatory variables are included in the multiple regression analysis.

The following nine exercises use the CHEESE data file. As cheddar cheese matures, a variety of chemical processes take place. The taste of matured cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. The variable “Case’’ is used to number the observations from 1 to 30. “Taste’’ is the response variable of interest. The taste scores were obtained by combining the scores from several tasters. Three of the chemicals whose concentrations were measured were acetic acid, hydrogen sulfide, and lactic acid. For acetic acid and hydrogen sulfide, (natural) log transformations were taken. Thus, the explanatory variables are the transformed concentrations of acetic acid (“Acetic’’) and hydrogen sulfide (“H2S’’) and the untransformed concentration of lactic acid (“Lactic’’).¹¹

Question 11.57

cheese

11.57 Describing the explanatory variables. For each of the four variables in the CHEESE data file, find the mean, median, standard deviation, and interquartile range. Display each distribution by means of a stemplot and use a Normal quantile plot to assess Normality of the data. Summarize your findings. Note that when doing regressions with these data, we do not assume that these distributions are Normal. Only the residuals from our model need to be (approximately) Normal. The careful study of each variable to be analyzed is, nonetheless, an important first step in any statistical analysis.

11.57 (a) Taste: 24.53, 20.95, 16.26, 23.9. Acetic: 5.50, 5.43, 0.57, 0.66. H2S: 5.94, 5.33, 2.13, 3.69. Lactic: 1.44, 1.45, 0.30, 0.43. None of the variables show striking deviations from Normality in the quantile plots. Taste and H2S are slightly right-skewed, and Acetic has an irregular shape. There are no outliers.

Question 11.58

cheese

11.58 Pairwise scatterplots of the explanatory variables. Make a scatterplot for each pair of variables in the CHEESE data file (you will have six plots). Describe the relationships. Calculate the correlation for each pair of variables and report the P-value for the test of zero population correlation in each case.

Question 11.59

cheese

11.59 Simple linear regression model of Taste. Perform a simple linear regression analysis using Taste as the response variable and Acetic as the explanatory variable. Be sure to examine the residuals carefully. Summarize your results. Include a plot of the data with the least-squares regression line. Plot the residuals versus each of the other two chemicals. Are any patterns evident? (The concentrations of the other chemicals are lurking variables for the simple linear regression.)

11.59 F = 12.11, P-value = 0.0017, $\hat{y}$ = − 61.49861 + 15.64777Acetic. R² = 30.20%. The residuals are Normally distributed, but the scatterplots show that the residuals are linearly related to both H2S and Lactic.

Question 11.60

cheese

11.60 Another simple linear regression model of Taste. Repeat the analysis of Exercise 11.59 using Taste as the response variable and H2S as the explanatory variable.

Question 11.61

cheese

11.61 The final simple linear regression model of Taste. Repeat the analysis of Exercise 11.59 using Taste as the response variable and Lactic as the explanatory variable.

11.61 F = 27.55, P-value < 0.0001, $\hat{y}$ = − 29.85883 + 37.71995Lactic. R² = 49.59%. The residuals are Normally distributed, but the scatterplots show that the residuals are linearly related to both H2S and Lactic.

Question 11.62

cheese

11.62 Comparing the simple linear regression models. Compare the results of the regressions performed in the three previous exercises. Construct a table with values of the F statistic, its P-value, R², and the estimate s of the standard deviation for each model. Report the three regression equations. Why are the intercepts in these three equations different?

Page 642

Question 11.63

cheese

11.63 Multiple regression model of Taste. Carry out a multiple regression using Acetic and H2S to predict Taste. Summarize the results of your analysis. Compare the statistical significance of Acetic in this model with its significance in the model with Acetic alone as a predictor (Exercise 11.59). Which model do you prefer? Give a simple explanation for the fact that Acetic alone appears to be a good predictor of Taste, but with H2S in the model, it is not.

11.63 $\hat{y}$ = − 26.94 + 3.801 Acetic + 5.146 H2S with s = 10.89 and R² = 0.582. For Acetic: t = 0.84 (P-value = 0.406). This two-variable model is not much better than the model with H2S alone (which explained 57.1% of the variation in Taste).

Question 11.64

cheese

11.64 Another multiple regression model of Taste. Carry out a multiple regression using H2S and Lactic to predict Taste. When we compare the results of this analysis with the simple linear regressions using each of these explanatory variables alone, it is evident that a better result is obtained by using both predictors in a model. Support this statement with explicit information obtained from your analysis.

Question 11.65

cheese

11.65 The final multiple regression model of Taste. Use the three explanatory variables Acetic, H2S, and Lactic in a multiple regression to predict Taste. Write a short summary of your results, including an examination of the residuals. Based on all the regression analyses you have carried out on these data, which model do you prefer and why?

11.65 $\hat{y}$ = − 28.88 + 0.328 Acetic + 3.912 H2S + 19.671 Lactic with s = 10.13. R² = 65.2%. Acetic is not significant (P-value = 0.942); there is no gain in adding Acetic to the model with H2S and Lactic. Residuals appear to be Normally distributed and show no patterns in scatterplots with explanatory variables. It appears that the H2S/Lactic model is best.

Question 11.66

11.66 Finding a multiple regression model on the Internet. Search the Internet to find an example of the use of multiple regression. Give the setting of the example, describe the data, give the model, and summarize the results. Explain why the use of multiple regression in this setting was appropriate or inappropriate.