11.25 Checking for a polynomial relationship. When looking at the residuals from the simple linear model of BMI versus physical activity (PA), Figure 10.5 (page 566) suggested a possible curvilinear relationship. Let’s investigate this further. Multiple regression can be used to fit the polynomial curve of degree q, y = β0 + β1x + β2x2 + … + βqxq, through the creation of additional explanatory variables x2, x3, etc. Let’s investigate a quadratic fit (q = 2) for the physical activity problem.
(a) It is often best to subtract the sample mean before creating the necessary explanatory variables. In this case, the average number of steps per day is 8.614. Create new explanatory variables x1 = (PA − 8.614) and x2 = (PA − 8.614)2 and run a multiple regression for BMI using the explanatory variables x1 and x2. Write down the fitted regression line.
(b) The regression model that included only PA had a R2 = 14.9%. What is R2 with the inclusion of this quadratic term?
(c) Obtain the residuals from part (a) and check the multiple regression assumptions. Are there any remaining patterns in the data? Are the residuals approximately Normal? Explain.
(d) Test the hypothesis that the coefficient of the variable (PA − 8.614)2 is equal to 0. Report the t statistic, degrees of freedom, and P-value. Does the quadratic term contribute significantly to the fit? Explain your answer.
11.26 Architectural firm billings. A summary of firms engaged in commercial architecture in the Indianapolis, Indiana, area provides firm characteristics, including total annual billing in the current year, total annual billing in the previous year, the number of architects, the number of engineers, and the number of staff employed in the firm.6 Consider developing a model to predict current total billing using the other four variables.
(a) Using numerical and graphical summaries, describe the distribution of current and past year total billing and the number of architects, engineers, and staff.
(b) For each of the 10 pairs of variables, use graphical and numerical summaries to describe the relationship.
(c) Carry out a multiple regression. Report the fitted regression equation and the value of the regression standard error s.
(d) Analyze the residuals from the multiple regression. Are there any concerns?
(e) A firm did not report its current total billing but had $1 million in billing last year and employs three architects, one engineer, and 17 staff members. What is the predicted total billing for this firm?
(f) This analysis utilized the data from all commercial firms in the Indianapolis area that responded to the survey. Provide justification for the use of inference under this setting.
The following six exercises use the MOVIES data file. This data set contains an SRS of 43 movies released four to five years ago to guarantee they are no longer in the theaters. This sample was collected from the Internet Movie Database (IMDb) to see if information available soon after a movie’s theatrical release can successfully predict total U.S. revenue.7 All dollar amounts are measured in millions of U.S. dollars.
637
11.27 Predicting movie revenue—
(a) Using numerical and graphical summaries, describe the distribution of each explanatory variable. Are there any unusual observations that should be monitored?
(b) Using numerical and/or graphical summaries, describe the relationship between each pair of explanatory variables.
11.28 Predicting movie revenue—
(a) Using numerical and graphical summaries, describe the distribution of the response variable, USRevenue.
(b) This variable is not Normally distributed. Does this violate one of the key model assumptions? Explain.
(c) Generate scatterplots of each explanatory variable and USRevenue. Do all these relationships look linear? Explain what you see
11.29 Predicting movie revenue—
(a) Write out the statistical model for this analysis, making sure to specify all assumptions.
(b) Run the multiple regression model and specify the fitted regression equation.
(c) Obtain the residuals from part (b) and check assumptions. Comment on any unusual residuals or patterns in the residuals.
(d) What percent of the variability in USRevenue is explained by this model?
11.30 A simpler model. In the multiple regression analysis using all four explanatory variables, Theaters and Budget appear to be the least helpful (given that the other two explanatory variables are in the model).
(a) Perform a new analysis using only the movie’s opening-
(b) What percent of the variability in USRevenue is explained by this model?
(c) Test the null hypothesis that Theaters and Budget combined add no additional predictive information beyond what is already contained in Opening and Opinion?
11.31 Predicting U.S. movie revenue. The movie Kick-
(a) A 95% prediction interval based on the model with all three explanatory variables.
(b) A 95% prediction interval based on the model using only opening-
(c) Compare the two intervals. Do the models give similar predictions and standard errors?
11.32 Considering the log transformation. Refer to Exercise 11.29. Variables like income often have very skewed distributions. This can result in certain cases strongly influencing the fit of the model. A common remedy is to take the log before analysis. Create a new response variable by taking the log of U.S. Revenue and fit the model using all four predictors. Obtain the residuals and assess the model conditions. Do these data fit the linear regression model better than the untransformed data? Explain your answer.
The following three exercises use the RANKINGS data file. Since 2004, The Times Higher Education Supplement has provided an annual ranking of the world universities. A total score for each university is calculated based on the scores for the following explanatory variables: Teaching (30%), Research (30%), Citations (30%), Industry Income (2.5%), and International Outlook (7.5%) The percents represent the contributions of each score to the total. For our purposes, we will assume that these weights are unknown and will focus on the development of a model for the total score based on the first three explanatory variables. The report includes a table for the top 200 universities.8 The RANKINGS data file contains a random sample of 55 of these universities. This is not a random sample of all universities, but for our purposes here, we will consider it to be.
638
11.33 Annual ranking of world universities. Let’s consider developing a model to predict total score (Overall) based on the teaching, research, and citations scores.
(a) Using numerical and graphical summaries, describe the distribution of each explanatory variable.
(b) Using numerical and graphical summaries, describe the relationship between each pair of explanatory variables.
11.34 Looking at the simple linear regressions. Now let’s look at the relationship between each explanatory variable and the total score.
(a) Generate scatterplots for each explanatory variable and the total score. Do these relationships all look linear?
(b) Compute the correlation between each explanatory variable and the total score. Are certain explanatory variables more strongly associated with the total score?
11.35 Multiple linear regression model. Now consider a regression model using all three explanatory variables.
(a) Write out the statistical model for this analysis, making sure to specify all assumptions.
(b) Run the multiple regression model and specify the fitted regression equation.
(c) Generate a 95% confidence interval for each coefficient. Should any of these intervals contain 0? Explain.
(d) What percent of the variation in total score is explained by this model? What is the estimate for σ?
11.36 Predicting GPA of seventh-
(a) Find the correlation between GPA and each of the explanatory variables. What percent of the total variation in student GPAs can be explained by the straight-
(b) The importance of IQ in explaining GPA is not surprising. The purpose of the study is to assess the influence of self-
(c) Run the model and report the fitted regression equation. What percent of the variation in GPA is explained by the explanatory variables in your model?
(d) Translate the question of interest into appropriate null and alternative hypotheses about the model parameters. Give the value of the test statistic and its P-value. Write a short summary of your analysis with an emphasis on your conclusion.
The following three exercises use the HAPPY data file. The World Database of Happiness is an online registry of scientific research on the subjective appreciation of life. It is available at worlddatabaseofhappiness.eur.nl, and the project is directed by Dr. Ruut Veenhoven, Erasmus University, Rotterdam. One inventory presents the “average happiness’’ score for various nations. This average is based on individual responses from numerous general population surveys to a general life satisfaction (well-
11.37 Predicting a nation’s “average happiness’’ score. Consider the five statistics for each nation: LSI, the average life-
(a) Using numerical and graphical summaries, describe the distribution of each variable.
(b) Using numerical and graphical summaries, describe the relationship between each pair of variables.
11.38 Building a multiple linear regression model. Let’s now build a model to predict the life-
(a) Consider a simple linear regression using GINI as the explanatory variable. Run the regression and summarize the results. Be sure to check assumptions.
(b) Now consider a model using GINI and LIFE. Run the multiple regression and summarize the results. Again be sure to check assumptions.
(c) Now consider a model using GINI, LIFE, and DEMOCRACY. Run the multiple regression and summarize the results. Again be sure to check assumptions.
(d) Now consider a model using all four explanatory variables. Again summarize the results and check assumptions.
639
11.39 Selecting from among several models. Refer to the results from the previous exercise.
(a) Make a table giving the estimated regression coefficients, standard errors, t statistics, and P-values.
(b) Describe how the coefficients and P-values change for the four models.
(c) Based on the table of coefficients, suggest another model. Run that model, summarize the results, and compare it with the other ones. Which model would you choose to explain LSI? Explain.
The following six exercises use the BIOMARK data file. Healthy bones are continually being renewed by two processes. Through bone formation, new bone is built; through bone resorption, old bone is removed. If one or both of these processes are disturbed—
11.40 Bone formation and resorption. Consider the following four variables: VO+, a measure of bone formation; VO−, a measure of bone resorption; OC, a biomarker of bone formation; and TRAP, a biomarker of bone resorption.
(a) Using numerical and graphical summaries, describe the distribution of each of these variables.
(b) Using numerical and graphical summaries, describe the relationship between each pair of variables.
11.41 Predicting bone formation. Let’s use regression methods to predict VO+, the measure of bone formation.
(a) Because OC is a biomarker of bone formation, we start with a simple linear regression using OC as the explanatory variable. Run the regression and summarize the results. Be sure to include an analysis of the residuals.
(b) Because the processes of bone formation and bone resorption are highly related, it is possible that there is some information in the bone resorption variables that can tell us something about bone formation. Use a model with both OC and TRAP, the biomarker of bone resorption, to predict VO+. Summarize the results. In the context of this model, it appears that TRAP is a better predictor of bone formation, VO+, than the biomarker of bone formation, OC. Is this view consistent with the pattern of relationships that you described in the previous exercise? One possible explanation is that, although all these variables are highly related, TRAP is measured with more precision than OC.
11.42 More on predicting bone formation. Now consider a regression model for predicting VO+ using OC, TRAP, and VO−.
(a) Write out the statistical model for this analysis including all assumptions.
(b) Run the multiple regression to predict VO+ using OC, TRAP, and VO−. Summarize the results.
(c) Make a table giving the estimated regression coefficients, standard errors, and t statistics with P-values for this analysis and for the two that you ran in the previous exercise. Describe how the coefficients and the P-values differ for the three analyses.
(d) Give the percent of variation in VO+ explained by each of the three models and the estimate of σ. Give a short summary.
(e) The results you found in part (b) suggest another model. Run that model, summarize the results, and compare them with the results in part (b).
11.43 Predicting bone formation using transformed variables. Because the distributions of VO+, VO−, OC, and TRAP tend to be skewed, it is common to work with logarithms rather than the measured values. Using the questions in the previous three exercises as a guide, analyze the log data.
11.44 Predicting bone resorption. Refer to Exercises 11.40, 11.41, and 11.42. Answer these questions with the roles of VO+ and VO− reversed; that is, run models to predict VO−, with VO+ as an explanatory variable.
11.45 Predicting bone resorption using transformed variables. Refer to the previous exercise. Rerun using logs.
The following 11 exercises use the PCB data file. Polychlorinated biphenyls (PCBs) are a collection of synthetic compounds, called congeners, that are particularly toxic to fetuses and young children. Although PCBs are no longer produced in the United States, they are still found in the environment. Because human exposure to these PCBs is primarily through the consumption of fish, the Environmental Protection Agency (EPA) monitors the PCB levels in fish. Unfortunately, there are 209 different congeners and measuring all of them in a fish specimen is an expensive and time-
640
11.46 Relationships among PCB congeners. Consider the following variables: PCB (the total amount of PCB) and four congeners: PCB52, PCB118, PCB138, and PCB180.
(a) Using numerical and graphical summaries, describe the distribution of each of these variables.
(b) Using numerical and graphical summaries, describe the relationship between each pair of variables.
11.47 Predicting the total amount of PCB. Use the four congeners PCB52, PCB118, PCB138, and PCB180 in a multiple regression to predict PCB.
(a) Write the statistical model for this analysis. Include all assumptions.
(b) Run the regression and summarize the results.
(c) Examine the residuals. Do they appear to be approximately Normal? When you plot them versus each of the explanatory variables, are any patterns evident?
11.48 Adjusting the analysis for potential outliers. The examination of the residuals in part (c) of the previous exercise suggests that there may be two outliers, one with a high residual and one with a low residual.
(a) Because of safety issues, we are more concerned about underestimating PCB in a specimen than about overestimating. Give the specimen number for each of the two suspected outliers. Which one corresponds to an overestimate of PCB?
(b) Rerun the analysis with the two suspected outliers deleted, summarize these results, and compare them with those you obtained in the previous exercise.
11.49 More on predicting the total amount of PCB. Run a regression to predict PCB using the variables PCB52, PCB118, and PCB138. Note that this is similar to the analysis that you did in Exercise 11.47, with the change that PCB180 is not included as an explanatory variable.
(a) Summarize the results.
(b) In this analysis, the regression coefficient for PCB118 is not statistically significant. Give the estimate of the coefficient and the associated P-value.
(c) Find the estimate of the coefficient for PCB118 and the associated P-value for the model analyzed in Exercise 11.47.
(d) Using the results in parts (b) and (c), write a short paragraph explaining how the inclusion of other variables in a multiple regression can have an effect on the estimate of a particular coefficient and the results of the associated significance test.
11.50 Multiple regression model for total TEQ. Dioxins and furans are other classes of chemicals that can cause undesirable health effects similar to those caused by PCB. The three types of chemicals are combined using toxic equivalent scores (TEQs), which attempt to measure the health effects on a common scale. The PCB data file contains TEQs for PCB, dioxins, and furans. The variables are called TEQPCB, TEQDIOXIN, and TEQFURAN. The data file also includes the total TEQ, defined to be the sum of these three variables.
(a) Consider using a multiple regression to predict TEQ using the three components TEQPCB, TEQDIOXIN, and TEQFURAN as explanatory variables. Write the multiple regression model in the form
Give numerical values for the parameters β0, β1, β2, and β3.
(b) The multiple regression model assumes that the ϵ’s are Normal with mean zero and standard deviation σ. What is the numerical value of σ?
(c) Use software to run this regression and summarize the results.
11.51 Multiple regression model for total TEQ, continued. The information summarized in TEQ is used to assess and manage risks from these chemicals. For example, the World Health Organization (WHO) has established the tolerable daily intake (TDI) as one to four TEQs per kilogram of body weight per day. Therefore, it would be very useful to have a procedure for estimating TEQ using just a few variables that can be measured cheaply. Use the four PCB congeners PCB52, PCB118, PCB138, and PCB180 in a multiple regression to predict TEQ. Give a description of the model and assumptions, summarize the results, examine the residuals, and write a summary of what you have found.
11.52 Predicting total amount of PCB using transformed variables. Because distributions of variables such as PCB, the PCB congeners, and TEQ tend to be skewed, researchers frequently analyze the logarithms of the measured variables. Create a data set that has the logs of each of the variables in the PCB data file. Note that zero is a possible value for PCB126; most software packages will eliminate these cases when you request a log transformation.
(a) If you do not do anything about the 16 zero values of PCB126, what does your software do with these cases? Is there an error message of some kind?
(b) If you attempt to run a regression to predict the log of PCB using the log of PCB126 and the log of PCB52, are the cases with the zero values of PCB126 eliminated? Do you think that this is a good way to handle this situation?
(c) The smallest nonzero value of PCB126 is 0.0052. One common practice when taking logarithms of measured values is to replace the zeros by one-
641
11.53 Predicting total amount of PCB using transformed variables, continued. Refer to the previous exercise.
(a) Use numerical and graphical summaries to describe the relationships between each pair of log variables.
(b) Compare these summaries with the summaries that you produced in Exercise 11.46 for the measured variables.
11.54 Even more on predicting total amount of PCB using transformed variables. Use the log data set that you created in Exercise 11.52 to find a good multiple regression model for predicting the log of PCB. Use only log PCB variables for this analysis. Write a report summarizing your results.
11.55 Predicting total TEQ using transformed variables. Use the log data set that you created in Exercise 11.52 to find a good multiple regression model for predicting the log of TEQ. Use only log PCB variables for this analysis. Write a report summarizing your results and comparing them with the results that you obtained in the previous exercise.
11.56 Interpretation of coefficients in log PCB regressions. Use the results of your analysis of the log PCB data in Exercise 11.54 to write an explanation of how regression coefficients, standard errors of regression coefficients, and tests of significance for explanatory variables can change depending on what other explanatory variables are included in the multiple regression analysis.
The following nine exercises use the CHEESE data file. As cheddar cheese matures, a variety of chemical processes take place. The taste of matured cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. The variable “Case’’ is used to number the observations from 1 to 30. “Taste’’ is the response variable of interest. The taste scores were obtained by combining the scores from several tasters. Three of the chemicals whose concentrations were measured were acetic acid, hydrogen sulfide, and lactic acid. For acetic acid and hydrogen sulfide, (natural) log transformations were taken. Thus, the explanatory variables are the transformed concentrations of acetic acid (“Acetic’’) and hydrogen sulfide (“H2S’’) and the untransformed concentration of lactic acid (“Lactic’’).11
11.57 Describing the explanatory variables. For each of the four variables in the CHEESE data file, find the mean, median, standard deviation, and interquartile range. Display each distribution by means of a stemplot and use a Normal quantile plot to assess Normality of the data. Summarize your findings. Note that when doing regressions with these data, we do not assume that these distributions are Normal. Only the residuals from our model need to be (approximately) Normal. The careful study of each variable to be analyzed is, nonetheless, an important first step in any statistical analysis.
11.58 Pairwise scatterplots of the explanatory variables. Make a scatterplot for each pair of variables in the CHEESE data file (you will have six plots). Describe the relationships. Calculate the correlation for each pair of variables and report the P-value for the test of zero population correlation in each case.
11.59 Simple linear regression model of Taste. Perform a simple linear regression analysis using Taste as the response variable and Acetic as the explanatory variable. Be sure to examine the residuals carefully. Summarize your results. Include a plot of the data with the least-
11.60 Another simple linear regression model of Taste. Repeat the analysis of Exercise 11.59 using Taste as the response variable and H2S as the explanatory variable.
11.61 The final simple linear regression model of Taste. Repeat the analysis of Exercise 11.59 using Taste as the response variable and Lactic as the explanatory variable.
11.62 Comparing the simple linear regression models. Compare the results of the regressions performed in the three previous exercises. Construct a table with values of the F statistic, its P-value, R2, and the estimate s of the standard deviation for each model. Report the three regression equations. Why are the intercepts in these three equations different?
642
11.63 Multiple regression model of Taste. Carry out a multiple regression using Acetic and H2S to predict Taste. Summarize the results of your analysis. Compare the statistical significance of Acetic in this model with its significance in the model with Acetic alone as a predictor (Exercise 11.59). Which model do you prefer? Give a simple explanation for the fact that Acetic alone appears to be a good predictor of Taste, but with H2S in the model, it is not.
11.64 Another multiple regression model of Taste. Carry out a multiple regression using H2S and Lactic to predict Taste. When we compare the results of this analysis with the simple linear regressions using each of these explanatory variables alone, it is evident that a better result is obtained by using both predictors in a model. Support this statement with explicit information obtained from your analysis.
11.65 The final multiple regression model of Taste. Use the three explanatory variables Acetic, H2S, and Lactic in a multiple regression to predict Taste. Write a short summary of your results, including an examination of the residuals. Based on all the regression analyses you have carried out on these data, which model do you prefer and why?
11.66 Finding a multiple regression model on the Internet. Search the Internet to find an example of the use of multiple regression. Give the setting of the example, describe the data, give the model, and summarize the results. Explain why the use of multiple regression in this setting was appropriate or inappropriate.