CHAPTER 11 EXERCISES

Question 11.25

image 11.25 Checking for a polynomial relationship. When looking at the residuals from the simple linear model of BMI versus physical activity (PA), Figure 10.5 (page 566) suggested a possible curvilinear relationship. Let’s investigate this further. Multiple regression can be used to fit the polynomial curve of degree q, y = β0 + β1x + β2x2 + … + βqxq, through the creation of additional explanatory variables x2, x3, etc. Let’s investigate a quadratic fit (q = 2) for the physical activity problem.

  1. (a) It is often best to subtract the sample mean before creating the necessary explanatory variables. In this case, the average number of steps per day is 8.614. Create new explanatory variables x1 = (PA − 8.614) and x2 = (PA − 8.614)2 and run a multiple regression for BMI using the explanatory variables x1 and x2. Write down the fitted regression line.

  2. (b) The regression model that included only PA had a R2 = 14.9%. What is R2 with the inclusion of this quadratic term?

  3. (c) Obtain the residuals from part (a) and check the multiple regression assumptions. Are there any remaining patterns in the data? Are the residuals approximately Normal? Explain.

  4. (d) Test the hypothesis that the coefficient of the variable (PA − 8.614)2 is equal to 0. Report the t statistic, degrees of freedom, and P-value. Does the quadratic term contribute significantly to the fit? Explain your answer.

Question 11.26

11.26 Architectural firm billings. A summary of firms engaged in commercial architecture in the Indianapolis, Indiana, area provides firm characteristics, including total annual billing in the current year, total annual billing in the previous year, the number of architects, the number of engineers, and the number of staff employed in the firm.6 Consider developing a model to predict current total billing using the other four variables.

  1. (a) Using numerical and graphical summaries, describe the distribution of current and past year total billing and the number of architects, engineers, and staff.

  2. (b) For each of the 10 pairs of variables, use graphical and numerical summaries to describe the relationship.

  3. (c) Carry out a multiple regression. Report the fitted regression equation and the value of the regression standard error s.

  4. (d) Analyze the residuals from the multiple regression. Are there any concerns?

  5. (e) A firm did not report its current total billing but had $1 million in billing last year and employs three architects, one engineer, and 17 staff members. What is the predicted total billing for this firm?

  6. (f) This analysis utilized the data from all commercial firms in the Indianapolis area that responded to the survey. Provide justification for the use of inference under this setting.

The following six exercises use the MOVIES data file. This data set contains an SRS of 43 movies released four to five years ago to guarantee they are no longer in the theaters. This sample was collected from the Internet Movie Database (IMDb) to see if information available soon after a movie’s theatrical release can successfully predict total U.S. revenue.7 All dollar amounts are measured in millions of U.S. dollars.

637

Question 11.27

11.27 Predicting movie revenue—preliminary analysis. The response variable is a movie’s total U.S. revenue (USRevenue). Let’s consider as explanatory variables the movie’s budget (Budget); opening-weekend revenue (Opening); the number of theaters (Theaters) the movie was in for the opening weekend; and the movie’s IMDb rating (Ratings), which is on a 1 to 10 scale (10 being best). While this rating is updated continuously, we’ll assume that the current rating is the rating at the end of the first week.

  1. (a) Using numerical and graphical summaries, describe the distribution of each explanatory variable. Are there any unusual observations that should be monitored?

  2. (b) Using numerical and/or graphical summaries, describe the relationship between each pair of explanatory variables.

Question 11.28

11.28 Predicting movie revenue—simple linear regressions. Now let’s look at the response variable and its relationship with each explanatory variable.

  1. (a) Using numerical and graphical summaries, describe the distribution of the response variable, USRevenue.

  2. (b) This variable is not Normally distributed. Does this violate one of the key model assumptions? Explain.

  3. (c) Generate scatterplots of each explanatory variable and USRevenue. Do all these relationships look linear? Explain what you see

Question 11.29

11.29 Predicting movie revenue—multiple linear regression. Now consider fitting a model using all the explanatory variables.

  1. (a) Write out the statistical model for this analysis, making sure to specify all assumptions.

  2. (b) Run the multiple regression model and specify the fitted regression equation.

  3. (c) Obtain the residuals from part (b) and check assumptions. Comment on any unusual residuals or patterns in the residuals.

  4. (d) What percent of the variability in USRevenue is explained by this model?

Question 11.30

image 11.30 A simpler model. In the multiple regression analysis using all four explanatory variables, Theaters and Budget appear to be the least helpful (given that the other two explanatory variables are in the model).

  1. (a) Perform a new analysis using only the movie’s opening-weekend revenue and IMDb rating. Give the estimated regression equation for this analysis.

  2. (b) What percent of the variability in USRevenue is explained by this model?

  3. (c) Test the null hypothesis that Theaters and Budget combined add no additional predictive information beyond what is already contained in Opening and Opinion?

Question 11.31

11.31 Predicting U.S. movie revenue. The movie Kick-Ass was released during this same time period. It had a budget of $30.0 million and was shown in 3065 theaters, grossing $19.83 million during the first weekend. Use software to construct the following.

  1. (a) A 95% prediction interval based on the model with all three explanatory variables.

  2. (b) A 95% prediction interval based on the model using only opening-weekend revenue and budget.

  3. (c) Compare the two intervals. Do the models give similar predictions and standard errors?

Question 11.32

11.32 Considering the log transformation. Refer to Exercise 11.29. Variables like income often have very skewed distributions. This can result in certain cases strongly influencing the fit of the model. A common remedy is to take the log before analysis. Create a new response variable by taking the log of U.S. Revenue and fit the model using all four predictors. Obtain the residuals and assess the model conditions. Do these data fit the linear regression model better than the untransformed data? Explain your answer.

The following three exercises use the RANKINGS data file. Since 2004, The Times Higher Education Supplement has provided an annual ranking of the world universities. A total score for each university is calculated based on the scores for the following explanatory variables: Teaching (30%), Research (30%), Citations (30%), Industry Income (2.5%), and International Outlook (7.5%) The percents represent the contributions of each score to the total. For our purposes, we will assume that these weights are unknown and will focus on the development of a model for the total score based on the first three explanatory variables. The report includes a table for the top 200 universities.8 The RANKINGS data file contains a random sample of 55 of these universities. This is not a random sample of all universities, but for our purposes here, we will consider it to be.

638

Question 11.33

11.33 Annual ranking of world universities. Let’s consider developing a model to predict total score (Overall) based on the teaching, research, and citations scores.

  1. (a) Using numerical and graphical summaries, describe the distribution of each explanatory variable.

  2. (b) Using numerical and graphical summaries, describe the relationship between each pair of explanatory variables.

Question 11.34

11.34 Looking at the simple linear regressions. Now let’s look at the relationship between each explanatory variable and the total score.

  1. (a) Generate scatterplots for each explanatory variable and the total score. Do these relationships all look linear?

  2. (b) Compute the correlation between each explanatory variable and the total score. Are certain explanatory variables more strongly associated with the total score?

Question 11.35

11.35 Multiple linear regression model. Now consider a regression model using all three explanatory variables.

  1. (a) Write out the statistical model for this analysis, making sure to specify all assumptions.

  2. (b) Run the multiple regression model and specify the fitted regression equation.

  3. (c) Generate a 95% confidence interval for each coefficient. Should any of these intervals contain 0? Explain.

  4. (d) What percent of the variation in total score is explained by this model? What is the estimate for σ?

Question 11.36

11.36 Predicting GPA of seventh-graders. Refer to the educational data for 78 seventh-grade students given in Table 1.3 (page 26). We view GPA as the response variable. IQ, gender, and self-concept are the explanatory variables.

  1. (a) Find the correlation between GPA and each of the explanatory variables. What percent of the total variation in student GPAs can be explained by the straight-line relationship with each of the explanatory variables?

  2. (b) The importance of IQ in explaining GPA is not surprising. The purpose of the study is to assess the influence of self-concept on GPA. So we will include IQ in the regression model and ask, “How much does self-concept contribute to explaining GPA after the effect of IQ on GPA is taken into account?’’ Give a model that can be used to answer this question.

  3. (c) Run the model and report the fitted regression equation. What percent of the variation in GPA is explained by the explanatory variables in your model?

  4. (d) Translate the question of interest into appropriate null and alternative hypotheses about the model parameters. Give the value of the test statistic and its P-value. Write a short summary of your analysis with an emphasis on your conclusion.

The following three exercises use the HAPPY data file. The World Database of Happiness is an online registry of scientific research on the subjective appreciation of life. It is available at worlddatabaseofhappiness.eur.nl, and the project is directed by Dr. Ruut Veenhoven, Erasmus University, Rotterdam. One inventory presents the “average happiness’’ score for various nations. This average is based on individual responses from numerous general population surveys to a general life satisfaction (well-being) question. Scores range from 0 (dissatisfied) to 10 (satisfied). The Nation-Master website, www.nationmaster.com, contains a collection of statistics associated with various nations. For our analysis, we will consider the GINI index, which measures the degree of inequality in the distribution of income (higher score = greater inequality), the degree of corruption in government (higher score = less corruption), average life expectancy, and the degree of democracy (higher score = more civil and political liberties).

Question 11.37

11.37 Predicting a nation’s “average happiness’’ score. Consider the five statistics for each nation: LSI, the average life-satisfaction score; GINI, the GINI index; CORRUPT, the degree of government corruption; LIFE, the average life expectancy; and DEMOCRACY, a measure of civil and political liberties.

  1. (a) Using numerical and graphical summaries, describe the distribution of each variable.

  2. (b) Using numerical and graphical summaries, describe the relationship between each pair of variables.

Question 11.38

11.38 Building a multiple linear regression model. Let’s now build a model to predict the life-satisfaction score, LSI.

  1. (a) Consider a simple linear regression using GINI as the explanatory variable. Run the regression and summarize the results. Be sure to check assumptions.

  2. (b) Now consider a model using GINI and LIFE. Run the multiple regression and summarize the results. Again be sure to check assumptions.

  3. (c) Now consider a model using GINI, LIFE, and DEMOCRACY. Run the multiple regression and summarize the results. Again be sure to check assumptions.

  4. (d) Now consider a model using all four explanatory variables. Again summarize the results and check assumptions.

639

Question 11.39

11.39 Selecting from among several models. Refer to the results from the previous exercise.

  1. (a) Make a table giving the estimated regression coefficients, standard errors, t statistics, and P-values.

  2. (b) Describe how the coefficients and P-values change for the four models.

  3. (c) Based on the table of coefficients, suggest another model. Run that model, summarize the results, and compare it with the other ones. Which model would you choose to explain LSI? Explain.

The following six exercises use the BIOMARK data file. Healthy bones are continually being renewed by two processes. Through bone formation, new bone is built; through bone resorption, old bone is removed. If one or both of these processes are disturbed—by disease, aging, or space travel, for example—bone loss can be the result. The variables VO+ and VOmeasure bone formation and bone resorption, respectively. Osteocalcin (OC) is a biochemical marker for bone formation: higher levels of bone formation are associated with higher levels of OC. A blood sample is used to measure OC, and it is much less expensive to obtain than direct measures of bone formation. The units are milligrams of OC per milliliter of blood (mg/ml). Similarly, tartrate-resistant acid phosphatase (TRAP) is a biochemical marker for bone resorption that is also measured in blood. It is measured in units per liter (U/l). These variables were measured in a study of 31 healthy women aged 11 to 32 years.9 Variables with the first letter “L’’ are the logarithms of the measured variables.

Question 11.40

11.40 Bone formation and resorption. Consider the following four variables: VO+, a measure of bone formation; VO−, a measure of bone resorption; OC, a biomarker of bone formation; and TRAP, a biomarker of bone resorption.

  1. (a) Using numerical and graphical summaries, describe the distribution of each of these variables.

  2. (b) Using numerical and graphical summaries, describe the relationship between each pair of variables.

Question 11.41

11.41 Predicting bone formation. Let’s use regression methods to predict VO+, the measure of bone formation.

  1. (a) Because OC is a biomarker of bone formation, we start with a simple linear regression using OC as the explanatory variable. Run the regression and summarize the results. Be sure to include an analysis of the residuals.

  2. (b) Because the processes of bone formation and bone resorption are highly related, it is possible that there is some information in the bone resorption variables that can tell us something about bone formation. Use a model with both OC and TRAP, the biomarker of bone resorption, to predict VO+. Summarize the results. In the context of this model, it appears that TRAP is a better predictor of bone formation, VO+, than the biomarker of bone formation, OC. Is this view consistent with the pattern of relationships that you described in the previous exercise? One possible explanation is that, although all these variables are highly related, TRAP is measured with more precision than OC.

Question 11.42

11.42 More on predicting bone formation. Now consider a regression model for predicting VO+ using OC, TRAP, and VO−.

  1. (a) Write out the statistical model for this analysis including all assumptions.

  2. (b) Run the multiple regression to predict VO+ using OC, TRAP, and VO−. Summarize the results.

  3. (c) Make a table giving the estimated regression coefficients, standard errors, and t statistics with P-values for this analysis and for the two that you ran in the previous exercise. Describe how the coefficients and the P-values differ for the three analyses.

  4. (d) Give the percent of variation in VO+ explained by each of the three models and the estimate of σ. Give a short summary.

  5. (e) The results you found in part (b) suggest another model. Run that model, summarize the results, and compare them with the results in part (b).

Question 11.43

image 11.43 Predicting bone formation using transformed variables. Because the distributions of VO+, VO−, OC, and TRAP tend to be skewed, it is common to work with logarithms rather than the measured values. Using the questions in the previous three exercises as a guide, analyze the log data.

Question 11.44

image 11.44 Predicting bone resorption. Refer to Exercises 11.40, 11.41, and 11.42. Answer these questions with the roles of VO+ and VO− reversed; that is, run models to predict VO−, with VO+ as an explanatory variable.

Question 11.45

image 11.45 Predicting bone resorption using transformed variables. Refer to the previous exercise. Rerun using logs.

The following 11 exercises use the PCB data file. Polychlorinated biphenyls (PCBs) are a collection of synthetic compounds, called congeners, that are particularly toxic to fetuses and young children. Although PCBs are no longer produced in the United States, they are still found in the environment. Because human exposure to these PCBs is primarily through the consumption of fish, the Environmental Protection Agency (EPA) monitors the PCB levels in fish. Unfortunately, there are 209 different congeners and measuring all of them in a fish specimen is an expensive and time-consuming process. You’ve been asked to see if the total amount of PCBs in a specimen can be estimated with only a few, easily quantifiable congeners.10 If this can be done, costs can be greatly reduced.

640

Question 11.46

11.46 Relationships among PCB congeners. Consider the following variables: PCB (the total amount of PCB) and four congeners: PCB52, PCB118, PCB138, and PCB180.

  1. (a) Using numerical and graphical summaries, describe the distribution of each of these variables.

  2. (b) Using numerical and graphical summaries, describe the relationship between each pair of variables.

Question 11.47

11.47 Predicting the total amount of PCB. Use the four congeners PCB52, PCB118, PCB138, and PCB180 in a multiple regression to predict PCB.

  1. (a) Write the statistical model for this analysis. Include all assumptions.

  2. (b) Run the regression and summarize the results.

  3. (c) Examine the residuals. Do they appear to be approximately Normal? When you plot them versus each of the explanatory variables, are any patterns evident?

Question 11.48

11.48 Adjusting the analysis for potential outliers. The examination of the residuals in part (c) of the previous exercise suggests that there may be two outliers, one with a high residual and one with a low residual.

  1. (a) Because of safety issues, we are more concerned about underestimating PCB in a specimen than about overestimating. Give the specimen number for each of the two suspected outliers. Which one corresponds to an overestimate of PCB?

  2. (b) Rerun the analysis with the two suspected outliers deleted, summarize these results, and compare them with those you obtained in the previous exercise.

Question 11.49

11.49 More on predicting the total amount of PCB. Run a regression to predict PCB using the variables PCB52, PCB118, and PCB138. Note that this is similar to the analysis that you did in Exercise 11.47, with the change that PCB180 is not included as an explanatory variable.

  1. (a) Summarize the results.

  2. (b) In this analysis, the regression coefficient for PCB118 is not statistically significant. Give the estimate of the coefficient and the associated P-value.

  3. (c) Find the estimate of the coefficient for PCB118 and the associated P-value for the model analyzed in Exercise 11.47.

  4. (d) Using the results in parts (b) and (c), write a short paragraph explaining how the inclusion of other variables in a multiple regression can have an effect on the estimate of a particular coefficient and the results of the associated significance test.

Question 11.50

11.50 Multiple regression model for total TEQ. Dioxins and furans are other classes of chemicals that can cause undesirable health effects similar to those caused by PCB. The three types of chemicals are combined using toxic equivalent scores (TEQs), which attempt to measure the health effects on a common scale. The PCB data file contains TEQs for PCB, dioxins, and furans. The variables are called TEQPCB, TEQDIOXIN, and TEQFURAN. The data file also includes the total TEQ, defined to be the sum of these three variables.

  1. (a) Consider using a multiple regression to predict TEQ using the three components TEQPCB, TEQDIOXIN, and TEQFURAN as explanatory variables. Write the multiple regression model in the form

  2. Give numerical values for the parameters β0, β1, β2, and β3.

  3. (b) The multiple regression model assumes that the ϵ’s are Normal with mean zero and standard deviation σ. What is the numerical value of σ?

  4. (c) Use software to run this regression and summarize the results.

Question 11.51

image 11.51 Multiple regression model for total TEQ, continued. The information summarized in TEQ is used to assess and manage risks from these chemicals. For example, the World Health Organization (WHO) has established the tolerable daily intake (TDI) as one to four TEQs per kilogram of body weight per day. Therefore, it would be very useful to have a procedure for estimating TEQ using just a few variables that can be measured cheaply. Use the four PCB congeners PCB52, PCB118, PCB138, and PCB180 in a multiple regression to predict TEQ. Give a description of the model and assumptions, summarize the results, examine the residuals, and write a summary of what you have found.

Question 11.52

image 11.52 Predicting total amount of PCB using transformed variables. Because distributions of variables such as PCB, the PCB congeners, and TEQ tend to be skewed, researchers frequently analyze the logarithms of the measured variables. Create a data set that has the logs of each of the variables in the PCB data file. Note that zero is a possible value for PCB126; most software packages will eliminate these cases when you request a log transformation.

  1. (a) If you do not do anything about the 16 zero values of PCB126, what does your software do with these cases? Is there an error message of some kind?

  2. (b) If you attempt to run a regression to predict the log of PCB using the log of PCB126 and the log of PCB52, are the cases with the zero values of PCB126 eliminated? Do you think that this is a good way to handle this situation?

  3. (c) The smallest nonzero value of PCB126 is 0.0052. One common practice when taking logarithms of measured values is to replace the zeros by one-half of the smallest observed value. Create a logarithm data set using this procedure; that is, replace the 16 zero values of PCB126 by 0.0026 before taking logarithms. Use numerical and graphical summaries to describe the distributions of the log variables.

641

Question 11.53

image 11.53 Predicting total amount of PCB using transformed variables, continued. Refer to the previous exercise.

  1. (a) Use numerical and graphical summaries to describe the relationships between each pair of log variables.

  2. (b) Compare these summaries with the summaries that you produced in Exercise 11.46 for the measured variables.

Question 11.54

image 11.54 Even more on predicting total amount of PCB using transformed variables. Use the log data set that you created in Exercise 11.52 to find a good multiple regression model for predicting the log of PCB. Use only log PCB variables for this analysis. Write a report summarizing your results.

Question 11.55

image 11.55 Predicting total TEQ using transformed variables. Use the log data set that you created in Exercise 11.52 to find a good multiple regression model for predicting the log of TEQ. Use only log PCB variables for this analysis. Write a report summarizing your results and comparing them with the results that you obtained in the previous exercise.

Question 11.56

11.56 Interpretation of coefficients in log PCB regressions. Use the results of your analysis of the log PCB data in Exercise 11.54 to write an explanation of how regression coefficients, standard errors of regression coefficients, and tests of significance for explanatory variables can change depending on what other explanatory variables are included in the multiple regression analysis.

The following nine exercises use the CHEESE data file. As cheddar cheese matures, a variety of chemical processes take place. The taste of matured cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. The variable “Case’’ is used to number the observations from 1 to 30. “Taste’’ is the response variable of interest. The taste scores were obtained by combining the scores from several tasters. Three of the chemicals whose concentrations were measured were acetic acid, hydrogen sulfide, and lactic acid. For acetic acid and hydrogen sulfide, (natural) log transformations were taken. Thus, the explanatory variables are the transformed concentrations of acetic acid (“Acetic’’) and hydrogen sulfide (“H2S’’) and the untransformed concentration of lactic acid (“Lactic’’).11

Question 11.57

11.57 Describing the explanatory variables. For each of the four variables in the CHEESE data file, find the mean, median, standard deviation, and interquartile range. Display each distribution by means of a stemplot and use a Normal quantile plot to assess Normality of the data. Summarize your findings. Note that when doing regressions with these data, we do not assume that these distributions are Normal. Only the residuals from our model need to be (approximately) Normal. The careful study of each variable to be analyzed is, nonetheless, an important first step in any statistical analysis.

Question 11.58

11.58 Pairwise scatterplots of the explanatory variables. Make a scatterplot for each pair of variables in the CHEESE data file (you will have six plots). Describe the relationships. Calculate the correlation for each pair of variables and report the P-value for the test of zero population correlation in each case.

Question 11.59

11.59 Simple linear regression model of Taste. Perform a simple linear regression analysis using Taste as the response variable and Acetic as the explanatory variable. Be sure to examine the residuals carefully. Summarize your results. Include a plot of the data with the least-squares regression line. Plot the residuals versus each of the other two chemicals. Are any patterns evident? (The concentrations of the other chemicals are lurking variables for the simple linear regression.)

Question 11.60

11.60 Another simple linear regression model of Taste. Repeat the analysis of Exercise 11.59 using Taste as the response variable and H2S as the explanatory variable.

Question 11.61

11.61 The final simple linear regression model of Taste. Repeat the analysis of Exercise 11.59 using Taste as the response variable and Lactic as the explanatory variable.

Question 11.62

11.62 Comparing the simple linear regression models. Compare the results of the regressions performed in the three previous exercises. Construct a table with values of the F statistic, its P-value, R2, and the estimate s of the standard deviation for each model. Report the three regression equations. Why are the intercepts in these three equations different?

642

Question 11.63

11.63 Multiple regression model of Taste. Carry out a multiple regression using Acetic and H2S to predict Taste. Summarize the results of your analysis. Compare the statistical significance of Acetic in this model with its significance in the model with Acetic alone as a predictor (Exercise 11.59). Which model do you prefer? Give a simple explanation for the fact that Acetic alone appears to be a good predictor of Taste, but with H2S in the model, it is not.

Question 11.64

11.64 Another multiple regression model of Taste. Carry out a multiple regression using H2S and Lactic to predict Taste. When we compare the results of this analysis with the simple linear regressions using each of these explanatory variables alone, it is evident that a better result is obtained by using both predictors in a model. Support this statement with explicit information obtained from your analysis.

Question 11.65

11.65 The final multiple regression model of Taste. Use the three explanatory variables Acetic, H2S, and Lactic in a multiple regression to predict Taste. Write a short summary of your results, including an examination of the residuals. Based on all the regression analyses you have carried out on these data, which model do you prefer and why?

Question 11.66

11.66 Finding a multiple regression model on the Internet. Search the Internet to find an example of the use of multiple regression. Give the setting of the example, describe the data, give the model, and summarize the results. Explain why the use of multiple regression in this setting was appropriate or inappropriate.