SECTION 10.1 EXERCISES

For Exercise 10.1, see page 560; for Exercise 10.2, see page 565; for Exercises 10.3 and 10.4, see page 570; for Exercises 10.5 and 10.6, see pages 573–574.

Question 10.7

10.7 What’s wrong? For each of the following, explain what is wrong and why.

  1. (a) The parameters of the simple linear regression model are b0, b1, and s.

  2. (b) To test H0: b1 = 0, use a t test.

  3. (c) For a particular value of the explanatory variable x, the confidence interval for the mean response will be wider than the prediction interval for a future observation.

Question 10.8

10.8 What’s wrong? For each of the following, explain what is wrong and why.

  1. (a) The slope describes the change in x for a change in y.

  2. (b) The population regression line is y = b0 + b1x.

  3. (c) A 95% confidence interval for the mean response is the same width regardless of x.

579

Question 10.9

image 10.9 Importance of Normal model deviations. A general form of the central limit theorem tells us that the sampling distributions of b0 and b1 will be approximately Normal even if the model deviations are not Normally distributed. Using this fact, explain why the Normal distribution assumption is much more important for a prediction interval than for the confidence interval of the mean response at x = x*.

Question 10.10

10.10 Complete check of the residuals. In Example 10.12 (page 574), we checked model assumptions using a scatterplot (Figure 10.9). Let’s consider assessing the model assumptions using the residuals.

  1. (a) Fit the (EDUC, INC) data using least-squares regression and obtain the residuals. Write down the least-squares regression line.

  2. (b) Generate a plot of the residuals versus EDUC and comment on the pattern. Does a linear fit appear reasonable? Does there appear to be constant variance? Are there any unusual observations? Explain your answers.

  3. (c) Construct a histogram and a Normal quantile plot of the residuals. Do the residuals appear Normal? Explain your answer.

  4. (d) Analysis of the residuals is typically done because patterns in the residuals are easier to see. Do you think the plots in parts (b) and (c) magnify the violations of assumptions better than the scatterplot in Figure 10.9? Write a short paragraph comparing the scatterplot with the residual plots.

Question 10.11

10.11 Complete check of the residuals, continued. Refer to the previous exercise. In Example 10.13 (page 575), we checked model assumptions using a scatterplot (Figure 10.10) after log transforming the response variable.

  1. (a) Repeat parts (a) through (c) of the previous exercise using LOGINC and EDUC.

  2. (b) Do you think we can comfortably perform inference using the log transformed y? Explain your answer.

Question 10.12

10.12 College debt versus adjusted in-state costs. Kiplinger’s “Best Values in Public Colleges’’ provides a ranking of U.S. public colleges based on a combination of various measures of academics and affordability.5 Let’s focus on the relationship between the average debt in dollars at graduation (AveDebt) and the in-state cost per year after need-based aid (InCostAid). A scatterplot with least-squares regression line is shown in Figure 10.12 for a sample of 25 colleges from Kiplinger’s 2015 report.

  1. (a) Does a linear relationship between InCostAid and AveDebt seem reasonable? Explain your answer.

  2. (b) Are there any unusual cases in this sample? If yes, state which ones they are and how they may be affecting the least-squares model fit.

image
Figure 10.12: FIGURE 10.12 Scatterplot with least-squares regression line for a sample of 25 colleges from Kiplinger’s 2015 report, Exercise 10.12.

Question 10.13

10.13 Can we consider this an SRS? Refer to the previous exercise. The report states that Kiplinger’s rankings focus on traditional four-year public colleges with broad-based curricula. Each year, they start with more than 500 schools and then narrow the list down to roughly 120 based on academic quality before ranking them. The data set in the previous exercise is an SRS from Kiplinger’s published list of 100 schools. As far as investigating the relationship between the average debt and the in-state cost after adjusting for need-based aid, is it reasonable to consider this to be an SRS from the population of more than 500 schools? Write a short paragraph explaining your answer.

580

image
Figure 10.13: FIGURE 10.13 JMP output for the simple linear regression, Exercise 10.14.

Question 10.14

10.14 Predicting college debt. Refer to Exercise 10.12. Baruch College has substantially less average debt compared to the other schools with similar in-state costs. Figure 10.13 contains JMP output for the simple linear regression of AveDebt on InCostAid with this case removed.

  1. (a) State the least-squares regression line.

  2. (b) The University of North Florida is one school in this sample. It has an in-state cost of $11,421 and average debt of $17,617. What is the residual?

  3. (c) Construct a 95% confidence interval for the slope. What does this interval tell you about the change in average debt for a $1000 change in the in-state cost?

  4. (d) Penn State University is reported to have an adjusted in-state cost of $23,053. Discuss the appropriateness of using this data set to predict the average debt for this university.

Question 10.15

10.15 More on predicting college debt. Refer to the previous exercise. Appalachian State University has an in-state cost of $7372, and Texas A&M University has an in-state cost of $10,566.

  1. (a) Using your answer to part (a) of the previous exercise, what is the predicted average debt for a student at Appalachian State University?

  2. (b) What is the predicted average debt for a student at Texas A&M University?

  3. (c) Without doing any calculations, would the 95% margin of error for the predicted average debt be larger for Appalachian State University or Texas A&M University? Explain your answer.

Question 10.16

10.16 Impact of an unusual observation. Refer to Exercise 10.14. Baruch College was removed from this analysis because it was deemed an outlier. Let’s investigate its impact on the fit.

  1. (a) Refit the model using the entire sample of 25 schools. Create a table that summarizes the model estimates with and without this case.

  2. (b) Describe the impact this observation has on the fit of the linear regression model.

  3. (c) If you were writing a report for publication, would you include the fit with or without this case? Explain your answer.

Question 10.17

10.17 Predicting college debt: Other measures. Refer to Exercise 10.12. Let’s look at AveDebt and its relationship with the other explanatory variables in the data set. In addition to the in-state cost after aid (InCostAid), there is the admittance rate (Admit), the four-year graduation rate (GradRate), and out-of-state cost after aid (OutCostAid).

  1. (a) Generate scatterplots of each explanatory variable and AveDebt. Do all these relationships look linear? Describe what you see. Does Baruch College still look unusual?

  2. (b) Fit each of the explanatory variables separately and create a table that lists the explanatory variable, estimated model standard deviation s, and the P-value for the test of a linear association. For each analysis, make sure to specify whether you removed Baruch College or not.

  3. (c) Which variable do you think is the best single explanatory variable of average debt? Explain your answer.

581

Question 10.18

10.18 Are the two fuel efficiency measurements similar? Refer to Exercise 7.32 (page 429). In addition to the computer calculating miles per gallon (mpg), the driver also measured mpg by dividing the miles driven by the number of gallons at fill-up. The driver wants to determine if these calculations are similar.

Fill-up 1 2 3 4 5 6 7 8 9 10
Computer 41.5 50.7 36.6 37.3 34.2 45.0 48.0 43.2 47.7 42.2
Driver 36.5 44.2 37.2 35.6 30.5 40.5 40.0 41.0 42.8 39.2
Fill-up 11 12 13 14 15 16 17 18 19 20
Computer 43.2 44.6 48.4 46.4 46.8 39.2 37.3 43.5 44.3 43.3
Driver 38.8 44.5 45.4 45.3 45.7 34.2 35.2 39.8 44.9 47.5
  1. (a) Consider the driver’s mpg calculations as the explanatory variable. Plot the data and describe the relationship. Are there any outliers or unusual values? Does a linear relationship seem reasonable?

  2. (b) Run the simple linear regression and state the least-squares regression line.

  3. (c) Summarize the results. Does it appear that the computer and driver calculations are the same? Explain your answer.

Question 10.19

10.19 Is the number of tornadoes increasing? The Storm Prediction Center of the National Oceanic and Atmospheric Administration maintains a database of tornadoes, floods, and other weather phenomena. Table 10.1 summarizes the annual number of tornadoes in the United States between 1953 and 2014.6

  1. (a) Make a plot of the total number of tornadoes by year. Does a linear trend over years appear reasonable? Are there any outliers or unusual patterns? Explain your answer.

  2. (b) Run the simple linear regression and report the least-squares regression line.

  3. (c) A friend of yours thinks you made a mistake fitting the model because b0 is a large negative value. Explain to him why this is not a mistake.

  4. (d) Obtain the residuals and plot them versus year. Are there any unusual patterns or cases that you did not discuss in part (a)? If so, comment on them.

    Table : TABLE 10.1 Annual Number of Tornadoes in The United States Between 1953 and 2014
    Year Number of
    tornadoes
    Year Number of
    tornadoes
    Year Number of
    tornadoes
    Year Number of
    tornadoes
    1953 421 1969 608 1985 684 2001 1215
    1954 550 1970 653 1986 764 2002 934
    1955 593 1971 888 1987 656 2003 1374
    1956 504 1972 741 1988 702 2004 1817
    1957 856 1973 1102 1989 856 2005 1265
    1958 564 1974 947 1990 1133 2006 1103
    1959 604 1975 920 1991 1132 2007 1096
    1960 616 1976 835 1992 1298 2008 1692
    1961 697 1977 852 1993 1176 2009 1156
    1962 657 1978 788 1994 1082 2010 1282
    1963 464 1979 852 1995 1235 2011 1691
    1964 704 1980 866 1996 1173 2012 938
    1965 906 1981 783 1997 1148 2013 907
    1966 585 1982 1046 1998 1449 2014 888
    1967 926 1983 931 1999 1340
    1968 660 1984 907 2000 1075
  5. (e) Are the residuals approximately Normal? Justify your answer.

  6. (f) Based on the these residual checks, are you confident proceeding with inference? Explain your answer.

582

Question 10.20

10.20 Annual increase? Refer to the previous exercise. Let’s proceed with inference.

  1. (a) Do these data support a linear trend in the number of tornadoes? Justify your answer.

  2. (b) Construct a 95% confidence interval for the average annual increase in the number of tornadoes. Explain how this interval can be used to justify your response in part (a).

  3. (c) What is the predicted number of tornadoes in 2015?

  4. (d) Provide an interval that should contain the actual count 95% of the time.

Question 10.21

10.21 Computer memory. The capacity of memory commonly available at retail has increased rapidly over time.7

  1. (a) Make a scatterplot of the data. The growth is much faster than linear.

  2. (b) Compute the logarithm of capacity and plot it against year. Are these points closer to a straight line?

  3. (c) Fit the simple linear regression model with logarithm of capacity as the response and year as the explanatory variable. Give a 90% confidence interval for the slope of the population regression line.

  4. (d) Write a brief summary describing the change in memory capacity over time using the confidence interval from part (c).

Question 10.22

image 10.22 Alternative model. Refer to Exercise 10.19. The number of tornadoes in 2004 is much larger than expected and the number in 2014 is much smaller than expected. In fact, most of the large positive and negative deviations occur later in time. This suggests there may not be constant variance. Because the response variable is a count, one can argue the variance is not constant (for example, see the Poisson distribution, page 329).

  1. (a) Take the natural logarithm of the count and refit the model. What is the least-squares regression line?

  2. (b) Check the residuals of this model. Does the linear regression model fit these data? Explain your answer.

  3. (c) When the response y is on the log scale, the slope approximates the percent change in y for a unit increase in x. Construct an approximate 95% confidence interval for the annual percent change.

  4. (d) Does this model also support the hypothesis that tornadoes have increased over time? Explain your answer.

  5. (e) Construct a prediction interval for the predicted number of tornadoes in 2015 and compare it with the interval from part (d) of Exercise 10.19. (Note: An approximate interval can be constructed by first obtaining a prediction interval for log y and then taking the antilog (inverse function of log) of each interval endpoint.)

  6. (f) Which of the two models (and prediction) do you prefer? Explain why.