CHAPTER 11 Review Exercises

Question 11.84

11.84 Alternate movie revenue model.

CASE 11.2 Refer to the data set on movie revenue in Case 11.2 (page 550). The variables Budget, Opening, and USRevenue all have distributions with long tails. For this problem, let’s consider building a model using the logarithm transformation of these variables.

movies

  1. Run the multiple regression to predict the logarithm of USRevenue using the logarithm of Budget, the logarithm of Opening, and Theaters, and obtain the residuals. Examine the residuals graphically. Does the distribution appear approximately Normal? Explain your answer.
  2. State the regression equation and note which coefficients are statistically significant at the 5% level.
  3. In Exercise 11.37 (page 555), you were asked to predict the revenue of a particular movie. Using the results from this new model, construct a 95% prediction interval for the movie’s log USRevenue.
  4. The movie The Hangover has the largest residual. Remove this movie and refit the model in part (a). Compare these results with the results in part (b). Does it appear this case is an influential observation? Explain your answer.

Question 11.85

11.85 Education and income.

CASE 10.1 Recall Case 10.1 (pages 485486), which looked at the relationship between an entrepreneur’s log income and level of education. In addition to the level of education, the entrepreneur’s age and a measure of his or her perceived control of the environment (locus of control) was also obtained. The larger the locus of control, the more in control one feels.

entre1

  1. Write the model that you would use for a multiple regression to predict log income from education, locus of control, and age.
  2. What are the parameters of your model?
  3. Run the multiple regression and give the estimates of the model parameters.
  4. Find the residuals and examine their distribution. Summarize what you find.
  5. Plot the residuals versus each of the explanatory variables. Describe the plots. Does your analysis suggest that the model assumptions may not be reasonable for this problem?

11.85

(a) .
(b) The parameters are: , and .
(c) .
(d) The Normal quantile plot shows the residuals are Normally distributed.
(e) All three residual plots look good (random). Both linearity and constant variance are valid.

Question 11.86

11.86 Education and income, continued.

CASE 10.1 Refer to the previous exercise. Provided the data meet the requirements of the multiple regression model, we can now perform inference.

entre1

  1. Test the hypothesis that the coefficients for education, locus of control, and age are all zero. Give the test statistic with degrees of freedom and the -value. What do you conclude?
  2. What is the value of for this model and data? Interpret what this numeric summary means to someone unfamiliar with it.
  3. Give the results of the hypothesis test for the coefficient for education. Include the test statistic, degrees of freedom, and the -value. Do the same for the other two variables. Summarize your conclusions from these three tests.

Question 11.87

11.87 Compare regression coefficients.

Again refer to Exercise 11.85.

entre1

  1. In Example 10.5 (page 497), parameter estimates for the model that included just EDUC were obtained. Compare those parameter estimates with the ones obtained from the full model that also includes age and locus of control. Describe any changes.
  2. Consider a 36-year-old entrepreneur with 12 years of education and a locus of control of −0.25. Compare the predicted log incomes based on the full model and the model that includes only education level.
  3. In Example 10.12 (pages 521522), we computed for the model that included only education level. It was 0.0573. Use and to test whether age and locus of control together are helpful predictors, given EDUC is already in the model.

11.87

(a) For the model with EDUC: . For the model with all three: . With just Education, the intercept was larger and the effect size of each year of education was larger, 0.1126, on LogIncome. Once we account for both Locus of control and Age, the intercept isn’t quite as large, and the effect size of each year of education goes down to 0.08542.
(b) For the full model: . For the EDUC model: . The predictions don’t seem too different unless we undo the log transformation; the predicted incomes are $17,940.21 and $14,849.18, which seems like a substantial difference.
(c) are 2 and . Locus of control and Age are helpful predictors in explaining LogIncome when Education is already in the model.

Question 11.88

11.88 Business-to-business (B2B) marketing.

A group of researchers were interested in determining the likelihood that a business currently purchasing office supplies via a catalog would switch to purchasing from the website of the same supplier. To do this, they performed an online survey using the business clients of a large Australian-based stationery provider with both a catalog and a Web-based business.19 Results from 1809 firms, all currently purchasing via the catalog, were obtained. The following table summarizes the regression model.

585

Variable
Staff interpersonal contact with catalog −0.08 3.34
Trust of supplier 0.11 4.66
Web benefits (access and accuracy) 0.08 3.92
Previous Web purchases 0.18 8.20
Previous Web information search 0.08 3.47
Key catalog benefits (staff, speed, security) −0.08 3.96
Web benefits (speed and ease of use) 0.36 3.97
Problems with Web ordering and delivery −0.06 2.65
  1. The statistic is reported to be 78.15. What degrees of freedom are associated with this statistic?
  2. This statistic can be expressed in terms of as

    Use this relationship to determine .

  3. The coefficients listed above are standardized coefficients. These are obtained when each variable is standardized (subtract its mean, divide by its standard deviation) prior to fitting the regression model. These coefficients then represent the change in standard deviations of for a one standard deviation change in . This typically allows one to determine which independent variables have the greatest effect on the dependent variable. Using this idea, what are the top two variables in this analysis?

Exercises 11.89 through 11.92 use the PPROMO data set shown in Table 11.7.

Question 11.89

11.89 Discount promotions at a supermarket.

How does the frequency that a supermarket product is promoted at a discount affect the price that customers expect to pay for the product? Does the percent reduction also affect this expectation? These questions were examined by researchers in a study that used 160 subjects. The treatment conditions corresponded to the number of promotions (one, three, five, or seven) that were described during a 10-week period and the percent that the product was discounted (10%, 20%, 30%, and 40%). Ten students were randomly assigned to each of the treatments.20

ppromo

Table 11.19: TABLE 11.7 Expected price data
Number of
promotions
Percent
discount
Expected price ($)
1 40 4.10 4.50 4.47 4.42 4.56 4.69 4.42 4.17 4.31 4.59
1 30 3.57 3.77 3.90 4.49 4.00 4.66 4.48 4.64 4.31 4.43
1 20 4.94 4.59 4.58 4.48 4.55 4.53 4.59 4.66 4.73 5.24
1 10 5.19 4.88 4.78 4.89 4.69 4.96 5.00 4.93 5.10 4.78
3 40 4.07 4.13 4.25 4.23 4.57 4.33 4.17 4.47 4.60 4.02
3 30 4.20 3.94 4.20 3.88 4.35 3.99 4.01 4.22 3.70 4.48
3 20 4.88 4.80 4.46 4.73 3.96 4.42 4.30 4.68 4.45 4.56
3 10 4.90 5.15 4.68 4.98 4.66 4.46 4.70 4.37 4.69 4.97
5 40 3.89 4.18 3.82 4.09 3.94 4.41 4.14 4.15 4.06 3.90
5 30 3.90 3.77 3.86 4.10 4.10 3.81 3.97 3.67 4.05 3.67
5 20 4.11 4.35 4.17 4.11 4.02 4.41 4.48 3.76 4.66 4.44
5 10 4.31 4.36 4.75 4.62 3.74 4.34 4.52 4.37 4.40 4.52
7 40 3.56 3.91 4.05 3.91 4.11 3.61 3.72 3.69 3.79 3.45
7 30 3.45 4.06 3.35 3.67 3.74 3.80 3.90 4.08 3.52 4.03
7 20 3.89 4.45 3.80 4.15 4.41 3.75 3.98 4.07 4.21 4.23
7 10 4.04 4.22 4.39 3.89 4.26 4.41 4.39 4.52 3.87 4.70

586

  1. Plot the expected price versus the number of promotions. Do the same for expected price versus discount. Summarize the results.
  2. These data come from a designed experiment with an equal number of observations for each promotion by discount combination. Find the means and standard deviations for expected price for each of these combinations. Describe any patterns that are evident in these summaries.
  3. Using your summaries from part (b), make a plot of the mean expected price versus the number of promotions for the 10% discount condition. Connect these means with straight lines. On the same plot, add the means for the other discount conditions. Summarize the major features of this plot.

11.89

(a) As the number of promotions increases, the expected price goes down. For discount, the expected price for 10% and 20% seem similar, as does the expected price for 30% and 40%, which is lower than the expected price for 10% and 20%.
(b) and (c) The drop of expected price is fairly consistent with an increase in promotions. Similarly, the drop in price is fairly consistent with increase percent discount; however, the 40% discount consistently yields higher expected prices than when the 30% discount is used.

Promotions Discount Mean Std Dev
1 10 4.92 0.1520234
20 4.689 0.2330689
30 4.225 0.3856092
40 4.423 0.1847551
3 10 4.756 0.2429083
20 4.524 0.2707274
30 4.097 0.2346179
40 4.284 0.2040261
5 10 4.393 0.2685372
20 4.251 0.2648459
30 3.89 0.1628906
40 4.058 0.1759924
7 10 4.269 0.2699156
20 4.094 0.2407488
30 3.76 0.2617887
40 3.78 0.2143725

Question 11.90

11.90 Run the multiple regression.

Refer to the previous exercise. Run a multiple regression using promotions and discount to predict expected price. Write a summary of your results.

ppromo

Question 11.91

11.91 Residuals and other models.

Refer to the previous exercise. Analyze the residuals from your analysis, and investigate the possibility of using quadratic and interaction terms as predictors. Write a report recommending a final model for this problem with a justification for your recommendation.

ppromo

11.91

The Normal quantile plot shows a slight left-skew in the residuals. The residual plot for promotions looks good (random). The residual plot for Discount shows a slight curve and suggests a possible quadratic model. Investigating a quadratic term for discount and possible interaction terms shows that none of the interaction terms test significant. After removing these, the quadratic term for discount is significant . The equation becomes: . This model has an , which is somewhat better than the 56.62% for the model without the quadratic term. It is possible to leave out the quadratic term to simplify interpretation; otherwise, the model with this term seems to be best in terms of prediction.

Question 11.92

11.92 Can we generalize the results?

The subjects in this experiment were college students at a large Midwest university who were enrolled in an introductory management course. They received the information about the promotions during a 10-week period during their course. Do you think that these facts about the data would influence how you would interpret and generalize the results? Write a summary of your ideas regarding this issue.

ppromo

Question 11.93

11.93 Determinants of innovation capability.

A study of 367 Australian small/medium enterprise (SME) firms looked at the relationship between perceived innovation marketing capability and two marketing support capabilities, market orientation and management capability. All three variables were measured on the same scale such that a higher score implies a more positive perception.21 Given the relatively large sample size, the researchers grouped the firms into three size categories (micro, small, and medium) and analyzed each separately. The following table summarizes the results.

Micro

Small

Medium

Explanatory
variable
Market orientation 0.69 0.08 0.47 0.06 0.37 0.12
Management capability 0.14 0.08 0.39 0.06 0.38 0.12
F statistic 87.6 117.7 37.2
  1. For each firm size, test if these two explanatory variables together are helpful in predicting the perceived level of innovation capability. Make sure to specify degrees of freedom.
  2. Using the table, test if each explanatory variable is a helpful predictor given that the other variable is already in the model.
  3. Using the table and your results to parts (a) and (b), summarize the relationship between innovation capability and the two explanatory variables. Are there any differences in this relationship across different-sized firms?

11.93

(a) For Micro: are 2 and . For Small: are 2 and . For Medium: are 2 and . The two explanatory variables are helpful in predicting the perceived level of innovation capability for each firm size.
(b) For Micro: , and Market Orientation is significant. , and Management Capability is not significant. For Small: , and Market Orientation is significant. , and Management Capability is significant. For Medium: , and Market Orientation is significant. , and Management Capability is significant.
(c) For all three sizes, the overall model was very significant. However, for the Micro size, the Management Capability was not needed and was not significant given Market Orientation is in the model. For the other two sizes, Small and Medium, both variables tested significant at the 5% level, and both were useful in predicting perceived level of innovation capability.

Question 11.94

11.94 Are separate analyses needed?

Refer to the previous exercise. Suppose you wanted to generate a similar table but have it based on results from only one multiple regression rather than on three.

  1. Describe what additional explanatory variables you would need to include in your regression model and write out the model.
  2. In the actual table, the importance (b coefficient) of marketing orientation appears to decrease as the firm size increases. Based on your model in part (a), describe an test to see if marketing coefficient is different across the three firm sizes.
  3. Explain why this test is more appropriate compared to using tests to compare each pair of coefficients.

Question 11.95

11.95 Impact of word of mouth.

Word of mouth (WOM) is informal advice passed among consumers that may have a quick and powerful influence on consumer behavior. Word of mouth may be positive (PWOM), encouraging choice of a certain brand, or negative (NWOM), discouraging that choice. A study investigated the impact of WOM on brand purchase probability.22 Multiple regression was used to assess the effect of six variables on brand choice. These were pre-WOM probability of purchase (PPP), strength of expression of WOM, WOM about main brand, closeness of the communicator, whether advice was sought, and amount of WOM given. The following table summarizes the results for 903 participants who received NWOM.

587

Variable
PPP −0.37 0.022
Strength of expression of WOM −0.22 0.065
WOM about main brand 0.21 0.164
Closeness of communicator −0.06 0.121
Whether advice was sought −0.04 0.140
Amount of WOM given −0.08 0.022

In addition, it is reported that .

  1. What percent of the variation in change in brand purchase probability is explained by these explanatory variables?
  2. State which of these variables are statistically significant at the 5% level.
  3. The PPP result implies that the more uncertain someone is about purchasing, the more negative the impact of NWOM. Explain what the "strength of expression of WOM’’ result implies.
  4. The variable "WOM about main brand’’ is an indicator variable. It is equal to 1 when the NWOM is about the receiver’s main brand and 0 when it is about another brand. Explain the meaning of this result.

11.95

(a) 20%.
(b) To be significant, , Strength of expression of , and Amount of WOM given are significant.
(c) The stronger the expression of WOM, the more negative the impact of NWOM. (d) The regression coefficient for WOM about main brand is 0.21, meaning there is a 0.21 difference in NWOM between when the receiver is given NWOM about the receiver’s main brand versus when they are given NWOM about another brand, or the NWOM effect is much larger when it is the receiver’s main brand.

Question 11.96

11.96 Correlations may not be a good way to screen for multiple regression predictors.

We use a constructed data set in this problem to illustrate this point.

dseta

  1. Find the correlations between the response variable and each of the explanatory variables and . Plot the data and run the two simple linear regressions to verify that no evidence of a relationship is found by this approach. Some researchers would conclude at this point that there is no point in further exploring the possibility that and could be useful in predicting .
  2. Analyze the data using and in a multiple regression to predict . The fit is quite good. Summarize the results of this analysis.
  3. What do you conclude about an analytical strategy that first looks at one candidate predictor at a time and selects from these candidates for a multiple regression based on some threshold level of significance?

Question 11.97

11.97 The multiple regression results do not tell the whole story.

We use a constructed data set in this problem to illustrate this point.

dsetb

  1. Run the multiple regression using and to predict . The test and the significance tests for the coefficients of the explanatory variables fail to reach the 5% level of significance. Summarize these results.
  2. Now run the two simple linear regressions using each of the explanatory variables in separate analyses. The coefficients of the explanatory variables are statistically significant at the 5% level in each of these analyses. Verify these conclusions with plots and correlations.
  3. What do you conclude about an analytical strategy that looks only at multiple regression results?

11.97

(a) The multiple regression equation is: . Likewise, neither predictor tests significant when added last: . The data do not show a significant multiple linear regression between and the predictors and .
(b) For and : . For and . Both and are significant in predicting in a simple linear regression.
(c) An insignificant multiple regression test doesn’t necessarily imply that all predictors are not useful; we should explore other strategies and/or tests to verify that none of the predictors are useful in different models/settings. In this case, and are highly correlated and likely their tests will be insignificant when they are used in the same model together.

Exercises 11.98 through 11.104 use the CROPS data file, which contains the U.S. yield (bushels/acre) of corn and soybeans from 1957-2013.23

Question 11.98

11.98 Corn yield varies over time.

Run the simple linear regression using year to predict corn yield.

crops

  1. Summarize the results of your analysis, including the significance test results for the slope and for this model.
  2. Analyze the residuals with a Normal quantile plot. Is there any indication in the plot that the residuals are not Normal?
  3. Plot the residuals versus soybean yield. Does the plot indicate that soybean yield might be useful in a multiple linear regression with year to predict corn yield? Explain your answer.

Question 11.99

11.99 Can soybean yield predict corn yield?

Run the simple linear regression using soybean yield to predict corn yield.

crops

  1. Summarize the results of your analysis, including the significance test results for the slope and for this model.
  2. Analyze the residuals with a Normal quantile plot. Is there any indication in the plot that the residuals are not Normal?
  3. Plot the residuals versus year. Does the plot indicate that year might be useful in a multiple linear regression with soybean yield to predict corn yield? Explain your answer.

11.99

.
(a) . There is a significant simple linear regression between corn yield and soybean yield; soybean yield can significantly predict corn yield. .
(b) The Normal quantile plot shows that the residuals are mostly Normal but have a slight right-skew.
(c) There is somewhat of a relationship between the residuals and year, suggesting that it might be useful in the model with soybean yield to predict corn yield.

Question 11.100

11.100 Use both predictors.

From the previous two exercises, we conclude that year and soybean yield may be useful together in a model for predicting corn yield. Run this multiple regression.

crops

  1. Explain the results of the ANOVA test. Give the null and alternative hypotheses, the test statistic with degrees of freedom, and the -value. What do you conclude?
  2. What percent of the variation in corn yield is explained by these two variables? Compare it with the percent explained in the simple linear regression models of the previous two exercises.
  3. Give the fitted model. Why do the coefficients for year and soybean yield differ from those in the previous two exercises?

    588

  4. Summarize the significance test results for the regression coefficients for year and soybean yield.
  5. Give a 95% confidence interval for each of these coefficients.
  6. Plot the residuals versus year and versus soybean yield. What do you conclude?
  7. There is one case that is not predicted well with this model. What year is it? Remove this case and refit the model. Compare the estimated parameters with the results from part (c). Does this case appear to be influential? Explain your answer.

Question 11.101

11.101 Try a quadratic.

We need a new variable to model the curved relation that we see between corn yield and year in the residual plot of the last exercise. Let . (When adding a squared term to a multiple regression model, we sometimes subtract the mean of the variable being squared before squaring. This eliminates the correlation between the linear and quadratic terms in the model and thereby reduces collinearity.)

crops

  1. Run the multiple linear regression using year, year2, and soybean yield to predict corn yield. Give the fitted regression equation.
  2. Give the null and alternative hypotheses for the ANOVA test. Report the results of this test, giving the test statistic, degrees of freedom, -value, and conclusion.
  3. What percent of the variation in corn yield is explained by this multiple regression? Compare this with the model in the previous exercise.
  4. Summarize the results of the significance tests for the individual regression coefficients.
  5. Analyze the residuals and summarize your conclusions.

11.101

(a) .
(b) At least one are 3 and . There is a significant multiple linear regression between corn yield and the predictors’ Year, Year2, and SoyBeanYield. Together, the predictors can significantly predict corn yield.
(c) , up from 93.82%.
(d) For Year: . Year is significant in predicting corn yield in a model already containing Year2 and SoyBeanYield. For Year2: . Year2 is significant in predicting corn yield in a model already containing Year and SoyBeanYield. For SoyBeanYield: . Soy- BeanYield is significant in predicting corn yield in a model already containing Year and Year2.
(e) The Normal quantile plot shows a roughly Normal distribution; there is one observation with a fairly high residual. The residual plots all look good (random); the residual plot for Year is much better and doesn’t have the rising and falling that the previous plot had. Overall, the model fit is much better using the quadratic term for Year than without.

Question 11.102

11.102 Compare models.

Run the model to predict corn yield using year and the squared term year2 defined in the previous exercise.

crops

  1. Summarize the significance test results.
  2. The coefficient for year2 is not statistically significant in this run, but it was highly significant in the model analyzed in the previous exercise. Explain how this can happen.
  3. Obtain the fitted values for each year in the data set, and use these to sketch the curve on a plot of the data. Plot the least-squares line on this graph for comparison. Describe the differences between the two regression functions. For what years do they give very similar fitted values? For what years are the differences between the two relatively large?

Question 11.103

11.103 Do a prediction.

Use the simple linear regression model with corn yield as the response variable and year as the explanatory variable to predict the corn yield for the year 2014, and give the 95% prediction interval. Also, use the multiple regression model where year and year2 are both explanatory variables to find another predicted value with the 95% interval. Explain why these two predicted values are so different. The actual yield for 2014 was 167.4 bushels per acre. How well did your models predict this value?

crops

11.103

For Year alone, , and the prediction interval is (138.8, 181.6901). For Year and Year2: , and the prediction interval is (134.3030, 179.0220). The two predicted values are different because we are near the edge of the data for Year, and as we saw in the previous exercise, this will cause the greatest differences using the quadratic term. The actual yield of 167.4 is not predicted very well by either model but is closer to the predicted value for the linear model than for the quadratic model.

Question 11.104

11.104 Predict the yield for another year.

Repeat the previous exercise doing the prediction for 2020. Compare the results of this exercise with the previous one. Also explain why the predicted values are beginning to differ more substantially.

crops

Question 11.105

11.105 Predicting U.S. movie revenue.

Refer to Case 11.2 (page 550). The data set MOVIES contains several other explanatory variables that are available at the time of release that we did not consider in the examples and exercises. These include

  • Hype: A numeric value that describes the interest in the movie at the time of release. The smaller, the more interest.
  • Minutes: The length of the movie in minutes.
  • Rating: A variable indicating the Motion Picture Association of America film rating. This variable has three categories so two indicator variables are required.
  • Sequel: A variable indicating if the movie is a sequel or not.

Using these explanatory variables and Opening, Budget, and Theaters, determine the best model for predicting U.S. revenue.

movies

Question 11.106

11.106 Price-fixing litigation.

Multiple regression is sometimes used in litigation. In the case of Cargill, Inc. v. Hardin, the prosecution charged that the cash price of wheat was manipulated in violation of the Commodity Exchange Act. In a statistical study conducted for this case, a multiple regression model was constructed to predict the price of wheat using three supply-and-demand explanatory variables.24 Data for 14 years were used to construct the regression equation, and a prediction for the suspect period was computed from this equation. The value of was 0.989.

  1. The fitted model gave the predicted value $2.136 with standard error $0.013. Express the prediction as an interval. (The degrees of freedom were large for this analysis, so use 100 as the df to determine .)
  2. The actual price for the period in question was $2.13. The judge decided that the analysis provided evidence that the price was not artificially depressed, and the opinion was sustained by the court of appeals. Write a short summary of the results of the analysis that relate to the decision and explain why you agree or disagree with it.

589

Question 11.107

11.107 Predicting CO2 emissions.

The data set CO2MPG contains an SRS of 200 passenger vehicles sold in Canada in 2014. There appears to be a quadratic relationship between CO2 emissions and mile per gallon highway(MPGHWY).

co2mpg

  1. Create two new centered variables and and fit a quadratic regression for each fuel type (FUELTYPE). Create a table of parameter estimates and comment on the similarities and differences in the coefficients across fuel types.
  2. Create three indicator variables for fuel type, three interaction variables between MPG and each of the indicators, and three interaction variables between MPG2 and each the indicator variables. Fit this model to the entire data set. Use the estimate coefficients to construct the quadratic equation for each of the fuel types. How do they compare to the equations in part (a)?

11.107

(a)

Regression Coefficients
Type Intercept mpg mpg2
D 267.3823 −5.42585 0.04619
E 160.84557 −3.89582 0.30631
X 235.16637 −7.18033 0.12751
Z 243.75987 −7.88188 0.13832

Type X and Z are very similar and show very few differences in all of the coefficients. Types D and E are very different. Type E has a much smaller slope for MPG than all the other types, and the MPG2 effect is quite large—more than double all the rest. Type D also has a slightly smaller slope for MPG than X and, but it has an extremely small slope for MPG2.

(b)

Parameter Estimate
Intercept 243.75987
X1 23.62243
X2 −82.91430
X3 −8.59350
mpg −7.88188
MPGX1 2.45603
MPGX2 3.98607
MPGX3 0.70155
mpg2 0.13832
MPG2X1 −0.09214
MPG2X2 0.16798
MPG2X3 −0.01081

S-34

Answers will vary depending on how the indicator variables were created. Setting Z has the default type ; the parameter estimates are in the table shown. So the estimates for the Intercept, MPG, and MPG2 will match type Z’s estimates exactly. To recoup the others, we just set and for Type D, etc., yielding an intercept of , a slope for MPG of , and a slope for MPG2 of , etc. This yields the same equations as part (a).

Question 11.108

11.108 Prices of homes.

Consider the data set used for Case 11.3 (page 566). This data set includes information for several other zip codes. Pick a different zip code and analyze the data. Compare your results with what we found for zip code 47904 in Section 11.3.

homes

590