11 Multiple Regression

SECTION 11.1 Exercises

For Exercises 11.1 and 11.2, see page 533; for 11.3 and 11.4, see page 535; for 11.5 and 11.6, see page 537; for 11.7 and 11.8, see page 538; for 11.9 and 11.10, see page 541; for 11.11 to 11.14, see pages 543–544; and for 11.15 and 11.16, see page 544.

Question 11.17

11.17 Describing a multiple regression.

As part of a study, data from 282 students majoring in accounting at the College of Business Studies in Kuwait were obtained through a survey.⁵ The researchers were interested in finding determinants of academic performance measured by the student’s major grade point average (MGPA). They considered gender, high school major, age, frequency of doing homework, participation in class, and number of days studying before an exam.

What is the response variable?
What is , the number of cases?
What is , the number of explanatory variables?
What are the explanatory variables?

11.17

(a) Major grade point average. (b) . (c) . (d) Gender, high school major, age, frequency of doing homework, participation in class, and number of days studying before the exam.

Question 11.18

11.18 Understanding the fitted regression line.

The fitted regression equation for a multiple regression is

If and , what is the predicted value of y?
For the answer to part (a) to be valid, is it necessary that the values and correspond to a case in the data set? Explain why or why not.
If you hold at a fixed value, what is the effect of an increase of two units in on the predicted value of ?

Question 11.19

11.19 Predicting the price of tablets: Individual variables.

Suppose your company needs to buy some tablets. To help in the purchasing decision, you decide to develop a model to predict the selling price. You decide to obtain price and product characteristic information on 20 tablets from Consumer Reports.⁶ The characteristics are screen size, battery life, weight (pounds), ease of use, display, and versatility. The latter three are scored on a 1 to 5 scale.

tablts

Make a table giving the mean, median, and standard deviation of each variable.
Use stemplots or histograms to make graphical summaries of each distribution.
Describe these distributions. Are there any unusual observations that may affect a multiple regression? Explain your answer.
The screen size distribution appears bimodal. Is this lack of Normality necessarily a problem? Explain your answer.

11.19

(a)

Variable	Mean	Median	Std Dev
Price	395.50	400.00	119.76
Size	9.15	9.90	1.21
Battery	11.11	10.55	2.57
Weight	1.07	1.10	0.29
Ease	4.55	5.00	0.51
Display	4.30	4.00	0.47
Versatility	3.80	4.00	0.41

(c) Price is roughly Normal. Size has a bimodal distribution. Battery is right-skewed. Ease, Display and Versatility all only have 2 different values even though they were rated on a 1 to 5 scale. There aren’t really any unusual observations that might affect the regression analysis. (d) No, we do not make any assumption on the distribution of explanatory variables, so this is perfectly fine.

Question 11.20

11.20 Predicting the price of tablets: Pairs of variables.

Refer to the tablet data described in Exercise 11.19.

tablts

Examine the relationship between each pair of variables using correlation and a scatterplot.
Which characteristic is most strongly correlated with price? Is any pair of characteristics strongly correlated?
Summarize the relationships. Are there any unusual or outlying cases?

Question 11.21

11.21 Predicting the price of tablets: Multiple regression equation.

Refer to the tablet data described in Exercise 11.19.

tablts

Run a multiple regression to predict price using the six product characteristics. Give the equation for predicted price.
What is the value of the regression standard error ? Verify that this value is the square root of the sum of squares of residuals divided by the degrees of freedom for the residuals.
Obtain the residuals and use graphical summaries to describe the distribution.
Observation 11 is much higher priced than the model predicts. Remove this observation and repeat parts (a), (b), and (c). Comment on the differences between the two model fits.

11.21

(a)
. (b) . (c) A Normal quantile plot shows a potential outlier. (d)
.
A Normal quantile plot shows the residuals are much closer to a Normal distribution without the outlier; however, there still appears to be slightly heavy tails. This model is likely much better than the original model. Before only Size was significant, now Battery and Display are significant at the 5% level; the standard error is much smaller for the second model as well.

Question 11.22

11.22 Predicting the price of a tablet.

Refer to the previous exercise. Let’s use the model with Observation 11 removed.

tablts

What is the predicted price for the second tablet? The characteristics are , , , , , and .
547
The stated price for this tablet is $400. Is the predicted price above or below the stated price? Should you consider buying it? Explain your answer.
Explain how you could use the residuals to help determine which tablet to buy.
Consumer Reports names Tablets 4, 8, 12, and 20 as "Best Buys." Based on your regression model, do you agree with this assessment? What tablets would you recommend?

Question 11.23

11.23 Data analysis: Individual variables.

Table 11.3 gives data on the current fast-food market share, along with the number of franchises, number of company-owned stores, annual sales ($ million) from three years ago, and whether it is a burger restaurant.⁷ Market share is expressed in percents, based on current U.S. sales.

ffood

Make a table giving the mean, the standard deviation, and the five-number summary for each of these variables.
Use stemplots or histograms to make graphical summaries of the five distributions.
Describe the distributions. Are there any unusual observations?

11.23

(a)

Variable	Mean	Std Dev	Minimum	Lower Quartile	Median	Upper Quartile	Maximum
Share	4.94	5.07	1.72	2.33	3.28	5.48	22.69
Franchises	5525.56	5754.47	0.00	1983.00	4406.50	6563.00	23850.00
Company	1116.31	1565.77	0.00	452.50	826.50	1194.50	6707.00
Sales	6.99	7.23	1.80	3.20	5.05	7.95	32.40
Burger	0.31	0.48	0.00	0.00	0.00	1.00	1.00

(c) McDonald’s is an outlier for Share and Sales, Subway is an outlier for Franchises, and Starbucks is an outlier for Company. Otherwise, it is hard to tell the distributions of the other restaurants because they are being squished on the histograms because of the outliers. Burger also only has two possible values.

Question 11.24

11.24 Data analysis: Pairs of variables.

Refer to the previous exercise.

ffood

Plot market share versus each of the explanatory variables.
Summarize these relationships. Are there any influential observations?
Find the correlation between each pair of variables.

Question 11.25

11.25 Multiple regression equation.

Refer to the fast-food data in Exercise 11.23. Run a multiple regression to predict market share using all four explanatory variables.

ffood

Give the equation for predicted market share.
What is the value of the regression standard error ?

11.25

(a) . (b)

Question 11.26

11.26 Residuals.

Refer to the fast-food data in Exercise 11.23. Find the residuals for the multiple regression used to predict market share based on the four explanatory variables.

ffood

Give a graphical summary of the distribution of the residuals. Are there any outliers in this distribution?
Plot the residuals versus the number of franchises. Describe the plot and any unusual cases.
Repeat part (b) with number of company-owned stores in place of number of franchises.
Repeat part (b) with previous sales in place of number of franchises.

Your analyses in Exercises 11.23 through 11.26 point to two restaurants, McDonald’s and Starbucks, as unusual in several respects. How influential are these restaurants? The following four exercises provide answers.

Table 11.7: TABLE 11.3 Market share data for Exercise 11.23

Restaurant	Market share	Franchises	Company	Sales	Burger
McDonald’s	22.69	12,477	1550	32.4	1
Subway	7.71	23,850	0	10.6	0
Starbucks	6.76	4424	6707	7.6	0
Wendy’s	5.48	5182	1394	8.3	1
Burger King	5.48	6380	873	8.6	1
Taco Bell	4.78	4389	1245	6.9	0
Dunkin’ Donuts	4.02	6746	26	6.0	0
Pizza Hut	3.63	7083	459	5.4	0
Chik-fil-A	2.93	1461	76	3.6	0
KFC	2.87	4275	780	4.7	0
Panera Bread	2.49	791	662	3.1	0
Sonic	2.42	3117	455	3.6	1
Domino’s	2.23	4479	450	3.3	0
Jack in the Box	1.98	1250	956	2.9	1
Arby’s	1.91	2505	1144	3.0	0
Chipotle	1.72	0	1084	1.8	0

548

Question 11.27

11.27

Rerun Exercise 11.23 without the data for McDonald’s and Starbucks. Compare your results with what you obtained in that exercise.

ffood

11.27

(a)

Variable	Mean	Std Dev	Minimum	Lower Quartile	Median	Upper Quartile	Maximum
Share	3.55	1.75	1.72	2.23	2.90	4.78	7.71
Franchises	5107.71	5848.92	0.00	1461.00	4332.00	6380.00	23850.00
Company	686.00	458.93	0.00	450.00	721.00	1084.00	1394.00
Sales	5.13	2.62	1.80	3.10	4.15	6.90	10.60
Burger	0.29	0.47	0.00	0.00	0.00	1.00	1.00

Taking out the two outliers fixed a lot of the outlier problems we saw earlier with the histograms. Subway still shows up as an outlier in the Franchise histogram; otherwise, we can now see the distributions of the other variables much better. (c) Share is somewhat right-skewed, but it has an outlier, Subway. Subway is also a huge outlier for Franchises, making it hard to tell the distribution of Franchises. Company is uniformly distributed. Sales looks roughly Normal with a small rightskew. Burger only has two possible values.

Question 11.28

11.28

Rerun Exercise 11.24 without the data for McDonald’s and Starbucks. Compare your results with what you obtained in that exercise.

ffood

Question 11.29

11.29

Rerun Exercise 11.25 without the data for McDonald’s and Starbucks. Compare your results with what you obtained in that exercise.

ffood

11.29

Taking out the two outliers changed the model somewhat and did give us less error overall. (a)
. (b) .

Question 11.30

11.30

Rerun Exercise 11.26 without the data for McDonald’s and Starbucks. Compare your results with what you obtained in that exercise.

ffood

Question 11.31

11.31 Predicting retail sales.

Daily sales at a secondhand shop are recorded over a 25-day period.⁸ The daily gross sales and total number of items sold are broken down into items paid by check, cash, and credit card. The owners expect that the daily numbers of cash items, check items, and credit card items sold will accurately predict gross sales.

retail

Describe the distribution of each of these four variables using both graphical and numerical summaries. Briefly summarize what you find and note any unusual observations.
Use plots and correlations to describe the relationships between each pair of variables. Summarize your results.
Run a multiple regression and give the least-squares equation.
Analyze the residuals from this multiple regression. Are there any patterns of interest?
One of the owners is troubled by the equation because the intercept is not zero (that is, no items sold should result in $0 gross sales). Explain to this owner why this isn’t a problem.

11.31

(a) All four variables are somewhat right-skewed. There is a potential outlier for gross sales. (b) All three explanatory variables look linearly related with gross sales but each scatterplot has a few semi-outlying observations that could be potentially influential. From the correlation matrix, we can see that both cash items and check items have quite strong linear relationships with gross sales, but they also have some correlation between them. (c) . (d) The Normal quantile plot shows a roughly Normal distribution with no outliers. The three residual plots all look pretty good (random) but show a couple semi-outlying observations we identified earlier. (e) The intercept is not significantly different from 0; .

Question 11.32

11.32 Architectural firm billings.

A summary of firms engaged in commercial architecture in the Indianapolis, Indiana, area provides firm characteristics, including total annual billing in the current year, total annual billing in the previous year, the number of architects, the number of engineers, and the number of staff employed in the firm.⁹ Consider developing a model to predict current total billing using the other four variables.

arch

Using numerical and graphical summaries, describe the distribution of current and past year total billing and the number of architects, engineers, and staff.
For each of the 10 pairs of variables, use graphical and numerical summaries to describe the relationship.
Carry out a multiple regression. Report the fitted regression equation and the value of the regression standard error .
Analyze the residuals from the multiple regression. Are there any concerns?
A firm did not report its current total billing but had $1 million in billing last year and employs three architects, one engineer, and 17 staff members. What is the predicted total billing for this firm?