Examining Relationships

63

image

CHAPTER OUTLINE

  • 2.1 Scatterplots
  • 2.2 Correlation
  • 2.3 Least-Squares Regression
  • 2.4 Cautions about Correlation and Regression
  • 2.5 Relations in Categorical Data

INTRODUCTION

Our topic in this chapter is relationships between two variables. We measure both variables on the same cases. Often, we take the view that one of the variables explains or influences the other.

Statistical summaries of relationships are used to inform decisions in business and economics in many different settings.

  • United Airlines wants to know how well numbers of customers flying different segments this year will predict the numbers for next year.
  • How can Visa use characteristics of potential customers to decide who should receive promotional material?
  • IKEA wants to know how its number of Facebook followers relates to the company’s sales. Should it invest in increasing its Facebook presence?

Response Variable, Explanatory Variable

A response variable measures an outcome of a study. An explanatory variable explains or influences changes in a response variable.

independent variable

dependent variable

You will often find explanatory variables called independent variables and response variables called dependent variables. The idea behind this language is that the response variable depends on the explanatory variable. Because the words “independent” and “dependent” have other meanings in statistics that are unrelated to the explanatory–response distinction, we prefer to avoid those words.

It is easiest to identify explanatory and response variables when we actually control the values of one variable to see how it affects another variable.

64

EXAMPLE 2.1

The Best Price?

Price is important to consumers and, therefore, to retailers. Sales of an item typically increase as its price falls, except for some luxury items, where high price suggests exclusivity. The seller’s profits for an item often increase as the price is reduced, due to increased sales, until the point at which lower profit per item cancels rising sales. Thus, a retail chain introduces a new TV that can respond to voice commands at several different price points and monitors sales. The chain wants to discover the price at which its profits are greatest. Price is the explanatory variable, and total profit from sales of the TV is the response variable.

When we just observe the values of both variables, there may or may not be explanatory and response variables. Whether there are such variables depends on how we plan to use the data.

EXAMPLE 2.2

Inventory and Sales

Emily is a district manager for a retail chain. She wants to know how the average monthly inventory and monthly sales for the stores in her district are related to each other. Emily doesn’t think that either inventory level or sales explains the other. She has two related variables, and neither is an explanatory variable.

Zachary manages another district for the same chain. He asks, “Can I predict a store’s monthly sales if I know its inventory level?” Zachary is treating the inventory level as the explanatory variable and the monthly sales as the response variable.

In Example 2.1, price differences actually cause differences in profits from sales of TVs. There is no cause-and-effect relationship between inventory levels and sales in Example 2.2. Because inventory and sales are closely related, we can nonetheless use a store’s inventory level to predict its monthly sales. We will learn how to do the prediction in Section 2.3. Prediction requires that we identify an explanatory variable and a response variable. Some other statistical techniques ignore this distinction. Remember that calling one variable “explanatory” and the other “response” doesn’t necessarily mean that changes in one cause changes in the other.

Most statistical studies examine data on more than one variable. Fortunately, statistical analysis of several-variable data builds on the tools we used to examine individual variables. The principles that guide our work also remain the same:

APPLY YOUR KNOWLEDGE

Question 2.1

2.1

Relationship between worker productivity and sleep

A study is designed to examine the relationship between how effectively employees work and how much sleep they get. Think about making a data set for this study.

  1. What are the cases?
  2. Would your data set have a label variable? If yes, describe it.
  3. What are the variables? Are they quantitative or categorical?
  4. Is there an explanatory variable and a response variable? Explain your answer.

65

Question 2.2

2.2

Price versus size

You visit a local Starbucks to buy a Mocha Frappuccino®. The barista explains that this blended coffee beverage comes in three sizes and asks if you want a Tall, a Grande, or a Venti. The prices are $3.75, $4.45, and $4.95, respectively.

  1. What are the variables and cases?
  2. Which variable is the explanatory variable? Which is the response variable? Explain your answers.
  3. The Tall contains 12 ounces of beverage, the Grande contains 16 ounces, and the Venti contains 20 ounces. Answer parts (a) and (b) with ounces in place of the names for the sizes.

psbe4e_test_ch2

SECTION 2.3 Exercises

For Exercises 2.44 and 2.45, see page 82; for 2.46, see page 84; for 2.47, see page 86; for 2.48 and 2.49, see page 88; for 2.50, see page 90; for 2.51, see page 90; for 2.52, see page 91; and for 2.53, see page 94.

Question 2.54

2.54

What is the equation for the selling price?

You buy items at a cost of x and sell them for y. Assume that your selling price includes a profit of 12% plus a fixed cost of $25.00. Give an equation that can be used to determine y from x.

Question 2.55

2.55

Production costs for cell phone batteries

A company manufactures batteries for cell phones. The overhead expenses of keeping the factory operational for a month—even if no batteries are made—total $500,000. Batteries are manufactured in lots (1000 batteries per lot) costing $7000 to make. In this scenario, $500,000 is the fixed cost associated with producing cell phone batteries and $7000 is the marginal (or variable) cost of producing each lot of batteries. The total monthly cost y of producing x lots of cell phone batteries is given by the equation

  1. Draw a graph of this equation. (Choose two values of x, such as 0 and 20, to draw the line and a third for a check. Compute the corresponding values of y from the equation. Plot these two points on graph paper and draw the straight line joining them.)
  2. What will it cost to produce 15 lots of batteries (15,000 batteries)?
  3. If each lot cost $10,000 instead of $7000 to produce, what is the equation that describes total monthly cost for x lots produced?

Question 2.56

2.56

Inventory of Blu-Ray players

A local consumer electronics store sells exactly eight Blu-Ray players of a particular model each week. The store expects no more shipments of this particular model, and they have 96 such units in their current inventory.

  1. Give an equation for the number of Blu-Ray players of this particular model in inventory after x weeks. What is the slope of this line?
  2. Draw a graph of this line between now (Week 0) and Week 10.
  3. Would you be willing to use this line to predict the inventory after 25 weeks? Do the prediction and think about the reasonableness of the result.

Question 2.57

2.57

Compare the cell phone payment plans

A cellular telephone company offers two plans. Plan A charges $30 a month for up to 120 minutes of airtime and $0.55 per minute above 120 minutes. Plan B charges $35 a month for up to 200 minutes and $0.50 per minute above 200 minutes.

  1. Draw a graph of the Plan A charge against minutes used from 0 to 250 minutes.
  2. How many minutes a month must the user talk in order for Plan B to be less expensive than Plan A?

Question 2.58

2.58

Companies of the world

Refer to Exercise 1.118 (page 61), where we examined data collected by the World Bank on the numbers of companies that are incorporated and listed on their country’s stock exchange at the end of the year. In Exercise 2.10, you examined the relationship between these numbers for 2012 and 2002, and in Exercise 2.27, you found the correlation between these two variables.

  1. Find the least-squares regression equation for predicting the 2012 numbers using the 2002 numbers.
  2. Sweden had 332 companies in 2012 and 278 companies in 2002. Use the least-squares regression equation to find the predicted number of companies in 2012 for Sweden.
  3. Find the residual for Sweden.

96

Question 2.59

2.59

Companies of the world

Refer to the previous exercise and to Exercise 2.11 (page 72). Answer parts (a), (b), and (c) of the previous exercise for 2012 and 1992. Compare the results you found in the previous exercise with the ones you found in this exercise. Explain your findings in a short paragraph.

Question 2.60

2.60

A product for lab experiments

In Exercise 2.17 (page 73), you described the relationship between time and count for an experiment examining the decay of barium. In Exercise 2.29 (page 78), you found the correlation between these two variables.

  1. Find the least-squares regression equation for predicting count from time.
  2. Use the equation to predict the count at one, three, five, and seven minutes.
  3. Find the residuals for one, three, five, and seven minutes.
  4. Plot the residuals versus time.
  5. What does this plot tell you about the model you used to describe this relationship?

Question 2.61

2.61

Use a log for the radioactive decay

Refer to the previous exercise. Also see Exercise 2.18 (page 73), where you transformed the counts with a logarithm, and Exercise 2.30 (pages 78–79), where you found the correlation between time and the log of the counts. Answer parts (a) to (e) of the previous exercise for the transformed counts and compare the results with those you found in the previous exercise.

Question 2.62

2.62

Fuel efficiency and CO_2 emissions

In Exercise 2.37 (page 79), you examined the relationship between highway MPG and city MPG for 1067 vehicles for the model year 2014.

  1. Use the city MPG to predict the highway MPG. Give the equation of the least-squares regression line.
  2. The Lexus 350h AWD gets 42 MPG for city driving and 38 MPG for highway driving. Use your equation to find the predicted highway MPG for this vehicle.
  3. Find the residual.

Question 2.63

2.63

Fuel efficiency and CO_2 emissions

Refer to the previous exercise.

  1. Make a scatterplot of the data with highway MPG as the response variable and city MPG as the explanatory variable. Include the least-squares regression line on the plot. There is an unusual pattern for the vehicles with high city MPG. Describe it.
  2. Make a plot of the residuals versus city MPG. Describe the major features of this plot. How does the unusual pattern noted in part (a) appear in this plot?
  3. The Lexus 350h AWD that you examined in parts (b) and (c) of the previous exercise is in the group of unusual cases mentioned in parts (a) and (b) of this exercise. It is a hybrid vehicle that uses a conventional engine and a electric motor that is powered by a battery that can recharge when the vehicle is driven. The conventional engine also turns off when the vehicle is stopped in traffic. As a result of these features, hybrid vehicles are unusually efficient for city driving, but they do not have a similar advantage when driven at higher speeds on the highway. How do these facts explain the residual for this vehicle?
  4. Several Toyota vehicles are also hybrids. Use the residuals to suggest which vehicles are in this category.

Question 2.64

2.64

Consider the fuel type

Refer to the previous two exercises and to Figure 2.6 (page 71), where different colors are used to distinguish four different types of fuels used by these vehicles. In Exercise 2.38, you examined the relationship between Highway MPG and City MPG for each of the four different fuel types used by these vehicles. Using the previous two exercises as a guide, analyze these data separately for each of the four fuel types. Write a summary of your findings.

Question 2.65

2.65

Predict one characteristic of a product using another characteristic

In Exercise 2.12 (page 72), you used a scatterplot to examine the relationship between calories per 12 ounces and percent alcohol in 175 domestic brands of beer. In Exercise 2.31 (page 79), you calculated the correlation between these two variables.

  1. Find the equation of the least-squares regression line for these data.
  2. Make a scatterplot of the data with the least-squares regression line.

Question 2.66

2.66

Predicted values and residuals

Refer to the previous exercise.

  1. New Belgium Fat Tire is 5.2 percent alcohol and has 160 calories per 12 ounces. Find the predicted calories for New Belgium Fat Tire.
  2. Find the residual for New Belgium Fat Tire.

97

Question 2.67

2.67

Predicted values and residuals

Refer to the previous two exercises.

  1. Make a plot of the residuals versus percent alcohol.
  2. Interpret the plot. Is there any systematic pattern? Explain your answer.
  3. Examine the plot carefully and determine the approximate location of New Belgium Fat Tire. Is there anything unusual about this case? Explain why or why not.

Question 2.68

2.68

Carbohydrates and alcohol in beer revisited

Refer to Exercise 2.65. The data that you used to compute the least-squares regression line includes a beer with a very low alcohol content that might be considered to be an outlier.

  1. Remove this case and recompute the least-squares regression line.
  2. Make a graph of the regression lines with and without this case.
  3. Do you think that this case is influential? Explain your answer.

Question 2.69

2.69

Monitoring the water quality near a manufacturing plant

Manufacturing companies (and the Environmental Protection Agency) monitor the quality of the water near manufacturing plants. Measurements of pollutants in water are indirect—a typical analysis involves forming a dye by a chemical reaction with the dissolved pollutant, then passing light through the solution and measuring its “absorbance.” To calibrate such measurements, the laboratory measures known standard solutions and uses regression to relate absorbance to pollutant concentration. This is usually done every day. Here is one series of data on the absorbance for different levels of nitrates. Nitrates are measured in milligrams per liter of water.10

Nitrates Absorbance Nitrates Absorbance
50 7.0 800 93.0
50 7.5 1200 138.0
100 12.8 1600 183.0
200 24.0 2000 230.0
400 47.0 2000 226.0
  1. Chemical theory says that these data should lie on a straight line. If the correlation is not at least 0.997, something went wrong and the calibration procedure is repeated. Plot the data and find the correlation. Must the calibration be done again?
  2. What is the equation of the least-squares line for predicting absorbance from concentration? If the lab analyzed a specimen with 500 milligrams of nitrates per liter, what do you expect the absorbance to be? Based on your plot and the correlation, do you expect your predicted absorbance to be very accurate?

Question 2.70

2.70

Data generated by software

The following 20 observations on y and x were generated by a computer program.

y x y x
34.38 22.06 27.07 17.75
30.38 19.88 31.17 19.96
26.13 18.83 27.74 17.87
31.85 22.09 30.01 20.20
26.77 17.19 29.61 20.65
29.00 20.72 31.78 20.32
28.92 18.10 32.93 21.37
26.30 18.01 30.29 17.31
29.49 18.69 28.57 23.50
31.36 18.05 29.80 22.02
  1. Make a scatterplot and describe the relationship between y and x.
  2. Find the equation of the least-squares regression line and add the line to your plot.
  3. Plot the residuals versus x.
  4. What percent of the variability in y is explained by x?
  5. Summarize your analysis of these data in a short paragraph.

Question 2.71

2.71

Add an outlier

Refer to the previous exercise. Add an additional case with y = 60 and x = 32 to the data set. Repeat the analysis that you performed in the previous exercise and summarize your results, paying particular attention to the effect of this outlier.

Question 2.72

2.72

Add a different outlier

Refer to the previous two exercises. Add an additional case with y = 60 and x = 18 to the original data set.

  1. Repeat the analysis that you performed in the first exercise and summarize your results, paying particular attention to the effect of this outlier.
  2. In this exercise and in the previous one, you added an outlier to the original data set and reanalyzed the data. Write a short summary of the changes in correlations that can result from different kinds of outliers.

Question 2.73

2.73

Influence on correlation

The Correlation and Regression applet at the text website allows you to create a scatterplot and to move points by dragging with the mouse. Click to create a group of 12 points in the lower-left corner of the scatterplot with a strong straight-line pattern (correlation about 0.9).

98

  1. Add one point at the upper right that is in line with the first 12. How does the correlation change?
  2. Drag this last point down until it is opposite the group of 12 points. How small can you make the correlation? Can you make the correlation negative? You see that a single outlier can greatly strengthen or weaken a correlation. Always plot your data to check for outlying points.

Question 2.74

2.74

Influence in regression

As in the previous exercise, create a group of 12 points in the lower-left corner of the scatterplot with a strong straight-line pattern (correlation at least 0.9). Click the “Show least-squares line” box to display the regression line.

  1. Add one point at the upper right that is far from the other 12 points but exactly on the regression line. Why does this outlier have no effect on the line even though it changes the correlation?
  2. Now drag this last point down until it is opposite the group of 12 points. You see that one end of the least-squares line chases this single point, while the other end remains near the middle of the original group of 12. What about the last point makes it so influential?

Question 2.75

2.75

Employee absenteeism and raises

Data on number of days of work missed and annual salary increase for a company’s employees show that, in general, employees who missed more days of work during the year received smaller raises than those who missed fewer days. Number of days missed explained 49% of the variation in salary increases. What is the numerical value of the correlation between number of days missed and salary increase?

Question 2.76

2.76

Always plot your data!

Four sets of data prepared by the statistician Frank Anscombe illustrate the dangers of calculating without first plotting the data.11

  1. Without making scatterplots, find the correlation and the least-squares regression line for all four data sets. What do you notice? Use the regression line to predict y for x = 10.
  2. Make a scatterplot for each of the data sets, and add the regression line to each plot.
  3. In which of the four cases would you be willing to use the regression line to describe the dependence of y on x? Explain your answer in each case.

psbe4e_test_ch2

2.4 Cautions about Correlation and Regression

Correlation and regression are powerful tools for describing the relationship between two variables. When you use these tools, you must be aware of their limitations, beginning with the fact that correlation and regression describe only linear relationships. Also remember that the correlation r and the least-squares regression line are not resistant. One influential observation or incorrectly entered data point can greatly change these measures. Always plot your data before interpreting regression or correlation. Here are some other cautions to keep in mind when you apply correlation and regression or read accounts of their use.

Extrapolation

Associations for variables can be trusted only for the range of values for which data have been collected. Even a very strong relationship may not hold outside the data’s range.

EXAMPLE 2.22

Predicting the Number of Target Stores in 2008 and 2014

Here are data on the number of Target stores in operation at the end of each year in the early 1990s in 2008 and in 2014:12

Year (x) 1990 1991 1992 1993 2008 2014
Stores (y) 420 463 506 554 1682 1916

A plot of these data is given in Figure 2.19. The data for 1990 through 1993 lie almost exactly on a straight line, which we calculated using only the data from 1990 to 1993. The equation of this line is y = −88,136 1 44.5x and r2 = 0.9992. We know that 99.92% of the variation in stores is explained by year for these years. The equation predicts 1220 stores for 2008, but the actual number of stores is much higher, 1682. It predicts 1487 for 2014, also an underestimate by a large amount. The predictions are very poor because the very strong linear trend evident in the 1990 to 1993 data did not continue to the years 2008 and 2014.

Figure 2.19:

FIGURE 2.19

Plot of the number of Target stores versus year with the least-squares regression line calculated using data from 1990, 1991, 1992, and 1993, Example 2.22. The poor fits to the numbers of stores in 2008 and 2014 illustrate the dangers of extrapolation.

99

Predictions made far beyond the range for which data have been collected can’t be trusted. Few relationships are linear for all values of x. It is risky to stray far from the range of x-values that actually appear in your data.

Extrapolations

Extrapolation is the use of a regression line for prediction far outside the range of values of the explanatory variable x that you used to obtain the line. Such predictions are often not accurate.

In general, extrapolation involves using a mathematical relationship beyond the range of the data that were used to estimate the relationship. The scenario described in the previous example is typical: we try to use a least-squares relationship to make predictions for values of the explanatory variable that are much larger than the values in the data that we have. We can encounter the same difficulty when we attempt predictions for values of the explanatory variable that are much smaller than the values in the data that we have.

Careful judgment is needed when making predictions. If the prediction is for values that are within the range of the data that you have, or are not too far above or below, then your prediction can be reasonably accurate. Beyond that, you are in danger of making an inaccurate prediction.

Correlations based on averaged data

Many regression and correlation studies work with averages or other measures that combine information from many cases. You should note this carefully and resist the temptation to apply the results of such studies to individual cases. Correlations based on averages are usually higher than correlations based on individual cases. This is another reminder that it is important to note exactly what variables are measured in a statistical study.

Lurking variables

Correlation and regression describe the relationship between two variables. Often, the relationship between two variables is strongly influenced by other variables. We try to measure potentially influential variables. We can then use more advanced statistical methods to examine all the relationships revealed by our data. Sometimes, however, the relationship between two variables is influenced by other variables that we did not measure or even think about. Variables lurking in the background—measured or not—often help explain statistical associations.

100

Lurking Variable

A lurking variable is a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables.

A lurking variable can falsely suggest a strong relationship between x and y, or it can hide a relationship that is really there. Here is an example of a negative correlation that is due to a lurking variable.

EXAMPLE 2.23

Gas and Electricity Bills

A single-family household receives bills for gas and electricity each month. The 12 observations for a recent year are plotted with the least-squares regression line in Figure 2.20. We have arbitrarily chosen to put the electricity bill on the x axis and the gas bill on the y axis. There is a clear negative association. Does this mean that a high electricity bill causes the gas bill to be low, and vice versa?

To understand the association in this example, we need to know a little more about the two variables. In this household, heating is done by gas and cooling by electricity. Therefore, in the winter months, the gas bill will be relatively high and the electricity bill will be relatively low. The pattern is reversed in the summer months. The association that we see in this example is due to a lurking variable: time of year.

Figure 2.20:

FIGURE 2.20

Scatterplot with the least-squares regression line for predicting monthly charges for gas using monthly charges for electricity for a household, Example 2.23.

101

APPLY YOUR KNOWLEDGE

Question 2.77

2.77

Education and income

There is a strong positive correlation between years of education and income for economists employed by business firms. In particular, economists with a doctorate earn more than economists with only a bachelor’s degree. There is also a strong positive correlation between years of education and income for economists employed by colleges and universities. But when all economists are considered, there is a negative correlation between education and income. The explanation for this is that business pays high salaries and employs mostly economists with bachelor’s degrees, while colleges pay lower salaries and employ mostly economists with doctorates. Sketch a scatterplot with two groups of cases (business and academic) illustrating how a strong positive correlation within each group and a negative overall correlation can occur together.

Association is not causation

When we study the relationship between two variables, we often hope to show that changes in the explanatory variable cause changes in the response variable. But a strong association between two variables is not enough to draw conclusions about cause and effect. Sometimes, an observed association really does reflect cause and effect. Natural gas consumption in a household that uses natural gas for heating will be higher in colder months because cold weather requires burning more gas to stay warm. In other cases, an association is explained by lurking variables, and the conclusion that x causes y is either wrong or not proved. Here is an example.

EXAMPLE 2.24

Does Television Extend Life?

Measure the number of television sets per person x and the average life expectancy y for the world’s nations. There is a high positive correlation: nations with many TV sets have higher life expectancies.

The basic meaning of causation is that by changing x, we can bring about a change in y. Could we lengthen the lives of people in Rwanda by shipping them TV sets? No. Rich nations have more TV sets than poor nations. Rich nations also have longer life expectancies because they offer better nutrition, clean water, and better health care. There is no cause-and-effect tie between TV sets and length of life.

Correlations such as that in Example 2.24 are sometimes called “nonsense correlations.” The correlation is real. What is nonsense is the conclusion that changing one of the variables causes changes in the other. A lurking variable—such as national wealth in Example 2.24—that influences both x and y can create a high correlation, even though there is no direct connection between x and y.

APPLY YOUR KNOWLEDGE

Question 2.78

2.78

How’s your self-esteem?

People who do well tend to feel good about themselves. Perhaps helping people feel good about themselves will help them do better in their jobs and in life. For a time, raising self-esteem became a goal in many schools and companies. Can you think of explanations for the association between high self-esteem and good performance other than “Self-esteem causes better work”?

Question 2.79

2.79

Are big hospitals bad for you?

A study shows that there is a positive correlation between the size of a hospital (measured by its number of beds x) and the median number of days y that patients remain in the hospital. Does this mean that you can shorten a hospital stay by choosing a small hospital? Why?

102

Question 2.80

2.80

Do firefighters make fires worse?

Someone says, “There is a strong positive correlation between the number of firefighters at a fire and the amount of damage the fire does. So sending lots of firefighters just causes more damage.” Explain why this reasoning is wrong.

These and other examples lead us to the most important caution about correlation, regression, and statistical association between variables in general.

Association Does Not Imply Causation

An association between an explanatory variable x and a response variable y— even if it is very strong—is not, by itself, good evidence that changes in x actually cause changes in y.

experiment

The best way to get good evidence that x causes y is to do an experiment in which we change x and keep lurking variables under control. We will discuss experiments in Chapter 3. When experiments cannot be done, finding the explanation for an observed association is often difficult and controversial. Many of the sharpest disputes in which statistics plays a role involve questions of causation that cannot be settled by experiment. Does gun control reduce violent crime? Does cell phone usage cause brain tumors? Has increased free trade widened the gap between the incomes of more-educated and less-educated American workers? All of these questions have become public issues. All concern associations among variables. And all have this in common: they try to pinpoint cause and effect in a setting involving complex relations among many interacting variables.

BEYOND THE BASICS: Data Mining

Chapters 1 and 2 of this book are devoted to the important aspect of statistics called exploratory data analysis (EDA). We use graphs and numerical summaries to examine data, searching for patterns and paying attention to striking deviations from the patterns we find. In discussing regression, we advanced to using the pattern we find (in this case, a linear pattern) for prediction.

data mining

Suppose now that we have a truly enormous database, such as all purchases recorded by the cash register scanners of our retail chain during the past week. Surely this mass of data contains patterns that might guide business decisions. If we could clearly see the types of activewear preferred in large California cities and compare the preferences of small Midwest cities—right now, not at the end of the season—we might improve profits in both parts of the country by matching stock with demand. This sounds much like EDA, and indeed it is. Exploring very large databases in the hope of finding useful patterns is called data mining. Here are some distinctive features of data mining:

All of these features point to the need for sophisticated computer science as a basis for data mining. Indeed, data mining is often thought of as a part of computer science. Yet many statistical ideas and tools—mostly tools for dealing with multidimensional data, not the sort of thing that appears in a first statistics course—are very helpful. Like many modern developments, data mining crosses the boundaries of traditional fields of study.

103

Do remember that the perils we encounter with blind use of correlation and regression are yet more perilous in data mining, where the fog of an immense database prevents clear vision. Extrapolation, ignoring lurking variables, and confusing association with causation are traps for the unwary data miner.


psbe4e_test_ch2

SECTION 2.4 Summary


psbe4e_test_ch2

SECTION 2.4 Exercises

For Exercises 2.77 to 2.79, see page 101; and for 2.80, see page 102.

Question 2.81

2.81

What’s wrong?

Each of the following statements contains an error. Describe each error and explain why the statement is wrong.

  1. A negative relationship is always due to causation.
  2. A lurking variable is always a quantitative variable.
  3. If the residuals are all negative, this implies that there is a negative relationship between the response variable and the explanatory variable.

Question 2.82

2.82

What’s wrong?

Each of the following statements contains an error. Describe each error and explain why the statement is wrong.

  1. An outlier will always have a large residual.
  2. If we have data at values of x equal to 1, 2, 3, 4, and 5, and we try to predict the value of y at x = 2.5 using a least-squares regression line, we are extrapolating.
  3. High correlation implies causation.

Question 2.83

2.83

Predict the sales

You analyzed the past 10 years of sales data for your company, and the data fit a straight line very well. Do you think the equation you found would be useful for predicting next year’s sales? Would your answer change if the prediction was for sales five years from now? Give reasons for your answers.

Question 2.84

2.84

Older workers and income

The effect of a lurking variable can be surprising when cases are divided into groups. Explain how, as a nation’s population grows older, mean income can go down for workers in each age group but still go up for all workers.

Question 2.85

2.85

Marital status and income

Data show that married, divorced, and widowed men earn quite a bit more than men the same age who have never been married. This does not mean that a man can raise his income by getting married because men who have never been married are different from married men in many ways other than marital status. Suggest several lurking variables that might help explain the association between marital status and income.

104

Question 2.86

2.86

Sales at a farmers’ market

You sell fruits and vegetables at your local farmers’ market, and you keep track of your weekly sales. A plot of the data from May through August suggests a increase over time that is approximately linear, so you calculate the least-squares regression line. Your partner likes the plot and the line and suggests that you use it to estimate sales for the rest of the year. Explain why this is probably a very bad idea.

Question 2.87

2.87

Does your product have an undesirable side effect?

People who use artificial sweeteners in place of sugar tend to be heavier than people who use sugar. Does this mean that artificial sweeteners cause weight gain? Give a more plausible explanation for this association.

Question 2.88

2.88

Does your product help nursing-home residents?

A group of college students believes that herbal tea has remarkable powers. To test this belief, they make weekly visits to a local nursing home, where they visit with the residents and serve them herbal tea. The nursing-home staff reports that, after several months, many of the residents are healthier and more cheerful. We should commend the students for their good deeds but doubt that herbal tea helped the residents. Identify the explanatory and response variables in this informal study. Then explain what lurking variables account for the observed association.

Question 2.89

2.89

Education and income

There is a strong positive correlation between years of schooling completed x and lifetime earnings y for American men. One possible reason for this association is causation: more education leads to higher-paying jobs. But lurking variables may explain some of the correlation. Suggest some lurking variables that would explain why men with more education earn more.

Question 2.90

2.90

Do power lines cause cancer?

It has been suggested that electromagnetic fields of the kind present near power lines can cause leukemia in children. Experiments with children and power lines are not ethical. Careful studies have found no association between exposure to electromagnetic fields and childhood leukemia.13 Suggest several lurking variables that you would want information about in order to investigate the claim that living near power lines is associated with cancer.


psbe4e_test_ch2

2.5 Relations in Categorical Data

We have concentrated on relationships in which at least the response variable is quantitative. Now we shift to describing relationships between two or more categorical variables. Some variables—such as gender, race, and occupation—are categorical by nature. Other categorical variables are created by grouping values of a quantitative variable into classes. Published data often appear in grouped form to save space. To analyze categorical data, we use the counts or percents of cases that fall into various categories.

CASE 2.2

Does the Right Music Sell the Product?

Market researchers know that background music can influence the mood and the purchasing behavior of customers. One study in a supermarket in Northern Ireland compared three treatments: no music, French accordion music, and Italian string music. Under each condition, the researchers recorded the numbers of bottles of French, Italian, and other wine purchased.14 Here is the two-way table that summarizes the data:

Music
Wine None French Italian Total
French 30 39 30 99
Italian 11 1 19 31
Other 43 35 35 113
Total 84 75 84 243
Table 2.8: Counts for wine and music

105

two-way table row and column variables

The data table for Case 2.2 is a two-way table because it describes two categorical variables. The type of wine is the row variable because each row in the table describes the data for one type of wine. The type of music played is the column variable because each column describes the data for one type of music.The entries in the table are the counts of bottles of wine of the particular type sold while the given type of music was playing. The two variables in this example, wine and music, are both categorical variables.

This two-way table is a 3 × 3 table, to which we have added the marginal totals obtained by summing across rows and columns. For example, the first-rowtotal is 30 + 39 + 30 = 99. The grand total, the number of bottles of wine in the study, can be computed by summing the row totals, 99 + 31 + 113 = 243, or the column totals, 84 + 75 + 84 = 243. It is a good idea to do both as a check on your arithmetic.

Marginal distributions

marginal row totals

marginal column totals

How can we best grasp the information contained in the wine and music table? First, look at the distribution of each variable separately. The distribution of a categorical variable says how often each outcome occurred. The “Total” column at the right margin of the table contains the totals for each of the rows. These are called marginal row totals. They give the numbers of bottles of wine sold by the type of wine: 99 bottles of French wine, 31 bottles of Italian wine, and 113 bottles of other types of wine. Similarly, the marginal column totals are given in the “Total” row at the bottom margin of the table. These are the numbers of bottles of wine that were sold while different types of music were being played: 84 bottles when no music was playing, 75 bottles when French music was playing, and 84 bottles when Italian music was playing.

marginal distribution

Percents are often more informative than counts. We can calculate the distribution of wine type in percents by dividing each row total by the table total. This distribution is called the marginal distribution of wine type.

Marginal Distributions

To find the marginal distribution for the row variable in a two-way table, divide each row total by the total number of entries in the table. Similarly, to find the marginal distribution for the column variable in a two-way table, divide each column total by the total number of entries in the table.

Although the usual definition of a distribution is in terms of proportions, we often multiply these by 100 to convert them to percents. You can describe a distribution either way as long as you clearly indicate which format you are using.

EXAMPLE 2.25

Calculating a Marginal Distribution

CASE 2.2 Let’s find the marginal distribution for the types of wine sold. The counts that we need for these calculations are in the margin at the right of the table:

Wine Total
French 99
Italian 31
Other 113
Total 243

106

The percent of bottles of French wine sold is

Similar calculations for Italian wine and other wine give the following distribution in percents:

Wine French Italian Other
Percent 40.74 12.76 46.50

The total should be 100% because each bottle of wine sold is classified into exactly one of these three categories. In this case, the total is exactly 100%. Small deviations from 100% can occur due to roundoff error.

As usual, we prefer to display numerical summaries using a graph. Figure 2.21 is a bar graph of the distribution of wine type sold. In a two-way table, we have two marginal distributions, one for each of the variables that defines the table.

Figure 2.21: FIGURE 2.21 Marginal distribution of type of wine sold, Example 2.25.

APPLY YOUR KNOWLEDGE

Question 2.91

2.91

Marginal distribution for type of music

CASE 2.2 Find the marginal distribution for the type of music. Display the distribution using a graph.

In working with two-way tables, you must calculate lots of percents. Here’s a tip to help you decide what fraction gives the percent you want. Ask, “What group represents the total that I want a percent of?” The count for that group is the denominator of the fraction that leads to the percent. In Example 2.25, we wanted percents “of bottles of the different types of wine sold,” so the table total is the denominator.

APPLY YOUR KNOWLEDGE

Question 2.92

2.92

Construct a two-way table

Construct your own 2 × 3 table. Add the marginal totals and find the two marginal distributions.

Question 2.93

2.93

Fields of study for college students

The following table gives the number of students (in thousands) graduating from college with degrees in several fields of study for seven countries:15

107

Field of study Canada France Germany Italy Japan U.K. U.S.
Social sciences, business, law 64 153 66 125 259 152 878
Science, mathematics, engineering 35 111 66 80 136 128 355
Arts and humanities 27 74 33 42 123 105 397
Education 20 45 18 16 39 14 167
Other 30 289 35 58 97 76 272
  1. Calculate the marginal totals, and add them to the table.
  2. Find the marginal distribution of country, and give a graphical display of the distribution.
  3. Do the same for the marginal distribution of field of study.

Conditional distributions

The 3 × 3 table for Case 2.2 contains much more information than the two marginal distributions. We need to do a little more work to describe the relationship between the type of music playing and the type of wine purchased. Relationships among categorical variables are described by calculating appropriate percents from the counts given.

Conditional Distributions

To find the conditional distribution of the column variable for a particular value of the row variable in a two-way table, divide each count in the row by the row total. Similarly, to find the conditional distribution of the row variable for a particular value of the column variable in a two-way table, divide each count in the column by the column total.

EXAMPLE 2.26

Wine Purchased When No Music Was Playing

CASE 2.2 What types of wine were purchased when no music was playing? To answer this question, we find the marginal distribution of wine type for the value of music equal to none. The counts we need are in the first column of our table:

Music
Wine None
French 30
Italian 11
Other 43
Total 84

What percent of French wine was sold when no music was playing? To answer this question, we divide the number of bottles of French wine sold when no music was playing by the total number of bottles of wine sold when no music was playing:

108

In the same way, we calculate the percents for Italian and other types of wine. Here are the results:

Wine type: French Italian Other
Percent when no music is playing: 35.7 13.1 51.2

Other wine was the most popular choice when no music was playing, but French wine has a reasonably large share. Notice that these percents sum to 100%. There is no roundoff error here. The distribution is displayed in Figure 2.22.

Figure 2.22: FIGURE 2.22 Conditional distribution of types of wine sold when no music is playing, Example 2.26.

APPLY YOUR KNOWLEDGE

Question 2.94

2.94

CASE 2.2: Conditional distribution when French music was playing

  1. Write down the column of counts that you need to compute the conditional distribution of the type of wine sold when French music was playing.
  2. Compute this conditional distribution.
  3. Display this distribution graphically.
  4. Compare this distribution with the one in Example 2.26. Was there an increase in sales of French wine when French music was playing rather than no music?

Question 2.95

2.95

CASE 2.2: Conditional distribution when Italian music was playing

  1. Write down the column of counts that you need to compute the conditional distribution of the type of wine sold when Italian music was playing.
  2. Compute this conditional distribution.
  3. Display this distribution graphically.
  4. Compare this distribution with the one in Example 2.26. Was there an increase in sales of Italian wine when Italian music was playing rather than no music?

Question 2.96

2.96

CASE 2.2: Compare the conditional distributions

In Example 2.26, we found the distribution of sales by wine type when no music was playing. In Exercise 2.94, you found the distribution when French music was playing, and in Exercise 2.95, you found the distribution when Italian music was playing. Examine these three conditional distributions carefully, and write a paragraph summarizing the relationship between sales of different types of wine and the music played.

109

For Case 2.2, we examined the relationship between sales of different types of wine and the music that was played by studying the three conditional distributions of type of wine sold, one for each music condition. For these computations, we used the counts from the 3 × 3 table, one column at a time. We could also have computed conditional distributions using the counts for each row. The result would be the three conditional distributions of the type of music played for each of the three wine types. For this example, we think that conditioning on the type of music played gives us the most useful data summary. Comparing conditional distributions can be particularly useful when the column variable is an explanatory variable.

The choice of which conditional distribution to use depends on the nature of the data and the questions that you want to ask. Sometimes you will prefer to condition on the column variable, and sometimes you will prefer to condition on the row variable. Occasionally, both sets of conditional distributions will be useful. Statistical software will calculate all of these quantities. You need to select the parts of the output that are needed for your particular questions. Don’t let computer software make this choice for you.

APPLY YOUR KNOWLEDGE

Question 2.97

2.97

Fields of study by country for college students

In Exercise 2.93, you examined data on fields of study for graduating college students from seven countries.

  1. Find the seven conditional distributions giving the distribution of graduates in the different fields of study for each country.
  2. Display the conditional distributions graphically.
  3. Write a paragraph summarizing the relationship between field of study and country.

Question 2.98

2.98

Countries by fields of study for college students

Refer to the previous exercise. Answer the same questions for the conditional distribution of country for each field of study.

Question 2.99

2.99

Compare the two analytical approaches

In the previous two exercises, you examined the relationship between country and field of study in two different ways.

  1. Compare these two approaches.
  2. Which do you prefer? Give a reason for your answer.
  3. What kinds of questions are most easily answered by each of the two approaches? Explain your answer.

Mosaic plots and software output

mosaic plot

Statistical software will compute all of the quantities that we have discussed in this section. Included in some output is a very useful graphical summary called a mosaic plot. Here is an example.

EXAMPLE 2.27

Software Output for Wine and Music

CASE 2.2 Output from JMP statistical software for the wine and music data is given in Figure 2.23. The mosaic plot is given in the top part of the display. Here, we think of music as the explanatory variable and wine as the response variable, so music is displayed across the x axis in the plot. The conditional distributions of wine for each type of music are displayed in the three columns. Note that when French is playing, 52% of the wine sold is French wine. The red bars display the percents of French wine sold for each type of music. Similarly, the green and blue bars display the correspondence to Italian wine and other wine, respectively. The widths of the three sets of bars display the marginal distribution of music. We can see that the proportions are approximately equal, but the French wine sold a little less than the other two categories of wine.

110

Figure 2.23: FIGURE 2.23 Output from JMP for the wine and music data, Example 2.27.

Simpson’s paradox

As is the case with quantitative variables, the effects of lurking variables can change or even reverse relationships between two categorical variables. Here is an example that demonstrates the surprises that can await the unsuspecting user of data.

EXAMPLE 2.28

Which Customer Service Representative Is Better?

A customer service center has a goal of resolving customer questions in 10 minutes or less. Here are the records for two representatives:

Representative
Goal met Ashley Joshua
Yes 172 118
No 28 82
Total 200 200

Ashley has met the goal 172 times out of 200, a success rate of 86%. For Joshua, the success rate is 118 out of 200, or 59%. Ashley clearly has the better success rate.

111

Let’s look at the data in a little more detail. The data summarized come from two different weeks in the year.

EXAMPLE 2.29

Let’s Look at the Data More Carefully

Here are the counts broken down by week:

Week 1 Week 2
Goal met Ashley Joshua Ashley Joshua
Yes 162 19 10 99
No 18 1 10 81
Total 180 20 20 180

For Week 1, Ashley met the goal 90% of the time (162/180), while Joshua met the goal 95% of the time (19/20). Joshua had the better performance in Week 1. What about Week 2? Here, Ashley met the goal 50% of the time (10/20), while the success rate for Joshua was 55% (99/180). Joshua again had the better performance. How does this analysis compare with the analysis that combined the counts for the two weeks? That analysis clearly showed that Ashley had the better performance, 86% versus 59%.

These results can be explained by a lurking variable related to week. The first week was during a period when the product had been in use for several months. Most of the calls to the customer service center concerned problems that had been encountered before. The representatives were trained to answer these questions and usually had no trouble in meeting the goal of resolving the problems quickly. On the other hand, the second week occurred shortly after the release of a new version of the product. Most of the calls during this week concerned new problems that the representatives had not yet encountered. Many more of these questions took longer than the 10-minute goal to resolve.

Look at the total in the bottom row of the detailed table. During the first week, when calls were easy to resolve, Ashley handled 180 calls and Joshua handled 20. The situation was exactly the opposite during the second week, when calls were difficult to resolve. There were 20 calls for Ashley and 180 for Joshua.

The original two-way table, which did not take account of week, was misleading. This example illustrates Simpson’s paradox.

Simpson’s Paradox

An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.

The lurking variables in Simpson’s paradox are categorical. That is, they break the cases into groups, as when calls are classified by week. Simpson’s paradox is just an extreme form of the fact that observed associations can be misleading when there are lurking variables.

APPLY YOUR KNOWLEDGE

Question 2.100

2.100

Which hospital is safer?

Insurance companies and consumers are interested in the performance of hospitals. The government releases data about patient outcomes in hospitals that can be useful in making informed health care decisions. Here is a two-way table of data on the survival of patients after surgery in two hospitals. All patients undergoing surgery in a recent time period are included. “Survived” means that the patient lived at least six weeks following surgery.

112

Hospital A Hospital B
Died 63 16
Survived 2037 784
Total 2100 800

What percent of Hospital A patients died? What percent of Hospital B patients died? These are the numbers one might see reported in the media.

Question 2.101

2.101

Patients in “poor” or “good” condition

Not all surgery cases are equally serious, however. Patients are classified as being in either “poor” or “good” condition before surgery. Here are the data broken down by patient condition. Check that the entries in the original two-way table are just the sums of the “poor” and “good” entries in this pair of tables.

Good Condition
Hospital A Hospital B
Died 6 8
Survived 594 592
Total 600 600
Poor Condition
Hospital A Hospital B
Died 57 8
Survived 1443 192
Total 1500 200
  1. Find the percent of Hospital A patients who died who were classified as “poor” before surgery. Do the same for Hospital B. In which hospital do “poor” patients fare better?
  2. Repeat part (a) for patients classified as “good” before surgery.
  3. What is your recommendation to someone facing surgery and choosing between these two hospitals?
  4. How can Hospital A do better in both groups, yet do worse overall? Look at the data and carefully explain how this can happen.

three-way table

aggregation

The data in Example 2.28 can be given in a three-way table that reports counts for each combination of three categorical variables: week, representative, and whether or not the goal was met. In Example 2.29, we constructed two two-way tables for representative by goal, one for each week. The original table, the one that we showed in Example 2.28, can be obtained by adding the corresponding counts for the two tables in Example 2.29. This process is called aggregating the data. When we aggregated data in Example 2.28, we ignored the variable week, which then became a lurking variable. Conclusions that seem obvious when we look only at aggregated data can become quite different when the data are examined in more detail.


psbe4e_test_ch2

SECTION 2.5 Summary

113


psbe4e_test_ch2

SECTION 2.5 Exercises

For Exercise 2.91, see page 106; for 2.92 and 2.93, see pages 106–107; for 2.94 to 2.96, see page 108; for 2.97 to 2.99, see page 109; and for 2.100 and 2.101, see pages 111–112.

Question 2.102

2.102

Remote deposit capture

The Federal Reserve has called remote deposit capture (RDC) “the most important development the [U.S.] banking industry has seen in years.” This service allows users to scan checks and to transmit the scanned images to a bank for posting.16 In its annual survey of community banks, the American Bankers Association asked banks whether or not they offered this service.17 Here are the results classified by the asset size (in millions of dollars) of the bank:

Offer RDC
Asset size
($ in millions)
Yes No
Under $100 63 309
$101 to $200 59 132
$201 or more 112 s85

Summarize the results of this survey question numerically and graphically. Write a short paragraph explaining the relationship between the size of a bank, measured by assets, and whether or not RDC is offered.

Question 2.103

2.103

How does RDC vary across the country?

The survey described in the previous exercise also classified community banks by region. Here is the 6 × 2 table of counts:18

Offer RDC
Region Yes No
Northeast 28 38
Southeast 57 61
Central 53 84
Midwest 63 181
Southwest 27 51
West 61 76

Summarize the results of this survey question numerically and graphically. Write a short paragraph explaining the relationship between the location of a bank, measured by region, and whether or not remote deposit capture is offered.

Question 2.104

2.104

Exercise and adequate sleep

A survey of 656 boys and girls, ages 13 to 18, asked about adequate sleep and other health-related behaviors. The recommended amount of sleep is six to eight hours per night.19 In the survey, 54% of the respondents reported that they got less than this amount of sleep on school nights. The researchers also developed an exercise scale that was used to classify the students as above or below the median in how much they exercised. Here is the 2 × 2 table of counts with students classified as getting or not getting adequate sleep and by the exercise variable:

114

Exercise
Enough sleep High Low
Yes 151 115
No 148 242
  1. Find the distribution of adequate sleep for the high exercisers.
  2. Do the same for the low exercisers.
  3. If you have the appropriate software, use a mosaic plot to illustrate the marginal distribution of exercise and your results in parts (a) and (b).
  4. Summarize the relationship between adequate sleep and exercise using the results of parts (a) and (b).

Question 2.105

2.105

Adequate sleep and exercise

Refer to the previous exercise.

  1. Find the distribution of exercise for those who get adequate sleep.
  2. Do the same for those who do not get adequate sleep.
  3. Write a short summary of the relationship between adequate sleep and exercise using the results of parts (a) and (b).
  4. Compare this summary with the summary that you obtained in part (c) of the previous exercise. Which do you prefer? Give a reason for your answer.

Question 2.106

2.106

Full-time and part-time college students

The Census Bureau provides estimates of numbers of people in the United States classified in various ways.20 Let’s look at college students. The following table gives us data to examine the relation between age and full-time or part-time status. The numbers in the table are expressed as thousands of U.S. college students.

Status
Age Full-time Part-time
15–19 3388 389
20–24 5238 1164
25–34 1703 1699
35 and over 762 2045
  1. Find the distribution of age for full-time students.
  2. Do the same for the part-time students.
  3. Use the summaries in parts (a) and (b) to describe the relationship between full- or part-time status and age. Write a brief summary of your conclusions.

Question 2.107

2.107

Condition on age

Refer to the previous exercise.

  1. For each age group, compute the percent of students who are full-time and the percent of students who are part-time.
  2. Make a graphical display of the results that you found in part (a).
  3. If you have the appropriate software, make a mosaic plot.
  4. In a short paragraph, describe the relationship between age and full- or part-time status using your numerical and graphical summaries.
  5. Explain why you need only the percents of students who are full-time for your summary in part (b).
  6. Compare this way of summarizing the relationship between these two variables with what you presented in part (c) of the previous exercise.

Question 2.108

2.108

Lying to a teacher

One of the questions in a survey of high school students asked about lying to teachers.21 The accompanying table gives the numbers of students who said that they lied to a teacher about something significant at least once during the past year, classified by gender.

Gender
Lied at least once Male Female
Yes 6067 5966
No 4145 5719
  1. Add the marginal totals to the table.
  2. Calculate appropriate percents to describe the results of this question.
  3. Summarize your findings in a short paragraph.

Question 2.109

2.109

Trust and honesty in the workplace

The students surveyed in the study described in the previous exercise were also asked whether they thought trust and honesty were essential in business and the workplace. Here are the counts classified by gender:

Gender
Trust and honesty are essential Male Female
Agree 9,097 10,935
Disagree 685 423

Answer the questions given in the previous exercise for this survey question.

115

Question 2.110

2.110

Class size and course level

College courses taught at lower levels often have larger class sizes. The following table gives the number of classes classified by course level and class size.22 For example, there were 202 first-year level courses with between one and nine students.

Class size
Course
level
1–9 10–19 20–29 30–39 40–49 50–99 100 or
more
1 202 659 917 241 70 99 123
2 190 370 486 307 84 109 134
3 150 387 314 115 96 186 53
4 146 256 190 83 67 64 17
  1. Fill in the marginal totals in the table.
  2. Find the marginal distribution for the variable course level.
  3. Do the same for the variable class size.
  4. For each course level, find the conditional distribution of class size.
  5. Summarize your findings in a short paragraph.

Question 2.111

2.111

Hiring practices

A company has been accused of age discrimination in hiring for operator positions. Lawyers for both sides look at data on applicants for the past three years. They compare hiring rates for applicants younger than 40 years and those 40 years or older.

Age Hired Not hired
Younger than 40 82 1160
40 or older 2 168
  1. Find the two conditional distributions of hired/not hired—one for applicants who are less than 40 years old and one for applicants who are not less than 40 years old.
  2. Based on your calculations, make a graph to show the differences in distribution for the two age categories.
  3. Describe the company’s hiring record in words. Does the company appear to discriminate on the basis of age?
  4. What lurking variables might be involved here?

Question 2.112

2.112

Nonresponse in a survey of companies

A business school conducted a survey of companies in its state. It mailed a questionnaire to 200 small companies, 200 medium-sized companies, and 200 large companies. The rate of nonresponse is important in deciding how reliable survey results are. Here are the data on response to this survey:

Small Medium Large
Response 124 80 41
No response 76 120 159
Total 200 200 200
  1. What was the overall percent of nonresponse?
  2. Describe how nonresponse is related to the size of the business. (Use percents to make your statements precise.)
  3. Draw a bar graph to compare the nonresponse percents for the three size categories.

Question 2.113

2.113

Demographics and new products

Companies planning to introduce a new product to the market must define the “target” for the product. Who do we hope to attract with our new product? Age and gender are two of the most important demographic variables. The following two-way table describes the age and marital status of American women.23 The table entries are in thousands of women.

Marital status
Age (years) Never
married
Married Widowed Divorced
18 to 24 12,112 2,171 23 164
25 to 39 9,472 18,219 177 2,499
40 to 64 5,224 35,021 2,463 8,674
≥ 65 984 9,688 8,699 2,412
  1. Find the sum of the entries for each column.
  2. Find the marginal distributions.
  3. Find the conditional distributions.
  4. If you have the appropriate software, make a mosaic plot.
  5. Write a short description of the relationship between marital status and age for women.

Question 2.114

2.114

Demographics, continued

  1. Using the data in the previous exercise, compare the conditional distributions of marital status for women aged 18 to 24 and women aged 40 to 64. Briefly describe the most important differences between the two groups of women, and back up your description with percents.
  2. Your company is planning a magazine aimed at women who have never been married. Find the conditional distribution of age among never-married women, and display it in a bar graph. What age group or groups should your magazine aim to attract?

Question 2.115

2.115

Demographics and new products—men

Refer to Exercises 2.113 and 2.114. Here are the corresponding counts for men:

116

Marital status
Age (years) Never
married
Married Widowed Divorced
18 to 24 13,509 1,245 6 63
25 to 39 12,685 16,029 78 1,790
40 to 64 6,869 34,650 760 6,647
≥ 65 685 12,514 2,124 1,464

Answer the questions from Exercises 2.113 and 2.114 for these counts.

Question 2.116

2.116

Discrimination?

Wabash Tech has two professional schools, business and law. Here are two-way tables of applicants to both schools, categorized by gender and admission decision. (Although these data are made up, similar situations occur in reality.)

Business
Admit Deny
Male 480 120
Female 180 20
Law
Admit Deny
Male 10 90
Female 100 200
  1. Make a two-way table of gender by admission decision for the two professional schools together by summing entries in these tables.
  2. From the two-way table, calculate the percent of male applicants who are admitted and the percent of female applicants who are admitted. Wabash admits a higher percent of male applicants.
  3. Now compute separately the percents of male and female applicants admitted by the business school and by the law school. Each school admits a higher percent of female applicants.
  4. This is Simpson’s paradox: both schools admit a higher percent of the women who apply, but overall, Wabash admits a lower percent of female applicants than of male applicants. Explain carefully, as if speaking to a skeptical reporter, how it can happen that Wabash appears to favor males when each school individually favors females.

Question 2.117

2.117

Obesity and health

Recent studies have shown that earlier reports underestimated the health risks associated with being overweight. The error was due to lurking variables. In particular, smoking tends both to reduce weight and to lead to earlier death. Illustrate Simpson’s paradox by a simplified version of this situation. That is, make up tables of overweight (yes or no) by early death (yes or no) by smoker (yes or no) such that

  • Overweight smokers and overweight nonsmokers both tend to die earlier than those not overweight.
  • But when smokers and nonsmokers are combined into a two-way table of overweight by early death, persons who are not overweight tend to die earlier.

Question 2.118

2.118

Find the table

Here are the row and column totals for a two-way table with two rows and two columns:

a b 60
c d 60
70 50 120

Find two different sets of counts a, b, c, and d for the body of the table that give these same totals. This shows that the relationship between two variables cannot be obtained from the two individual distributions of the variables.


psbe4e_test_ch2

CHAPTER 2 Review Exercises

Question 2.119

2.119

Companies of the world with logs

In Exercises 2.10 (page 72), 2.27 (page 78), and 2.58 (pages 95–96), you examined the relationship between the numbers of companies that are incorporated and are listed on their country’s stock exchange at the end of the year using data collected by the World Bank.24 In this exercise, you will explore the relationship between the numbers for 2012 and 2002 using logs.

  1. Which variable do you choose to be the explanatory variable, and which do you choose to be the response variable? Explain your answer.
  2. Plot the data with the least-squares regression line. Summarize the major features of your plot.
  3. Give the equation of the least-squares regression line.
  4. Find the predicted value and the residual for Sweden.
  5. Find the correlation between the two variables.
  6. Compare the results found in this exercise with those you found in Exercises 2.10, 2.27, and 2.58. Do you prefer the analysis with the original data or the analysis using logs? Give reasons for your answer.

117

Question 2.120

2.120

Residuals for companies of the world with logs

Refer to the previous exercise.

  1. Use a histogram to examine the distribution of the residuals.
  2. Make a Normal quantile plot of the residuals.
  3. Summarize the distribution of the residuals using the graphical displays that you created in parts (a) and (b).
  4. Repeat parts (a), (b), and (c) for the original data, and compare these results with those you found in parts (a), (b), and (c). Which do you prefer? Give reasons for your answer.

Question 2.121

2.121

Dwelling permits and sales for 21 European countries

The Organization for Economic Cooperation and Development (OECD) collects data on Main Economic Indicators (MEIs) for many countries. Each variable is recorded as an index, with the year 2000 serving as a base year. This means that the variable for each year is reported as a ratio of the value for the year divided by the value for 2000. Use of indices in this way makes it easier to compare values for different countries.25

  1. Make a scatterplot with sales as the response variable and permits issued for new dwellings as the explanatory variable. Describe the relationship. Are there any outliers or influential observations?
  2. Find the least-squares regression line and add it to your plot.
  3. What is the predicted value of sales for a country that has an index of 160 for dwelling permits?
  4. The Netherlands has an index of 160 for dwelling permits. Find the residual for this country.
  5. What percent of the variation in sales is explained by dwelling permits?

Question 2.122

2.122

Dwelling permits and production

Refer to the previous exercise.

  1. Make a scatterplot with production as the response variable and permits issued for new dwellings as the explanatory variable. Describe the relationship. Are there any outliers or influential observations?
  2. Find the least-squares regression line and add it to your plot.
  3. What is the predicted value of production for a country that has an index of 160 for dwelling permits?
  4. The Netherlands has an index of 160 for dwelling permits. Find the residual for this country.
  5. What percent of the variation in production is explained by dwelling permits? How does this value compare with the value you found in the previous exercise for the percent of variation in sales that is explained by building permits?

Question 2.123

2.123

Sales and production

Refer to the previous two exercises.

  1. Make a scatterplot with sales as the response variable and production as the explanatory variable. Describe the relationship. Are there any outliers or influential observations?
  2. Find the least-squares regression line and add it to your plot.
  3. What is the predicted value of sales for a country that has an index of 125 for production?
  4. Finland has an index of 125 for production. Find the residual for this country.
  5. What percent of the variation in sales is explained by production? How does this value compare with the percents of variation that you calculated in the two previous exercises?

Question 2.124

2.124

Salaries and raises

For this exercise, we consider a hypothetical employee who starts working in Year 1 at a salary of $50,000. Each year her salary increases by approximately 5%. By Year 20, she is earning $126,000. The following table gives her salary for each year (in thousands of dollars):

Year Salary Year Salary Year Salary Year Salary
1 50 6 63 11 81 16 104
2 53 7 67 12 85 17 109
3 56 8 70 13 90 18 114
4 58 9 74 14 93 19 120
5 61 10 78 15 99 20 126
  1. Figure 2.24 is a scatterplot of salary versus year with the least-squares regression line. Describe the relationship between salary and year for this person.
  2. The value of r2 for these data is 0.9832. What percent of the variation in salary is explained by year? Would you say that this is an indication of a strong linear relationship? Explain your answer.
Figure 2.24: FIGURE 2.24 Plot of salary versus year, with the least-squares regression line, for an individual who receives approximately a 5% raise each year for 20 years, Exercise 2.124.

Question 2.125

2.125

Look at the residuals

Refer to the previous exercise. Figure 2.25 is a plot of the residuals versus year.

  1. Interpret the residual plot.
  2. Explain how this plot highlights the deviations from the least-squares regression line that you can see in Figure 2.24.
Figure 2.25: FIGURE 2.25 Plot of residuals versus year for an individual who receives approximately a 5% raise each year for 20 years, Exercise 2.125.

118

Question 2.126

2.126

Try logs

Refer to the previous two exercises. Figure 2.26 is a scatterplot with the least-squares regression line for log salary versus year. For this model, r2 = 0.9995.

  1. Compare this plot with Figure 2.24. Write a short summary of the similarities and the differences.
  2. Figure 2.27 is a plot of the residuals for the model using year to predict log salary. Compare this plot with Figure 2.25 and summarize your findings.
Figure 2.26: FIGURE 2.26 Plot of log salary versus year, with the least-squares regression line, for an individual who receives approximately a 5% raise each year for 20 years, Exercise 2.126.
Figure 2.27: FIGURE 2.27 Plot of residuals, based on log salary, versus year for an individual who receives approximately a 5% raise each year for 20 years, Exercise 2.126.

Question 2.127

2.127

Predict some salaries

The individual whose salary we have been studying in Exercises 2.122 through 2.124 wants to do some financial planning. Specifically, she would like to predict her salary five years into the future, that is, for Year 25. She is willing to assume that her employment situation will be stable for the next five years and that it will be similar to the last 20 years.

  1. Use the least-squares regression equation constructed to predict salary from year to predict her salary for Year 25.
  2. Use the least-squares regression equation constructed to predict log salary from year to predict her salary for Year 25. Note that you will need to convert the predicted log salary back to the predicted salary. Many calculators have a function that will perform this operation.
  3. Which prediction do you prefer? Explain your answer.
  4. Someone looking at the numerical summaries, and not the plots, for these analyses says that because both models have very high values of r2, they should perform equally well in doing this prediction. Write a response to this comment.
  5. Write a short paragraph about the value of graphical summaries and the problems of extrapolation using what you have learned from studying these salary data.

119

Question 2.128

2.128

Faculty salaries

Data on the salaries of a sample of professors in a business department at a large university are given below. The salaries are for the academic years 2014–2015 and 2015–2016.

2014–2015
salary ($)
2015–2016
salary ($)
2014–2015
salary ($)
2015–2016
salary ($)
145,700 147,700 136,650 138,650
112,700 114,660 132,160 134,150
109,200 111,400 74,290 76,590
98,800 101,900 74,500 77,000
112,000 113,000 83,000 85,400
111,790 113,800 141,850 143,830
103,500 105,700 122,500 124,510
149,000 150,900 115,100 117,100
  1. Construct a scatterplot with the 2015–2016 salaries on the vertical axis and the 2014–2015 salaries on the horizontal axis.
  2. Comment on the form, direction, and strength of the relationship in your scatterplot.
  3. What proportion of the variation in 2015–2016 salaries is explained by 2014–2015 salaries?

Question 2.129

2.129

Find the line and examine the residuals

Refer to the previous exercise.

  1. Find the least-squares regression line for predicting 2015–2016 salaries from 2014–2015 salaries.
  2. Analyze the residuals, paying attention to any outliers or influential observations. Write a summary of your findings.

Question 2.130

2.130

Bigger raises for those earning less

Refer to the previous two exercises. The 2014–2015 salaries do an excellent job of predicting the 2015–2016 salaries. Is there anything more that we can learn from these data? In this department, there is a tradition of giving higher-than-average percent raises to those whose salaries are lower. Let’s see if we can find evidence to support this idea in the data.

120

  1. Compute the percent raise for each faculty member. Take the difference between the 2015–2016 salary and the 2014–2015 salary, divide by the 2014–2015 salary, and then multiply by 100. Make a scatterplot with the raise as the response variable and the 2014–2015 salary as the explanatory variable. Describe the relationship that you see in your plot.
  2. Find the least-squares regression line and add it to your plot.
  3. Analyze the residuals. Are there any outliers or influential cases? Make a graphical display and include it in a short summary of what you conclude.
  4. Is there evidence in the data to support the idea that greater percentage raises are given to those with lower salaries? Summarize your findings and include numerical and graphical summaries to support your conclusion.

Question 2.131

2.131

Marketing your college

Colleges compete for students, and many students do careful research when choosing a college. One source of information is the rankings compiled by U.S. News & World Report. One of the factors used to evaluate undergraduate programs is the proportion of incoming students who graduate. This quantity, called the graduation rate, can be predicted by other variables such as the SAT or ACT scores and the high school records of the incoming students. One of the components in U.S. News & World Report rankings is the difference between the actual graduation rate and the rate predicted by a regression equation.26 In this chapter, we call this quantity the residual. Explain why the residual is a better measure to evaluate college graduation rates than the raw graduation rate.

Question 2.132

2.132

Planning for a new product

The editor of a statistics text would like to plan for the next edition. A key variable is the number of pages that will be in the final version. Text files are prepared by the authors using a word processor called LaTeX, and separate files contain figures and tables. For the previous edition of the text, the number of pages in the LaTeX files can easily be determined, as well as the number of pages in the final version of the text. Here are the data:

Chapter
1 2 3 4 5 6 7 8 9 10 11 12 13
LaTeX
pages
77 73 59 80 45 66 81 45 47 43 31 46 26
Text
pages
99 89 61 82 47 68 87 45 53 50 36 52 19
  1. Plot the data and describe the overall pattern.
  2. Find the equation of the least-squares regression line, and add the line to your plot.
  3. Find the predicted number of pages for the next edition if the number of LaTeX pages for a chapter is 62.
  4. Write a short report for the editor explaining to her how you constructed the regression equation and how she could use it to estimate the number of pages in the next edition of the text.

Question 2.133

2.133

Points scored in women’s basketball games

Use the Internet to find the scores for the past season’s women’s basketball team at a college of your choice. Is there a relationship between the points scored by your chosen team and the points scored by their opponents? Summarize the data and write a report on your findings.

Question 2.134

2.134

Look at the data for men

Refer to the previous exercise. Analyze the data for the men’s team from the same college, and compare your results with those for the women.

Question 2.135

2.135

Circular saws

The following table gives the weight (in pounds) and amps for 19 circular saws. Saws with higher amp ratings tend to also be heavier than saws with lower amp ratings. We can quantify this fact using regression.

Weight Amps Weight Amps Weight Amps
11 15 9 10 11 13
12 15 11 15 13 14
11 15 12 15 10 12
11 15 12 14 11 12
12 15 10 10 11 12
11 15 12 13 10 12
13 15
  1. We will use amps as the explanatory variable and weight as the response variable. Give a reason for this choice.
  2. Make a scatterplot of the data. What do you notice about the weight and amp values?
  3. Report the equation of the least-squares regression line along with the value of r2.
  4. Interpret the value of the estimated slope.
  5. How much of an increase in amps would you expect to correspond to a one-pound increase in the weight of a saw, on average, when comparing two saws?
  6. Create a residual plot for the model in part (b). Does the model indicate curvature in the data?

121

Question 2.136

2.136

Circular saws

The table in the previous exercise gives the weight (in pounds) and amps for 19 circular saws. The data contain only five different amp ratings among the 19 saws.

  1. Calculate the correlation between the weights and the amps of the 19 saws.
  2. Calculate the average weight of the saws for each of the five amp ratings.
  3. Calculate the correlation between the average weights and the amps. Is the correlation between average weights and amps greater than, less than, or equal to the correlation between individual weights and amps?

Question 2.137

2.137

What correlation does and doesn’t say

Construct a set of data with two variables that have different means and correlation equal to one. Use your example to illustrate what correlation does and doesn’t say.

Question 2.138

2.138

Simpson’s paradox and regression

Simpson’s paradox occurs when a relationship between variables within groups of observations reverses when all of the data are combined. The phenomenon is usually discussed in terms of categorical variables, but it also occurs in other settings. Here is an example:

y x Group y x Group
10.1 1 1 18.3 6 2
8.9 2 1 17.1 7 2
8.0 3 1 16.2 8 2
6.9 4 1 15.1 9 2
6.1 5 1 14.3 10 2
  1. Make a scatterplot of the data for Group 1. Find the least-squares regression line and add it to your plot. Describe the relationship between y and x for Group 1.
  2. Do the same for Group 2.
  3. Make a scatterplot using all 10 observations. Find the least-squares line and add it to your plot.
  4. Make a plot with all of the data using different symbols for the two groups. Include the three regression lines on the plot. Write a paragraph about Simpson’s paradox for regression using this graphical display to illustrate your description.

Question 2.139

2.139

Wood products

A wood product manufacturer is interested in replacing solid-wood building material by less-expensive products made from wood flakes.27 The company collected the following data to examine the relationship between the length (in inches) and the strength (in pounds per square inch) of beams made from wood flakes:

Length 5 6 7 8 9 10 11 12 13 14
Strength 446 371 334 296 249 254 244 246 239 234
  1. Make a scatterplot that shows how the length of a beam affects its strength.
  2. Describe the overall pattern of the plot. Are there any outliers?
  3. Fit a least-squares line to the entire set of data. Graph the line on your scatterplot. Does a straight line adequately describe these data?
  4. The scatterplot suggests that the relation between length and strength can be described by two straight lines, one for lengths of 5 to 9 inches and another for lengths of 9 to 14 inches. Fit least-squares lines to these two subsets of the data, and draw the lines on your plot. Do they describe the data adequately? What question would you now ask the wood experts?

Question 2.140

2.140

Aspirin and heart attacks

Does taking aspirin regularly help prevent heart attacks? “Nearly five decades of research now link aspirin to the prevention of stroke and heart attacks.” So says the Bayer Aspirin website, bayeraspirin.com. The most important evidence for this claim comes from the Physicians’ Health Study. The subjects were 22,071 healthy male doctors at least 40 years old. Half the subjects, chosen at random, took aspirin every other day. The other half took a placebo, a dummy pill that looked and tasted like aspirin. Here are the results.28 (The row for “None of these” is left out of the two-way table.)

Aspirin
group
Placebo
group
Fatal heart attacks 10 26
Other heart attacks 129 213
Strokes 119 98
Total 11,037 11,034

What do the data show about the association between taking aspirin and heart attacks and stroke? Use percents to make your statements precise. Include a mosaic plot if you have access to the needed software. Do you think the study provides evidence that aspirin actually reduces heart attacks (cause and effect)?

122

Question 2.141

2.141

More smokers live at least 20 more years!

You can see the headlines “More smokers than nonsmokers live at least 20 more years after being contacted for study!” A medical study contacted randomly chosen people in a district in England. Here are data on the 1314 women contacted who were either current smokers or who had never smoked. The tables classify these women by their smoking status and age at the time of the survey and whether they were still alive 20 years later.29

Age 18 to 44 Age 45 to 64 Age 65+
Smoker Not Smoker Not Smoker Not
Dead 19 13 78 52 42 165
Alive 269 327 167 147 7 28
  1. From these data, make a two-way table of smoking (yes or no) by dead or alive. What percent of the smokers stayed alive for 20 years? What percent of the nonsmokers survived? It seems surprising that a higher percent of smokers stayed alive.
  2. The age of the women at the time of the study is a lurking variable. Show that within each of the three age groups in the data, a higher percent of nonsmokers remained alive 20 years later. This is another example of Simpson’s paradox.
  3. The study authors give this explanation: “Few of the older women (over 65 at the original survey) were smokers, but many of them had died by the time of follow-up.” Compare the percent of smokers in the three age groups to verify the explanation.

Question 2.142

2.142

Recycled product quality

Recycling is supposed to save resources. Some people think recycled products are lower in quality than other products, a fact that makes recycling less practical. People who actually use a recycled product may have different opinions from those who don’t use it. Here are data on attitudes toward coffee filters made of recycled paper among people who do and don’t buy these filters:30

Think the quality of the
recycled product is:
Higher The same Lower
Buyers 20 7 9
Nonbuyers 29 25 43
  1. Find the marginal distribution of opinion about quality. Assuming that these people represent all users of coffee filters, what does this distribution tell us?
  2. How do the opinions of buyers and nonbuyers differ? Use conditional distributions as a basis for your answer. Include a mosaic plot if you have access to the needed software. Can you conclude that using recycled filters causes more favorable opinions? If so, giving away samples might increase sales.

psbe4e_test_ch2

2.1 Scatterplots

CASE 2.1

Education Expenditures and Population: Benchmarking

We expect that states with larger populations would spend more on education than states with smaller populations.1 What is the nature of this relationship? Can we use this relationship to evaluate whether some states are spending more than we expect or less than we expect? This type of exercise is called benchmarking. The basic idea is to compare processes or procedures of an organization with those of similar organizations.

benchmarking

The data file EDSPEND gives

  • the state name
  • state spending on education ($ billion)
  • local government spending on education ($ billion)
  • spending (total of state and local) on education ($ billion)
  • gross state product ($ billion)
  • growth in gross state product (percent)
  • population (million)

for each of the 50 states in the United States.

APPLY YOUR KNOWLEDGE

Question 2.3

2.3

Classify the variables

Use the EDSPEND data set for this exercise. Classify each variable as categorical or quantitative. Is there a label variable in the data set? If there is, identify it.

Question 2.4

2.4

Describe the variables

Refer to the previous exercise.

  1. Use graphical and numerical summaries to describe the distribution of spending.
  2. Do the same for population.
  3. Write a short paragraph summarizing your work in parts (a) and (b).

The most common way to display the relation between two quantitative variables is a scatterplot.

Figure 2.1:

FIGURE 2.1

Scatterplot of spending on education (in billions of dollars) versus population (in millions), Example 2.3.

66

EXAMPLE 2.3

Spending and population

CASE 2.1 A state with a larger number of people needs to spend more money on education. Therefore, we think of population as an explanatory variable and spending on education as a response variable. We begin our study of this relationship with a graphical display of the two variables.

Figure 2.1 is a scatterplot that displays the relationship between the response variable, spending, and the explanatory variable, population. The data appear to cluster around a line with relatively small variation about this pattern. The relationship is positive: states with larger populations generally spend more on education than states with smaller populations. There are three or four states that are somewhat extreme in both population and spending on education, but their values still appear to be consistent with the overall pattern.

Scatterplot

A scatterplot shows the relationship between two quantitative variables measured on the same cases. The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each case in the data appears as the point in the plot fixed by the values of both variables for that case.

Always plot the explanatory variable, if there is one, on the horizontal axis (the x axis) of a scatterplot. As a reminder, we usually call the explanatory variable x and the response variable y. If there is no explanatory–response distinction, either variable can go on the horizontal axis. The time plots in Section 1.2 (page 19) are special scatterplots where the explanatory variable x is a measure of time.

APPLY YOUR KNOWLEDGE

Question 2.5

2.5

Make a scatterplot

  1. Make a scatterplot similar to Figure 2.1 for the education spending data.
  2. Label the four points with high population and high spending with the names of these states.

67

Question 2.6

2.6

Change the units

  1. Create a spreadsheet with the education spending data with education spending expressed in millions of dollars and population in thousands. In other words, multiply education spending by 1000 and multiply population by 1000.
  2. Make a scatterplot for the data coded in this way.
  3. Describe how this scatterplot differs from Figure 2.1.

Interpreting scatterplots

To interpret a scatterplot, apply the strategies of data analysis learned in Chapter 1.

REMINDER

examining a
distribution, p. 18

Examining a Scatterplot

In any graph of data, look for the overall pattern and for striking deviations from that pattern.

You can describe the overall pattern of a scatterplot by the form, direction, and strength of the relationship.

An important kind of deviation is an outlier, an individual value that falls outside the overall pattern of the relationship.

The scatterplot in Figure 2.1 shows a clear form: the data lie in a roughly straight-line, or linear, pattern. To help us see this linear relationship, we can use software to put a straight line through the data. (We will show how this is done in Section 2.3.)

linear relationship

EXAMPLE 2.4

Scatterplot with a Straight Line

CASE 2.1 Figure 2.2 plots the education spending data along with a fitted straight line. This plot confirms our initial impression about these data. The overall pattern is approximately linear and there are a few states with relatively high values for both variables.

Figure 2.2:

FIGURE 2.2

Scatterplot of spending on education (in billions of dollars) versus population (in millions) with a fitted straight line, Example 2.4.

The relationship in Figure 2.2 also has a clear direction: states with higher populations spend more on education than states with smaller populations. This is a positive association between the two variables.

68

Positive Association, Negative Association

Two variables are positively associated when above-average values of one tend to accompany above-average values of the other, and below-average values also tend to occur together.

Two variables are negatively associated when above-average values of one tend to accompany below-average values of the other, and vice versa.

The strength of a relationship in a scatterplot is determined by how closely the points follow a clear form. The strength of the relationship in Figure 2.1 is fairly strong.

Software is a powerful tool that can help us to see the pattern in a set of data. Many statistical packages have procedures for fitting smooth curves to data measured on a pair of quantitative variables. Here is an example.

EXAMPLE 2.5

Smooth Relationship for Education Spending

Figure 2.3 is a scatterplot of the population versus education spending for the 50 states in the United States with a smooth curve generated by software. The smooth curve follows the data very closely and is somewhat bumpy. We can adjust the extent to which the relationship is smoothed by changing the smoothing parameter. Figure 2.4 is the result. Here we see that the smooth curve is very close to our plot with the line in Figure 2.2. In this way, we have confirmed our view that we can summarize this relationship with a line.

smoothing parameter

Figure 2.3:

FIGURE 2.3

Scatterplot of spending on education (in billions of dollars) versus population (in millions) with a smooth curve, Example 2.5. This smooth curve fits the data too well and does not provide a good summary of the relationship.
Figure 2.4:

FIGURE 2.4

Scatterplot of spending on education (in billions of dollars) versus population (in millions) with a better smooth curve, Example 2.6. This smooth curve fits the data well and provides a good summary of the relationship. It shows that the relationship is approximately linear.

The log transformation

In many business and economic studies, we deal with quantitative variables that take only positive values and are skewed toward high values. In Example 2.4 (page 67), you observed this situation for spending and population size in our education spending data set. One way to make skewed distributions more Normal looking is to transform the data in some way.

log transformation

The most important transformation that we will use is the log transformation. This transformation can be used only for variables that have positive values. Occasionally, we use it when there are zeros, but, in this case, we first replace the zero values by some small value, often one-half of the smallest positive value in the data set.

You have probably encountered logarithms in one of your high school mathematics courses as a way to do certain kinds of arithmetic. Usually, these are base 10 logarithms. Logarithms are a lot more fun when used in statistical analyses. For our statistical applications, we will use natural logarithms. Statistical software and statistical calculators generally provide easy ways to perform this transformation.

APPLY YOUR KNOWLEDGE

Question 2.7

2.7

Transform education spending and population

Refer to Exercise 2.4 (page 65). Transform the education spending and population variables using logs, and describe the distributions of the transformed variables. Compare these distributions with those described in Exercise 2.4.

In this chapter, we are concerned with relationships between pairs of quantitative variables. There is no requirement that either or both of these variables should be Normal. However, let’s examine the effect of the transformations on the relationship between education spending and population.

69

EXAMPLE 2.6

Education Spending and Population with Logarithms

Figure 2.5 is a scatterplot of the log of education spending versus the log of education for the 50 states in the United States. The line on the plot fits the data well, and we conclude that the relationship is linear in the transformed variables.

Figure 2.5:

FIGURE 2.5

Scatterplot of log spending on education versus log population with a fitted straight line, Example 2.6.

Notice how the data are more evenly spread throughout the range of the possible values. The three or four high values no longer appear to be extreme. We now see them as the high end of a distribution.

In Exercise 2.7, the transformations of the two quantitative variables maintained the linearity of the relationship. Sometimes we transform one of the variables to change a nonlinear relationship into a linear one.

The interpretation of scatterplots, including knowing to use transformations, is an art that requires judgment and knowledge about the variables that we are studying. Always ask yourself if the relationship that you see makes sense. If it does not, then additional analyses are needed to understand the data.

70

Many statistical procedures work very well with data that are Normal and relationships that are linear. However, there is no requirement that we must have Normal data and linear relationships for everything that we do. In fact, with advances in statistical software, we now have many statistical techniques that work well in a wide range of settings. See Chapters 16 and 17 for examples.

Adding categorical variables to scatterplots

In Example 1.28 (page 38), we examined the fuel efficiency, measured as miles per gallon (MPG) for highway driving, for 1067 vehicles for the model year 2014. The data file (CANFUEL) that we used there also gives carbon dioxide (CO2) emissions and several other variables related to the type of vehicle. One of these is the type of fuel used. Four types are given:

Although much of our focus in this chapter is on linear relationships, many interesting relationships are more complicated. Our fuel efficiency data provide us with an example.

EXAMPLE 2.7

Fuel Efficiency and CO_2 Emissions

Let’s look at the relationship between highway MPG and CO2 emissions, two quantitative variables, while also taking into account the type of fuel, a categorical variable. The JMP statistical software was used to produce the plot in Figure 2.6. We see that there is a negative relationship between the two quantitative variables. Better (higher) MPG is associated with lower CO2 emissions. The relationship is curved, however, not linear.

Figure 2.6:

FIGURE 2.6

Scatterplot of CO2 emissions versus highway MPG for 1067 vehicles for the model year 2014 using JMP software. Colors correspond to the type of fuel used: blue for diesel, red for ethanol, green for regular gasoline, and purple for premium gasoline, Example 2.7.

71

The legend on the right side of the figure identifies the colors used to plot the four types of fuel, our categorical variable. The vehicles that use regular gasoline (green) and premium gasoline (purple) appear to be mixed together. The diesel-burning vehicles (blue) are close to the the gasoline-burning vehicles, but they tend to have higher values for both MPG and emissions. On the other hand, the vehicles that burn ethanol (red) are clearly separated from the other vehicles.

Careful judgment is needed in applying this graphical method. Don’t be discouraged if your first attempt is not very successful. To discover interesting things in your data, you will often produce several plots before you find the one that is most effective in describing the data.2


psbe4e_test_ch2

SECTION 2.1 Summary

72


psbe4e_test_ch2

SECTION 2.1 Exercises

For Exercises 2.1 and 2.2, pages 64–65; for 2.3 and 2.4, see page 65; for 2.5 and 2.6, pages 66–67; and for 2.7, see page 68.

Question 2.8

2.8

What’s wrong?

Explain what is wrong with each of the following:

  1. If two variables are negatively associated, then low values of one variable are associated with low values of the other variable.
  2. A stemplot can be used to examine the relationship between two variables.
  3. In a scatterplot, we put the response variable on the x axis and the explanatory variable on the y axis.

Question 2.9

2.9

Make some sketches

For each of the following situations, make a scatterplot that illustrates the given relationship between two variables.

  1. No apparent relationship.
  2. A weak negative linear relationship.
  3. A strong positive relationship that is not linear.
  4. A more complicated relationship. Explain the relationship.

Question 2.10

2.10

Companies of the world

In Exercise 1.118 (page 61), you examined data collected by the World Bank on the numbers of companies that are incorporated and are listed in their country’s stock exchange at the end of the year for 2012. In Exercise 1.119, you did the same for the year 2002.3 In this exercise, you will examine the relationship between the numbers for these two years.

  1. Which variable would you choose as the explanatory variable, and which would you choose as the response variable. Give reasons for your answers.
  2. Make a scatterplot of the data.
  3. Describe the form, the direction, and the strength of the relationship.
  4. Are there any outliers? If yes, identify them by name.

Question 2.11

2.11

Companies of the world

Refer to the previous exercise. Using the questions there as a guide, describe the relationship between the numbers for 2012 and 2002. Do you expect this relationship to be stronger or weaker than the one you described in the previous exercise? Give a reason for your answer.

Question 2.12

2.12

Brand-to-brand variation in a product

Beer100.com advertises itself as “Your Place for All Things Beer.” One of their “things” is a list of 175 domestic beer brands with the percent alcohol, calories per 12 ounces, and carbohydrates (in grams).4 In Exercises 1.56 through 1.58 (page 36), you examined the distribution of alcohol content and the distribution of calories for these beers.

  1. Give a brief summary of what you learned about these variables in those exercises. (If you did not do them when you studied Chapter 1, do them now.)
  2. Make a scatterplot of calories versus percent alcohol.
  3. Describe the form, direction, and strength of the relationship.
  4. Are there any outliers? If yes, identify them by name.

73

Question 2.13

2.13

More beer

Refer to the previous exercise. Repeat the exercise for the relationship between carbohydrates and percent alcohol. Be sure to include summaries of the distributions of the two variables you are studying.

Question 2.14

2.14

Marketing in Canada

Many consumer items are marketed to particular age groups in a population. To plan such marketing strategies, it is helpful to know the demographic profile for different areas. Statistics Canada provides a great deal of demographic data organized in different ways.5

  1. Make a scatterplot of the percent of the population over 65 versus the percent of the population under 15.
  2. Describe the form, direction, and strength of the relationship.

Question 2.15

2.15

Compare the provinces with the territories

Refer to the previous exercise. The three Canadian territories are the Northwest Territories, Nunavut, and the Yukon Territories. All of the other entries in the data set are provinces.

  1. Generate a scatterplot of the Canadian demographic data similar to the one that you made in the previous exercise but with the points labeled “P” for provinces and “T” for territories (or some other way if that is easier to do with your software.)
  2. Use your new scatterplot to write a new summary of the demographics for the 13 Canadian provinces and territories.

Question 2.16

2.16

Sales and time spent on web pages

You have collected data on 1000 customers who visited the web pages of your company last week. For each customer, you recorded the time spent on your pages and the total amount of their purchases during the visit. You want to explore the relationship between these two variables.

  1. What is the explanatory variable? What is the response variable? Explain your answers.
  2. Are these variables categorical or quantitative?
  3. Do you expect a positive or negative association between these variables? Why?
  4. How strong do you expect the relationship to be? Give reasons for your answer.

Question 2.17

2.17

A product for lab experiments

Barium-137m is a radioactive form of the element barium that decays very rapidly. It is easy and safe to use for lab experiments in schools and colleges.6 In a typical experiment, the radioactivity of a sample of barium-137m is measured for one minute. It is then measured for three additional one-minute periods, separated by two minutes. So data are recorded at one, three, five, and seven minutes after the start of the first counting period. The measurement units are counts. Here are the data for one of these experiments:7

Time 1 3 5 7
Count 578 317 203 118
  1. Make a scatterplot of the data. Give reasons for the choice of which variables to use on the x and y axes.
  2. Describe the overall pattern in the scatterplot.
  3. Describe the form, direction, and strength of the relationship.
  4. Identify any outliers.
  5. Is the relationship approximately linear? Explain your answer.

Question 2.18

2.18

Use a log for the radioactive decay

Refer to the previous exercise. Transform the counts using a log transformation. Then repeat parts (a) through (e) for the transformed data, and compare your results with those from the previous exercise.

Question 2.19

2.19

Time to start a business

Case 1.2 (page 23) uses the World Bank data on the time required to start a business in different countries. For Example 1.21 and several other examples that follow we used data for a subset of the countries for 2013. Data are also available for times to start in 2008. Let’s look at the data for all 189 countries to examine the relationship between the times to start in 2013 and the times to start in 2008.

  1. Why should you use the time for 2008 as the explanatory variable and the time for 2013 as the response variable?
  2. Make a scatterplot of the two variables.
  3. How many points are in your plot? Explain why there are not 189 points.
  4. Describe the form, direction, and strength of the relationship.
  5. Identify any outliers.
  6. Is the relationship approximately linear? Explain your answer.

Question 2.20

2.20

Use 2003 to predict 2013

Refer to the previous exercise. The data set also has times for 2003. Use the 2003 times as the explanatory variable and the 2013 times as the response variable.

  1. Answer the questions in the previous exercise for this setting.
  2. Compare the strength of this relationship (between the 2013 times and the 2003 times) with the strength of the relationship in the previous exercise (between the 2013 times and the 2008 times). Interpret this finding.

74

Question 2.21

2.21

Fuel efficiency and CO_2 emissions

Refer to Example 2.7 (pages 70–71), where we examined the relationship between CO2 emissions and highway MPG for 1067 vehicles for the model year 2014. In that example, we used MPG as the explanatory variable and CO2 as the response variable. Let’s see if the relationship differs if we change our measure of fuel efficiency from highway MPG to city MPG. Make a scatterplot of the fuel efficiency for city driving, city MPG, versus CO2 emissions. Write a summary describing the relationship between these two variables. Compare your summary with what we found in Example 2.7.

Question 2.22

2.22

Add the type of fuel to the plot

Refer to the previous exercise. As we did in Figure 2.6 (page 71), add the categorical variable, type of fuel, to your plot. (If your software does not have this capability, make separate plots for each fuel type. Use the same range of values for the y axis and for the x axis to make the plots easier to compare.) Summarize what you have found in this exercise, and compare your results with what we found in Example 2.7 (pages 70–71).


psbe4e_test_ch2

SECTION 2.2 Correlation

A scatterplot displays the form, direction, and strength of the relationship between two quantitative variables. Linear relationships are particularly important because a straight line is a simple pattern that is quite common. We say a linear relationship is strong if the points lie close to a straight line and weak if they are widely scattered about a line. Our eyes are not good judges of how strong a linear relationship is.

The two scatterplots in Figure 2.7 depict exactly the same data, but the lower plot is drawn smaller in a large field. The lower plot seems to show a stronger linear relationship. Our eyes are often fooled by changing the plotting scales or the amount of white space around the cloud of points in a scatterplot.8 We need to follow our strategy for data analysis by using a numerical measure to supplement the graph. Correlation is the measure we use.

Figure 2.7:

FIGURE 2.7

Two scatterplots of the same data. The straight-line pattern in lower plot appears stronger because of the surrounding open space.

75

The correlation r

Correlation

The correlation measures the direction and strength of the linear relationship between two quantitative variables. Correlation is usually written as r.

Suppose that we have data on variables x and y for n cases. The values for the first case are x1 and y1, the values for the second case are x2 and y2, and so on. The means and standard deviations of the two variables are and sx for the x-values, and and sy for the y-values. The correlation r between x and y is

As always, the summation sign Σ means “add these terms for all cases.” The formula for the correlation r is a bit complex. It helps us to see what correlation is, but in practice you should use software or a calculator that finds r from keyed-in values of two variables x and y.

The formula for r begins by standardizing the data. Suppose, for example, that x is height in centimeters and y is weight in kilograms and that we have height and weight measurements for n people. Then and sx are the mean and standard deviation of the n heights, both in centimeters. The value

REMINDER

standardizing, p. 45

is the standardized height of the ith person. The standardized height says how many standard deviations above or below the mean a person’s height lies. Standardized values have no units—in this example, they are no longer measured in centimeters. Similarly, the standardized weights obtained by subtracting and dividing by sy are no longer measured in kilograms. The correlation r is an average of the products of the standardized height and the standardized weight for the n people.

APPLY YOUR KNOWLEDGE

Question 2.23

2.23

Spending on education

CASE 2.1 In Example 2.3 (page 66), we examined the relationship between spending on education and population for the 50 states in the United States. Compute the correlation between these two variables.

Question 2.24

2.24

Change the units

CASE 2.1 Refer to Exercise 2.6 (page 67), where you changed the units to millions of dollars for education spending and to thousands for population.

  1. Find the correlation between spending on education and population using the new units.
  2. Compare this correlation with the one that you computed in the previous exercise.
  3. Generally speaking, what effect, if any, did changing the units in this way have on the correlation?

76

Facts about correlation

The formula for correlation helps us see that r is positive when there is a positive association between the variables. Height and weight, fors example, have a positive association. People who are above average in height tend to be above average in weight. Both the standardized height and the standardized weight are positive. People who are below average in height tend to have below-average weight. Then both standardized height and standardized weight are negative. In both cases, the products in the formula for r are mostly positive, so r is positive. In the same way, we can see that r is negative when the association between x and y is negative. More detailed study of the formula gives more detailed properties of r. Here is what you need to know to interpret correlation.

  1. Correlation makes no distinction between explanatory and response variables. It makes no difference which variable you call x and which you call y in calculating the correlation.
  2. Correlation requires that both variables be quantitative, so it makes sense to do the arithmetic indicated by the formula for r. We cannot calculate a correlation between the incomes of a group of people and what city they live in because city is a categorical variable.
  3. Because r uses the standardized values of the data, r does not change when we change the units of measurement of x, y, or both. Measuring height in inches rather than centimeters and weight in pounds rather than kilograms does not change the correlation between height and weight. The correlation r itself has no unit of measurement; it is just a number.
  4. Positive r indicates positive association between the variables, and negative r indicates negative association.
  5. The correlation r is always a number between −1 and 1. Values of r near 0 indicate a very weak linear relationship. The strength of the linear relationship increases as r moves away from 0 toward either −1 or 1. Values of r close to −1 or 1 indicate that the points in a scatterplot lie close to a straight line. The extreme values r = −1 and r = 1 occur only in the case of a perfect linear relationship, when the points lie exactly along a straight line.
  6. Correlation measures the strength of only a linear relationship between two variables. Correlation does not describe curved relationships between variables, no matter how strong they are.
  7. Like the mean and standard deviation, the correlation is not resistant: r is strongly affected by a few outlying observations. Use r with caution when outliers appear in the scatterplot.

REMINDER

resistant, p. 25

The scatterplots in Figure 2.8 illustrate how values of r closer to 1 or −1 correspond to stronger linear relationships. To make the meaning of r clearer, the standard deviations of both variables in these plots are equal, and the horizontal and vertical scales are the same. In general, it is not so easy to guess the value of r from the appearance of a scatterplot. Remember that changing the plotting scales in a scatterplot may mislead our eyes, but it does not change the correlation.

Figure 2.8:

FIGURE 2.8

How the correlation measures the strength of a linear relationship. Patterns closet to a straight line have correlations closer to 1 or −1.

77

Remember that correlation is not a complete description of two-variable data, even when the relationship between the variables is linear. You should give the means and standard deviations of both x and y along with the correlation. (Because the formula for correlation uses the means and standard deviations, these measures are the proper choice to accompany a correlation.) Conclusions based on correlations alone may require rethinking in the light of a more complete description of the data.

EXAMPLE 2.8

Forecasting Earnings

Stock analysts regularly forecast the earnings per share (EPS) of companies they follow. EPS is calculated by dividing a company’s net income for a given time period by the number of common stock shares outstanding. We have two analysts’ EPS forecasts for a computer manufacturer for the next six quarters. How well do the two forecasts agree? The correlation between them is r = 0.9, but the mean of the first analyst’s forecasts is $3 per share lower than the second analyst’s mean.

These facts do not contradict each other. They are simply different kinds of information. The means show that the first analyst predicts lower EPS than the second. But because the first analyst’s EPS predictions are about $3 per share lower than the second analyst’s for every quarter, the correlation remains high. Adding or subtracting the same number to all values of either x or y does not change the correlation. The two analysts agree on which quarters will see higher EPS values. The high r shows this agreement, despite the fact that the actual predicted values differ by $3 per share.

APPLY YOUR KNOWLEDGE

Question 2.25

2.25

Strong association but no correlation

Here is a data set that illustrates an important point about correlation:

x 20 30 40 50 60
y 10 30 50 30 10
  1. Make a scatterplot of y versus x.
  2. Describe the relationship between y and x. Is it weak or strong? Is it linear?
  3. Find the correlation between y and x.
  4. What important point about correlation does this exercise illustrate?

78

Question 2.26

2.26

Brand names and generic products

  1. If a store always prices its generic “store brand” products at exactly 90% of the brand name products’ prices, what would be the correlation between these two prices? (Hint: Draw a scatterplot for several prices.)
  2. If the store always prices its generic products $1 less than the corresponding brand name products, then what would be the correlation between the prices of the brand name products and the store brand products?

psbe4e_test_ch2

SECTION 2.2 Summary


psbe4e_test_ch2

SECTION 2.2 Exercises

For Exercises 2.23 and 2.24, see page 75; and for 2.25 and 2.26, see pages 77–78.

Question 2.27

2.27

Companies of the world

Refer to Exercise 1.118 (page 61), where we examined data collected by the World Bank on the numbers of companies that are incorporated and are listed on their country’s stock exchange at the end of the year. In Exercise 2.10 (page 71), you examined the relationship between these numbers for 2012 and 2002.

  1. Find the correlation between these two variables.
  2. Do you think that the correlation you computed gives a good numerical summary of the strength of the relationship between these two variables? Explain your answer.

Question 2.28

2.28

Companies of the world

Refer to the previous exercise and to Exercise 2.11 (page 72). Answer parts (a) and (b) for 2012 and 1992. Compare the correlation you found in the previous exercise with the one you found in this exercise. Why do they differ in this way?

Question 2.29

2.29

A product for lab experiments

In Exercise 2.17 (page 73), you described the relationship between time and count for an experiment examining the decay of barium.

  1. Is the relationship between these two variables strong? Explain your answer.
  2. Find the correlation.
  3. Do you think that the correlation you computed gives a good numerical summary of the strength of the relationship between these two variables? Explain your answer.

Question 2.30

2.30

Use a log for the radioactive decay

Refer to the previous exercise and to Exercise 2.18 (page 73), where you transformed the counts with a logarithm.

  1. Is the relationship between time and the log of the counts strong? Explain your answer.
  2. Find the correlation between time and the log of the counts.
  3. Do you think that the correlation you computed gives a good numerical summary of the strength of the relationship between these two variables? Explain your answer.
  4. Compare your results here with those you found in the previous exercise. Was the correlation useful in explaining the relationship before the transformation? After? Explain your answers.
  5. Using your answer in part (d), write a short explanation of what these analyses show about the use of a correlation to explain the strength of a relationship.

79

Question 2.31

2.31

Brand-to-brand variation in a product

In Exercise 2.12 (page 73), you examined the relationship between percent alcohol and calories per 12 ounces for 175 domestic brands of beer.

  1. Compute the correlation between these two variables.
  2. Do you think that the correlation you computed gives a good numerical summary of the strength of the relationship between these two variables? Explain your answer.

Question 2.32

2.32

Alcohol and carbohydrates in beer revisited

Refer to the previous exercise. Delete any outliers that you identified in Exercise 2.12.

  1. Recompute the correlation without the outliers.
  2. Write a short paragraph about the possible effects of outliers on the correlation, using this example to illustrate your ideas.

Question 2.33

2.33

Marketing in Canada

In Exercise 2.14 (page 73), you examined the relationship between the percent of the population over 65 and the percent under 15 for the 13 Canadian provinces and territories.

  1. Make a scatterplot of the two variables if you do not have your work from Exercise 2.14.
  2. Find the value of the correlation r.
  3. Does this numerical summary give a good indication of the strength of the relationship between these two variables? Explain your answer.

Question 2.34

2.34

Nunavut

Refer to the previous exercise.

  1. Do you think that Nunavut is an outlier? Explain your answer.
  2. Find the correlation without Nunavut. Using your work from the previous exercise, summarize the effect of Nunavut on the correlation.

Question 2.35

2.35

Education spending and population with logs

In Example 2.3 (page 66), we examined the relationship between spending on education and population, and in Exercise 2.23 (page 75), you found the correlation between these two variables. In Example 2.6 (page 69), we examined the relationship between the variables transformed by logs.

  1. Compute the correlation between the variables expressed as logs.
  2. How does this correlation compare with the one you computed in Exercise 2.23? Discuss this result.

Question 2.36

2.36

Are they outliers?

Refer to the previous exercise. Delete the four states with high values.

  1. Find the correlation between spending on education and population for the remaining 46 states.
  2. Do the same for these variables expressed as logs.
  3. Compare your results in parts (a) and (b) with the correlations that you computed with the full data set in Exercise 2.23 and in the previous exercise. Discuss these results.

Question 2.37

2.37

Fuel efficiency and CO_2 emissions

In Example 2.7 (pages 70–71), we examined the relationship between highway MPG and CO2 emissions for 1067 vehicles for the model year 2014. Let’s examine the relationship between the two measures of fuel efficiency in the data set, highway MPG and city MPG.

  1. Make a scatterplot with city MPG on the x axis and highway MPG on the y axis.
  2. Describe the relationship.
  3. Calculate the correlation.
  4. Does this numerical summary give a good indication of the strength of the relationship between these two variables? Explain your answer.

Question 2.38

2.38

Consider the fuel type

Refer to the previous exercise and to Figure 2.6 (page 71), where different colors are used to distinguish four different types of fuels used by these vehicles.

  1. Make a figure similar to Figure 2.6 that allows us to see the categorical variable, type of fuel, in the scatterplot. If your software does not have this capability, make different scatterplots for each fuel type.
  2. Discuss the relationship between highway MPG and city MPG, taking into account the type of fuel. Compare this view with what you found in the previous exercise where you did not make this distinction.
  3. Find the correlation between highway MPG and city MPG for each type of fuel. Write a short summary of what you have found.

Question 2.39

2.39

Match the correlation

The Correlation and Regression applet at the text website allows you to create a scatterplot by clicking and dragging with the mouse. The applet calculates and displays the correlation as you change the plot. You will use this applet to make scatterplots with 10 points that have correlation close to 0.7. The lesson is that many patterns can have the same correlation. Always plot your data before you trust a correlation.

  1. Stop after adding the first two points. What is the value of the correlation? Why does it have this value?
  2. Make a lower-left to upper-right pattern of 10 points with correlation about r = 0.7. (You can drag points up or down to adjust r after you have 10 points.) Make a rough sketch of your scatterplot.
  3. Make another scatterplot with nine points in a vertical stack at the right of the plot. Add one point far to the left and move it until the correlation is close to 0.7. Make a rough sketch of your scatterplot.
  4. Make yet another scatterplot with 10 points in a curved pattern that starts at the lower left, rises to the right, then falls again at the far right. Adjust the points up or down until you have a quite smooth curve with correlation close to 0.7. Make a rough sketch of this scatterplot also.

80

Question 2.40

2.40

Stretching a scatterplot

Changing the units of measurement can greatly alter the appearance of a scatterplot. Consider the following data:

x −4 −4 −3 3 4 4
y 0.5 −0.6 −0.5 0.5 0.5 −0.6
  1. Draw x and y axes each extending from −6 to 6. Plot the data on these axes.
  2. Calculate the values of new variables x* = x/10 and y* = 10y, starting from the values of x and y. Plot y* against x* on the same axes using a different plotting symbol. The two plots are very different in appearance.
  3. Find the correlation between x and y. Then find the correlation between x* and y*. How are the two correlations related? Explain why this isn’t surprising.

Question 2.41

2.41

CEO compensation and stock market performance

An academic study concludes, “The evidence indicates that the correlation between the compensation of corporate CEOs and the performance of their company’s stock is close to zero.” A business magazine reports this as “A new study shows that companies that pay their CEOs highly tend to perform poorly in the stock market, and vice versa.” Explain why the magazine’s report is wrong. Write a statement in plain language (don’t use the word “correlation”) to explain the study’s conclusion.

Question 2.42

2.42

Investment reports and correlations

Investment reports often include correlations. Following a table of correlations among mutual funds, a report adds, “Two funds can have perfect correlation, yet different levels of risk. For example, Fund A and Fund B may be perfectly correlated, yet Fund A moves 20% whenever Fund B moves 10%.” Write a brief explanation, for someone who does not know statistics, of how this can happen. Include a sketch to illustrate your explanation.

Question 2.43

2.43

Sloppy writing about correlation

Each of the following statements contains a blunder. Explain in each case what is wrong.

  1. “The correlation between y and x is r = 0.5 but the correlation between x and y is r = −0.5.”
  2. “There is a high correlation between the color of a smartphone and the age of its owner.”
  3. “There is a very high correlation (r = 1.2) between the premium you would pay for a standard automobile insurance policy and the number of accidents you have had in the last three years.”

psbe4e_test_ch2

2.3 Least-Squares Regression

Correlation measures the direction and strength of the straight-line (linear) relationship between two quantitative variables. If a scatterplot shows a linear relationship, we would like to summarize this overall pattern by drawing a line on the scatterplot. A regression line summarizes the relationship between two variables, but only in a specific setting: when one of the variables helps explain or predict the other. That is, regression describes a relationship between an explanatory variable and a response variable.

Regression Line

A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x.

81

EXAMPLE 2.9

World Financial Markets

The World Economic Forum studies data on many variables related to financial development in the countries of the world. They rank countries on their financial development based on a collection of factors related to economic growth.9 Two of the variables studied are gross domestic product per capita and net assets per capita. Here are the data for 15 countries that ranked high on financial development:

Country GDP Assets Country GDP Assets Country GDP Assets
United Kingdom 43.8 199 Switzerland 67.4 358 Germany 44.7 145
Australia 47.4 166 Netherlands 52.0 242 Belgium 47.1 167
United States 47.9 191 Japan 38.6 176 Sweden 52.8 169
Singapore 40.0 168 Denmark 62.6 224 Spain 35.3 152
Canada 45.4 170 France 46.0 149 Ireland 61.8 214

In this table, GDP is gross domestic product per capita in thousands of dollars and assets is net assets per capita in thousands of dollars. Figure 2.9 is a scatterplot of the data. The correlation is r = 0.76. The scatterplot includes a regression line drawn through the points.

Figure 2.9:

FIGURE 2.9

Scatterplot of GDP per capita and net assets per capita for 15 countries that rank high on financial development, Example 2.9. The dashed line indicates how to use the regression line to predict net assets per capita for a country with a GDP per capita of 50.

prediction

Suppose we want to use this relationship between GDP per capita and net assets per capita to predict the net assets per capita for a country that has a GDP per capita of $50,000. To predict the net assets per capita (in thousands of dollars), first locate 50 on the x axis. Then go “up and over” as in Figure 2.9 to find the GDP per capita y that corresponds to x = 50. We predict that a country with a GDP per capita of $50,000 will have net assets per capita of about $200,000.

The least-squares regression line

Different people might draw different lines by eye on a scatterplot. We need a way to draw a regression line that doesn’t depend on our guess as to where the line should be. We will use the line to predict y from x, so the prediction errors we make are errors in y, the vertical direction in the scatterplot. If we predict net assets per capita of 177 and the actual net assets per capita are 170, our prediction error is

82

The error is −$7,000.

APPLY YOUR KNOWLEDGE

Question 2.44

2.44

Find a prediction error

2.44 Use Figure 2.9 to estimate the net assets per capita for a country that has a GDP per capita of $40,000. If the actual net assets per capita are $170,000, find the prediction error.

Question 2.45

2.45

Positive and negative prediction errors

Examine Figure 2.9 carefully. How many of the prediction errors are positive? How many are negative?

No line will pass exactly through all the points in the scatterplot. We want the vertical distances of the points from the line to be as small as possible.

EXAMPLE 2.10

The Least-Squares Idea

Figure 2.10 illustrates the idea. This plot shows the data, along with a line. The vertical distances of the data points from the line appear as vertical line segments.

Figure 2.10:

FIGURE 2.10

The least-squares idea. For each observation, find the vertical distance of each point from a regression line. The least-squares regression line makes the sum of the squares of these distances as small as possible.

There are several ways to make the collection of vertical distances “as small as possible.” The most common is the least-squares method.

Least-Squares Regression Line

The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

One reason for the popularity of the least-squares regression line is that the problem of finding the line has a simple solution. We can give the recipe for the least-squares line in terms of the means and standard deviations of the two variables and their correlation.

83

Equation of the Least-Squares Regression Line

We have data on an explanatory variable x and a response variable y for n cases. From the data, calculate the means and and the standard deviations sx and sy of the two variables and their correlation r. The least-squares regression line is the line

with slope

and intercept

We write (read “y hat”) in the equation of the regression line to emphasize that the line gives a predicted response for any x. Because of the scatter of points about the line, the predicted response will usually not be exactly the same as the actually observed response y. In practice, you don’t need to calculate the means, standard deviations, and correlation first. Statistical software or your calculator will give the slope b1 and intercept b0 of the least-squares line from keyed-in values of the variables x and y. You can then concentrate on understanding and using the regression line. Be warned—different software packages and calculators label the slope and intercept differently in their output, so remember that the slope is the value that multiplies x in the equation.

EXAMPLE 2.11

The Equation for Predicting Net Assets

The line in Figure 2.9 is in fact the least-squares regression line for predicting net assets per capita from GDP per capita. The equation of this line is

slope

The slope of a regression line is almost always important for interpreting the data. The slope is the rate of change, the amount of change in when x increases by 1. The slope b1 = 4.5 in this example says that each additional $1000 of GDP per capita is associated with an additional $4500 in net assets per capita.

intercept

The intercept of the regression line is the value of when x = 0. Although we need the value of the intercept to draw the line, it is statistically meaningful only when x can actually take values close to zero. In our example, x = 0 occurs when a country has zero GDP. Such a situation would be very unusual, and we would not include it within the framework of our analysis.

EXAMPLE 2.12

Predict Net Assets

prediction

The equation of the regression line makes prediction easy. Just substitute a value of x into the equation. To predict the net assets per capita for a country that has a GDP per capita of $50,000, we use x = 50:

The predicted net assets per capita is $198,000.

84

plot the line

To plot the line on the scatterplot, you can use the equation to find for two values of x, one near each end of the range of x in the data. Plot each above its x, and draw the line through the two points. As a check, it is a good idea to compute for a third value of x and verify that this point is on your line.

APPLY YOUR KNOWLEDGE

Question 2.46

2.46

A regression line

A regression equation is y = 15 + 30x.

  1. What is the slope of the regression line?
  2. What is the intercept of the regression line?
  3. Find the predicted values of y for x = 10, for x = 20, and for x = 30.
  4. Plot the regression line for values of x between 0 and 50.

EXAMPLE 2.13

GDP and Assets Results Using Software

Figure 2.11 displays the selected regression output for the world financial markets data from JMP, Minitab, and Excel. The complete outputs contain many other items that we will study in Chapter 10.

Figure 2.11:

FIGURE 2.11

Selected least-squares regression output for the world financial markets data. (a) JMP. (b) Minitab. (c) Excel.

Coefficient

Let’s look at the Minitab output first. A table gives the regression intercept and slope under the heading “Coefficients.” Coefficient is a generic term that refers to the quantities that define a regression equation. Note that the intercept is labeled “Constant,” and the slope is labeled with the name of the explanatory variable. In the table, Minitab reports the intercept as −27.2 and the slope as 4.50 followed by the regression equation.

85

86

Excel provides the same information in a slightly different format. Here the intercept is reported as −27.16823305, and the slope is reported as 4.4998956. Check the JMP output to see how the regression coefficients are reported there.

How many digits should we keep in reporting the results of statistical calculations? The answer depends on how the results will be used. For example, if we are giving a description of the equation, then rounding the coefficients and reporting the equation as y = −27 + 4.5x would be fine. If we will use the equation to calculate predicted values, we should keep a few more digits and then round the resulting calculation as we did in Example 2.12.

APPLY YOUR KNOWLEDGE

Question 2.47

2.47

Predicted values for GDP and assets

Refer to the world financial markets data in Example 2.9.

  1. Use software to compute the coefficients of the regression equation. Indicate where to find the slope and the intercept on the output, and report these values.
  2. Make a scatterplot of the data with the least-squares line.
  3. Find the predicted value of assets for each country.
  4. Find the difference between the actual value and the predicted value for each country.

Facts about least-squares regression

Regression as a way to describe the relationship between a response variable and an explanatory variable is one of the most common statistical methods, and least squares is the most common technique for fitting a regression line to data. Here are some facts about least-squares regression lines.

Fact 1. There is a close connection between correlation and the slope of the least-squares line. The slope is

This equation says that along the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y. When the variables are perfectly correlated (r = 1 or r = −1), the change in the predicted response is the same (in standard deviation units) as the change in x. Otherwise, because −1 ≤ r ≤ 1, the change in is less than the change in x. As the correlation grows less strong, the prediction moves less in response to changes in x.

Fact 2. The least-squares regression line always passes through the point on the graph of y against x. So the least-squares regression line of y on x is the line with slope rsy/sx that passes through the point . We can describe regression entirely in terms of the basic descriptive measures , and r.

Fact 3. The distinction between explanatory and response variables is essential in regression. Least-squares regression looks at the distances of the data points from the line only in the y direction. If we reverse the roles of the two variables, we get a different least-squares regression line.

87

EXAMPLE 2.14

Education Spending and Population

CASE 2.1 Figure 2.12 is a scatterplot of the education spending data described in Case 2.1 (page 65). There is a positive linear relationship.

Figure 2.12:

FIGURE 2.12

Scatterplot of spending on education versus the population. The two lines are the least-squares regression lines: using population to predict spending on education (solid) and using spending on education to predict population (dashed), Example 2.14.

The two lines on the plot are the two least-squares regression lines. The regression line for using population to predict education spending is solid. The regression line for using education spending to predict population is dashed. The two regressions give different lines. In the regression setting, you must choose one variable to be explanatory.

Interpretation of r2

The square of the correlation r describes the strength of a straight-line relationship. Here is the basic idea. Think about trying to predict a new value of y. With no other information than our sample of values of y, a reasonable choice is .

Now consider how your prediction would change if you had an explanatory variable. If we use the regression equation for the prediction, we would use . This prediction takes into account the value of the explanatory variable x.

Let’s compare our two choices for predicting y. With the explanatory variable x, we use ; without this information, we use , the sample of the response variable. How can we compare these two choices? When we use to predict, our prediction error is . If, instead, we use , our prediction error is . The use of x in our prediction changes our prediction error from is to . The difference is . Our comparison uses the sums of squares of these differences and . The ratio of these two quantities is the square of the correlation:

The numerator represents the variation in y that is explained by x, and the denominator represents the total variation in y.

88

Percent of Variation Explained by the Least-Squares Equation

To find the percent of variation explained by the least-squares equation, square the value of the correlation and express the result as a percent.

EXAMPLE 2.15Using r2

The correlation between GDP per capita and net assets per capita in Example 2.12 (pages 83–84) is r = 0.76312, so r2 = 0.58234. GDP per capita explains about 58% of the variability in net assets per capita.

When you report a regression, give r2 as a measure of how successful the regression was in explaining the response. The software outputs in Figure 2.11 include r2, either in decimal form or as a percent. When you see a correlation (often listed as R or Multiple R in outputs), square it to get a better feel for the strength of the association.

APPLY YOUR KNOWLEDGE

Question 2.48

2.48

The “January effect.”

Some people think that the behavior of the stock market in January predicts its behavior for the rest of the year. Take the explanatory variable x to be the percent change in a stock market index in January and the response variable y to be the change in the index for the entire year. We expect a positive correlation between x and y because the change during January contributes to the full year’s change. Calculation based on 38 years of data gives

  1. What percent of the observed variation in yearly changes in the index is explained by a straight-line relationship with the change during January?
  2. What is the equation of the least-squares line for predicting the full-year change from the January change?
  3. The mean change in January is . Use your regression line to predict the change in the index in a year in which the index rises 1.75% in January. Why could you have given this result (up to roundoff error) without doing the calculation?

Question 2.49

2.49

Is regression useful?

In Exercise 2.39 (pages 79–80), you used the Correlation and Regression applet to create three scatterplots having correlation about r = 0.7 between the horizontal variable x and the vertical variable y. Create three similar scatterplots again, after clicking the “Show least-squares line” box to display the regression line.Correlation r = 0.7 is considered reasonably strong in many areas of work. Because there is a reasonably strong correlation, wemight use a regression line to predict y from x. In which of your three scatterplots does it make sense to use a straightline for prediction?

89

Residuals

A regression line is a mathematical model for the overall pattern of a linear relationship between an explanatory variable and a response variable. Deviations from the overall pattern are also important. In the regression setting, we see deviations by looking at the scatter of the data points about the regression line. The vertical distances from the points to the least-squares regression line are as small as possible in the sense that they have the smallest possible sum of squares. Because they represent “leftover” variation in the response after fitting the regression line, these distances are called residuals.

Residuals

A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is,

EXAMPLE 2.16

Education Spending and Population

CASE 2.1 Figure 2.13 is a scatterplot showing education spending versus the population for the 50 states that we studied in Case 2.1 (page 65). Included on the scatterplot is the least-squares line. The points for the states with large values for both variables—California, Texas, Florida, and New York—are marked individually.

The equation of the least-squares line is where represents education spending and x represents the population of the state.

Let’s look carefully at the data for California, y = 110.1 and x = 38.7. The predicted education spending for a state with 38.7 million people is

The residual for California is the difference between the observed spending (y) and this predicted value.

Figure 2.13:

FIGURE 2.13

Scatterplot of spending on education versus the population for 50 states, with the least-squares line and selected points labeled, Example 2.16.

California spends $5.73 million less on education than the least-squares regression line predicts. On the scatterplot, the residual for California is shown as a dashed vertical line between the actual spending and the least-squares line.

90

APPLY YOUR KNOWLEDGE

Question 2.50

2.50

Residual for Texas

Refer to Example 2.16 (page 89). Texas spent $90.5 million on education and has a population of 26.8 million people.

  1. Find the predicted education spending for Texas.
  2. Find the residual for Texas.
  3. Which state, California or Texas, has a greater deviation from the regression line?

There is a residual for each data point. Finding the residuals with a calculator is a bit unpleasant, because you must first find the predicted response for every x. Statistical software gives you the residuals all at once.

Because the residuals show how far the data fall from our regression line, examining the residuals helps us assess how well the line describes the data. Although residuals can be calculated from any model fitted to data, the residuals from the least-squares line have a special property: the mean of the least-squares residuals is always zero.

APPLY YOUR KNOWLEDGE

Question 2.51

2.51

Sum the education spending residuals

The residuals in the EDSPEND data file have been rounded to two places afterthe decimal. Find the sum of these residuals. Is the sum exactly zero? If not, explain why.

As usual, when we perform statistical calculations, we prefer to display the results graphically. We can do this for the residuals.

Residual Plots

A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess the fit of a regression line.

EXAMPLE 2.17

Residual Plot for Education Spending

CASE 2.1 Figure 2.14 gives the residual plot for the education spending data. The horizontal line at zero in the plot helps orient us.

Figure 2.14:

FIGURE 2.14

Residual plot for the education spending data, Example 2.17.

91

APPLY YOUR KNOWLEDGE

Question 2.52

2.52

Identify the four states

In Figure 2.13, four states are identified by name: California, Texas, Florida, and NewYork. The dashed lines in the plot represent the residuals.

  1. Sketch a version of Figure 2.14 or generate your own plot using the EDSPEND data file. Write in the names of the states California,Texas, Florida, and New York on your plot.
  2. Explain how you were able to identify these four points on your sketch.

If the regression line captures the overall relationship between x and y, the residuals should have no systematic pattern. The residual plot will look something like the pattern in Figure 2.15(a). That plot shows a scatter of points about the fitted line, with no unusual individual observations or systematic change as x increases. Here are some things to look for when you examine a residual plot:

Figure 2.15:

FIGURE 2.15

Idealized patterns in plots of least-squares residuals. Plot (a) indicates that the regression line fits the data well. The data in plot (b) have a curved pattern, so a straight line fits poorly. The response variable y in plot (c) has more spread for larger values of the explanatory variable x, so prediction will be less accurate when x is large.

92

The distribution of the residuals

REMINDER

Normal quantile
plots, p. 51

When we compute the residuals, we are creating a new quantitative variable for our data set. Each case has a value for this variable. It is natural to ask about the distribution of this variable. We already know that the mean is zero. We can use the methods we learned in Chapter 1 to examine other characteristics of the distribution. We will see in Chapter 10 that a question of interest with respect to residuals is whether or not they are approximately Normal. Recall that we used Normal quantile plots to address this issue.

EXAMPLE 2.18

Are the Residuals Approximately Normal?

CASE 2.1 Figure 2.16 gives the Normal quantile plot for the residuals in our education spending example. The distribution of the residuals is not Normal. Most of the points are close to a line in the center of the plot, but there appear to be five outliers—one with a negative residual and four with positive residuals.

Figure 2.16:

FIGURE 2.16

Normal quantile plot of the residuals for the education spending regression, Example 2.18.

Take a look at the plot of the data with the least-squares line in Figure 2.2 (page 67). Note that you can see the same four points in this plot. If we eliminated these states from our data set, the remaining residuals would be approximately Normal. On the other hand, there is nothing wrong with the data for these four states. A complete analysis of the data should include a statement that they are somewhat extreme relative to the distribution of the other states.

Influential observations

influential

In the scatterplot of spending on education versus population in Figure 2.12 (page 87) California, Texas, Florida, and New York have somewhat higher values for both variables than the other 46 states. This could be of concern if these cases distort the least-squares regression line. A case that has a big effect on a numerical summary is called influential.

EXAMPLE 2.19

Is California Influential?

CASE 2.1 To answer this question, we compare the regression lines with and without California. The result is in Figure 2.17. The two lines are very close, so we conclude that California is not influential with respect to the least-squares slope and intercept.

Figure 2.17:

FIGURE 2.17

Two least-squares lines for the education spending data, Example 2.19. The solid line is calculated using all of the data. The dashed line leaves out the data for California. The two lines are very similar, so we conclude that California is not influential.

93

Let’s think about a situation in which California would be influential on the least-squares regression line. California’s spending on education is $110.1 million. This case is close to both least-squares regression lines in Figure 2.17. Suppose California’s spending was much less than $110.1 million. Would this case then become influential?

EXAMPLE 2.20

Suppose California Spent Half as Much?

CASE 2.1 What would happen if California spent about half of what was actually spent, say, $55 million. Figure 2.18 shows the two regression lines, with and without California. Here we see that the regression line changes substantially when California is removed. Therefore, in this setting we would conclude that California is very influential.

Figure 2.18:

FIGURE 2.18

Two least-squares lines for the education spending data with the California education spending changed to $55 million, Example 2.20. The solid line is calculated using all of the data. The dashed line leaves out the data for California, which is influential here. California pulls the least-squares regression line toward it.

94

Outliers and Influential Cases in Regression

An outlier is an observation that lies outside the overall pattern of the other observations. Points that are outliers in the y direction of a scatterplot have large regression residuals, but other outliers need not have large residuals.

A case is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are extreme in the x direction of a scatterplot are often influential for the least-squares regression line.

APPLY YOUR KNOWLEDGE

Question 2.53

2.53

The influence of Texas

CASE 2.1 Make a plot similar to Figure 2.16 giving regression lines with and without Texas. Summarizewhat this plot describes.

California, Texas, Florida, and New York are somewhat unusual and might be considered outliers. However, these cases are not influential with respect to the least-squares regression line.

Influential cases may have small residuals because they pull the regression line toward themselves. That is, you can’t always rely on residuals to point out influential observations. Influential observations can change the interpretation of data. For a linear regression, we compute a slope, an intercept, and a correlation. An individual observation can be influential for one of more of these quantities.

EXAMPLE 2.21

Effects on the Correlation

CASE 2.1 The correlation between the spending on education and population for the 50 states is r = 0.98. If we drop California, it decreases to 0.97. We conclude that California is not influential on the correlation.

The best way to grasp the important idea of influence is to use an interactive animation that allows you to move points on a scatterplot and observe how correlation and regression respond. The Correlation and Regression applet on the text website allows you to do this. Exercises 2.73 and 2.74 later in the chapter guide the use of this applet.


psbe4e_test_ch2

SECTION 2.3 Summary

95