188
OBJECTIVES By the end of this section, I will be able to …
So far, most of our work has looked at ways to describe only one quantitative variable at a time. But there may exist a relationship between two quantitative variables (for example, height and weight) that we want to graph or quantify. We may also want to use the value of one variable, say, height, to predict the value of the other variable, weight. In Section 4.1, we explore scatterplots, which are graphs of the relationship between two quantitative variables, and we learn about correlation, which quantifies this relationship.
1 Scatterplots
Whenever you are examining the relationship between two quantitative variables, your best bet is to start with a scatterplot. A scatterplot is used to summarize the relationship between two quantitative variables that have been measured on the same element. An example of a scatterplot is given in Figure 1.
Note: The predictor variable and response variable are sometimes referred to as the independent variable and dependent variable, respectively. This textbook avoids this terminology because it may be confused with the definition of independent and dependent events and variables in probability (Chapter 5) and categorical data analysis (Chapter 11).
A scatterplot is a graph of points (x, y), each of which represents one observation from the data set. One of the variables is measured along the horizontal axis and is called the x variable. The other variable is measured along the vertical axis and is called the y variable.
Often, the value of the variable can be used to predict or estimate the value of the variable. For this reason, the variable is referred to as the predictor variable, and the variable is called the response variable. We also say that the value of the response variable depends on the value of the predictor variable.
EXAMPLE 1 Predictor variables and response variables
For the following pairs of variables, identify which is the predictor variable and which is the response variable:
Solution
NOW YOU CAN DO
Exercises 9–12.
189
YOUR TURN #1
For the following variables, identify which is the predictor variable and which is the response variable: the number of hours spent studying for an exam, and the grade on the exam.
(The solution is shown in Appendix A.)
EXAMPLE 2 Constructing a scatterplot
sqrfootsale
Suppose you are interested in moving to Glen Ellyn, Illinois, and want to purchase a lot upon which to build a new house. Table 1 contains a random sample of eight lots for sale in Glen Ellyn, with their square footage and prices.
Lot | ||
---|---|---|
Harding St. | 75 | 155 |
Newton Ave. | 125 | 210 |
Stacy Ct. | 125 | 290 |
Eastern Ave. | 175 | 360 |
Second St. | 175 | 250 |
Sunnybrook Rd. | 225 | 450 |
Ahlstrand Rd. | 225 | 530 |
Eastern Ave. | 275 | 635 |
Note: The square footage is expressed in 100s of square feet, so that “90” represents square feet. Similarly, the sales price is expressed in $1000s, so that .
Solution
190
From this scatterplot, we can see that larger lots tend to have higher prices. This is not the case for each observation. For example, the Second Street property is larger than the Stacy Court property, but it has a lower price. Nevertheless, the overall tendency remains.
NOW YOU CAN DO
Exercises 13a–20a.
YOUR TURN #2
Measuring the Human Body
Table 2 contains the heights in inches and weights in pounds of the first eight women in the body_females Case Study data set. Do the following:
63.5 | 113.8 |
65.9 | 130.1 |
62.8 | 108.5 |
61.8 | 138.9 |
61.3 | 118.2 |
66.9 | 130.1 |
62.6 | 104.9 |
65.4 | 153.9 |
(The solutions are shown in Appendix A.)
Developing Your Statistical Sense
Scatterplot Terminology
Note the terminology in the caption to Figure 1. When describing a scatterplot, always indicate the variable first, and then use the term versus (vs.) or against the variable. This terminology reinforces the notion that the variable depends on the variable.
The relationship between two quantitative variables can take many different forms. We illustrate four of the most common relationships.
Note the phrase, “as increases in value …”. When interpreting scatterplots, we always move from left to right.
191
No apparent relationship. Figure 4 shows that no apparent relationship exists between the height of people who purchase used cars and the cost of the used car .
192
Nonlinear relationship. Figure 5 illustrates an example of a nonlinear relationship. When there is either too little salad dressing , or too much salad dressing, the tastiness of a salad can be lower than when a moderate amount of salad dressing is used. Thus, in this case, as the salad dressing increases, at first the tastiness also tends to increase, but then it tends to decrease as too much salad dressing is applied. This is only one example of many different types of nonlinear relationships.
EXAMPLE 3 Characterize the relationship between two variables using a scatterplot
Using Figure 1 on page 189, characterize the relationship between lot square footage and lot price.
Solution
The scatterplot in Figure 1 most resembles Figure 2 on page 191, where a positive linear relationship exists between the variables. Thus, smaller lot sizes tend to be associated with lower prices, and larger lot sizes tend to be associated with higher prices. Put another way, as the lot size increases, the lot price also tends to increase.
NOW YOU CAN DO
Exercises 13b–20b and 21–26.
YOUR TURN #3
Measuring the Human Body
Characterize the relationship between height and weight, using the scatterplot you constructed from Table 2.
(The solution is shown in Appendix A.)
2 Correlation Coefficient
Scatterplots provide a visual description of the relationship between two quantitative variables. The correlation coefficient is a numerical measure for quantifying the linear relationship between two quantitative variables. Table 3 contains the low and high temperatures in degrees Fahrenheit (°F) for five American cities on a particular day. The variables are and . Applying what we have just learned, we construct a scatterplot of the data set, which is presented in Figure 6.
Figure 6 shows us that a positive relationship exists between the high temperature and the low temperature of a city. That is, colder low temperatures are associated with colder high temperatures. Warmer low temperatures are associated with warmer high temperatures. In this section, we seek to quantify this relationship between two numerical variables, using the correlation coefficient . The correlation coefficient (sometimes known as the Pearson product moment correlation coefficient) measures the strength and direction of the linear relationship between two variables. By linear, we mean straight line. The correlation coefficient does not measure the strength of a curved relationship between two variables.
193
City | ||
---|---|---|
Boston | 30 | 50 |
Chicago | 35 | 55 |
Philadelphia | 40 | 70 |
Washington, DC | 45 | 65 |
Dallas | 50 | 80 |
The correlation coefficient measures the strength and direction of the linear relationship between two variables. The correlation coefficient is
where is the sample standard deviation of the data values, and is the sample standard deviation of the data values.
EXAMPLE 4 Calculating the correlation coefficient
highlowtemp
Find the value of the correlation coefficient for the temperature data in Table 3.
Solution
We will outline the steps used in calculating the value of using the temperature data.
194
City | |||||||
---|---|---|---|---|---|---|---|
Boston | 30 | 50 | −10 | 100 | −14 | 196 | 140 |
Chicago | 35 | 55 | −5 | 25 | −9 | 81 | 45 |
Philadelphia | 40 | 70 | 0 | 0 | 6 | 36 | 0 |
Washington, DC | 45 | 65 | 5 | 25 | 1 | 1 | 5 |
Dallas | 50 | 80 | 10 | 100 | 16 | 256 | 160 |
Note on Rounding: Whenever you calculate a quantity that will be needed for later calculations, do not round. Round only when you arrive at the final answer. Here, because the quantities and are used to calculate the correlation coefficient , neither of them is rounded until the end of the calculation.
The correlation coefficient for the high and low temperatures is 0.9272.
NOW YOU CAN DO
Exercises 13c–20c.
YOUR TURN #4
Measuring the Human Body
Use Steps 1–4 to calculate the correlation coefficient between height and weight for the data in Table 2.
(The solution is shown in Appendix A.)
What Does This Formula Mean?
The Correlation Coefficient
Let's analyze the definition formula for the correlation coefficient . When would be positive, and when would it be negative? We see that the formula
consists of a ratio.
195
Four cases (or regions, which are illustrated in Figure 7) describe when the product will be positive or negative, as shown in Table 5.
Region | |||
---|---|---|---|
1 | Positive | Positive | Positive |
2 | Negative | Positive | Negative |
3 | Negative | Negative | Positive |
4 | Positive | Negative | Negative |
Let's explore how our high and low temperature data fit into the above framework. The mean low temperature is , whereas the mean high temperature is . We find the point in our scatterplot of the high and low temperatures, draw the lines and , and mark out our four regions, as shown in Figure 7. Note that four of the five data points fall in Regions 1 and 3, with the fifth falling exactly on a boundary line. Therefore, we expect the value of for this data set to be positive, which is indeed the case, because we observed in Example 4.
Next, we outline the properties of the correlation coefficient .
If your calculations give you a value of outside this range, try it again.
196
Properties of the correlation coefficient
The correlation coefficient always takes on values between −1 and 1, inclusive.
That is, .
As increases, tends to increase.
Figure 9 repeats Figure 1, the scatterplot of sales price versus square footage, for which . Figure 10 repeats Figure 2, the scatterplot of height and weight of middle school children, for which .
197
As increases, tends to decrease.
Figure 12 repeats Figure 3, the scatterplot of the cost of used cars versus their age, for which . Figure 13 shows a scatterplot of , and , for a trip combining city and highway travel. The correlation is .
198
Figure 14 repeats Figure 4, the scatterplot of and , for which .
EXAMPLE 5 Interpreting the correlation coefficient
Interpret the value of the correlation coefficient found in Example 4.
Solution
In Example 4, we found the correlation coefficient for the relationship between high and low temperatures to be . This value of is very close to the maximum value . We would therefore say that high and low temperatures for these five American cities are strongly positively correlated. As low temperature increases, high temperature also tends to increase.
NOW YOU CAN DO
Exercises 13d–20d.
YOUR TURN #5
Measuring the Human Body
Interpret the value of the correlation coefficient you found for the data in Table 2.
(The solution is shown in Appendix A.)
Developing Your Statistical Sense
Note: The Correlation and Regression applet allows you to insert your own data values and see how the regression line changes.
Correlation Is Not Causation
If we conclude that two variables are correlated, it does not necessarily follow that one variable causes the other to occur. For example, in the late 1940s, before the development of a vaccine for the disease polio, analysts noticed a strong correlation between the amounts of ice cream consumed nationwide and higher levels of the onset of polio. Some doctors went on to recommend eliminating ice cream as a way to fight polio. But did ice cream really cause polio? No. Ice cream consumption and polio outbreaks both peaked in the hot summer months, and so were correlated seasonally. Ice cream did not cause polio. After the development of the polio -vaccine by Jonas Salk in the 1950s, the disease disappeared from most countries in the world.
199