Processing math: 8%

4.1 Scatterplots and Correlation

This page includes Video Technology Manuals
This page includes Statistical Videos

OBJECTIVES By the end of this section, I will be able to …

  1. Construct and interpret scatterplots for two quantitative variables.
  2. Calculate and interpret the correlation coefficient.

So far, most of our work has looked at ways to describe only one quantitative variable at a time. But there may exist a relationship between two quantitative variables (for example, height and weight) that we want to graph or quantify. We may also want to use the value of one variable, say, height, to predict the value of the other variable, weight. In Section 4.1, we explore scatterplots, which are graphs of the relationship between two quantitative variables, and we learn about correlation, which quantifies this relationship.

1 Scatterplots

Whenever you are examining the relationship between two quantitative variables, your best bet is to start with a scatterplot. A scatterplot is used to summarize the relationship between two quantitative variables that have been measured on the same element. An example of a scatterplot is given in Figure 1.

Note: The predictor variable and response variable are sometimes referred to as the independent variable and dependent variable, respectively. This textbook avoids this terminology because it may be confused with the definition of independent and dependent events and variables in probability (Chapter 5) and categorical data analysis (Chapter 11).

A scatterplot is a graph of points (x, y), each of which represents one observation from the data set. One of the variables is measured along the horizontal axis and is called the x variable. The other variable is measured along the vertical axis and is called the y variable.

Often, the value of the x variable can be used to predict or estimate the value of the y variable. For this reason, the x variable is referred to as the predictor variable, and the y variable is called the response variable. We also say that the value of the response variable depends on the value of the predictor variable.

EXAMPLE 1 Predictor variables and response variables

For the following pairs of variables, identify which is the predictor (x) variable and which is the response (y) variable:

  1. The cost of an engagement ring, and the size of the diamond (in carats)
  2. The heights of primary school children, and their ages

Solution

  1. The cost of an engagement ring depends in part on the size of the diamond. We can use the size of the diamond to predict the cost of the ring. Thus, the diamond size is the predictor (x) variable, and the cost is the response (y) variable.
  2. Because the response variable depends on the predictor variable, and because a child's age depends on nothing but the calendar, then age cannot be the response variable. Age must therefore be the predictor (x) variable, with height as the response (y) variable.

NOW YOU CAN DO

Exercises 9–12.

Page 189

YOUR TURN#1

For the following variables, identify which is the predictor (x) variable and which is the response (y) variable: the number of hours spent studying for an exam, and the grade on the exam.

(The solution is shown in Appendix A.)

EXAMPLE 2 Constructing a scatterplot

sqrfootsale

Suppose you are interested in moving to Glen Ellyn, Illinois, and want to purchase a lot upon which to build a new house. Table 1 contains a random sample of eight lots for sale in Glen Ellyn, with their square footage and prices.

  1. Identify the predictor variable and the response variable.
  2. Construct a scatterplot.
Table 4.1: Table 1 Lot square footage and sales price
Lot x=square footage (100s of sq. ft.) y=sales price ($1000s)
Harding St. 75 155
Newton Ave. 125 210
Stacy Ct. 125 290
Eastern Ave. 175 360
Second St. 175 250
Sunnybrook Rd. 225 450
Ahlstrand Rd. 225 530
Eastern Ave. 275 635

Note: The square footage is expressed in 100s of square feet, so that “90” represents 90×100=9000 square feet. Similarly, the sales price is expressed in $1000s, so that 200=200×1000=$200,000.

Solution

  1. It is reasonable to expect that the price of a new lot depends in part on the size of the lot. Thus, we define our predictor variable x to be x=square footage and our response variable y to be y=sales price.
  2. Next, we construct the scatterplot using the data from Table 1. Draw the horizontal axis so that it can contain all the values of the predictor (x) variable, and similarly for the vertical axis. Then, at each data point (x,y), draw a dot. For example, for the Harding Street lot, move along the x axis to 75, then go up until you reach a spot level with y=155, at which point you draw a dot. Proceed similarly for the other seven properties. The result should look similar to the scatterplot in Figure 1.
    FIGURE 1 Scatterplot of sales price versus square footage.
    image
Page 190

From this scatterplot, we can see that larger lots tend to have higher prices. This is not the case for each observation. For example, the Second Street property is larger than the Stacy Court property, but it has a lower price. Nevertheless, the overall tendency remains.

NOW YOU CAN DO

Exercises 13a–20a.

YOUR TURN#2

image Measuring the Human Body

Table 2 contains the heights in inches and weights in pounds of the first eight women in the body_females Case Study data set. Do the following:

  1. Identify the predictor variable and the response variable.
  2. Construct a scatterplot.
Table 4.2: Table 2 Heights and weights of eight women
x=Height (inches) y=Weight (pounds)
63.5 113.8
65.9 130.1
62.8 108.5
61.8 138.9
61.3 118.2
66.9 130.1
62.6 104.9
65.4 153.9

(The solutions are shown in Appendix A.)

Developing Your Statistical Sense

Scatterplot Terminology

Note the terminology in the caption to Figure 1. When describing a scatterplot, always indicate the y variable first, and then use the term versus (vs.) or against the x variable. This terminology reinforces the notion that the y variable depends on the x variable.

The relationship between two quantitative variables can take many different forms. We illustrate four of the most common relationships.

Note the phrase, “as x increases in value …”. When interpreting scatterplots, we always move from left to right.

EXAMPLE 3 Characterize the relationship between two variables using a scatterplot

Using Figure 1 on page 189, characterize the relationship between lot square footage and lot price.

Solution

The scatterplot in Figure 1 most resembles Figure 2 on page 191, where a positive linear relationship exists between the variables. Thus, smaller lot sizes tend to be associated with lower prices, and larger lot sizes tend to be associated with higher prices. Put another way, as the lot size increases, the lot price also tends to increase.

NOW YOU CAN DO

Exercises 13b–20b and 21–26.

YOUR TURN#3

image Measuring the Human Body

Characterize the relationship between height and weight, using the scatterplot you constructed from Table 2.

(The solution is shown in Appendix A.)

2 Correlation Coefficient r

Scatterplots provide a visual description of the relationship between two quantitative variables. The correlation coefficient is a numerical measure for quantifying the linear relationship between two quantitative variables. Table 3 contains the low and high temperatures in degrees Fahrenheit (°F) for five American cities on a particular day. The variables are x=low temperature and y=high temperature. Applying what we have just learned, we construct a scatterplot of the data set, which is presented in Figure 6.

Figure 6 shows us that a positive relationship exists between the high temperature and the low temperature of a city. That is, colder low temperatures are associated with colder high temperatures. Warmer low temperatures are associated with warmer high temperatures. In this section, we seek to quantify this relationship between two numerical variables, using the correlation coefficient r. The correlation coefficient r (sometimes known as the Pearson product moment correlation coefficient) measures the strength and direction of the linear relationship between two variables. By linear, we mean straight line. The correlation coefficient does not measure the strength of a curved relationship between two variables.

Page 193
image
Table 4.3: Table 3 Low and high temperatures, in degrees Fahrenheit, of five American cities
City x=low temperature y=high temperature
Boston 30 50
Chicago 35 55
Philadelphia 40 70
Washington, DC 45 65
Dallas 50 80
image
FIGURE 6 Scatterplot of high versus low temperatures for five American cities.

The correlation coefficient r measures the strength and direction of the linear relationship between two variables. The correlation coefficient r is

r=(xx¯)(yy¯).(n1)sxsy

where sx is the sample standard deviation of the x data values, and sy is the sample standard deviation of the y data values.

EXAMPLE 4 Calculating the correlation coefficient r

highlowtemp

Find the value of the correlation coefficient r for the temperature data in Table 3.

Solution

We will outline the steps used in calculating the value of r using the temperature data.

  • Step 1 Calculate the respective sample means, x¯ and y¯.

    x¯=xn=40,y¯=yn=64

  • Step 2 Construct a table, as shown here in Table 4.
    Page 194
    Table 4.4: Table 4 Calculation table for the correlation coefficient r
    City x y (xx¯) (xx¯)2 (yy¯) (yy¯)2 (xx¯) (yy¯)
    Boston 30 50 −10 100 −14 196 140
    Chicago 35 55 −5 25 −9 81 45
    Philadelphia 40 70 0 0 6 36 0
    Washington, DC 45 65 5 25 1 1 5
    Dallas 50 80 10 100 16 256 160
    (xx¯)2=250 (yy¯)2=570 (xx¯)(yy¯)=350

    Note on Rounding: Whenever you calculate a quantity that will be needed for later calculations, do not round. Round only when you arrive at the final answer. Here, because the quantities sx and sy are used to calculate the correlation coefficient r, neither of them is rounded until the end of the calculation.

  • Step 3 Calculate the respective sample standard deviations sx and sy. Using the sums calculated from Table 4, we have

    sx=(xx¯)2n1=250517.90569415andsy=(yy¯)2n1=5705111.93733639

  • Step 4 Put these values all together in the formula for the correlation coefficient r:

    r=(xx¯)(yy¯)(n1)sxsy=350(4)(7.90569415)(11.93733639)0.927172650.9272

The correlation coefficient r for the high and low temperatures is 0.9272.

NOW YOU CAN DO

Exercises 13c–20c.

YOUR TURN#4

image Measuring the Human Body

Use Steps 1–4 to calculate the correlation coefficient r between height and weight for the data in Table 2.

(The solution is shown in Appendix A.)

What Does This Formula Mean?

The Correlation Coefficient r

Let's analyze the definition formula for the correlation coefficient r. When would r be positive, and when would it be negative? We see that the formula

r=(xx¯)(yy¯)(n1)sxsy

consists of a ratio.

  • Note that the denominator can never be negative because it is the product of three non-negative values (standard deviations can never be negative). Therefore, the numerator determines whether r will be positive or negative.
  • We know that xx¯ is positive whenever the data value x is greater than x¯, and it is negative when x is less than x¯. This relationship is similar for yy¯.
Page 195

Four cases (or regions, which are illustrated in Figure 7) describe when the product (xx¯)(yy¯) will be positive or negative, as shown in Table 5.

Table 4.5: Table 5 When the product (xx¯)(yy¯) will be positive or negative
Region (xx¯) (yy¯) (xx¯)(yy¯)
1 Positive Positive Positive
2 Negative Positive Negative
3 Negative Negative Positive
4 Positive Negative Negative
  • If most of the data values fall in Regions 1 and 3, then r will tend to be positive.
  • If most of the data values fall in Regions 2 and 4, then r will tend to be negative.

Let's explore how our high and low temperature data fit into the above framework. The mean low temperature is x¯=40°F, whereas the mean high temperature is y¯=64°F. We find the point (x¯,y¯)=(40,64) in our scatterplot of the high and low temperatures, draw the lines x=x¯=40 and y=y¯=64, and mark out our four regions, as shown in Figure 7. Note that four of the five data points fall in Regions 1 and 3, with the fifth falling exactly on a boundary line. Therefore, we expect the value of r for this data set to be positive, which is indeed the case, because we observed r=0.9272 in Example 4.

image
FIGURE 7 Nearly all of the temperature data points lie in Regions 1 and 3, making r positive.

Next, we outline the properties of the correlation coefficient r.

image If your calculations give you a value of r outside this range, try it again.

Page 196

Properties of the correlation coefficient r

  1. The correlation coefficient r always takes on values between −1 and 1, inclusive.

    That is, 1r1.

  2. When r=+1, a perfect positive relationship exists between x and y. Figure 8 illustrates the perfect positive relationship between x=number of hours worked at a part-time job, and y=the income from that job at$15per hour.
    FIGURE 8 Perfect positive relationship between x=hours worked and y=income.
    image
  3. Positive values of r indicate a positive relationship between x and y (Figures 9 and 10):
    • The closer r gets to +1, the stronger the evidence for a positive relationship.
    • The variables are said to be positively correlated.
    • As x increases, y tends to increase.

      Figure 9 repeats Figure 1, the scatterplot of sales price versus square footage, for which r=0.943. Figure 10 repeats Figure 2, the scatterplot of height and weight of middle school children, for which r=0.597.

    FIGURE 9 r=0.943 for x=square footage and y=sales price.
    image
    FIGURE 10 r=0.597 for x=height and y=weight.
    image
  4. When r=1, a perfect negative relationship exists between x and y. Figure 11 illustrates the perfect negative relationship between x=the number of $100 ATM withdrawals from a bank account, and y=the account balance.
    Page 197
    FIGURE 11 Perfect negative relationship between x=number of withdrawals and y=account balance.
    image
  5. Negative values of r indicate a negative relationship between x and y (Figures 12 and 13):
    FIGURE 12 r=0.732 for x=vehicle age and y=cost.
    image
    FIGURE 13 r=0.998 for x=miles traveled and y=gas remaining.
    image
    • The closer r gets to −1, the stronger the evidence for a negative relationship.
    • The variables are said to be negatively correlated.
    • As x increases, y tends to decrease.

      Figure 12 repeats Figure 3, the scatterplot of the cost of used cars versus their age, for which r=0.732. Figure 13 shows a scatterplot of x=number of miles traveled on a tank of gas, and y=number of gallons of gas remaining, for a trip combining city and highway travel. The correlation is r=0.998.

  6. Values of r near 0 indicate that no linear relationship exists between x and y (Figure 14):
    • The closer r gets to 0, the weaker the evidence for a linear relationship.
    • The variables are not linearly correlated.
    • A nonlinear relationship may exist between x and y.
Page 198

Figure 14 repeats Figure 4, the scatterplot of x=the heights of car purchasers and y=the vehicle price, for which r=0.023.

image
FIGURE 14 r=0.023 for x=height of car purchaser, and y=cost of vehicle.

EXAMPLE 5 Interpreting the correlation coefficient

Interpret the value of the correlation coefficient found in Example 4.

Solution

In Example 4, we found the correlation coefficient for the relationship between high and low temperatures to be r=0.9272. This value of r is very close to the maximum value r=1. We would therefore say that high and low temperatures for these five American cities are strongly positively correlated. As low temperature increases, high temperature also tends to increase.

NOW YOU CAN DO

Exercises 13d–20d.

YOUR TURN#5

image Measuring the Human Body

Interpret the value of the correlation coefficient you found for the data in Table 2.

(The solution is shown in Appendix A.)

Developing Your Statistical Sense

Note: The Correlation and Regression applet allows you to insert your own data values and see how the regression line changes.

Correlation Is Not Causation

If we conclude that two variables are correlated, it does not necessarily follow that one variable causes the other to occur. For example, in the late 1940s, before the development of a vaccine for the disease polio, analysts noticed a strong correlation between the amounts of ice cream consumed nationwide and higher levels of the onset of polio. Some doctors went on to recommend eliminating ice cream as a way to fight polio. But did ice cream really cause polio? No. Ice cream consumption and polio outbreaks both peaked in the hot summer months, and so were correlated seasonally. Ice cream did not cause polio. After the development of the polio -vaccine by Jonas Salk in the 1950s, the disease disappeared from most countries in the world.

Page 199
[Leave] [Close]