2.4 2.3 Correlation

When you complete this section, you will be able to:

  • Use a correlation to describe the direction and strength of a linear relationship between two quantitative variables.

  • Interpret the sign of a correlation.

  • Identify situations in which the correlation is not a good measure of association between two quantitative variables.

  • Identify a linear pattern in a scatterplot.

  • For describing the relationship between two quantitative variables, identify the roles of the correlation, a numerical summary, and the scatterplot, a graphical summary.

A scatterplot displays the form, direction, and strength of the relationship between two quantitative variables. Linear (straight-line) relations are particularly important because a straight line is a simple pattern that is quite common. We say a linear relationship is strong if the points lie close to a straight line and weak if they are widely scattered about a line. Our eyes are not good judges of how strong a relationship is. The two scatterplots in Figure 2.14 depict exactly the same data, but the plot on the right is drawn smaller in a large field. The plot on the right seems to show a stronger relationship.

101

image
Figure 2.14: Figure 2.14 Two scatterplots of the same data. The linear pattern in the plot on the right appears stronger because of the surrounding space.

Our eyes can be fooled by changing the plotting scales or the amount of white space around the cloud of points in a scatterplot.16 We need to follow our strategy for data analysis by using a numerical measure to supplement the graph. Correlation is the measure we use.

The correlation r

We have data on variables x and y for n individuals. Think, for example, of measuring height and weight for n people. Then x1 and y1 are your height and your weight, x2 and y2 are my height and my weight, and so on. For the ith individual, height xi goes with weight yi. Here is the definition of correlation.

CORRELATION

The correlation measures the direction and strength of the linear relationship between two quantitative variables. Correlation is usually written as r.

Suppose that we have data on variables x and y for n individuals. The means and standard deviations of the two variables are and sx for the x-values, and and sy for the y-values. The correlation r between x and y is

As always, the summation sign means “add these terms for all the individuals.” The formula for the correlation r is a bit complex. It helps us see what correlation is but is not convenient for actually calculating r. In practice, you should use software or a calculator that computes r from the values of x and y pairs.

standardize, p. 59

The formula for r begins by standardizing the observations. Suppose, for example, that x is height in centimeters and y is weight in kilograms and that we have height and weight measurements for n people. Then and sx are the mean and standard deviation of the n heights, both in centimeters. The value

102

is the standardized height of the ith person. The standardized height says how many standard deviations above or below the mean a person’s height lies. Standardized values have no units—in this example, they are no longer measured in centimeters. You can standardize the weights also. The correlation r is an average of the products of the standardized height and the standardized weight for the n people.

USE YOUR KNOWLEDGE

Question 2.38

2.38 Laundry detergents. Example 2.8 (page 85) describes data on the rating and price per load for 53 laundry detergents. Use these data to compute the correlation between rating and the price per load.

LAUNDRY

Question 2.39

2.39 Change the units. Refer to the previous exercise. Express the price per load in dollars.

  1. (a) Is the transformation from cents to dollars a linear transformation? Explain your answer.

  2. (b) Compute the correlation between rating and price per load expressed in dollars.

  3. (c) How does the correlation that you computed in part (b) compare with the one you computed in the previous exercise?

  4. (d) What can you say in general about the effect of changing units using linear transformations on the size of the correlation?

LAUNDRY

Properties of correlation

The formula for correlation helps us see that r is positive when there is a positive association between the variables. Height and weight, for example, have a positive association. People who are above average in height tend to also be above average in weight. Both the standardized height and the standardized weight for such a person are positive. People who are below average in height tend also to have below-average weight. Then both standardized height and standardized weight are negative. In both cases, the products in the formula for r are mostly positive, so r is positive. In the same way, we can see that r is negative when the association between x and y is negative. More detailed study of the formula gives more detailed properties of r.

Here is what you need to know to interpret correlation:

standardize, p. 30

image

The scatterplots in Figure 2.15 illustrate how values of r closer to 1 or 1 correspond to stronger linear relationships. To make the essential meaning of r clear, the standard deviations of both variables in these plots are equal, and the horizontal and vertical scales are the same. In general, it is not so easy to guess the value of r from the appearance of a scatterplot. Remember that changing the plotting scales in a scatterplot may mislead our eyes, but it does not change the standardized values of the variables and, therefore, cannot change the correlation. To explore how extreme observations can influence r, use the Correlation and Regression applet available on the text website. Also, see Exercises 2.56 and 2.57 (page 106).

image
Figure 2.15: Figure 2.15 How the correlation r measures the direction and strength of a linear association.

104

Finally, remember that correlation is not a complete description of two-variable data, even when the relationship between the variables is linear. You should give the means and standard deviations of both x and y along with the correlation. (Because the formula for correlation uses the means and standard deviations, these measures are the proper choices to accompany a correlation.) Conclusions based on correlations alone may require rethinking in the light of a more complete description of the data.

EXAMPLE 2.18

Scoring of figure skating in the Olympics. Until a scandal at the 2002 Olympics brought change, figure skating was scored by judges on a scale from 0.0 to 6.0. The scores were often controversial. We have the scores awarded by two judges, Pierre and Elena, to many skaters. How well do they agree? We calculate that the correlation between their scores is r = 0.9. But the mean of Pierre’s scores is 0.8 point lower than Elena’s mean.

These facts in the example above do not contradict each other. They are simply different kinds of information. The mean scores show that Pierre awards lower scores than Elena. But because Pierre gives every skater a score about 0.8 point lower than Elena, the correlation remains high. Adding the same number to all values of either x or y does not change the correlation. If both judges score the same skaters, the competition is scored consistently because Pierre and Elena agree on which performances are better than others. The high r shows their agreement. But if Pierre scores some skaters and Elena others, we must add 0.8 point to Pierre’s scores to arrive at a fair comparison.