A scatterplot displays the form, direction, and strength of the relationship between two quantitative variables. Linear relationships are particularly important because a straight line is a simple pattern that is quite common. We say a linear relationship is strong if the points lie close to a straight line and weak if they are widely scattered about a line. Our eyes are not good judges of how strong a linear relationship is.
The two scatterplots in Figure 2.7 depict exactly the same data, but the lower plot is drawn smaller in a large field. The lower plot seems to show a stronger linear relationship. Our eyes are often fooled by changing the plotting scales or the amount of white space around the cloud of points in a scatterplot.8 We need to follow our strategy for data analysis by using a numerical measure to supplement the graph. Correlation is the measure we use.
75
The correlation measures the direction and strength of the linear relationship between two quantitative variables. Correlation is usually written as r.
Suppose that we have data on variables x and y for n cases. The values for the first case are x1 and y1, the values for the second case are x2 and y2, and so on. The means and standard deviations of the two variables are and sx for the x-values, and and sy for the y-values. The correlation r between x and y is
As always, the summation sign Σ means “add these terms for all cases.” The formula for the correlation r is a bit complex. It helps us to see what correlation is, but in practice you should use software or a calculator that finds r from keyed-in values of two variables x and y.
The formula for r begins by standardizing the data. Suppose, for example, that x is height in centimeters and y is weight in kilograms and that we have height and weight measurements for n people. Then and sx are the mean and standard deviation of the n heights, both in centimeters. The value
is the standardized height of the ith person. The standardized height says how many standard deviations above or below the mean a person’s height lies. Standardized values have no units—in this example, they are no longer measured in centimeters. Similarly, the standardized weights obtained by subtracting and dividing by sy are no longer measured in kilograms. The correlation r is an average of the products of the standardized height and the standardized weight for the n people.
standardizing, p. 45
2.23 Spending on education
CASE 2.1 In Example 2.3 (page 66), we examined the relationship between spending on education and population for the 50 states in the United States. Compute the correlation between these two variables.
2.24 Change the units
CASE 2.1 Refer to Exercise 2.6 (page 67), where you changed the units to millions of dollars for education spending and to thousands for population.
76
The formula for correlation helps us see that r is positive when there is a positive association between the variables. Height and weight, fors example, have a positive association. People who are above average in height tend to be above average in weight. Both the standardized height and the standardized weight are positive. People who are below average in height tend to have below-average weight. Then both standardized height and standardized weight are negative. In both cases, the products in the formula for r are mostly positive, so r is positive. In the same way, we can see that r is negative when the association between x and y is negative. More detailed study of the formula gives more detailed properties of r. Here is what you need to know to interpret correlation.
resistant, p. 25
The scatterplots in Figure 2.8 illustrate how values of r closer to 1 or −1 correspond to stronger linear relationships. To make the meaning of r clearer, the standard deviations of both variables in these plots are equal, and the horizontal and vertical scales are the same. In general, it is not so easy to guess the value of r from the appearance of a scatterplot. Remember that changing the plotting scales in a scatterplot may mislead our eyes, but it does not change the correlation.
77
Remember that correlation is not a complete description of two-variable data, even when the relationship between the variables is linear. You should give the means and standard deviations of both x and y along with the correlation. (Because the formula for correlation uses the means and standard deviations, these measures are the proper choice to accompany a correlation.) Conclusions based on correlations alone may require rethinking in the light of a more complete description of the data.
EXAMPLE 2.8 Forecasting Earnings
Stock analysts regularly forecast the earnings per share (EPS) of companies they follow. EPS is calculated by dividing a company’s net income for a given time period by the number of common stock shares outstanding. We have two analysts’ EPS forecasts for a computer manufacturer for the next six quarters. How well do the two forecasts agree? The correlation between them is r = 0.9, but the mean of the first analyst’s forecasts is $3 per share lower than the second analyst’s mean.
These facts do not contradict each other. They are simply different kinds of information. The means show that the first analyst predicts lower EPS than the second. But because the first analyst’s EPS predictions are about $3 per share lower than the second analyst’s for every quarter, the correlation remains high. Adding or subtracting the same number to all values of either x or y does not change the correlation. The two analysts agree on which quarters will see higher EPS values. The high r shows this agreement, despite the fact that the actual predicted values differ by $3 per share.
2.25 Strong association but no correlation
Here is a data set that illustrates an important point about correlation:
x | 20 | 30 | 40 | 50 | 60 |
y | 10 | 30 | 50 | 30 | 10 |
78
2.26 Brand names and generic products