13.1 The Meaning of Correlation

A correlation is exactly what its name suggests: a co-relation between two variables. Lots of everyday observations are co-related: junk food eaten and body fat, miles driven and the wear on tires, air conditioner usage and the electric bill. If you can measure any two variables on a scale, you can calculate the degree to which they are co-related.

The Characteristics of Correlation

  • A correlation coefficient is a statistic that quantifies a relation between two variables.

A correlation coefficient is a statistic that quantifies a relation between two variables. In this chapter, we learn how to quantify a relation—that is, we learn to calculate a correlation coefficient—when the data are linearly related. A linear relation means that the data form an overall pattern through which it would make sense to draw a straight line—that is, the dots on a scatterplot are roughly clustered around a line, rather than, say, a curve. You can actually see—and understand—the data story with just a glance. There are three main characteristics of the correlation coefficient.

  1. The correlation coefficient can be either positive or negative.

  2. The correlation coefficient always falls between −1.00 and 1.00.

  3. It is the strength (also called the magnitude) of the coefficient, not its sign, that indicates how large it is.

  • A positive correlation is an association between two variables such that participants with high scores on one variable tend to have high scores on the other variable as well, and those with low scores on one variable tend to have low scores on the other variable.

The first important characteristic of the correlation coefficient is that it can be either positive or negative. A positive correlation has a positive sign (e.g., +0.32, or more typically, just 0.32), and a negative correlation has a negative sign (e.g., −0.32). A positive correlation is an association between two variables such that participants with high scores on one variable tend to have high scores on the other variable as well, and those with low scores on one variable tend to have low scores on the other variable.

365

Contrary to what some people think, when participants with low scores on one variable tend to have low scores on the other, it is not a negative correlation. A positive correlation describes a situation in which participants tend to have similar scores, with respect to the mean and spread, on both variables—whether both scores are low, medium, or high. The line that summarizes a scatterplot with a positive correlation slopes upward and to the right.

MASTERING THE CONCEPT

13-1: The sign indicates the direction of the correlation, positive or negative. A positive correlation occurs when people who are high on one variable tend to be high on the other as well, and people who are low on one variable tend to be low on the other. A negative correlation occurs when people who are high on one variable tend to be low on the other.

EXAMPLE 13.1

The scatterplot in Figure 13-1 shows a positive correlation between Scholastic Aptitude Test (SAT) score and college grade point average (GPA). For example, the second dot from the left is for a person with a 980 on the SAT and a 2.2 GPA; this person is lower than average on both scores. The upper-right dot is for a person with a 1360 on the SAT and a 3.8 GPA; this person is higher than average on both scores. This makes sense, because we would expect people with higher SAT scores to get better grades, on average.

image
Figure 13.1: FIGURE 13-1
A Positive Correlation
These data points depict a positive correlation between SAT score and college GPA. Those with higher SAT scores tend to have higher GPAs, and those with lower SAT scores tend to have lower GPAs.

EXAMPLE 13.2

  • A negative correlation is an association between two variables in which participants with high scores on one variable tend to have low scores on the other variable.

The scatterplot in Figure 13-2 shows the negative correlation of −0.43 between cheating and final exam grade for the MIT study. A negative correlation is an association between two variables in which participants with high scores on one variable tend to have low scores on the other variable. The line that summarizes a scatterplot with a negative correlation slopes downward and to the right. Each dot represents one person’s values on both variables. The proportion of homework copied during the semester is on the horizontal x-axis, and the final exam grade (converted to standardized z scores) is on the vertical y-axis. For example, the dot in the green diamond indicates a student who copied less than 0.2, or 20%, of the homework, and scored almost 2 standard deviations above the mean on the final exam. The dot in the red diamond indicates a student who copied almost 80% of the homework and scored more than 3 standard deviations below the mean on the final exam. Even though most dots are not as extreme as the pattern of the two students we just described, the overall trend is for students who copied more to perform more poorly on the final—a linear relation.

image
Figure 13.2: FIGURE 13-2
A Negative Correlation
In this negative correlation from the MIT study, those who cheat more tend to have a lower final grade; those who cheat less tend to have a higher final grade.

366

MASTERING THE CONCEPT

13-2: A correlation coefficient always falls between −1.00 and 1.00. The size of the coefficient, not its sign, indicates how large it is.

A second important characteristic of the correlation coefficient is that it always falls between −1.00 and 1.00. Both −1.00 and 1.00 are perfect correlations. If we calculate a coefficient that is outside this range, we have made a mistake in the calculations. A correlation coefficient of 1.00 indicates a perfect positive correlation; every point on the scatterplot falls on one line, as seen in the imaginary relation between absences and exam grades depicted in Figure 13-3. Higher scores on one variable are associated with higher scores on the other variable, and lower scores on one variable are associated with lower scores on the other variable. When a correlation coefficient is either −1.00 or 1.00, knowing somebody’s score on one variable tells you exactly what that person’s score is on the other variable. They are perfectly related.

image
Figure 13.3: FIGURE 13-3
A Perfect Positive Correlation
Every dot falls exactly on a straight line that moves up and to the right. This perfect, positive correlation is not real. More absences almost certainly don’t lead to higher grades—and certainly they don’t for every student.

A correlation coefficient of −1.00 indicates a perfect negative correlation. Every point on the scatterplot falls on one line, as seen in the imaginary relation between absences and exam grades depicted in Figure 13-4, but now higher scores on one variable go with lower scores on the other variable. As with a perfect positive correlation, knowing somebody’s score on one variable tells you that person’s exact score on the other variable. A correlation of 0.00 falls right in the middle of the two extremes and indicates no correlation—no association between the two variables.

image
Figure 13.4: FIGURE 13-4
A Perfect Negative Correlation
When every pair of scores falls on the same line on a scatterplot and higher scores on one variable are associated with lower scores on the other variable, there is a perfect negative correlation of −1.00, a situation that almost never happens in real life.

The third useful characteristic of the correlation coefficient is that its sign—positive or negative—indicates only the direction of the association, not the strength or size of the association. So a correlation coefficient of −0.35 is the same size as one of 0.35. A correlation coefficient of −0.67 is larger than one of 0.55. Don’t be fooled by a negative sign; the sign indicates the direction of the relation, not the strength.

367

The strength of the correlation is determined by how close to “perfect” the data points are. The closer the data points are to the imaginary line that one could draw through them, the closer the correlation is to being perfect (either −1.00 or 1.00), and the stronger the relation between the two variables. The farther the points are from this imaginary line, the farther the correlation is from being perfect (so, closer to 0.00), and the weaker the relation between the two variables.

image
© Ole Graf/zefa/Corbis
The Teeter-Tottering Negative Correlation When two variables are negatively correlated, a high score on one variable indicates a likely low score on the other variable—just like children on a teeter-totter.

The scores in a positive correlation move up and down together, the same way the mercury rises or falls in a thermometer as the temperature goes up or down. The scores in a negative correlation move up and down in opposition to each other, as though on a teeter-totter. Knowing the direction of a correlation allows us to use a person’s score on one variable to predict his or her score on another variable. Fortunately, the correlation statistic lets us identify both the direction and the strength of the relation between two variables.

How big does a correlation coefficient have to be to be considered important? As he did for effect sizes, Jacob Cohen (1988) published standards, shown in Table 13-1, to help us interpret the correlation coefficient. Very few findings in the behavioral sciences have correlation coefficients of 0.50 or larger because any particular outcome—a student’s exam grade, for example—is likely influenced by many variables. A student’s exam grade is likely influenced by absences from class, attention level, hours of studying, interest in the subject matter, IQ, and many other variables. So, the correlation of −0.43 between cheating and exam grades found among MIT students is a large correlation for the behavioral sciences.

image
Figure 13.5:

368

Correlation Is Not Causation

image
Figure 13.6: FIGURE 13-5
Three Possible Causal Explanations for a Correlation
Any correlation can be explained in one of several ways. The first variable, (A), might cause the second variable, (B). Or the reverse could be true—the second variable, (B), could cause the first variable, (A). Finally, a third variable, (C), could cause both (A) and (B). In fact, there could be many “third” variables.

You need to understand what correlations do not reveal about the relation between variables. Correlations only provide clues to causality; they do not demonstrate or test for causality; they only quantify the strength and direction of the relation between variables. Your appreciation for what correlations do not reveal suggests that you are thinking scientifically. For example, we know that there was a strong negative correlation in the MIT study between cheating and final exam grade; it is not unreasonable to think that cheating causes bad grades. However, there are three possible reasons for this observed correlation.

First, variable A (cheating) could cause variable B (poor grades). Second, variable B (poor grades) could cause variable A (cheating). Third, variable C (some other influence) could be causing the correlation between variable A (cheating) and variable B (poor grades). You can think of these three possibilities as the A-B-C model (Figure 13-5).

Knowing that correlation does not imply causation coaxes our brains into thinking of alternate explanations. The researchers found that physics and math ability did not correlate with cheating; so that’s an unlikely answer. But we also mentioned working, anxiety, and other time commitments. You can probably think of even more possibilities. Never confuse correlation with causation.

MASTERING THE CONCEPT

13-3: Just because two variables are related doesn’t mean one causes the other. It could be that the first causes the second, the second causes the first, or a third variable causes both. Correlation does not indicate causation.

CHECK YOUR LEARNING

Reviewing the Concepts
  • A correlation coefficient is a statistic that quantifies a relation between two variables.

  • The correlation coefficient always falls between −1.00 and 1.00.

  • When two variables are related such that people with high scores on one tend to have high scores on the other and people with low scores on one tend to have low scores on the other, we describe the variables as positively correlated.

  • When two variables are related such that people with high scores on one tend to have low scores on the other, we describe the variables as negatively correlated.

  • When two variables are not related, there is no correlation and they have a correlation coefficient close to 0.

  • The strength of a correlation, captured by the number value of the coefficient, is independent of its sign. Cohen established standards for evaluating the strength of association.

  • Correlation is not equivalent to causation. In fact, a correlation does not help us decide the merits of different causal explanations.

  • When two variables are correlated, this association might occur because the first variable, (A), causes the second, (B); or because the second variable, (B), causes the first, (A). Alternately, a third variable, (C), could cause both of the correlated variables, (A) and (B).

Clarifying the Concepts 13-1 There are three main characteristics of the correlation coefficient. What are they?
13-2 Why doesn’t correlation indicate causation?
Calculating the Statistics 13-3 Use Cohen’s guidelines to describe the strength of the following coefficients:
  1. −0.60

  2. 0.35

  3. 0.04

13-4 Draw a hypothetical scatterplot to depict the following correlation coefficients:
  1. −0.60

  2. 0.35

  3. 0.04

Applying the Concepts 13-5 A writer for Runner’s World magazine debated the merits of running while listening to music (Seymour, 2006). The writer, an avid iPod user, interviewed a clinical psychologist, whose response to the debate about whether to listen to music while running was: “I like to do what the great ones do and try to emulate that. What are the Kenyans doing?”
Let’s say a researcher conducted a study in which he determined the correlation between the percentage of a country’s marathon runners who train while using a portable music device and the average marathon finishing time for that country’s runners. (Note that in this case the participants are countries, not people.) Let’s say the researcher finds a strong positive correlation. That is, the more of a country’s runners who train with music, the longer the average marathon finishing time. Remember, in a marathon, a longer time is bad. So this fictional finding is that training with music is associated with slower marathon finishing times; the United States, for example, would have a higher percentage of music use and higher (slower) finishing times than Kenya.
Using the A-B-C model, provide three possible explanations for this finding.

Solutions to these Check Your Learning questions can be found in Appendix D.

369