As we learned when we looked at the research on gender differences in mathematical reasoning ability, “statistically significant” does not mean that the findings from a study represent a meaningful difference. “Statistically significant” only means that those findings are unlikely to occur if in fact the null hypothesis is true. Researcher Geoff Cumming, a vocal advocate for the use of the procedures taught in this chapter, points out that hypothesis testing “relies on strange backward logic and can’t give us direct information about what we want to know—
192
The almost completely overlapping curves in Figure 8-1 were “statistically significant” because the sample size was so big. Increasing sample size always increases the test statistic if all else stays the same. For example, psychology test scores on the Graduate Record Examination (GRE) had a mean of 603 and a standard deviation of 101 during the years 2005–
The test statistic calculated from these numbers was:
What would happen if we increased the sample size to 200? We’d have to recalculate the standard error to reflect the larger sample, and then recalculate the test statistic to reflect the smaller standard error.
What if we increased the sample size to 1000?
What if we increased it to 100,000?
Notice that each time we increased the sample size, the standard error decreased and the test statistic increased. The original test statistic, 1.78, was not beyond the critical values of 1.96 and −1.96. However, the remaining test statistics (2.66, 5.95, and 59.56) were increasingly more extreme than the positive critical value. In their study of gender differences in mathematics performance, researchers studied 10,000 participants, a very large sample (Benbow & Stanley, 1980). It is not surprising, then, that a small difference would be a statistically significant difference.
193
8-
Let’s consider, logically, why it makes sense that a large sample should allow us to reject the null hypothesis more readily than a small sample. If we randomly selected 5 people among all those who had taken the GRE and they had scores well above the national average, we might say, “It could be chance.” But if we randomly selected 1000 people with GRE scores well above the national average, it is very unlikely that we just happened to choose 1000 people with high scores.
But just because a real difference exists does not mean it is a large, or meaningful, difference. The difference we found with 5 people might be the same as the difference we found with 1000 people. As we demonstrated with multiple z tests with different sample sizes, we might fail to reject the null hypothesis with a small sample but then reject the null hypothesis for the same-
Cohen (1990) used the small but statistically significant correlation between height and IQ to explain the difference between statistical significance and practical importance. The sample size was big: 14,000 children. Imagining that height and IQ were causally related, Cohen calculated that a person would have to grow by 3.5 feet to increase her IQ by 30 points (two standard deviations). Or, to increase her height by 4 inches, she would have to increase her IQ by 233 points! Height may have been statistically significantly related to IQ, but there was no practical real-
Language Alert! When you come across the term statistical significance, do not interpret this as an indication of practical importance.
194
Effect size indicates the size of a difference and is unaffected by sample size.
Effect size can tell us whether a statistically significant difference might also be an important difference. Effect size indicates the size of a difference and is unaffected by sample size. Effect size tells us how much two populations do not overlap. Simply put, the less overlap, the bigger the effect size.
The amount of overlap between two distributions can be decreased in two ways. First, as shown in Figure 8-6, overlap decreases and effect size increases when means are further apart. Second, as shown in Figure 8-7, overlap decreases and effect size increases when variability within each distribution of scores is smaller.
When we discussed gender differences in mathematical reasoning ability, you may have noticed that we described the size of the findings as “small” (Hyde, 2005). Because effect size is a standardized measure based on scores rather than means, we can compare the effect sizes of different studies with one another, even when the studies have different sample sizes.
195
Figure 8-8 demonstrates why we use scores instead of means to calculate effect size. First, assume that each of these distributions is based on the same underlying population. Second, notice that all means represented by the vertical lines are identical. The differences are due only to the spread of the distributions. The small degree of overlap in the tall, skinny distributions of means in Figure 8-8a is the result of a large sample size. The greater degree of overlap in the somewhat wider distributions of means in Figure 8-8b is the result of a smaller sample size. By contrast, the distributions of scores in Figures 8-8c and 8-8d represent scores rather than means for these two studies. Because these flatter, wider distributions include actual scores, sample size is not an issue in making comparisons.
196
In this case, the amounts of real overlap in Figures 8-8c and 8-8d are identical. We can directly compare the amount of overlap and see that they have the same effect size.
105
Cohen’s d is a measure of effect size that assesses the difference between two means in terms of standard deviation, not standard error.
There are many different effect-
Let’s calculate Cohen’s d for the same data with which we constructed a confidence interval. We simply substitute standard deviation for standard error. When we calculated the test statistic for the 1000 customers at Starbucks with posted calories, we first calculated standard error:
8-
It is the same formula as for the z statistic, except we divide by the population standard deviation rather than by standard error.
We calculated the z statistic using the population mean of 247 and the sample mean of 232:
To calculate Cohen’s d, we simply use the formula for the z statistic, substituting σ for σM (and µ for µM, even though these means are always the same). So we use 201 instead of 6.356 in the denominator. Cohen’s d is based on the spread of the distribution of scores, rather than the distribution of means.
8-
Now that we have the effect size, often written in shorthand as d = −0.07, we ask: What does it mean? First, we know that the two sample means are 0.07 standard deviation apart, which doesn’t sound like a big difference—
197
Based on these numbers, the effect size for the study of Starbucks customers (−0.07) is not even at the level of a small effect. As we pointed out in Chapter 7, however, the researchers hypothesized that even a small effect might spur eateries to provide more low-
A meta-
Many researchers consider meta-
The logic of the meta-
Step 1: Select the topic of interest, and decide exactly how to proceed before beginning to track down studies.
Step 2: Locate every study that has been conducted and meets the criteria.
Step 3: Calculate an effect size, often Cohen’s d, for every study.
Step 4: Calculate statistics—
In Step 4, researchers calculate a mean effect size for all studies, the central goal of a meta-
Reviewing the Concepts |
|
|
198 |
8- |
Distinguish statistical significance and practical importance. |
8- |
What is effect size? |
|
Calculating the Statistics | 8- |
Using IQ as a variable, where we know the mean is 100 and the standard deviation is 15, calculate Cohen’s d for an observed mean of 105. |
Applying the Concepts | 8- |
In Check Your Learning 8-
|
Solutions to these Check Your Learning questions can be found in Appendix D.