8.2 Effect Size

As we learned when we looked at the research on gender differences in mathematical reasoning ability, “statistically significant” does not mean that the findings from a study represent a meaningful difference. “Statistically significant” only means that those findings are unlikely to occur if in fact the null hypothesis is true. Calculating an effect size moves us a little closer to what we are most interested in: Is the pattern in a data set meaningful or important?

The Effect of Sample Size on Statistical Significance

Misinterpreting Statistical Significance Statistical significance that is achieved by merely collecting a large sample can make a research finding appear to be far more important than it really is, just as a curved mirror can exaggerate a person’s size.
Digital Vision, Ltd./Superstock

The almost completely overlapping curves in Figure 8-1 were “statistically significant” because the sample size was so big. Increasing sample size always increases the test statistic if all else stays the same. For example, psychology test scores on the Graduate Record Examination (GRE) had a mean of 603 and a standard deviation of 101 during the years 2005-2008 (http://www.ets.org/Media/Tests/GRE/pdf/gre_0809_interpretingscores.pdf). In a fictional study, Example 7.4 in Chapter 7, we reported that 90 graduating seniors had a mean of 622. Based on the sample size of 90, we reported the mean and standard error for the distribution of means as:

The test statistic calculated from these numbers was:

What would happen if we increased the sample size to 200? We’d have to recalculate the standard error to reflect the larger sample, and then recalculate the test statistic to reflect the smaller standard error.

What if we increased the sample size to 1000?

What if we increased it to 100,000?

199

Larger Samples Give Us More Confidence in Our Conclusions Stephen, a British student studying in the United States, is told that he won’t be able to find his favorite candy bar, Yorkie. He tests this hypothesis in 3 stores and finds no Yorkie bars. Another British student, Victoria, also warned by her friends, looks for her favorite, Curly Wurly. She tests her hypothesis in 25 U.S. stores and finds no Curly Wurly bars. Both conclude that their friends were right. Do you feel more confident that Stephen or Victoria really won’t be able to find the favorite candy bar?
© Rob Walls/Alamy

MASTERING THE CONCEPT

8.2: As sample size increases, so does the test statistic (if all else stays the same). Because of this, a small difference might not be statistically significant with a small sample but might be statistically significant with a large sample.

Notice that each time we increased the sample size, the standard error decreased and the test statistic increased. The original test statistic, 1.78, was not beyond the critical values of 1.96 and −1.96. However, the remaining test statistics (2.66, 5.95, and 59.56) were increasingly more extreme than the positive critical value. In their study of gender differences in mathematics performance, researchers studied 10,000 participants, a very large sample (Benbow & Stanley, 1980). It is not surprising, then, that a small difference would be a statistically significant difference.

Let’s consider, logically, why it makes sense that a large sample should allow us to reject the null hypothesis more readily than a small sample. If we randomly selected 5 people among all those who had taken the GRE and they had scores well above the national average, we might say, “It could be chance.” But if we randomly selected 1000 people with GRE scores well above the national average, it is very unlikely that we just happened to choose 1000 people with high scores.

But just because a real difference exists does not mean it is a large, or meaningful, difference. The difference we found with 5 people might be the same as the difference we found with 1000 people. As we demonstrated with multiple z tests with different sample sizes, we might fail to reject the null hypothesis with a small sample but then reject the null hypothesis for the same-size difference between means with a large sample.

Cohen (1990) used the small but statistically significant correlation between height and IQ to explain the difference between statistical significance and practical importance. The sample size was big: 14,000 children. Imagining that height and IQ were causally related, Cohen calculated that a person would have to grow by 3.5 feet to increase her IQ by 30 points (two standard deviations). Or, to increase her height by 4 inches, a person would have to increase IQ by 233 points! Height may have been statistically significantly related to IQ, but there was no practical real-world application. A larger sample size should influence the level of confidence that the story is true, but it shouldn’t increase our confidence that the story is important.

Language Alert! When you come across the term statistical significance, do not interpret this as an indication of practical importance.

What Effect Size Is

Effect size indicates the size of a difference and is unaffected by sample size.

Effect size can tell us whether a statistically significant difference might also be an important difference. Effect size indicates the size of a difference and is unaffected by sample size. Effect size tells us how much two populations do not overlap. Simply put, the less overlap, the bigger the effect size.

The amount of overlap between two distributions can be decreased in two ways. First, as shown in Figure 8-6, overlap decreases and effect size increases when means are further apart. Second, as shown in Figure 8-7, overlap decreases and effect size increases when variability within each distribution of scores is smaller.

Figure 8-6

Effect Size and Mean Differences When two population means are farther apart, as in (b), the overlap of the distributions is less and the effect size is bigger.

Figure 8-7

Effect Size and Standard Deviation When two population distributions decrease their spread, as in (b), the overlap of the distributions is less and the effect size is bigger.

When we discussed gender differences in mathematical reasoning ability, you may have noticed that we described the size of the findings as “small” (Hyde, 2005). Because effect size is a standardized measure based on scores rather than means, we can compare the effect sizes of different studies with one another, even when the studies have different sample sizes.

200

EXAMPLE 8.3

Figure 8-8 demonstrates why we use scores instead of means to calculate effect size. First, assume that each of these distributions is based on the same underlying population. Second, notice that all means represented by the vertical lines are identical. The differences are due only to the spread of the distributions. The small degree of overlap in the tall, skinny distributions of means in Figure 8-8a is the result of a large sample size. The greater degree of overlap in the somewhat wider distributions of means in Figure 8-8b is the result of a smaller sample size. By contrast, the distributions of scores in Figures 8-8c and 8-8d represent scores rather than means for these two studies. Because these flatter, wider distributions include actual scores, sample size is not an issue in making comparisons.

201

Figure 8-8

Making Fair Comparisons The top two pairs of curves (a and b) depict two studies, study 1 and study 2. The first study (a) compared two samples with very large sample sizes, so each curve is very narrow. The second study (b) compared two samples with much smaller sample sizes, so each curve is wider. The first study has less overlap, but that doesn’t mean it has a bigger effect than study 2; we just can’t compare the effects. The bottom two pairs (c and d) depict the same two studies, but used standard deviation for individual scores. Now they are comparable and we see that they have the same amount of overlap—the same effect sizes.

In this case, the amounts of real overlap in Figures 8-8c and 8-8d are identical. We can directly compare the amount of overlap and see that they have the same effect size.

Cohen’s d

Cohen’s d is a measure of effect size that assesses the difference between two means in terms of standard deviation, not standard error.

There are many different effect-size statistics, but they all neutralize the influence of sample size. When we conduct a z test, the effect-size statistic is typically Cohen’s d, developed by Jacob Cohen (Cohen, 1988). Cohen’s d is a measure of effect size that assesses the difference between two means in terms of standard deviation, not standard error. In other words, Cohen’s d allows us to measure the difference between means using standard deviations, much like a z statistic. We accomplish this by using standard deviation in the denominator (rather than using standard error).

202

EXAMPLE 8.4

Let’s calculate Cohen’s d for the situation for which we constructed a confidence interval. We simply substitute standard deviation for standard error. When we calculated the test statistic for the 1000 customers at Starbucks with posted calories, we first calculated standard error:

We calculated the z statistic using the population mean of 247 and the sample mean of 232:

To calculate Cohen’s d, we simply use the formula for the z statistic, substituting σ for σM (and μ for μM, even though these means are always the same). So we use 201 instead of 6.356 in the denominator. The Cohen’s d is now based on the spread of the distribution of scores, rather than the distribution of means.

MASTERING THE FORMULA

8-2: The formula for Cohen’s d for a z statistic is: Cohen’s . It is the same formula as for the z statistic, except we divide by the population standard deviation rather than by standard error.

Now that we have the effect size, often written in shorthand as d = −0.07, what does it mean? First, we know that the two sample means are 0.07 standard deviation apart, which doesn’t sound like a big difference—and it isn’t. Cohen developed guidelines for what constitutes a small effect (0.2), a medium effect (0.5), or a large effect (0.8). Table 8-1 displays these guidelines, along with the amount of overlap between two curves that is indicated by an effect of that size. No sign is provided because it is the magnitude of an effect size that matters; an effect size of −0.5 is the same size as one of 0.5.

Table : TABLE 8-1. Cohen’s Conventions for Effect Sizes: d Jacob Cohen published guidelines (or conventions), based on the overlap between two distributions, to help researchers determine whether an effect is small, medium, or large. These numbers are not cutoffs; they are merely rough guidelines to help researchers interpret results.
Effect Size Convention Overlap
Small 0.2 85%
Medium 0.5 67%
Large 0.8 53%

203

Based on these numbers, the effect size for the study of Starbucks customers (−0.07) is not even at the level of a small effect. As we pointed out in Chapter 7, however, the researchers hypothesized that even a small effect might spur eateries to provide more low-calorie choices. Sometimes a small effect can be meaningful.

MASTERING THE CONCEPT

8.3: Because a statistically significant effect might not be an important one, we should calculate effect size in addition to conducting a hypothesis test. We can then report whether a statistically significant effect is small, medium, or large.

Next Steps

Meta-Analysis

A meta-analysis is a study that involves the calculation of a mean effect size from the individual effect sizes of many studies.

Many researchers consider meta-analysis to be the most important recent advancement in social science research (e.g., Newton & Rudestam, 1999). A meta-analysis is a study that involves the calculation of a mean effect size from the individual effect sizes of many studies. Meta-analysis provides added statistical power by considering many studies simultaneously and helps to resolve debates fueled by contradictory research findings (Lam & Kennedy, 2005).

The logic of the meta-analysis process is surprisingly simple. There are just four steps:

STEP 1: Select the topic of interest, and decide exactly how to proceed before beginning to track down studies.

Here are some of the considerations to keep in mind:

  1. Make sure the necessary statistical information, either effect sizes or the summary statistics necessary to calculate effect sizes, is available.
  2. Consider selecting only studies in which participants meet certain criteria, such as age, gender, or geographic location.
  3. Consider eliminating studies based on the research design—for example, because they were not experimental in nature.

For example, British researcher Fidelma Hanrahan and her colleagues conducted a meta-analysis to examine the effectiveness of cognitive therapy in reducing levels of worrying in people with generalized anxiety disorder (Hanrahan, Field, Jones, & Davey, 2013). Before they began their meta-analysis, they developed criteria for the studies they would include; for example, they decided to include only studies that were true experiments and only studies in which the participants were between the ages of 18 and 65.

STEP 2: Locate every study that has been conducted and meets the criteria.

Obvious places to start are PsycINFO, Google Scholar, and other electronic databases. For example, these researchers searched several databases using terms such as “generalized anxiety disorder,” “cognitive,” “therapy,” and “anxiety” (Hanrahan et al., 2013). A key part of meta-analysis, however, is finding any studies that have been conducted but have not been published (Conn, Valentine, Cooper, & Rantz, 2003). Much of this “fugitive literature” (Rosenthal, 1995, p. 184) or “gray literature” (Lam & Kennedy, 2005) is unpublished simply because the studies did not find a significant difference. The effect size seems larger without these studies. We find these studies by using other sources—for example, by reading the proceedings of relevant conferences or contacting the primary researchers in the field to obtain any relevant unpublished findings. Hanrahan and her colleagues emailed the authors of the studies they located by using databases and asked whether they had any unpublished data.

204

Meta-Analysis and Electronic Databases Researchers who conduct a meta-analysis use many tools and strategies to gather all of the findings in a particular research area. Among the most useful are electronic databases such as PsycINFO.
American Psychological Association

STEP 3: Calculate an effect size, often Cohen’s d, for every study.

When the effect size has not been reported, the researcher must calculate it from summary statistics that were reported. These researchers were able to calculate 19 effect sizes from the 15 studies that met their criteria (some studies reported more than one effect) (Hanrahan et al., 2013).

STEP 4: Calculate statistics—ideally, summary statistics, a hypothesis test, a confidence interval, and a visual display of the effect sizes (Rosenthal, 1995).

Most importantly, researchers calculate a mean effect size for all studies. In fact, we can apply all of the statistical insights we’ve learned: means, medians, standard deviations, confidence intervals and hypothesis testing, and visual displays such as box plots or stem-and-leaf plots.

In their meta-analysis of cognitive therapy for worry, Hanrahan and colleagues (2013) calculated several mean effect sizes. For example, the mean Cohen’s d for the comparison of cognitive therapy with no therapy was 1.81. The confidence interval did not include 0, and the researchers were able to reject the null hypothesis that the effect size was 0. The mean Cohen’s d for the comparison of cognitive therapy with other types of therapy was 0.63. The researchers were again able to reject the null hypothesis; however, they found an outlier that had a large effect size. The researchers found that when they omitted outliers, the mean dropped from 0.63 to 0.45, a smaller but still statistically significant effect. The researchers also included a graph—a box plot of the confidence intervals. Based on this meta-analysis, they concluded that cognitive therapy seems to be an effective treatment for worry.

Much of the fugitive literature of unpublished studies exists because studies with null results are less likely to appear in press (e.g., Begg, 1994). Twenty percent of the studies included in the meta-analysis conducted by Hanrahan and colleagues (2013) were unpublished at the time the meta-analysis was conducted, but there may have been other studies that these researchers were unable to locate. This has been called “the file drawer problem” and we will describe two solutions to it.

205

A file drawer analysis is a statistical calculation, following a meta-analysis, of the number of studies with null results that would have to exist so that a mean effect size would no longer be statistically significant.

The first solution involves additional analyses. The most common follow-up analysis was proposed by Robert Rosenthal (1991), and became aptly known as a file drawer analysis, a statistical calculation, following a meta-analysis, of the number of studies with null results that would have to exist so that a mean effect size would no longer be statistically significant. If just a few studies could render a mean effect size nonsignificant—that is, no longer statistically significantly different from zero—then the mean effect size should be viewed as likely to be an inflated estimate. If it would take several hundred studies in researchers’ “file drawers” to render the effect nonsignificant, then it is safe to conclude that there really is a significant effect. For most research topics, it is not likely that there are hundreds of unpublished studies.

There are other variants of the file drawer analysis, including analyses that allow researchers to examine their findings as if there were publication bias—that is, as if there were many studies out there with null results. The meta-analysis described here used a sensitivity analysis developed by Vevea and Woods (2005). Hanrahan and her colleagues concluded that the sensitivity analysis gave them “confidence that the estimated population effect sizes have not been severely inflated by unpublished studies not in the meta-analysis” (p. 126).

A second solution is particularly exciting because it creates new opportunities for undergraduates. Psi Chi, the International Honor Society in Psychology, has partnered with the Open Science Collaboration to test the reproducibility of well-known experiments. Both faculty members and graduate students can serve as mentors in the Reproducibility Project. The Psi Chi faculty advisor in your department can find out how to participate; if you do not have a Psi Chi chapter, you might ask your department chairperson to help you start one. Think of the Reproducibility Project as a way to crowd-source small pieces of very important work. You get to play a small part in something very big.

CHECK YOUR LEARNING

Reviewing the Concepts

  • As sample size increases, the test statistic becomes more extreme and it becomes easier to reject the null hypothesis.
  • A statistically significant result is not necessarily one with practical importance.
  • Effect sizes are calculated with respect to scores, rather than means, so they are not contingent on sample size.
  • The size of an effect is based on the difference between two group means and the amount of variability within each group.
  • Effect size for a z test is measured with Cohen’s d, which is calculated much like a z statistic, but using standard deviation instead of standard error.
  • A meta-analysis is a study of studies that provides a more objective measure of an effect size than an individual study does.
  • A researcher conducting a meta-analysis chooses a topic, decides on guidelines for a study’s inclusion, tracks down every study on a given topic, and calculates an effect size for each. A mean effect size is calculated and reported, often along with a standard deviation, median, significance testing, confidence interval, and appropriate graphs.

206

Clarifying the Concepts

  • 8-4 Distinguish statistical significance and practical importance.
  • 8-5 What is effect size?

Calculating the Statistics

  • 8-6 Using IQ as a variable, where we know the mean is 100 and the standard deviation is 15, calculate Cohen’s d for an observed mean of 105.

Applying the Concepts

  • 8-7 In Check Your Learning 8-3, we calculated a confidence interval based on CFC data. The population mean CFC score was 3.20, with a standard deviation of 0.70. The mean for the sample of 45 students who joined a career discussion group is 3.45.
    1. Calculate the appropriate effect size for this study.
    2. Citing Cohen’s conventions, explain what this effect size tells us.
    3. Based on the effect size, does this finding have any consequences or implications for anyone’s life?

Solutions to these Check Your Learning questions can be found in Appendix D.