13.2 Beyond Hypothesis Testing for the One-Way Within-Groups ANOVA

Hypothesis testing with the one-way within-groups ANOVA can tell us whether people can, on average, distinguish between types of beer based on price category— that is, whether people give beers different mean ratings based on price. Effect sizes help us figure out whether these differences are large enough to matter. The Tukey HSD test can tell us exactly which means are statistically significantly different from each other.

R2, the Effect Size for ANOVA

MASTERING THE FORMULA

13-8: The formula for effect size for a one-way within-groups ANOVA is: . We divide the between-groups sum of squares by the difference between the total sum of squares and the subjects sum of squares. We remove the subjects sum of squares so we can determine the variability explained only by between-groups differences.

The calculations for R2 for a one-way within-groups ANOVA and a one-way between-groups ANOVA are similar. As before, the numerator is a measure of the variability that takes into account just the differences among means, SSbetween. The denominator, however, takes into account the total variability, SStotal, but removes the variability caused by differences among participants, SSsubjects. This enables us to determine the variability explained only by between-groups differences. The formula is:

EXAMPLE 13.3

Let’s apply this to the ANOVA we just conducted. We can use the statistics in the source table above to calculate R2:

The conventions for R2 are the same as those shown in Table 12-12. This effect size of 0.79 is a very large effect: 79% of the variability in ratings of beer is explained by price.

Tukey HSD

EXAMPLE 13.4

We use the same procedure that we used for a one-way between-groups ANOVA, the Tukey HSD test: We calculate an HSD for each pair of means by first calculating the standard error:

The standard error allows us to calculate HSD for each pair of means.

Cheap beer (34.4) versus mid-range beer (34.6):

Cheap beer (34.4) versus high-end beer (52.6):

337

Mid-range beer (34.6) versus high-end beer (52.6):

Now we look up the critical value in the q table in Appendix B. For a comparison of three means with within-groups degrees of freedom of 8 and a p level of 0.05, the cutoff q is 4.04. As before, the sign of each HSD does not matter.

Within-Groups Designs in Everyday Life We often use a within-groups design without even knowing it. A bride might use a within-groups design when she has all of her bridesmaids (the participants) try on several different possible dresses (the levels of the study). They would then choose the dress that is most flattering, on average, on the bridesmaids. We even have an innate understanding of order effects. A bride, for example, might ask her bridesmaids to try on the dress that she prefers either first or last (but not in the middle) so they’ll remember it better and be more likely to prefer it!
Megan Maloy/Getty Images

The q table indicates two statistically significant differences whose HSDs are beyond the critical values: −6.691 and −6.618. It appears that high-end beers elicit higher average ratings than cheap beers; high-end beers also elicit higher average ratings than mid-range beers. No statistically significant difference is found between cheap beers and mid-range beers.

What might explain these differences? It’s not surprising that expensive beers came out ahead of cheap and midrange beers, but Fallows was surprised that no observable average difference was found between cheap and mid-range beers, which led to this advice that he gave to his beer-drinking colleagues: Buy high-end beer “when [you] want an individual glass of lager to be as good as it can be,” but buy cheap beer “at all other times, since it gives the maximum taste and social influence per dollar invested.” The mid-range beers? Not worth the money.

How much faith can we have in these findings? As behavioral scientists, we critically examine the design and procedures. Did the darker color of Sam Adams (the beer that received the highest average ratings) give it away as a high-end beer? The beers were labeled with letters (Budweiser was labeled with F). Yet, in line with many academic grading systems, the letter A has a positive connotation and F has a negative one. Were there order effects? Did the testers get more lenient (or critical) with every swallow? The panel of tasters was mostly Microsoft employees and was all men. Would we get different results for non-tech employees or with female participants? Science is a slow but sure way of knowing that depends on replication of experiments.

Next Steps

Matched Groups

So far, we’ve learned two hypothesis tests that we can use when we have a within-groups design. In Chapter 10, we introduced the paired-samples t test, and in this chapter, we introduced the within-groups ANOVA. Previously, we stated that a within-groups design requires that every participant experience every level of the independent variable, but there is one important exception: The matched-groups design has different people in each group, but they are similar (matched) on characteristics important to our study. A matched design has more statistical power because it allows us to analyze the data as if the same people were in each group.

This research design is particularly useful when participants can’t be in two groups at the same time. For example, a matched-groups design works when we want to learn whether psychology majors or history majors are more interested in current events. We can’t randomly assign students to their majors, of course, and we want to control for confounding variables such as time spent reading newspapers that might differ systematically between psychology majors and history majors. We want to know that it is their major, not some other variable, that is associated with any mean difference in interest in current events, so we would match students on these other variables.

338

Let’s look at a published example in the social science literature. Researchers in the state of Indiana in the United States compared depression levels of elderly Mexican American caregivers with elderly Mexican American noncaregivers (Hernandez & Bigatti, 2010). Sixty-five people who cared for individuals with Alzheimer’s disease or a disability were matched with 65 noncaregivers on variables that the researchers knew to be related to depression—age, gender, socioeconomic status, physical health, and level of acculturation to the United States. For example, a caregiver who was female, 68 years old, healthy, and well acculturated to the United States would be matched with a noncaregiver who shared these characteristics. In this way, the researchers could know that these matched variables were not responsible for any differences between groups, making it more likely that caregiver status caused any mean difference in depression between the two groups. In this study, the researchers found that caregivers were more likely to be depressed than noncaregivers, on average.

Using matched groups increases statistical power the same way that a within-groups design has more statistical power than a between-groups design. However, there are two main problems, and they may already have occurred to you. First, we might not be aware of all of the important variables of interest. For example, social support is related to depression, and caregivers may be so busy caring for others that they have little time to develop their own network of social support. If we did not match the groups on level of social support, then social support might account for any mean differences in the dependent variable.

Second, if one of the people in a matched pair decides not to complete the study, then we must discard the data for the match for this person. This makes for less-than-efficient research. In the study comparing caregivers and noncaregivers, data from 8 caregivers had to be discarded because they did not provide much of the data needed to complete the data set. Because of this, the researchers had to discard the data from the 8 noncaregivers who were matched to these participants, even though they had completed most of the measures. So only 57 of the 65 matched pairs remained in the final data set. If these problems can be addressed, however, matched groups can allow researchers to harness the increased statistical power of a within-groups design.

CHECK YOUR LEARNING

Reviewing the Concepts

  • It is recommended, as it is for other hypothesis tests, that we calculate a measure of effect size, R2, for a one-way within-groups ANOVA.
  • As with one-way between-groups ANOVA, if we are able to reject the null hypothesis with a one-way within-groups ANOVA, we’re not finished. We must conduct a post hoc test, such as a Tukey HSD test, to determine exactly which pairs of means are significantly different from one another.
  • Matched pairs and matched groups allow us to use within-groups designs even if different participants experience each level of the independent variable. Rather than using the same participants, we match different participants on possible confounding variables.

339

Clarifying the Concepts

  • 13-6 How does the calculation of the effect size R2 differ between the one-way within-groups ANOVA and the one-way between-groups ANOVA?
  • 13-7 How does the calculation of the Tukey HSD differ between the one-way within-groups ANOVA and the one-way between-groups ANOVA?

Calculating the Statistics

  • 13-8 A researcher measured the reaction time of six participants at three different times and found the mean reaction time at time 1 (M1 = 155.833), time 2 (M2 = 206.833), and time 3 (M3 = 251.667). The researcher rejected the null hypothesis after performing a one-way within-groups ANOVA. For the ANOVA, dfbetween = 2, dfwithin = 10, and MSwithin = 771.256.
    1. Calculate the HSD for each of the three mean comparisons.
    2. What is the critical value of q for this Tukey HSD test?
    3. For which comparisons do we reject the null hypothesis?
  • 13-9 Use the following source table to calculate the effect size R2 for the one-way within-groups ANOVA.
    Source SS df MS F
    Between 27,590.486       2       795.243       17.887      
    Subjects 16,812.189       5       3362.438       4.360      
    Within 7712.436       10       771.244      
    Total 52,115.111       17      

Applying the Concepts

  • 13-10 In Check Your Learning 13-4 and 13-5, we conducted an analysis of driver-experience ratings following test drives.
    1. Calculate R2 for this ANOVA, and state what size effect this is.
    2. What follow-up tests are needed for this ANOVA, if any?

Solutions to these Check Your Learning questions can be found in Appendix D.