12 Analysis of Variance

12.2 Multiple Comparisons

OBJECTIVES By the end of this section, I will be able to …

Perform multiple comparisons tests using the Bonferroni method.
Use Tukey's test to perform multiple comparisons.
Use confidence intervals to perform multiple comparisons for Tukey's test.

Recall Example 5, where we rejected the null hypothesis that the population mean time spent in the open-ended sections of a maze was the same for three groups of genetically altered mice. But so far, we have not tested to find out which pairs of population means are significantly different.

Page 686

FIGURE 21 Summary statistics for three groups of mice.

Figure 21 indicates that the sample mean time for Group 0 ${\bar{x}}_{Group 0} = 19.387$ was much larger than the sample means of the other groups ${\bar{x}}_{Group 1} = 8.660$ or ${\bar{x}}_{Group 2} = 8.620$ . Because ${\bar{x}}_{Group 0} > {\bar{x}}_{Group 1}$ , and because the ANOVA test produced evidence that the three population means are not equal, we are tempted to conclude that $μ_{Group 0} > μ_{Group 1}$ . However, we cannot formally draw such a conclusion based on the one-way ANOVA results alone. Instead, we need to perform multiple comparisons.

Multiple Comparisons

Once an ANOVA result has been found significant (the null hypothesis is rejected) multiple comparisons procedures seek to determine which pairs of population means are significantly different. Multiple comparisons are not performed if the ANOVA null hypothesis has not been rejected.

We will learn three multiple comparisons procedures: the Bonferroni method, Tukey's test, and Tukey's test using confidence intervals.

1 Performing Multiple Comparisons Tests Using the Bonferroni Method

In Section 10.2, we learned about the independent sample $t$ test for determining whether pairs of population means were significantly different. We will do something similar here, except that (a) the formula for test statistic $t_{data}$ is different from the one in Section 10.2, and (b) we need to apply the Bonferroni adjustment to the $p$ -value.

Denote the number of population means as $k$ . In general, there are

$c = (_{k} C_{2}) = \frac{k!}{2! (k - 2)!}$

possible pairs of means to compare; that is, there are $c$ pairwise comparisons. For $k = 3$ , there are $c = (_{3} C_{2}) = \frac{3!}{2! (k - 2)!} = 3$ comparisons, and for $k = 4$ there are $c = (_{4} C_{2}) = \frac{4!}{2! (4 - 2)!} = 6$ comparisons. We rejected the null hypothesis in Example 5, so we are interested in which pairs of population means are significantly different. There are $c = 3$ hypothesis tests:

$\begin{array}{l} H_{0} : μ_{Group 0} = μ_{Group 1} & versus & H_{a} : μ_{Group 0} \neq \end{array} μ_{Group 1}$
$\begin{array}{l} H_{0} : μ_{Group 0} = μ_{Group 2} & versus & H_{a} : μ_{Group 0} \neq \end{array} μ_{Group 2}$
$\begin{array}{l} H_{0} : μ_{Group 1} = μ_{Group 2} & versus & H_{a} : μ_{Group 1} \neq \end{array} μ_{Group 2}$

Suppose each of these three pairwise hypothesis tests is carried out using a level of significance $a = 0.05$ . Then the experimentwise error rate, that is, the probability of making at least one Type I error in these three hypothesis tests is

$α_{EW} = 1 - {(1 - a)}^{3} = 0.142625$

which is approximately three times larger than $α = 0.05$ . The Bonferroni adjustment corrects for this as follows.

Recall that a Type I error is rejecting the null hypothesis when it is true.

The Bonferroni Adjustment

When performing multiple comparisons, the experimentwise error rate $α_{EW}$ is the probability of making at least one Type I error in the set of hypothesis tests.
$α_{EW}$ is always greater than the comparison level of significance $α$ by a factor approximately equal to the number of comparisons being made.
Thus, the Bonferroni adjustment corrects for the experimentwise error rate by multiplying the $p$ -value of each pairwise hypothesis test by the number of comparisons being made. If the Bonferroni-adjusted $p$ -value is greater than 1, then set the adjusted value equal to 1.

Page 687

For example, when we test $\begin{array}{l} H_{0} : μ_{Group 0} = μ_{μ_{Group 1}} & versus & H_{a} : μ_{Group 0} \neq \end{array} μ_{Group 1}$ , the Bonferroni adjustment says to multiply the resulting $p$ -value by $c = 3$ . Example 7 shows how to use the Bonferroni method of multiple comparisons.

EXAMPLE 7 Bonferroni method of multiple comparisons

Use the Bonferroni method of multiple comparisons to determine which pairs of population mean times differ, for the mice in Groups 0, 1, and 2 in Example 5. Use level of significance $α = 0.01$ .

Solution

The Bonferroni method requires that

the requirements for ANOVA have been met, and
the null hypothesis that the population means are all equal has been rejected.

In Example 5, we verified both requirements.

Step 1 For each of the c hypothesis tests, state the hypotheses and the rejection rule. There are $k = 3$ means, so there will be $c = 3$ hypothesis tests. Our hypotheses are
- Test 1: $\begin{array}{l} H_{0} : μ_{Group 0} = μ_{Group 1} & versus & H_{a} : μ_{Group 0} \neq \end{array} μ_{Group 1}$
- Test 2: $\begin{array}{l} H_{0} : μ_{Group 0} = μ_{Group 2} & versus & H_{a} : μ_{Group 0} \neq \end{array} μ_{Group 2}$
- Test 3: $\begin{array}{l} H_{0} : μ_{Group 1} = μ_{Group 2} & versus & H_{a} : μ_{Group 1} \neq \end{array} μ_{Group 2}$
where $μ_{i}$ represents the population mean time spent in the open-ended sections of the maze, for the $i$ th group. For each hypothesis test, reject $H_{0}$ if the Bonferroni-adjusted $p - v a l u e \leq α = 0.01$ .
Step 2 Calculate $t_{data}$ for each hypothesis test. From Figure 11 on page 676, we have the mean square error from the original ANOVA as MSE = 52.9485079 and from Figure 21 we get the sample means and the sample sizes. Thus,
- Test 1:
  $t_{data} = \frac{{\bar{x}}_{Group 0} - {\bar{x}}_{Group 1}}{\sqrt{MSE \cdot (\frac{1}{n_{Group 0}} + \frac{1}{n_{Group 1}})}} = \frac{19.387 - 8.660}{\sqrt{(52.9485079) (\frac{1}{15} + \frac{1}{15})}} \approx 4.037$
- Test 2:
  $t_{data} = \frac{{\bar{x}}_{Group 0} - {\bar{x}}_{Group 2}}{\sqrt{MSE \cdot (\frac{1}{n_{Group 0}} + \frac{1}{n_{Group 2}})}} = \frac{19.387 - 8.620}{\sqrt{(52.9485079) (\frac{1}{15} + \frac{1}{15})}} \approx 4.052$
- Test 3:
  $t_{data} = \frac{{\bar{x}}_{Group 1} - {\bar{x}}_{Group 2}}{\sqrt{MSE \cdot (\frac{1}{n_{Group 1}} + \frac{1}{n_{Group 2}})}} = \frac{8.660 - 8.620}{\sqrt{(52.9485079) (\frac{1}{15} + \frac{1}{15})}} \approx 0.015$
When the requirements are met, $t_{data}$ follows a $t$ distribution with $n_{t} - k = 45 - 3 = 42$ degrees of freedom, where $n_{t}$ represents the total sample size.

FIGURE 22 Unadjusted $p$ -values from Excel.
Step 3 Find the Bonferroni-adjusted $p$ -value for each hypothesis test. Figure 22 shows the unadjusted $p$ -values for the values of $t_{data}$ from Step 2, using the function tdist $(t_{d a t a}, df, 2)$ , where $df = 42$ and the 2 represents a two-tailed test. Then the Bonferroni-adjusted $p - v a l u e = c \cdot (p -value) = 3 \cdot (p -value)$ , for each hypothesis test.
Page 688
- Test 1: Bonferroni-adjusted $p -value = 3 \cdot 0.000225 = 0.000675$ .
- Test 2: Bonferroni-adjusted $p -value = 3 \cdot 0.000215 = 0.000645$ .
- Test 3: Bonferroni-adjusted $p -value = 3 \cdot 0.988103 = 2.964309$ , but this value exceeds 1, so we set this $p$ -value equal to 1.
Step 4 For each hypothesis test, state the conclusion and the interpretation.
- Test 1: The adjusted $p -value = 0.000675$ , which is ≤0.01; therefore, reject $H_{0}$ . There is evidence at the 0.01 level of significance that the population mean time spent in the open-ended part of the maze differs between Group 0 and Group 1.
- Test 2: The adjusted $p -value = 0.000645$ , which is ≤0.01; therefore, reject $H_{0}$ . There is evidence at the 0.01 level of significance that the population mean time differs between Group 0 and Group 2.
- Test 3: The adjusted $p -value = 1$ , which is not ≤0.01; therefore, do not reject $H_{0}$ . There is insufficient evidence at the 0.01 level of significance that the population mean time differs between Group 1 and Group 2.

NOW YOU CAN DO

Exercises 9–18.

2 Tukey's Test for Multiple Comparisons

We may also use Tukey's test to determine which pairs of population means are significantly different. Tukey's test was developed by John Tukey, whom we met earlier as the developer of the stem-and-leaf display. We illustrate the steps for Tukey's method using an example.

EXAMPLE 8 Tukey's test for multiple comparisons

In the Case Study on page 678, we tested whether the population mean student motivation scores were equal for the three types of professor self-disclosure on Facebook: high, medium, and low. Figure 18 on page 678 contains the ANOVA results, for which we rejected the null hypothesis of equal population mean scores. Use Tukey's method to determine which pairs of population means are significantly different, using level of significance $α = 0.05$ .

Solution

Tukey's method has the same requirements as the Bonferroni method:

the requirements for ANOVA have been met, and
the null hypothesis that the population means are all equal has been rejected.

In the Case Study, both requirements were verified.

Step 1 For each of the $c$ hypothesis tests, state the hypotheses. There are $k = 3$ means, so there will be $c = 3$ hypothesis tests. Our hypotheses are:
- Test 1: $\begin{array}{l} H_{0} : μ_{High} = μ_{Medium} & versus & H_{a} : μ_{High} \neq μ_{Medium} \end{array}$
- Test 2: $\begin{array}{l} H_{0} : μ_{High} = μ_{Low} & versus & H_{a} : μ_{High} \neq μ_{Low} \end{array}$
- Test 3: $\begin{array}{l} H_{0} : μ_{Medium} = μ_{Low} & versus & H_{a} : μ_{Medium} \neq μ_{Low} \end{array}$
where $μ_{i}$ represents the population mean score, for the $i$ th category.
Step 2 Find the Tukey critical value $q_{crit}$ and state the rejection rule. The total sample size is $n_{t} = 43 + 44 + 43 = 130$ . Use experimentwise error rate $α_{EW} = 0.05$ , degrees of freedom $df = n_{t} - k = 130 - 3 = 127$ , and $k = number of population means = 3$ . Using the table of Tukey critical values (Table G in the Appendix), we seek $df = 127$ on the left, but, when we don't find it, we conservatively choose df = 120. Then, in the column for $k = 3$ , we find the Tukey critical value $q_{crit} = 3.356$ (Figure 23). The rejection rule for the Tukey method is “Reject $H_{0} if q_{data} \geq q_{crit}$ ,” that is, Reject $H_{0}$ if $q_{data} \geq 3.356$ .

FIGURE 23 Finding the Tukey critical value $q_{crit}$ .

Page 689
Step 3 Calculate the Tukey test statistic $q_{data}$ for each hypothesis test. From Figure 18 on page 678, we get the sample means, the sample sizes, and the mean square error MSE = 168. Thus,
- Test 1:
  $q_{data} = \frac{{\bar{x}}_{High} - {\bar{x}}_{Medium}}{\sqrt{\frac{MSE}{2} \cdot (\frac{1}{n_{High}} + \frac{1}{n_{Medium}})}} = \frac{81.09 - 79.36}{\sqrt{(\frac{168}{2}) (\frac{1}{43} + \frac{1}{44})}} \approx 0.880$
- Test 2:
  $q_{data} = \frac{{\bar{x}}_{High} - {\bar{x}}_{Low}}{\sqrt{\frac{MSE}{2} \cdot (\frac{1}{n_{High}} + \frac{1}{n_{Low}})}} = \frac{81.09 - 79.63}{\sqrt{(\frac{168}{2}) (\frac{1}{43} + \frac{1}{43})}} \approx 5.292$
- Test 3:
  $q_{data} = \frac{{\bar{x}}_{Medium} - {\bar{x}}_{Low}}{\sqrt{\frac{MSE}{2} \cdot (\frac{1}{n_{Medium}} + \frac{1}{n_{Low}})}} = \frac{79.36 - 70.63}{\sqrt{(\frac{168}{2}) (\frac{1}{44} + \frac{1}{43})}} \approx 4.442$
Step 4 For each hypothesis test, state the conclusion and the interpretation.
- Test 1: $q_{data} = 0.880$ , which is not $\geq q_{crit} = 3.356$ ; therefore, do not reject $H_{0}$ . There is insufficient evidence at the 0.05 level of significance that the population mean student motivation scores differ between professors having high and medium self-disclosure on Facebook.
- Test 2: $q_{data} = 5.292$ , which is $\geq q_{crit} = 3.356$ ; therefore, reject $H_{0}$ . There is evidence at the 0.05 level of significance that the population mean scores differ between high and low professor self-disclosure on Facebook.
- Test 3: $q_{data} = 4.442$ , which is $\geq q_{crit} = 3.356$ ; therefore, reject $H_{0}$ . There is evidence at the 0.05 level of significance that the population mean scores differ between medium and low professor self-disclosure on Facebook.

This set of three hypothesis tests has an experimentwise error rate $α_{EW} = 0.05$ .

When calculating the numerator of $q_{data}$ for each pairwise comparison, be sure to subtract the smaller value of $\bar{x}$ from the larger value of $\bar{x}$ , so that the value of $q_{data}$ is positive.

NOW YOU CAN DO

Exercises 19–30.

3 Using Confidence Intervals to Perform Tukey's Test

Tukey's test for multiple comparisons may also be performed using confidence intervals and technology. Recall that when using confidence intervals for hypothesis tests, $H_{0}$ is rejected if the hypothesized value of the population mean does not fall inside the confidence interval.

Rejection Rule for Using Confidence Intervals to Perform Tukey's test

If a $100 (1 - α) %$ confidence interval for $μ_{1} - μ_{2}$ contains zero, then at level of significance $α$ , we do not reject the null hypothesis $H_{0} : μ_{1} = μ_{2}$ . If the interval does not contain zero, then we do reject $H_{0}$ .

Page 690

We illustrate the concept of using confidence intervals to perform Tukey's test with an example using the Facebook data.

EXAMPLE 9 Using confidence intervals to perform Tukey's test

Use the 95% confidence intervals for the differences in population means provided by Minitab to perform Tukey's test for multiple comparisons on the Facebook data.

Solution

We use the steps in the Step-by-Step Technology Guide provided at the end of this section. Figure 24 contains the output from Minitab showing 95% confidence intervals for the differences in population means for the high, medium, and low professor disclosure levels. The output states that “Group = Low” is being subtracted from the other two groups, meaning that the first two confidence intervals are for $μ_{Medium} - μ_{Low}$ and $μ_{High} - μ_{Low}$ . Later, “Group = Medium” is subtracted from the high group, indicating a confidence interval for $μ_{High} - μ_{Medium}$ . The column headings “Lower” and “Upper” represent the lower and upper bounds of the confidence interval. Figure 25 shows the output from JMP, including 95% confidence intervals for the differences in population means. The output states that the second level listed is subtracted from the first, meaning that the first two confidence intervals are for $μ_{High} - μ_{Low}$ and $μ_{Medium} - μ_{Low}$ . The columns “Lower CL” and “Upper CL” represent the lower and upper bounds of each confidence interval.

FIGURE 24 Using Minitab confidence intervals to perform Tukey's test.

FIGURE 25 Using JMP confidence intervals to perform Tukey's test.

Thus, for our $c = 3$ hypothesis tests, we have

Test 1: $\begin{array}{l} H_{0} : μ_{Medium} = μ_{Low} & versus & H_{a} : μ_{Medium} \neq μ_{Low} \end{array}$

95% confidence interval for $μ_{Medium} - μ_{Low}$ is (2.14, 15.33), which does not contain zero, so we reject $H_{0} : μ_{Medium} = μ_{Low}$ for level of significance $α = 0.05$ .
Test 2: $\begin{array}{l} H_{0} : μ_{High} = μ_{Low} & versus & H_{a} : μ_{High} \neq μ_{Low} \end{array}$

95% confidence interval for $μ_{High} - μ_{Low}$ is (3.84, 17.09), which does not contain zero, so we reject $H_{0} : μ_{High} = μ_{Low}$ for level of significance $α = 0.05$ .

Page 691
Test 3: $\begin{array}{l} H_{0} : μ_{High} = μ_{Medium} & versus & H_{a} : μ_{High} \neq μ_{Medium} \end{array}$

95% confidence interval for $μ_{High} - μ_{Medium}$ is (–4.86, 8.32), which does contain zero, so we do not reject $H_{0} : μ_{High} = μ_{Medium}$ for level of significance $α = 0.05$ .

Note that these conclusions are exactly the same as the conclusions from Example 8.

NOW YOU CAN DO

Exercises 31 and 32.