Chapter 6: Introduction to Inference

6.4 6.4 Power and Inference as a Decision

When you complete this section, you will be able to:

• Define what is meant by the power of a test.
• Determine the power of a test to detect an alternative for a given sample size n.
• Describe the two types of possible errors when performing a test that focuses on deciding between two hypotheses.
• Relate the two errors to the significance level and power of the test.

Although we prefer to use P-values rather than the reject-or-not view of the level α significance test, the latter view is very important for planning studies and for understanding statistical decision theory. We will discuss these two topics in this section.

Power

Level α significance tests are closely related to confidence intervals—in fact, we saw that a two-sided test can be carried out directly from a confidence interval (pages 353–354). The significance level, like the confidence level, says how reliable the method is in repeated use. If we use 5% significance tests repeatedly when H₀ is, in fact, true, we will be wrong (the test will reject H₀) 5% of the time and right (the test will fail to reject H₀) 95% of the time.

The ability of a test to detect that H₀ is false is measured by the probability that the test will reject H₀ when an alternative is true. The higher this probability is, the more sensitive the test is.

Page 392

POWER

The probability that a level α significance test will reject H₀ when a particular alternative value of the parameter is true is called the power of the test to detect that alternative.

EXAMPLE 6.29

The power of a TBBMC significance test. Can a six-month exercise program increase the total body bone mineral content (TBBMC) of young women? A team of researchers is planning a study to examine this question. Based on the results of a previous study, they are willing to assume that σ = 2 for the percent change in TBBMC over the six-month period. They also believe that a change in TBBMC of 1% is important, so they would like to have a reasonable chance of detecting a change this large or larger. Is 25 subjects a large enough sample for this project?

We will answer this question by calculating the power of the significance test that will be used to evaluate the data to be collected. The calculation consists of three steps:

1. State H₀, H_a (the particular alternative we want to detect), and the significance level α.
2. Find the values of $\bar{x}$ that will lead us to reject H₀.
3. Calculate the probability of observing these values of $\bar{x}$ when the alternative is true.

Step 1. The null hypothesis is that the exercise program has no effect on TBBMC. In other words, the mean percent change is zero. The alternative is that exercise is beneficial; that is, the mean change is positive. Formally, we have

H₀: μ = 0

H_a: μ > 0

The alternative of interest is μ = 1% increase in TBBMC. A 5% test of significance will be used.

Step 2. The z test rejects H₀ at the α = 0.05 level whenever

$z = \frac{\bar{x} - μ_{0}}{σ / \sqrt{n}} = \frac{\bar{x} - 0}{2 / \sqrt{25}} \geq 1.645$

Be sure you understand why we use 1.645. Rewrite this in terms of $\bar{x}$ :

$\begin{array}{l} \bar{x} & \geq & 1.645 \frac{2}{\sqrt{25}} \\ \bar{x} & \geq & 0.658 \end{array}$

Because the significance level is α = 0.05, this event has probability 0.05 of occurring when the population mean μ is 0.

Step 3. The power to detect the alternative μ = 1% is the probability that H₀ will be rejected when in fact μ = 1%. We calculate this probability by standardizing $\bar{x}$ , using the value μ = 1, the population standard deviation σ = 2, and the sample size n = 25. The power is

Page 393

$P (\bar{x} \geq 0.658 when μ = 1) = P (\frac{\bar{x} - μ}{σ / \sqrt{n}} \geq \frac{0.658 - 1}{2 / \sqrt{25}})$

= P(Z ≥ −0.855) = 0.80

Figure 6.16: Figure 6.16 The sampling distributions of

$\bar{x}$ when μ = 0 and when μ = 1, Example 6.29. The power is the probability that the test rejects H₀ when the alternative is true.

Figure 6.16 illustrates the power with the sampling distribution of $\bar{x}$ when μ = 1. This significance test rejects the null hypothesis that exercise has no effect on TBBMC 80% of the time if the true effect of exercise is a 1% increase in TBBMC. If the true effect of exercise is a greater percent increase, the test will have greater power; it will reject with a higher probability.

Here is another example of a power calculation, this time for a two-sided z test.

EXAMPLE 6.30

Power of the lead concentration test. Example 6.17 (page 375) presented a test of

H₀: μ = 15.00

H_a: μ ≠ 15.00

at the 1% level of significance. What is the power of this test against the specific alternative μ = 15.50?

The test rejects H₀ when |z| ≥ 2.576. The test statistic is

$z = \frac{\bar{x} - 15.00}{0.25 / \sqrt{3}}$

Page 394

Some arithmetic shows that the test rejects when either of the following is true:

z ≥ 2.576 (in other words, $\bar{x}$ ≥ 15.37)

z ≤ −2.576 (in other words, $\bar{x}$ ≤ 14.63)

These are disjoint events, so the power is the sum of their probabilities, computed assuming that the alternative μ = 15.50 is true. We find that

$P (\bar{x} \geq 15.37) = P (\frac{\bar{x} - μ}{σ / \sqrt{n}} \geq \frac{15.37 - 15.50}{0.25 / \sqrt{3}})$

= P(Z ≥ −0.90) = 0.8159

$P (\bar{x} \leq 14.63) = P (\frac{\bar{x} - μ}{σ / \sqrt{n}} \leq \frac{14.63 - 15.50}{0.25 / \sqrt{3}})$

= P(Z ≤ −6.03) ≐ 0

Figure 6.17 illustrates this calculation. A power of about 0.82, we are quite confident that the test will reject H₀ when this alternative is true.

Figure 6.17: Figure 6.17 The power, Example 6.30. Unlike Figure 6.16, only the sampling distribution under the alternative is shown.

High power is desirable. Along with 95% confidence intervals and 5% significance tests, 80% power is becoming a standard. Many U.S. government agencies that provide research funds require that the sample size for the funded studies be sufficient to detect important results 80% of the time using a 5% test of significance.

EXAMPLE 6.31

Constructing a power curve. Example 6.30 considered one specific alternative, μ = 15.50. Often, it is helpful to consider the power for a range of alternatives. Fortunately, most statistical software saves us from having to do these calculations manually. Figure 6.18 shows Minitab output for the power over the range 15.00 ppm to 15.80 ppm. The power calculation of Example 6.30 is represented by a dot on the curve at a difference of 15.50 − 15.00 = 0.50. This curve is very informative. We see that with a sample size of three, the power is greater than 80% only for differences larger than about 0.48. If it is important to detect differences less than this, the Deely Laboratory needs to consider ways to increase the power.

Page 395

Figure 6.18: Figure 6.18 Minitab output (a power curve) for the one-sample power calculation, Example 6.31.

Increasing the power

Suppose that you have performed a power calculation and found that the power is too small. What can you do to increase it? Here are four ways. Note the similarity between these and the choices to reduce the margin of error (page 352).

• Increase α. A 5% test of significance will have a greater chance of rejecting the alternative than a 1% test because the strength of evidence required for rejection is less.
• Consider a particular alternative that is farther away from μ₀. Values of μ that are in H_a but lie close to the hypothesized value μ₀ are harder to detect (lower power) than values of μ that are far from μ₀.
• Increase the sample size. More data will provide more information about $\bar{x}$ so we have a better chance of distinguishing values of μ.
• Decrease σ. This has the same effect as increasing the sample size: more information about μ. Improving the measurement process and restricting attention to a subpopulation are possible ways to decrease σ.

Power calculations are important in planning studies. Using a significance test with low power makes it unlikely that you will find a significant effect even if the truth is far from the null hypothesis. A null hypothesis that is, in fact, false can become widely believed if repeated attempts to find evidence against it fail because of low power. The following example illustrates this point.

Page 396

EXAMPLE 6.32

Are stock markets efficient? The “efficient market hypothesis” for the time series of stock prices says that future stock prices (when adjusted for inflation) show only random variation. No information available now will help us predict stock prices in the future because the efficient working of the market has already incorporated all available information in the present price. Many studies have tested the claim that one or another kind of information is helpful. In these studies, the efficient market hypothesis is H₀, and the claim that prediction is possible is H_a. Almost all the studies have failed to find good evidence against H₀. As a result, the efficient market hypothesis is quite popular. But an examination of the significance tests employed finds that the power is generally low. Failure to reject H₀ when using tests of low power is not evidence that H₀ is true. As one expert says, “The widespread impression that there is strong evidence for market efficiency may be due just to a lack of appreciation of the low power of many statistical tests.”³⁰

Inference as decision

We have presented tests of significance as methods for assessing the strength of evidence against the null hypothesis. This assessment is made by the P-value, which is a probability computed under the assumption that H₀ is true. The alternative hypothesis (the statement we seek evidence for) enters the test only to help us see what outcomes count against the null hypothesis.

There is another way to think about these issues. Sometimes, we are really concerned about making a decision or choosing an action based on our evaluation of the data. Acceptance samplingacceptance sampling is one such circumstance. A producer of bearings and a skateboard manufacturer agree that each carload lot of bearings shall meet certain quality standards. When a carload arrives, the manufacturer chooses a sample of bearings to be inspected. On the basis of the sample outcome, the manufacturer will either accept or reject the carload. Let’s examine how the idea of inference as a decision changes the reasoning used in tests of significance.

Two types of error

Tests of significance concentrate on H₀, the null hypothesis. If a decision is called for, however, there is no reason to single out H₀. There are simply two hypotheses, and we must accept one and reject the other. It is convenient to call the two hypotheses H₀ and H_a, but H₀ no longer has the special status (the statement we try to find evidence against) that it had in tests of significance. In the acceptance sampling problem, we must decide between

H₀: the lot of bearings meets standards

H_a: the lot does not meet standards

on the basis of a sample of bearings.

We hope that our decision will be correct, but sometimes it will be wrong. There are two types of incorrect decisions. We can accept a bad lot of bearings, or we can reject a good lot. Accepting a bad lot injures the consumer, while rejecting a good lot hurts the producer. To help distinguish these two types of error, we give them specific names.

Page 397

Figure 6.19: Figure 6.19 The two types of error in testing hypotheses.

Figure 6.20: Figure 6.20 The two types of error in the acceptance sampling setting.

TYPE I AND TYPE II ERRORS

If we reject H₀ (accept H_a) when in fact H₀ is true, this is a Type I error. If we accept H₀ (reject H_a) when in fact H_a is true, this is a Type II error.

The possibilities are summed up in Figure 6.19. If H₀ is true, our decision either is correct (if we accept H₀) or is a Type I error. If H_a is true, our decision either is correct or is a Type II error. Only one error is possible at one time. Figure 6.20 applies these ideas to the acceptance sampling example.

Error probabilities

Any rule for making decisions is assessed in terms of the probabilities of the two types of error. This is in keeping with the idea that statistical inference is based on probability. We cannot (short of inspecting the whole lot) guarantee that good lots of bearings will never be rejected and bad lots never be accepted. But by random sampling and the laws of probability, we can say what the probabilities of both kinds of error are.

Significance tests with fixed level α give a rule for making decisions because the test either rejects H₀ or fails to reject it. If we adopt the decision-making way of thought, failing to reject H₀ means deciding that H₀ is true. We can then describe the performance of a test by the probabilities of Type I and Type II errors.

EXAMPLE 6.33

Photo by The Photo Works

Outer diameter of a skateboard bearing. The mean outer diameter of a skateboard bearing is supposed to be 22.000 millimeters (mm). The outer diameters vary Normally with standard deviation σ = 0.010 mm. When a lot of the bearings arrives, the skateboard manufacturer takes an SRS of five bearings from the lot and measures their outer diameters. The manufacturer rejects the bearings if the sample mean diameter is significantly different from 22 mm at the 5% significance level.

This is a test of the hypotheses

H₀: μ = 22

H_a: μ ≠ 22

Page 398

To carry out the test, the manufacturer computes the z statistic:

$z = \frac{\bar{x} - 22}{0.01 / \sqrt{5}}$

and rejects H₀ if

z < −1.96 or z > 1.96

A Type I error is to reject H₀ when in fact μ = 22.

What about Type II errors? Because there are many values of μ in H_a, we will concentrate on one value. The producer and the manufacturer agree that a lot of bearings with mean 0.015 mm away from the desired mean 22.000 should be rejected. So a particular Type II error is to accept H₀ when in fact μ = 22.015.

Figure 6.21 shows how the two probabilities of error are obtained from the two sampling distributions of $\bar{x}$ , for μ = 22 and for μ = 22.015. When μ = 22, H₀ is true and to reject H₀ is a Type I error. When μ = 22.015, accepting H₀ is a Type II error. We will now calculate these error probabilities.

Figure 6.21: Figure 6.21 The two error probabilities, Example 6.33. The probability of a Type I error (yellow area) is the probability of rejecting H₀: μ = 22 when, in fact, μ = 22. The probability of a Type II error (blue area) is the probability of accepting H₀ when, in fact, μ = 22.015.

The probability of a Type I error is the probability of rejecting H₀ when it is really true. In Example 6.33, this is the probability that |z| ≥ 1.96 when μ = 22. But this is exactly the significance level of the test. The critical value 1.96 was chosen to make this probability 0.05, so we do not have to compute it again. The definition of “significant at level 0.05” is that sample outcomes this extreme will occur with probability 0.05 when H₀ is true.

SIGNIFICANCE AND TYPE I ERROR

The significance level α of any fixed level test is the probability of a Type I error. That is, α is the probability that the test will reject the null hypothesis H₀ when H₀ is in fact true.

Page 399

The probability of a Type II error for the particular alternative μ = 22.015 in Example 6.33 is the probability that the test will fail to reject H₀ when μ has this alternative value. The power of the test to detect the alternative μ = 22.015 is just the probability that the test does reject H₀. By following the method of Example 6.30, we can calculate that the power is about 0.92. The probability of a Type II error is therefore 1 − 0.92, or 0.08.

POWER AND TYPE II ERROR

The power of a fixed level test to detect a particular alternative is 1 minus the probability of a Type II error for that alternative.

The two types of error and their probabilities give another interpretation of the significance level and power of a test. The distinction between tests of significance and tests as rules for deciding between two hypotheses does not lie in the calculations but in the reasoning that motivates the calculations. In a test of significance, we focus on a single hypothesis (H₀) and a single probability (the P-value). The goal is to measure the strength of the sample evidence against H₀. Calculations of power are done to check the sensitivity of the test. If we cannot reject H₀, we conclude only that there is not sufficient evidence against H₀, not that H₀ is actually true.

If the same inference problem is thought of as a decision problem, we focus on two hypotheses and give a rule for deciding between them based on the sample evidence. Therefore, we must focus equally on two probabilities, the probabilities of the two types of error. We must choose one hypothesis and cannot abstain on grounds of insufficient evidence.

The common practice of testing hypotheses

Such a clear distinction between the two ways of thinking is helpful for understanding. In practice, the two approaches often merge. We continued to call one of the hypotheses in a decision problem H₀. The common practice of testing hypotheses mixes the reasoning of significance tests and decision rules as follows:

1. State H₀ and H_a just as in a test of significance.
2. Think of the problem as a decision problem, so that the probabilities of Type I and Type II errors are relevant.
3. Because of Step 1, Type I errors are more serious. So choose an α (significance level) and consider only tests with probability of a Type I error no greater than α.
4. Among these tests, select one that makes the probability of a Type II error as small as possible (that is, power as large as possible). If this probability is too large, you will have to take a larger sample to reduce the chance of an error.

Testing hypotheses may seem to be a hybrid approach. It was, historically, the effective beginning of decision-oriented ideas in statistics. An impressive mathematical theory of hypothesis testing was developed between 1928 and 1938 by Jerzy Neyman and Egon Pearson. The decision-making approach came later (1940s). Because decision theory in its pure form leaves you with two error probabilities and no simple rule on how to balance them, it has been used less often than either tests of significance or tests of hypotheses. Decision ideas have been applied in testing problems mainly by way of the Neyman-Pearson hypothesis-testing theory. That theory asks you first to choose α, and the influence of Fisher has often led users of hypothesis testing comfortably back to α = 0.05 or α = 0.01. Fisher, who was exceedingly argumentative, violently attacked the Neyman-Pearson decision-oriented ideas, and the argument still continues.