Chapter 23: Use and Abuse of Statistical Inference

Using inference wisely

In previous chapters, we have met the two major types of statistical inference: confidence intervals and significance tests. We have, however, seen only two inference methods of each type, one designed for inference about a population proportion p and the other designed for inference about a population mean m. There are libraries of both books and software filled with methods for inference about various parameters in various settings. The reasoning of confidence intervals and significance tests remains the same, regardless of the method. The first step in using inference wisely is to understand your data and the questions you want to answer and fit the method to its setting. Here are some tips on inference, adapted to the settings we are familiar with.

Page 548

The design of the data production matters. “Where do the data come from?” remains the first question to ask in any statistical study. Any inference method is intended for use in a specific setting. For our confidence interval and test for a proportion p:

• The data must be a simple random sample (SRS) from the population of interest. When you use these methods, you are acting as if the data are an SRS. In practice, it is often not possible to actually choose an SRS from the population. Your conclusions may then be open to challenge.
• These methods are not correct for sample designs more complex than an SRS, such as stratified samples. There are other methods that fit these settings.
• There is no correct method for inference from data haphazardly collected with bias of unknown size. Fancy formulas cannot rescue badly produced data.
• Other sources of error, such as dropouts and nonresponse, are important. Remember that confidence intervals and tests use the data you collect and ignore these errors.

EXAMPLE 1 The psychologist and the women’s studies professor

A psychologist is interested in how our visual perception can be fooled by optical illusions. Her subjects are students in Psychology 101 at her university. Most psychologists would agree that it’s safe to treat the students as an SRS of all people with normal vision. There is nothing special about being a student that changes visual perception.

A professor at the same university uses students in Women’s Studies 101 to examine attitudes toward violence against women and reproductive rights. Students as a group are younger than the adult population as a whole. Even among young people, students as a group come from more prosperous and better-educated homes. Even among students, this university isn’t typical of all campuses. Even on this campus, students in a women’s studies course may have opinions that are quite different from those of students who do not take Women’s Studies 101. The professor can’t reasonably act as if these students are a random sample from any population of interest other than students taking Women’s Studies 101 at this university during this term.

Know how confidence intervals behave. A confidence interval estimates the unknown value of a parameter and also tells us how uncertain the estimate is. All confidence intervals share these behaviors:

Page 549

• The confidence level says how often the method catches the true parameter when sampling many times. We never know whether this specific data set gives us an interval that contains the true value of the parameter. All we can say is that “we got this result from a method that works 95% of the time.” This data set might be one of the 5% that produce an interval that misses the true value of the parameter. If that risk is too high for you, use a 99% confidence interval.
• High confidence is not free. A 99% confidence interval will be wider than a 95% confidence interval based on the same data. To be more confident, we must have more values to be confident about. There is a trade-off between how closely we can pin down the true value of the parameter (the precision of the confidence interval) and how confident we are that we have captured its true value.
• Larger samples give narrower intervals. If we want high confidence and a narrow interval, we must take a larger sample. The width of our confidence interval for p goes down by a factor of the square root of the sample size. To cut the interval in half, we must take four times as many observations. This is typical of many types of confidence intervals.

Dropping out An experiment found that weight loss is significantly more effective than exercise for reducing high cholesterol and high blood pressure. The 170 subjects were randomly assigned to a weight-loss program, an exercise program, or a control group. Only 111 of the 170 subjects completed their assigned treatment, and the analysis used data from these 111. Did the dropouts create bias? Always ask about details of the data before trusting inference.

Know what statistical significance says. Many statistical studies hope to show that some claim is true. A clinical trial compares a new drug with a standard drug because the doctors hope that the health of patients given the new drug will improve. A psychologist studying gender differences suspects that women will do better than men (on the average) on a test that measures social-networking skills. The purpose of significance tests is to weigh the evidence that the data give in favor of such claims. That is, a test helps us know if we found what we were looking for.

To do this, we ask what would happen if the claim were not true. That’s the null hypothesis—no difference between the two drugs, no difference between women and men. A significance test answers only one question: “How strong is the evidence that the null hypothesis is not true?” A test answers this question by giving a P-value. The P-value tells us how likely data as or more extreme than ours would be if the null hypothesis were true. Data that are very unlikely and have a small P-value are good evidence that the null hypothesis is not true. We usually don’t know whether the hypothesis is true for this specific population. All we can say is that “data as or more extreme than these would occur only 5% of the time if the hypothesis were true.”

Page 550

This kind of indirect evidence against the null hypothesis (and for the effect we hope to find) is less straightforward than a confidence interval. We will say more about tests in the next section.

Know what your methods require. Our significance test and confidence interval for a population proportion p require that the population size be much larger than the sample size. They also require that the sample size itself be reasonably large so that the sampling distribution of the sample proportion $\hat{p}$ is close to Normal. We have said little about the specifics of these requirements because the reasoning of inference is more important. Just as there are inference methods that fit stratified samples, there are methods that fit small samples and small populations. If you plan to use statistical inference in practice, you will need help from a statistician (or need to learn lots more statistics) to manage the details.

Most of us read about statistical studies more often than we actually work with data ourselves. Concentrate on the big issues, not on the details of whether the authors used exactly the right inference methods. Does the study ask the right questions? Where did the data come from? Do the results make sense? Does the study report confidence intervals so you can see both the estimated values of important parameters and how uncertain the estimates are? Does it report P-values to help convince you that findings are not just good luck?