7.8 7.7 Inference: From Sample to Population

A USA Today Pew Research Center poll conducted in 2014 interviewed a random sample of 1500 adults. The result was that 60% agreed with the following statement: “Most people who want to get ahead can make it if they’re willing to work hard.” This applies to the 1500 people in the sample. But what is the truth about the 230 million American adults who make up the population? Because the sample was chosen at random, it’s reasonable to think that these 1500 people represent the entire population fairly well. So the researchers turn the fact that 60% of the sample believe that people can get ahead if they’re willing to work hard into an estimate that about 60% of all adults feel this way. That’s a fundamental operation in statistics: Use a fact about a sample to estimate the truth about the whole population. We call this statistical inference.

Statistical Inference DEFINITION

Statistical inference refers to methods for drawing conclusions about an entire population on the basis of data from a sample.

If the selected individuals were chosen at random, we think that they fairly represent the population and inference makes sense. However, if we have data from only a convenience sample or a voluntary response sample, the data do not represent the population and we can’t use them for inference. Statistical inference works only if the data come from a random sample or randomized comparative experiment. That’s why this chapter starts with producing reliable data before moving on to inference from the data to a larger population.

To think about inference, we must keep straight whether a number describes a sample or a population.

315

Parameter DEFINITION

A parameter is a fixed (usually unknown) number that describes some characteristic of a population.

Statistic DEFINITION

A statistic is a number that describes some characteristic of a sample. The value of a statistic is known when we have taken a sample, but it can change from sample to sample. We often use a statistic to estimate an unknown parameter.

To avoid confusing these terms, remember Parameters are for Populations and Statistics are for Samples. We can’t determine the true value of a parameter unless we examine the entire population, which isn’t usually possible. However, we can estimate the unknown parameter based on information from a sample statistic.

EXAMPLE 14 Working Hard and Getting Ahead: A Sample Statistic

The actual results from the poll discussed at the start of this section were that 885 of the 1484 people who answered the question agreed with the statement “Most people if they want to get ahead can make it if they work hard.” (Notice not all of the 1500 people surveyed answered this question.) The proportion of the respondents who agreed was

Algebra Review Appendix

Fractions, Percents, and Percentages

The symbol is read “p-hat.” The ^ symbol here tells us that a quantity has been estimated, just as the use of in Chapter 6 (page 253) told us that a value was estimated by using a regression-line model. The number is a statistic. The corresponding parameter is the proportion (call it ) of all adult U.S. residents who would have responded “Agree” if questioned about the same statement. We don’t know the value of the parameter , so we use the statistic to estimate it.

From Example 14, we have an estimate for the population proportion . But how good is our estimate? If the Pew Research Center took a second random sample of 1500 adults, the new sample would have different people in it. It is almost certain that there would not be exactly 885 responses in agreement with the statement. That is, the value of the statistic will vary from sample to sample. If the variation when we take repeat samples from the same population is too great, then we can’t trust the results of any one sample.

In practice, it is too expensive to take many samples from a large population, such as all adult U.S. residents. But we can use a computer to imitate drawing many samples at random from a population that we specify. This is called simulation. Example 15 explores what happens when we do this.

EXAMPLE 15 What Happens in Many Samples?

We start with a scenario in which we know that the population proportion responding “Agree” to some statement is . Using computer simulation, we repeatedly take random samples, first of size 100 and later of size 1500. For each sample, we calculate the sample proportion .

316

Results from Samples of Size 100

A histogram of the 1000 values of from our computer-simulated data appears in Figure 7.6. This histogram gives us an idea of the shape, center, and spread of the distribution of the sample proportion for samples of size 100 drawn from a population in which .

image
Figure 7.6: Figure 7.6 Draw 1000 SRSs of size 100 from a population with proportion of “Agrees.” The histogram shows the distribution of the 1000 sample proportions .

The USA Today Pew Research Center poll interviewed around 1500 people, not just 100. Again, we use computer simulation to generate 1000 samples of size 1500 and record the value of the sample proportion for each sample. A histogram of the based on these 1000 samples is shown in Figure 7.7.

image
Figure 7.7: Figure 7.7 Draw 1000 SRSs of size 1500 from a population with proportion of Agrees. The histogram shows the distribution of the 1000 sample proportions .

317

For comparative purposes, Figures 7.6 and 7.7 are drawn using the same horizontal scale. This allows us to compare what happens when we increase the size of our samples from 100 to 1500. These histograms display the sampling distribution of the statistic for two sample sizes. Notice that for both situations, the histograms are centered at , the known value of the parameter. The histograms are single-peaked and roughly symmetric—what we would expect for a normal distribution (refer to Section 5.8, page 209). However, the variability is much smaller for the situation in which the sample size is 1500 compared to only 100.

Self Check 8

  1. Under the scenario described in Example 15 , what are the possible values for if the sample size is 2?
  2. Suppose a computer simulation is used to generate 1000 samples of size 2 from a population in which . Which of the values for listed in part (a) do you expect to occur least frequently?

Sampling Distribution DEFINITION

The sampling distribution of a statistic is the distribution of values taken on by the statistic in all possible samples of the same size from the same population.

Strictly speaking, the sampling distribution is the ideal pattern that would emerge if we looked at all possible samples of the same size from our population. A distribution obtained from a fixed number of trials, like the 1000 trials in Figures 7.6 and 7.7, is only an approximation of the sampling distribution. However, the results from our simulations support a general theoretical result, which is based on probability theory (introduced in Chapter 8). We now turn to probability theory to learn the mathematical facts that lie behind the simulations. We’ll use the word success for whatever we are counting, such as “Agree” responses in the USA Today Pew Research Center poll. Note that success does not necessarily have the positive (or negative) association it does in real life, but is simply a convenient way to identify an outcome. For example, in a cancer study, success might unfortunately signify that a person developed cancer.

Sampling Distribution of a Sample Proportion THEOREM

Choose an SRS of size from a large population that contains population proportion of successes. Let be the sample proportion of successes, expressed as

Then:

  • Shape: For large sample sizes , the sampling distribution of is approximately normal.
  • Center: The mean of the sampling distribution of is .
  • Variability: The standard deviation of the sampling distribution of is

318

image
Figure 7.8: Figure 7.8 Repeat many times the process of selecting an SRS of size from a population of which the proportion is number of successes. The values of the sample proportion of successes have this normal sampling distribution.

Figure 7.8 summarizes these facts in a form that reminds us that a sampling distribution describes the results of lots of samples from the same population.

EXAMPLE 16 Comparing Simulation Results to Theory

Return to the scenario in which the population proportion who would respond “Agree” to some statement is . in Example 15, we used computer simulation to generate 1000 samples of size 1500 from this population and record the sample proportion of “Agree” responses for each sample. Figure 7.7 (page 316) shows one histogram of these -values. However, the histogram in Figure 7.9 (based on the same data) gives a better sense of the overall shape of the data. We also computed the mean and standard deviation of the -values: and .

image
Figure 7.9: Figure 7.9 Histogram of the same data used for Figure 7.7. The horizontal scale has been changed to better show the normal shape of the data.

Next, we look at what the theorem tells us about the sampling distribution of . The distribution of in many samples

  • Is close to normal
  • Has mean 0.6
  • Has standard deviation 0.0126

To show our work for the last number, note that , and the square root of 0.00016 is approximately 0.0126:

319

Finally, we compare the simulation results to the results from the theorem. First, the normal curve in Figure 7.9 does a reasonable job of summarizing the shape of the histogram. Second, the mean of 0.60 and standard deviation of 0.0126 from the mathematics are very close to the mean of 0.59982 and standard deviation of 0.01255 we observed in our simulation data. if the simulation used more than 1000 trials, the results would be still closer to the mathematical theory.

Self Check 9

Return to Example 16. Suppose currently that 70% of all adults in a population would reply “Agree” to some statement. Take simple random samples of 1500 adults.

  1. What is the shape of the distribution of in many samples?
  2. What is the mean of the distribution?
  3. What is its standard deviation?
  4. How did the increase in the value of the population proportion from 0.6 to 0.7 change the distribution of ?

Look back at Figure 7.9. Notice that most of the -values lie close to the actual population proportion of . Hence, the sampling distribution shows why we can trust the results of a large random sample—a high percentage of such samples give results (values of ) that are close to the truth about the population.

EXAMPLE 17 The 68–95–99.7 Rule Again

In Example 16, the population parameter, the proportion of adults who agreed with some statement, is . if we take SRSs of size 1500, the sample proportions follow the normal distribution with a mean of 0.6 and a standard deviation of 0.0126. The “95” part of the 68–95–99.7 rule from Section 5.9 (page 216) says that 95% of all samples give a within 2 standard deviations of the truth about the population. So in this example, 95% of all samples have within of 0.60, that is, between 0. 57 and 0.63 (rounded to two decimal places). Figure 7.10 illustrates this use of the 68–95–99.7 rule.

Algebra Review Appendix

Rounding Numbers

image
Figure 7.10: Figure 7.10 The sampling distribution of for Example 16. By the 68–95–99.7 rule, 95% of all samples have a sample proportion within 0.0252 of the true population proportion .

320

We can repeat this reasoning for any value of the parameter and the sample size . Using the 68–95–99.7 rule, it is always true that 95% of all samples give a sample proportion within 2 standard deviations of the population proportion . Now, suppose a sample is one of the 95% of all samples for which lies within 2 standard deviations of , as shown in Figure 7.11. Then we can turn things around and say that the interval from (standard deviation) to (standard deviation) contains .

image
Figure 7.11: Figure 7.11 If is within 2 standard deviations of (lower interval), then is within 2 standard deviations of (upper interval).

That means 95% of all samples catch in the interval extending 2 standard deviations on either side of , which is the interval

This formula tells us how close the unknown parameter lies to the observed statistic in 95% of all samples. But there is one problem: We can’t calculate the interval from the data because the standard deviation involves the population proportion , and in practice we don’t know .

What to do? The standard deviation of the statistic does depend on the parameter , but it doesn’t change a lot when changes. We can go back to Example 16 and redo the calculation for other values of when . The results appear in Table 7.5. (You can fill in the value for from your results to Self Check 9c.)

Table 7.8: Table 7.5 Standard Deviation for Different Values of
Value of 0.4 0.5 0.6 0.7 0.8
Standard deviation: 0.0126 0.0129 0.0126 0.0103

The standard deviations are all 0.01 when rounded to the hundredths place. You see that if we guess a value of that is reasonably close to the true value, the standard deviation found from the guessed value will be about right. We know that when we take a large random sample, the statistic is almost always close to the parameter . So we will use as the guessed value of the unknown . Now we have an interval estimate for that we can calculate from the sample data. We call it a confidence interval.

321

Confidence Interval DEFINITION

A 95% confidence interval is an interval obtained from the sample data by a method in which 95% of all samples will produce an interval containing the true population parameter.

Confidence Interval for Population Proportion FORMULA

Choose an SRS of size from a large population that contains an unknown proportion of successes. A 95% confidence interval for is approximately

The ± sign is read “plus or minus,” so, for example, yields two numbers: and . This can be written as an interval: (0.3, 0.7).

This formula is only approximately correct but is quite accurate when the sample size is large . Here, is the proportion of successes, and , the expression to the right of the ± sign, is the margin of error. When results of polls are reported in the news, the margin of error is commonly reported along with the percentage estimate for the population proportion.

Margin of Error DEFINITION

The margin of error is equal to half of the width of a confidence interval. For a 95% confidence interval, it equals about 2 standard deviations of the sampling distribution of the estimated parameter. If you conducted a very large number of polls, about 95% of the time the difference between a particular poll’s result and the true value of the population parameter would be within the margin of error.

EXAMPLE 18 Americans’ Concern About Climate Change

Are Americans concerned about climate change, and if so, what would they be willing to sacrifice? in 2014, the Bloomberg National Poll surveyed 1005 U.S. adults on climate change and other topics. Of those surveyed:

  • 462 viewed climate change as a major threat.
  • 623 indicated that they would be willing to pay more for energy if air pollution from carbon emissions could be reduced.

Algebra Review Appendix

Powers and Roots

The sample proportion who viewed climate change as a major threat is

A 95% confidence interval for the proportion of all U.S. adults who view climate change as a major threat is

322

A report of these calculations might say, “The study found that approximately 46% of U.S. adults viewed climate change as a major threat. The margin of error for this result is 3.1%.”

image

Self Check 10

  1. Using the information from Example 18, determine a 95% confidence interval for the percentage of U.S. adults who would be willing to pay more for energy if air pollution from carbon emissions could be reduced.
  2. Compare the margin of error on the confidence interval you calculated in part (a) to the one calculated in Example 18.

We got the interval in Example 18 by using a formula that catches the true unknown population proportion in 95% of all samples. The shorthand for this is: We are 95% confident that the true percentage of adults in the United States who view climate change as a major threat lies between 42.83% and 49.11%.

Keep in mind that the 95% confidence level refers to the track record of using the confidence interval formula. This formula results in an interval that contains the true unknown population proportion in 95% of all samples. However, that also means the true value of the population proportion lies outside the calculated interval in 5% of all samples. We’ll never know if our interval contains or not. (To gain a better understanding of the meaning of confidence level, check out Applet Exercise 4, on page 339.)

The length of a confidence interval depends on how confident we want to be that the interval does capture the true parameter value. It is common to use 95% confidence, but you can ask for higher or lower confidence if you want. Our 95% confidence interval was based on the middle 95% of a normal distribution. A 99% confidence interval requires the middle 99% of the distribution and therefore is wider (has a larger margin of error), as can be seen from Figure 7.12.

323

image
Figure 7.12: Figure 7.12 Determining margins of error for different confidence levels.

The length of a 95% confidence interval also depends on the size of the sample. Larger samples give shorter intervals because of the in the denominator of the margin of . But the interval does not depend on the size of the population. This is true so long as the population is much larger than the sample. The confidence interval in Example 18 works for a sample of 1005 from a city with 100,000 adults as well as it does for a sample of 1005 from a nation of 308 million. What matters more is how many people we interview, not what percentage of the population we contact.

EXAMPLE 19 Understanding the News

Here’s what the TV news announcer says: “A new Gallup poll on American exercise habits finds that 45% of adults are not engaging in vigorous sports or physical activities. The margin of error for the poll was 3 percentage points.” Plus or minus 3%, starting at 45%, is 42% to 48%. People with minimal statistics knowledge may think that the truth about the entire population must be in that interval, but now we know better!

This is the full background Gallup actually gives: “For results based on this sample, one can say with 95% confidence that the maximum error attributable to sampling and other random effects is 3 percentage points. in addition to sampling error, question wording and practical difficulties in conducting surveys can introduce error or bias into the findings of public opinion polls.” That is, Gallup tells us that the margin of error works only for 95% of all its samples; “95% confidence” is shorthand for that longer fact. The news report left out the “95% confidence.” in fact, almost all margins of error in the news are for 95% confidence. if you don’t see the confidence level in a scientific poll, it’s usually safe to assume 95%.

324

Gallup’s mention of “question wording and practical difficulties” in Example 19 takes us back to our cautions about sample surveys in Section 7.4. The margin of error does not address nonresponse and other practical difficulties. The margin of error in a confidence interval comes from the sampling distribution of the statistic. The sampling distribution describes the variation of the statistic due to chance in repeated random samples. This random variation is the only source of error covered by the margin of error.

Real-life samples also suffer from undercoverage and nonresponse. Errors from these practical difficulties are usually more serious and harder to quantify than random sampling error. The actual error in sample surveys may be much larger than the announced margin of error. What is worse is that we can’t say how much larger. Statistical conclusions are approximations of a complicated truth, not mathematical results that are simply true. As we will see in Spotlight 7.4, responsible polling organizations tell the public something about both the precision and limitations of their poll results.

Truth in Polling Spotlight 7.4

College student newspapers may not have the resources to conduct polls using random sampling, so it is refreshing when the polls that it publishes from voluntary response samples are accompanied by a disclaimer, such as this one that has been used by The Prospector, the student newspaper of The University of Texas at El Paso:

This poll is not scientific and reflects the opinions of only those internet users who have chosen to participate. The results cannot be assumed to represent the opinions of internet users in general, nor the public as a whole.

Because of this limitation, The Prospector simply reports the breakdown of responses given but without any margin of error, since sampling error cannot be quantified from a (voluntary response) sample that is not probability-based.

The Harris Poll accompanies its polls with the following disclaimer:

All sample surveys and polls, whether or not they use probability sampling, are subject to multiple sources of error which are most often not possible to quantify or estimate, including sampling error, coverage error, error associated with nonresponse, error associated with question wording and response options, and post-survey weighting and adjustments. Therefore, Harris interactive avoids the words “margin of error” as they are misleading. All that can be calculated are different possible sampling errors with different probabilities for pure, unweighted, random samples with 100 percent response rates. These are only theoretical because no published polls come close to this ideal.