3.2 Designing Samples

Samsung and O2 want to know how much time smartphone users spend on their smartphones. An automaker hires a market research firm to learn what percent of adults aged 18 to 35 recall seeing television advertisements for a new sport utility vehicle. Government economists inquire about average household income. In all these cases, we want to gather information about a large group of people. We will not, as in an experiment, impose a treatment in order to observe the response. Also, time, cost, and inconvenience forbid contacting every person. In such cases, we gather information about only part of the group—a sample—in order to draw conclusions about the whole. Sample surveys are an important kind of observational study.

sample survey

Population and Sample

The entire group of cases that we want to study is called the population.

A sample is a subset of the population for which we collect data.

Notice that “population” is defined in terms of our desire for knowledge. If we wish to draw conclusions about all U.S. college students, that group is our population—even if only local students are available for questioning. The sample is the part from which we draw conclusions about the whole. The design of a sample survey refers to the method used to choose the sample from the population.

sample design

EXAMPLE 3.4 Can We Compete Globally?

A lack of reading skills has been cited as one factor that limits our ability to compete in the global economy.7 Various efforts have been made to improve this situation. One of these is the Reading Recovery (RR) program. RR has specially trained teachers work one-on-one with at-risk first-grade students to help them learn to read. A study was designed to examine the relationship between the RR teachers’ beliefs about their ability to motivate students and the progress of the students whom they teach.8 The National Data Evaluation Center (NDEC) website (ndec.us) says that there are 6112 RR teachers. The researchers send a questionnaire to a random sample of 200 of these. The population consists of all 6112 RR teachers, and the sample is the 200 that were randomly selected.

Unfortunately, our idealized framework of population and sample does not exactly correspond to the situations that we face in many cases. In Example 3.4, the list of teachers was prepared at a particular time in the past. It is very likely that some of the teachers on the list are no longer working as RR teachers today. New teachers have been trained in RR methods and are not on the list. A list of items to be sampled is often called a sampling frame. For our example, we view this list as the population. We may have out-of-date addresses for some who are still working as RR teachers, and some teachers may choose not to respond to our survey questions.

sampling frame

130

In reporting the results of a sample survey, it is important to include all details regarding the procedures used. The proportion of the original sample who actually provide usable data is called the response rate and should be reported for all surveys. If only 150 of the teachers who were sent questionnaires provided usable data, the response rate would be 150/200, or 75%. Follow-up mailings or phone calls to those who do not initially respond can help increase the response rate.

response rate

Apply Your Knowledge

Question 3.17

3.17 Taxes and forestland usage.

A study was designed to assess the impact of taxes on forestland usage in part of the Upper Wabash River Watershed in Indiana.9 A survey was sent to 772 forest owners from this region, and 348 were returned. Consider the population, the sample, and the response rate for this study. Describe these based on the information given, and indicate any additional information that you would need to give a complete answer.

3.17

The population is all forest owners in this region. The sample is the 772 to whom the survey was sent. The response rate is 45.1%.

Question 3.18

3.18 Job satisfaction.

A research team wanted to examine the relationship between employee participation in decision making and job satisfaction in a company. They are planning to randomly select 300 employees from a list of 2500 employees in the company. The Job Descriptive Index (JDI) will be used to measure job satisfaction, and the Conway Adaptation of the Alutto-Belasco Decisional Participation Scale will be used to measure decision participation. Describe the population and the sample for this study. Can you determine the response rate? Explain your answer.

Poor sample designs can produce misleading conclusions. Here is an example.

EXAMPLE 3.5 Sampling Product in a Steel Mill

A mill produces large coils of thin steel for use in manufacturing home appliances. The quality engineer wants to submit a sample of 5-centimeter squares to detailed laboratory examination. She asks a technician to cut a sample of 10 such squares. Wanting to provide “good” pieces of steel, the technician carefully avoids the visible defects in the coil material when cutting the sample. The laboratory results are wonderful, but the customers complain about the material they are receiving.

In Example 3.5, the samples were selected in a manner that guaranteed that they would not be representative of the entire population. This sampling scheme displays bias, or systematic error, in favoring some parts of the population over others. Online opinion polls are particularly vulnerable to bias because the sample who respond are not representative of the population at large. Online polls use voluntary response samples, a particularly common form of biased sample.

Voluntary Response Sample

A voluntary response sample consists of people who choose themselves by responding to a general appeal. Voluntary response samples are biased because people with strong opinions, especially negative opinions, are most likely to respond.

The remedy for bias in choosing a sample is to allow impersonal chance to do the choosing so that there is neither favoritism by the sampler nor voluntary response. Random selection of a sample eliminates bias by giving all cases an equal chance to be chosen.

131

Voluntary response is one common type of bad sample design. Another is convenience sampling, which chooses the cases easiest to reach. Here is an example of convenience sampling.

convenience sampling

EXAMPLE 3.6 Interviewing Customers at the Mall

Manufacturers and advertising agencies often use interviews at shopping malls to gather information about the habits of consumers and the effectiveness of ads. A sample of mall customers is fast and cheap. But people contacted at shopping malls are not representative of the entire U.S. population. They are richer, for example, and more likely to be teenagers or retired. Moreover, mall interviewers tend to select neat, safe-looking subjects from the stream of customers. Decisions based on mall interviews may not reflect the preferences of all consumers.

Both voluntary response samples and convenience samples produce samples that are almost guaranteed not to represent the entire population. These sampling methods display bias in favoring some parts of the population over others.

Bias

The design of a study is biased if it systematically favors certain outcomes.

Big data involves extracting useful information from large and complex data sets. There are exciting developments in this field and opportunities for new uses of data are widespread. Some have suggested that there are potential biases in the results obtained from some big data sets.10 Here is an example:

EXAMPLE 3.7 Bias and Big Data

image

A study used Twitter and Foursquare data on coffee, food, nightlife, and shopping activity to describe the disruptive effects of Hurricane Sandy.11 However, the data are dominated by tweets and smartphone activity from Manhattan. Relatively little data are from areas such as Breezy Point, where the effects of the hurricane were most severe.

Apply Your Knowledge

Question 3.19

3.19 What is the population?

For each of the following sampling situations, identify the population as exactly as possible. That is, indicate what kind of cases the population consists of and exactly which cases fall in the population. If the information given is not sufficient, complete the description of the population in a reasonable way.

  1. Each week, the Gallup Poll questions a sample of about 1500 adult U.S. residents to determine national opinion on a wide variety of issues.
  2. The 2000 census tried to gather basic information from every household in the United States. Also, a “long form” requesting additional information was sent to a sample of about 17% of households.
  3. A machinery manufacturer purchases voltage regulators from a supplier. There are reports that variation in the output voltage of the regulators is affecting the performance of the finished products. To assess the quality of the supplier’s production, the manufacturer sends a sample of five regulators from the last shipment to a laboratory for study.

3.19

(a) The population consists of all adult U.S. residents. (b) The population consists of all households in the United States. (c) The population consists of all voltage regulators from the supplier.

132

Question 3.20

3.20 Market segmentation and movie ratings.

You wonder if that new “blockbuster” movie is really any good. Some of your friends like the movie, but you decide to check the Internet Movie Database (imdb.com) to see others’ ratings. You find that 2497 people chose to rate this movie, with an average rating of only 3.7 out of 10. You are surprised that most of your friends liked the movie, while many people gave low ratings to the movie online. Are you convinced that a majority of those who saw the movie would give it a low rating? What type of sample are your friends? What type of sample are the raters on the Internet Movie Database? Discuss this example in terms of market segmentation (see, for example, businessplans.org/Segment.html.)

Simple random samples

The simplest sampling design amounts to placing names in a hat (the population) and drawing out a handful (the sample). This is simple random sampling.

Simple Random Sample

A simple random sample (SRS) of size consists of cases from the population chosen in such a way that every set of cases has an equal chance to be the sample actually selected.

We select an SRS by labeling all the cases in the population and using software or a table of random digits to select a sample of the desired size. Notice that an SRS not only gives each case an equal chance to be chosen (thus avoiding bias in the choice), but gives every possible sample an equal chance to be chosen. There are other random sampling designs that give each case, but not each sample, an equal chance. One such design, systematic random sampling, is described later in Exercise 3.36 (pages 141142).

Thinking about random digits helps you to understand randomization even if you will use software in practice. Table B at the back of the book is a table of random digits.

Random Digits

A table of random digits is a list of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 that has the following properties:

  1. The digit in any position in the list has the same chance of being any one of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
  2. The digits in different positions are independent in the sense that the value of one has no influence on the value of any other.

You can think of Table B as the result of asking an assistant (or a computer) to mix the digits 0 to 9 in a hat, draw one, then replace the digit drawn, mix again, draw a second digit, and so on. The assistant’s mixing and drawing saves us the work of mixing and drawing when we need to randomize. Table B begins with the digits 19223950340575628713. To make the table easier to read, the digits appear in groups of five and in numbered rows. The groups and rows have no meaning— the table is just a long list of digits having the properties 1 and 2 described in the preceding box.

Our goal is to use random digits to select random samples. We need the following facts about random digits, which are consequences of the basic properties 1 and 2:

133

EXAMPLE 3.8 Brands

brands

A brand is a symbol or an image that is associated with a company. An effective brand identifies the company and its products. Using a variety of measures, dollar values for brands can be calculated. In Exercise 1.53 (page 36), you examined the distribution of the values of the top 100 brands.

Suppose that you want to write a research report on some of the characteristics of the companies in this elite group. You decide to look carefully at the websites of 10 companies from the list. One way to select the companies is to use a simple random sample. Here are some details about how to do this using Table B.

We start with a list of the companies with the top 100 brands. This is given in the data file BRANDS. Next, we need to label the companies. In the data file, they are listed with their ranks, 1 to 100. Let’s assign the labels 01 to 99 to the first 99 companies and 00 to the company with rank 100. With these labels, we can use Table B to select the SRS.

Let’s start with line 156 of Table B. This line has the entries 55494 67690 88131 81800 11188 28552 25752 21953. These are grouped in sets of five digits, but we need to use sets of two digits for our randomization. Here is line 156 of Table B in sets of two digits: 55 49 46 76 90 88 13 18 18 00 11 18 82 85 52 25 75 22 19 53.

Using these random digits, we select Kraft (55), Accenture (49), Fox (46), Starbucks (76), Ericsson (90), Chase (88), Oracle (13), Disney (18; we skip the second 18 because we have already selected Disney to be in our SRS), Estee Lauder (00; recoded from rank 100), and BMW (11).

Most statistical software will select an SRS for you, eliminating the need for Table B. The Simple Random Sample applet on the text website is another convenient way to automate this task.

Excel and other spreadsheet software can do the job. There are four steps:

  1. Create a data set with all the elements of the population in the first column.
  2. Assign a random number to each element of the population; put these in the second column.
  3. Sort the data set by the random number column.
  4. The simple random sample is obtained by taking elements in order from the sorted list until the desired sample size is reached.

We illustrate the procedure with a simplified version of Example 3.8.

EXAMPLE 3.9 Select a Random Sample

Figure 3.2(a) gives the spreadsheet with the company names in column B. Only the first 12 of the 100 companies in the top 100 brands list are shown.

The random numbers generated by the RAND() function are given in the next column in Figure 3.2(b). The sorted data set is given in Figure 3.2(c). The 10 brands were selected for our random sample are Danone, Disney, Boeing, Home Depot, Nescafe, Mastercard, Gucci, Nintendo, Apple, and Credit Suisse.

134

image
Figure 3.2: FIGURE 3.2 Selection of a simple random sample of brands using Excel, Example 3.9: (a) labels; (b) random numbers; (c) randomly sorted labels.

Apply Your Knowledge

Question 3.21

3.21 Ringtones for cell phones.

You decide to change the ringtones for your cell phone by choosing two from a list of the 10 most popular ringtones.12 Here is the list:

Fancy Happy Turn Down for What Rude Problem
Bottoms Up All of Me Crise Beachin’ Wiggle

Select your two ringtones using a simple random sample.

Question 3.22

3.22 Listen to three songs.

The walk to your statistics class takes about 10 minutes, about the amount of time needed to listen to three songs on your iPod. You decide to take a simple random sample of songs from the top 10 songs listed on the Billboard Top Heatseekers Songs.13 Here is the list:

Studio Habits (Stay High) Leave the Night On I’m Ready
Ready Set Roll All About That Bass Riptide Cool Kids
v.3005 Hope You Get Lonely Tonight

Select the three songs for your iPod using a simple random sample.

Stratified samples

The general framework for designs that use chance to choose a sample is a probability sample.

Probability Sample

A probability sample is a sample chosen by chance. We must know what samples are possible and what chance, or probability, each possible sample has.

Some probability sampling designs (such as an SRS) give each member of the population an equal chance to be selected. This may not be true in more elaborate sampling designs. In every case, however, the use of chance to select the sample is the essential principle of statistical sampling.

135

Designs for sampling from large populations spread out over a wide area are usually more complex than an SRS. For example, it is common to sample important groups within the population separately, then combine these samples. This is the idea of a stratified sample.

Stratified Random Sample

To select a stratified random sample, first divide the population into groups of similar cases, called strata. Then choose a separate SRS in each stratum and combine these SRSs to form the full sample.

Choose the strata based on facts known before the sample is taken. For example, a population of election districts might be divided into urban, suburban, and rural strata. A stratified design can produce more exact information than an SRS of the same size by taking advantage of the fact that cases in the same stratum are similar to one another. Think of the extreme case in which all cases in each stratum are identical: just one case from each stratum is then enough to completely describe the population.

EXAMPLE 3.10 Fraud against Insurance Companies

A dentist is suspected of defrauding insurance companies by describing some dental procedures incorrectly on claim forms and overcharging for them. An investigation begins by examining a sample of his bills for the past three years. Because there are five suspicious types of procedures, the investigators take a stratified sample. That is, they randomly select bills for each of the five types of procedures separately.

Multistage samples

multistage sample

Another common means of restricting random selection is to choose the sample in stages. This is common practice for national samples of households or people. For example, data on employment and unemployment are gathered by the government’s Current Population Survey, which conducts interviews in about 60,000 households each month. The cost of sending interviewers to the widely scattered households in an SRS would be too high. Moreover, the government wants data broken down by states and large cities. The Current Population Survey, therefore, uses a multistage sampling design. The final sample consists of clusters of nearby households that an interviewer can easily visit. Most opinion polls and other national samples are also multistage, though interviewing in most national samples today is done by telephone rather than in person, eliminating the economic need for clustering. The Current Population Survey sampling design is roughly as follows:14

Stage 1. Divide the United States into 2007 geographical areas called primary sampling units, or PSUs. PSUs do not cross state lines. Select a sample of 754 PSUs. This sample includes the 428 PSUs with the largest populations and a stratified sample of 326 of the others.

Stage 2. Divide each PSU selected into smaller areas called “blocks.” Stratify the blocks using ethnic and other information, and take a stratified sample of the blocks in each PSU.

Stage 3. Sort the housing units in each block into clusters of four nearby units. Interview the households in a probability sample of these clusters.

Analysis of data from sampling designs more complex than an SRS takes us beyond basic statistics. But the SRS is the building block of more elaborate designs, and analysis of other designs differs more in complexity of detail than in fundamental concepts.

136

Apply Your Knowledge

Question 3.23

3.23 Who goes to the market research workshop?

wshop

A small advertising firm has 30 junior associates and 10 senior associates. The junior associates are

Abel Fisher Huber Miranda Reinmann
Chen Ghosh Jimenez Moskowitz Santos
Cordoba Griswold Jones Neyman Shaw
David Hein Kim O’Brien Thompson
Deming Hernandez Klotz Pearl Utts
Elashoff Holland Lorenz Potter Varga

The senior associates are

Andrews Fernandez Kim Moore West
Besicovitch Gupta Lightman Vicario Yang

The firm will send four junior associates and two senior associates to a workshop on current trends in market research. It decides to choose those who will go by random selection. Use Table B to choose a stratified random sample of four junior associates and two senior associates. Start at line 141 to choose your sample.

3.23

Using line 141: The junior associates chosen are 23, 29, 12, 16. Continuing from line 141, the senior associates chosen are 02, 08 (or 5 and 1 using 0–9 numbering).

Question 3.24

3.24 Sampling by accountants.

Accountants use stratified samples during audits to verify a company’s records of such things as accounts receivable. The stratification is based on the dollar amount of the item and often includes 100% sampling of the largest items. One company reports 5000 accounts receivable. Of these, 100 are in amounts over $50,000; 500 are in amounts between $1000 and $50,000; and the remaining 4400 are in amounts under $1000. Using these groups as strata, you decide to verify all of the largest accounts and to sample 5% of the midsize accounts and 1% of the small accounts. How would you label the two strata from which you will sample? Use Table B, starting at line 125, to select only the first five accounts from each of these strata.

Cautions about sample surveys

Random selection eliminates bias in the choice of a sample from a list of the population. Sample surveys of large human populations, however, require much more than a good sampling design. To begin, we need an accurate and complete list of the population. Because such a list is rarely available, most samples suffer from some degree of undercoverage. A sample survey of households, for example, will miss not only homeless people, but prison inmates and students in dormitories as well. An opinion poll conducted by telephone will miss the 6% of American households without residential phones. Thus, the results of national sample surveys have some bias if the people not covered—who most often are poor people—differ from the rest of the population.

A more serious source of bias in most sample surveys is nonresponse, which occurs when a selected case cannot be contacted or refuses to cooperate. Nonresponse to sample surveys often reaches 50% or more, even with careful planning and several callbacks. Because nonresponse is higher in urban areas, most sample surveys substitute other people in the same area to avoid favoring rural areas in the final sample. If the people contacted differ from those who are rarely at home or who refuse to answer questions, some bias remains.

137

Undercoverage and Nonresponse

Undercoverage occurs when some groups in the population are left out of the process of choosing the sample.

Nonresponse occurs when a case chosen for the sample cannot be contacted or does not cooperate.

EXAMPLE 3.11 Nonresponse in the Current Population Survey

How bad is nonresponse? The Current Population Survey (CPS) has the lowest nonresponse rate of any poll we know: only about 4% of the households in the CPS sample refuse to take part, and another 3% or 4% can’t be contacted. People are more likely to respond to a government survey such as the CPS, and the CPS contacts its sample in person before doing later interviews by phone.

The General Social Survey (Figure 3.3) is the nation’s most important social science research survey. The GSS also contacts its sample in person, and it is run by a university. Despite these advantages, its most recent survey had a 30% rate of nonresponse.15

image
Figure 3.3: FIGURE 3.3 The General Social Survey (GSS) assesses attitudes on a variety of topics, Example 3.11.

138

What about polls done by the media and by market research and opinion-polling firms? We don’t know their rates of nonresponse because they won’t say. That in itself is a bad sign.

EXAMPLE 3.12 Change in Nonresponse in Pew Surveys

The Pew Research Center conducts research using surveys on a variety of issues, attitudes, and trends.16 A study by the center examined the decline in the response rates to their surveys over time. The changes are dramatic, and there is a consistent pattern over time. Here are some data from the report:17

Year 1997 2000 2003 2006 2009 2012
Nonresponse rate 64% 72% 75% 79% 85% 91%

The center is devising alternative methods that show some promise of improving the response rates of their surveys.

Most sample surveys, and almost all opinion polls, are now carried out by telephone or online. This and other details of the interview method can affect the results. When presented with several options for a reply—such as completely agree, mostly agree, mostly disagree, and completely disagree—people tend to be a little more likely to respond to the first one or two options presented.

response bias

The behavior of the respondent or of the interviewer can cause response bias in sample results. Respondents may lie, especially if asked about illegal or unpopular behavior. The race or gender of the interviewer can influence responses to questions about race relations or attitudes toward feminism. Answers to questions that ask respondents to recall past events are often inaccurate because of faulty memory.

wording of questions

The wording of questions is the most important influence on the answers given to a sample survey. Confusing or leading questions can introduce strong bias, and even minor changes in wording can change a survey’s outcome. Here are some examples.

EXAMPLE 3.13 The Form of the Question Is Important

In response to the question “Are you heterosexual, homosexual, or bisexual?” in a social science research survey, one woman answered, “It’s just me and my husband, so bisexual.” The issue is serious, even if the example seems silly: reporting about sexual behavior is difficult because people understand and misunderstand sexual terms in many ways.

Apply Your Knowledge

Question 3.25

3.25 Random digit dialing.

The list of cases from which a sample is actually selected is called the sampling frame. Ideally, the frame should include every case in the population, but in practice this is often difficult. A frame that leaves out part of the population is a common source of undercoverage.

  1. Suppose that a sample of households in a community is selected at random from the telephone directory. What households are omitted from this frame? What types of people do you think are likely to live in these households? These people will probably be underrepresented in the sample.
  2. It is usual in telephone surveys to use random digit dialing equipment that selects the last four digits of a telephone number at random after being given the exchange (the first three digits). Which of the households that you mentioned in your answer to part (a) will be included in the sampling frame by random digit dialing?

3.25

(a) Households not listed in the telephone directory are omitted. They may not own a phone or choose to have their number unlisted. (b) Random digit dialing would include those that are unlisted but own a phone.

139

The statistical design of sample surveys is a science, but this science is only part of the art of sampling. Because of nonresponse, response bias, and the difficulty of posing clear and neutral questions, you should hesitate to fully trust reports about complicated issues based on surveys of large human populations. Insist on knowing the exact questions asked, the rate of nonresponse, and the date and method of the survey before you trust a poll result.

Beyond the Basics: Capture-Recapture Sampling

Pacific salmon return to reproduce in the river where they were hatched three or four years earlier. How many salmon made it back this year? The answer will help determine quotas for commercial fishing on the west coast of Canada and the United States. Biologists estimate the size of animal populations with a special kind of repeated sampling, called capture-recapture sampling. More recently, capture-recapture methods have been used on human populations as well.

capture-recapture sampling

EXAMPLE 3.14 Sampling for a Major Industry in British Columbia

The old method of counting returning salmon involved placing a “counting fence” in a stream and counting all the fish caught by the fence. This is expensive and difficult. For example, fences are often damaged by high water.

Repeat sampling using small nets is more practical. During this year’s spawning run in the Chase River in British Columbia, Canada, you net 200 coho salmon, tag the fish, and release them. Later in the week, your nets capture 120 coho salmon in the river, of which 12 have tags.

The proportion of your second sample that have tags should estimate the proportion in the entire population of returning salmon that are tagged. So if is the unknown number of coho salmon in the Chase River this year, we should have approximately

Solve for to estimate that the total number of salmon in this year’s spawning run in the Chase River is approximately

The capture-recapture idea extends the use of a sample proportion to estimate a population proportion. The idea works well if both samples are SRSs from the population and the population remains unchanged between samples. In practice, complications arise. For example, some tagged fish might be caught by bears or otherwise die between the first and second samples.

Variations on capture-recapture samples are widely used in wildlife studies and are now finding other applications. One way to estimate the census undercount in a district is to consider the census as “capturing and marking” the households that respond. Census workers then visit the district, take an SRS of households, and see how many of those counted by the census show up in the sample. Capture-recapture estimates the total count of households in the district. As with estimating wildlife populations, there are many practical pitfalls. Our final word is as before: the real world is less orderly than statistics textbooks imply.