2.3 Random Samples

In the first two sections of this chapter, we discussed the importance of randomization in observational studies and experiments. In this section, we will consider how to achieve this randomization. In particular, we will look at the process of selecting a simple random sample.

2.3.1 Types of Random Samples

Are the students who live in one dormitory on a college campus pretty much like those in another? Maybe—or maybe not. A researcher wanting to design an experiment involving students who live on campus may want to group subjects based on where they live. Do students living in single-sex dormitories respond differently than those in coed dorms? Are results different for freshman housing as compared to upper class dormitories?

invisible clear both

Photo Credit: © Aerial Archives / Alamy

invisible clear both

While we use “sampling frame” to mean the list of individuals in the population, the term is sometimes used to describe how the sample selected relates to the underlying population. Researchers investigating burnout in athletic trainers used the membership list of the National Athletic Trainers’ Association to select a stratified random sample. Strata used were gender, type of institution, and years of experience. Their sampling frame displays the numbers and percentages of individuals selected

When selecting a sample to use in a study, we use randomization because it provides us the best chance of obtaining a sample that is representative of the characteristics of the population. If you were asked to describe what would make a sample random, you might say that every individual in the population should have an equal chance of being selected. That is certainly a good place to start.

Suppose that a statistics professor professor gets the wild idea to give A’s to four students randomly selected from the 28 individuals enrolled in her class. She might write each person’s name on a slip of paper, put the names in a hat, and have an impartial observer draw four names from the hat. The students whose names are selected will receive the A’s—no work required!

A sample selected in this fashion satisfies the criterion that every individual has an equal chance of being selected. In addition, this method allows every possible combination of four students to be selected—and each combination is equally likely to be selected. A sample selected in such a way that every sample of the desired size is equally likely to be chosen is called a simple random sample (SRS).

The statistical procedures that we will consider later in this course require that our samples be simple random samples so we are primarily concerned with them. But there are other methods that also generate random samples; we now present several of the simpler ones.

Let’s return to the rogue professor giving out A’s, and look at a way to select the sample of four students systematically. Since there are 28 students in the class, and a sample of four is required, she could choose one of the first seven students from her alphabetical roll list, and then every seventh student thereafter. The randomization comes in by choosing where to start the selection; as long as the first student is chosen at random, each student has an equal chance of being selected. This method does not, however, produce a simple random sample, because all samples of size 4 are not equally likely. Some samples are impossible to obtain, such as those with two names that are alphabetically adjacent in the list. (Systematic sampling does not require that the sample size divides evenly into the population size, but since we will not be selecting samples by this method, we will omit the details of such selections here.)

Now consider a situation in which the population consists of individuals that can be divided into groups according to one or more characteristics; a representative sample of such a population should include individuals from each subgroup, in approximately the same proportion that they appear in the population. A stratified random sample is obtained by first dividing the population into groups (strata) defined by one or more variables, and then selecting a simple random sample from each group. For example, suppose a community college offers classes during the day, in the evening, and on the weekend. The director of the campus tutoring center wants to conduct a survey to determine the best hours to be open in order to serve the student population. He should use a stratified random sample to assure that students who attend classes at different times are adequately represented.

Pollsters often also use a stratified random sample when they are interested in predicting the winner of a presidential race. Typically, national surveys involve approximately 1000 people (we’ll see why this is the case later) so one might think that the 1000 voters are randomly selected from the entire population of eligible voters. In fact, pollsters do not randomly pick 1000 phone numbers because voting patterns vary by region and state.

Instead, they will often use previous elections to determine the percentage of people who voted in each geographical region. For example, in a previous election, if 23% of votes came from the East, 26% from the South, 31% from the Great Lakes/Central region, and 20% from the West, they would make 23% of the 1000 calls (or 230) to the East, 260 to the South, 310 to the central region, and 200 to the West. In addition, voter turnout by state is also used to make sure that each state in a particular region is getting the right number of calls.

Unlike a stratified random sample, in which the groups are different, a cluster sample first divides the population into groups (clusters) that are similar. A set of clusters is then randomly chosen, and a census of the individuals in each selected cluster is conducted. If the director of food service at a large university wants to gauge student opinion about menu choice in dormitories, she could select some of the university’s dormitories at random, and then survey each student residing in those dorms. The clusters here are determined by geography; all students residing in a particular dorm constitute a cluster.

Cluster samples are often used in market research studies, and are particularly useful in the situation where a complete list of people in the population is not available. Cluster samples can also be used to reduce costs associated with polling; if the population is concentrated in clusters such as neighborhoods or schools, several interviews can be conducted much more quickly than interviews performed in spread out areas.

When very large samples are involved, organizations typically use a combination of techniques, often using several stages and types of random sampling. These samples are called multistage samples. Because the census of American households occurs only every ten years, the Census Bureau conducts the American Community Survey (ACS) yearly in order to provide updates about how communities are changing. The ACS is used to inform decisions about policies and programs, and to determine how billions of dollars of federal funds are distributed. The ACS Design and Methodology Paper describes in great detail the multistage sampling procedure used in the survey. Merely reading the table of contents of the report reveals many of the issues about sample surveys that we have raised in this chapter.

The random digit dialing typically used in social research polls and surveys also involves multistage sampling. One stage selects the first six or eight digits of the phone number, and the final four or two digits of the number are dialed at random

2.3.2 Selecting a Simple Random Sample Using a Table

While our names-in-the-hat method produces a simple random sample, it is not an efficient way to choose an SRS if the population is large. Choosing a simple random sample in this case requires a list of all individuals in the population (the sampling frame) and a way to pick them out of the list at random. One way to do this is to attach a different numerical label to each individual. A random number finder, either a table or statistical software, is then used to select as many numbers as needed for the desired sample size.

When using a random number table, the sampling frame is numbered so that each individual has a unique label with the same number of digits. For example, to label a population of 150 individuals, we would probably use 001 to 100, although any set of 150 consecutive whole numbers would work, even something that seems as strange as 5033 through 5182.

Below is a portion of a random digits table.

123 54580 81507 27102 56027 55892 33063 41842 81868
124 71035 09001 43367 49497 72719 96758 27611 91596
125 96746 12149 37823 71868 18442 35119 62103 39244
126 96927 19931 36809 74192 77567 88741 48409 41903
127 43909 99477 25330 64359 40085 16925 85117 36071
128 15689 14227 06565 14374 13352 49367 81982 87209
129 36759 58984 68288 22913 18638 54303 00795 08727
130 69051 64817 87174 09517 84534 06489 87201 97245
131 05007 16632 81194 14873 04197 85576 45195 96565
132 68732 55259 84292 08796 43165 93739 31685 97150
Table 2.3: Random Digits

In the leftmost column are the line numbers that provide a way to identify our starting point. The remainder of the table consists of a long list of random digits. The digits are separated into rows and columns only for ease of reading them. Each digit 0 to 9 is equally likely to appear in any space, and there is no relationship between any of the entries.

Once we have a starting place, we read the table from left to right to obtain consecutive digits of the length needed. Starting in the third column of row 127, we find the first ten numbers of these lengths:

2 digits: 25, 33, 06, 43, 59, 40, 08, 51, 69, 25
3 digits: 253, 306, 435, 940, 085, 169, 258, 511, 736, 071
4 digits: 2533, 0643, 5940, 0851, 6925, 8511, 7360, 7115, 6891, 4227

Notice that we ignore any row or column boundaries in selecting the numbers, and we do not use the line labels.

Suppose we want to choose 10 individuals from our population of 150. If we labeled the individuals from 001 to 150, we must search the table until we have 10 numbers that fall in this interval, so we must use the 3-digit row. The only two numbers in that row that are between 001 and 150 are 085 and 071, so we would have to continue searching the random number table until we had 8 additional unique numbers. Of course, these numbers are not themselves the sample; rather, they are the labels that identify the individuals who will make up the sample.

Question 2.17

The table below gives the first names of 12 individuals in a book club.

Alicia Barb Carolyn Ebony Ginny Harriet
Jaclyn Jean Jill Linda LouAnn Martha
Table : Book Club Members

Use the table below to assign numerical labels 01 to 12 to this population, making the numerical order coincide with the alphabetical order. Starting in the third column of line 128, select 2 random numbers to identify the individuals to select for a sample.

128 15689 14227 06565 14374 13352 49367 81982 87209
129 36759 58984 68288 22913 18638 54303 00795 08727
130 69051 64817 87174 09517 84534 06489 87201 97245
131 05007 16632 81194 14873 04197 85576 45195 96565
132 68732 55259 84292 08796 43165 93739 31685 97150
Table : Random Numbers

1. Give the name of the first individual selected: /WWQiBkkcIcSYWMjshVFX1Kmy/8=

2. Give the name of the second individual selected: xPRy5mZVLbRHNM3/ICr3GA==

3
Correct. The two individuals selected are Harriet and Jill.
Incorrect. The two individuals selected are Harriet and Jill.
Try again - you have not named the correct individuals.

2.3.3 Selecting a Simple Random Sample Using Software

Using a table of random digits is a higher-tech way to generate random numbers than picking individual names from a hat. Perhaps even easier is to use statistical software that allows you to generate a set of random numbers in a certain interval, or—better yet—to select a sample directly from the sampling frame. The table below shows the book club list from the Now Try This! example stored in column 1 as var1, and then a randomly selected sample of size 4 stored in column 2 as Sample(var1).

var1 Sample(var1)
Ann Ginny
Barb Jean
Carolyn Carolyn
Elaine Kitty
Ginny
Jackie
Jean
Jill
Kitty
Linda
LouAnn
Martha
Table 2.4: Book Club List

If your sampling frame is already stored in a spreadsheet, this is a very nice way to select a sample.

How random are the numbers generated by statistical software? Random enough for our purposes. Technically, the numbers obtained from most random number generators are called pseudo-random numbers, because they are obtained from a mathematical formula that has a certain starting value (called a seed). So if you have the formula and the seed, you can reproduce the selection of numbers. There are sources on the Internet that generate true random numbers; the website http://www.random.org has a true random number generator, and a nice discussion of randomness.

Any process used to select a simple random sample can also be applied to assigning individuals to treatment groups in experiments. Suppose that the book club discussed earlier decides that, instead of all reading the same book as they usually do, half of the club will read one book, and half another. The members decide to assign individuals to one of the two books at random. Since six members will read each book, we can choose an SRS of size 6 from the club, and assign them to the first book. The remaining six members would then read the second book.

This same method works regardless of the number of individuals involved in the experiment, or the number of treatments administered. If the individuals need to be randomized into more than two treatments, those to be given the first treatment are selected, and the process is repeated for assignment to additional treatments. In order to make the selection of subsequent groups easier, individuals already assigned to a treatment group should be removed from the sampling frame.

As we proceed through this course, we will see the critical role that simple random samples play in the types of inference we will study. Although we will not generally select the samples, it is important to understand how such selections are made.