In the first two sections of this chapter, we discussed the importance of randomization in observational studies and experiments. In this section, we will consider how to achieve this randomization. In particular, we will look at the process of selecting a simple random sample.
invisible clear both
Photo Credit: © Aerial Archives / Alamy
invisible clear both
While we use “sampling frame” to mean the list of individuals in the population, the term is sometimes used to describe how the sample selected relates to the underlying population. Researchers investigating burnout in athletic trainers used the membership list of the National Athletic Trainers’ Association to select a stratified random sample. Strata used were gender, type of institution, and years of experience. Their sampling frame displays the numbers and percentages of individuals selected
When selecting a sample to use in a study, we use randomization because it provides us the best chance of obtaining a sample that is representative of the characteristics of the population. If you were asked to describe what would make a sample random, you might say that every individual in the population should have an equal chance of being selected. That is certainly a good place to start.
Suppose that a statistics professor professor gets the wild idea to give A’s to four students randomly selected from the 28 individuals enrolled in her class. She might write each person’s name on a slip of paper, put the names in a hat, and have an impartial observer draw four names from the hat. The students whose names are selected will receive the A’s—no work required!
A sample selected in this fashion satisfies the criterion that every individual has an equal chance of being selected. In addition, this method allows every possible combination of four students to be selected—and each combination is equally likely to be selected. A sample selected in such a way that every sample of the desired size is equally likely to be chosen is called a simple random sample (SRS).
The statistical procedures that we will consider later in this course require that our samples be simple random samples so we are primarily concerned with them. But there are other methods that also generate random samples; we now present several of the simpler ones.
Let’s return to the rogue professor giving out A’s, and look at a way to select the sample of four students systematically. Since there are 28 students in the class, and a sample of four is required, she could choose one of the first seven students from her alphabetical roll list, and then every seventh student thereafter. The randomization comes in by choosing where to start the selection; as long as the first student is chosen at random, each student has an equal chance of being selected. This method does not, however, produce a simple random sample, because all samples of size 4 are not equally likely. Some samples are impossible to obtain, such as those with two names that are alphabetically adjacent in the list. (Systematic sampling does not require that the sample size divides evenly into the population size, but since we will not be selecting samples by this method, we will omit the details of such selections here.)
Now consider a situation in which the population consists of individuals that can be divided into groups according to one or more characteristics; a representative sample of such a population should include individuals from each subgroup, in approximately the same proportion that they appear in the population. A stratified random sample is obtained by first dividing the population into groups (strata) defined by one or more variables, and then selecting a simple random sample from each group. For example, suppose a community college offers classes during the day, in the evening, and on the weekend. The director of the campus tutoring center wants to conduct a survey to determine the best hours to be open in order to serve the student population. He should use a stratified random sample to assure that students who attend classes at different times are adequately represented.
Pollsters often also use a stratified random sample when they are interested in predicting the winner of a presidential race. Typically, national surveys involve approximately 1000 people (we’ll see why this is the case later) so one might think that the 1000 voters are randomly selected from the entire population of eligible voters. In fact, pollsters do not randomly pick 1000 phone numbers because voting patterns vary by region and state.
Instead, they will often use previous elections to determine the percentage of people who voted in each geographical region. For example, in a previous election, if 23% of votes came from the East, 26% from the South, 31% from the Great Lakes/Central region, and 20% from the West, they would make 23% of the 1000 calls (or 230) to the East, 260 to the South, 310 to the central region, and 200 to the West. In addition, voter turnout by state is also used to make sure that each state in a particular region is getting the right number of calls.
Unlike a stratified random sample, in which the groups are different, a cluster sample first divides the population into groups (clusters) that are similar. A set of clusters is then randomly chosen, and a census of the individuals in each selected cluster is conducted. If the director of food service at a large university wants to gauge student opinion about menu choice in dormitories, she could select some of the university’s dormitories at random, and then survey each student residing in those dorms. The clusters here are determined by geography; all students residing in a particular dorm constitute a cluster.
Cluster samples are often used in market research studies, and are particularly useful in the situation where a complete list of people in the population is not available. Cluster samples can also be used to reduce costs associated with polling; if the population is concentrated in clusters such as neighborhoods or schools, several interviews can be conducted much more quickly than interviews performed in spread out areas.
When very large samples are involved, organizations typically use a combination of techniques, often using several stages and types of random sampling. These samples are called multistage samples. Because the census of American households occurs only every ten years, the Census Bureau conducts the American Community Survey (ACS) yearly in order to provide updates about how communities are changing. The ACS is used to inform decisions about policies and programs, and to determine how billions of dollars of federal funds are distributed. The ACS Design and Methodology Paper describes in great detail the multistage sampling procedure used in the survey. Merely reading the table of contents of the report reveals many of the issues about sample surveys that we have raised in this chapter.
The random digit dialing typically used in social research polls and surveys also involves multistage sampling. One stage selects the first six or eight digits of the phone number, and the final four or two digits of the number are dialed at random
While our names-in-the-hat method produces a simple random sample, it is not an efficient way to choose an SRS if the population is large. Choosing a simple random sample in this case requires a list of all individuals in the population (the sampling frame) and a way to pick them out of the list at random. One way to do this is to attach a different numerical label to each individual. A random number finder, either a table or statistical software, is then used to select as many numbers as needed for the desired sample size.
When using a random number table, the sampling frame is numbered so that each individual has a unique label with the same number of digits. For example, to label a population of 150 individuals, we would probably use 001 to 100, although any set of 150 consecutive whole numbers would work, even something that seems as strange as 5033 through 5182.
Below is a portion of a random digits table.
123 | 54580 | 81507 | 27102 | 56027 | 55892 | 33063 | 41842 | 81868 |
124 | 71035 | 09001 | 43367 | 49497 | 72719 | 96758 | 27611 | 91596 |
125 | 96746 | 12149 | 37823 | 71868 | 18442 | 35119 | 62103 | 39244 |
126 | 96927 | 19931 | 36809 | 74192 | 77567 | 88741 | 48409 | 41903 |
127 | 43909 | 99477 | 25330 | 64359 | 40085 | 16925 | 85117 | 36071 |
128 | 15689 | 14227 | 06565 | 14374 | 13352 | 49367 | 81982 | 87209 |
129 | 36759 | 58984 | 68288 | 22913 | 18638 | 54303 | 00795 | 08727 |
130 | 69051 | 64817 | 87174 | 09517 | 84534 | 06489 | 87201 | 97245 |
131 | 05007 | 16632 | 81194 | 14873 | 04197 | 85576 | 45195 | 96565 |
132 | 68732 | 55259 | 84292 | 08796 | 43165 | 93739 | 31685 | 97150 |
In the leftmost column are the line numbers that provide a way to identify our starting point. The remainder of the table consists of a long list of random digits. The digits are separated into rows and columns only for ease of reading them. Each digit 0 to 9 is equally likely to appear in any space, and there is no relationship between any of the entries.
Once we have a starting place, we read the table from left to right to obtain consecutive digits of the length needed. Starting in the third column of row 127, we find the first ten numbers of these lengths:
2 digits: 25, 33, 06, 43, 59, 40, 08, 51, 69, 25
3 digits: 253, 306, 435, 940, 085, 169, 258, 511, 736, 071
4 digits: 2533, 0643, 5940, 0851, 6925, 8511, 7360, 7115, 6891, 4227
Notice that we ignore any row or column boundaries in selecting the numbers, and we do not use the line labels.
Suppose we want to choose 10 individuals from our population of 150. If we labeled the individuals from 001 to 150, we must search the table until we have 10 numbers that fall in this interval, so we must use the 3-digit row. The only two numbers in that row that are between 001 and 150 are 085 and 071, so we would have to continue searching the random number table until we had 8 additional unique numbers. Of course, these numbers are not themselves the sample; rather, they are the labels that identify the individuals who will make up the sample.
The table below gives the first names of 12 individuals in a book club.
Alicia | Barb | Carolyn | Ebony | Ginny | Harriet |
Jaclyn | Jean | Jill | Linda | LouAnn | Martha |
Use the table below to assign numerical labels 01 to 12 to this population, making the numerical order coincide with the alphabetical order. Starting in the third column of line 128, select 2 random numbers to identify the individuals to select for a sample.
128 | 15689 | 14227 | 06565 | 14374 | 13352 | 49367 | 81982 | 87209 |
129 | 36759 | 58984 | 68288 | 22913 | 18638 | 54303 | 00795 | 08727 |
130 | 69051 | 64817 | 87174 | 09517 | 84534 | 06489 | 87201 | 97245 |
131 | 05007 | 16632 | 81194 | 14873 | 04197 | 85576 | 45195 | 96565 |
132 | 68732 | 55259 | 84292 | 08796 | 43165 | 93739 | 31685 | 97150 |
1. Give the name of the first individual selected: /WWQiBkkcIcSYWMjshVFX1Kmy/8=
2. Give the name of the second individual selected: xPRy5mZVLbRHNM3/ICr3GA==
Using a table of random digits is a higher-tech way to generate random numbers than picking individual names from a hat. Perhaps even easier is to use statistical software that allows you to generate a set of random numbers in a certain interval, or—better yet—to select a sample directly from the sampling frame. The table below shows the book club list from the Now Try This! example stored in column 1 as var1, and then a randomly selected sample of size 4 stored in column 2 as Sample(var1).
var1 | Sample(var1) |
---|---|
Ann | Ginny |
Barb | Jean |
Carolyn | Carolyn |
Elaine | Kitty |
Ginny | |
Jackie | |
Jean | |
Jill | |
Kitty | |
Linda | |
LouAnn | |
Martha |
If your sampling frame is already stored in a spreadsheet, this is a very nice way to select a sample.
How random are the numbers generated by statistical software? Random enough for our purposes. Technically, the numbers obtained from most random number generators are called pseudo-random numbers, because they are obtained from a mathematical formula that has a certain starting value (called a seed). So if you have the formula and the seed, you can reproduce the selection of numbers. There are sources on the Internet that generate true random numbers; the website http://www.random.org has a true random number generator, and a nice discussion of randomness.
Any process used to select a simple random sample can also be applied to assigning individuals to treatment groups in experiments. Suppose that the book club discussed earlier decides that, instead of all reading the same book as they usually do, half of the club will read one book, and half another. The members decide to assign individuals to one of the two books at random. Since six members will read each book, we can choose an SRS of size 6 from the club, and assign them to the first book. The remaining six members would then read the second book.
This same method works regardless of the number of individuals involved in the experiment, or the number of treatments administered. If the individuals need to be randomized into more than two treatments, those to be given the first treatment are selected, and the process is repeated for assignment to additional treatments. In order to make the selection of subsequent groups easier, individuals already assigned to a treatment group should be removed from the sampling frame.
As we proceed through this course, we will see the critical role that simple random samples play in the types of inference we will study. Although we will not generally select the samples, it is important to understand how such selections are made.