Observational studies are used widely to investigate conditions and characteristics of populations, to estimate values of parameters of individual groups, to compare changes in parameters over time, or to compare parameters of two or more groups. Recall that an observational study is one in which the researchers impose no treatment on the individuals being studied.
Photo Credit: © Bernd Juergens/Chromorange Photostock/Age Fotostock Inc.
Oceana is a nonprofit organization dedicated to improving the earth’s oceans. Among the organization’s campaigns is one to stop seafood contamination. Of particular concern to Oceana is the level of mercury in fish. Mercury is a toxin that is particularly dangerous to children; a woman’s exposure to mercury during pregnancy can affect her unborn child. The FDA warns women of child-bearing age and children against consuming fish with high-mercury levels, and recommends limiting the amount of lower-level mercury fish eaten as well.
As part of their campaign, Oceana conducted a mercury sampling project in which volunteers purchased fish samples from grocery stores and sushi restaurants in 26 American cities; the mercury content in each one was analyzed. In addition, the volunteers questioned grocery store workers about government advice on mercury in seafood for women thinking about having children.
This study is clearly an observational one. The intent is to report on the mercury level in purchased fish, and to determine whether grocery store clerks know the FDA advisory. No treatment was applied to either the fish or the clerks; information from each group was recorded, summarized, and reported.
Let’s take a closer look at the design of this study. Volunteers collected fish samples from grocery stores and sushi restaurants in 26 American cities. They were asked to purchase swordfish and tuna steaks from the grocery stores; if either fish were unavailable, tilapia was substituted. At the sushi restaurants, volunteers were to purchase samples containing tuna and mackerel.
To assess seafood counter clerks’ knowledge of FDA recommendations, volunteers asked “What is the government advice on mercury in seafood for women who are thinking about having kids?” in 40 stores in 38 cities.
In this instance, there are really three samples: the fish purchased from grocery stores, the fish purchased from sushi restaurants, and the answers given by the clerks about the government’s advice on mercury. For this (or any other) study, we are interested in two questions. How random were these samples, and do they accurately reflect the populations of interest? These questions are difficult to answer in this case. There is no discussion in the report of how the cities were chosen or the grocery stores selected, and the restaurants were chosen by the volunteers, not by a random process.
Randomization is an important feature of statistical design. Randomization is the use of impersonal chance either to select individuals to study or to assign individuals to treatment groups. Statisticians use randomization to avoid bias in study results. Bias is the predisposition of a study toward certain results; it is caused by problems in the study design or sample selection. (For example, asking bicyclists whether road improvement funds should be used to create dedicated bike lanes is likely to produce a higher rate of affirmative answers than asking the same question of the general public.)
Whenever a study depends on volunteers (and they frequently do), it is important to remember that their responses (or choices, in this case) may introduce some amount of bias into the results. Here there may be nothing wrong—the samples of fish selected may be every bit as good as ones selected randomly. And that is really the point of the second question—do they accurately reflect the situation across the country? What we are asking is not whether the sample is random (it isn’t), but rather whether it is representative of what the average consumer would purchase in his or her favorite grocery store or sushi restaurant.
Again, we just don’t know. Random selection of individuals gives us the best chance of getting a sample that represents the population. When a study does not include a random sample, we wonder about the results. Some people would say the samples in the Oceana report are fine, and the results can be trusted. Others would want to know more about the selection method. Statistics is about making decisions at all stages of the process. As a consumer of statistics, you get to decide; our goal is to help you make informed decisions.
When an observational study is a poll or survey in which human beings answer questions, study design becomes even more complicated. Professional polling and market research organizations design surveys to minimize any factors that might introduce bias in the results, and use random selection to produce samples that are representative of the population of interest.
A major concern in all survey design is response bias, any aspect of the design that influences the participants’ responses. Response bias comes from a number of sources: the topics covered in the survey, the kind of questions asked, how they are worded, the behavior of the interviewers (if there are any), and the behavior of the respondents.
If the survey addresses sensitive topics, such as behavior considered immoral or illegal, respondents are inclined to answer questions as they believe they should, whether those answers are entirely truthful or not. People also tend to overestimate behavior perceived as good. Questions such as “Did you vote in the last election?” typically produce a higher percentage of “voters” than actual returns show. How the respondent perceives an interviewer, or how the respondent believes the interviewer perceives him or her, can also affect answers.
In designing a survey, researchers must decide whether to use open or closed questions. An open question is one that allows the respondent to answer as he or she chooses. A closed question is one in which the respondent must choose an answer from a set supplied by the interviewer. Closed questions limit the participants’ choices and may not include an answer that reflects a respondent’s true opinion. Therefore, the results may overestimate certain opinions.
The particular wording of a question can change the results as well. The Pew Research Center for the People and the Press frequently creates two versions of a survey, with the respondents randomly assigned to one of the two forms. Since randomization creates two groups of individuals that are essentially the same, any difference in response can be attributed to the difference in wording.
The scholarly article Why the 1936 Literary Digest Poll Failed discusses problems with the Literary Digest poll. Similarly, US Election 1948: The First Great Controversy about Polls, Media, and Social Science points out how errors in the 1948 polls lead to improvements in the polling process.
To illustrate this, the Pew Center’s website describes a survey taken in January 2003. On Form 1, they asked the question, "Would you favor or oppose taking military action in Iraq to end Saddam Hussein's rule?" On Form 2, they asked the question "Would you favor or oppose taking military action in Iraq to end Saddam Hussein's rule even if it meant that U.S. forces might suffer thousands of casualties?" As you might expect, more people who answered the Form 1 question favored military action (68%) than those who answered the Form 2 question (43%).
The NPR story Death-Penalty Opinion Varies With the Question discusses the effect of question wording on survey results.
Selecting a representative sample begins with identifying the sampling frame, a list of the individuals that make up the population. In some cases this can be an actual list (all 2013 graduates of your college), but frequently it is a more theoretical one (all adult Americans). For many decades, in order to survey adult Americans, polling organizations used households with landline telephones as the sampling frame. The sample households actually contacted were then selected using various forms of random digit dialing.
In selecting the sample, researchers must address the issues of undercoverage and nonresponse. Undercoverage is underrepresentation in the sample of a group or groups of individuals in the population. A sampling frame that uses households with landline telephones to select a sample omits households without a landline telephone. Historically, these households were less affluent, because not having a landline meant not having a telephone at all.
A Literary Digest poll in 1936 used the opinions of over 2.3 million individuals, chosen primarily from phone directories and car registries, to predict that Alf Landon would defeat Franklin Roosevelt in that year’s presidential election. In depression-era America, having a phone or a car were characteristic of wealthier households, with individuals more likely to vote for the Republican candidate Landon. Many statisticians believe that this undercoverage of less wealthy Americans is responsible, at least in part, for the incorrect prediction.
In recent years, however, households without landlines include those that use only cell phones. The persons in cell-phone-only households are different in several important characteristics from those in landline households; age is certainly one of those characteristics, as cell-phone-only households tend to be younger.
Studies have been done to compare the results of surveys that include cell-phone-only individuals with those that do not, and to correct for undercoverage when cell-phone-only users are not part of the sampling frame. In 2007 an entire special issue of Public Opinion Quarterly was devoted to cell phone numbers and telephone surveys. Pollsters have more recently tested surveys that include a combination of landline, cell phone, text messaging, and Internet answers to avoid undercoverage.
The NPR story Capturing Cell Phone-Only Users in Political Polls discusses the characteristics of cell phone-only users and their effects on political polls.
Another factor that contributed to the prediction error in the 1936 Literary Digest poll was nonresponse, the failure of individuals selected for the sample to actually participate in the study. More than 10 million straw ballots were mailed out, but less than 3 million were returned. It is believed that those who failed to respond favored Roosevelt more than Landon.
Today researchers use statistical methods to decide about appropriate sample sizes, and they make multiple attempts to contact each individual in order to minimize nonresponse. Pollsters have found that they need to call more people when surveying cell-phone-only users because there is a higher refusal rate with this group. This may be due to the fact that people are potentially busier when contacted on their cell phone rather than a home landline--they may be driving, in class, working, or socializing when the call comes in.
Why all this fuss about choosing the sample? Wouldn’t it be easier to just put your survey on the Internet, and let whoever feels like it reply? Easier, yes; more useful, no. Statisticians are particularly wary of voluntary response samples, in which individuals choose to respond to a particular question. Without randomization, there is no way to judge how closely the respondents match the desired population in important characteristics, or to measure the error in any predictions based on the results. Voluntary response surveys can be fun—the popularity of American Idol attests to that, but if you want to rely on the numbers you get, you should start with a properly chosen random sample.
The Pew Research Center reported on its findings from a telephone poll on racial attitudes taken from September 3 to October 6, 2007. Their results showed that many African Americans believed that the gap between the values of middle class and poor blacks was widening. One-fifth of blacks said that things were better for blacks than 5 years earlier; whites were nearly twice as likely as blacks to see black gains in the past five years.
The Pew Research Center is a well-respected social research organization whose reports are widely quoted in the media. Let’s take a look at the methods used in this poll to minimize bias and produce reliable results.
Survey Design: The survey consisted of 42 questions, some with follow-up questions. Most of the questions were closed; a typical question was “Do you think white students and black students should go to the same schools or to separate schools?”
Because the order in which choices are presented may influence respondents’ answers, for some questions, order choice was rotated. For example, in a question asking about the seriousness of various local problems, the choices crime, high school students dropping out, the number of children born to unmarried mothers, the lack of good paying jobs, the quality of the public schools, and illegal immigration were given in different orders to different individuals.
Questions about racial attitudes are often sensitive; people may hesitate to answer honestly, particularly if they perceive that the interviewer is a person of a different race. In this survey, care was taken to have African-American interviewers question African-American respondents; 82% of African-Americans were interviewed by black interviewers. Similarly, 76% of white respondents were interviewed by non-black interviewers.
Sample Selection: The sampling frame here consisted of adult Americans (African-American, Hispanic and white). The survey interviewed 3,086 adults living in telephone households in the continental United States. Interviews were conducted in English or in Spanish. Two separate samples were used. One sample of 2,522 households was selected by a random digit dialing method in which more numbers were chosen from areas with higher concentrations of African-American and Hispanic households. A second sample of 564 households were those screened for (but not used for) a previous survey; each of these households contained an adult African-American.
Telephone numbers were called up to ten times, and at different times of day, in an attempt to reach all numbers. In each contacted household, interviewers asked to speak with the youngest male adult currently at home. If no male was available, the interviewers asked to speak with the youngest adult female at home. This methodology produces samples that are more representative of the population in terms of age and sex, as compared to interviewing whatever adult answers the phone.
Data Reporting and Analysis: The statisticians who analyze data for professional polling organizations employ methods much more complicated than those we will study. In this particular survey, researchers used a technique that adjusts for effects of study design and implementation that might introduce bias.
But some basic principles that will be important to us apply here. Because such a study yields sample statistics that are used to estimate population parameters, researchers typically report not only summarized data, but also approximate margins of error attached to these estimates. A margin of error is a number that estimates how far the parameter (which is a specific, but unknown, number) might be from the reported statistic. It establishes a range of values higher and lower than the statistic within which the parameter is likely to fall.
For this study, researchers reported margins of error as indicated in the table below.
Sample Size | Margin of Error (percentage points) | |
---|---|---|
Entire Sample | 3,086 | 2.5 |
Non-Hispanic Whites | 1,536 | 3.5 |
Non-Hispanic African-Americans | 1,007 | 4 |
Hispanics | 388 | 7 |
Source: Pew Research Center
The survey found that 15% of the entire sample believed that illegal immigration was a very big problem in their local communities; the margin of error for the entire sample was 2.5%. This means that the true percentage of all Americans who held this belief was likely to be between 12.5% and 17.5%.
It is easy to see that the larger sample sizes had smaller margins of error. In fact, researchers often select a sample of a certain size in order to have a desired margin of error.
There is always a margin of error in sample surveys. The sample does not contain all members of the population, and so is unlikely to match its characteristics exactly. Also, the samples themselves vary; if the Pew researchers selected another sample of 3,086 people, even of the same racial composition, the percentages reporting particular answers would seldom be exactly the same.
In addition to reporting summary statistics, sample sizes and margins of error, researchers often indicate the response rate for the survey. The response rate for this Pew Research Center Survey was 24%. This represents the percentage of the households initially selected who actually completed an interview. Interestingly, this is approximately the same response rate as the Literary Digest poll, a response rate cited as a possible cause of the incorrect prediction.
Why is the Pew Research Center willing to publish a poll with such a low response rate? As polls have become more numerous, Americans busier, and caller ID more common, response rates for national polls have declined. The Pew Research Center itself conducted experiments which compared usual polling techniques with more rigorous ones designed to obtain a higher response rate. These and other studies indicate that carefully designed polls do obtain representative samples and accurate results. The article Gauging the Impact of Growing Nonresponse on Estimates from a National RDD Telephone Survey in the journal Public Opinion Quarterly reports on such an experiment.
When the Supreme Court struck down Washington, D.C.’s ban on handguns in June, 2008, it served to re-ignite the debate over the effect of such legislation. A study by University of Maryland researchers examined the effects of restricting access to handguns on gun-related homicides and suicides. The study investigated homicides and suicides committed from 1968 through 1987, classifying them by location (D.C. or adjacent metropolitan area without the ban), cause (homicide or suicide), method (firearm or other), and time of occurrence (before or after the ban).
Average monthly totals for before the ban and after the ban were calculated. The study found that in Washington, D.C., after the ban, the numbers of both homicides and suicides by firearms were reduced by more than 20% (by 3.3 per month for homicides and 0.6 for suicides). In adjacent areas without the ban, no such decreases were found.
Here we have two variables that seem to be related. The number of homicides and suicides by firearms is the outcome that we are studying; we wonder if the number of firearm deaths responded to the presence of the ban. For this reason, we call the number of such deaths the response variable. The response variable in a study is the outcome that we are investigating. Another way of phrasing our research question is “Does the presence of the handgun ban ‘explain’ the number of deaths?” Thus, we refer to the presence of the ban as the explanatory variable. The explanatory variable in a study is the variable that explains or predicts the values of the response variable.
Do these results show that handguns cause homicides and suicides? Or that the ban stops them? Unfortunately, it’s not that simple. What the results show is that after the ban, firearm deaths declined in Washington, D.C. Just because the decrease followed the ban, it was not necessarily caused by it. Perhaps factors that were not studied (such as changes in gang activity or employment rates) contributed to the decrease. We refer to these other factors as lurking variables. A lurking variable is a characteristic of the sample that is not investigated as part of the study, but which may influence the results. The inability to distinguish between the effects of explanatory and lurking variables on the response variable is called confounding.
In order to establish a cause-and-effect relationship, an experiment must be performed. But in many cases, such as this one, an experiment is not possible. Controversies arise when different observational studies point to different conclusions; in fact, this has been the case with gun control studies. In the case of the link between smoking and lung cancer, many, many observational studies were conducted over many, many years before scientists were willing to state definitively that smoking is a cause of lung cancer.
Former and current drug users are at increased risk for contracting the Hepatitis C virus (HCV) because the virus can be transmitted through shared needles. In a California study, seventy-one recovering drug users on methadone maintenance who had HCV were treated with interferon plus ribavirin for 24 or 48 weeks. About a third of the subjects used marijuana while being treated; according to the researchers, marijuana use was neither endorsed nor prohibited. At the end of the treatment, the individuals’ HCV viral load was measured. Sixty-four percent of marijuana users had undetectable viral load, while 47% of non-users did.
XoxxUy5JkcJf2CkaBTLhtAGJO3XgqPGiysWO2QrxwFNIVvdGcLIwveR1wU96k7Uu8McrfSxi9nTzSVoI+6UTvx+k9M38jYN4kBrURit+8k9NhdcyGWrI1bEkTyhwH8rM41ChBBcGOQJnH1tggxPbc1dSvjsQVR+dy1en0nIy5u46zCCF1JYLGXZTPyzSMvbYy1izK8/mDajtLj85MqNSRUaf7pM2dS96cNqDkWze8iZ1TUwm94OaspLYYlGMPt4d5JFJNYzLFuI1TUYj1KU8d9JyiJB1QZLIDchQD9kykJbNs1Ac9hpuXA5EzJIQChmWZ/rVXNmP2AIyCUp9DJCq5w==While observational studies, including polls and surveys, can be valuable tools in gathering information, they do require care in their design, implementation, and interpretation. We will continue to return to issues discussed here as we present additional concepts and procedures. In Section 2.2, we will look at experiments and how they can be used to link cause and effect.