When you complete this section, you will be able to:
• Identify the key characteristics of a data set to be used to explore a relationship between two variables.
• Categorize variables as response variables or explanatory variables.
80
In Chapter 1 (page 2), we discussed the key characteristics of a data set. Cases are the objects described by a set of data, and a variable is a characteristic of a case. We also learned to categorize variables as categorical or quantitative. For Chapter 2, we focus on data sets that have pairs of variables that we want to study together. Here is an example.
EXAMPLE 2.1
College students cope with stress. Stress is a common problem for college students. Exploring factors that are associated with stress may lead to strategies that will help students to relieve some of the stress that they experience. A recent study found that students who experienced greater stress had less access to resources that would help them to cope with their stress.1 The two variables involved in the relationship here are perceived stress and resources to cope. The cases are the 97 students who are the subjects for a particular study.
When we study relationships between two variables, it is not sufficient to collect data on the two variables. A key idea for this chapter is that both variables must be measured on the same cases.
USE YOUR KNOWLEDGE
2.1 Facebook friends. Do people who have more Facebook friends spend more time on Facebook? In an introductory statistics class of 38 students, there were 32 users of Facebook. Each of these students was asked to report how many Facebook friends they had and the average amount of time that they spent on Facebook per week.
(a) Who are the cases for this study?
(b) What are the variables?
(c) Are the variables quantitative or categorical? Explain your answer.
We use the term associated to describe the relationship between two variables, such as stress and access to resources to cope in Example 2.1. Here is another example where two variables are associated.
EXAMPLE 2.2
Size and price of a coffee beverage. You visit a local Starbucks to buy a Mocha Frappuccino®. The barista explains that this blended coffee beverage comes in three sizes and asks if you want a Tall, a Grande, or a Venti. The prices are $3.95, $4.45, and $4.95, respectively. There is a clear association between the size of the Mocha Frappuccino® and its price.
ASSOCIATION BETWEEN VARIABLES
Two variables measured on the same cases are associated if knowing the values of one of the variables tells you something about the values of the other variable.
81
In the Mocha Frappuccino® example, knowing the size tells you the exact price, so the association here is very strong. Many statistical associations, however, are simply overall tendencies that allow exceptions. For example, it’s likely that some students in Example 2.1 are highly stressed and have a high level of resources to cope. Others experience little stress and have a low level of resources to cope. The association in that example is much weaker than the one in the Mocha Frappuccino example.
Examining relationships
To examine the relationship between two or more variables, we first need to know some basic characteristics of the data.
EXAMPLE 2.3
Stress and resources to cope. Refer to Example 2.1. The study asked 97 first-year college students about their stress (perceived stress) and the availability of resources to deal with stress (resources to cope).2 Perceived stress is based on responses to 10 questions that are summarized in a single variable. Therefore, we will treat the perceived stress as a quantitative variable. Resources to cope is constructed in a similar way summarizing the responses to 20 questions. We treat resources to cope as a quantitative variable also.
In many situations, we measure a collection of categorical variables and then combine them in a scale that can be viewed as a quantitative variable. The perceived stress and resources to cope are examples. We can also turn the tables in the other direction. Here is an example.
EXAMPLE 2.4
Hemoglobin and anemia. Hemoglobin is a measure of iron in the blood. The units are grams of hemoglobin per deciliter of blood (g/dl). Typical values depend on age and sex. Adult women typically have values between 12 and 16 g/dl.
Anemia is a major problem in developing countries, and many studies have been designed to address the problem. In these studies, computing the mean hemoglobin is not particularly useful. For studies like these, it is more appropriate to use a definition of severe anemia (a hemoglobin of less than 8 g/dl). Thus, for example, researchers can compare the proportions of subjects who are severely anemic for two treatments rather than the difference in the mean hemoglobin levels. In this situation, the categorical variable, severely anemic or not, is much more useful than the quantitative variable, hemoglobin.
When analyzing data to draw conclusions, it is important to carefully consider the best way to summarize the data. Just because a variable is measured as a quantitative variable, it does not necessarily follow that the best summary is based on the mean (or the median). As the previous example illustrates, converting a quantitative variable to a categorical variable is a very useful option to keep in mind.
82
USE YOUR KNOWLEDGE
2.2 Create a categorical variable from a quantitative variable. Consider the study described in Example 2.3. Suppose that we order the students based on the values of resources to cope from smallest to largest. Then, we define three resource groups: low resources, the first 32 students; medium resources, the next 33 students; and high resources, the remaining 32 students. If we compare the perceived stress of the three resource groups, are we using resource group as a quantitative variable or as a categorical variable? Explain your answer and describe some advantages to using the groups versus the original variable in explaining the results of a study such as this.
2.3 Replace names by ounces. In the Mocha Frappuccino® example, the variable size is categorical, with Tall, Grande, and Venti as the possible values. Suppose that you converted these values to the number of ounces: Tall is 12 ounces, Grande is 16 ounces, and Venti is 24 ounces. For studying the relationship between ounces and price, describe the cases and the variables and state whether each is quantitative or categorical.
When you examine the relationship between two variables, a new question becomes important:
Is your purpose simply to explore the nature of the relationship, or do you hope to show that one of the variables can explain variation in the other? In other words, is one of the variables a response variable and the other an explanatory variable?
RESPONSE VARIABLE, EXPLANATORY VARIABLE
A response variable measures an outcome of a study. An explanatory variable explains or causes changes in the response variable.
EXAMPLE 2.5
Stress and resources to cope. Refer to the study of stress and resources to cope in Example 2.3. Here, the explanatory variable is resources to cope and the response variable is perceived stress.
USE YOUR KNOWLEDGE
2.4 Stress and resources or resources and stress? Consider the scenario described in the previous example. Note that the variable, resources to cope, is constructed by summarizing the responses to 20 questions that include items measuring the skills that the student has developed to reduce stress. Make an argument for treating stress as the explanatory variable and resources to cope as the response variable.
In some studies, it is easy to identify explanatory and response variables. The following example illustrates one situation where this is true: when we actually set values of one variable to see how it affects another variable.
83
EXAMPLE 2.6
How much calcium do you need? Adolescence is a time when bones are growing very actively. If young people do not have enough calcium, their bones will not grow properly. How much calcium is enough? Research designed to answer this question has been performed for many years at events called “Camp Calcium.”3 At these camps, subjects eat controlled diets that are identical except for the amount of calcium. The amount of calcium retained by the body is the major response variable of interest. Because the amount of calcium consumed is controlled by the researchers, this variable is the explanatory variable.
When you don’t set the values of either variable but just observe both variables, there may or may not be explanatory and response variables. Whether there are depends on how you plan to use the data.
EXAMPLE 2.7
Student loans. A college student aid officer looks at the findings of the National Student Loan Survey. She notes data on the amount of debt of recent graduates, their current income, and how stressful they feel about college debt. She isn’t interested in predictions but is simply trying to understand the situation of recent college graduates.
A sociologist looks at the same data with an eye to using amount of debt and income, along with other variables, to explain the stress caused by college debt. Now, amount of debt and income are explanatory variables, and stress level is the response variable.
In many studies, the goal is to show that changes in one or more explanatory variables actually cause changes in a response variable. But many explanatory-response relationships do not involve direct causation. The SAT scores of high school students help predict the students’ future college grades, but high SAT scores certainly don’t cause high college grades.
KEY CHARACTERISTICS OF DATA FOR RELATIONSHIPS
A description of the key characteristics of a data set that will be used to explore a relationship between two variables should include
• Cases. Identify the cases and how many there are in the data set.
• Categorical or quantitative. Classify each variable as categorical or quantitative.
• Values. Identify the possible values for each variable.
• Explanatory or response. If appropriate, classify each variable as explanatory or response.
• Label. Identify what is used as a label variable if one is present.
Some of the statistical techniques in this chapter require us to distinguish explanatory from response variables; others make no use of this distinction. You will often see explanatory variables called independent variablesindependent variable variable and response variables called dependent variablesdependent variable. These terms express mathematical ideas; they are not statistical terms. The concept that underlies this language is that the response depends on explanatory variables. Because the words “independent” and “dependent” have other meanings in statistics that are unrelated to the explanatory-response distinction, we prefer to avoid those words.
84
Most statistical studies examine data on more than one variable. Fortunately, statistical analysis of several-variable data builds on the tools used for examining individual variables. The principles that guide our work also remain the same:
• Start with a graphical display of the data.
• Look for overall patterns and deviations from those patterns.
• Based on what you see, use numerical summaries to describe specific aspects of the data.