16.3 Multiple Regression

An orthogonal variable is an independent variable that makes a separate and distinct contribution in the prediction of a dependent variable, as compared with the contributions of another variable.

In regression analysis, we explain more of the variability in the dependent variable if we can discover genuine predictors that are separate and distinct. This involves orthogonal variables, independent variables that make separate and distinct contributions in the prediction of a dependent variable, as compared with the contributions of other variables. Orthogonal variables do not overlap each other. For example, the study we discussed earlier explored whether the amount of Facebook use predicted social capital. It is likely that a person’s personality also predicts social capital; for example, we would expect extroverted, or outgoing, people to be more likely to have this kind of social capital than introverted, or shy, people. It would be useful to separate the effects of the amount of Facebook use and extroversion on social capital.

The statistical technique we consider next is a way of quantifying (1) whether multiple pieces of evidence really are better than one, and (2) precisely how much better each additional piece of evidence actually is.

Understanding the Equation

Multiple regression is a statistical technique that includes two or more predictor variables in a prediction equation.

Just as a regression equation using one independent variable is a better predictor than the mean, a regression equation using more than one independent variable is likely to be an even better predictor. This makes sense in the same way that knowing a baseball player’s historical batting average plus knowing that the player continues to suffer from a serious injury is likely to improve our ability to predict the player’s future performance. So it is not surprising that multiple regression is far more common than simple linear regression. Multiple regression is a statistical technique that includes two or more predictor variables in a prediction equation.

Let’s examine an equation that might be used to predict a final exam grade from two variables, number of absences and score on the mathematics portion of the SAT. Table 16-8 repeats the data from Table 16-1, with the added variable of SAT score. (Note that although the scores on number of absences and final exam grade are real-life data from our statistics classes, the SAT scores are fictional.)

Table : TABLE 16-8. Predicting Exam Grade from Two Variables Multiple regression allows us to develop a regression equation that predicts a dependent variable from two or more independent variables. Here, we will use these data to develop a regression equation that predicts exam grade from number of absences and SAT score.
Student Absences SAT Exam Grade
1 4 620 82
2 2 750 98
3 2 500 76
4 3 520 68
5 1 540 84
6 0 690 99
7 4 590 67
8 8 490 58
9 7 450 50
10   3 560 78

MASTERING THE CONCEPT

16.5: Multiple regression predicts scores on a single dependent variable from scores on more than one independent variable. Because behavior tends to be influenced by many factors, multiple regression allows us to better predict a given outcome.

441

The computer gives us the printout seen in Figure 16-7. The column in which we’re interested is the one labeled “B,” under “Unstandardized Coefficients.” The first number, across from “(Constant),” is the intercept. The intercept is called constant because it does not change; it is not multiplied by any value of an independent variable. The intercept here is 33.422. The second number is the slope for the independent variable, number of absences. Number of absences is negatively correlated with final exam grade, so the slope, −3.340, is negative. The third number in this column is the slope for the independent variable of SAT score. As we might guess, SAT score and final exam grade are positively correlated; a student with a high SAT score tends to have a higher final exam grade. So the slope, 0.094, is positive. We can put these numbers into a regression equation, with X1 representing the first variable, SAT score, and X2 representing the second variable, final exam grade:

Figure 16-7

Software Output for Regression Computer software provides the information necessary for the multiple regression equation. All necessary coefficients are in column B under “Unstandardized Coefficients.” The constant, 33.422, is the intercept; the number next to “Number of absences,” −3.340, is the slope for that independent variable; and the number next to “SAT,” .094, is the slope for that independent variable.

Once we develop the multiple regression equation, we can input raw scores on number of absences and mathematics SAT score to determine a student’s predicted score on Y. Imagine that our student, Allie, scored 600 on the mathematics portion of the SAT. We already know she planned to miss two classes this semester. What would we predict her final exam grade to be?

442

Based on these two variables, we predict a final exam grade of 83.142 for Allie. How good is this multiple regression equation? From software, we calculated that the proportionate reduction in error for this equation is a whopping 0.93. By using a multiple regression equation with the independent variables of number of absences and SAT score, we have reduced 93% of the error that would result from predicting the mean of 76 for everyone.

When we calculate proportionate reduction in error for a multiple regression, the symbol changes slightly. The symbol is now R2 instead of r2. The capitalization of this statistic is an indication that the proportionate reduction in error is based on more than one independent variable.

Stepwise Multiple Regression and Hierarchical Multiple Regression

Stepwise multiple regression is a type of multiple regression in which a computer program determines the order in which independent variables are included in the equation.

Researchers have several options to choose from when they conduct statistical analyses using multiple regression. One common approach is stepwise multiple regression, a type of multiple regression in which a computer program determines the order in which independent variables are included in the equation. Stepwise multiple regression is used frequently by researchers because it is the default in many computer programs.

When the researcher conducts a stepwise multiple regression, the program implements a series of steps. In the first step, it identifies the independent variable responsible for the most variance in the dependent variable—that is, the independent variable with the highest R2. In other words, the program examines each independent variable as if it were the only predictor of the dependent variable; the one with the highest R2 “wins.” If the winning independent variable is not a statistically significant predictor of the dependent variable, then the process stops. After all, if the independent variable that explains the most variance in the dependent variable is not significant, then the other independent variables won’t be significant either.

If the first independent variable is a statistically significant predictor of the dependent variable, then the program continues to the next step: choosing the second independent variable that, in conjunction with the one already chosen, is responsible for the largest amount of variance in the dependent variable. If the R2 of both independent variables together represents a statistically significant increase over the R2 of just the first independent variable alone, then the program continues to the next step: choosing the independent variable responsible for the next largest amount of variance, and so on. So, at each step, the program assesses whether the change in R2, after adding another independent variable, is statistically significant. If the inclusion of an additional independent variable does not lead to a statistically significant increase in R2 at any step, then the program stops.

Overlapping Independent Variables When two independent variables measure similar characteristics and are highly predictive of the dependent variable, stepwise regression is not the best choice. For example, researchers might explore whether hours spent playing violent video games and hours spent watching violent television shows— independent variables that are strongly related to each other—predict aggression levels in children. Stepwise regression will likely show that one of these variables—say, video game playing—is a strong predictor of aggression; the second variable, watching violent TV shows, won’t explain any additional variability because it overlaps so much with the first. The first variable, video game playing, gets credit, so to speak, for all of the variance it contributes itself as well as all of the variance it shares with watching violent TV shows. The regression will falsely indicate that only violent video game playing predicts aggression.
© Alex Segre/Alamy

443

The strength of using stepwise regression is its reliance on data, rather than theory— especially when a researcher is not certain of what to expect in a study. The results can generate hypotheses that the researcher can then test. That strength is also a weakness when a researcher is working with nonorthogonal, overlapping variables.

For example, imagine that both depression and anxiety are very strong negative predictors of the quality of one’s romantic relationship. Also imagine that there is a great deal of overlap in the predictive ability of depression and anxiety. That is, once depression is accounted for, anxiety doesn’t add much to the equation; similarly, once anxiety is accounted for, depression doesn’t add much to the equation. It is perhaps the negative affect (or mood) shared by both clusters of symptoms that predicts the quality of one’s relationship.

Now imagine that in one sample, depression turns out to be a slightly better predictor of relationship quality than is anxiety, but just barely. The computer would choose depression as the first independent variable. Because of the overlap between the two independent variables, the addition of anxiety would not be statistically significant. A stepwise regression would pinpoint depression, but not anxiety, as a predictor of relationship quality. That finding would suggest that anxiety is not a good predictor of the quality of your romantic relationship even though it could be extremely important.

Now imagine that in a second sample, anxiety is a slightly better predictor of relationship quality than is depression, but just barely. This time, the computer would choose anxiety as the first independent variable. Now the addition of depression would not be statistically significant. This time, a stepwise regression would pinpoint anxiety, but not depression, as a predictor of relationship quality. So the problem with using stepwise multiple regression is that two samples with very similar data can, and sometimes do, lead to drastically different conclusions.

Hierarchical multiple regression is a type of multiple regression in which the researcher adds independent variables into the equation in an order determined by theory.

That is why another common approach is hierarchical multiple regression, a type of multiple regression in which the researcher adds independent variables into the equation in an order determined by theory. A researcher might want to know the degree to which depression predicts relationship quality but knows that there are other independent variables that also affect relationship quality. Based on a reading of the literature, that researcher might decide to enter other independent variables into the equation before adding depression.

For example, the researcher might add age, a measure of social skills, and the number of years the relationship has lasted. After adding these independent variables, the researcher would add depression. If the addition of depression leads to a statistically significant increase in R2, then the researcher has evidence that depression predicts relationship quality over and above those other independent variables. As with stepwise multiple regression, we’re interested in how much each additional independent variable adds to the overall variance explained. We look at the increase in R2 with the inclusion of each new independent variable or variables and we predetermine the order (or hierarchy) in which variables are entered into a hierarchical regression equation.

MASTERING THE CONCEPT

16.6: In multiple regression, we determine whether each added independent variable increases the amount of variance in the dependent variable that we can explain. In stepwise multiple regression, the computer program determines the order in which independent variables are added, whereas in hierarchical multiple regression, the researcher chooses the order. In both cases, however, we report the increase in R2 with the inclusion of each new independent variable or variables.

444

The strength of hierarchical multiple regression is that it is grounded in theory that we can test. In addition, we’re less likely to identify a statistically significant predictor just by chance (a Type I error) because a well-established theory would already have established most of the predictors as statistically significant. There is a serious weakness associated with hierarchical multiple regression, but it has nothing to do with the technique, its mathematics, or its concept. The problem is the researcher. Sometimes researchers haven’t really thought through the variables that are probably at work and why they might be there. However, if a researcher has enough information to develop a specific hypothesis, then a hierarchical multiple regression should be used instead of a stepwise multiple regression.

Multiple Regression in Everyday Life

With the development of increasingly more powerful computers and the availability of ever-larger amounts of electronic data, tools based on multiple regression have proliferated. Now the general public can access many of them online (Darlin, 2006). Bing Travel (formerly Farecast.com) predicts the price of an airline ticket for specific routes, travel dates, and, most important, purchase dates. Using the same data available to travel agents, along with additional independent variables such as the weather and even which sports teams’ fans might be traveling to a championship game, Bing Travel mimics the regression equations used by the airlines. Airlines predict how much money potential travelers are willing to pay on a given date for a given flight and use these predictions to adjust their fares so they can earn the most money.

Bing Travel is an attempt at an end run, using mathematical prediction tools, to help savvy airline consumers either beat or wait out the airlines’ price hikes. In 2007, Bing Travel’s precursor, Farecast.com, claimed a 74.5% accuracy rate for its predictions. Zillow.com does for real estate what Bing Travel does for airline tickets. Using archival land records, Zillow.com predicts U.S. housing prices and claims to be accurate to within 10% of the actual selling price of a given home.

Another company, Inrix, predicts the dependent variable, traffic, using the independent variables of the weather, traveling speeds of vehicles that have been outfitted with Global Positioning Systems (GPS), and information about events such as rock concerts. It even suggests, via cell phone or in-car navigation systems, alternative routes for gridlocked drivers. Like the future of visual displays of data, the future of the regression equation is limited only by the creativity of the rising generation of behavioral scientists and statisticians.

Next Steps

Structural Equation Modeling (SEM)

Structural equation modeling (SEM) is a statistical technique that quantifies how well sample data “fit” a theoretical model that hypothesizes a set of relations among multiple variables.

A statistical (or theoretical) model is a hypothesized network of relations, often portrayed graphically, among multiple variables.

We’re going to introduce an approach to data analysis that is infinitely more flexible and visually more expressive than multiple regression. Structural equation modeling (SEM) is a statistical technique that quantifies how well sample data “fit” a theoretical model that hypothesizes a set of relations among multiple variables. Here, we are discussing “fit” in much the same way you might try on some new clothes and say, “That’s a good fit” or “That really doesn’t fit!” Statisticians who use SEM refer to the “model” that they are testing. In this case, a statistical (or theoretical) model is a hypothesized network of relations, often portrayed graphically, among multiple variables.

Instead of thinking of variables as “independent” variables or “dependent” variables, SEM encourages researchers to think of variables as a series of connections. Consider an independent variable such as the number of hours spent studying. What predicts how many hours a person will study? We answer that by thinking of hours studying as a dependent variable with its own set of independent variables. An independent variable in one study can become a dependent variable in another study. SEM quantifies a network of relations, so some of the variables are independent variables that predict other variables later in the network, as well as dependent variables that are predicted by other variables earlier in the network. This is why we will refer to variables without the usual adjectives of independent or dependent as we discuss SEM.

445

Path is the term that statisticians use to describe the connection between two variables in a statistical model.

Path analysis is a statistical method that examines a hypothesized model, usually by conducting a series of regression analyses that quantify the paths between variables at each succeeding step in the model.

Language Alert! In the historical development of SEM, the analyses based on this kind of diagram were called path analyses for the fairly obvious reason that the arrows represented “paths”—factors that lead to whatever the next variable in the model happened to be. Path is the term that statisticians use to describe the connection between two variables in a statistical model. Path analysis is a statistical method that examines a hypothesized model, usually by conducting a series of regression analyses that quantify the paths at each succeeding step in the model.

Path analysis is used rarely these days because the more powerful technique of SEM can better quantify the relations among variables in a model. But we still find the term path to be a more intuitive way to describe the flow of behavior through a network of variables, and the word path continues to be used in structural equation models. SEM uses a statistic much like the correlation coefficient to indicate the relation between any two variables. Like a path through a forest, a path could be small and barely discernible (close to 0) or large and easy to follow (closer to −1.00 or 1.00).

Manifest variables are the variables in a study that we can observe and that are measured.

Latent variables are the ideas that we want to research but cannot directly measure.

In SEM, we start with measurements called manifest variables, the variables in a study that we can observe and that are measured. We assess something that we can observe in an attempt to understand the underlying main idea in which we’re interested. In SEM, these main idea variables are called latent variables, the ideas that we want to research but cannot directly measure. For example, we cannot actually see the latent variable we call “shyness,” but we still try to measure shyness in the manifest variables using self-report scales, naturalistic observations, and reports by others about shy behavior. We use these manifest variables and latent variables to create a model that is depicted in a visual diagram of variables.

Let’s examine one published SEM study. In a longitudinal study, researchers examined whether receiving good parenting at age 17 predicted emotional adjustment at age 26 (Dumas, Lawford, Tieu, & Pratt, 2009). They explored the relations among four latent variables in a sample of 100 adolescents in Ontario, Canada. The latent variables were the following:

  1. Positive parenting, an estimate of whether an adolescent received good parenting as assessed by three self-report manifest variables at age 17
  2. Story ending (a variable called narrative coherent positive resolution by the researchers), an estimate based on stories that participants were asked to tell at age 26 about their most difficult life experience and that included the manifest variables of the positivity of the story’s ending, the negativity of the story’s ending, whether the ending was coherent, and whether the story had a resolution
  3. Identity maturity with respect to religion, politics, and career at age 26, quantified via the manifest variables of achievement (strong identity in these areas), moratorium (still deciding with respect to these areas), and diffusion (weak identity in these areas)
  4. Emotional adjustment, assessed at age 26 by the manifest variables of optimism, depression, and general well-being

Note that higher scores on all four latent variables indicate a more positive outcome. Figure 16-8 depicts the researchers’ model.

Figure 16-8

A SEM Diagram This diagram depicts a structural equation model (SEM). By “reading” the numbers connecting variables, we can begin to understand the story this graph tells. For example, the path between the left-hand circle, “Positive Parenting,” and the upper-right-hand circle, “Identity Maturity,” has a coefficient of 0.41. This indicates a strong, positive predictive relation between the type of parenting an adolescent received at age 17 and the maturity of her or his identity formation at age 26. As another example, the path between the two circles on the right, “Identity Maturity” and “Emotional Adjustment,” has a coefficient of 0.81. This indicates a very strong positive relation between the maturity of one’s identity formation and one’s emotional adjustment at age 26. The two-headed arrow indicates that these constructs are theorized to predict each other.

To understand the story of this model, we need to understand only three components—the four circles, the three or four squares attached to each circle, and the arrows linking the circles. The circles represent the underlying ideas of interest, the latent variables. These are latent variables because positive parenting cannot be directly measured, nor can emotional adjustment be directly “seen.”

446

The squares above or below each circle represent the measurement tools used to operationalize each latent variable. These are the manifest variables, the ones that can be observed. Let’s look at “Story ending,” the latent variable related to the stories that the 26-year-old participants told about their worst life experiences. There are four boxes above it: “Positivity,” “Negativity,” “Resolution,” and “Coherence.” These refer to the four different measures of the story’s ending that we described above.

Once we understand the latent variables (as measured by the three or more manifest variables), all that’s left for us to understand is the arrows that represent each path. The numbers on each path give us a sense of the relation between each pair of variables. Similar to correlation coefficients, the sign of the number indicates the direction of the relation, either negative or positive, and the value of the number indicates the strength of the relation. Although this is a simplification, these basic rules will allow you to “read” the story in the diagram of the model.

So how does this SEM diagram address the research question about the effects of good parenting? We’ll start on the left and read across. Notice the two paths that lead from positive parenting to story ending and to identity maturity. The numbers are 0.31 and 0.41, respectively. These are fairly large positive numbers. They suggest that receiving positive parenting at age 17 leads to a healthier interpretation of a negative life experience and to a well-established identity at age 26. Characteristics of the narrative that participants told also positively predict identity maturity at a level of 0.35.

Now notice the two-headed vertical arrow between identity maturity and emotional adjustment at age 26 with the number 0.81. This shows that identity maturity and emotional adjustment are strong predictors of each other at age 26. Positive parenting does indeed seem to predict later emotional adjustment (through the variables of how participants interpret negative events in their lives and the degree of maturity in their identity formation). It seems that good parenting helps an adolescent to be more mature, which leads to good emotional adjustment.

447

When you encounter a model such as SEM, follow these basic steps: First, figure out what variables the researcher is studying. Then, look at the numbers to see which variables are related and look at the signs of the numbers to see the direction of the relation.

CHECK YOUR LEARNING

Reviewing the Concepts

  • Multiple regression is used to predict a dependent variable from more than one independent variable. Ideally, these variables are distinct from one another in such a way that they contribute uniquely to the predictions.
  • We can develop a multiple regression equation and input specific scores for each independent variable to determine the predicted score on the dependent variable.
  • Multiple regression is the backbone of many online tools that we can use for predicting everyday variables such as traffic or home prices.
  • In stepwise multiple regression, a computer program determines the order in which independent variables are tested; in hierarchical multiple regression, the researcher determines the order.
  • Structural equation modeling (SEM) allows us to examine the “fit” of a sample’s data to a hypothesized model of the relations among multiple variables, the latent variables that we hypothesize to exist but cannot see.

Clarifying the Concepts

  • 16-10 What is multiple regression, and what are its benefits over simple linear regression?

Calculating the Statistics

  • 16-11 Write the equation for the line of prediction using the following output from a multiple regression analysis:
  • 16-12 Use the equation for the line you created in Check Your Learning 16-11 to make predictions for each of the following:
    1. X1 = 40, X2 = 14
    2. X1 = 101, X2 = 39
    3. X1 = 76, X2 = 20

Applying the Concepts

  • 16-13 The accompanying computer printout shows a regression equation that predicts GPA from three independent variables: hours slept per night, hours studied per week, and admiration for Pamela Anderson, the B-level actress whom many view as tacky. The data are from some of our statistics classes. (Note: Hypothesis testing shows that all three independent variables are statistically significant predictors of GPA!)
    1. Based on these data, what is the regression equation?
    2. If someone reports that he typically sleeps 6 hours a night, studies 20 hours per week, and has a Pamela Anderson admiration level of 4 (on a scale of 1–7, with 7 indicating the highest level of admiration), what would you predict for his GPA?
    3. What does the negative sign in the slope for the independent variable, level of admiration for Pamela Anderson, tell you about this variable’s predictive association with GPA?

Solutions to these Check Your Learning questions can be found in Appendix D.