Chapter 15: Describing Relationships: Regression, Prediction, and Causation

STATISTICS IN SUMMARY

Chapter Specifics

• Regression is the name for statistical methods that fit some model to data in order to predict a response variable from one or more explanatory variables.
• The simplest kind of regression fits a straight line on a scatterplot for use in predicting $y$ from $x$ . The most common way to fit a line is the least-squares method, which finds the line that makes the sum of the squared vertical distances of the data points from the line as small as possible.
• The squared correlation r² tells us what fraction of the variation in the responses is explained by the straight-line tie between $y$ and $x$ .
• Extrapolation, or prediction outside the range of the data, is risky because the pattern may be different there. Beware of extrapolation!
• A strong relationship between two variables is not always evidence that changes in one variable cause changes in the other. Lurking variables can create relationships through common response or confounding.
• If we cannot do experiments, it is often difficult to get convincing evidence for causation.

In Chapter 14, we used scatterplots and the correlation to explore and describe the relationship between two quantitative variables. In this chapter, we looked carefully at fitting a straight line to data in a scatterplot when there appears to be a straight-line trend, and then we used this line to predict the response from the explanatory variable. In doing this, we have used data to draw conclusions. We assume that the straight line that we fit to our data describes the actual relationship between the response and the explanatory variable and, thus, that conclusions (predictions) about additional values of the response based on other values of the explanatory variable are valid.

Page 356

Are these conclusions (predictions) justified? The squared correlation provides information about the likelihood of a successful prediction. Small values of the squared correlation suggest that our predictions are not likely to be accurate. Extrapolation is another setting in which our predictions are not likely to be accurate.

Finally, when there is a strong relationship between two variables, it is tempting to draw an additional conclusion: namely, that changes in one variable cause changes in another. However, the case for causation requires more than a strong relationship. Unless our data are produced by a proper experiment, the case for causation is difficult to prove.

CASE STUDY EVALUATED What should we conclude about the Super Bowl Indicator described in the Case Study at the beginning of this chapter? To evaluate the Super Bowl Indicator, answer the following questions.

1. We wrote this Case Study on March 4, 2016, the year in which the Broncos won the Super Bowl. The Super Bowl Indicator predicts stocks should go down in 2016. Did they go down?
2. Stocks went down only 12 times in the 49 years between 1967 and 2015. If you simply predicted “up” every year, how would you have performed?
3. There are 19 original NFL and NFC teams and only 13 AFC teams. How often would you expect “NFL wins” to occur if one assumes that the chance of winning is proportional to the number of teams? How does this compare with simply predicting “up” every year?
4. Write a paragraph, in language that someone who knows no statistics would understand, explaining why the association between the Super Bowl Indicator and stock prices is not surprising and why it would be incorrect to conclude that the Super Bowl outcome causes changes in stock prices.

Online Resources

• The StatClips video Regression—Introduction and Motivation describes many of the topics in this chapter in the context of an example about hair growth.
• The StatBoards video Beware Extrapolation! discusses the dangers of extrapolation in the context of several examples.