In the previous two sections we were interested in describing the relationship between two quantitative variables using graphical and numerical methods. When graphical and numerical methods suggest that there is a sufficiently strong linear association between the two variables, our goal will be to find a regression line to model the relationship. We will then use this line to predict values of the response variable for given values of the explanatory variable.
Before we look at finding models for our data, let’s recall a few key ideas about lines that you learned in algebra. First, every line can be written in the form where a is the y-intercept and b is the slope. Slope is a measure of the steepness and direction of the line, and the y-intercept is the y-value where the line intersects the y-axis.
With real-world problems, the data that we are modeling are almost never perfectly linear, so we will be interested in finding an equation of a line which, in some sense, best represents the data set. In this section, we are interested in finding such a line, one called a regression line. A regression line predicts values of the response variable, y, for given values of the explanatory variable, x. We’ll let represent the predicted value of y to distinguish it from the observed value of y. Therefore, we’ll write the equation of our regression line in the form .
In section 4.1 we looked at average April temperatures (in degrees Fahrenheit) and latitude data for cities in the northern hemisphere. Figure 4.17 displays scatter plot of the data, along with a table of values.
This scatter plot shows that there is a very strong negative linear association between latitude and average April temperature for cities in the northern hemisphere, and a correlation coefficient of r= –0.9603 confirms this fact. Because there is such a strong linear relationship between these two variables, we are interested in finding a regression line to model the relationship. Since "average April temperature" is the response variable (y) and “latitude” is the explanatory variable (x), we will find an equation of a line that predicts values of y for given x-values.
Since the points don’t fall on a line, it is not possible to find an equation of a line which passes through all of the data points. We’re trying to do the next best thing, that is, get close to as many points as possible. We could pick two points from the data set, and use the techniques of algebra to determine the line containing those points. Then we could adjust the slope and y-intercept of such a line until we believe that we have obtained a line that represents the pattern of the data well. However, this can be a time-consuming (and not very rewarding) process.
When we refer to "the" regression line for a set of data with a linear trend, we mean the line that "best fits" the data. For statisticians, there is only one "best-fitting" line, and it satisfies a specific mathematical criterion. Finding this line can be done manually, but in practice, the regression line is found using statistical software, such as CrunchIt! For now, we’ll focus on finding the regression line using software, and then interpreting its meaning. Later in the section we’ll describe what "best fitting" means in this context, and look at the mathematics behind the formulas for slope and y-intercept.
For the set of data displayed in Figure 4.17, CrunchIt! reports the regression line as: Avg. April Temp = 96.83 - 1.118 * Latitude.
Notice that CrunchIt! uses the actual names of the explanatory and response variables (the column names in the CrunchIt! table) rather than x and . This is useful because it serves as a reminder of what we have chosen as explanatory and response. Unfortunately, CrunchIt! doesn’t put the little hat over the response variable, so you have to keep reminding yourself that this model is used to find predicted y-values. It does not report (except in the rare and lucky instance when a data point lies on the regression line) the observed y-value for an x-value in the data set.
We will follow the CrunchIt! practice of using the actual variable names in our regression equations, distinguishing the variables themselves with italics. Thus, we can see that this line is indeed of the form , with Avg. April Temp. as y, Latitude as x, a = 96.83 and b = –1.118. Therefore the y-intercept (Avg. April Temp.-intercept) for this line is 96.83 and the slope is –1.118.
The line is graphed on the scatter plot shown in Figure 4.18.
This line fits the data pretty well. Notice that, in this instance, the line doesn’t go through any of the points, but is pretty close to all of them. Let’s use the model to make some predictions.
Toronto, Canada’s latitude is 43.667°N . According to the regression model, Avg. April Temp = 96.83 – 1.118* Latitude, what is Toronto’s predicted average April temperature? Figure 4.3.3 shows the scatter plot and regression line, with the red line indicating Toronto’s latitude of 43.667°N . Our question then is what is the appropriate average April temperature for the point on the regression line with first coordinate 43.667?
The regression equation predicts that Avg. April Temp = 96.83 – 1.118*43.667 = 48.0103 degrees Fahrenheit, which is actually a pretty good estimate since the average April temperature in Toronto is 44 degrees Fahrenheit.
What do the values of the slope and y-intercept tell us about the relationship between latitude and temperature? Since a y-intercept is the value of y when x = 0, in the context of this problem, this means that if a city lies on the equator, the model predicts that its average April temperature will be 96.82503 degrees Fahrenheit.
The slope, b, of the any line y can be interpreted as follows: For every one unit increase in x, y changes by b units. In the context of this problem, this means that for every one unit increase in latitude, average April temperature changes by –1.118 units. Even better, this means that for every additional degree increase in latitude, the average April temperature decreases by 1.118 degrees Fahrenheit. So for (roughly) every 4 degrees north you travel, temperature is dropping by about 5 degrees Fahrenheit.
In any model the slope can (and should be) interpreted in the context of the problem. Often though, the y-intercept interpretation is not useful. We’ll see why shortly.
When working with "real world" data it is extremely unlikely that the data points fall in a perfectly linear pattern. Therefore when we use a linear regression equation to model the data, prediction error will be present. For a given data value of <em>x</em>, we define its residual to be the difference between its observed value y and its predicted value .
Let’s look again at the graph of the data and the regression line that we saw previously in Figure 4.18.
The latitude for Ithaca, NY is 42.44ºN . The actual average temperature for April is 45 degrees Fahrenheit. The model predicts that the average temperature is 49.37 degrees Fahrenheit, which is an overestimate. The residual for Ithaca, NY is therefore
Next locate the data point for Anchorage, Alaska (latitude 61.167ºN) in Figure 4.20. Since the regression line lies below this point, we see that the model is underestimating the average April temperature in Anchorage. The model predicts that the average April temperature will be 28.43 degrees Fahrenheit. The actual temperature is 36 degrees so the residual for this data point is 7.57.
The following graph, called a residuals plot, displays the residuals for each of the ten data points, with each x-coordinate being the latitude of the city, and the corresponding y-coordinate the prediction error.
The residuals are of varying size; some are positive, while others are negative. The sum of the residuals is zero, so the mean residual is also zero. The residual plot shows how far above or how far below the regression line a particular data point lies. If a data point is above the regression line, its corresponding point in the residual plot lies above the y = 0 line, and by exactly the same vertical distance (and similarly for points lying below the regression line).
In the latitude/temperature example, we saw that CrunchIt! Used Avg. April Temp = 96.82503-1.118155*Latitude as the regression line. Why this line? The software reported what is known as the "least squares regression equation."
Recall that we are looking for the line that gets "closest to" the (not perfectly linear) data points. Since the sum of the residuals is always 0, negative residuals are "counteracting" positive residuals. To measure how far off our line is, we look at the square of each residual instead.
This accomplishes two goals. First, these squares are all non-negative, so their sum will not be zero. (A particular residual can be zero, if the data point lies on the regression line, but since the data is not linear, all residuals cannot be zero.) Second, if a point is close to the regression line, its residual will be small in absolute value, and its square will be relatively small. On the other hand, a residual with a large absolute value will produce a large squared value. For our purposes, a line that is close to lots of points is better than one that goes exactly through a couple, but misses others by large vertical distances.
For any data set, we can compute the sum of the squared residuals for any particular line. But we don’t want just any line, rather we want the line which minimizes the sum of the squared residuals, . This is the line that makes "least" the "squares," and the line we will call the best-fitting line.
Let’s examine this idea with a very small data set, one with nice values. Figure 4.22 gives the data and its scatter plot.
Suppose that we find a regression equation using the first two data points. Then we get the model , whose graph is shown on the scatter plot in Figure 4.23.
In Table 4.6, we calculate the squared residuals for this line.
So the sum of the squared residuals is 256. It is clear from the plot that this line is too steep, and misses the third point by a good distance. We certainly should be able to find a smaller sum of the squared residuals. The line that goes through the first and third data points is . Its graph and the corresponding table of squared residuals are shown in Figure 4.24.
We see that the sum of the squared residuals, 36, is much smaller than that of the first line we used. The line shown in Figure 4.25 seems to have a pretty good slope; perhaps if we moved it up a bit, we would decrease the sum of the squared residuals. Table 4.7 shows several lines, each with slope 2 but different y-intercepts, along with each line’s sum of squared residuals.
Of these three lines, has the smallest sum of squared residuals, but how do we know we can’t find a line with a sum even smaller? The beauty of the least squares regression line is that it guarantees us the smallest possible sum of squared residuals. And, as we previously indicated, we’ll use software exclusively to find the least squares linear model.
For the (very small) data set for which we have been creating models, the least squares regression equation is , with slope and y-intercept each rounded to 4 decimal places. The sum of squared errors for this model is 23.5102. While a, b, and the sum of squared residuals are all close to the best model we found, we would be unlikely to stumble across the least squares model. So we are happy to rely on software to report the equation.
However, in case you are wondering about how the line is found, here are the formulas to determine a and b for the least squares regression equation :
For the example above, we find that r = 0.9113, = 5.667, = 4.0415, = 17.3333, and = 8.3267. Then the appropriate values of a and b are
, and
,
which agree to three decimal places with the values found by software. (The difference in the fourth decimal place for a can be explained by our using rounded values in our calculations.)
In general, the reason for using linear regression is to predict y-values in situations where the actual y-values are unknown, and it is sensible to wonder how good our linear model is at predicting these unknown y-values. A logical starting place to investigate this question is to examine how well the model predicts the y-values that we do know.
We have already seen that the correlation coefficient measures the strength and the direction of the linear relationship between our two variables. It turns out that the square of the correlation coefficient, r2 (sometimes written as R2), gives us more specific information about how well the least squares equation predicts our known y-values.
r2 tells us the fraction of the variability in the observed y-values that can be explained by the least squares regression. Because r is a number between –1 and 1 inclusive, r2 is between 0 and 1 inclusive. And because r2 represents a fraction of the total variability in the y’s, we generally convert it to a percent. So another way to interpret r2 is to say that it represents the percent of the variability in y-values that is explained by the variability of the x’s, according to our linear regression model.
The language here is a little tricky, and the meaning of r2 can be somewhat confusing. Consider a set of data that is perfectly linear and with positive slope. It has a correlation coefficient of 1, so r2 is also 1. This means that all of the variability in the y-values (100% of it) can be explained by the linear model. That is, the model predicts exactly how the y-value changes for a given change in x.
Recalling that for our latitude/temperature example, r was 0.9603, we determine that r2 = 0.9222. How would we interpret this value in the context of the situation? We would say that about 92% of the variation in the average April temperatures can be explained by the linear regression model. Or, we could say that for our least squares regression equation, about 92% of the variability in the average April temperatures can be explained by the variability in the cities’ latitudes.
That’s quite a large percentage to be accounted for by the model, and we can’t always expect such good results from real data. Depending on the situation, a researcher might consider even an r2 smaller than 50% as indicating a linear regression model worth using.
The importance of making a scatter plot of the data can be seen when we examine the effect of outliers on a linear regression model. The scatter plot shown in Fig. 4.25 displays a weak positive linear relationship (r = 0.2365), and we see quite a bit of scatter around the regression line. In this case, the linear relationship accounts for only about 5% of the variability in the y’s (since r2 = 0.0559).
What happens if we add an outlier to this scatter plot? As we see in Fig. 4.26, the outlier makes the data appear more linear, and indeed the correlation coefficient is now 0.6715. This makes it seem as though a linear model is much more appropriate in this case. But is it really? If this were a set of real data, we would want to check and see whether that point represents actual data values or errors in data collection or entry. And even if it the values are correct, linear regression may not be appropriate.
Figures 4.27 and 4.28 show two additional scatter plots. In Fig. 4.27, the outlier is in the y-direction only. While it decreases the correlation coefficient to 0.1631, it has little effect on the original regression line. In Fig. 4.28, the outlier is in the x-direction only. Here r is 0.2359, similar to the original value, and once again, the regression line is close to the original.
It is critical to remember is that plotting the data is essential. It is not sufficient to rely on the correlation coefficient alone to determine whether linear regression is appropriate. In statistics it is always wise to “look before you leap.”
As we have indicated, the usefulness of linear regression lies in our ability to use the model to make predictions when we don’t have observed values of x and y. Recall that we substituted Toronto’s latitude (which we did know) into our linear model to predict a value for its average April temperature (which we didn’t know). We obtained a reasonable estimate for this temperature (about 48˚F) because Toronto’s latitude is similar to the other cities’ upon which we built our model.
Making predictions using values of the explanatory variable that lie within the range of the given x-values generally produces y-values that are appropriate estimates. On the other hand, making predictions using values of the explanatory variable outside the range of the given data is called extrapolation. The linear regression model is a mathematical equation into which you can substitute any number you like. However, the model assumes that the trend displayed in the given data continues indefinitely, which may not be the case at all.
We could use our roller coaster regression model Speed = 26.2547 + 0.2326*Drop to predict the speed of a roller coaster whose drop is 800 feet (or 8000 feet for that matter). For a drop of 800 feet, we would get a predicted speed of about 212 miles per hour. But 800 feet is nearly twice as large a drop as that for Kingda Ka, which has the largest drop (418 feet) in our data set. There is no reason to believe that our linear model is reliable for that large a drop, or that it has produced a reasonable estimate. This is the same issue we faced when we tried to interpret the y-intercept for this linear model. A vertical drop of 0 is outside the range of the given data, so the y-intercept value does not have a sensible practical meaning.
It is common, even if somewhat dangerous, to extrapolate using the linear regression equation. However, as you use x-values farther away from your data values, your estimates become less reliable. Use caution when extrapolating—don’t go very far beyond the given x-values, and exercise a bit of skepticism when reading the results of others’ extrapolation.
In addition, remember that the predictions in the model go only one way—we use values of x to predict values of y, not vice versa. While correlation remains the same when you switch the roles of the explanatory and response variables, the regression model changes when you make the same switch. It is important to decide before you start which variable you want to predict, and make that variable the response variable.
Finally, even if everything goes smoothly—you have a reasonably strong linear relationship, your scatter plot indicates that linear regression is appropriate, and you are not extrapolating (or at least not too far out)—you still must be careful about the conclusions that you draw. In the best of circumstances, correlation and regression suggest that a relationship between two variables exists. They do not imply that the relationship involves cause and effect.
Linear regression is a widely used and effective statistical technique. Linear models are easy to understand, and making predictions is straight-forward. It is important to remember, however, that linear regression is not appropriate in all situations. You must be careful about when you use linear regression, and how you interpret the results.