4 Correlation and Regression

4.3 Further Topics in Regression Analysis

OBJECTIVES By the end of this section, I will be able to …

Calculate the sum of squares error (SSE), and use the standard error of the estimate s as a measure of a typical prediction error.
Describe how total variability, prediction error, and improvement are related to the total sum of squares (SST), the sum of squares error (SSE), and the sum of squares regression (SSR).
Explain the meaning of the coefficient of determination $r^{2}$ as a measure of the usefulness of the regression.

In Section 4.2, we were introduced to regression analysis, which uses an equation to approximate the linear relationship between two quantitative variables. Here in Section 4.3, we learn some further topics that will enable us to better apply the tools of regression analysis for a deeper understanding of our data.

Page 226

1 Sum of Squares Error (SSE) and Standard Error of the Estimate $s$

Table 6 shows the results for 10 student subjects who were given a set of short-term memory tasks to perform within a certain amount of time. These tasks included memorizing nonsense words and random patterns. Later, the students were asked to repeat the words and patterns, and the students were scored according to the number of words and patterns memorized and the quality of their memories. Partially remembered words and patterns were given partial credit, so the score was a continuous variable. Figure 31 displays the scatterplot of $y = score$ versus $x = time$ , together with the regression line $\hat{y} = 2 x + 7$ , as calculated by Minitab.

shortmemory

FIGURE 31 Scatterplot with regression line.

Minitab regression results (excerpt).

Table 4.67: Table 6 Results of short-term memory test

Student	Time to memorize (in minutes) $(x)$	Short-term memory score $(y)$
1	1	9
2	1	10
3	2	11
4	3	12
5	3	13
6	4	14
7	5	19
8	6	17
9	7	21
10	8	24

In Section 4.2, we learned that the difference $y - \hat{y}$ represented the prediction error or residual between the actual data value $y$ and the predicted value $\hat{y}$ . For example, for a student who is given $x = 5 minutes to study$ , the predicted score is $\hat{y} = 2 (time) + 7 = 17$ .

For Student 7, who was given 5 minutes to study and got a score of 19, the prediction error is $y - \hat{y} = 19 - 17 = 2$ .

We can calculate the prediction errors for every student who was tested. If we wish to use the regression to make useful predictions, we want to keep all our prediction errors small. To measure the prediction errors, we calculate the sum of squared prediction errors, or more simply, the sum of squares error (SSE):

Sum of Squares Error (SSE)

$SSE = Σ {(y - \hat{y})}^{2} = Σ {(prediction error)}^{2} = Σ {(residual)}^{2}$

We want our prediction errors to be small, therefore, it follows that we want SSE to be as small as possible.

Least-Squares Criterion

The least-squares criterion states that the regression line will be the line for which the SSE is minimized. That is, out of all possible straight lines, the least-squares criterion chooses the line with the smallest SSE to be the regression line.

Page 227

EXAMPLE 12 Calculating SSE, the sum of squares error

Construct a scatterplot of the memory score data, indicating each residual.
Calculate SSE for the memory score data.

Solution

The brackets (}) in the scatterplot in Figure 32 indicate the residual for each student’s score. The quantities represented by these brackets are the residuals $y - \hat{y}$ .
Table 7 shows the $\hat{y}$ -values and residuals for the data in Table 6. The SSE is then found by squaring each residual and taking the sum. Thus,

$SSE = Σ {(y - \hat{y})}^{2} = 12$

We know that $\hat{y} = 2 x + 7$ is the regression line, according to the least-squares criterion, so no other possible straight line would result in a smaller SSE.

FIGURE 32 Scatterplot showing the prediction errors or residuals $y - \hat{y}$ .

Table 4.68: Table 7 Calculation of the SSE for the short-term memory test example

Student	Time $(x)$	Actual score $(y)$	Predicted score $(\hat{y} = 2 x + 7)$	Residual $(y - \hat{y})$	(Residual)² ${(y - \hat{y})}^{2}$
1	1	9	9	0	0
2	1	10	9	1	1
3	2	11	11	0	0
4	3	12	13	−1	1
5	3	13	13	0	0
6	4	14	15	−1	1
7	5	19	17	2	4
8	6	17	19	−2	4
9	7	21	21	0	0
10	8	24	23	1	1
				$SSE = Σ {(y - \hat{y})}^{2} = 12$

NOW YOU CAN DO

Exercises 11a–22a.

A useful interpretive statistic is $s$ , the standard error of the estimate. The formula for $s$ follows.

Page 228

Standard Error of the Estimate $s$

$s = \sqrt{\frac{SSE}{n - 2}}$

Don’t confuse this use of the $s$ notation for the standard error of the estimate with the use of the $s$ notation for the sample standard deviation.

The standard error of the estimate gives a measure of the typical residual. That is, $s$ is a measure of the size of the typical prediction error, which is the typical difference between the predicted value of $y$ and the actual observed value of $y$ . If the typical prediction error is large, then the regression line may not be useful.

EXAMPLE 13 Calculating and interpreting $s$ , the standard error of the estimate

Calculate and interpret the standard error of the estimate $s$ for the memory score data.

Solution

$SSE = 12$ and $n = 10$ , so

$s = \sqrt{\frac{SSE}{n - 2}} = \sqrt{\frac{12}{8}} \approx 1.2247$

Thus, the typical error in prediction is 1.2247 points. In other words, if we know the amount of time $(x)$ a given student spent memorizing, then our estimate of the student’s score on the short-term memory test will typically differ from the student’s actual score by only 1.2247 points.

Note: Here, we are rounding $s = 1.2247$ for reporting purposes. However, when we use $s$ for calculating other quantities later, we will not round until the last calculation.

NOW YOU CAN DO

Exercises 11b–22b.

2 SST, SSR, and SSE

The least-squares criterion guarantees that the value of $SSE = 12$ that we found in Example 12 is the smallest possible value for SSE, given the data in Table 6. However, this guarantee in itself does not tell us that the regression is useful. For the regression to be useful, the prediction error (and therefore SSE) must be small. But, we cannot yet tell whether the value of $SSE = 12$ is indeed small because we can’t compare it to anything.

Suppose for a moment that we want to estimate short-term memory scores, but we have no knowledge of the amount of time $(x)$ for memorization. Then the best estimate for $y$ is simply $\bar{y} = 15$ , the mean of the sample of short-term memory test scores. The graph of $\bar{y} = 15$ is the horizontal line in Figure 33.

FIGURE 33 Comparing

$(y - \hat{y})$ and

$(y = \bar{y})$ .

Page 229

In general, the data points are closer to the regression line than they are to the horizontal line $\bar{y} = 15$ , indicating that the errors in prediction are smaller when using the regression equation. Consider Student 10, who had a short-term memory score of $y = 24$ after memorizing for $x = 8$ minutes. Using $\bar{y} = 15$ as the estimate, the error for Student 10 is

$(y - \bar{y}) = 24 - 15 = 9$

This error is shown in Figure 33 as the vertical distance $(y = \bar{y})$ .

Suppose we found this value $(y - \bar{y})$ for every student in the data set and summed the squared $(y - \bar{y})$ , just as we did for the $(y - \hat{y})$ when finding SSE. The resulting statistic is called the total sum of squares (SST) and is a measure of the total variability in the values of the $y$ variable:

$SST = Σ {(y - \bar{y})}^{2}$

Developing Your Statistical Sense

Relationship Between SST and the Variance of the $y$ 's

Note that SST ignores the presence of the $x$ information; it is simply a measure of the variability in $y$ . Recall (see page 133) that the variance of a sample of $y$ -values is given by $s^{2} = Σ {(y - \bar{y})}^{2} / (n - 1)$ . Thus,

$SST = (n - 1) s^{2}$

Thus, SST is proportional to the variance of the $y$ ’s and, as such, is a measure of the variability in the $y$ data.

EXAMPLE 14 Calculating SST, the total sum of squares, in two ways

Calculate SST, the total sum of squares, for the memory score data in two ways:

By using Table 8
By using the fact that the sample variance of the score data (the $y$ -values) equals $25 \frac{1}{3}$

Solution

Table 8 shows the values for $(y - \bar{y}) = (y - 15)$ for the data in Table 7. Thus, $SST = {\sum (y - \bar{y})}^{2} = 228$ .
When we are given the variance of $y$ , we may calculate SST as follows:

$SST = (n - 1) s^{2} = (10 - 1) (25 \frac{1}{3}) = 228$

Table 4.69: Table 8 Calculation of SST

Student	Score $(y)$	$(y - \bar{y})$	${(y - \bar{y})}^{2}$
1	9	−6	36
2	10	−5	25
3	11	−4	16
4	12	−3	9
5	13	−2	4
6	14	−1	1
7	19	4	16
8	17	2	4
9	21	6	36
10	24	9	81
		$SST = Σ {(y - \bar{y})}^{2} = 228$

NOW YOU CAN DO

Exercises 11c–22c.

Page 230

Consider Figure 33 once again. For Student 10, note that the error in prediction when ignoring the $x$ data is $(y - \bar{y}) = 9$ , while the error in prediction when using the regression equation is $(y - \hat{y}) = 1$ . (Recall that $\hat{y} = 2 (8) + 7 = 23$ because Student 10’s time is $x = 8$ .) The amount of improvement (that is, the amount by which the prediction error is diminished) is the difference between $\hat{y}$ and $\bar{y}$ :

$(\hat{y} - \bar{y}) = 23 - 15 = 8$

Once again, we can find $(\hat{y} - \bar{y})$ for each observation in the data set, square them, and sum the squared results to obtain $Σ {(\hat{y} - \bar{y})}^{2}$ . The resulting statistic is SSR, the sum of squares regression.

$SSR = Σ {(\hat{y} - \bar{y})}^{2}$

SSR measures the amount of improvement in the accuracy of our estimates when using the regression equation compared with relying only on the $y$ -values and ignoring the $x$ information. Note in Figure 33 that the distance $(y - \bar{y})$ is the same as the sum of the distances $(\hat{y} - \bar{y})$ and $(y - \hat{y})$ . It can be shown, by using algebra, that the following also holds true.

Relationship Among SST, SSR, and SSE

$SST = SSR + SSE$

Note: None of these sums of squares can ever be negative.

If any two of these sums of squares are known, the third can also be calculated, as shown in the following example.

EXAMPLE 15 Using SST and SSE to find SSR

Use SST and SSE to find the value of SSR for the data from Examples 12–14.

Solution

From Example 12, we have $SSE = 12$ , and from Example 14 we have $SST = 228$ . That leaves us with just one unknown in the equation $SST = SSR + SSE$ , so we can solve for the unknown SSR:

$SSR = SST - SSE = 228 - 12 = 216$

NOW YOU CAN DO

Exercises 11d–22d.

3 Coefficient of Determination $r^{2}$

SSR represents the amount of variability in the response variable that is accounted for by the regression equation, that is, by the linear relationship between $y$ and $x$ . SSE represents the amount of variability in the $y$ that is left unexplained after accounting for the relationship between $x$ and $y$ (including random error). We know that SST represents the sum of SSR and SSE; therefore, it makes sense to consider the ratio of SSR and SST, which is called the coefficient of determination $r^{2}$ .

The coefficient of determination $r^{2} = SSR/SST$ measures the goodness of fit of the regression equation to the data. We interpret $r^{2}$ as the proportion of the variability in $y$ that is accounted for by the linear relationship between $y$ and $x$ . The values that $r^{2}$ can take are $0 \leq r^{2} \leq 1$ . Note that the coefficient of determination $r^{2}$ is the square of the correlation coefficient $r$ . Thus, $\pm \sqrt{r^{2}} = r$ , the correlation coefficient.

Page 231

EXAMPLE 16 Calculating and interpreting the coefficient of determination $r^{2}$

Calculate and interpret the value of the coefficient of determination $r^{2}$ for the memory score data.

Solution

From Example 14, we have $SSE = 228$ , and from Example 15 we have $SSE = 216$ . Thus,

$r^{2} = \frac{SSR}{SST} = \frac{216}{228} \approx 0.9474$

Therefore, 94.74% of the variability in the memory test score $(y)$ is accounted for by the linear relationship between score $(y)$ and the time given for study $(x)$ .

NOW YOU CAN DO

Exercises 11e–22e.

What Does This Number Mean?

What does the value of $r^{2} \approx 0.9474 mean$ ? Consider that the memory test scores have a certain amount of variability: some scores are higher than others. In addition to the amount of time $(x)$ given for memorizing, there may be several other factors that might account for variability in the scores, such as the memorizing ability of the students, how much sleep the students had, and so on. However, $r^{2} \approx 0.9474$ indicates that 94.74% of this variability in memory scores $(y)$ is explained by the single factor “amount of time given for study” $(x)$ . All other factors, including factors such as amount of sleep, account for only $100 % - 94.74 % = 5.26 %$ of the variability in the memory test scores.

Suppose that the regression equation was a perfect fit to the data, so that every observation lies exactly on the regression line. No errors in prediction would occur; therefore, SSE would equal 0, which would imply that

$SST = SSR + 0 = SSR$

In this case, $SST = SSR$ , then

$r^{2} = \frac{SSR}{SST} = \frac{SST}{SST} = 1$

Conversely, if $SST = 0$ , then no improvement at all is gained by using the regression equation. That is, the regression equation accounts for no variability at all, and $r^{2} = 0 / SST = 0$ .

The closer the value of $r^{2}$ is to 1, the better the fit of the regression equation to the data set. A value near 1 indicates that the regression equation fits the data extremely well. A value near 0 indicates that the regression equation fits the data extremely poorly.

Recall from Section 4.1 that the correlation coefficient $r$ is given by

$r = \frac{Σ (x - \bar{x}) (y - \bar{y})}{(n - 1) s_{x} s_{y}}$

where $s_{x}$ and $s_{y}$ represent the sample standard deviation of the $x$ data and the $y$ data, respectively. We can express the correlation coefficient $r$ as

$r = \pm \sqrt{r^{2}}$

where $r^{2}$ is the coefficient of determination. The correlation coefficient $r$ takes the same sign as the slope $b_{1}$ . If the slope $b_{1}$ of the regression equation is positive, then $r = \sqrt{r^{2}}$ ; if the slope $b_{1}$ of the regression equation is negative, then $r = - \sqrt{r^{2}}$ .

Page 232

EXAMPLE 17 Calculate the correlation coefficient using $r^{2}$

Use $r^{2}$ to calculate the value of the correlation coefficient $r$ for the memory score data.

Solution

The slope $b_{1} = 2$ , which is positive, tells us that the sign of the correlation coefficient $r$ is positive. Thus,

$r = \sqrt{r^{2}} = \sqrt{0.9474} \approx 0.9733$

Therefore, student scores on the short-term memory test are strongly positively correlated with the amount of time allowed for memorization.

NOW YOU CAN DO

Exercises 11f–22f.