716
OBJECTIVES By the end of this section, I will be able to …
1 The Regression Model and the Regression Assumptions
Before we learn about the regression model and assumptions, let us review the correlation and regression topics that we learned in Chapter 4. Recall that the regression line approximates the linear relationship between two continuous variables and is described by the regression equation , where is the slope of the regression line, is the intercept, represents the predictor variable, represents the response variable, and represents the estimated or predicted -value.
EXAMPLE 1 Review of regression topics
textms
The Nielsen company has reported that the number of text messages that a person sends tends to decrease with age. Table 1 contains a random sample of 10 people, along with their age and the number of text messages they sent on the previous day.
18 | 35 | 28 | 16 |
20 | 29 | 30 | 19 |
22 | 27 | 32 | 12 |
24 | 28 | 34 | 8 |
26 | 19 | 36 | 8 |
You may want to refer to Section 4.1 for (a) and (b), and Section 4.2 for (c) and (d).
Solution
717
Figure 2 shows that and , and thus the regression equation is
We can interpret and as follows:
For a 20-year-old person, the estimated number of daily text messages is
The actual number of text messages sent by our 20-year-old in Table 1 is . Our prediction from (c) is . Thus, our prediction error (or residual) is: . Our 20-year-old sent slightly fewer text messages than expected.
YOUR TURN #1
The table contains the age and score in a video game for a random sample of five young people.
(The solutions are shown in Appendix A.)
Age | Score |
---|---|
14 | 80 |
16 | 90 |
18 | 90 |
20 | 90 |
22 | 100 |
Example 1 and our work in Chapter 4 on regression represented descriptive statistics. Next, we turn to learning about inference in regression.
Note that the regression equation depends on the sample. It is likely that a second sample will differ from the first, giving us a different regression line and different values for and . In fact, for every different sample, and take different values because and are sample statistics. However, every sample comes from a population. We do not have data on the entire population, so we are not able to calculate the population regression equation. The intercept and slope of the population regression equation are unknown population parameters, just as and are parameters in other contexts. The values of and are unknown, so we need to perform inference to learn about them.
The regression model may be used to approximate the relationship between the predictor variable and the response variable for the entire population of pairs.
718
Note that there is no “hat” on the in the population regression equation because the equation represents a model of the relationship between the actual values of and , not an estimate of .
Regression Model
The population regression equation is defined as
where is the intercept of the population regression line, is the slope of the population regression line, and is the error term.
The 20-year-old in Table 1 sent 29 text messages. Suppose another 20-year-old sent 30 messages, so that both texters had age , but different values of and . Then it would be impossible to draw a single regression line to pass through both and . Thus, any linear approximation of the true relationship between and will introduce a certain amount of error. This is why the error term is needed.
Regression Model Assumptions
The regression model operates under a set of four assumptions that must be valid in order to perform the inference in this section.
Regression Model Assumptions
To summarize, for each value of , the values of come from a normally distributed population with a mean on the population regression line and constant standard deviation . Figure 3 illustrates how is distributed for each value of . Note that each normal curve has the same shape, indicating constant variance for each .
719
Verifying the Regression Assumptions
To check the regression model assumptions, we construct two graphs:
Figure 4 shows four types of patterns that might be observed in the residuals versus fitted values plots.
Developing Your Statistical Sense
Verifying the Regression Assumptions
With small data sets, it is difficult to ascertain whether or not patterns really exist. Be wary of seeing patterns where none exist. If one or more regression assumptions are violated, we should not proceed with inferential methods such as hypothesis tests or confidence intervals. However, even if one or more regression assumptions are violated, we can still report and interpret the descriptive regression statistics that we learned in Sections 4.2 and 4.3.
720
EXAMPLE 2 Calculating the residuals and verifying the regression assumptions
For the data in Example 1, do the following:
Solution
Fitted (predicted) values |
Residuals | ||
---|---|---|---|
18 | 35 | 33.6 | 1.4 |
20 | 29 | 30.6 | −1.6 |
22 | 27 | 27.6 | −0.6 |
24 | 28 | 24.6 | 3.4 |
26 | 19 | 21.6 | −2.6 |
28 | 16 | 18.6 | −2.6 |
30 | 19 | 15.6 | 3.4 |
32 | 12 | 12.6 | −0.6 |
34 | 8 | 9.6 | −1.6 |
36 | 8 | 6.6 | 1.4 |
NOW YOU CAN DO
Exercises 7–14.
721
YOUR TURN #2
For the data in the Your Turn #1 on page 717, do the following:
(The solutions are shown in Appendix A.)
Once the regression assumptions have been verified, we may (a) perform hypothesis tests, and (b) construct confidence intervals for the population slope .
2 Hypothesis Tests for Slope
Suppose for a moment that, for the population regression equation , the slope equals zero. Then the population regression equation would be
That is,
This idea forms the basis for our inference in this section. To test whether a relationship exists between and , we begin with the hypothesis test to determine whether or not equals 0. The hypotheses are
Assuming is true, the test statistic for this hypothesis test takes the following form.
Test Statistic
where represents the slope of the regression line, represents the standard error of the estimate (from Section 4.3), and is related to the sample variance of the data (see page 229 in Section 4.3).
Here, refers to the standard error of the estimate, not the sample standard deviation.
consists of three quantities: , , and . The next example shows how to calculate by finding these three quantities.
EXAMPLE 3 Calculating
Use the following steps to calculate the test statistic for the data in Table 2:
722
Solution
All calculations up to the final result are expressed to nine decimal places.
Recall from Section 4.3 (page 228) that
is the standard error of the estimate. Squaring each residual from Table 2 gives us the squared residuals in Table 3, and the sum of squared residuals, or sum of squares error, equal to
Then the standard error of the estimate is
Residuals | Squared residuals |
---|---|
1.4 | 1.96 |
−1.6 | 2.56 |
−0.6 | 0.36 |
3.4 | 11.56 |
−2.6 | 6.76 |
−2.6 | 6.76 |
3.4 | 11.56 |
−0.6 | 0.36 |
−1.6 | 2.56 |
1.4 | 1.96 |
To compute , we note from page 110 in Chapter 3 that the sample variance of is
Multiplying each side of the equation by , we obtain an equation for the quantity :
The TI-83/84 output from Figure 7 shows that , and, because ,
Therefore,
Now that we have , we can perform the hypothesis test for the slope , as the next example shows using the critical-value method.
723
EXAMPLE 4 Hypothesis test for slope using the critical-value method
Test whether a linear relationship exists between age and text messages, using the data from Table 1 at level of significance .
Solution
The regression assumptions were shown to be valid in Example 2. We may thus proceed with the hypothesis test.
Step 2 Find the critical value and the rejection rule. To find , use the distribution table (Table D in the Appendix) for a two-tailed test and degrees of freedom . The rejection rule for this two-tailed test is
Here, , so . For level of significance , the table gives us . We will reject if .
Step 3 Calculate . From Example 3, we have
Step 4 State the conclusion and the interpretation. Because , we reject . There is evidence, at level of significance , that and that a linear relationship exists between age and text messages.
NOW YOU CAN DO
Exercises 15–18.
The next example illustrates the steps for performing the hypothesis test for the slope using the -value method.
EXAMPLE 5 Hypothesis test for the slope using the -value method and technology
shortmemory
Time | Score |
---|---|
1 | 9 |
1 | 10 |
2 | 11 |
3 | 12 |
3 | 13 |
4 | 14 |
5 | 19 |
6 | 17 |
7 | 21 |
8 | 24 |
In Section 4.3, we considered a study on short-term memory. Ten subjects were given a set of nonsense words to memorize within a certain amount of time and were later scored on the number of words they could remember. The results are repeated here in Table 4. Use the -value method and technology to test, using level of significance , whether a linear relationship exists between time and score.
Solution
We begin by verifying the regression assumptions. The scatterplot of the residuals versus the fitted values in Figure 8 shows no strong evidence that the independence assumption, the constant variance assumption, or the zero-mean assumption is violated. Also, the normal probability plot of the residuals in Figure 9 offers evidence of the normality of the results. Therefore, we conclude that the regression assumptions are verified, and proceed with the hypothesis test.
Step 1 State the hypotheses and the rejection rule.
724
The rejection rule is: reject if the .
Step 2 Calculate .
From page 226 in Section 4.3, we have . From Example 13 in Chapter 4 on page 228, we have
From the TI-83/84 summary statistics, we have the standard deviation of the (time) data to be . Thus, using the relationship we learned in Example 3:
Therefore,
Step 3 Find the -value. For instructions, see the Step-by-Step Technology Guide on page 730. The regression results (including the -value) for the TI-83/84, Excel, Minitab, and CrunchIt! are shown in Figures 10, 11, 12, and 13. (Differing results are due to rounding.)
725
Step 4 The -value of about , so we reject . Evidence exists, at level of significance , for a linear relationship between time and score.
NOW YOU CAN DO
Exercises 19–22.
YOUR TURN #3
Recall the age and score data from the Your Turn #1 on page 717. Test, using level of significance , whether a linear relationship exists between age and score.
(The solution is shown in Appendix A.)
726
3 Confidence Interval for Slope
Recall that in Chapter 8 we constructed a confidence interval estimate for a population parameter, consisting of an interval of numbers that contain the parameter with a certain confidence level. Similarly, we can construct a confidence interval for the slope of the population regression equation .
Confidence Interval for
When the regression assumptions are met, a confidence interval for is given by
where is the point estimate of the slope of the population regression equation, is the standard error of the estimate, and has degrees of freedom.
Margin of Error
The margin of error for a confidence interval for is given by
Thus, the confidence interval for takes the form .
EXAMPLE 6 Confidence interval for the slope
Construct a 95% confidence interval for the slope of the population regression equation for the memory-test data in Example 5.
Solution
The regression assumptions were verified in Example 5, where we found:
From the table (Appendix Table D), we find that, for 95% confidence, for degrees of freedom is . So, our margin of error is
The 95% confidence interval for is then given by
NOW YOU CAN DO
Exercises 23–30.
What Do These Numbers Mean?
727
4 Using Confidence Intervals to Perform the Test for the Slope
As in earlier sections, we may use a confidence interval for the slope to perform the test for , which is a two-tailed test.
Equivalence of a Two-Tailed Test About and a Confidence Interval for
EXAMPLE 7 Using confidence intervals to perform the t test for the slope
textms
Solution
From the table, we find that, for 99% confidence, for degrees of freedom is . So, our margin of error si
The 99% confidence interval for is then given by
We are 99% confident that the interval (−1.9448, −1.0552) captures the slope of the population regression line. That is, we are 99% confident that, for each additional year of age, the decrease in the number of text messages lies between 1.9448 and 1.0552.
The hypotheses are
No linear relationship exists between age and text messages.
A linear relationship exists between age and text messages.
The confidence interval from (a) does not contain zero, so we may conclude that a linear relationship exists between age and text messages, at level of significance .
NOW YOU CAN DO
Exercises 31–38.
728
Letter | Rel. freq. in English language |
Frequency in Scrabble |
Point value in Scrabble |
Letter | Rel. freq. in English language |
Frequency in Scrabble |
Point value in Scrabble |
---|---|---|---|---|---|---|---|
A | 0.073 | 9 | 1 | N | 0.078 | 6 | 1 |
B | 0.009 | 2 | 3 | O | 0.074 | 8 | 1 |
C | 0.030 | 2 | 3 | P | 0.027 | 2 | 3 |
D | 0.044 | 4 | 2 | Q | 0.003 | 1 | 10 |
E | 0.130 | 12 | 1 | R | 0.077 | 6 | 1 |
F | 0.028 | 2 | 4 | S | 0.063 | 4 | 1 |
G | 0.016 | 3 | 2 | T | 0.093 | 6 | 1 |
H | 0.035 | 2 | 4 | U | 0.027 | 4 | 1 |
I | 0.074 | 9 | 1 | V | 0.013 | 2 | 4 |
J | 0.002 | 1 | 8 | W | 0.016 | 2 | 4 |
K | 0.003 | 1 | 5 | X | 0.005 | 1 | 8 |
L | 0.035 | 4 | 1 | Y | 0.019 | 2 | 4 |
M | 0.025 | 2 | 3 | Z | 0.001 | 1 | 10 |
How Fair Is the Scoring in Scrabble?
scrabble
In this Case Study, we consider the frequency and point values of Scrabble tiles. Table 5 shows the relative frequency in the English language, the frequency (number of tiles) in Scrabble, and the point value in Scrabble.
First of all, what is the relationship between the tile frequencies in Scrabble and the letter frequencies in the English language? Figure 14 shows a scatterplot of the tile frequencies in Scrabble against the letter frequencies in the English language. A positive relationship appears to exist between the two variables. That is, as the English frequencies increase, game frequencies also tend to increase.
Note that the letters above the regression line occur “too frequently” in the game, whereas the letters below the line occur “not frequently enough.” Playing typical English words during a game of Scrabble would tend to leave you with a rack of letters similar to those above the regression line. Note that S is one of the letters that is rarer in the game than in the language.
Figure 15 displays the Minitab results from a regression of the tile frequencies against the English language relative frequencies. The regression equation is
729
The slope is positive, which concurs with the scatterplot in Figure 14.
Next, we turn to the hypothesis test:
The 0.000 in red represents the -value for the test. This -value is smaller than any , so we reject the null hypothesis that no linear relationship exists between the game frequencies and the English frequencies. Does the model fit the data well? The coefficient of determination is 0.855, which is good, and the correlation coefficient equals 0.924, which indicates that the variables are positively correlated.
But the fit really could be better. Look at the value of , the standard error of the estimate: . This means that, given the English language frequency of a letter, the estimate of the tile frequency will typically differ from the actual tile frequency by more than one tile.
Next, what is the relationship between the Scrabble point values and the English relative frequencies? Figure 16 shows a scatterplot of these two variables. The first thing you might notice about this relationship is that it is not linear. Therefore, it would not be appropriate to perform linear regression on this data set.
730
We can, nevertheless, make some descriptive remarks.