13.1 Inference About the Slope of the Regression Line

716

OBJECTIVES By the end of this section, I will be able to …

  1. Explain the regression model and the regression model assumptions.
  2. Perform the hypothesis test for the slope of the population regression equation.
  3. Construct confidence intervals for the slope .
  4. Use confidence intervals to perform the hypothesis test for the slope .

1 The Regression Model and the Regression Assumptions

Before we learn about the regression model and assumptions, let us review the correlation and regression topics that we learned in Chapter 4. Recall that the regression line approximates the linear relationship between two continuous variables and is described by the regression equation , where is the slope of the regression line, is the intercept, represents the predictor variable, represents the response variable, and represents the estimated or predicted -value.

EXAMPLE 1 Review of regression topics

textms

The Nielsen company has reported that the number of text messages that a person sends tends to decrease with age. Table 1 contains a random sample of 10 people, along with their age and the number of text messages they sent on the previous day.

  1. Construct and interpret a scatterplot of the response variable versus the predictor variable .
  2. Calculate and interpret the correlation coefficient .
  3. Compute the regression equation . Interpret the meaning of the intercept and the slope of the regression equation.
  4. Predict the number of text messages sent by a 20-year-old person, and calculate the prediction error (residual).
Table 13.1: Table 1 Age and number of text messages
18 35 28 16
20 29 30 19
22 27 32 12
24 28 34 8
26 19 36 8

You may want to refer to Section 4.1 for (a) and (b), and Section 4.2 for (c) and (d).

Solution

  1. The number of messages depends on age, and not vice versa, so the predictor variable is age and the response variable is messages. Also, note that in (d) we are trying to predict the number of text messages, which tells us that messages is the response variable because we never try to predict the known value of . The TI-83/84 scatterplot is shown in Figure 1. As age increases, the number of messages tends to decrease.

    717

    image
    Figure 13.1: FIGURE 1 TI-83/84 scatterplot of messages versus age.
  2. Figure 2 shows the correlation coefficient , calculated by the TI-83/84. Age and messages are negatively correlated. An increase in age is associated with a decrease in the number of messages.
  3. Figure 2 shows that and , and thus the regression equation is

    We can interpret and as follows:

    • The intercept is the estimated number of text messages sent by someone aged , which does not make sense because this value lies far below the minimum value of and therefore represents extreme extrapolation.
    • The slope means there is an estimated decrease of 1.5 in the number of text messages for each additional year of age.
    image
    Figure 13.2: FIGURE 2 TI-83/84 correlation and regression results.
  4. For a 20-year-old person, the estimated number of daily text messages is

    The actual number of text messages sent by our 20-year-old in Table 1 is . Our prediction from (c) is . Thus, our prediction error (or residual) is: . Our 20-year-old sent slightly fewer text messages than expected.

YOUR TURN #1

The table contains the age and score in a video game for a random sample of five young people.

  1. Construct and interpret a scatterplot of the response variable versus the predictor variable .
  2. Calculate and interpret the correlation coefficient .
  3. Compute the regression equation . Interpret the meaning of the intercept and the slope of the regression equation.
  4. Predict the score for a 22-year old person, and calculate the prediction error (residual).

(The solutions are shown in Appendix A.)

Age Score
14 80
16 90
18 90
20 90
22 100

Example 1 and our work in Chapter 4 on regression represented descriptive statistics. Next, we turn to learning about inference in regression.

Note that the regression equation depends on the sample. It is likely that a second sample will differ from the first, giving us a different regression line and different values for and . In fact, for every different sample, and take different values because and are sample statistics. However, every sample comes from a population. We do not have data on the entire population, so we are not able to calculate the population regression equation. The intercept and slope of the population regression equation are unknown population parameters, just as and are parameters in other contexts. The values of and are unknown, so we need to perform inference to learn about them.

The regression model may be used to approximate the relationship between the predictor variable and the response variable for the entire population of pairs.

718

Note that there is no “hat” on the in the population regression equation because the equation represents a model of the relationship between the actual values of and , not an estimate of .

Regression Model

The population regression equation is defined as

where is the intercept of the population regression line, is the slope of the population regression line, and is the error term.

The 20-year-old in Table 1 sent 29 text messages. Suppose another 20-year-old sent 30 messages, so that both texters had age , but different values of and . Then it would be impossible to draw a single regression line to pass through both and . Thus, any linear approximation of the true relationship between and will introduce a certain amount of error. This is why the error term is needed.

Regression Model Assumptions

The regression model operates under a set of four assumptions that must be valid in order to perform the inference in this section.

Regression Model Assumptions

  1. Zero-mean assumption. The error term is a random variable, with a mean of 0. That is, the expected value of the random variable is .
  2. Constant variance assumption. The variance of , which is denoted as , is the same regardless of the value of .
  3. Independence assumption. The values of are independent of each other.
  4. Normality assumption. The error term is a normal random variable.

To summarize, for each value of , the values of come from a normally distributed population with a mean on the population regression line and constant standard deviation . Figure 3 illustrates how is distributed for each value of . Note that each normal curve has the same shape, indicating constant variance for each .

image
Figure 13.3: FIGURE 3 Illustrating the regression assumptions.

719

Verifying the Regression Assumptions

To check the regression model assumptions, we construct two graphs:

  1. Scatterplot of the residuals (prediction errors, ) against the fitted values (fitted values refers to the predicted values, )
  2. Normal probability plot of the residuals

Figure 4 shows four types of patterns that might be observed in the residuals versus fitted values plots.

Developing Your Statistical Sense

Verifying the Regression Assumptions

With small data sets, it is difficult to ascertain whether or not patterns really exist. Be wary of seeing patterns where none exist. If one or more regression assumptions are violated, we should not proceed with inferential methods such as hypothesis tests or confidence intervals. However, even if one or more regression assumptions are violated, we can still report and interpret the descriptive regression statistics that we learned in Sections 4.2 and 4.3.

720

EXAMPLE 2 Calculating the residuals and verifying the regression assumptions

For the data in Example 1, do the following:

  1. Calculate the residuals .
  2. Verify the regression assumptions.

Solution

  1. Table 2 contains the and data from Table 1, the fitted (predicted) values , and the residuals .
    Table 13.3: Table 2 Calculating the residuals
    Fitted (predicted) values

    Residuals
    18 35 33.6 1.4
    20 29 30.6 −1.6
    22 27 27.6 −0.6
    24 28 24.6 3.4
    26 19 21.6 −2.6
    28 16 18.6 −2.6
    30 19 15.6 3.4
    32 12 12.6 −0.6
    34 8 9.6 −1.6
    36 8 6.6 1.4
  2. The scatterplot in Figure 5 of the residuals versus fitted values shows no strong evidence of the unhealthy patterns shown in Figure 4. Thus, the independence assumption, the constant variance assumption, and the zero-mean assumption are verified. Also, the normal probability plot of the residuals in Figure 6 indicates no evidence of departures from normality in the residuals. Therefore, we conclude that the regression assumptions are verified.
    image
    Figure 13.5: FIGURE 5 Scatterplot of residuals versus fitted values.
    image
    Figure 13.6: FIGURE 6 Normal probability plot of the residuals.

NOW YOU CAN DO

Exercises 7–14.

721

YOUR TURN #2

For the data in the Your Turn #1 on page 717, do the following:

  1. Calculate the residuals . for the regression of score on age.
  2. Verify the regression assumptions.

(The solutions are shown in Appendix A.)

Once the regression assumptions have been verified, we may (a) perform hypothesis tests, and (b) construct confidence intervals for the population slope .

2 Hypothesis Tests for Slope

Suppose for a moment that, for the population regression equation , the slope equals zero. Then the population regression equation would be

That is,

This idea forms the basis for our inference in this section. To test whether a relationship exists between and , we begin with the hypothesis test to determine whether or not equals 0. The hypotheses are

Assuming is true, the test statistic for this hypothesis test takes the following form.

Test Statistic

where represents the slope of the regression line, represents the standard error of the estimate (from Section 4.3), and is related to the sample variance of the data (see page 229 in Section 4.3).

image Here, refers to the standard error of the estimate, not the sample standard deviation.

consists of three quantities: , , and . The next example shows how to calculate by finding these three quantities.

EXAMPLE 3 Calculating

Use the following steps to calculate the test statistic for the data in Table 2:

  1. Find , the slope of the regression line.
  2. Calculate , the standard error of the estimate.
  3. Compute , the numerator of the sample variance of the data.

722

Solution

All calculations up to the final result are expressed to nine decimal places.

  1. From Example 1, the slope of the regression line is .
  2. Recall from Section 4.3 (page 228) that

    is the standard error of the estimate. Squaring each residual from Table 2 gives us the squared residuals in Table 3, and the sum of squared residuals, or sum of squares error, equal to

    Then the standard error of the estimate is

    Table 13.4: Table 3 Calculating SSE
    Residuals Squared residuals
    1.4 1.96
    −1.6 2.56
    −0.6 0.36
    3.4 11.56
    −2.6 6.76
    −2.6 6.76
    3.4 11.56
    −0.6 0.36
    −1.6 2.56
    1.4 1.96
  3. To compute , we note from page 110 in Chapter 3 that the sample variance of is

    Multiplying each side of the equation by , we obtain an equation for the quantity :

    The TI-83/84 output from Figure 7 shows that , and, because ,

    Therefore,

    image
    Figure 13.7: FIGURE 7 Summary statistics for the (age) data.

Now that we have , we can perform the hypothesis test for the slope , as the next example shows using the critical-value method.

723

EXAMPLE 4 Hypothesis test for slope using the critical-value method

Test whether a linear relationship exists between age and text messages, using the data from Table 1 at level of significance .

Solution

The regression assumptions were shown to be valid in Example 2. We may thus proceed with the hypothesis test.

  • Step 1 State the hypotheses.
    • No linear relationship exists between age and text messages.
    • A linear relationship exists between age and text messages.
  • Step 2 Find the critical value and the rejection rule. To find , use the distribution table (Table D in the Appendix) for a two-tailed test and degrees of freedom . The rejection rule for this two-tailed test is

    Here, , so . For level of significance , the table gives us . We will reject if .

  • Step 3 Calculate . From Example 3, we have

  • Step 4 State the conclusion and the interpretation. Because , we reject . There is evidence, at level of significance , that and that a linear relationship exists between age and text messages.

NOW YOU CAN DO

Exercises 15–18.

The next example illustrates the steps for performing the hypothesis test for the slope using the -value method.

EXAMPLE 5 Hypothesis test for the slope using the -value method and technology

shortmemory

Table 13.5: Table 4
Time Score
1 9
1 10
2 11
3 12
3 13
4 14
5 19
6 17
7 21
8 24

In Section 4.3, we considered a study on short-term memory. Ten subjects were given a set of nonsense words to memorize within a certain amount of time and were later scored on the number of words they could remember. The results are repeated here in Table 4. Use the -value method and technology to test, using level of significance , whether a linear relationship exists between time and score.

Solution

We begin by verifying the regression assumptions. The scatterplot of the residuals versus the fitted values in Figure 8 shows no strong evidence that the independence assumption, the constant variance assumption, or the zero-mean assumption is violated. Also, the normal probability plot of the residuals in Figure 9 offers evidence of the normality of the results. Therefore, we conclude that the regression assumptions are verified, and proceed with the hypothesis test.

  • Step 1 State the hypotheses and the rejection rule.

    • No linear relationship exists between time and score.
    • A linear relationship exists between time and score.

    724

    image
    Figure 13.8: FIGURE 8 Residuals versus fitted values plot.
    image
    Figure 13.9: FIGURE 9 Normal probability plot of the residuals.

    The rejection rule is: reject if the .

  • Step 2 Calculate .

    From page 226 in Section 4.3, we have . From Example 13 in Chapter 4 on page 228, we have

    From the TI-83/84 summary statistics, we have the standard deviation of the (time) data to be . Thus, using the relationship we learned in Example 3:

    Therefore,

    image
    TI-83/84 summary statistics for (time) data.
  • Step 3 Find the -value. For instructions, see the Step-by-Step Technology Guide on page 730. The regression results (including the -value) for the TI-83/84, Excel, Minitab, and CrunchIt! are shown in Figures 10, 11, 12, and 13. (Differing results are due to rounding.)

    image
    Figure 13.10: FIGURE 10 TI-83/84 regression results.

    725

    image
    Figure 13.11: FIGURE 11 Excel regression result.
    image
    Figure 13.12: FIGURE 12 Minitab regression results.
    image
    Figure 13.13: FIGURE 13 Crunchit! regression results.
  • Step 4 The -value of about , so we reject . Evidence exists, at level of significance , for a linear relationship between time and score.

NOW YOU CAN DO

Exercises 19–22.

YOUR TURN #3

Recall the age and score data from the Your Turn #1 on page 717. Test, using level of significance , whether a linear relationship exists between age and score.

(The solution is shown in Appendix A.)

726

3 Confidence Interval for Slope

Recall that in Chapter 8 we constructed a confidence interval estimate for a population parameter, consisting of an interval of numbers that contain the parameter with a certain confidence level. Similarly, we can construct a confidence interval for the slope of the population regression equation .

Confidence Interval for

When the regression assumptions are met, a confidence interval for is given by

where is the point estimate of the slope of the population regression equation, is the standard error of the estimate, and has degrees of freedom.

Margin of Error

The margin of error for a confidence interval for is given by

Thus, the confidence interval for takes the form .

EXAMPLE 6 Confidence interval for the slope

Construct a 95% confidence interval for the slope of the population regression equation for the memory-test data in Example 5.

Solution

The regression assumptions were verified in Example 5, where we found:

  • ,
  • , and
  • .

From the table (Appendix Table D), we find that, for 95% confidence, for degrees of freedom is . So, our margin of error is

The 95% confidence interval for is then given by

NOW YOU CAN DO

Exercises 23–30.

What Do These Numbers Mean?

  • The margin of error means that, when we repeatedly take samples from this population, most of the time the sample estimate will be within of the unknown value of the slope of the population regression line.
  • We are 95% confident that the interval captures the slope of the population regression line.
  • Because is the increase in memory-test score per added minute of memorization, we are 95% confident that, for each additional minute of memorization, the increase in memory-test score will lie between 1.6157 and 2.3843 points.

727

4 Using Confidence Intervals to Perform the Test for the Slope

As in earlier sections, we may use a confidence interval for the slope to perform the test for , which is a two-tailed test.

Equivalence of a Two-Tailed Test About and a Confidence Interval for

  • If a confidence interval for does not contain zero, then we would reject for level of significance , and conclude that a linear relationship exists between and .
  • If a confidence interval for does contain zero, then we would not reject for level of significance .

EXAMPLE 7 Using confidence intervals to perform the t test for the slope

  1. Construct and interpret a 99% confidence interval for the slope for the text messaging data in Table 1.
  2. Use the confidence interval in (a) to test whether a linear relationship exists between age and text messages, using level of significance .

textms

Solution

  1. The regression assumptions were verified in Example 2. Also,
    • In Example 1, we found .
    • In Example 3, we calculated , and .

    From the table, we find that, for 99% confidence, for degrees of freedom is . So, our margin of error si

    The 99% confidence interval for is then given by

    We are 99% confident that the interval (−1.9448, −1.0552) captures the slope of the population regression line. That is, we are 99% confident that, for each additional year of age, the decrease in the number of text messages lies between 1.9448 and 1.0552.

  2. The hypotheses are

    No linear relationship exists between age and text messages.

    A linear relationship exists between age and text messages.

The confidence interval from (a) does not contain zero, so we may conclude that a linear relationship exists between age and text messages, at level of significance .

NOW YOU CAN DO

Exercises 31–38.

728

Table 13.6: Table 5 Frequency in English, frequency in Scrabble, and Scrabble point value of the letters in the alphabet
Letter Rel. freq.
in English
language
Frequency
in Scrabble
Point value
in Scrabble
Letter Rel. freq.
in English
language
Frequency
in Scrabble
Point value
in Scrabble
A 0.073 9 1 N 0.078 6 1
B 0.009 2 3 O 0.074 8 1
C 0.030 2 3 P 0.027 2 3
D 0.044 4 2 Q 0.003 1 10
E 0.130 12 1 R 0.077 6 1
F 0.028 2 4 S 0.063 4 1
G 0.016 3 2 T 0.093 6 1
H 0.035 2 4 U 0.027 4 1
I 0.074 9 1 V 0.013 2 4
J 0.002 1 8 W 0.016 2 4
K 0.003 1 5 X 0.005 1 8
L 0.035 4 1 Y 0.019 2 4
M 0.025 2 3 Z 0.001 1 10

image How Fair Is the Scoring in Scrabble?

scrabble

In this Case Study, we consider the frequency and point values of Scrabble tiles. Table 5 shows the relative frequency in the English language, the frequency (number of tiles) in Scrabble, and the point value in Scrabble.

First of all, what is the relationship between the tile frequencies in Scrabble and the letter frequencies in the English language? Figure 14 shows a scatterplot of the tile frequencies in Scrabble against the letter frequencies in the English language. A positive relationship appears to exist between the two variables. That is, as the English frequencies increase, game frequencies also tend to increase.

image
Figure 13.14: FIGURE 14 Scatterplot of Scrabble frequency versus English relative frequency of letters, with regression line.

Note that the letters above the regression line occur “too frequently” in the game, whereas the letters below the line occur “not frequently enough.” Playing typical English words during a game of Scrabble would tend to leave you with a rack of letters similar to those above the regression line. Note that S is one of the letters that is rarer in the game than in the language.

image

Figure 15 displays the Minitab results from a regression of the tile frequencies against the English language relative frequencies. The regression equation is

729

The slope is positive, which concurs with the scatterplot in Figure 14.

image
Figure 13.15: FIGURE 15 Minitab regression output.

Next, we turn to the hypothesis test:

  • No linear relationship exists between Scrabble frequency and English relative frequency.
  • A linear relationship exists between Scrabble frequency and English relative frequency.

The 0.000 in red represents the -value for the test. This -value is smaller than any , so we reject the null hypothesis that no linear relationship exists between the game frequencies and the English frequencies. Does the model fit the data well? The coefficient of determination is 0.855, which is good, and the correlation coefficient equals 0.924, which indicates that the variables are positively correlated.

But the fit really could be better. Look at the value of , the standard error of the estimate: . This means that, given the English language frequency of a letter, the estimate of the tile frequency will typically differ from the actual tile frequency by more than one tile.

Next, what is the relationship between the Scrabble point values and the English relative frequencies? Figure 16 shows a scatterplot of these two variables. The first thing you might notice about this relationship is that it is not linear. Therefore, it would not be appropriate to perform linear regression on this data set.

image
Figure 13.16: FIGURE 16 Scatterplot of Scrabble frequency versus English relative frequency. Linear regression would not be appropriate.

730

We can, nevertheless, make some descriptive remarks.

  • What is a “good” Scrabble tile to pick up? In the best case, it would be a letter with high English frequency worth lots of Scrabble game points. Unfortunately, the two do not go together. The high-frequency letters such as E and T have low point values, and the high point-value letters such as Q and Z have low frequencies. But we can still make comparisons.
  • Which would you rather pick up, a D or a G? It would seem that D would be preferable because it has the same point value as G with much higher English frequency.
  • Which do you prefer between J and X? They are worth the same points, but X has a higher frequency in English, so it is easier to make words with it in the game. The letter H seems to have a good combination of high points and moderate frequency.