13 Inference in Regression

13.1 Inference About the Slope of the Regression Line

This page includes Video Technology Manuals

This page includes Statistical Videos

OBJECTIVES By the end of this section, I will be able to …

Explain the regression model and the regression model assumptions.
Perform the hypothesis test for the slope $β_{1}$ of the population regression equation.
Construct confidence intervals for the slope $β_{1}$ .
Use confidence intervals to perform the hypothesis test for the slope $β_{1}$ .

1 The Regression Model and the Regression Assumptions

Before we learn about the regression model and assumptions, let us review the correlation and regression topics that we learned in Chapter 4. Recall that the regression line approximates the linear relationship between two continuous variables and is described by the regression equation $\hat{y} = b_{1} x + b_{0}$ , where $b_{1}$ is the slope of the regression line, $b_{0}$ is the $y$ intercept, $x$ represents the predictor variable, $y$ represents the response variable, and $\hat{y}$ represents the estimated or predicted $y$ -value.

EXAMPLE 1 Review of regression topics

textms

The Nielsen company has reported that the number of text messages that a person sends tends to decrease with age. Table 1 contains a random sample of 10 people, along with their age and the number of text messages they sent on the previous day.

Construct and interpret a scatterplot of the response variable $y$ versus the predictor variable $x$ .
Calculate and interpret the correlation coefficient $r$ .
Compute the regression equation $\hat{y} = b_{1} x + b_{0}$ . Interpret the meaning of the $y$ intercept $b_{0}$ and the slope $b_{1}$ of the regression equation.
Predict the number of text messages sent by a 20-year-old person, and calculate the prediction error (residual).

Table 13.1: Table 1 Age and number of text messages

$x = Age$	$y = Text messages$	$x = Age$	$y = Text messages$
18	35	28	16
20	29	30	19
22	27	32	12
24	28	34	8
26	19	36	8

You may want to refer to Section 4.1 for (a) and (b), and Section 4.2 for (c) and (d).

Solution

The number of messages depends on age, and not vice versa, so the predictor variable $x$ is age and the response variable $y$ is messages. Also, note that in (d) we are trying to predict the number of text messages, which tells us that messages is the response variable $y$ because we never try to predict the known value of $x$ . The TI-83/84 scatterplot is shown in Figure 1. As age increases, the number of messages tends to decrease.
Page 717

FIGURE 1 TI-83/84 scatterplot of messages versus age.
Figure 2 shows the correlation coefficient $r \approx - 0.9701$ , calculated by the TI-83/84. Age and messages are negatively correlated. An increase in age is associated with a decrease in the number of messages.
Figure 2 shows that $a = b_{1} = - 1.5$ and $b = b_{0} = 60.6$ , and thus the regression equation is

$\hat{y} = b_{1} x + b_{0} = (- 1.5) (age) + 60.6$

We can interpret $b_{0}$ and $b_{1}$ as follows:
- The $y$ intercept $b_{0} = 60.6$ is the estimated number of text messages sent by someone aged $x = 0$ , which does not make sense because this value $x = 0$ lies far below the minimum value of $x$ and therefore represents extreme extrapolation.
- The slope $b_{1} = - 1.5$ means there is an estimated decrease of 1.5 in the number of text messages for each additional year of age.
FIGURE 2 TI-83/84 correlation and regression results.
For a 20-year-old person, the estimated number of daily text messages is

$\hat{y} = b_{1} x + b_{0} = (- 1.5) (20) + 60.6 = 30.6$

The actual number of text messages sent by our 20-year-old in Table 1 is $y = 29$ . Our prediction from (c) is $\hat{y} = 30.6$ . Thus, our prediction error (or residual) is: $(y - \hat{y}) = (29 - 30.6) = - 1.6$ . Our 20-year-old sent slightly fewer text messages than expected.

YOUR TURN#1

The table contains the age $(x)$ and score in a video game $(y)$ for a random sample of five young people.

Construct and interpret a scatterplot of the response variable $y$ versus the predictor variable $x$ .
Calculate and interpret the correlation coefficient $r$ .
Compute the regression equation $\hat{y} = b_{1} x + b_{0}$ . Interpret the meaning of the $y$ intercept $b_{0}$ and the slope $b_{1}$ of the regression equation.
Predict the score for a 22-year old person, and calculate the prediction error (residual).

(The solutions are shown in Appendix A.)

Age $(x)$	Score $(y)$
14	80
16	90
18	90
20	90
22	100

Example 1 and our work in Chapter 4 on regression represented descriptive statistics. Next, we turn to learning about inference in regression.

Note that the regression equation $\hat{y} = b_{1} x + b_{0} = (- 1.5) (age) + 60.6$ depends on the sample. It is likely that a second sample will differ from the first, giving us a different regression line and different values for $b_{0}$ and $b_{1}$ . In fact, for every different sample, $b_{0}$ and $b_{1}$ take different values because $b_{0}$ and $b_{1}$ are sample statistics. However, every sample comes from a population. We do not have data on the entire population, so we are not able to calculate the population regression equation. The $y$ intercept $β_{0}$ and slope $β_{1}$ of the population regression equation are unknown population parameters, just as $μ$ and $p$ are parameters in other contexts. The values of $β_{0}$ and $β_{1}$ are unknown, so we need to perform inference to learn about them.

The regression model may be used to approximate the relationship between the predictor variable $x$ and the response variable $y$ for the entire population of $(x, y)$ pairs.

Page 718

Note that there is no “hat” on the $y$ in the population regression equation because the equation represents a model of the relationship between the actual values of $x$ and $y$ , not an estimate of $y$ .

Regression Model

The population regression equation is defined as

$y = β_{1} x + β_{0} + ε$

where $β_{0}$ is the $y$ intercept of the population regression line, $β_{1}$ is the slope of the population regression line, and $ε$ is the error term.

The 20-year-old in Table 1 sent 29 text messages. Suppose another 20-year-old sent 30 messages, so that both texters had age $x = 20$ , but different values of $y : y = 29$ and $y = 30$ . Then it would be impossible to draw a single regression line to pass through both $(x = 20, y = 29)$ and $(x = 20, y = 30)$ . Thus, any linear approximation of the true relationship between $x$ and $y$ will introduce a certain amount of error. This is why the error term $ε$ is needed.

Regression Model Assumptions

The regression model operates under a set of four assumptions that must be valid in order to perform the inference in this section.

Regression Model Assumptions

Zero-mean assumption. The error term $ε$ is a random variable, with a mean of 0. That is, the expected value of the random variable $ε$ is $0 : E (ε) = 0$ .
Constant variance assumption. The variance of $ε$ , which is denoted as $σ^{2}$ , is the same regardless of the value of $x$ .
Independence assumption. The values of $ε$ are independent of each other.
Normality assumption. The error term $ε$ is a normal random variable.

To summarize, for each value of $x$ , the values of $y$ come from a normally distributed population with a mean on the population regression line $E (y) = β_{1} x + β_{0}$ and constant standard deviation $σ^{2}$ . Figure 3 illustrates how $y$ is distributed for each value of $x$ . Note that each normal curve has the same shape, indicating constant variance for each $x$ .

FIGURE 3 Illustrating the regression assumptions.

Page 719

Verifying the Regression Assumptions

To check the regression model assumptions, we construct two graphs:

Scatterplot of the residuals (prediction errors, $y - \hat{y}$ ) against the fitted values (fitted values refers to the predicted values, $\hat{y}$ )
Normal probability plot of the residuals

Figure 4 shows four types of patterns that might be observed in the residuals versus fitted values plots.

Plot (a) is a “healthy” plot, displaying no noticeable patterns.
Plot (b) is a curve, which indicates a violation of the independence assumption. Independence implies that knowing the value of a particular $y$ does not help to predict the value of a different $y$ . However, a curve suggests that knowing the value of a previous $y$ helps in knowing the value of the next $y$ .
Plot (c) shows a “funnel” pattern, which contradicts the constant variance assumption. The residuals on the left are close together vertically (small variability), whereas the residuals on the right are far apart vertically (large variability).
Plot (d) shows an increasing pattern that violates the zero-mean assumption. The residuals on the left are all below the midline, so $E (y) < β_{1} x + β_{0}$ , whereas the residuals on the right are all above the midline, so $E (y) > β_{1} x + β_{0}$ .

FIGURE 4 Patterns in the residuals versus predicted plots.

Developing Your Statistical Sense

Verifying the Regression Assumptions

With small data sets, it is difficult to ascertain whether or not patterns really exist. Be wary of seeing patterns where none exist. If one or more regression assumptions are violated, we should not proceed with inferential methods such as hypothesis tests or confidence intervals. However, even if one or more regression assumptions are violated, we can still report and interpret the descriptive regression statistics that we learned in Sections 4.2 and 4.3.

Page 720

EXAMPLE 2 Calculating the residuals and verifying the regression assumptions

For the data in Example 1, do the following:

Calculate the residuals $y - \hat{y}$ .
Verify the regression assumptions.

Solution

Table 2 contains the

$x$ and

$y$ data from Table 1, the fitted (predicted) values

$\hat{y}$ , and the residuals

$y - \hat{y}$ .

Table 13.3: Table 2 Calculating the residuals

$x = Age$	$y = Text messages$	Fitted (predicted) values $\hat{y} = (- 1.5) (age) + 60.6$	Residuals $y - \hat{y}$
18	35	33.6	1.4
20	29	30.6	−1.6
22	27	27.6	−0.6
24	28	24.6	3.4
26	19	21.6	−2.6
28	16	18.6	−2.6
30	19	15.6	3.4
32	12	12.6	−0.6
34	8	9.6	−1.6
36	8	6.6	1.4

The scatterplot in Figure 5 of the residuals versus fitted values shows no strong evidence of the unhealthy patterns shown in Figure 4. Thus, the independence assumption, the constant variance assumption, and the zero-mean assumption are verified. Also, the normal probability plot of the residuals in Figure 6 indicates no evidence of departures from normality in the residuals. Therefore, we conclude that the regression assumptions are verified.

FIGURE 5 Scatterplot of residuals versus fitted values.

FIGURE 6 Normal probability plot of the residuals.

NOW YOU CAN DO

Exercises 7–14.

Page 721

YOUR TURN#2

For the data in the Your Turn #1 on page 717, do the following:

Calculate the residuals $y - \hat{y}$ . for the regression of score on age.
Verify the regression assumptions.

(The solutions are shown in Appendix A.)

Once the regression assumptions have been verified, we may (a) perform hypothesis tests, and (b) construct confidence intervals for the population slope $β_{1}$ .

2 Hypothesis Tests for Slope $β_{1}$

Suppose for a moment that, for the population regression equation $y = β_{1} x + β_{0} + ε$ , the slope $β_{1}$ equals zero. Then the population regression equation would be

$y = (0) x + β_{0} + ε = β_{0} + ε$

That is,

If $β_{1}$ equals zero, then no relationship exists between $x$ and $y$ , because changing $x$ in the equation $y = β_{0} + ε$ does not affect $y$ .
If $β_{1}$ equals any other value, then a linear relationship does exist between $x$ and $y$ .

This idea forms the basis for our inference in this section. To test whether a relationship exists between $x$ and $y$ , we begin with the hypothesis test to determine whether or not $β_{1}$ equals 0. The hypotheses are

$H_{0} : β_{1} = 0$ No linear relationship exists between $x$ and $y$ .
$H_{a} : β_{1} \neq 0$ A linear relationship exists between $x$ and $y$ .

Assuming $H_{0} : β_{1} = 0$ is true, the test statistic $t_{data}$ for this hypothesis test takes the following form.

Test Statistic $t_{data}$

$t_{data} = \frac{b_{1} - β_{1}}{s / \sqrt{\sum {(x - \bar{x})}^{2}}} = \frac{b_{1} - 0}{s / \sqrt{\sum {(x - \bar{x})}^{2}}} = \frac{b_{1}}{s / \sqrt{\sum {(x - \bar{x})}^{2}}}$

where $b_{1}$ represents the slope of the regression line, $s = \sqrt{\frac{SSE}{n - 2}}$ represents the standard error of the estimate (from Section 4.3), and $\sqrt{\sum {(x - \bar{x})}^{2}}$ is related to the sample variance of the $x$ data (see page 229 in Section 4.3).

Here, $s$ refers to the standard error of the estimate, not the sample standard deviation.

$t_{data}$ consists of three quantities: $b_{1}$ , $s$ , and $\sqrt{\sum {(x - \bar{x})}^{2}}$ . The next example shows how to calculate $t_{data}$ by finding these three quantities.

EXAMPLE 3 Calculating $t_{data}$

Use the following steps to calculate the test statistic $t_{data} = \frac{b_{1}}{s / \sqrt{\sum {(x - \bar{x})}^{2}}}$ for the data in Table 2:

Find $b_{1}$ , the slope of the regression line.
Calculate $s$ , the standard error of the estimate.
Compute $\sqrt{\sum {(x - \bar{x})}^{2}}$ , the numerator of the sample variance of the $x$ data.

Page 722

Solution

All calculations up to the final result are expressed to nine decimal places.

From Example 1, the slope of the regression line is $b_{1} = - 1.5$ .

Recall from Section 4.3 (page 228) that

$s = \sqrt{\frac{SSE}{n - 2}} = \sqrt{\frac{\sum {(y - \hat{y})}^{2}}{n - 2}} = \sqrt{\frac{\sum {(residual)}^{2}}{n - 2}}$

is the standard error of the estimate. Squaring each residual from Table 2 gives us the squared residuals in Table 3, and the sum of squared residuals, or sum of squares error, equal to

$SSE = {\sum (y - \hat{y})}^{2} = 46.4$

Then the standard error of the estimate is

$s = \sqrt{\frac{SSE}{n - 2}} = \sqrt{\frac{46.4}{8}} \approx 2.408318916.$

Table 13.4: Table 3 Calculating SSE

Residuals $y - \hat{y}$	Squared residuals ${(y - \hat{y})}^{2}$
1.4	1.96
−1.6	2.56
−0.6	0.36
3.4	11.56
−2.6	6.76
−2.6	6.76
3.4	11.56
−0.6	0.36
−1.6	2.56
1.4	1.96
	$Sum = 46.4$

To compute $\sum {(x - \bar{x})}^{2}$ , we note from page 110 in Chapter 3 that the sample variance of $x$ is

$s_{x}^{2} = \frac{\sum {(x - \bar{x})}^{2}}{n - 1}$

Multiplying each side of the equation by $n - 1$ , we obtain an equation for the quantity $\sum {(x - \bar{x})}^{2}$ :

$\sum {(x - \bar{x})}^{2} = (n - 1) \cdot s_{x}^{2}$

The TI-83/84 output from Figure 7 shows that $s_{x} = 6.055300708$ , and, because $n = 10$ ,

$\sum {(x - \bar{x})}^{2} = (n - 1) \cdot s_{x}^{2} = (9) {(6.055300708)}^{2} = 330$

Therefore,

$t_{data} = \frac{b_{1}}{s / \sqrt{\sum {(x - \bar{x})}^{2}}} = \frac{- 1.5}{2.408318916 / \sqrt{330}} \approx - 11.3$

FIGURE 7 Summary statistics for the $x$ (age) data.

Now that we have $t_{data}$ , we can perform the hypothesis test for the slope $β_{1}$ , as the next example shows using the critical-value method.

Page 723

EXAMPLE 4 Hypothesis test for slope $β_{1}$ using the critical-value method

Test whether a linear relationship exists between age and text messages, using the data from Table 1 at level of significance $α = 0.01$ .

Solution

The regression assumptions were shown to be valid in Example 2. We may thus proceed with the hypothesis test.

Step 1 State the hypotheses.
- $H_{0} : β_{1} = 0$ No linear relationship exists between age and text messages.
- $H_{a} : β_{1} \neq 0$ A linear relationship exists between age and text messages.
Step 2 Find the $t$ critical value $t_{crit}$ and the rejection rule. To find $t_{crit}$ , use the $t$ distribution table (Table D in the Appendix) for a two-tailed test and degrees of freedom $df = n - 2$ . The rejection rule for this two-tailed test is

$Reject H_{0} if t_{data} \geq t_{crit} or t_{data} \leq - t_{crit}$

Here, $n = 10$ , so $df = 8$ . For level of significance $α = 0.01$ , the $t$ table gives us $t_{crit} = 3.355$ . We will reject $H_{0}$ if $t_{data} \geq 3.335 or t_{data} \leq - 3.335$ .
Step 3 Calculate $t_{data}$ . From Example 3, we have

$t_{d a t a} = \frac{b_{1}}{s / \sqrt{\sum {(x - \bar{x})}^{2}}} \approx - 11.3$
Step 4 State the conclusion and the interpretation. Because $t_{data} \approx - 11.3 \leq - 3.335$ , we reject $H_{0}$ . There is evidence, at level of significance $α = 0.01$ , that $β_{1} \neq 0$ and that a linear relationship exists between age and text messages.

NOW YOU CAN DO

Exercises 15–18.

The next example illustrates the steps for performing the hypothesis test for the slope $β_{1}$ using the $p$ -value method.

EXAMPLE 5 Hypothesis test for the slope $β_{1}$ using the $p$ -value method and technology

shortmemory

Table 13.5: Table 4

Time $(x)$	Score $(y)$
1	9
1	10
2	11
3	12
3	13
4	14
5	19
6	17
7	21
8	24

In Section 4.3, we considered a study on short-term memory. Ten subjects were given a set of nonsense words to memorize within a certain amount of time and were later scored on the number of words they could remember. The results are repeated here in Table 4. Use the $p$ -value method and technology to test, using level of significance $α = 0.01$ , whether a linear relationship exists between time and score.

Solution

We begin by verifying the regression assumptions. The scatterplot of the residuals versus the fitted values in Figure 8 shows no strong evidence that the independence assumption, the constant variance assumption, or the zero-mean assumption is violated. Also, the normal probability plot of the residuals in Figure 9 offers evidence of the normality of the results. Therefore, we conclude that the regression assumptions are verified, and proceed with the hypothesis test.

Step 1 State the hypotheses and the rejection rule.
- $H_{0} : β_{1} = 0$ No linear relationship exists between time and score.
- $H_{a} : β_{1} \neq 0$ A linear relationship exists between time and score.
Page 724

FIGURE 8 Residuals versus fitted values plot.

FIGURE 9 Normal probability plot of the residuals.

The rejection rule is: reject $H_{0}$ if the $p -value \leq 0.01$ .
Step 2 Calculate $t_{data}$ .

$t_{data} = \frac{b_{1}}{s / \sqrt{\sum {(x - \bar{x})}^{2}}}$

From page 226 in Section 4.3, we have $b_{1} = 2$ . From Example 13 in Chapter 4 on page 228, we have

$s = \sqrt{\frac{12}{8}} \approx 1.224744871$

From the TI-83/84 summary statistics, we have the standard deviation of the $x$ (time) data to be $s_{x} = 2.449489743$ . Thus, using the relationship we learned in Example 3:

$\sum {(x - \bar{x})}^{2} = (n - 1) \cdot s_{x}^{2} = (9) {2.449489743}^{2} = 54$

Therefore,

$t_{data} = \frac{b_{1}}{s / \sqrt{\sum {(x - \bar{x})}^{2}}} \approx \frac{2}{1.224744871 / \sqrt{54}} = 12$

TI-83/84 summary statistics for $x$ (time) data.
Step 3 Find the $p$ -value. For instructions, see the Step-by-Step Technology Guide on page 730. The regression results (including the $p$ -value) for the TI-83/84, Excel, Minitab, and CrunchIt! are shown in Figures 10, 11, 12, and 13. (Differing results are due to rounding.)

FIGURE 10 TI-83/84 regression results.

Page 725

FIGURE 11 Excel regression result.

FIGURE 12 Minitab regression results.

FIGURE 13 Crunchit! regression results.
Step 4 The $p$ -value of about $0.000 is \leq α = 0.01$ , so we reject $H_{0}$ . Evidence exists, at level of significance $α = 0.01$ , for a linear relationship between time and score.

NOW YOU CAN DO

Exercises 19–22.

YOUR TURN#3

Recall the age and score data from the Your Turn #1 on page 717. Test, using level of significance $α = 0.05$ , whether a linear relationship exists between age and score.

(The solution is shown in Appendix A.)

Page 726

3 Confidence Interval for Slope $β_{1}$

Recall that in Chapter 8 we constructed a confidence interval estimate for a population parameter, consisting of an interval of numbers that contain the parameter with a certain confidence level. Similarly, we can construct a confidence interval for the slope of the population regression equation $β_{1}$ .

Confidence Interval for $β_{1}$

When the regression assumptions are met, a $100 (1 - α) %$ confidence interval for $β_{1}$ is given by

$b_{1} \pm t_{α / 2} \cdot \frac{s}{\sqrt{{\sum (x - \bar{x})}^{2}}}$

where $b_{1}$ is the point estimate of the slope $β_{1}$ of the population regression equation, $s$ is the standard error of the estimate, and $t_{α / 2}$ has $n - 2$ degrees of freedom.

Margin of Error $E$

The margin of error for a $100 (1 - α) %$ confidence interval for $β_{1}$ is given by

$E = t_{α / 2} \cdot \frac{s}{\sqrt{\sum {(x - \bar{x})}^{2}}}$

Thus, the confidence interval for $β_{1}$ takes the form $b_{1} \pm E$ .

EXAMPLE 6 Confidence interval for the slope $β_{1}$

Construct a 95% confidence interval for the slope $β_{1}$ of the population regression equation for the memory-test data in Example 5.

Solution

The regression assumptions were verified in Example 5, where we found:

$b_{1} = 2$ ,
$s = 1.224744871$ , and
$\sum {(x - \bar{x})}^{2} = 54$ .

From the $t$ table (Appendix Table D), we find that, for 95% confidence, $t_{α / 2}$ for $n - 2 = 10 - 2 = 8$ degrees of freedom is $t_{α / 2} = 2.306$ . So, our margin of error $E$ is

$E = t_{α / 2} \cdot \frac{s}{\sqrt{\sum {(x - \bar{x})}^{2}}} = (2.306) (\frac{1.224744874}{\sqrt{54}}) \approx 0.3843$

The 95% confidence interval for $β_{1}$ is then given by

$b_{1} \pm E = 2 \pm 0.3843 = (1.6157, 2.3843)$

NOW YOU CAN DO

Exercises 23–30.

What Do These Numbers Mean?

The margin of error $E = 0.3843$ means that, when we repeatedly take samples from this population, most of the time the sample estimate $b_{1}$ will be within $E = 0.3843$ of the unknown value of the slope $β_{1}$ of the population regression line.
We are 95% confident that the interval $(1.6157, 2.3843)$ captures the slope $β_{1}$ of the population regression line.
Because $β_{1}$ is the increase in memory-test score per added minute of memorization, we are 95% confident that, for each additional minute of memorization, the increase in memory-test score will lie between 1.6157 and 2.3843 points.

Page 727

4 Using Confidence Intervals to Perform the $t$ Test for the Slope $β_{1}$

As in earlier sections, we may use a $100 (1 - α) % t$ confidence interval for the slope $β_{1}$ to perform the $t$ test for $β_{1}$ , which is a two-tailed test.

Equivalence of a Two-Tailed $t$ Test About $β_{1}$ and a $t$ Confidence Interval for $β_{1}$

If a $100 (1 - α) % t$ confidence interval for $β_{1}$ does not contain zero, then we would reject $H_{0} : β_{1} = 0$ for level of significance $α$ , and conclude that a linear relationship exists between $x$ and $y$ .
If a $100 (1 - α) % t$ confidence interval for $β_{1}$ does contain zero, then we would not reject $H_{0} : β_{1} = 0$ for level of significance $α$ .

EXAMPLE 7 Using confidence intervals to perform the t test for the slope $β_{1}$

Construct and interpret a 99% confidence interval for the slope $β_{1}$ for the text messaging data in Table 1.
Use the confidence interval in (a) to test whether a linear relationship exists between age and text messages, using level of significance $α = 0.01$ .

textms

Solution

The regression assumptions were verified in Example 2. Also,
- In Example 1, we found $b_{1} = - 1.5$ .
- In Example 3, we calculated $s = 2.408318916$ , and $\sum {(x - \bar{x})}^{2} = 330$ .
From the $t$ table, we find that, for 99% confidence, $t_{α / 2}$ for $n - 2 = 10 - 2 = 8$ degrees of freedom is $t_{α / 2} = 3.355$ . So, our margin of error $E$ si

$E = t_{α / 2} \cdot \frac{s}{\sqrt{\sum {(x - \bar{x})}^{2}}} = (3.355) (\frac{2.408318916}{\sqrt{330}}) \approx 0.4448$

The 99% confidence interval for $β_{1}$ is then given by

$b_{1} \pm E = - 1.5 \pm 0.4448 = (- 1.9448, - 1.0552)$

We are 99% confident that the interval (−1.9448, −1.0552) captures the slope $β_{1}$ of the population regression line. That is, we are 99% confident that, for each additional year of age, the decrease in the number of text messages lies between 1.9448 and 1.0552.
The hypotheses are

$H_{0} : β_{1} = 0$ No linear relationship exists between age and text messages.

$H_{a} : β_{1} \neq 0$ A linear relationship exists between age and text messages.

The confidence interval from (a) does not contain zero, so we may conclude that a linear relationship exists between age and text messages, at level of significance $α = 0.01$ .

NOW YOU CAN DO

Exercises 31–38.

Page 728

Table 13.6: Table 5 Frequency in English, frequency in Scrabble, and Scrabble point value of the letters in the alphabet

Letter	Rel. freq. in English language	Frequency in Scrabble	Point value in Scrabble	Letter	Rel. freq. in English language	Frequency in Scrabble	Point value in Scrabble
A	0.073	9	1	N	0.078	6	1
B	0.009	2	3	O	0.074	8	1
C	0.030	2	3	P	0.027	2	3
D	0.044	4	2	Q	0.003	1	10
E	0.130	12	1	R	0.077	6	1
F	0.028	2	4	S	0.063	4	1
G	0.016	3	2	T	0.093	6	1
H	0.035	2	4	U	0.027	4	1
I	0.074	9	1	V	0.013	2	4
J	0.002	1	8	W	0.016	2	4
K	0.003	1	5	X	0.005	1	8
L	0.035	4	1	Y	0.019	2	4
M	0.025	2	3	Z	0.001	1	10

How Fair Is the Scoring in Scrabble?

scrabble

In this Case Study, we consider the frequency and point values of Scrabble tiles. Table 5 shows the relative frequency in the English language, the frequency (number of tiles) in Scrabble, and the point value in Scrabble.

First of all, what is the relationship between the tile frequencies in Scrabble and the letter frequencies in the English language? Figure 14 shows a scatterplot of the tile frequencies in Scrabble against the letter frequencies in the English language. A positive relationship appears to exist between the two variables. That is, as the English frequencies increase, game frequencies also tend to increase.

FIGURE 14 Scatterplot of Scrabble frequency versus English relative frequency of letters, with regression line.

Note that the letters above the regression line occur “too frequently” in the game, whereas the letters below the line occur “not frequently enough.” Playing typical English words during a game of Scrabble would tend to leave you with a rack of letters similar to those above the regression line. Note that S is one of the letters that is rarer in the game than in the language.

Figure 15 displays the Minitab results from a regression of the tile frequencies against the English language relative frequencies. The regression equation is

Page 729

$\hat{y} = 81.5 (relative frequency in Engilish) + 0.636$

The slope is positive, which concurs with the scatterplot in Figure 14.

FIGURE 15 Minitab regression output.

Next, we turn to the hypothesis test:

$H_{0} : β_{1} = 0$ No linear relationship exists between Scrabble frequency and English relative frequency.
$H_{a} : β_{1} \neq 0$ A linear relationship exists between Scrabble frequency and English relative frequency.

The 0.000 in red represents the $p$ -value for the $t$ test. This $p$ -value is smaller than any $α$ , so we reject the null hypothesis that no linear relationship exists between the game frequencies and the English frequencies. Does the model fit the data well? The coefficient of determination $r^{2}$ is 0.855, which is good, and the correlation coefficient $r$ equals 0.924, which indicates that the variables are positively correlated.

But the fit really could be better. Look at the value of $s$ , the standard error of the estimate: $s \approx 1.16$ . This means that, given the English language frequency of a letter, the estimate of the tile frequency will typically differ from the actual tile frequency by more than one tile.

Next, what is the relationship between the Scrabble point values and the English relative frequencies? Figure 16 shows a scatterplot of these two variables. The first thing you might notice about this relationship is that it is not linear. Therefore, it would not be appropriate to perform linear regression on this data set.

FIGURE 16 Scatterplot of Scrabble frequency versus English relative frequency. Linear regression would not be appropriate.

Page 730

We can, nevertheless, make some descriptive remarks.

What is a “good” Scrabble tile to pick up? In the best case, it would be a letter with high English frequency worth lots of Scrabble game points. Unfortunately, the two do not go together. The high-frequency letters such as E and T have low point values, and the high point-value letters such as Q and Z have low frequencies. But we can still make comparisons.
Which would you rather pick up, a D or a G? It would seem that D would be preferable because it has the same point value as G with much higher English frequency.
Which do you prefer between J and X? They are worth the same points, but X has a higher frequency in English, so it is easier to make words with it in the game. The letter H seems to have a good combination of high points and moderate frequency.