10 Inference for Regression

10.1 Inference about the Regression Model

This page includes Video Technology Manuals

This page includes Statistical Videos

Simple linear regression studies the relationship between a response variable $y$ and an explanatory variable $x$ . We expect that different values of $x$ are associated with different mean responses for $y$ . We encountered a similar but simpler situation in Chapter 7 when we discussed methods for comparing two population means. Figure 10.1 illustrates a statistical model for comparing the items per hour entered by two groups of financial clerks using new data entry software. Group 2 received some training in the software while Group 1 did not. Entries per hour is the response variable. The treatment (training or not) is the explanatory variable. The model has two important parts:

The mean entries per hour may be different in the two populations. These means are $μ_{1}$ and $μ_{2}$ in Figure 10.1.
Individual entries per hour vary within each population according to a Normal distribution. The two Normal curves in Figure 10.1 describe these responses. These Normal distributions have the same spread, indicating that the population standard deviations are assumed to be equal.

Statistical model for simple linear regression

Now imagine giving different lengths $x$ of training to different groups of subjects. We can think of these groups as belonging to subpopulations, one for each possible value of $x$ . Each subpopulation consists of all individuals in the population having the same value of $x$ . If we gave $x = 15$ hours of training to some subjects, $x = 30$ hours of training to some others, and $x = 60$ hours of training to some others, these three groups of subjects would be considered samples from the corresponding three subpopulations.

subpopulation

The statistical model for simple linear regression also assumes that, for each value of $x$ , the response variable $y$ is Normally distributed with a mean that depends on $x$ . We use $μ_{y}$ to represent these means. In general, the means $μ_{y}$ can change as $x$ changes according to any sort of pattern. In simple linear regression, we assume that the means all lie on a line when plotted against $x$ . To summarize, this model also has two important parts:

FIGURE 10.1 The statistical model for comparing the responses to two treatments. The responses vary within each treatment group according to a Normal distribution. The mean may be different in the two treatment groups.

Page 485

FIGURE 10.2 The statistical model for linear regression. The responses vary within each subpopulation according to a Normal distribution. The mean response is a straight-line function of the explanatory variable.

The mean entries per hour $μ_{y}$ changes as the number of training hours $x$ changes. The means all lie on a straight line. That is, $μ_{y} = β_{0} + β_{1} x$ .
Individual entries per hour $y$ for subjects with the same amount of training $x$ vary according to a Normal distribution. This variation, measured by the standard deviation $σ$ , is the same for all values of $x$ .

This statistical model is pictured in Figure 10.2. The line describes how the mean response $μ_{y}$ changes with $x$ . This is the population regression line. The three Normal curves show how the response $y$ will vary for three different values of the explanatory variable $x$ . Each curve is centered at its mean response $μ_{y}$ . All three curves have the same spread, measured by their common standard deviation $σ$ .

population regression line

From data analysis to inference

The data for a regression problem are the observed values of $x$ and $y$ . The model takes each $x$ to be a fixed known quantity, like the hours of training a worker has received.¹ The response $y$ for a given $x$ is a Normal random variable. The model describes the mean and standard deviation of this random variable. This model is not appropriate if there is error in measuring $x$ and it is large relative to the spread of the $x$ ’s. In these situations, more advanced inference methods are needed.

We use Case 10.1 to explain the fundamentals of simple linear regression. Because regression calculations in practice are always done by software, we rely on computer output for the arithmetic. Later in the chapter, we show formulas for doing the calculations. These formulas are useful in understanding analysis of variance (see Section 10.3) and multiple regression (see Chapter 11).

CASE 10.1 The Relationship between Income and Education for Entrepreneurs

Numerous studies have shown that better-educated employees have higher incomes. Is this also true for entrepreneurs? Do more years of formal education translate into higher incomes? And if so, is the return for an additional year of education the same for entrepreneurs and employees? One study explored these questions using the National Longitudinal Survey of Youth (NLSY), which followed a large group of individuals aged 14 to 22 for roughly 10 years.² They looked at both employees and entrepreneurs, but we just focus on entrepreneurs here.

Page 486

FIGURE 10.3 Scatterplot, with smoothed curve, of average annual income versus years of education for a sample of 100 entrepreneurs.

entre

The researchers defined entrepreneurs to be those who were self-employed or who were the owner/director of an incorporated business. For each of these individuals, they recorded the education level and income. The education level (EDUC) was defined as the years of completed schooling prior to starting the business. The income level was the average annual total earnings (INC) since starting the business.

We consider a random sample of 100 entrepreneurs. Figure 10.3 is a scatterplot of the data with a fitted smoothed curve. The explanatory variable $x$ is the entrepreneur’s education level. The response variable $y$ is the income level.

Let’s briefly review some of the ideas from Chapter 2 regarding least-squares regression. We start with a plot of the data, as in Figure 10.3, to verify that the relationship is approximately linear with no outliers. Always start with a graphical display of the data. There is no point in fitting a linear model if the relationship does not, at least approximately, appear linear. In this case, the distribution of income is skewed to the right (at each education level, there are many small incomes and just a few large incomes). Although the smoothed curve is roughly linear, the curve is being pulled toward the very large incomes, suggesting these observations could be influential.

Reminder

least-squares regression, p. 80

A common remedy for a strongly skewed variable such as income is to consider transforming the variable prior to fitting a model. Here, the researchers considered the natural logarithm of income (LOGINC). Figure 10.4 is a scatterplot of these transformed data with a fitted smoothed curve in black and the least-squares regression line in green. The smoothed curve is almost linear, and the observations in the $y$ direction are more equally dispersed above and below this curve than the curve in Figure 10.3. Also, those four very large incomes no longer appear to be influential. Given these results, we continue our discussion of least-squares regression using the transformed $y$ data.

Reminder

log transformation, p. 68

Page 487

FIGURE 10.4 Scatterplot, with smoothed curve and regression line, of log average annual income versus years of education for a sample of 100 entrepreneurs. The smoothed curve is almost the same as the least-squares regression line.

EXAMPLE 10.1 Prediction of Log Income from Education Level

entre

CASE 10.1 The green line in Figure 10.4 is the least-squares regression line for predicting log income from years of formal schooling. The equation of this line is

$predicted LOGINC = 8.2546 + 0.1126 \times EDUC$

Reminder

residuals, p. 88

We can use the least-squares regression equation to find the predicted log income corresponding to a given value of EDUC. The difference between the observed and predicted value is the residual. For example, Entrepreneur 4 has 15 years of formal schooling and a log income of 10.2274. We predict that this person will have a log income of

$8.2546 + (0.1126) (15) = 9.9436$

so the residual is

$y - \hat{y} = 10.2274 - 9.9436 = 0.2838$

Recall that the least-squares line is the line that minimizes the sum of the squares of the residuals. The least-squares regression line also always passes through the point $(\bar{x}, \bar{y})$ . These are helpful facts to remember when considering the fit of this line to a data set.

In Section 2.2 (pages 74–77), we discussed the correlation as a measure of association between two quantitative variables. In Section 2.3, we learned to interpret the square of the correlation as the fraction of the variation in $y$ that is explained by $x$ in a simple linear regression.

Reminder

interpretation of $r^{2}$ , p. 87

EXAMPLE 10.2 Correlation between Log Income and Education Level

CASE 10.1 For Case 10.1, the correlation between LOGINC and EDUC is $r = 0.2394$ . Because the squared correlation $r^{2} = 0.0573$ , the change in log income along the regression line as years of education increases explains only 5.7% of the variation. The remaining 94.3% is due to other differences among these entrepreneurs. The entrepreneurs in this sample live in different parts of the United States; some are single and others are married; and some may have had a difficult upbringing. All these factors could be associated with log income and thus add to the variability if not included in the model.

Page 488

Apply Your Knowledge

Question 10.1

10.1 Predict the log income.

In Case 10.1, Entrepreneur 3 has an EDUC of 14 years and a log income of 10.9475. Using the least-squares regression equation in Example 10.1, find the predicted log income and the residual for this individual.

10.1

$\hat{y} = 9.831. Residual = 1.1165$ .

Question 10.2

10.2 Understanding a linear regression model.

Consider a linear regression model with $μ_{y} = 26.35 + 3.4 x$ and standard deviation $σ = 4.1$ .

What is the slope of the population regression line?
Explain clearly what this slope says about the change in the mean of $y$ for a unit change in $x$ .
What is the subpopulation mean when $x = 12$ ?
Between what two values would approximately 95% of the observed responses $y$ fall when $x = 12$ ?

Having reviewed the basics of least-squares regression, we are now ready to proceed with a discussion of inference for regression. Here’s what is new in this chapter:

We regard the 100 entrepreneurs for whom we have data as a simple random sample (SRS) from the population of all entrepreneurs in the United States.
We use the regression line calculated from this sample as a basis for inference about the population. For example, for a given level of education, we want not just a prediction but a prediction with a margin of error and a level of confidence for the log income of any entrepreneur in the United States.

Our statistical model assumes that the responses $y$ are Normally distributed with a mean $μ_{y}$ that depends upon $x$ in a linear way. Specifically, the population regression line

$μ_{y} = β_{0} + β_{1} x$

describes the relationship between the mean log income $μ_{y}$ and the number of years of formal education $x$ in the population. The slope $β_{1}$ is the mean increase in log income for each additional year of education. The intercept $β_{0}$ is the mean log income when an entrepreneur has $x = 0$ years of formal education. This parameter, by itself, is not meaningful in this example because $x = 0$ years of education would be extremely rare.

Because the means $μ_{y}$ lie on the line $μ_{y} = β_{0} + β_{1} x$ , they are all determined by $β_{0}$ and $β_{1}$ . Thus, once we have estimates of $β_{0}$ and $β_{1}$ , the linear relationship determines the estimates of $μ_{y}$ for all values of $x$ . Linear regression allows us to do inference not only for subpopulations for which we have data, but also for those corresponding to $x$ ’s not present in the data. These $x$ -values can be both within and outside the range of observed $x$ ’s. However, extreme caution must be taken when performing inference for an $x$ -value outside the range of the observed $x$ ’s because there is no assurance that the same linear relationship between $μ_{y}$ and $x$ holds.

We cannot observe the population regression line because the observed responses $y$ vary about their means. In Figure 10.4 we see the least-squares regression line that describes the overall pattern of the data, along with the scatter of individual points about this line. The statistical model for linear regression makes the same distinction. This was displayed in Figure 10.2 with the line and three Normal curves. The population regression line describes the on-the-average relationship and the Normal curves describe the variability in $y$ for each value of $x$ .

Page 489

Think of the model in the form

$DATA = FIT + RESIDUAL$

The FIT part of the model consists of the subpopulation means, given by the expression $β_{0} + β_{1} x$ . The RESIDUAL part represents deviations of the data from the line of population means. The model assumes that these deviations are Normally distributed with standard deviation $σ$ . We use $\in$ (the lowercase Greek letter epsilon) to stand for the RESIDUAL part of the statistical model. A response $y$ is the sum of its mean and a chance deviation $\in$ from the mean. The deviations $\in$ represent “noise,” variation in $y$ due to other causes that prevent the observed $(x, y)$ -values from forming a perfectly straight line on the scatterplot.

Simple Linear Regression Model

Given $n$ observations of the explanatory variable $x$ and the response variable $y$ ,

$\begin{array}{l} (x_{1}, y_{1}), & (x_{2}, y_{2}), & \dots, & (x_{n}, y_{n}) \end{array}$

The statistical model for simple linear regression states that the observed response $y_{i}$ when the explanatory variable takes the value $x_{i}$ is

$y_{i} = β_{0} + β_{1} x_{i} + \in_{i}$

Here, $μ_{y} = β_{0} + β_{1} x_{i}$ is the mean response when $x = x_{i}$ . The deviations $\in_{i}$ are independent and Normally distributed with mean 0 and standard deviation $σ$ .

The parameters of the model are $β_{0}$ , $β_{1}$ , and $σ$ .

The simple linear regression model can be justified in a wide variety of circumstances. Sometimes, we observe the values of two variables, and we formulate a model with one of these as the response variable and the other as the explanatory variable. This was the setting for Case 10.1, where the response variable was log income and the explanatory variable was the number of years of formal education. In other settings, the values of the explanatory variable are chosen by the persons designing the study. The scenario illustrated by Figure 10.2 is an example of this setting. Here, the explanatory variable is training time, which is set at a few carefully selected values. The response variable is the number of entries per hour.

For the simple linear regression model to be valid, one essential assumption is that the relationship between the means of the response variable for the different values of the explanatory variable is approximately linear. This is the FIT part of the model. Another essential assumption concerns the RESIDUAL part of the model. The assumption states that the residuals are an SRS from a Normal distribution with mean zero and standard deviation $σ$ . If the data are collected through some sort of random sampling, this assumption is often easy to justify. This is the case in our two scenarios, in which both variables are observed in a random sample from a population or the response variable is measured at predetermined values of the explanatory variable.

In many other settings, particularly in business applications, we analyze all of the data available and there is no random sampling. Here, we often justify the use of inference for simple linear regression by viewing the data as coming from some sort of process. The line gives a good description of the relationship, the fit, and we model the deviations from the fit, the residuals, as coming from a Normal distribution.

Page 490

EXAMPLE 10.3 Retail Sales and Floor Space

It is customary in retail operations to assess the performance of stores partly in terms of their annual sales relative to their floor area (square feet). We might expect sales to increase linearly as stores get larger, with, of course, individual variation among stores of the same size. The regression model for a population of stores says that

$sales = β_{0} + β_{1} \times area + \in$

The slope $β_{1}$ is, as usual, a rate of change: it is the expected increase in annual sales associated with each additional square foot of floor space. The intercept $β_{0}$ is needed to describe the line but has no statistical importance because no stores have area close to zero. Floor space does not completely determine sales. The $\in$ term in the model accounts for differences among individual stores with the same floor space. A store’s location, for example, could be important but is not included in the FIT part of the model. In Chapter 11, we consider moving variables like this out of the RESIDUAL part of the model by allowing more than one explanatory variable in the FIT part.

Apply Your Knowledge

Question 10.3

10.3 U.S. versus overseas stock returns.

Returns on common stocks in the United States and overseas appear to be growing more closely correlated as economies become more interdependent. Suppose that the following population regression line connects the total annual returns (in percent) on two indexes of stock prices:

$mean overseas return = - 0.1 + 0.15 \times U .S . return$

What is $β_{0}$ in this line? What does this number say about overseas returns when the U.S. market is flat (0% return)?
What is $β_{1}$ in this line? What does this number say about the relationship between U.S. and overseas returns?
We know that overseas returns will vary in years that have the same return on U.S. common stocks. Write the regression model based on the population regression line given above. What part of this model allows overseas returns to vary when U.S. returns remain the same?

10.3

(a) −0.1. When the U.S. market is flat, the overseas returns will be −0.1. (b) 0.15. For each unit increase in U.S. return, the mean overseas return will increase by 0.15. (c) $MEAN OVERSEAS RETURN = β_{0} + β_{1} \times U .S . RETURN + ε$ . The $ε$ allows overseas returns to vary when U.S. returns remain the same.

Question 10.4

10.4 Fixed and variable costs.

In some mass production settings, there is a linear relationship between the number $x$ of units of a product in a production run and the total cost $y$ of making these $x$ units.

Write a population regression model to describe this relationship.
The fixed cost is the component of total cost that does not change as $x$ increases. Which parameter in your model is the fixed cost?
Which parameter in your model shows how total cost changes as more units are produced? Do you expect this number to be greater than 0 or less than 0? Explain your answer.
Actual data from several production runs will not fall directly on a straight line. What term in your model allows variation among runs of the same size $x$ ?

Estimating the regression parameters

The method of least squares presented in Chapter 2 fits the least-squares line to summarize a relationship between the observed values of an explanatory variable and a response variable. Now we want to use this line as a basis for inference about a population from which our observations are a sample. We can do this only when the statistical model for regression is reasonable. In that setting, the slope $b_{1}$ and intercept $b_{0}$ of the least-squares line

$\hat{y} = b_{0} + b_{1} x$

estimate the slope $β_{1}$ and the intercept $β_{0}$ of the population regression line.

Page 491

Recalling the formulas from Chapter 2, the slope of the least-squares line is

$b_{1} = r \frac{s_{y}}{s_{x}}$

and the intercept is

$b_{0} = \bar{y} - b_{1} \bar{x}$

Here, $r$ is the correlation between the observed values of $y$ and $x$ , $s_{y}$ is the standard deviation of the sample of $y$ ’s, and $s_{x}$ is the standard deviation of the sample of $x$ ’s. Notice that if the estimated slope is 0, so is the correlation, and vice versa. We discuss this relationship more later in this chapter.

Reminder

correlation, p. 74

The remaining parameter to be estimated is $σ$ , which measures the variation of $y$ about the population regression line. More precisely, $σ$ is the standard deviation of the Normal distribution of the deviations $\in_{i}$ in the regression model. However, we don’t observe these $\in_{i}$ , so how can we estimate $σ$ ?

Recall that the vertical deviations of the points in a scatterplot from the fitted regression line are the residuals. We use $e_{i}$ for the residual of the $i$ th observation:

$\begin{array}{l} e_{i} & = observed response - predicted response \\ = y_{i} - {\hat{y}}_{i} \\ = y_{i} - b_{0} - b_{1} x_{i} \end{array}$

Reminder

residuals, p. 88

The residuals $e_{i}$ are the observable quantities that correspond to the unobservable model deviations $\in_{i}$ . The $e_{i}$ sum to 0, and the $\in_{i}$ come from a population with mean 0. Because we do not observe the $\in_{i}$ , we use the residuals to estimate $σ$ and check the model assumptions of the $\in_{i}$ .

To estimate $σ$ , we work first with the variance and take the square root to obtain the standard deviation. For simple linear regression the estimate of $σ^{2}$ is the average squared residual

$\begin{array}{l} s^{2} & = \frac{1}{n - 2} \sum e_{i}^{2} \\ = \frac{1}{n - 2} \sum {(y_{i} - {\hat{y}}_{i})}^{2} \end{array}$

We average by dividing the sum by $n - 2$ in order to make $s^{2}$ an unbiased estimator of $σ^{2}$ . The sample variance of $n$ observations use the divisor $n - 1$ for the same reason. The residuals $e_{i}$ are not $n$ separate quantities. When any $n - 2$ residuals are known, we can find the other two. The quantity $n - 2$ is the degrees of freedom of $s^{2}$ . The estimate of the model standard deviation $σ$ is given by

$s = \sqrt{s^{2}}$

Reminder

sample variance, p. 31

model standard deviation $σ$

We call $s$ the regression standard error.

Page 492

Estimating the Regression Parameters

In the simple linear regression setting, we use the slope $b_{1}$ and intercept $b_{0}$ of the least-squares regression line to estimate the slope $β_{1}$ and intercept $β_{0}$ of the population regression line.

The standard deviation $σ$ in the model is estimated by the regression standard error

$s = \sqrt{\frac{1}{n - 2} \sum {(y_{i} - {\hat{y}}_{i})}^{2}}$

In practice, we use software to calculate $b_{1}$ , $b_{0}$ , and $s$ from data on $x$ and $y$ . Here are the results for the income example of Case 10.1.

EXAMPLE 10.4 Log Income and Years of Education

entre

CASE 10.1 Figure 10.5 displays Excel output for the regression of log income (LOGINC) on years of education (EDUC) for our sample of 100 entrepreneurs in the United States. In this output, we find the correlation $r = 0.2394$ and the squared correlation that we used in Example 10.2, along with the intercept and slope of the least-squares line. The regression standard error $s$ is labeled simply “Standard Error.”

FIGURE 10.5 Excel output for the regression of log average income on years of education, Example 10.4.

The three parameter estimates are

$\begin{array}{l} b_{0} = 8.254643317 & b_{1} = 0.112587853 & s = 1.114599592 \end{array}$

After rounding, the fitted regression line is

$\hat{y} = 8.2546 + 0.1126 x$

As usual, we ignore the parts of the output that we do not yet need. We will return to the output for additional information later.

Page 493

FIGURE 10.6 JMP and Minitab outputs for the regression of log average income on years of education. The data are the same as in Figure 10.5.

Figure 10.6 shows the regression output from two other software packages. Although the formats differ, you should be able to find the results you need. Once you know what to look for, you can understand statistical output from almost any software.

Apply Your Knowledge

Question 10.5

10.5 Research and development spending.

The National Science Foundation collects data on the research and development spending by universities and colleges in the United States.³ Here are the data for the years 2008–2011:

Year	2008	2009	2010	2011
Spending (billions of dollars)	51.9	54.9	58.4	62.0

Page 494

Make a scatterplot that shows the increase in research and development spending over time. Does the pattern suggest that the spending is increasing linearly over time?
Find the equation of the least-squares regression line for predicting spending from year. Add this line to your scatterplot.
For each of the four years, find the residual. Use these residuals to calculate the standard error $s$ . (Do these calculations with a calculator.)
Write the regression model for this setting. What are your estimates of the unknown parameters in this model?
Use your least-squares equation to predict research and development spending for the year 2013. The actual spending for that year was $63.4 billion. Add this point to your plot, and comment on why your equation performed so poorly.

(Comment: These are time series data. Simple regression is often a good fit to time series data over a limited span of time. See Chapter 13 for methods designed specifically for use with time series.)

10.5

(a) The spending is increasing linearly over time. (b) $\hat{y} = - 6735.3 + 3.38 x$ .
(c) 0.17, −0.21, −0.09, 0.13. $s = 0.22136$ .
(d) $SPENDING = β_{0} + β_{1} \times YEAR + ε$ . The estimate for $β_{0}$ is −6735.5, the estimate for $β_{1}$ is 3.38, and the estimate for $ε$ is 0. (e) 68.63.

Conditions for regression inference

You can fit a least-squares line to any set of explanatory-response data when both variables are quantitative. The simple linear regression model, which is the basis for inference, imposes several conditions on this fit. We should always verify these conditions before proceeding to inference. There is no point in trying to do statistical inference if the data do not, at least approximately, meet the conditions that are the foundation for the inference.

The conditions concern the population, but we can observe only our sample. Thus, in doing inference, we act as if the sample is an SRS from the population. For the study described in Case 10.1, the researchers used a national survey. Participants were chosen to be a representative sample of the United States, so we can treat this sample as an SRS. The potential for bias should always be considered, especially when obtaining volunteers.

The next condition is that there is a linear relationship in the population, described by the population regression line. We can’t observe the population line, so we check this condition by asking if the sample data show a roughly linear pattern in a scatterplot. We also check for any outliers or influential observations that could affect the least-squares fit. The model also says that the standard deviation of the responses about the population line is the same for all values of the explanatory variable. In practice, the spread of observations above and below the least-squares line should be roughly the same as $x$ varies.

Reminder

outliers and influential observations, p. 94

Plotting the residuals against the explanatory variable or against the predicted (or fitted) values is a helpful and frequently used visual aid to check these conditions. This is better than the scatterplot because a residual plot magnifies patterns. The residual plot in Figure 10.7 for the data of Case 10.1 looks satisfactory. There is no curved pattern or data points that seem out of the ordinary, and the data appear equally dispersed above and below zero throughout the range of $x$ .

The final condition is that the response varies Normally about the population regression line. In that case, we expect the residuals $e_{i}$ to also be Normally distributed.⁴ A Normal quantile plot of the residuals (Figure 10.8) shows no serious deviations from a Normal distribution. The data give no reason to doubt the simple linear regression model, so we proceed to inference.

Reminder

Normal quantile plot, p. 51

There is no condition that requires Normality for the distributions of the response or explanatory variables. The Normality condition applies only to the distribution of the model deviations, which we assess using the residuals. For the entrepreneur problem, we transformed $y$ to get a more linear relationship as well as residuals that appear Normal with constant variance. The fact that the marginal distribution of the transformed $y$ is more Normal is purely a coincidence.

Page 495

FIGURE 10.7 Plot of the regression residuals against the explanatory variable for the annual income data.

FIGURE 10.8 Normal quantile plot of the regression residuals for the annual income data.

Confidence intervals and significance tests

Chapter 7 presented confidence intervals and significance tests for means and differences in means. In each case, inference rested on the standard errors of estimates and on $t$ distributions. Inference for the slope and intercept in linear regression is similar in principle. For example, the confidence intervals have the form

$estimate \pm t^{*} {SE}_{estimate}$

where $t^{*}$ is a critical value of a $t$ distribution. It is the formulas for the estimate and standard error that are different.

Page 496

Confidence intervals and tests for the slope and intercept are based on the sampling distributions of the estimates $b_{1}$ and $b_{0}$ . Here are some important facts about these sampling distributions:

When the simple linear regression model is true, each of $b_{0}$ and $b_{1}$ has a Normal distribution.
The mean of $b_{0}$ is $β_{0}$ and the mean of $b_{1}$ is $β_{1}$ . That is, the intercept and slope of the fitted line are unbiased estimators of the intercept and slope of the population regression line.
The standard deviations of $b_{0}$ and $b_{1}$ are multiples of the model standard deviation $σ$ . (We give details later.)

Reminder

unbiased estimator, p. 279

Normality of $b_{0}$ and $b_{1}$ is a consequence of Normality of the individual deviations $\in_{i}$ in the regression model. If the $\in_{i}$ are not Normal, a general form of the central limit theorem tells us that the distributions of $b_{0}$ and $b_{1}$ will be approximately Normal when we have a large sample. Regression inference is robust against moderate lack of Normality. On the other hand, outliers and influential observations can invalidate the results of inference for regression.

Reminder

central limit theorem, p. 294

Because $b_{0}$ and $b_{1}$ have Normal sampling distributions, standardizing these estimates gives standard Normal $z$ statistics. The standard deviations of these estimates are multiples of $σ$ . Because we do not know $σ$ , we estimate it by $s$ , the variability of the data about the least-squares line. When we do this, we get $t$ distributions with degrees of freedom $n - 2$ , the degrees of freedom of $s$ . We give formulas for the standard errors ${SE}_{b_{1}}$ and ${SE}_{b_{0}}$ in Section 10.3. For now, we concentrate on the basic ideas and let software do the calculations.

Inference for Regression Slope

A level $C$ confidence interval for the slope $β_{1}$ of the population regression line is

$b_{1} \pm t^{*} {SE}_{b_{1}}$

In this expression, $t^{*}$ is the value for the $t (n - 2)$ density curve with area $C$ between $- t^{*}$ and $t^{*}$ . The margin of error is $m = t^{*} {SE}_{b_{1}}$ .

To test the hypothesis $H_{0} : β_{1} = 0$ , compute the $t$ statistic

$t = \frac{b_{1}}{{SE}_{b_{1}}}$

The degrees of freedom are $n - 2$ . In terms of a random variable $T$ having the $t (n - 2)$ distribution, the $P$ -value for a test of $H_{0}$ against

$H_{^{a}} : β_{1} > 0 is P (T \geq t)$

$H_{^{a}} : β_{1} < 0 is P (T \leq t)$

$H_{^{a}} : β_{1} \neq 0 is 2 P (T \geq | t |)$

Page 497

Formulas for confidence intervals and significance tests for the intercept $β_{0}$ are exactly the same, replacing $b_{1}$ and ${SE}_{b_{1}}$ by $b_{0}$ and its standard error ${SE}_{b_{0}}$ . Although computer outputs often include a test of $H_{0} : β_{0} = 0$ , this information usually has little practical value. From the equation for the population regression line, $μ_{y} = β_{0} + β_{1} x$ , we see that $β_{0}$ is the mean response corresponding to $x = 0$ . In many practical situations, this subpopulation does not exist or is not interesting.

On the other hand, the test of $H_{0} : β_{1} = 0$ is quite useful. When we substitute $β_{1} = 0$ in the model, the $x$ term drops out and we are left with

$μ_{y} = β_{0}$

This model says that the mean of $y$ does not vary with $x$ . In other words, all the $y ’ s$ come from a single population with mean $β_{0}$ , which we would estimate by $\bar{y}$ . The hypothesis $H_{0} : β_{1} = 0$ , therefore, says that there is no straight-line relationship between $y$ and $x$ and that linear regression of $y$ on $x$ is of no value for predicting $y$ .

EXAMPLE 10.5 Does Log Income Increase with Education?

entre

CASE 10.1 The Excel regression output in Figure 10.5 (page 492) for the entrepreneur problem contains the information needed for inference about the regression coefficients. You can see that the slope of the least-squares line is $b_{1} = 0.1126$ and the standard error of this statistic is ${SE}_{b_{1}} = 0.046116$ .

Given that the response $y$ is on the log scale, this slope approximates the percent change in $y$ for a unit change in $x$ (see Example 13.10 [pages 661–662] for more details). In this case, one extra year of education is associated with an approximate 11.3% increase in income.

The $t$ statistic and $P$ -value for the test of $H_{0} : β_{1} = 0$ against the two-sided alternative $H_{a} : β_{1} \neq 0$ appear in the columns labeled “ $t$ Stat” and “ $P$ -value.” The $t$ statistic for the significance of the regression is

$t = \frac{b_{1}}{{SE}_{b_{1}}} = \frac{0.1126}{0.046116} = 2.44$

and the $P$ -value for the two-sided alternative is 0.0164. If we expected beforehand that income rises with education, our alternative hypothesis would be one-sided, $H_{a} : β_{1} > 0$ . The $P$ -value for this $H_{a}$ is one-half the two-sided value given by Excel; that is, $P = 0.0082$ . In both cases, there is strong evidence that the mean log income level increases as education increases.

A 95% confidence interval for the slope $β_{1}$ of the regression line in the population of all entrepreneurs in the United States is

$\begin{array}{l} b_{1} \pm t^{*} {SE}_{b_{1}} & = 0.1126 \pm (1.990) (0.046116) \\ = 0.1126 \pm 0.09177 \\ = 0.0208 to 0.2044 \end{array}$

This interval contains only positive values, suggesting an increase in log income for an additional year of schooling. We’re 95% confident that the average increase in income for one additional year of education is between 2.1% and 20.4%.

The $t$ distribution for this problem has $n - 2 = 98$ degrees of freedom. Table D has no entry for 98 degrees of freedom, so we use the table entry $t^{*} = 1.990$ for 80 degrees of freedom. As a result, our confidence interval agrees only approximately with the more accurate software result. Note that using the next lower degrees of freedom in Table D makes our interval a bit wider than we actually need for 95% confidence. Use this conservative approach when you don’t know $t^{*}$ for the exact degrees of freedom.

Page 498

In this example, we can discuss percent change in income for a unit change in education because the response variable $y$ is on the log scale and $x$ is not. In business and economics, we often encounter models in which both variables are on the log scale. In these cases, the slope approximates the percent change in $y$ for a 1% change in $x$ . This is known as elasticity, which is a very important concept in economic theory.

elasticity

Apply Your Knowledge

Treasury bills and inflation.

When inflation is high, lenders require higher interest rates to make up for the loss of purchasing power of their money while it is loaned out. Table 10.1 displays the return of six-month Treasury bills (annualized) and the rate of inflation as measured by the change in the government’s Consumer Price Index in the same year.⁵ An inflation rate of 5% means that the same set of goods and services costs 5% more. The data cover 55 years, from 1958 to 2013. Figure 10.9 is a scatterplot of these data. Figure 10.10 shows Excel regression output for predicting T-bill return from inflation rate. Exercises 10.6 through 10.8 ask you to use this information.

Table 10.2: TABLE 10.1 Return on Treasury bills and rate of inflation

Year	T-bill percent	Inflation percent	Year	T-bill percent	Inflation percent	Year	T-bill percent	Inflation percent
1958	3.01	1.76	1977	5.52	6.70	1996	5.08	3.32
1959	3.81	1.73	1978	7.58	9.02	1997	5.18	1.70
1960	3.20	1.36	1979	10.04	13.20	1998	4.83	1.61
1961	2.59	0.67	1980	11.32	12.50	1999	4.75	2.68
1962	2.90	1.33	1981	13.81	8.92	2000	5.90	3.39
1963	3.26	1.64	1982	11.06	3.83	2001	3.34	1.55
1964	3.68	0.97	1983	8.74	3.79	2002	1.68	2.38
1965	4.05	1.92	1984	9.78	3.95	2003	1.05	1.88
1966	5.06	3.46	1985	7.65	3.80	2004	1.58	3.26
1967	4.61	3.04	1986	6.02	1.10	2005	3.39	3.42
1968	5.47	4.72	1987	6.03	4.43	2006	4.81	2.54
1969	6.86	6.20	1988	6.91	4.42	2007	4.44	4.08
1970	6.51	5.57	1989	8.03	4.65	2008	1.62	0.09
1971	4.52	3.27	1990	7.46	6.11	2009	0.28	2.72
1972	4.47	3.41	1991	5.44	3.06	2010	0.20	1.50
1973	7.20	8.71	1992	3.54	2.90	2011	0.10	2.96
1974	7.95	12.34	1993	3.12	2.75	2012	0.13	1.74
1975	6.10	6.94	1994	4.64	2.67	2013	0.09	1.50
1976	5.26	4.86	1995	5.56	2.54

Page 499

FIGURE 10.9 Scatterplot of the percent return on Treasury bills against the rate of inflation the same year, Exercises 10.6 through 10.8.

FIGURE 10.10 Excel output for the regression of the percent return on Treasury bills against the rate of inflation the same year, Exercises 10.6 through 10.8.

Question 10.6

10.6 Look at the data.

Give a brief description of the form, direction, and strength of the relationship between the inflation rate and the return on Treasury bills. What is the equation of the least-squares regression line for predicting T-bill return?

inflat

Question 10.7

10.7 Is there a relationship?

What are the slope $b_{1}$ of the fitted line and its standard error? Use these numbers to test by hand the hypothesis that there is no straight-line relationship between inflation rate and T-bill return against the alternative that the return on T-bills increases as the rate of inflation increases. State the hypotheses, give both the $t$ statistic and its degrees of freedom, and use Table D to approximate the $P$ -value. Then compare your results with those given by Excel. (Excel’s $P$ -value 3.04E-09 is shorthand for 0.00000000304. We would report this as “ $< 0.0001$ .”)

10.7

$b_{1} = 0.70318. S E_{b_{1}} = 0.09931. H_{0} : β_{1} = 0, H_{a} : β_{1} \neq 0. t = 7.08, d f = 54, P -value < 0.001$ . The results are the same as Excel’s.

inflat

Question 10.8

10.8 Estimating the slope.

Using Excel’s values for $b_{1}$ and its standard error, find a 95% confidence interval for the slope $β_{1}$ of the population regression line. Compare your result with Excel’s 95% confidence interval. What does the confidence interval tell you about the change in the T-bill return rate for a 1% increase in the inflation rate?

inflat

Page 500

The word “regression”

To “regress” means to go backward. Why are statistical methods for predicting a response from an explanatory variable called “regression”? Sir Francis Galton (1822–1911) was the first to apply regression to biological and psychological data. He looked at examples such as the heights of children versus the heights of their parents. He found that the taller-than-average parents tended to have children who were also taller than average, but not as tall as their parents. Galton called this fact “regression toward mediocrity,” and the name came to be applied to the statistical method. Galton also invented the correlation coefficient $r$ and named it “correlation.”

Why are the children of tall parents shorter on the average than their parents? The parents are tall in part because of their genes. But they are also tall in part by chance. Looking at tall parents selects those in whom chance produced height. Their children inherit their genes, but not their good luck. As a group, the children are taller than average (genes), but their heights vary by chance about the average, some upward and some downward. The children, unlike the parents, were not selected because they were tall and thus, on average, are shorter. A similar argument can be used to describe why children of short parents tend to be taller than their parents.

Here’s another example. Students who score at the top on the first exam in a course are likely to do less well on the second exam. Does this show that they stopped studying? No—they scored high in part because they knew the material but also in part because they were lucky. On the second exam, they may still know the material but be less lucky. As a group, they will still do better than average but not as well as they did on the first exam. The students at the bottom on the first exam will tend to move up on the second exam, for the same reason.

The regression fallacy is the assertion that regression toward the mean shows that there is some systematic effect at work: students with top scores now work less hard, or managers of last year’s best-performing mutual funds lose their touch this year, or heights get less variable with each passing generation as tall parents have shorter children and short parents have taller children. The Nobel economist Milton Friedman says, “I suspect that the regression fallacy is the most common fallacy in the statistical analysis of economic data.”⁶ Beware.

regression fallacy

Apply Your Knowledge

Question 10.9

10.9 Hot funds?

Explain carefully to a naive investor why the mutual funds that had the highest returns this year will as a group probably do less well relative to other funds next year.

10.9

The mutual funds that had the highest returns this year were high in part because they did well but also in part because they were lucky, so we might expect them to do well again next year but probably not as well as this year.

Question 10.10

10.10 Mediocrity triumphant?

In the early 1930s, a man named Horace Secrist wrote a book titled The Triumph of Mediocrity in Business. Secrist found that businesses that did unusually well or unusually poorly in one year tended to be nearer the average in profitability at a later year. Why is it a fallacy to say that this fact demonstrates an overall movement toward “mediocrity”?

Inference about correlation

The correlation between log income and level of education for the 100 entrepreneurs is $r = 0.2394$ . This value appears in the Excel output in Figure 10.5 (page 492), where it is labeled “Multiple R.”⁷ We might expect a positive correlation between these two measures in the population of all entrepreneurs in the United States. Is the sample result convincing evidence that this is true?

This question concerns a new population parameter, the population correlation. This is the correlation between the log income and level of education when we measure these variables for every member of the population. We call the population correlation $ρ$ , the Greek letter rho. To assess the evidence that $ρ > 0$ in the population, we must test the hypotheses

$\begin{array}{l} H_{0} : ρ = 0 \\ H_{a} : ρ > 0 \end{array}$

population correlation $ρ$

Page 501

It is natural to base the test on the sample correlation $r = 0.2394$ . Table G in the back of the book shows the one-sided critical values of $r$ . To use software for the test, we exploit the close link between correlation and the regression slope. The population correlation $ρ$ is zero, positive, or negative exactly when the slope $β_{1}$ of the population regression line is zero, positive, or negative. In fact, the $t$ statistic for testing $H_{0} : β_{1} = 0$ also tests $H_{0} : ρ = 0$ . What is more, this $t$ statistic can be written in terms of the sample correlation $r$ .

Test for Zero Population Correlation

To test the hypothesis $H_{0} : ρ = 0$ that the population correlation is 0, compare the sample correlation $r$ with critical values in Table G or use the $t$ statistic for regression slope.

The $t$ statistic for the slope can be calculated from the sample correlation $r$ :

$t = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^{2}}}$

This $t$ statistic has $n - 2$ degrees of freedom.

EXAMPLE 10.6 Correlation between Log Income and Years of Education

CASE 10.1 The sample correlation between log income and education level is $r = 0.2394$ from a sample of size $n = 100$ . We can use Table G to test

$\begin{array}{l} H_{0} : ρ = 0 \\ H_{a} : ρ > 0 \end{array}$

For the row $n = 100$ , we find that the $P$ -value for $r = 0.2394$ lies between 0.005 and 0.01.

We can get a more accurate result from the Excel output in Figure 10.5 (page 492). In the “EDUC” line, we see that $t = 2.441$ with two-sided $P$ -value 0.0164. That is, $P = 0.0083$ for our one-sided alternative.

Finally, we can calculate $t$ directly from $r$ as follows:

$\begin{array}{l} t & = \frac{r \sqrt{n - 2}}{\sqrt{1 - r^{2}}} \\ = \frac{0.2394 \sqrt{100 - 2}}{\sqrt{1 - {(0.2394)}^{2}}} \\ = \frac{2.3699}{0.9709} = 2.441 \end{array}$

If we are not using software, we can compare $t = 2.441$ with critical values from the $t$ table (Table D) with 80 (largest row less than or equal to $n - 2 = 98$ ) degrees of freedom.

Page 502

The alternative formula for the test statistic is convenient because it uses only the sample correlation $r$ and the sample size $n$ . Remember that correlation, unlike regression, does not require the distinction between explanatory and response variables. For variables $x$ and $y$ , there are two regressions ( $y$ on $x$ and $x$ on $y$ ) but just one correlation. Both regressions produce the same $t$ statistic.

The distinction between the regression setting and correlation is important only for understanding the conditions under which the test for 0 population correlation makes sense. In the regression model, we take the values of the explanatory variable $x$ as given. The values of the response $y$ are Normal random variables, with means that are a straight-line function of $x$ . In the model for testing correlation, we think of the setting where we obtain a random sample from a population and measure both $x$ and $y$ . Both are assumed to be Normal random variables. In fact, they are taken to be jointly Normal. This implies that the conditional distribution of $y$ for each possible value of $x$ is Normal, just as in the regression model.

jointly Normal

Apply Your Knowledge

Question 10.11

10.11 T-bills and inflation.

We expect the interest rates on Treasury bills to rise when the rate of inflation rises and fall when inflation falls. That is, we expect a positive correlation between the return on T-bills and the inflation rate.

Find the sample correlation $r$ for the 55 years in Table 10.1 in the Excel output in Figure 10.10. Use Table G to get an approximate $P$ -value. What do you conclude?
From $x$ , calculate the $t$ statistic for testing correlation. What are its degrees of freedom? Use Table D to give an approximate $P$ -value. Compare your result with the $P$ -value from (a).
Verify that your $t$ for correlation calculated in part (b) has the same value as the $t$ for slope in the Excel output.

10.11

(a) $r = 0.6939. P -value < 0.001$ . There is a significant positive correlation between T-bills and inflation rate. (b) $t = 7.08, d f = 54, P -value < 0.0005$ . The results are the same.

Question 10.12

CASE 10.1

10.12 Two regressions.

We have regressed the log income of entrepreneurs on their years of education, with the results appearing in Figures 10.5 and 10.6. Use software to regress years of education on log income for the same data.

What is the equation of the least-squares line for predicting years of education from log income? Is it a different line than the regression line from Figure 10.4? To answer this, plot two points for each equation and draw a line connecting them.
Verify that the two lines cross at the mean values of the two variables. That is, substitute the mean years of education into the line from Figure 10.5, and show that the predicted log income equals the mean of the log incomes of the 100 subjects. Then substitute the mean log income into your new line, and show that the predicted years of education equals the mean years of education for the entrepreneurs.
Verify that the two regressions give the same value of the $t$ statistic for testing the hypothesis of zero population slope. You could use either regression to test the hypothesis of zero population correlation.

entre