Simple linear regression studies the relationship between a response variable and an explanatory variable . We expect that different values of are associated with different mean responses for . We encountered a similar but simpler situation in Chapter 7 when we discussed methods for comparing two population means. Figure 10.1 illustrates a statistical model for comparing the items per hour entered by two groups of financial clerks using new data entry software. Group 2 received some training in the software while Group 1 did not. Entries per hour is the response variable. The treatment (training or not) is the explanatory variable. The model has two important parts:
Statistical model for simple linear regression
Now imagine giving different lengths of training to different groups of subjects. We can think of these groups as belonging to subpopulations, one for each possible value of . Each subpopulation consists of all individuals in the population having the same value of . If we gave hours of training to some subjects, hours of training to some others, and hours of training to some others, these three groups of subjects would be considered samples from the corresponding three subpopulations.
subpopulation
The statistical model for simple linear regression also assumes that, for each value of , the response variable is Normally distributed with a mean that depends on . We use to represent these means. In general, the means can change as changes according to any sort of pattern. In simple linear regression, we assume that the means all lie on a line when plotted against . To summarize, this model also has two important parts:
485
This statistical model is pictured in Figure 10.2. The line describes how the mean response changes with . This is the population regression line. The three Normal curves show how the response will vary for three different values of the explanatory variable . Each curve is centered at its mean response . All three curves have the same spread, measured by their common standard deviation .
population regression line
From data analysis to inference
The data for a regression problem are the observed values of and . The model takes each to be a fixed known quantity, like the hours of training a worker has received.1 The response for a given is a Normal random variable. The model describes the mean and standard deviation of this random variable. This model is not appropriate if there is error in measuring and it is large relative to the spread of the ’s. In these situations, more advanced inference methods are needed.
We use Case 10.1 to explain the fundamentals of simple linear regression. Because regression calculations in practice are always done by software, we rely on computer output for the arithmetic. Later in the chapter, we show formulas for doing the calculations. These formulas are useful in understanding analysis of variance (see Section 10.3) and multiple regression (see Chapter 11).
CASE 10.1 The Relationship between Income and Education for Entrepreneurs
Numerous studies have shown that better-educated employees have higher incomes. Is this also true for entrepreneurs? Do more years of formal education translate into higher incomes? And if so, is the return for an additional year of education the same for entrepreneurs and employees? One study explored these questions using the National Longitudinal Survey of Youth (NLSY), which followed a large group of individuals aged 14 to 22 for roughly 10 years.2 They looked at both employees and entrepreneurs, but we just focus on entrepreneurs here.
486
entre
The researchers defined entrepreneurs to be those who were self-employed or who were the owner/director of an incorporated business. For each of these individuals, they recorded the education level and income. The education level (EDUC) was defined as the years of completed schooling prior to starting the business. The income level was the average annual total earnings (INC) since starting the business.
We consider a random sample of 100 entrepreneurs. Figure 10.3 is a scatterplot of the data with a fitted smoothed curve. The explanatory variable is the entrepreneur’s education level. The response variable is the income level.
Let’s briefly review some of the ideas from Chapter 2 regarding least-squares regression. We start with a plot of the data, as in Figure 10.3, to verify that the relationship is approximately linear with no outliers. Always start with a graphical display of the data. There is no point in fitting a linear model if the relationship does not, at least approximately, appear linear. In this case, the distribution of income is skewed to the right (at each education level, there are many small incomes and just a few large incomes). Although the smoothed curve is roughly linear, the curve is being pulled toward the very large incomes, suggesting these observations could be influential.
Reminder
least-squares regression, p. 80
A common remedy for a strongly skewed variable such as income is to consider transforming the variable prior to fitting a model. Here, the researchers considered the natural logarithm of income (LOGINC). Figure 10.4 is a scatterplot of these transformed data with a fitted smoothed curve in black and the least-squares regression line in green. The smoothed curve is almost linear, and the observations in the direction are more equally dispersed above and below this curve than the curve in Figure 10.3. Also, those four very large incomes no longer appear to be influential. Given these results, we continue our discussion of least-squares regression using the transformed data.
Reminder
log transformation, p. 68
487
EXAMPLE 10.1 Prediction of Log Income from Education Level
entre
CASE 10.1 The green line in Figure 10.4 is the least-squares regression line for predicting log income from years of formal schooling. The equation of this line is
Reminder
residuals, p. 88
We can use the least-squares regression equation to find the predicted log income corresponding to a given value of EDUC. The difference between the observed and predicted value is the residual. For example, Entrepreneur 4 has 15 years of formal schooling and a log income of 10.2274. We predict that this person will have a log income of
so the residual is
Recall that the least-squares line is the line that minimizes the sum of the squares of the residuals. The least-squares regression line also always passes through the point . These are helpful facts to remember when considering the fit of this line to a data set.
In Section 2.2 (pages 74–77), we discussed the correlation as a measure of association between two quantitative variables. In Section 2.3, we learned to interpret the square of the correlation as the fraction of the variation in that is explained by in a simple linear regression.
Reminder
interpretation of , p. 87
EXAMPLE 10.2 Correlation between Log Income and Education Level
CASE 10.1 For Case 10.1, the correlation between LOGINC and EDUC is . Because the squared correlation , the change in log income along the regression line as years of education increases explains only 5.7% of the variation. The remaining 94.3% is due to other differences among these entrepreneurs. The entrepreneurs in this sample live in different parts of the United States; some are single and others are married; and some may have had a difficult upbringing. All these factors could be associated with log income and thus add to the variability if not included in the model.
488
Apply Your Knowledge
10.1 Predict the log income.
In Case 10.1, Entrepreneur 3 has an EDUC of 14 years and a log income of 10.9475. Using the least-squares regression equation in Example 10.1, find the predicted log income and the residual for this individual.
10.1
.
10.2 Understanding a linear regression model.
Consider a linear regression model with and standard deviation .
Having reviewed the basics of least-squares regression, we are now ready to proceed with a discussion of inference for regression. Here’s what is new in this chapter:
Our statistical model assumes that the responses are Normally distributed with a mean that depends upon in a linear way. Specifically, the population regression line
describes the relationship between the mean log income and the number of years of formal education in the population. The slope is the mean increase in log income for each additional year of education. The intercept is the mean log income when an entrepreneur has years of formal education. This parameter, by itself, is not meaningful in this example because years of education would be extremely rare.
Because the means lie on the line , they are all determined by and . Thus, once we have estimates of and , the linear relationship determines the estimates of for all values of . Linear regression allows us to do inference not only for subpopulations for which we have data, but also for those corresponding to ’s not present in the data. These -values can be both within and outside the range of observed ’s. However, extreme caution must be taken when performing inference for an -value outside the range of the observed ’s because there is no assurance that the same linear relationship between and holds.
We cannot observe the population regression line because the observed responses vary about their means. In Figure 10.4 we see the least-squares regression line that describes the overall pattern of the data, along with the scatter of individual points about this line. The statistical model for linear regression makes the same distinction. This was displayed in Figure 10.2 with the line and three Normal curves. The population regression line describes the on-the-average relationship and the Normal curves describe the variability in for each value of .
489
Think of the model in the form
The FIT part of the model consists of the subpopulation means, given by the expression . The RESIDUAL part represents deviations of the data from the line of population means. The model assumes that these deviations are Normally distributed with standard deviation . We use (the lowercase Greek letter epsilon) to stand for the RESIDUAL part of the statistical model. A response is the sum of its mean and a chance deviation from the mean. The deviations represent “noise,” variation in due to other causes that prevent the observed -values from forming a perfectly straight line on the scatterplot.
Simple Linear Regression Model
Given observations of the explanatory variable and the response variable ,
The statistical model for simple linear regression states that the observed response when the explanatory variable takes the value is
Here, is the mean response when . The deviations are independent and Normally distributed with mean 0 and standard deviation .
The parameters of the model are , , and .
The simple linear regression model can be justified in a wide variety of circumstances. Sometimes, we observe the values of two variables, and we formulate a model with one of these as the response variable and the other as the explanatory variable. This was the setting for Case 10.1, where the response variable was log income and the explanatory variable was the number of years of formal education. In other settings, the values of the explanatory variable are chosen by the persons designing the study. The scenario illustrated by Figure 10.2 is an example of this setting. Here, the explanatory variable is training time, which is set at a few carefully selected values. The response variable is the number of entries per hour.
For the simple linear regression model to be valid, one essential assumption is that the relationship between the means of the response variable for the different values of the explanatory variable is approximately linear. This is the FIT part of the model. Another essential assumption concerns the RESIDUAL part of the model. The assumption states that the residuals are an SRS from a Normal distribution with mean zero and standard deviation . If the data are collected through some sort of random sampling, this assumption is often easy to justify. This is the case in our two scenarios, in which both variables are observed in a random sample from a population or the response variable is measured at predetermined values of the explanatory variable.
In many other settings, particularly in business applications, we analyze all of the data available and there is no random sampling. Here, we often justify the use of inference for simple linear regression by viewing the data as coming from some sort of process. The line gives a good description of the relationship, the fit, and we model the deviations from the fit, the residuals, as coming from a Normal distribution.
490
EXAMPLE 10.3 Retail Sales and Floor Space
It is customary in retail operations to assess the performance of stores partly in terms of their annual sales relative to their floor area (square feet). We might expect sales to increase linearly as stores get larger, with, of course, individual variation among stores of the same size. The regression model for a population of stores says that
The slope is, as usual, a rate of change: it is the expected increase in annual sales associated with each additional square foot of floor space. The intercept is needed to describe the line but has no statistical importance because no stores have area close to zero. Floor space does not completely determine sales. The term in the model accounts for differences among individual stores with the same floor space. A store’s location, for example, could be important but is not included in the FIT part of the model. In Chapter 11, we consider moving variables like this out of the RESIDUAL part of the model by allowing more than one explanatory variable in the FIT part.
Apply Your Knowledge
10.3 U.S. versus overseas stock returns.
Returns on common stocks in the United States and overseas appear to be growing more closely correlated as economies become more interdependent. Suppose that the following population regression line connects the total annual returns (in percent) on two indexes of stock prices:
10.3
(a) −0.1. When the U.S. market is flat, the overseas returns will be −0.1. (b) 0.15. For each unit increase in U.S. return, the mean overseas return will increase by 0.15. (c) . The allows overseas returns to vary when U.S. returns remain the same.
10.4 Fixed and variable costs.
In some mass production settings, there is a linear relationship between the number of units of a product in a production run and the total cost of making these units.
Estimating the regression parameters
The method of least squares presented in Chapter 2 fits the least-squares line to summarize a relationship between the observed values of an explanatory variable and a response variable. Now we want to use this line as a basis for inference about a population from which our observations are a sample. We can do this only when the statistical model for regression is reasonable. In that setting, the slope and intercept of the least-squares line
estimate the slope and the intercept of the population regression line.
491
Recalling the formulas from Chapter 2, the slope of the least-squares line is
and the intercept is
Here, is the correlation between the observed values of and , is the standard deviation of the sample of ’s, and is the standard deviation of the sample of ’s. Notice that if the estimated slope is 0, so is the correlation, and vice versa. We discuss this relationship more later in this chapter.
Reminder
correlation, p. 74
The remaining parameter to be estimated is , which measures the variation of about the population regression line. More precisely, is the standard deviation of the Normal distribution of the deviations in the regression model. However, we don’t observe these , so how can we estimate ?
Recall that the vertical deviations of the points in a scatterplot from the fitted regression line are the residuals. We use for the residual of the th observation:
Reminder
residuals, p. 88
The residuals are the observable quantities that correspond to the unobservable model deviations . The sum to 0, and the come from a population with mean 0. Because we do not observe the , we use the residuals to estimate and check the model assumptions of the .
To estimate , we work first with the variance and take the square root to obtain the standard deviation. For simple linear regression the estimate of is the average squared residual
We average by dividing the sum by in order to make an unbiased estimator of . The sample variance of observations use the divisor for the same reason. The residuals are not separate quantities. When any residuals are known, we can find the other two. The quantity is the degrees of freedom of . The estimate of the model standard deviation is given by
Reminder
sample variance, p. 31
model standard deviation
We call the regression standard error.
492
Estimating the Regression Parameters
In the simple linear regression setting, we use the slope and intercept of the least-squares regression line to estimate the slope and intercept of the population regression line.
The standard deviation in the model is estimated by the regression standard error
In practice, we use software to calculate , , and from data on and . Here are the results for the income example of Case 10.1.
EXAMPLE 10.4 Log Income and Years of Education
entre
CASE 10.1 Figure 10.5 displays Excel output for the regression of log income (LOGINC) on years of education (EDUC) for our sample of 100 entrepreneurs in the United States. In this output, we find the correlation and the squared correlation that we used in Example 10.2, along with the intercept and slope of the least-squares line. The regression standard error is labeled simply “Standard Error.”
The three parameter estimates are
After rounding, the fitted regression line is
As usual, we ignore the parts of the output that we do not yet need. We will return to the output for additional information later.
493
Figure 10.6 shows the regression output from two other software packages. Although the formats differ, you should be able to find the results you need. Once you know what to look for, you can understand statistical output from almost any software.
Apply Your Knowledge
10.5 Research and development spending.
The National Science Foundation collects data on the research and development spending by universities and colleges in the United States.3 Here are the data for the years 2008–2011:
Year | 2008 | 2009 | 2010 | 2011 |
---|---|---|---|---|
Spending (billions of dollars) | 51.9 | 54.9 | 58.4 | 62.0 |
494
(Comment: These are time series data. Simple regression is often a good fit to time series data over a limited span of time. See Chapter 13 for methods designed specifically for use with time series.)
10.5
(a) The spending is increasing linearly over time. (b) .
(c) 0.17, −0.21, −0.09, 0.13. .
(d) . The estimate for is −6735.5, the estimate for is 3.38, and the estimate for is 0. (e) 68.63.
Conditions for regression inference
You can fit a least-squares line to any set of explanatory-response data when both variables are quantitative. The simple linear regression model, which is the basis for inference, imposes several conditions on this fit. We should always verify these conditions before proceeding to inference. There is no point in trying to do statistical inference if the data do not, at least approximately, meet the conditions that are the foundation for the inference.
The conditions concern the population, but we can observe only our sample. Thus, in doing inference, we act as if the sample is an SRS from the population. For the study described in Case 10.1, the researchers used a national survey. Participants were chosen to be a representative sample of the United States, so we can treat this sample as an SRS. The potential for bias should always be considered, especially when obtaining volunteers.
The next condition is that there is a linear relationship in the population, described by the population regression line. We can’t observe the population line, so we check this condition by asking if the sample data show a roughly linear pattern in a scatterplot. We also check for any outliers or influential observations that could affect the least-squares fit. The model also says that the standard deviation of the responses about the population line is the same for all values of the explanatory variable. In practice, the spread of observations above and below the least-squares line should be roughly the same as varies.
Reminder
outliers and influential observations, p. 94
Plotting the residuals against the explanatory variable or against the predicted (or fitted) values is a helpful and frequently used visual aid to check these conditions. This is better than the scatterplot because a residual plot magnifies patterns. The residual plot in Figure 10.7 for the data of Case 10.1 looks satisfactory. There is no curved pattern or data points that seem out of the ordinary, and the data appear equally dispersed above and below zero throughout the range of .
The final condition is that the response varies Normally about the population regression line. In that case, we expect the residuals to also be Normally distributed.4 A Normal quantile plot of the residuals (Figure 10.8) shows no serious deviations from a Normal distribution. The data give no reason to doubt the simple linear regression model, so we proceed to inference.
Reminder
Normal quantile plot, p. 51
There is no condition that requires Normality for the distributions of the response or explanatory variables. The Normality condition applies only to the distribution of the model deviations, which we assess using the residuals. For the entrepreneur problem, we transformed to get a more linear relationship as well as residuals that appear Normal with constant variance. The fact that the marginal distribution of the transformed is more Normal is purely a coincidence.
495
Confidence intervals and significance tests
Chapter 7 presented confidence intervals and significance tests for means and differences in means. In each case, inference rested on the standard errors of estimates and on distributions. Inference for the slope and intercept in linear regression is similar in principle. For example, the confidence intervals have the form
where is a critical value of a distribution. It is the formulas for the estimate and standard error that are different.
496
Confidence intervals and tests for the slope and intercept are based on the sampling distributions of the estimates and . Here are some important facts about these sampling distributions:
Reminder
unbiased estimator, p. 279
Normality of and is a consequence of Normality of the individual deviations in the regression model. If the are not Normal, a general form of the central limit theorem tells us that the distributions of and will be approximately Normal when we have a large sample. Regression inference is robust against moderate lack of Normality. On the other hand, outliers and influential observations can invalidate the results of inference for regression.
Reminder
central limit theorem, p. 294
Because and have Normal sampling distributions, standardizing these estimates gives standard Normal statistics. The standard deviations of these estimates are multiples of . Because we do not know , we estimate it by , the variability of the data about the least-squares line. When we do this, we get distributions with degrees of freedom , the degrees of freedom of . We give formulas for the standard errors and in Section 10.3. For now, we concentrate on the basic ideas and let software do the calculations.
Inference for Regression Slope
A level confidence interval for the slope of the population regression line is
In this expression, is the value for the density curve with area between and . The margin of error is .
To test the hypothesis , compute the statistic
The degrees of freedom are . In terms of a random variable having the distribution, the -value for a test of against
497
Formulas for confidence intervals and significance tests for the intercept are exactly the same, replacing and by and its standard error . Although computer outputs often include a test of , this information usually has little practical value. From the equation for the population regression line, , we see that is the mean response corresponding to . In many practical situations, this subpopulation does not exist or is not interesting.
On the other hand, the test of is quite useful. When we substitute in the model, the term drops out and we are left with
This model says that the mean of does not vary with . In other words, all the come from a single population with mean , which we would estimate by . The hypothesis , therefore, says that there is no straight-line relationship between and and that linear regression of on is of no value for predicting .
EXAMPLE 10.5 Does Log Income Increase with Education?
entre
CASE 10.1 The Excel regression output in Figure 10.5 (page 492) for the entrepreneur problem contains the information needed for inference about the regression coefficients. You can see that the slope of the least-squares line is and the standard error of this statistic is .
Given that the response is on the log scale, this slope approximates the percent change in for a unit change in (see Example 13.10 [pages 661–662] for more details). In this case, one extra year of education is associated with an approximate 11.3% increase in income.
The statistic and -value for the test of against the two-sided alternative appear in the columns labeled “ Stat” and “-value.” The statistic for the significance of the regression is
and the -value for the two-sided alternative is 0.0164. If we expected beforehand that income rises with education, our alternative hypothesis would be one-sided, . The -value for this is one-half the two-sided value given by Excel; that is, . In both cases, there is strong evidence that the mean log income level increases as education increases.
A 95% confidence interval for the slope of the regression line in the population of all entrepreneurs in the United States is
This interval contains only positive values, suggesting an increase in log income for an additional year of schooling. We’re 95% confident that the average increase in income for one additional year of education is between 2.1% and 20.4%.
The distribution for this problem has degrees of freedom. Table D has no entry for 98 degrees of freedom, so we use the table entry for 80 degrees of freedom. As a result, our confidence interval agrees only approximately with the more accurate software result. Note that using the next lower degrees of freedom in Table D makes our interval a bit wider than we actually need for 95% confidence. Use this conservative approach when you don’t know for the exact degrees of freedom.
498
In this example, we can discuss percent change in income for a unit change in education because the response variable is on the log scale and is not. In business and economics, we often encounter models in which both variables are on the log scale. In these cases, the slope approximates the percent change in for a 1% change in . This is known as elasticity, which is a very important concept in economic theory.
elasticity
Apply Your Knowledge
Treasury bills and inflation.
When inflation is high, lenders require higher interest rates to make up for the loss of purchasing power of their money while it is loaned out. Table 10.1 displays the return of six-month Treasury bills (annualized) and the rate of inflation as measured by the change in the government’s Consumer Price Index in the same year.5 An inflation rate of 5% means that the same set of goods and services costs 5% more. The data cover 55 years, from 1958 to 2013. Figure 10.9 is a scatterplot of these data. Figure 10.10 shows Excel regression output for predicting T-bill return from inflation rate. Exercises 10.6 through 10.8 ask you to use this information.
Year | T-bill percent |
Inflation percent |
Year | T-bill percent |
Inflation percent |
Year | T-bill percent |
Inflation percent |
---|---|---|---|---|---|---|---|---|
1958 | 3.01 | 1.76 | 1977 | 5.52 | 6.70 | 1996 | 5.08 | 3.32 |
1959 | 3.81 | 1.73 | 1978 | 7.58 | 9.02 | 1997 | 5.18 | 1.70 |
1960 | 3.20 | 1.36 | 1979 | 10.04 | 13.20 | 1998 | 4.83 | 1.61 |
1961 | 2.59 | 0.67 | 1980 | 11.32 | 12.50 | 1999 | 4.75 | 2.68 |
1962 | 2.90 | 1.33 | 1981 | 13.81 | 8.92 | 2000 | 5.90 | 3.39 |
1963 | 3.26 | 1.64 | 1982 | 11.06 | 3.83 | 2001 | 3.34 | 1.55 |
1964 | 3.68 | 0.97 | 1983 | 8.74 | 3.79 | 2002 | 1.68 | 2.38 |
1965 | 4.05 | 1.92 | 1984 | 9.78 | 3.95 | 2003 | 1.05 | 1.88 |
1966 | 5.06 | 3.46 | 1985 | 7.65 | 3.80 | 2004 | 1.58 | 3.26 |
1967 | 4.61 | 3.04 | 1986 | 6.02 | 1.10 | 2005 | 3.39 | 3.42 |
1968 | 5.47 | 4.72 | 1987 | 6.03 | 4.43 | 2006 | 4.81 | 2.54 |
1969 | 6.86 | 6.20 | 1988 | 6.91 | 4.42 | 2007 | 4.44 | 4.08 |
1970 | 6.51 | 5.57 | 1989 | 8.03 | 4.65 | 2008 | 1.62 | 0.09 |
1971 | 4.52 | 3.27 | 1990 | 7.46 | 6.11 | 2009 | 0.28 | 2.72 |
1972 | 4.47 | 3.41 | 1991 | 5.44 | 3.06 | 2010 | 0.20 | 1.50 |
1973 | 7.20 | 8.71 | 1992 | 3.54 | 2.90 | 2011 | 0.10 | 2.96 |
1974 | 7.95 | 12.34 | 1993 | 3.12 | 2.75 | 2012 | 0.13 | 1.74 |
1975 | 6.10 | 6.94 | 1994 | 4.64 | 2.67 | 2013 | 0.09 | 1.50 |
1976 | 5.26 | 4.86 | 1995 | 5.56 | 2.54 |
499
10.6 Look at the data.
Give a brief description of the form, direction, and strength of the relationship between the inflation rate and the return on Treasury bills. What is the equation of the least-squares regression line for predicting T-bill return?
inflat
10.7 Is there a relationship?
What are the slope of the fitted line and its standard error? Use these numbers to test by hand the hypothesis that there is no straight-line relationship between inflation rate and T-bill return against the alternative that the return on T-bills increases as the rate of inflation increases. State the hypotheses, give both the statistic and its degrees of freedom, and use Table D to approximate the -value. Then compare your results with those given by Excel. (Excel’s -value 3.04E-09 is shorthand for 0.00000000304. We would report this as “.”)
10.7
. The results are the same as Excel’s.
inflat
10.8 Estimating the slope.
Using Excel’s values for and its standard error, find a 95% confidence interval for the slope of the population regression line. Compare your result with Excel’s 95% confidence interval. What does the confidence interval tell you about the change in the T-bill return rate for a 1% increase in the inflation rate?
inflat
500
The word “regression”
To “regress” means to go backward. Why are statistical methods for predicting a response from an explanatory variable called “regression”? Sir Francis Galton (1822–1911) was the first to apply regression to biological and psychological data. He looked at examples such as the heights of children versus the heights of their parents. He found that the taller-than-average parents tended to have children who were also taller than average, but not as tall as their parents. Galton called this fact “regression toward mediocrity,” and the name came to be applied to the statistical method. Galton also invented the correlation coefficient and named it “correlation.”
Why are the children of tall parents shorter on the average than their parents? The parents are tall in part because of their genes. But they are also tall in part by chance. Looking at tall parents selects those in whom chance produced height. Their children inherit their genes, but not their good luck. As a group, the children are taller than average (genes), but their heights vary by chance about the average, some upward and some downward. The children, unlike the parents, were not selected because they were tall and thus, on average, are shorter. A similar argument can be used to describe why children of short parents tend to be taller than their parents.
Here’s another example. Students who score at the top on the first exam in a course are likely to do less well on the second exam. Does this show that they stopped studying? No—they scored high in part because they knew the material but also in part because they were lucky. On the second exam, they may still know the material but be less lucky. As a group, they will still do better than average but not as well as they did on the first exam. The students at the bottom on the first exam will tend to move up on the second exam, for the same reason.
The regression fallacy is the assertion that regression toward the mean shows that there is some systematic effect at work: students with top scores now work less hard, or managers of last year’s best-performing mutual funds lose their touch this year, or heights get less variable with each passing generation as tall parents have shorter children and short parents have taller children. The Nobel economist Milton Friedman says, “I suspect that the regression fallacy is the most common fallacy in the statistical analysis of economic data.”6 Beware.
regression fallacy
Apply Your Knowledge
10.9 Hot funds?
Explain carefully to a naive investor why the mutual funds that had the highest returns this year will as a group probably do less well relative to other funds next year.
10.9
The mutual funds that had the highest returns this year were high in part because they did well but also in part because they were lucky, so we might expect them to do well again next year but probably not as well as this year.
10.10 Mediocrity triumphant?
In the early 1930s, a man named Horace Secrist wrote a book titled The Triumph of Mediocrity in Business. Secrist found that businesses that did unusually well or unusually poorly in one year tended to be nearer the average in profitability at a later year. Why is it a fallacy to say that this fact demonstrates an overall movement toward “mediocrity”?
Inference about correlation
The correlation between log income and level of education for the 100 entrepreneurs is . This value appears in the Excel output in Figure 10.5 (page 492), where it is labeled “Multiple R.”7 We might expect a positive correlation between these two measures in the population of all entrepreneurs in the United States. Is the sample result convincing evidence that this is true?
This question concerns a new population parameter, the population correlation. This is the correlation between the log income and level of education when we measure these variables for every member of the population. We call the population correlation , the Greek letter rho. To assess the evidence that in the population, we must test the hypotheses
population correlation
501
It is natural to base the test on the sample correlation . Table G in the back of the book shows the one-sided critical values of . To use software for the test, we exploit the close link between correlation and the regression slope. The population correlation is zero, positive, or negative exactly when the slope of the population regression line is zero, positive, or negative. In fact, the statistic for testing also tests . What is more, this statistic can be written in terms of the sample correlation .
Test for Zero Population Correlation
To test the hypothesis that the population correlation is 0, compare the sample correlation with critical values in Table G or use the statistic for regression slope.
The statistic for the slope can be calculated from the sample correlation :
This statistic has degrees of freedom.
EXAMPLE 10.6 Correlation between Log Income and Years of Education
CASE 10.1 The sample correlation between log income and education level is from a sample of size . We can use Table G to test
For the row , we find that the -value for lies between 0.005 and 0.01.
We can get a more accurate result from the Excel output in Figure 10.5 (page 492). In the “EDUC” line, we see that with two-sided -value 0.0164. That is, for our one-sided alternative.
Finally, we can calculate directly from as follows:
If we are not using software, we can compare with critical values from the table (Table D) with 80 (largest row less than or equal to ) degrees of freedom.
502
The alternative formula for the test statistic is convenient because it uses only the sample correlation and the sample size . Remember that correlation, unlike regression, does not require the distinction between explanatory and response variables. For variables and , there are two regressions ( on and on ) but just one correlation. Both regressions produce the same statistic.
The distinction between the regression setting and correlation is important only for understanding the conditions under which the test for 0 population correlation makes sense. In the regression model, we take the values of the explanatory variable as given. The values of the response are Normal random variables, with means that are a straight-line function of . In the model for testing correlation, we think of the setting where we obtain a random sample from a population and measure both and . Both are assumed to be Normal random variables. In fact, they are taken to be jointly Normal. This implies that the conditional distribution of for each possible value of is Normal, just as in the regression model.
jointly Normal
Apply Your Knowledge
10.11 T-bills and inflation.
We expect the interest rates on Treasury bills to rise when the rate of inflation rises and fall when inflation falls. That is, we expect a positive correlation between the return on T-bills and the inflation rate.
10.11
(a) . There is a significant positive correlation between T-bills and inflation rate. (b) . The results are the same.
10.12 Two regressions.
CASE 10.1 We have regressed the log income of entrepreneurs on their years of education, with the results appearing in Figures 10.5 and 10.6. Use software to regress years of education on log income for the same data.
entre