EXAMPLE 16 Strategy for building a multiple regression model

baseball2013

image

The author of this book first became interested in the field of statistics through the enjoyment of sports statistics, especially baseball, which is packed with interesting statistics. Today, professional sports teams are seeking competitive advantage through the analysis of data and statistics, such as Sabermetrics (Society of American Baseball Research, www.sabr.org), as shown in the motion picture Moneyball.

Suppose a baseball researcher is interested in predicting , using the data set Baseball 2013 and the following predictor variables:

Use the Strategy for Building a Multiple Regression Model to build the best multiple regression model for predicting the number of runs scored using these predictor variables, at level of significance .

Solution

The data set Baseball 2013 contains the batting statistics of the players in Major League Baseball who had at least 100 at-bats during the 2013 season (Source: www.seanlahman.com/baseball-archive/statistics).

  • Step 1 The Test. Figure 30 shows the Minitab results of a regression of on the set of predictor variables . The -value for the test is significant, so we know that a linear relationship exists between and at least one of the variables.
  • Step 2 The test (the first time). In Figure 30, the -value for Batting Average is greater than level of significance . We therefore eliminate the Batting Average from the model. Perhaps surprisingly, a player's batting average is evidently not helpful in predicting the number of runs that player will score when all other predictors are held constant.

    752

    image
    Figure 13.31: FIGURE 30 Step 1: test is significant.
    image
    Figure 13.32: FIGURE 31 All variables are significant; we have our final model.
  • Step 2 The test (the second time). We repeat Step 2 as long as there are variables with -values greater than level of significance . Figure 31 shows the results of performing the multiple regression of on all the variables except Batting Average. No further variables have -values below 0.05; therefore, no further variables are excluded from the model. In other words, we have our final model.
  • Step 3 Verify the assumptions. For our final model, we now verify the regression assumptions. Figures 32 and 33 show no patterns for the bulk of the data that would indicate a violation of the regression assumptions. We therefore conclude that the regression assumptions are verified.
    image
    Figure 13.33: FIGURE 32 Scatterplot of residuals versus fitted values.
    image
    Figure 13.34: FIGURE 33 Normal probability plots of the residuals.
  • Step 4 Report and interpret your final model.
    1. The multiple regression equation for the final model is shown here.

      753

    2. We interpret the coefficient for Hits, and leave to the exercises the interpretation of the other multiple regression coefficients. β€œFor each additional hit that a player makes, the estimated increase in the number of runs that player will score is 0.3182, when all the other variables are held constant.”
    3. The standard error of the estimate for the final model is . That is, using the multiple regression equation in (a), the size of the typical prediction error will be about 6.5 runs. The value of the adjusted coefficient of determination is . In other words, 93.07% of the variability in the number of runs scored is accounted for by this multiple regression equation.

NOW YOU CAN DO

Exercises 24–26 and 31–33.