17 Logistic Regression

17.1 The Logistic Regression Model

In general, the data for simple logistic regression are $n$ independent cases, each consisting of a value of the explanatory variable $x$ and either a success or a failure for that trial. For example, $x$ may be the salary amount, and “success” means that this applicant accepted the job offer. Every observation may have a different value of $x$ .

To introduce logistic regression, however, it is convenient to start with the special case in which the explanatory variable $x$ is also a yes-or-no variable. The data then contain a number of outcomes (success or failure) for each of the two values of $x$ . There are also just two values of $p$ , one for each value of $x$ . Assuming the count of successes for each value of $x$ has a binomial distribution, we are on familiar ground as described in Chapters 5 and 8. Here is an example.

CASE 17.1 Clothing Color and Tipping

What are the factors that affect a customer's tipping behavior? Studies have shown that a server's gender and various aspects of a server's appearance have an effect on tipping, unrelated to the quality of service. Some of these same studies have shown that the effect is different for male and female customers.

Because the color red has been shown to increase the physical attractiveness of women, a group of researchers decided to see if the color of clothing a female server wears has an effect on the tipping behavior.² Although they considered both male and female customers, we focus on the 418 male customers in the study.

The response variable is whether or not the male customer left a tip. The explanatory variable is whether the female server wore a red top or not. Let's express this condition numerically using an indicator variable,

$x = {\begin{array}{l} 1 & if the server wore a red top \\ 0 & if the server wore a different colored top \end{array}$

The female servers in the study wore a red top for 69 of the customers and wore a different colored top for the other 349.

The probability that a randomly chosen customer will tip has two values, $p_{1}$ for those whose server wore a red top and $p_{0}$ for those whose server wore a different colored top. The number of customers who tipped a server wearing red top has the binomial distribution $B (69, p_{1})$ . The number of customers who tipped a server wearing a different color top has the $B (349, p_{0})$ distribution.

Binomial distributions and odds

We begin with a review of some ideas associated with binomial distributions.

EXAMPLE 17.1 Proportion of Tippers

red

CASE 17.1 In Chapter 8, we used sample proportions to estimate population proportions. For this study, 40 of the 69 male customers tipped a server who was wearing red and 130 of the 349 customers tipped a server who was wearing a different color. Our estimates of the two population proportions are

$\begin{array}{l} red: & {\hat{p}}_{1} = \frac{40}{69} = 0.5797 \end{array}$

Page 17-3

and

$\begin{array}{l} not red: & {\hat{p}}_{0} = \frac{130}{349} = 0.3725 \end{array}$

That is, we estimate that 58.0% of the male customers will tip if the server wears red, and 37.3% of the male customers will tip if the server wears a different color.

Reminder

odds, p. 582

Logistic regression works with odds rather than proportions. The odds are the ratio of the proportions for the two possible outcomes. If $p$ is the probability of a success, then $1 - p$ is the probability of a failure, and

$odds = \frac{p}{1 - p} = \frac{probability of success}{probability of failure}$

A similar formula for the sample odds is obtained by substituting $\hat{p}$ for $p$ in this expression.

EXAMPLE 17.2 Odds of Tipping

CASE 17.1 The proportion of tippers among male customers who have a server wearing red is ${\hat{p}}_{1} = 0.5797$ , so the proportion of male customers who are not tippers when their server wears red is

$1 - {\hat{p}}_{1} = 1 - 0.5797 = 0.4203$

The estimated odds of a male customer tipping when the server wears red are, therefore,

$\begin{array}{l} odds & = & \frac{{\hat{p}}_{1}}{1 - {\hat{p}}_{1}} \\ = & \frac{0.5797}{1 - 0.5797} = 1.3793 \end{array}$

For the case when the server does not wear red, the odds are

$\begin{array}{l} odds & = & \frac{{\hat{p}}_{0}}{1 - {\hat{p}}_{0}} \\ = & \frac{0.3725}{1 - 0.3725} = 0.5936 \end{array}$

When people speak about odds, they often round to integers or fractions. Because 1.3793 is approximately 7/5, we could say that the odds that a male customer tips when the server wears red are 7 to 5. In a similar way, we could describe the odds that a male customer does not tip when the server wears red as 5 to 7.

Apply Your Knowledge

Question 17.1

17.1 Energy drink commercials.

A study was designed to compare Red Bull energy drink commercials. Each participant was shown the commercials, A and B, in random order and asked to select the better one. There were 140 women and 130 men who participated in the study. Commercial A was selected by 65 women and by 67 men. Find the odds of selecting Commercial A for the men. Do the same for the women.

17.1

For men: the percent who chose Commercial A is 0.5154; the percent who chose Commercial B is 0.4846. The odds are 1.0636. For women: the percent who chose Commercial A is 0.4643; the percent who chose Commercial B is 0.5357. The odds are 0.8667.

Page 17-4

Question 17.2

17.2 Use of audio/visual sharing through social media.

In Case 8.3 (page 438), we studied data on large and small food and beverage companies and the use of audio/visual sharing through social media. Here are the data:

Observed numbers of companies
	Use A/V sharing
Size	Yes	No	Total
Small	150	28	178
Large	27	25	52
Total	177	53	230

What proportion of the small companies use audio/visual sharing? What proportion of the large companies use audio/visual sharing? Convert each of these proportions to odds.

Model for logistic regression

In Chapter 8, we learned how to compare the proportions of two groups (such as large and small companies) using $z$ tests and confidence intervals. Simple logistic regression is another way to make this comparison, but it extends to more general settings with a success-or-failure response variable.

In simple linear regression we modeled the mean $μ$ of the response variable $y$ as a linear function of the explanatory variable: $μ + β_{0} + β_{1} x$ . When $y$ is just 1 or 0 (success or failure), the mean is the probability $p$ of a success. Simple logistic regression models the mean $p$ in terms of an explanatory variable $x$ . We might try to relate $p$ and $x$ as in simple linear regression: $p = β_{0} + β_{1} x$ . Unfortunately, this is not a good model. Whenever $β_{1} \neq 0$ , extreme values of $x$ will give values of $β_{0} + β_{1} x$ that fall outside the range of possible values of $p$ , $0 \leq p \leq 1$ .

log odds or logit

The logistic regression model removes this difficulty by working with the natural logarithm of the odds. We use the term log odds or logit for this transformation of $p$ . We model the log odds as a linear function of the explanatory variable:

$\log (\frac{p}{1 - p}) = β_{0} + β_{1} x$

As $p$ moves from 0 to 1, the log odds move through all negative and positive numerical values. Here is a summary of the logistic regression model.

Simple Logistic Regression Model

The statistical model for simple logistic regression is

$\log (\frac{p}{1 - p}) = β_{0} + β_{1} x$

where $p$ is a binomial proportion and $x$ is the explanatory variable. The parameters of the logistic model are $β_{0}$ and $β_{1}$ .

Figure 17.1 graphs the relationship between $p$ and $x$ for some different values of $β_{0}$ and $β_{1}$ . The logistic regression model uses natural logarithms. There are tables of natural logarithms and most calculators have a built-in function for the natural logarithm, often labeled “ln.”

Page 17-5

FIGURE 17.1 Plot of

$p$ versus

$x$ for selected values of

$β_{0}$ and

$β_{1}$ .

Returning to the tipping study, for servers wearing red we have

$\log (odds) = \log (1.3793) = 0.3216$

and for servers not wearing red we have

$\log (odds) = \log (0.5936) = - 0.5215$

Verify these results with your calculator, remembering that “log” in these equations is the natural logarithm.

Apply Your Knowledge

Question 17.3

17.3 Log odds choosing Commercial A.

Refer to Exercise 17.1. Find the log odds for the men and the log odds for the women choosing Commercial A.

17.3

For $men : 0.06166$ . For $women : - 0.14306$ .

Question 17.4

17.4 Log odds for use of audio/visual sharing.

Refer to Exercise 17.2. Find the log odds for the small and large companies.

Fitting and interpreting the logistic regression model

We must now fit the logistic regression model to data. In general, the data consist of $n$ observations on the explanatory variable $x$ , each with a success-or-failure response. Our tipping example has an indicator (0 or 1) explanatory variable. Logistic regression with an indicator explanatory variable is a special case but is important in practice. We use this special case to understand a little more about the model.

EXAMPLE 17.3 Logistic model for Tipping Behavior

CASE 17.1 In the tipping example, there are $n = 418$ observations. The explanatory variable is whether the server wore a red top or not, which we coded using an indicator variable with values $x = 1$ for servers wearing red and $x = 0$ for servers wearing a different color. There are 69 observations with $x = 1$ and 349 observations with $x = 0$ . The response variable is also an indicator variable: $y = 1$ if the customer left a tip and $y = 0$ if not. The model says that the probability $p$ of leaving a tip depends on the color of the server's top $(x = 1 or x = 0)$ . There are two possible values for $p$ —say, $p_{1}$ for servers wearing red and $p_{0}$ for servers wearing a different color. The model says that for servers wearing red

$\log (\frac{p_{1}}{1 - p_{1}}) = β_{0} + β_{1}$

Page 17-6

and for servers wearing a different color

$\log (\frac{p_{0}}{1 - p_{0}}) = β_{0}$

Note that there is a $β_{1}$ term in the equation for servers wearing red because $x = 1$ , but it is missing in the equation for servers wearing a different color because $x = 0$ .

In general, the calculations needed to find the estimates $b_{0}$ and $b_{1}$ for the parameters $β_{0}$ and $β_{1}$ are complex and require the use of software. When the explanatory variable has only two possible values, however, we can easily find the estimates. This simple framework also provides a setting where we can learn what the logistic regression parameters mean.

EXAMPLE 17.4 Parameter Estimates for Tipping Behavior

CASE 17.1 For the tipping example, we found the log odds for servers wearing red,

$\log (\frac{{\hat{p}}_{1}}{1 - {\hat{p}}_{1}}) = 0.3216$

and for servers wearing a different color,

$\log (\frac{{\hat{p}}_{0}}{1 - {\hat{p}}_{0}}) = - 0.5215$

To find estimates $b_{0}$ and $b_{1}$ of the model parameters $β_{0}$ and $β_{1}$ we match the two model equations in Example 17.3 with the corresponding data equations. Because

$\begin{array}{l} \log (\frac{p_{0}}{1 - p_{0}}) = β_{0} & and & \log \end{array} (\frac{{\hat{p}}_{0}}{1 - {\hat{p}}_{0}}) = - 0.5215$

the estimate $b_{0}$ of the intercept is simply the logs odds ) for servers wearing a different color,

$b_{0} = - 0.5215$

Similarly, the estimated slope is the difference between the log(odds) for servers wearing red and the log(odds) for servers wearing a different color,

$b_{0} = 0.3216 - (- 0.5215) = 0.8431$

The fitted logistic regression model is

$\log (odds) = - 0.5215 + 0.8431 x$

The slope in this logistic regression model is the difference between the log( odds ) for servers wearing red and the log(odds ) for servers wearing a different color. Most people are not comfortable thinking in the log(odds) scale, so interpretation of the results in terms of the regression slope is difficult.

EXAMPLE 17.5 Transforming Estimates to the Odds Scale

CASE 17.1 To get to the odds scale, we take the exponential of the log(odds). Based on the parameter estimates in Example 17.4,

$odds = e^{- 0.5215 + 0.8431 x} = e^{- 0.5215} \times e^{0.8431 x}$

Page 17-7

From this, the ratio of the odds for a server wearing red ( $x = 1$ ) and for a server wearing a different color ( $x = 0$ ) is

$\frac{{odds}_{red}}{{odds}_{other}} = e^{0.8431} = 2.324$

odds ratio

The transformation $e^{0.8431}$ undoes the natural logarithm and transforms the logistic regression slope into an odds ratio, in this case, the comparison of odds that a male customer tips when a server is wearing red to the odds that a male customer tips when a server is wearing a different color. In other words, we can multiply the odds of tipping when a server wears a different color by the odds ratio to obtain the odds of tipping for a server wearing red:

${odds}_{red} = 2.324 \times {odds}_{other}$

In this case, the odds of tipping when a server wears red are about 2.3 times the odds when a server wears a different color.

Notice that we have chosen the coding for the indicator variable so that the regression slope is positive. This will give an odds ratio that is greater than 1. Had we coded servers wearing a different color as 1 and servers wearing red as 0, the sign of the slope would be reversed, and the fitted equation would be $\log (odds) = 0.3216 - 0.8431 x$ , and the odds ratio would be $e^{- 0.8431} = 0.430$ . The odds of tipping for servers wearing a different color are roughly 43% of the odds for servers wearing red.

Of course, it is often the case that the explanatory variable is quantitative rather than an indicator variable. We must then use software to fit the logistic regression model. Here is an example.

EXAMPLE 17.6 Will a Movie Be Profitable?

movprof

The MOVIES data set (described on page 550) includes both the movie's budget and the total U.S. revenue. For this example, we classify each movie as “profitable” ( $y = 1$ ) if U.S. revenue is larger than the budget and nonprofitable ( $y = 0$ ) otherwise. This is our response variable.

FIGURE 17.2 Scatterplot of the movie profit data with a scatterplot smoother, Example 17.6. The smoother suggests the upper half of an S-shape similar to those shown in Figure 17.1.

Page 17-8

The data set contains several explanatory variables, but we focus here on the natural logarithm of the opening-weekend revenue, LOpening. Figure 17.2 is a scatterplot of the data with a scatterplot smoother (page 68). The probability that a movie is profitable increases with the log opening weekend revenue. Because an S-shaped curve like those in Figure 17.1 is suggested by the smoother, we fit the logistic regression model

$\log (\frac{p}{1 - p}) = β_{0} + β_{1} x$

where $p$ is the probability that the movie is profitable and $x$ is the log opening-weekend revenue. The model for estimated log odds fitted by software is

$\log (odds) = b_{0} + b_{1} x = - 1.41 + 0.781 x$

The estimated odds ratio is $e^{b_{1}} = 2.184$ . This means that if opening-weekend revenue $x$ were roughly $e^{1} = 2.72$ times larger (for example, $18.1 million to $49.2 million), the odds that the movie will be profitable increase by 2.2 times.

Apply Your Knowledge

Question 17.5

17.5 Fitted model for energy drink commercials.

Refer to Exercise 17.1 and 17.3. Find the estimates $b_{0}$ and $b_{1}$ and give the fitted logistic model. What is the estimated odds ratio for a male to choose Commercial $A (x = 1)$ versus a female to choose Commercial $A (x = 0)$ ?

17.5

$b_{0} = - 0.14306,$ $b_{1} = 0.20472.$ $\log (odds) = - 0.14306 + 0.20472 x.$ The odds ratio is 1.227.

Question 17.6

17.6 Fitted model for use of audio/video sharing.

Refer to Exercise 17.2 and 17.4. Find the estimates $b_{0}$ and $b_{1}$ and give the fitted logistic model. What is the estimated odds ratio for small ( $x = 1$ ) versus large ( $x = 0$ ) companies?

Question 17.7

17.7 Interpreting an odds ratio.

If we apply the exponential function to the fitted model in Example 17.6, we get

$odds = e^{- 1.41 + 0.781 x} = e^{- 1.41} \times e^{0.781 x}$

Show that for any value of the quantitative explanatory variable $x$ , the odds ratio for increasing $x$ by 1,

$\frac{{odds}_{x + 1}}{{odds}_{x}}$

is $e^{0.781} = 2.184$ . This justifies the interpretation given at the end of Example 17.6.

17.7

$\begin{array}{l} \frac{{odds}_{x + 1}}{{odds}_{x}} & = & \frac{e^{- 1 \cdot 41 + 0 \cdot 781 (x + 1)}}{e^{- 1 \cdot 41 + 0 \cdot 781 x}} \\ = & \frac{e^{0 \cdot 781 x} e^{0 \cdot 781}}{e^{0 \cdot 781 x}} = e^{0.781} = 2.184 \cdot \end{array}$

The odds ratio interpretation of the estimated slope parameter is a very attractive feature of the logistic regression model. The health sciences, for example, have used this model extensively to identify risk factors for disease and illness. There are other statistical models, such as probit regression, that describe binary responses, but none of them has this interpretation.

probit regression