We have concentrated on relationships in which at least the response variable is quantitative. Now we shift to describing relationships between two or more categorical variables. Some variables—such as gender, race, and occupation—are categorical by nature. Other categorical variables are created by grouping values of a quantitative variable into classes. Published data often appear in grouped form to save space. To analyze categorical data, we use the counts or percents of cases that fall into various categories.
Market researchers know that background music can influence the mood and the purchasing behavior of customers. One study in a supermarket in Northern Ireland compared three treatments: no music, French accordion music, and Italian string music. Under each condition, the researchers recorded the numbers of bottles of French, Italian, and other wine purchased.^{14} Here is the twoway table that summarizes the data:
Music  

Wine  None  French  Italian  Total 
French  30  39  30  99 
Italian  11  1  19  31 
Other  43  35  35  113 
Total  84  75  84  243 
105
The data table for Case 2.2 is a twoway table because it describes two categorical variables. The type of wine is the row variable because each row in the table describes the data for one type of wine. The type of music played is the column variable because each column describes the data for one type of music.The entries in the table are the counts of bottles of wine of the particular type sold while the given type of music was playing. The two variables in this example, wine and music, are both categorical variables.
twoway table row and column variables
This twoway table is a 3 × 3 table, to which we have added the marginal totals obtained by summing across rows and columns. For example, the firstrowtotal is 30 + 39 + 30 = 99. The grand total, the number of bottles of wine in the study, can be computed by summing the row totals, 99 + 31 + 113 = 243, or the column totals, 84 + 75 + 84 = 243. It is a good idea to do both as a check on your arithmetic.
How can we best grasp the information contained in the wine and music table? First, look at the distribution of each variable separately. The distribution of a categorical variable says how often each outcome occurred. The “Total” column at the right margin of the table contains the totals for each of the rows. These are called marginal row totals. They give the numbers of bottles of wine sold by the type of wine: 99 bottles of French wine, 31 bottles of Italian wine, and 113 bottles of other types of wine. Similarly, the marginal column totals are given in the “Total” row at the bottom margin of the table. These are the numbers of bottles of wine that were sold while different types of music were being played: 84 bottles when no music was playing, 75 bottles when French music was playing, and 84 bottles when Italian music was playing.
marginal row totals
marginal column totals
Percents are often more informative than counts. We can calculate the distribution of wine type in percents by dividing each row total by the table total. This distribution is called the marginal distribution of wine type.
marginal distribution
To find the marginal distribution for the row variable in a twoway table, divide each row total by the total number of entries in the table. Similarly, to find the marginal distribution for the column variable in a twoway table, divide each column total by the total number of entries in the table.
Although the usual definition of a distribution is in terms of proportions, we often multiply these by 100 to convert them to percents. You can describe a distribution either way as long as you clearly indicate which format you are using.
CASE 2.2 Let’s find the marginal distribution for the types of wine sold. The counts that we need for these calculations are in the margin at the right of the table:
Wine  Total 

French  99 
Italian  31 
Other  113 
Total  243 
106
The percent of bottles of French wine sold is
Similar calculations for Italian wine and other wine give the following distribution in percents:
Wine  French  Italian  Other 
Percent  40.74  12.76  46.50 
The total should be 100% because each bottle of wine sold is classified into exactly one of these three categories. In this case, the total is exactly 100%. Small deviations from 100% can occur due to roundoff error.
As usual, we prefer to display numerical summaries using a graph. Figure 2.21 is a bar graph of the distribution of wine type sold. In a twoway table, we have two marginal distributions, one for each of the variables that defines the table.
CASE 2.2 Find the marginal distribution for the type of music. Display the distribution using a graph.
In working with twoway tables, you must calculate lots of percents. Here’s a tip to help you decide what fraction gives the percent you want. Ask, “What group represents the total that I want a percent of?” The count for that group is the denominator of the fraction that leads to the percent. In Example 2.25, we wanted percents “of bottles of the different types of wine sold,” so the table total is the denominator.
Construct your own 2 × 3 table. Add the marginal totals and find the two marginal distributions.
The following table gives the number of students (in thousands) graduating from college with degrees in several fields of study for seven countries:^{15}
107
Field of study  Canada  France  Germany  Italy  Japan  U.K.  U.S. 

Social sciences, business, law  64  153  66  125  259  152  878 
Science, mathematics, engineering  35  111  66  80  136  128  355 
Arts and humanities  27  74  33  42  123  105  397 
Education  20  45  18  16  39  14  167 
Other  30  289  35  58  97  76  272 
The 3 × 3 table for Case 2.2 contains much more information than the two marginal distributions. We need to do a little more work to describe the relationship between the type of music playing and the type of wine purchased. Relationships among categorical variables are described by calculating appropriate percents from the counts given.
To find the conditional distribution of the column variable for a particular value of the row variable in a twoway table, divide each count in the row by the row total. Similarly, to find the conditional distribution of the row variable for a particular value of the column variable in a twoway table, divide each count in the column by the column total.
CASE 2.2 What types of wine were purchased when no music was playing? To answer this question, we find the marginal distribution of wine type for the value of music equal to none. The counts we need are in the first column of our table:
Music  

Wine  None 
French  30 
Italian  11 
Other  43 
Total  84 
What percent of French wine was sold when no music was playing? To answer this question, we divide the number of bottles of French wine sold when no music was playing by the total number of bottles of wine sold when no music was playing:
108
In the same way, we calculate the percents for Italian and other types of wine. Here are the results:
Wine type:  French  Italian  Other 
Percent when no music is playing:  35.7  13.1  51.2 
Other wine was the most popular choice when no music was playing, but French wine has a reasonably large share. Notice that these percents sum to 100%. There is no roundoff error here. The distribution is displayed in Figure 2.22.
In Example 2.26, we found the distribution of sales by wine type when no music was playing. In Exercise 2.94, you found the distribution when French music was playing, and in Exercise 2.95, you found the distribution when Italian music was playing. Examine these three conditional distributions carefully, and write a paragraph summarizing the relationship between sales of different types of wine and the music played.
109
For Case 2.2, we examined the relationship between sales of different types of wine and the music that was played by studying the three conditional distributions of type of wine sold, one for each music condition. For these computations, we used the counts from the 3 × 3 table, one column at a time. We could also have computed conditional distributions using the counts for each row. The result would be the three conditional distributions of the type of music played for each of the three wine types. For this example, we think that conditioning on the type of music played gives us the most useful data summary. Comparing conditional distributions can be particularly useful when the column variable is an explanatory variable.
The choice of which conditional distribution to use depends on the nature of the data and the questions that you want to ask. Sometimes you will prefer to condition on the column variable, and sometimes you will prefer to condition on the row variable. Occasionally, both sets of conditional distributions will be useful. Statistical software will calculate all of these quantities. You need to select the parts of the output that are needed for your particular questions. Don’t let computer software make this choice for you.
In Exercise 2.93, you examined data on fields of study for graduating college students from seven countries.
Refer to the previous exercise. Answer the same questions for the conditional distribution of country for each field of study.
In the previous two exercises, you examined the relationship between country and field of study in two different ways.
Statistical software will compute all of the quantities that we have discussed in this section. Included in some output is a very useful graphical summary called a mosaic plot. Here is an example.
mosaic plot
CASE 2.2 Output from JMP statistical software for the wine and music data is given in Figure 2.23. The mosaic plot is given in the top part of the display. Here, we think of music as the explanatory variable and wine as the response variable, so music is displayed across the x axis in the plot. The conditional distributions of wine for each type of music are displayed in the three columns. Note that when French is playing, 52% of the wine sold is French wine. The red bars display the percents of French wine sold for each type of music. Similarly, the green and blue bars display the correspondence to Italian wine and other wine, respectively. The widths of the three sets of bars display the marginal distribution of music. We can see that the proportions are approximately equal, but the French wine sold a little less than the other two categories of wine.
As is the case with quantitative variables, the effects of lurking variables can change or even reverse relationships between two categorical variables. Here is an example that demonstrates the surprises that can await the unsuspecting user of data.
A customer service center has a goal of resolving customer questions in 10 minutes or less. Here are the records for two representatives:
Representative  

Goal met  Ashley  Joshua 
Yes  172  118 
No  28  82 
Total  200  200 
Ashley has met the goal 172 times out of 200, a success rate of 86%. For Joshua, the success rate is 118 out of 200, or 59%. Ashley clearly has the better success rate.
111
Let’s look at the data in a little more detail. The data summarized come from two different weeks in the year.
Here are the counts broken down by week:
Week 1  Week 2  

Goal met  Ashley  Joshua  Ashley  Joshua 
Yes  162  19  10  99 
No  18  1  10  81 
Total  180  20  20  180 
For Week 1, Ashley met the goal 90% of the time (162/180), while Joshua met the goal 95% of the time (19/20). Joshua had the better performance in Week 1. What about Week 2? Here, Ashley met the goal 50% of the time (10/20), while the success rate for Joshua was 55% (99/180). Joshua again had the better performance. How does this analysis compare with the analysis that combined the counts for the two weeks? That analysis clearly showed that Ashley had the better performance, 86% versus 59%.
These results can be explained by a lurking variable related to week. The first week was during a period when the product had been in use for several months. Most of the calls to the customer service center concerned problems that had been encountered before. The representatives were trained to answer these questions and usually had no trouble in meeting the goal of resolving the problems quickly. On the other hand, the second week occurred shortly after the release of a new version of the product. Most of the calls during this week concerned new problems that the representatives had not yet encountered. Many more of these questions took longer than the 10minute goal to resolve.
Look at the total in the bottom row of the detailed table. During the first week, when calls were easy to resolve, Ashley handled 180 calls and Joshua handled 20. The situation was exactly the opposite during the second week, when calls were difficult to resolve. There were 20 calls for Ashley and 180 for Joshua.
The original twoway table, which did not take account of week, was misleading. This example illustrates Simpson’s paradox.
An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.
The lurking variables in Simpson’s paradox are categorical. That is, they break the cases into groups, as when calls are classified by week. Simpson’s paradox is just an extreme form of the fact that observed associations can be misleading when there are lurking variables.
Insurance companies and consumers are interested in the performance of hospitals. The government releases data about patient outcomes in hospitals that can be useful in making informed health care decisions. Here is a twoway table of data on the survival of patients after surgery in two hospitals. All patients undergoing surgery in a recent time period are included. “Survived” means that the patient lived at least six weeks following surgery.
112
Hospital A  Hospital B  

Died  63  16 
Survived  2037  784 
Total  2100  800 
What percent of Hospital A patients died? What percent of Hospital B patients died? These are the numbers one might see reported in the media.
Not all surgery cases are equally serious, however. Patients are classified as being in either “poor” or “good” condition before surgery. Here are the data broken down by patient condition. Check that the entries in the original twoway table are just the sums of the “poor” and “good” entries in this pair of tables.
Good Condition  

Hospital A  Hospital B  
Died  6  8 
Survived  594  592 
Total  600  600 
Poor Condition  
Hospital A  Hospital B  

Died  57  8 
Survived  1443  192 
Total  1500  200 
The data in Example 2.28 can be given in a threeway table that reports counts for each combination of three categorical variables: week, representative, and whether or not the goal was met. In Example 2.29, we constructed two twoway tables for representative by goal, one for each week. The original table, the one that we showed in Example 2.28, can be obtained by adding the corresponding counts for the two tables in Example 2.29. This process is called aggregating the data. When we aggregated data in Example 2.28, we ignored the variable week, which then became a lurking variable. Conclusions that seem obvious when we look only at aggregated data can become quite different when the data are examined in more detail.
threeway table
aggregation
For Exercise 2.91, see page 106; for 2.92 and 2.93, see pages 106–107; for 2.94 to 2.96, see page 108; for 2.97 to 2.99, see page 109; and for 2.100 and 2.101, see pages 111–112.
The Federal Reserve has called remote deposit capture (RDC) “the most important development the [U.S.] banking industry has seen in years.” This service allows users to scan checks and to transmit the scanned images to a bank for posting.^{16} In its annual survey of community banks, the American Bankers Association asked banks whether or not they offered this service.^{17} Here are the results classified by the asset size (in millions of dollars) of the bank:
Offer RDC  

Asset size ($ in millions) 
Yes  No 
Under $100  63  309 
$101 to $200  59  132 
$201 or more  112  s85 
Summarize the results of this survey question numerically and graphically. Write a short paragraph explaining the relationship between the size of a bank, measured by assets, and whether or not RDC is offered.
The survey described in the previous exercise also classified community banks by region. Here is the 6 × 2 table of counts:^{18}
Offer RDC  

Region  Yes  No 
Northeast  28  38 
Southeast  57  61 
Central  53  84 
Midwest  63  181 
Southwest  27  51 
West  61  76 
Summarize the results of this survey question numerically and graphically. Write a short paragraph explaining the relationship between the location of a bank, measured by region, and whether or not remote deposit capture is offered.
A survey of 656 boys and girls, ages 13 to 18, asked about adequate sleep and other healthrelated behaviors. The recommended amount of sleep is six to eight hours per night.^{19} In the survey, 54% of the respondents reported that they got less than this amount of sleep on school nights. The researchers also developed an exercise scale that was used to classify the students as above or below the median in how much they exercised. Here is the 2 × 2 table of counts with students classified as getting or not getting adequate sleep and by the exercise variable:
114
Exercise  

Enough sleep  High  Low 
Yes  151  115 
No  148  242 
Refer to the previous exercise.
The Census Bureau provides estimates of numbers of people in the United States classified in various ways.^{20} Let’s look at college students. The following table gives us data to examine the relation between age and fulltime or parttime status. The numbers in the table are expressed as thousands of U.S. college students.
Status  

Age  Fulltime  Parttime 
15–19  3388  389 
20–24  5238  1164 
25–34  1703  1699 
35 and over  762  2045 
Refer to the previous exercise.
One of the questions in a survey of high school students asked about lying to teachers.^{21} The accompanying table gives the numbers of students who said that they lied to a teacher about something significant at least once during the past year, classified by gender.
Gender  

Lied at least once  Male  Female 
Yes  6067  5966 
No  4145  5719 
The students surveyed in the study described in the previous exercise were also asked whether they thought trust and honesty were essential in business and the workplace. Here are the counts classified by gender:
Gender  

Trust and honesty are essential  Male  Female 
Agree  9,097  10,935 
Disagree  685  423 
Answer the questions given in the previous exercise for this survey question.
115
College courses taught at lower levels often have larger class sizes. The following table gives the number of classes classified by course level and class size.^{22} For example, there were 202 firstyear level courses with between one and nine students.
Class size  

Course level 
1–9  10–19  20–29  30–39  40–49  50–99  100 or more 
1  202  659  917  241  70  99  123 
2  190  370  486  307  84  109  134 
3  150  387  314  115  96  186  53 
4  146  256  190  83  67  64  17 
A company has been accused of age discrimination in hiring for operator positions. Lawyers for both sides look at data on applicants for the past three years. They compare hiring rates for applicants younger than 40 years and those 40 years or older.
Age  Hired  Not hired 

Younger than 40  82  1160 
40 or older  2  168 
A business school conducted a survey of companies in its state. It mailed a questionnaire to 200 small companies, 200 mediumsized companies, and 200 large companies. The rate of nonresponse is important in deciding how reliable survey results are. Here are the data on response to this survey:
Small  Medium  Large  

Response  124  80  41 
No response  76  120  159 
Total  200  200  200 
Companies planning to introduce a new product to the market must define the “target” for the product. Who do we hope to attract with our new product? Age and gender are two of the most important demographic variables. The following twoway table describes the age and marital status of American women.^{23} The table entries are in thousands of women.
Marital status  

Age (years)  Never married 
Married  Widowed  Divorced 
18 to 24  12,112  2,171  23  164 
25 to 39  9,472  18,219  177  2,499 
40 to 64  5,224  35,021  2,463  8,674 
≥ 65  984  9,688  8,699  2,412 
Refer to Exercises 2.113 and 2.114. Here are the corresponding counts for men:
116
Marital status  

Age (years)  Never married 
Married  Widowed  Divorced 
18 to 24  13,509  1,245  6  63 
25 to 39  12,685  16,029  78  1,790 
40 to 64  6,869  34,650  760  6,647 
≥ 65  685  12,514  2,124  1,464 
Answer the questions from Exercises 2.113 and 2.114 for these counts.
Wabash Tech has two professional schools, business and law. Here are twoway tables of applicants to both schools, categorized by gender and admission decision. (Although these data are made up, similar situations occur in reality.)
Business  

Admit  Deny  
Male  480  120 
Female  180  20 
Law  

Admit  Deny  
Male  10  90 
Female  100  200 
Recent studies have shown that earlier reports underestimated the health risks associated with being overweight. The error was due to lurking variables. In particular, smoking tends both to reduce weight and to lead to earlier death. Illustrate Simpson’s paradox by a simplified version of this situation. That is, make up tables of overweight (yes or no) by early death (yes or no) by smoker (yes or no) such that
Here are the row and column totals for a twoway table with two rows and two columns:
a  b  60 
c  d  60 
70  50  120 
Find two different sets of counts a, b, c, and d for the body of the table that give these same totals. This shows that the relationship between two variables cannot be obtained from the two individual distributions of the variables.
In Exercises 2.10 (page 72), 2.27 (page 78), and 2.58 (pages 95–96), you examined the relationship between the numbers of companies that are incorporated and are listed on their country’s stock exchange at the end of the year using data collected by the World Bank.^{24} In this exercise, you will explore the relationship between the numbers for 2012 and 2002 using logs.
117
Refer to the previous exercise.
The Organization for Economic Cooperation and Development (OECD) collects data on Main Economic Indicators (MEIs) for many countries. Each variable is recorded as an index, with the year 2000 serving as a base year. This means that the variable for each year is reported as a ratio of the value for the year divided by the value for 2000. Use of indices in this way makes it easier to compare values for different countries.^{25}
Refer to the previous exercise.
Refer to the previous two exercises.
For this exercise, we consider a hypothetical employee who starts working in Year 1 at a salary of $50,000. Each year her salary increases by approximately 5%. By Year 20, she is earning $126,000. The following table gives her salary for each year (in thousands of dollars):
Year  Salary  Year  Salary  Year  Salary  Year  Salary 

1  50  6  63  11  81  16  104 
2  53  7  67  12  85  17  109 
3  56  8  70  13  90  18  114 
4  58  9  74  14  93  19  120 
5  61  10  78  15  99  20  126 
Refer to the previous exercise. Figure 2.25 is a plot of the residuals versus year.
118
Refer to the previous two exercises. Figure 2.26 is a scatterplot with the leastsquares regression line for log salary versus year. For this model, r^{2} = 0.9995.
The individual whose salary we have been studying in Exercises 2.122 through 2.124 wants to do some financial planning. Specifically, she would like to predict her salary five years into the future, that is, for Year 25. She is willing to assume that her employment situation will be stable for the next five years and that it will be similar to the last 20 years.
119
Data on the salaries of a sample of professors in a business department at a large university are given below. The salaries are for the academic years 2014–2015 and 2015–2016.
2014–2015 salary ($) 
2015–2016 salary ($) 
2014–2015 salary ($) 
2015–2016 salary ($) 

145,700  147,700  136,650  138,650 
112,700  114,660  132,160  134,150 
109,200  111,400  74,290  76,590 
98,800  101,900  74,500  77,000 
112,000  113,000  83,000  85,400 
111,790  113,800  141,850  143,830 
103,500  105,700  122,500  124,510 
149,000  150,900  115,100  117,100 
Refer to the previous exercise.
Refer to the previous two exercises. The 2014–2015 salaries do an excellent job of predicting the 2015–2016 salaries. Is there anything more that we can learn from these data? In this department, there is a tradition of giving higherthanaverage percent raises to those whose salaries are lower. Let’s see if we can find evidence to support this idea in the data.
120
Colleges compete for students, and many students do careful research when choosing a college. One source of information is the rankings compiled by U.S. News & World Report. One of the factors used to evaluate undergraduate programs is the proportion of incoming students who graduate. This quantity, called the graduation rate, can be predicted by other variables such as the SAT or ACT scores and the high school records of the incoming students. One of the components in U.S. News & World Report rankings is the difference between the actual graduation rate and the rate predicted by a regression equation.^{26} In this chapter, we call this quantity the residual. Explain why the residual is a better measure to evaluate college graduation rates than the raw graduation rate.
The editor of a statistics text would like to plan for the next edition. A key variable is the number of pages that will be in the final version. Text files are prepared by the authors using a word processor called LaTeX, and separate files contain figures and tables. For the previous edition of the text, the number of pages in the LaTeX files can easily be determined, as well as the number of pages in the final version of the text. Here are the data:
Chapter  

1  2  3  4  5  6  7  8  9  10  11  12  13  
LaTeX pages 
77  73  59  80  45  66  81  45  47  43  31  46  26 
Text pages 
99  89  61  82  47  68  87  45  53  50  36  52  19 
Use the Internet to find the scores for the past season’s women’s basketball team at a college of your choice. Is there a relationship between the points scored by your chosen team and the points scored by their opponents? Summarize the data and write a report on your findings.
Refer to the previous exercise. Analyze the data for the men’s team from the same college, and compare your results with those for the women.
The following table gives the weight (in pounds) and amps for 19 circular saws. Saws with higher amp ratings tend to also be heavier than saws with lower amp ratings. We can quantify this fact using regression.
Weight  Amps  Weight  Amps  Weight  Amps 

11  15  9  10  11  13 
12  15  11  15  13  14 
11  15  12  15  10  12 
11  15  12  14  11  12 
12  15  10  10  11  12 
11  15  12  13  10  12 
13  15 
121
The table in the previous exercise gives the weight (in pounds) and amps for 19 circular saws. The data contain only five different amp ratings among the 19 saws.
Construct a set of data with two variables that have different means and correlation equal to one. Use your example to illustrate what correlation does and doesn’t say.
Simpson’s paradox occurs when a relationship between variables within groups of observations reverses when all of the data are combined. The phenomenon is usually discussed in terms of categorical variables, but it also occurs in other settings. Here is an example:
y  x  Group  y  x  Group 

10.1  1  1  18.3  6  2 
8.9  2  1  17.1  7  2 
8.0  3  1  16.2  8  2 
6.9  4  1  15.1  9  2 
6.1  5  1  14.3  10  2 
A wood product manufacturer is interested in replacing solidwood building material by lessexpensive products made from wood flakes.^{27} The company collected the following data to examine the relationship between the length (in inches) and the strength (in pounds per square inch) of beams made from wood flakes:
Length  5  6  7  8  9  10  11  12  13  14 
Strength  446  371  334  296  249  254  244  246  239  234 
Does taking aspirin regularly help prevent heart attacks? “Nearly five decades of research now link aspirin to the prevention of stroke and heart attacks.” So says the Bayer Aspirin website, bayeraspirin.com. The most important evidence for this claim comes from the Physicians’ Health Study. The subjects were 22,071 healthy male doctors at least 40 years old. Half the subjects, chosen at random, took aspirin every other day. The other half took a placebo, a dummy pill that looked and tasted like aspirin. Here are the results.^{28} (The row for “None of these” is left out of the twoway table.)
Aspirin group 
Placebo group 


Fatal heart attacks  10  26 
Other heart attacks  129  213 
Strokes  119  98 
Total  11,037  11,034 
What do the data show about the association between taking aspirin and heart attacks and stroke? Use percents to make your statements precise. Include a mosaic plot if you have access to the needed software. Do you think the study provides evidence that aspirin actually reduces heart attacks (cause and effect)?
122
You can see the headlines “More smokers than nonsmokers live at least 20 more years after being contacted for study!” A medical study contacted randomly chosen people in a district in England. Here are data on the 1314 women contacted who were either current smokers or who had never smoked. The tables classify these women by their smoking status and age at the time of the survey and whether they were still alive 20 years later.^{29}
Age 18 to 44  Age 45 to 64  Age 65+  

Smoker  Not  Smoker  Not  Smoker  Not  
Dead  19  13  78  52  42  165 
Alive  269  327  167  147  7  28 
Recycling is supposed to save resources. Some people think recycled products are lower in quality than other products, a fact that makes recycling less practical. People who actually use a recycled product may have different opinions from those who don’t use it. Here are data on attitudes toward coffee filters made of recycled paper among people who do and don’t buy these filters:^{30}
Think the quality of the recycled product is: 


Higher  The same  Lower  
Buyers  20  7  9 
Nonbuyers  29  25  43 