Producing Data

12

Describing Distributions with Numbers

269

image
FRANCES M. ROBERTS/Newscom

CASE STUDY Does education pay? We are told that people with more education earn more on the average than people with less education. How much more? How can we answer this question?

Data on income can be found at the Census Bureau website. The data are estimates, for the year 2013, of the total incomes of 136,641,000 people aged 25 and over with earnings and are based on the results of the Current Population Survey in 2014. The website gives the income distribution for each of several education categories. In particular, it gives the number of people in each of several education categories who earned between $1 and $2499, between $2500 and $4999, up to between $97,500 and $99,999, and $100,000 and over. That is a lot of information. A histogram could be used to display the data, but are there simple ways to summarize the information with just a few numbers that allow us to make sensible comparisons?

In this chapter, we will learn several ways to summarize large data sets with a few numbers. By the end of this chapter, with these new methods for summarizing large data sets, you will be able to provide an answer to whether education really pays.

Baseball has a rich tradition of using statistics to summarize and characterize the performance of players. We begin by investigating ways to summarize the performance of the greatest home-run hitters of all time.

In the summer of 2007, Barry Bonds shattered the career home run record, breaking the previous record set by Hank Aaron. Here are his home run counts for the years 1986 (his rookie year) to 2007 (his final season):

1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996
16 25 24 19 33 25 34 46 37 33 42
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
40 37 34 49 73 46 45 45 5 26 28

270

image
Figure 269.1: Figure 12.1 Stemplot of the number of home runs hit by Barry Bonds in his 22-year career.

The stemplot in Figure 12.1 displays the data. The shape of the distribution is a bit irregular, but we see that it has one high outlier, and if we ignore this outlier, we might describe it as slightly skewed to the left with a single peak. The outlier is, of course, Bonds’s record season in 2001.

A graph and a few words give a good description of Barry Bonds’s home run career. But words are less adequate to describe, for example, the incomes of people with a high school education. We need numbers that summarize the center and variability of a distribution.