Chapter 1. Numerical Data Descriptions: Mean and Standard Deviation

1.1 Introduction

When the data we are interested in are quantitative, we commonly summarize the data not only graphically but also numerically. As with our graphical descriptions, we would like our numerical descriptions to address the important characteristics of the data set—its shape, center and spread. Consider once again the data table that gave information about some of the 25 top-grossing movies in the United States in November 2010.

  Year Box Office
(millions
of dollars)
Genre Running Time Academy Awards MPAA Rating
E.T.: The Extra-Terrestrial 1982 435 Family 115 4 PG
Star Wars: Episode I - The Phantom Menace 1999 431 Sci-Fi 133 0 PG
Pirates of the Caribbean: Dead Man's Chest 2006 423 Adventure 130 1 PG13
Toy Story 3 2010 415 Animation 103 * G
Spider-Man 2002 404 Action 121 0 PG13
Transformers: Revenge of the Fallen 2009 402 Action 150 0 PG13
Star Wars: Episode III - Revenge of the Sith 2005 380 Sci-Fi 140 0 PG13
The Lord of the Rings: The Return of the King 2003 377 Fantasy 201 11 PG13
*as of February 2011
Source: The Internet Movie Database

In reviewing this table we see, in particular, that the third column and the fifth column each contain values of a quantitative variable. In order to get a “picture” of these data, we might want to determine how much, on average, a top-grossing film earned, or whether the running time for The Lord of the Rings: The Return of the King was unusually long. These are questions that can be answered using numerical descriptive measures.

1.2 Locating the Center: Mean

We will begin our discussion by considering a numerical description with which you are probably familiar, the mean. The mean of a set of quantitative data is simply its arithmetic average, the very same average you learned to calculate in elementary school. The mean is found by summing all data values, then dividing the result by the number of values. We indicate this symbolically using formulas:

You are probably thinking that these two formulas are remarkably similar; indeed, they are nearly identical. That is because what we do is exactly the same; what we say depends on whether we are working with a population or a sample.

In working with a population, we denote the mean by µ, the Greek letter mu. The upper case N is the population size (the number of values we are adding), and the x’s are the individual data values, labeled from 1 to N. If we are calculating the mean of a sample, we indicate the result by , which we call (not very creatively) x-bar, with the lower case n the sample size, and the x’s labeled from 1 to n.

While this might seem unnecessarily “picky” at first glance, the notation indicates two important distinctions that we must keep in mind:

  • We denote the mean µ of a population of size N and the mean of a sample of size n differently, because the information they provide is different in nature. A population has only one mean, and if we calculate it correctly, we have the mean. When we calculate a sample mean, we are generally interested in using it to estimate the population mean. So while each sample has only one mean, if we select a different sample, we are likely to get a different sample mean. Recall that when we calculate a numerical descriptive measure for a population, we are finding a parameter; if we calculate a numerical descriptive measure for a sample, we are finding a statistic. It is easy to remember which is which—calculating from a population, you have a parameter; calculating from a sample, you have a statistic. Further, we use Greek letters for parameters and English ones for statistics. So µ is a parameter, while is a statistic.