Chapter 1. Numerical Data Descriptions: Mean and Standard Deviation

1.1 Introduction

When the data we are interested in are quantitative, we commonly summarize the data not only graphically but also numerically. As with our graphical descriptions, we would like our numerical descriptions to address the important characteristics of the data set—its shape, center and spread. Consider once again the data table that gave information about some of the 25 top-grossing movies in the United States in November 2010.

	Year	Box Office (millions of dollars)	Genre	Running Time	Academy Awards	MPAA Rating
E.T.: The Extra-Terrestrial	1982	435	Family	115	4	PG
Star Wars: Episode I - The Phantom Menace	1999	431	Sci-Fi	133	0	PG
Pirates of the Caribbean: Dead Man's Chest	2006	423	Adventure	130	1	PG13
Toy Story 3	2010	415	Animation	103	*	G
Spider-Man	2002	404	Action	121	0	PG13
Transformers: Revenge of the Fallen	2009	402	Action	150	0	PG13
Star Wars: Episode III - Revenge of the Sith	2005	380	Sci-Fi	140	0	PG13
The Lord of the Rings: The Return of the King	2003	377	Fantasy	201	11	PG13
as of February 2011 Source: The Internet Movie Database*

In reviewing this table we see, in particular, that the third column and the fifth column each contain values of a quantitative variable. In order to get a “picture” of these data, we might want to determine how much, on average, a top-grossing film earned, or whether the running time for The Lord of the Rings: The Return of the King was unusually long. These are questions that can be answered using numerical descriptive measures.

1.2 Locating the Center: Mean

We will begin our discussion by considering a numerical description with which you are probably familiar, the mean. The mean of a set of quantitative data is simply its arithmetic average, the very same average you learned to calculate in elementary school. The mean is found by summing all data values, then dividing the result by the number of values. We indicate this symbolically using formulas:

You are probably thinking that these two formulas are remarkably similar; indeed, they are nearly identical. That is because what we do is exactly the same; what we say depends on whether we are working with a population or a sample.

In working with a population, we denote the mean by µ, the Greek letter mu. The upper case N is the population size (the number of values we are adding), and the x’s are the individual data values, labeled from 1 to N. If we are calculating the mean of a sample, we indicate the result by , which we call (not very creatively) x-bar, with the lower case n the sample size, and the x’s labeled from 1 to n.

While this might seem unnecessarily “picky” at first glance, the notation indicates two important distinctions that we must keep in mind:

We denote the mean µ of a population of size N and the mean of a sample of size n differently, because the information they provide is different in nature. A population has only one mean, and if we calculate it correctly, we have the mean. When we calculate a sample mean, we are generally interested in using it to estimate the population mean. So while each sample has only one mean, if we select a different sample, we are likely to get a different sample mean. Recall that when we calculate a numerical descriptive measure for a population, we are finding a parameter; if we calculate a numerical descriptive measure for a sample, we are finding a statistic. It is easy to remember which is which—calculating from a population, you have a parameter; calculating from a sample, you have a statistic. Further, we use Greek letters for parameters and English ones for statistics. So µ is a parameter, while is a statistic.