3.3 Quantitative Data Descriptions: Median and Quartiles

We began Section 3.2 by looking once again at the top-grossing film data, and asking two questions. First, how much, on average, did a top-grossing film earn, and second, whether the running time for The Lord of the Rings: The Return of the King was unusually long. We answered the first question using the mean; on average, a top-grossing film earned $405 million. We now turn to numerical descriptive measures that will help us answer the second question.

3.3.1 Locating the Center: Median

Finding the average, or mean, of a data set is one way to locate the center of the distribution of values. The median, the middle value when the data are arranged in size order, is another measure of the distribution’s center. If the number of measurements is odd, there is exactly one measurement in the middle; the median is this measurement. If the number of measurements is even, there are two measurements in the middle, and the median is the mean (average) of these two measurements.

Consider once again the January gasoline price data we examined previously. Arranging the values from smallest to largest we have 1.09, 1.11, 1.26, 1.38, 1.41, 1.49, 1.75, 2.24, 2.30, 3.09. If we attempt to locate the middle number here, we see that there is not a single value in the middle, but rather, the two values 1.41 and 1.49. The median is then the mean of these two values or $1.45 per gallon.

The pth percentile is a number such that p% of the measurements fall at or below that number so the median is always the 50th percentile since 50% of the measurements fall below that number. In the January gasoline price data, the number 1.75 is the 70% percentile since 70% of the measurements fall at or below 1.75.

When we use the median as the measure of center, we use quartiles to describe the variability of the data. The median divides the data into its lower 50% and its upper 50%. We call the median of the lower 50% of the data Q1 and the median of the upper 50% of the data Q3. (What happened to Q2? Q2 is actually the median.) Q1, the median, and Q3 divide the set of data values into fourths, or quartiles. Therefore Q1 is the 25th percentile and Q3 is the 75% percentile. Along with the minimum and maximum values, Q1, the median, and Q3 constitute the five-number summary.

When we look at the five-number summary graphically, as indicated in Figure 3.24, we can see various ways to describe how much variability is in the data set:

Figure 3.24: What the Five-Number Summary Tells Us

Here is the five-number summary for the gasoline data:

Minimum: 0.91
Q1: 1.26
Median: 1.45
Q3: 2.24
Maximum: 3.09

So we can conclude that

Question 3.18

Use statistical software to find the five-number summary for the running times of the 25 top-grossing films. Between what two values does the middle 50% of running times lie?

The middle 50% of running times lie between fugjWOzhVo4BiBlh and qCzQBS4v9I8=.

2
Try again.

Incorrect. CrunchIt! reports the statistics for Running Time in the accompanying table. Notice that the values reported include sample size n, sample mean , and sample standard deviation s, in addition to the five-number summary values we want.

n Sample
Mean
Median Standard
Deviation
Max Min Q1 Q3
Running
Time
(minutes)
25 132.6 127 29.41 201 89 111.5 147

The five-number summary is

  • Minimum: 89
  • Q1: 111.5
  • Median: 127
  • Q3: 147
  • Maximum: 201

Thus, the middle 50% of running times lie between 111.5 and 147 minutes.

Correct. CrunchIt! reports the statistics for Running Time in the accompanying table. Notice that the values reported include sample size n, sample mean , and sample standard deviation s, in addition to the five-number summary values we want.

n Sample
Mean
Median Standard
Deviation
Max Min Q1 Q3
Running
Time
(minutes)
25 132.6 127 29.41 201 89 111.5 147

The five-number summary is

  • Minimum: 89
  • Q1: 111.5
  • Median: 127
  • Q3: 147
  • Maximum: 201

Thus, the middle 50% of running times lie between 111.5 and 147 minutes.

3.3.2 Outliers

Let’s return now to the question we asked earlier—was the running time for The Lord of the Rings: The Return of the King unusually long? When we looked at histograms and stem plots, we attempted to identify outliers, data values that were unusual compared to the rest of the data. We looked for gaps at the left-hand or right-hand side in these graphical displays, and tried to judge whether these gaps were significant enough to make us question any data values lying beyond those points. Now we are ready to establish a numerical criterion by which we can determine outliers, and we will illustrate the procedure using the running times of the 25 top-grossing movies.

The procedure statisticians have agreed upon to find outliers is:

  1. Calculate the interquartile range (IQR), which is Q3 – Q1.
  2. Multiply the IQR by 1.5.
  3. Calculate the lower fence, which is Q1 – 1.5IQR.
  4. Calculate the upper fence, which is Q3 + 1.5IQR.
  5. Outliers are any data values that lie outside the fences; that is, below the lower fence or above the upper fence.

Following these steps for the movie data, we find the fences.

  1. IQR = Q3 – Q1 = 147 – 111.5 = 35.5.
  2. IQR × 1.5 = 1.5 × 35.5 = 53.25.
  3. Lower fence = Q1 – 1.5IQR = 111.5 – 53.25 = 58.25.
  4. Upper fence = Q3 + 1.5IQR = 147 + 53.35 = 200.25.

Figure 3.25 shows these values on a number line—the lower fence lies 53.25 units below Q1, while the upper fence is 53.25 units above Q3.

Figure 3.25: Determining Outliers

Since the minimum running time of 89 minutes is not outside the lower fence, there are no low outliers. The data value 201 lies above the upper fence (although just barely), so the running time of 201 minutes is a high outlier. This example makes clear how the criterion works; if a data value lies outside the fences, we will call it an outlier. Having a strict rule allows us to agree on outliers, rather than having to take differences in opinion into account.

Hence, we see that the 201-minute running time for The Lord of the Rings: The Return of the King was indeed unusually long compared to the rest of these films. On a revenue-per-running-minute basis, Shrek 2, which had a running time of 93 minutes and earned $436 million was a better bargain than The Lord of the Rings: The Return of the King, which generated $377 million dollars. However, fans of The Lord of the Rings trilogy might argue that The Return of the King had more artistic merit, as evidenced by its earning 11 Academy Awards, as compared to none for Shrek 2.

Why do we care about outliers? Often, they are interesting just because they are different. A person who is 7 feet tall is much more likely to be noticed than one who is 6 feet tall. Like such a person, an outlier “stands out” from the crowd. If we are sure that the data values are correct (as in the movie running times), an outlier is a curiosity that may help or hinder the point we are trying to make. (Just for the record, it is not valid to remove an outlier from the data just because we don’t like what it does to our results.)

On the other hand, outliers sometimes occur as the result of data collection or data entry errors. If it can be determined that an outlier is the result of such an error, the error should be noted, and the value eliminated from statistical calculations. In a statistics class, the instructor collected student height data, asking for the value in inches. One student reported a height of “6.” Since no one in the class was only 6 inches tall, the error was pointed out, and that value was not used in analyzing the data set. While we might guess that the student was actually 6 feet tall, we cannot assume that this was the mistake and substitute the equivalent 72 inches. The best we can do is to explain why we have chosen to ignore the value in our calculations.

3.3.3 Box Plots

A handy way to display the information from the five-number summary is in a box plot (sometimes called a box-and-whisker plot). A box plot consists of two parts:

Does this sound confusing? A look at the TI-84 calculator box plot for the running times of the movies (in Figure 3.26) will assure you that it is a graph of the five-number summary.

Figure 3.26: Box Plot for the Movie Running Times

With a horizontal axis starting at 80 and marked off by tens, the box indicates that the median lies between 120 and 130, with Q1 close to 110, and Q3 between 140 and 150. The whiskers extend to about 90 on the left and 200 on the right. This indeed corresponds to the five-number summary we found previously: minimum 89, Q1 111.5, median 127, Q3 147, and maximum 201.

Recall that we used the information from the five-number summary to determine outliers. Most statistical software allows us to use the fences to determine outliers, and to identify them as separate points beyond the whiskers. This type of box plot is sometimes called a modified box plot. CrunchIt! displays modified box plots; in Figure 3.27 we see the CrunchIt! plot for the running time of the movies, showing the outlier we found previously, 201 minutes. Note also that CrunchIt! displays the plot vertically, rather than horizontally.

Figure 3.27: Modified Box Plot for the Movie Running Times

While different software packages display box plots in different ways, they are all show the same “box,” which indicates where the middle 50% of the data lie. For the movie data, the middle 50% of the running times lie between 111.5 and 147. While these values (Q1 and Q3) are not labeled on the graph, we can see that the bottom and top of the box are at these values, respectively

Question 3.19

Use a modified boxplot to determine whether there are any outliers in the January gas price data. Verify your result using the 1.5IQR criterion.

Ya9m9AHKNJmu8EMWJvwrgrffMxYfPZaQzC7xFeXMIkbhr6IenyPBSz0nsiOB3FvVFx2KNp8hVZQz3mkxrvSS0w==

Using the descriptive statistics calculated by CrunchIt!, IQR = 1.538 – 1.257 = 0.281, so 1.5IQR = 0.4215. The lower fence is 1.257 – 0.4215 = 0.8355; since the smallest gas price ($0.955) is larger than the lower fence, there are no low outliers. The upper fence is 1.538 + 0.4215 = 1.9595. Since the highest gas price ($1.729) is smaller than the upper fence, there are no high outliers. Thus the 1.5IQR criterion confirms what we see in the modified box plot—that none of the average gas prices was unusually small or large with respect to the rest of the data set.

3.3.4 Mean or Median?

How do we decide which measure of center is more appropriate for a given data set? Let’s consider a couple of examples.

Figure 3.28 shows the distribution of normal body temperatures for 134 individuals, as both a histogram and a box plot.

Figure 3.28: Graphical Descriptions of Body Temperature Data

We notice that the distribution is very symmetric; the mean for this data set is 98.55, and its median is 98.60. In this situation, you would probably agree that it does not matter whether we use the mean or the median to measure the center of the distribution.

The accompanying table gives the total compensation in millions of dollars for a simple random sample of heads of America’s 500 biggest companies.

31,825 3.203 6.301 1.750 21.903 19.706 6.979 1.046 3.234 1.158
9.458 1.882 11.007 1.200 1.996 4.810 10.312 6.504 3.946 5.944

Figure 3.29 shows a histogram and a box plot of these data. In this case, we see that the distribution is skewed to the right, with an upper outlier.

Figure 3.29: Two Views of a Right-Skewed Distribution

When we calculate the mean and the median, we find that the mean (7.708 million dollars) is quite a bit larger than the median (5.377 million dollars). Which value better represents the center of the distribution? The three largest values (including the outlier) are much larger than the rest, and they cause the mean to be larger than a “typical” measurement. In fact, the mean is larger than 14 of the 20 measurements—not exactly what you would think of as the “center.” So we conclude that here the median is the better measure of the center.

Income distributions, like the one for the sample of executive compensations, are frequently right-skewed. The NPR Story The Income of the “Average American” looks at the relationship between mean and median incomes.

Click here to discuss this story.

For a skewed distribution such as this, we say that the mean is “pulled away” from the median in the direction of the skewness. So here, the mean is to the right of—that is, larger than—the median. In a left-skewed distribution, the mean is influenced by low values (whether or not they are outliers), and thus is pulled to the left of the median.

When a distribution is symmetric, the mean is a good measure of the center. When we use the mean to measure the center, we typically use the standard deviation to measure the variability. We recall that the standard deviation, which measures variability by considering deviations from the mean, is also affected by extreme values. So like the mean, it is not very useful when the data are skewed. For a skewed distribution, the median is the preferred measure of the center, and we then use the quartiles to measure variability.

One additional measure of center that is sometimes used is the mode, the measurement (or measurements) that occur most frequently. A distribution can have one or more modes; however, if all data values occur the same number of times, we say that the distribution has no mode. Recall that we referred to the mode in Section 3.1 when we were describing the shape of histograms. We called a distribution bimodal if its graph had two non-adjacent tall peaks (of roughly equal height), . Bimodal distributions occur more frequently than you might think. A histogram showing the distribution of heights for a class of students (like Figure 3.30) might well be bimodal, with one peak reflecting a common height of females, and the second indicating a common height of males.

Figure 3.30: A Bimodal Distribution

The videos StatClips: Summaries of Quantitative Data and StatClips: Exploratory Pictures for Quantitative Data provide examples and comparisons of numerical and graphical measures to describe quantitative data. Note that the videos refer to "location" rather than "center."

Reporting the center and the variability of a set of numerical data gives important information about a distribution. Before choosing a particular combination of measure of center and variability, it is useful to graph the data set, using a histogram, a stem-and-leaf plot, or a box plot.

If the distribution is symmetric, we typically choose the mean to measure center, and the standard deviation to measure variability. If the distribution is not symmetric, the median is the preferred measure of center, with quartiles used to measure variability.

We also use quartiles in the procedure for determining outliers. Outliers can occur because of natural variation in data, or as a result of data collection or entry errors. When it is possible to characterize an outlier as a genuine error, it can be removed from the data set prior to any numerical analysis. However, such removal (and the reason for it) must be noted.

While it may seem that we concentrate on numerical techniques as we proceed with statistical analysis, it is important to remember that graphs can help us select the numerical measures and techniques that are most appropriate for a particular data set.