Chapter 1. Chapter 1: Looking at Data—Distributions

Looking at Data—Distributions

1

Looking at Data—Distributions

1

image
iStock/Creatista/Getty Images Plus

CHAPTER OUTLINE

  • 1.1 Data

  • 1.2 Displaying Distributions with Graphs

  • 1.3 Describing Distributions with Numbers

  • 1.4 Density Curves and Normal Distributions

Introduction

Statistics is the science of learning from data. Data are numerical or qualitative descriptions of the objects that we want to study. In this chapter, we will master the art of examining data.

We begin in Section 1.1 with some basic ideas about data. We will learn about the different types of data that are collected and how data sets are organized.

Section 1.2 starts our process of learning from data by looking at graphs. These visual displays give us a picture of the overall patterns in a set of data. We have excellent software tools that help us make these graphs. However, it takes a little experience and a lot of judgment to study the graphs carefully and to explain what they tell us about our data.

Section 1.3 continues our process of learning from data by computing numerical summaries. These sets of numbers describe key characteristics of the patterns that we saw in our graphical summaries.

The final section in this chapter helps us make the transition from data summaries to statistical models that are used to draw conclusions and to make predictions. Specifically, we learn about using density curves to describe a set of data and are introduced to the Normal distributions. These distributions can be used to describe many sets of data that we will encounter. They also play a fundamental role in many of the methods of statistical analysis.

1.1 1.1 Data

2

When you complete this section, you will be able to:

  • Give examples of cases in a data set.

  • Identify the variables in a data set.

  • Demonstrate how a label can be used as a variable in a data set.

  • Identify the values of a variable.

  • Classify variables as categorical or quantitative.

  • Describe the key characteristics of a set of data.

  • Explain how a rate is the result of adjusting one variable to create another.

A statistical analysis starts with a set of data. We construct a set of data by first deciding what cases, or units, we want to study. For each case, we record information about characteristics that we call variables.

CASES, LABELS, VARIABLES, AND VALUES

Cases are the objects described by a set of data. Cases may be customers, companies, subjects in a study, units in an experiment, or other objects.

A label is a special variable used in some data sets to distinguish the different cases.

A variable is a characteristic of a case.

Different cases can have different values of the variables.

EXAMPLE 1.1

Restaurant discount coupons. A website offers coupons that can be used to get discounts for various items at local restaurants. Coupons for food are very popular. Figure 1.1 gives information for seven restaurant coupons that were available for a recent weekend. These are the cases. Data for each coupon are listed on a different line, and the first column has the coupons numbered from 1 to 7. The remaining columns gives the type of restaurant, the name of the restaurant, the item being discounted, the regular price, and the discount price.

COUPONS

image
Figure 1.1: Figure 1.1 Spreadsheet of food discount coupons, Example 1.1.

3

Some variables, like the type of restaurant, the name of the restaurant, and the item simply place coupons into categories. The regular price and discount price columns have numerical values for which we can do arithmetic. It makes sense to give an average of the regular prices, but it does not make sense to give an “average” type of restaurant. We can, however, do arithmetic to compare the regular prices classified by type of restaurant.

CATEGORICAL AND QUANTITATIVE VARIABLES

A categorical variable places a case into one of several groups or categories.

A quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense.

EXAMPLE 1.2

Categorical and quantitative variables for coupons. The restaurant discount coupon file has six variables: coupon number, type of restaurant, name of restaurant, item, regular price, and discount price. The two price variables are quantitative variables. Coupon number, type of restaurant, name of restaurant, and item are categorical variables.

COUPONS

An appropriate label for your cases should be chosen carefully. In our food coupon example, a natural choice of a label would be the name of the restaurant. However, if there are two or more coupons available for a particular restaurant, or if a restaurant is a chain with different discounts offered at different locations, then the name of the restaurant would not uniquely label each of the coupons. In the restaurant discount coupon file, the first variable, ID, is a unique label for each coupon.

The display in Figure 1.1 is from an Excel spreadsheetspreadsheet. Spreadsheets are very useful for doing the kind of simple computations that you will do in Exercise 1.2. You can type in a formula and have the same computation performed for each row.

image

Note that the names we have chosen for the variables in our spreadsheet do not have spaces. For example, instead of “Restaurant Name” for the name of the restaurant, we simply use Name. In some statistical software packages, however, spaces are not allowed in variable names. For this reason, when creating spreadsheets for eventual use with statistical software, it is best to avoid spaces in variable names. Another convention is to use an underscore (_) where you would normally use a space. For our data set, we could have used Regular_Price and Discount_Price for the two price variables.

USE YOUR KNOWLEDGE

Question 1.1

1.1 Read the spreadsheet. Refer to Figure 1.1. Give the regular price and the discount price for the Smokey Grill ribs coupon.

Question 1.2

1.2 How much is the discount worth? Refer to Example 1.1. Consider adding another column to the spreadsheet that gives the coupon savings. Explain how you would compute the entries in this column. Does the new column contain values for a categorical variable or for a quantitative variable? Explain your answer.

4

Another important part of the description of any quantitative variable is its unit of measurement. For both RegPrice and DiscPrice, the unit of measurementunit of measurement is clearly dollars. In other settings, it may not be as obvious. For example, if we were measuring heights of children, we might choose to use either inches or centimeters. The units of measurement are an important part of the description of a quantitative variable.

Key characteristics of a data set

In practice, any set of data is accompanied by background information that helps us understand the data. When you plan a statistical study or explore data from someone else’s work, ask yourself the following questions:

  1. 1. Who? What cases do the data describe? How many cases does the data set contain?

  2. 2. What? How many variables do the data contain? What are the exact definitions of these variables? What are the units of measurement for each quantitative variable?

  3. 3. Why? What purpose do the data have? Do we hope to answer some specific questions? Do we want to draw conclusions about cases other than the ones we actually have data for? Are the variables that are recorded suitable for the intended purpose?

EXAMPLE 1.3

Statistics class data. Suppose that you are a teaching assistant for a statistics class and one of your jobs is to keep track of the grades for students in two sections of the course. The cases are the students in the class. There are weekly homework assignments, two exams during the semester, and a final exam. Each of these components is given a numerical score, and the components are added to get a total score that can range from 0 to 1000. Cutoffs of 900, 800, 700, etc., are used to assign letter grades of A, B, C, etc.

image
Caiaimage/Chris Ryan/Getty Images

The spreadsheet for this course will have seven variables:

  • An identifier for each student.

  • The number of points earned for homework.

  • The number of points earned for the first exam.

  • The number of points earned for the second exam.

  • The number of points earned for the final exam.

  • The total number of points earned.

  • The letter grade earned.

The student identifier is a label and the letter grade earned is a categorical variable. All the other variables are measured in “points.” Because we can do arithmetic with their values, these variables are quantitative variables.

In our example of statistics class data, the possible values for the grade variable are A, B, C, D, and F. When computing grade point averages, many colleges and universities translate these letter grades into numbers using A = 4, B = 3, C = 2, D = 1, and F = 0. The transformed variable with numeric values is considered to be quantitative because we can average the numerical values across different courses to obtain a grade point average.

5

Sometimes, experts argue about numerical scales such as this. They ask whether or not the difference between an A and a B is the same as the difference between a D and an F. Similarly, many questionnaires ask people to respond on a 1 to 5 scale, with 1 representing strongly agree, 2 representing agree, etc. Again we could ask whether or not the five possible values for this scale are equally spaced in some sense. From a practical point of view, the averages that can be computed when we convert categorical scales such as these to numerical values frequently provide a very useful way to summarize data.

EXAMPLE 1.4

Who, what, and why for the statistics class data. The data set in Example 1.3 was constructed to keep track of the grades for students in an introductory statistics course. The cases are the students in the class. There are seven variables in this data set. These include a label for each student and scores for the various course requirements. There are no units for the label and grade. The other variables all have “points” as the unit.

USE YOUR KNOWLEDGE

Question 1.3

1.3 Who, what, and why? For the restaurant discount coupon data of Example 1.1 (page 2), what cases do the data describe? How many cases are there? How many variables are there? What are their definitions and units of measurement? What purpose do the data have?

EXAMPLE 1.5

Statistics class data for a different purpose. Suppose that the data for the students in the introductory statistics class were also to be used to study relationships between student characteristics and success in the course. Here, we have decided to focus on the TotalPoints and Grade as the outcomes of interest. Other variables of interest would have been included—for example, Sex, PrevStat (whether or not the student has taken a statistics course previously), and Year (student classification as first, second, third, or fourth year). ID is a categorical variable, TotalPoints is a quantitative variable, and the remaining variables are all categorical.

USE YOUR KNOWLEDGE

Question 1.4

1.4 Apartment rentals. A data set lists apartments available for students to rent. Information provided includes the monthly rent, whether or not cable is included free of charge, whether or not pets are allowed, the number of bedrooms, and the distance to the campus. Describe the cases in the data set, give the number of variables, and specify whether each variable is categorical or quantitative.

Often, the variables in a statistical study are easy to understand: height in centimeters, study time in minutes, and so on. But each area of work also has its own special variables. A psychologist uses the Minnesota Multiphasic Personality Inventory (MMPI), and a physical fitness expert measures “VO2 max” (the volume of oxygen consumed per minute while exercising at your maximum capacity). Both of these variables are measured with special instrumentsinstrument. VO2 max is measured by exercising while breathing into a mouthpiece connected to an apparatus that measures oxygen consumed. Scores on the MMPI are based on a long questionnaire, which is also called an instrument.

6

Part of mastering your field of work is learning what variables are important and how they are best measured. Because details of particular measurements usually require knowledge of the particular field of study, we will say little about them.

image

Be sure that each variable really does measure what you want it to. A poor choice of variables can lead to misleading conclusions. Often, for example, the raterate at which something occurs is a more meaningful measure than a simple count of occurrences.

EXAMPLE 1.6

Comparing colleges based on graduates. Think about comparing colleges based on the numbers of graduates. This view tells you something about the relative sizes of different colleges. However, if you are interested in how well colleges succeed at graduating students they admit, it would be better to use a rate. For example, you can find data on the Internet on the six-year graduation rates of different colleges. These rates are computed by examining the progress of first-year students who enroll in a given year. Suppose that at College A there were 1000 first-year students in a particular year, and 800 graduated within six years. The graduation rate is

8001000=0.80

or 80%. College B has 2000 students who entered in the same year, and 1200 graduated within six years. The graduation rate is

12002000=0.60

or 60%. How do we compare these two colleges? College B has more graduates but College A has a better graduation rate.

In Example 1.6, when we computed the graduation rate, we used the total number of students to adjust the number of graduates. We constructed a new variable by dividing the number of graduates by the total number of students. Computing a rate is just one of several ways of adjusting one variable to create anotheradjusting one variable to create another. We often divide one variable by another to compute a more meaningful variable to study. Example 1.20 (page 20) is another type of adjustment.

USE YOUR KNOWLEDGE

Question 1.5

1.5 How should you express the change? Between the first exam and the second exam in your statistics course, you increased the amount of time that you spent working exercises. Which of the following three ways would you choose to express the results of your increased work: (a) give the grades on the two exams, (b) give the ratio of the grade on the second exam divided by the grade on the first exam, (c) take the difference between the grade on the second exam and the grade on the first exam, and express this as a percent of the grade on the first exam. Give reasons for your answer.

Question 1.6

1.6 Which variable would you choose? Refer to Example 1.6 on colleges and their graduates.

  1. (a) Give a setting in which you would prefer to evaluate the colleges based on the numbers of graduates. Give a reason for your choice.

  2. (b) Give a setting in which you would prefer to evaluate the colleges based on the graduation rates. Give a reason for your choice.

7

image

Exercises 1.5 and 1.6 illustrate an important point about presenting the results of your statistical calculations. Always consider how to best communicate your results to a general audience. For example, the numbers produced by your calculator or by statistical software frequently contain more digits that are needed. Be sure that you do not include extra information generated by software that will distract from a clear explanation of what you have found.

SECTION 1.1 SUMMARY

  • A data set contains information on a number of cases. Cases may be customers, companies, subjects in a study, units in an experiment, or other objects.

  • For each case, the data give values for one or more variables. A variable describes some characteristic of a case, such as a person’s height, gender, or salary. Variables can have different values for different cases.

  • A label is a special variable used to identify cases in a data set.

  • Some variables are categorical and others are quantitative. A categorical variable places each individual into a category, such as male or female. A quantitative variable has numerical values that measure some characteristic of each case, such as height in centimeters or annual salary in dollars.

  • The key characteristics of a data set answer the questions Who?, What?, and Why?

SECTION 1.1 EXERCISES

For Exercises 1.1 and 1.2, see page 3; for Exercise 1.3, see page 5; for Exercise 1.4, see page 5; and for Exercises 1.5 and 1.6, see page 6.

Question 1.7

1.7 How do you do online research? A study of 552 first-year college students asked about their favorite choice for doing online research. Possible choices were “Google or Google Scholar,” “Library database or website,” “Wikipedia or online encyclopedia,” and “Other.” Names of the students were not recorded, but the students were numbered from 1 to 552 in the data file. The researchers also recorded age, sex, and major area of study for each student.

  1. (a) What are the cases?

  2. (b) Identify the variables and their possible values.

  3. (c) Classify each variable as categorical or quantitative. Be sure to include at least one of each.

  4. (d) Was a label used? Explain your answer.

  5. (e) Summarize the key characteristics of your data set.

Question 1.8

1.8 Summer jobs. You are collecting information about summer jobs that are available for college students in your area. Describe a data set that you could use to organize the information that you collect.

  1. (a) What are the cases?

  2. (b) Identify the variables and their possible values.

  3. (c) Classify each variable as categorical or quantitative. Be sure to include at least one of each.

  4. (d) Use a label and explain how you chose it.

  5. (e) Summarize the key characteristics of your data set.

Question 1.9

1.9 Employee application data. The personnel department keeps records on all employees in a company. Here is the information that they keep in one of their data files: employee identification number, last name, first name, middle initial, department, number of years with the company, salary, education (coded as high school, some college, or college degree), and age.

  1. (a) What are the cases for this data set?

  2. (b) Describe each type of information as a label, a quantitative variable, or a categorical variable.

  3. (c) Set up a spreadsheet that could be used to record the data. Give appropriate column headings and five sample cases.

Question 1.10

1.10 How would you rank cities? Various organizations rank cities and produce lists of the 10 or the 100 best based on various measures. Create a list of criteria that you would use to rank cities. Include at least eight variables, and give reasons for your choices. Say whether each variable is quantitative or categorical.

8

Question 1.11

1.11 Survey of students. A survey of students in an introductory statistics class asked the following questions: (1) age; (2) do you like to sing? (Yes, No); (3) can you play a musical instrument (not at all, a little, pretty well); (4) how much did you spend on food last week (in dollars); (5) height.

  1. (a) Classify each of these variables as categorical or quantitative and give reasons for your answers.

  2. (b) For each variable give the possible values.

Question 1.12

1.12 What questions would you ask? Refer to the previous exercise. Make up your own survey with at least six questions. Include at least two categorical variables and at least two quantitative variables. Tell which variables are categorical and which are quantitative. Give reasons for your answers. For each variable, give the possible values.

Question 1.13

1.13 How would you rate colleges? Popular magazines rank colleges and universities on their “academic quality” in serving undergraduate students. Describe five variables that you would like to see measured for each college if you were choosing where to study. Give reasons for each of your choices.

Question 1.14

1.14 Attending college in your state or in another state. The U.S. Census Bureau collects a large amount of information concerning higher education.1 For example, the bureau provides a table that includes the following variables: state, number of students from the state who attend college, number of students who attend college in their home state.

  1. (a) What are the cases for this set of data?

  2. (b) Is there a label variable? If yes, what is it?

  3. (c) Identify each variable as categorical or quantitative.

  4. (d) Explain how you might use each of the quantitative variables to explain something about the states.

  5. (e) Consider a variable computed as the number of students in each state who attend college in the state divided by the total number of students from the state who attend college. Explain how you would use this variable to explain something about the states.

Question 1.15

1.15 Alcohol-impaired driving fatalities. A report on drunk-driving fatalities in the United States gives the number of alcohol-impaired driving fatalities for each state.2 Discuss at least three different ways that these numbers could be converted to rates. Give the advantages and disadvantages of each.

1.2 1.2 Displaying Distributions with Graphs

When you complete this section, you will be able to:

  • Analyze the distribution of a categorical variable using a bar graph.

  • Analyze the distribution of a categorical variable using a pie chart.

  • Analyze the distribution of a quantitative variable using a stemplot.

  • Analyze the distribution of a quantitative variable using a histogram.

  • Examine the distribution of a quantitative variable with respect to the overall pattern of the data and deviations from that pattern.

  • Identify the shape, center, and spread of the distribution of a quantitative variable.

  • Identify and describe any outliers in the distribution of a quantitative variable.

  • Use a time plot to describe the distribution of a quantitative variable that is measured over time.

Statistical tools and ideas help us examine data to describe their main features. This examination is called exploratory data analysisexploratory data analysis. Like an explorer crossing unknown lands, we want first to simply describe what we see. Here are two basic strategies that help us organize our exploration of a set of data:

9

  • Begin by examining each variable by itself. Then move on to study the relationships among the variables.

  • Begin with a graph or graphs. Then add numerical summaries of specific aspects of the data.

We follow these principles in organizing our learning. This chapter presents methods for describing a single variable. We will study relationships among several variables in Chapter 2. Within each chapter, we will begin with graphical displays, then add numerical summaries for a more complete description.

Categorical variables: Bar graphs and pie charts

The values of a categorical variable are labels for the categories, such as “yes” and “no.” The distribution of a categorical variabledistribution of a categorical variable lists the categories and gives either the countcount or the percentpercent of cases that fall in each category. An alternative to the percent is the proportionproportion, the count divided by the sum of the counts. Note that the percent is simply the proportion times 100.

EXAMPLE 1.7

How do you do online research? A study of 552 first-year college students asked about their preferences for online resources. One question asked them to pick their favorite.3 Here are the results:

Resource Count (n)
Google or Google Scholar 406
Library database or website 75
Wikipedia or online encyclopedia 52
Other 19
Total 552

Resource is the categorical variable in this example, and the values are the names of the online resources.

ONLINE

image
© Carl Skepper/Alamy

Note that the last value of the variable resource is “Other,” which includes all other online resources that were given as selection options. For data sets that have a large number of values for a categorical variable, we often create a category such as this that includes categories that have relatively small counts or percents. Careful judgment is needed when doing this. You don’t want to cover up some important piece of information contained in the data by combining data in this way.

image

EXAMPLE 1.8

ONLINE

Favorites as percents. When we look at the online resources data set, we see that Google is the clear winner. We see that 406 reported Google or Google Scholar as their favorite. To interpret this number, we need to know that the total number of students polled was 552. When we say that Google is the winner, we can describe this win by saying that 73.6% (406 divided by 552, expressed as a percent) of the students reported Google as their favorite. Here is a table of the preference percents:

10

Resource Percent (%)
Google or Google Scholar 73.6
Library database or website 13.6
Wikipedia or online encyclopedia 9.4
Other 3.4
Total 100.0

The use of graphical methods allows us to see this information and other characteristics of the data easily. We now examine two types of graphs.

EXAMPLE 1.9

Bar graph for the online resource preference data. Figure 1.2 displays the online resource preference data using a bar graphbar graph. The heights of the four bars show the percents of the students who reported each of the resources as their favorite.

ONLINE

image
Figure 1.2: Figure 1.2 Bar graph for the online resource preference data, Example 1.9.

The categories in a bar graph can be put in any order. In Figure 1.2, we ordered the resources based on their preference percents. For other data sets, an alphabetical ordering or some other arrangement might produce a more useful graphical display.

image

You should always consider the best way to order the values of the categorical variable in a bar graph. Choose an ordering that will be useful to you. If you have difficulty, ask a friend if your choice communicates what you expect. Note that a bar graph using counts will look the same as a bar graph using percents. A pie chart naturally uses percents.

11

EXAMPLE 1.10

Pie chart for the online resource preference data. The pie chartpie chart in Figure 1.3 helps us see what part of the whole each group forms. Here it is very easy to see that Google is the favorite for about three-quarters of the students.

ONLINE

image
Figure 1.3: Figure 1.3 Pie chart for the online resource preference data, Example 1.10.

USE YOUR KNOWLEDGE

Question 1.16

1.16 Compare the bar graph with the pie chart. Refer to the bar graph in Figure 1.2 and the pie chart in Figure 1.3 for the online resource preference data. Which graphical display does a better job of describing the data? Give reasons for your answer.

ONLINE

To make a pie chart, you must include all the categories that make up a whole. A category such as “Other” in this example can be used, but the sum of the percents for all the categories should be 100%. This constraint makes bar graphs more flexible.

image

Quantitative variables: Stemplots and histograms

A stemplot (also called a stem-and-leaf plot) gives a quick picture of the shape of a distribution while including the actual numerical values in the graph. Stemplots work best for small numbers of observations that are all greater than 0.

STEMPLOT

To make a stemplot,

  1. 1. Separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf, the final digit. Stems may have as many digits as needed, but each leaf contains only a single digit.

  2. 2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column.

  3. 3. Write each leaf in the row to the right of its stem, in increasing order out from the stem.

12

EXAMPLE 1.11

Soluble corn fiber and calcium. Soluble corn fiber (SCF) has been promoted for various health benefits. One study examined the effect of SCF on the absorption of calcium of adolescent boys and girls. Calcium absorption is expressed as a percent of calcium in the diet. Here are the data for the condition where subjects consumed 12 grams per day (g/d) of SCF.4

SCF

50 43 43 44 50 44 35 49 54 76 31 48
61 70 62 47 42 45 43 59 53 53 73

To make a stemplot of these data, use the first digits as stems and the second digits as leaves. Figure 1.4 shows the steps in making the plot, We use the first digit of each value as the stem. Figure 1.4(a) shows the stems that have values 3, 4, 5, 6, and 7. The first entry in our data set is 50. This appears in Figure 1.4(b) on the 5 stem with a leaf of 0. Similarly, the second value, 43, appears in the 4 stem with a leaf of 3. The stemplot is completed in Figure 1.4(c), where the leaves are ordered from smallest to largest.

The center of the distribution is in the 40s, and the data are more stretched out toward high values than low values (the highest value is 76, while the lowest is 31). In the plot, we do not see any extreme values that lie far from the remaining data.

image
Figure 1.4: Figure 1.4 Making a stemplot of the data in Example 1.11. (a) Write the stems. (b) Go through the data and write each leaf on the proper stem. For example, the values on the 3-stem are 35 and 31 in the order given in the display for the example. (c) Arrange the leaves on each stem in order out from the stem. The 3-stem now has leaves 1 and 5.

USE YOUR KNOWLEDGE

Question 1.17

1.17 Make a stemplot. Here are the scores on the first exam in an intro-ductory statistics course for 30 students in one section of the course:

82 73 92 82 75 98 94 57 80 90 92 80 87 91 65
73 70 85 83 61 70 90 75 75 59 68 85 78 80 94

STAT

Use these data to make a stemplot. Then use the stemplot to describe the distribution of the first-exam scores for this course.

When you wish to compare two related distributions, a back-to-back stemplotback-to-back stemplot with common stems is useful. The leaves on each side are ordered out from the common stem.

13

EXAMPLE 1.12

Soluble corn fiber and calcium. Refer to Example 1.11, which gives the data for subjects consuming 12 g/d of SCF. Here are the data for subjects under control conditions (0 g/d of SCF):

42 33 41 49 42 47 48 47 53 72 47 63
68 59 35 46 43 55 38 49 51 51 66

SCF

Figure 1.5 gives the back-to-back stemplot for the SCF and control conditions. The values on the left give absorption for the control condition, while the values on the right give absorption when SCF was consumed. The values for SCF appear to be somewhat higher than the controls.

image
Figure 1.5: Figure 1.5 A back-to-back stemplot to compare the distributions of calcium absorption under control and SCF conditions, Example 1.12.

There are two modifications of the basic stemplot that can be helpful in different situations. You can double the number of stems in a plot by splitting each stemsplitting stems into two: one with leaves 0 to 4 and the other with leaves 5 through 9. When the observed values have many digits, it is often best to trimtrimming the numbers by removing the last digit or digits before making a stemplot. If you are using software, you can round the data, which is what was done for the data given in Example 1.11.

You must use your judgment in deciding whether to split stems and whether to trim or round, though statistical software will often make these choices for you. Remember that the purpose of a stemplot is to display the shape of a distribution. If there are many stems with no leaves or only one leaf, trimming will reduce the number of stems. Let’s take a look at the effect of splitting the stems for our SCF data.

EXAMPLE 1.13

Stemplot with split stems for SCF. Figure 1.6 presents the data from Example 1.12 in a stemplot with split stems.

SCF

image
Figure 1.6: Figure 1.6 A back-to-back stemplot with split stems to compare the distributions of calcium absorption under control and SCF conditions, Example 1.13.

14

USE YOUR KNOWLEDGE

Question 1.18

1.18 Which stemplot do you prefer? Look carefully at the stemplots for the SCF data in Figures 1.5 and 1.6. Which do you prefer? Give reasons for your answer.

Question 1.19

1.19 Why should you keep the space? Suppose that you had a data set similar to the one given in Example 1.12, but in which the control values of 66 and 68 were both changed to 64.

  1. (a) Make a stemplot of these data using split stems.

  2. (b) Should you use one stem or two stems for the 60s? Give a reason for your answer. (Hint: How would your choice reveal or conceal a potentially important characteristic of the data?)

Histograms

Stemplots display the actual values of the observations. This feature makes stemplots awkward for large data sets. Moreover, the picture presented by a stemplot divides the observations into groups (stems) determined by the number system rather than by judgment.

Histograms do not have these limitations. A histogramhistogram breaks the range of values of a variable into classes and displays only the count or percent of the observations that fall into each class. You can choose any convenient number of classes, but you should choose classes of equal width.

Making a histogram by hand requires more work than a stemplot. Histograms do not display the actual values observed. For these reasons, we prefer stemplots for small data sets.

The construction of a histogram is best shown by example. Most statistical software packages will make a histogram for you.

EXAMPLE 1.14

Distribution of IQ scores. You have probably heard that the distribution of scores on IQ tests is supposed to be roughly “bell-shaped.” Let’s look at some actual IQ scores. Table 1.1 displays the IQ scores of 60 fifth-grade students chosen at random from one school.

  1. 1. Divide the range of the data into classes of equal width. Let’s use

    75 ≤ IQ score < 85

    85 ≤ IQ score < 95

    145 ≤ IQ score < 155

    Table : TABLE 1.1 IQ Test Scores for 60 Randomly Chosen Fifth-Grade Students
    145 139 126 122 125 130 96 110 118 118
    101 142 134 124 112 109 134 113 81 113
    123 94 100 136 109 131 117 110 127 124
    106 124 115 133 116 102 127 117 109 137
    117 90 103 114 139 101 122 105 97 89
    102 108 110 128 114 112 114 102 82 101

    15

    Be sure to specify the classes precisely so that each individual falls into exactly one class. A student with IQ 84 would fall into the first class, but IQ 85 falls into the second.

  2. 1. Count the number of individuals in each class. These counts are called frequenciesfrequency, and a table of frequencies for all classes is a frequency tablefrequency table.

    Class Count Class Count
    75 ≤ IQ score < 85 2 115 ≤ IQ score < 125 13
    85 ≤ IQ score < 95 3 125 ≤ IQ score < 135 10
    95 ≤ IQ score < 105 10 135 ≤ IQ score < 145 5
    105 ≤ IQ score < 115 16 145 ≤ IQ score < 155 1
  3. 3. Draw the histogram. First, on the horizontal axis mark the scale for the variable whose distribution you are displaying. That’s the IQ score. The scale runs from 75 to 155 because that is the span of the classes we chose. The vertical axis contains the scale of counts. Each bar represents a class. The base of the bar covers the class, and the bar height is the class count. There is no horizontal space between the bars unless a class is empty, so its bar has height zero. Figure 1.7 is our histogram. It does look roughly “bell-shaped.”

image
Figure 1.7: Figure 1.7 Histogram of the IQ scores of 60 fifth-grade students, Example 1.14.

Large sets of data are often reported in the form of frequency tables when it is not practical to publish the individual observations. In addition to the frequency (count) for each class, we may be interested in the fraction or percent of the observations that fall in each class. A histogram of percents looks just like a frequency histogram such as Figure 1.7. Simply relabel the vertical scale to read in percents. Use histograms of percents for comparing several distributions that have different numbers of observations.

16

USE YOUR KNOWLEDGE

Question 1.20

1.20 Make a histogram. Refer to the first-exam scores from Exercise 1.17 (page 12). Use these data to make a histogram with classes 50 to 59, 60 to 69, etc. Compare the histogram with the stemplot as a way of describing this distribution. Which do you prefer for these data?

STAT

Our eyes respond to the area of the bars in a histogram. Because the classes are all the same width, area is determined by height and all classes are fairly represented. There is no one right choice of the classes in a histogram. Too few classes will give a “skyscraper” graph, with all values in a few classes with tall bars. Too many will produce a “pancake” graph, with most classes having one or no observations. Neither choice will give a good picture of the shape of the distribution. You must use your judgment in choosing classes to display the shape. Statistical software will choose the classes for you. The software’s choice is often a good one, but you can change it if you want.

image

image

You should be aware that the appearance of a histogram can change when you change the classes. The histogram function in the One-Variable Statistical Calculator applet on the text website allows you to change the number of classes by dragging with the mouse, so that it is easy to see how the choice of classes affects the histogram.

USE YOUR KNOWLEDGE

Question 1.21

1.21 Change the classes in the histogram. Refer to the first-exam scores from Exercise 1.17 (page 12) and the histogram that you produced in Exercise 1.20. Now make a histogram for these data using classes 40 to 59, 60 to 79, and 80 to 100. Compare this histogram with the one that you produced in Exercise 1.20. Which do you prefer? Give a reason for your answer.

Question 1.22

1.22 Use smaller classes. Repeat the previous exercise using classes 55 to 59, 60 to 64, 65 to 69, etc. Of the three histograms, which do you prefer? Give reasons for your answer.

STAT

Although histograms resemble bar graphs, their details and uses are distinct. A histogram shows the distribution of counts or percents among the values of a single variable. A bar graph compares the counts or percents of different items. The horizontal axis of a bar graph need not have any measurement scale but simply identifies the items being compared.

Draw bar graphs with blank space between the bars to separate the items being compared. Draw histograms with no space, to indicate that all values of the variable are covered. Some spreadsheet programs, which are not primarily intended for statistics, will draw histograms as if they were bar graphs, with space between the bars. Often, you can tell the software to eliminate the space to produce a proper histogram.

image

Data analysis in action: Don’t hang up on me

Many businesses operate call centers to serve customers who want to place an order or make an inquiry. Customers want their requests handled thoroughly. Businesses want to treat customers well, but they also want to avoid wasted time on the phone. They therefore monitor the length of calls and encourage their representatives to keep calls short.

17

Table : TABLE 1.2 Service Times (Seconds) for Calls to a Customer Service Center
77 289 128 59 19 148 157 203
126 118 104 141 290 48 3 2
372 140 438 56 44 274 479 211
179 1 68 386 2631 90 30 57
89 116 225 700 40 73 75 51
148 9 115 19 76 138 178 76
67 102 35 80 143 951 106 55
4 54 137 367 277 201 52 9
700 182 73 199 325 75 103 64
121 11 9 88 1148 2 465 25

EXAMPLE 1.15

How long are customer service center calls? We have data on the lengths of all 31,492 calls made to the customer service center of a small bank in a month. Table 1.2 displays the lengths of the first 80 calls.5

CALLS80

Take a look at the data in Table 1.2. In this data set, the cases are calls made to the bank’s call center. The variable recorded is the length of each call. The units are seconds. We see that the call lengths vary a great deal. The longest call lasted 2631 seconds, almost 44 minutes. More striking is that 8 of these 80 calls lasted less than 10 seconds.

We started our study of the customer service center data by examining a few cases, the ones displayed in Table 1.2. It would be very difficult to examine all 31,492 cases in this way. How can we do this? Let’s try a histogram.

EXAMPLE 1.16

Histogram for customer service center call lengths. Figure 1.8 is a histogram of the lengths of all 31,492 calls. We did not plot the few lengths greater than 1200 seconds (20 minutes). As expected, the graph shows that most calls last between about 1 and 5 minutes, with some lasting much longer when customers have complicated problems. More striking is the fact that 7.6% of all calls are no more than 10 seconds long.

CALLS

image
Figure 1.8: Figure 1.8 The distribution of call lengths for 31,492 calls to a bank’s customer service center, Example 1.16. The data show a surprising number of very short calls. These are mostly due to representatives deliberately hanging up in order to bring down their average call length.

18

It turned out that the bank penalized representatives whose average call length was too long—so some representatives just hung up on customers to bring their average length down. Neither the customers nor the bank were happy about this. The bank changed its policy, and later data showed that calls under 10 seconds had almost disappeared.

The extreme values of a distribution are in the tailstails of the distribution. The high values are in the upper, or right, tail and the low values are in the lower, or left, tail. The overall pattern in Figure 1.8 is made up of the many moderate call lengths and the long right tail of more lengthy calls. The striking deviation from the overall pattern is the surprising number of very short calls in the left tail.

Our examination of the call center data illustrates some important principles:

  • After you understand the background of your data (cases, variables, units of measurement), the first thing to do is plot your data.

  • When you look at a plot, look for an overall pattern and also for any striking deviations from the pattern.

Examining distributions

Making a statistical graph is not an end in itself. The purpose of the graph is to help us understand the data. After you make a graph, always ask, “What do I see?” Once you have displayed a distribution, you can see its important features as follows.

EXAMINING A DISTRIBUTION

In any graph of data, look for the overall pattern and for striking deviations from that pattern.

You can describe the overall pattern of a distribution by its shape, center, and spread.

An important kind of deviation is an outlier, an individual value that falls outside the overall pattern.

In Section 1.3, we will learn how to describe center and spread numerically. For now, we can describe the center of a distribution by its midpoint, the value with roughly half the observations taking smaller values and half taking larger values. We can describe the spread of a distribution by giving the smallest and largest values. Stemplots and histograms display the shape of a distribution in the same way. Just imagine a stemplot turned on its side so that the larger values lie to the right.

Some things to look for in describing shape are

  • Does the distribution have one or several major peaks, called modesmodes? A distribution with one major peak is called unimodalunimodal.

  • Is it approximately symmetric or is it skewed in one direction? A distribution is symmetricsymmetric if the pattern of values smaller and larger than its midpoint are mirror images of each other. It is skewed to the rightskewed if the right tail (larger values) is much longer than the left tail (smaller values).

19

Some variables commonly have distributions with predictable shapes. Many biological measurements on specimens from the same species and sex—lengths of bird bills, heights of young women—have symmetric distributions. Money amounts, on the other hand, usually have right-skewed distributions. There are many moderately priced houses, for example, but the few very expensive mansions give the distribution of house prices a strong right-skew.

EXAMPLE 1.17

Examine the histogram of IQ scores. What does the histogram of IQ scores (Figure 1.7, page 15) tell us?

IQ

Shape: The distribution is roughly symmetric with a single peak in the center. We don’t expect real data to be perfectly symmetric, so in judging symmetry, we are satisfied if the two sides of the histogram are roughly similar in shape and extent.

Center: You can see from the histogram that the midpoint is not far from 110. Looking at the actual data shows that the midpoint is 114.

Spread: The histogram has a spread from 75 to 155. Looking at the actual data shows that the spread is from 81 to 145. There are no outliers or other strong deviations from the symmetric, unimodal pattern.

EXAMPLE 1.18

Examine the histogram of call lengths. The distribution of call lengths in Figure 1.8 (page 17), on the other hand, is strongly skewed to the right. The midpoint, the length of a typical call, is about 115 seconds, or just under 2 minutes. The spread is very large, from 1 second to 28,739 seconds.

The longest few calls are outliers. They stand apart from the long right tail of the distribution, though we can’t see this from Figure 1.8, which omits the largest observations. The longest call lasted almost 8 hours—that may well be due to equipment failure rather than an actual customer call.

USE YOUR KNOWLEDGE

Question 1.23

1.23 Describe the first-exam scores. Refer to the first-exam scores from Exercise 1.17 (page 12). Use your favorite graphical display to describe the shape, the center, and the spread of these data. Are there any outliers?

STAT

Dealing with outliers

image

In data sets smaller than the service call data, you can spot outliers by looking for observations that stand apart (either high or low) from the overall pattern of a histogram or stemplot. Identifying outliers is a matter for judgment. Look for points that are clearly apart from the body of the data, not just the most extreme observations in a distribution. You should search for an explanation for any outlier. Sometimes outliers point to errors made in recording the data. In other cases, the outlying observation may be caused by equipment failure or other unusual circumstances.

20

EXAMPLE 1.19

College students. How does the number of undergraduate college students vary by state? Figure 1.9 is a histogram of the numbers of undergraduate students in each of the states.6 Notice that more than 50% of the states are included in the first bar of the histogram. These states have fewer than 300,000 undergraduates. The next bar includes another 30% of the states. These have between 300,000 and 600,000 students. The bar at the far right of the histogram corresponds to the state of California, which has 2,685,893 undergraduates. California certainly stands apart from the other states for this variable. It is an outlier.

COLLEGE

image
Figure 1.9: Figure 1.9 The distribution of the numbers of undergraduate college students for the 50 states, Example 1.19.

The state of California is an outlier in the previous example because it has a very large number of undergraduate students. California has the largest population of all the states, so we might expect it to have a large number of college students. Let’s look at these data in a different way.

EXAMPLE 1.20

College students per 1000. To account for the fact that there is large variation in the populations of the states, for each state we divide the number of undergraduate students by the population and then multiply by 1000. This gives the undergraduate college enrollment expressed as the number of students per 1000 people in each state. Figure 1.10 gives a stemplot of the distribution. California has 60 undergraduate students per 1000 people. This is one of the higher values in the distribution, but it is clearly not an outlier.

COLLEGE

image
Figure 1.10: Figure 1.10 Stemplot of the numbers of undergraduate college students per 1000 people in each of the 50 states, Example 1.20.

21

USE YOUR KNOWLEDGE

Question 1.24

1.24 Four states with large populations. There are four states with populations greater than 15 million.

  1. (a) Examine the data file and report the names of these four states.

  2. (b) Find these states in the distribution of number of undergraduate students per 1000 people. To what extent do these four states influence the distribution of number of undergraduate students per 1000 people?

COLLEGE

In Example 1.19, we looked at the distribution of the number of undergraduate students, while in Example 1.20, we adjusted these data by expressing the counts as number per 1000 people in each state. Which way is correct? The answer depends upon why you are examining the data.

If you are interested in marketing a product to undergraduate students, the unadjusted numbers would be of interest because you want to reach the most people. On the other hand, if you are interested in comparing states with respect to how well they provide opportunities for higher education to their residents, the population-adjusted values would be more suitable. Always think about why you are doing a statistical analysis, and this will guide you in choosing an appropriate analytic strategy.

image

Here is an example with a different kind of outlier.

EXAMPLE 1.21

Healthy bones and PTH. Bones are constantly being built up (bone formation) and torn down (bone resorption). Young people who are growing have more formation than resorption. When we age, resorption increases to the point where it exceeds formation. (The same phenomenon occurs when astronauts travel in space.) The result is osteoporosis, a disease associated with fragile bones that are more likely to break. The underlying mechanisms that control these processes are complex and involve a variety of substances. One of these is parathyroid hormone (PTH). Here are the values of PTH measured on a sample of 29 boys and girls aged 12 to 15 years:7

39 59 30 48 71 31 25 31 71 50 38 63 49 45 31
33 28 40 127 49 59 50 64 28 46 35 28 19 29

PTH

image
Figure 1.11: Figure 1.11 Stemplot of the values of PTH, Example 1.21.

The data are measured in picograms per milliliter (pg/ml) of blood. The original data were recorded with one digit after the decimal point. They have been rounded to simplify our presentation here. Figure 1.11 gives a stemplot of the data.

The observation 127 clearly stands out from the rest of the distribution. A PTH measurement on this individual taken on a different day was similar to the rest of the values in the data set. We conclude that this outlier was caused by a laboratory error or a recording error, and we are confident in discarding it for any additional analysis.

Time plots

Whenever data are collected over time, it is a good idea to plot the observations in time order. Displays of the distribution of a variable that ignore time order, such as stemplots and histograms, can be misleading when there is systematic change over time.

image

22

TIME PLOT

A time plot of a variable plots each observation against the time at which it was measured. Always put time on the horizontal scale of your plot and the variable you are measuring on the vertical scale.

EXAMPLE 1.22

Seasonal variation in vitamin D. Although we get some of our vitamin D from food, most of us get about 75% of what we need from the sun. Cells in the skin make vitamin D in response to sunlight. If people do not get enough exposure to the sun, they can become deficient in vitamin D, resulting in weakened bones and other health problems. The elderly, who need more vitamin D than younger people, and people who live in northern areas, where there is relatively little sunlight in the winter, are particularly vulnerable to these problems.

VITDS

Figure 1.12 is a plot of the serum levels of vitamin D versus time of year for samples of subjects from Switzerland.8 The units measuring Vitamin D are nanomoles per liter (nmol/l) of blood. The observations are grouped into periods of two months for the plot. Means are marked by filled-in circles and are connected by a line in the plot. The effect of the lack of sunlight in the winter months on vitamin D levels is clearly evident in the plot.

image
Figure 1.12: Figure 1.12 Plot of vitamin D versus months of the year, Example 1.22.

The data described in the preceding example are based on a subset of the subjects in a study of 248 subjects. The researchers were particularly concerned about subjects whose levels were deficient, defined as a serum vitamin D level of less than 50 nmol/l. They found that there was a 3.8-fold higher deficiency rate in February–March than in August–September: 91.2% versus 24.3%. To ensure that individuals from this population have adequate levels of vitamin D, some form of supplementation is needed, particularly during certain times of the year.

SECTION 1.2 SUMMARY

23

  • Exploratory data analysis uses graphs and numerical summaries to describe the variables in a data set and the relations among them.

  • The distribution of a variable tells us what values it takes and how often it takes these values.

  • Bar graphs and pie charts display the distributions of categorical variables. These graphs use the counts or percents of the categories.

  • Stemplots and histograms display the distributions of quantitative variables. Stemplots separate each observation into a stem and a one-digit leaf. Histograms plot the frequencies (counts) or the percents of equal-width classes of values.

  • When examining a distribution, look for shape, center, and spread and for clear deviations from the overall shape.

  • Some distributions have simple shapes, such as symmetric or skewed. The number of modes (major peaks) is another aspect of overall shape. Not all distributions have a simple overall shape, especially when there are few observations.

  • Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them.

  • When observations on a variable are taken over time, make a time plot that graphs time horizontally and the values of the variable vertically. A time plot can reveal changes over time.

SECTION 1.2 EXERCISES

For Exercise 1.16, see page 11; for Exercise 1.17, see page 12; for Exercises 1.18 and 1.19, see page 14; for Exercise 1.20, see page 16; for Exercises 1.21 and 1.22, see page 16; for Exercise 1.23, see page 19; and for Exercise 1.24, see page 21.

Question 1.25

1.25 Your Facebook app can generate a million dollars a month. A report on Facebook suggests that Facebook apps can generate large amounts of money, as much as $1 million a month.9 The following table gives the numbers of Facebook users by country for the top 10 countries based on the number of users:10

FACEBK

Country Facebook users
(in millions)
Brazil 29.30
India 37.38
Mexico 29.80
Germany 21.46
France 23.19
Philippines 26.87
Indonesia 40.52
United Kingdom 30.39
United States 155.74
Turkey 30.63
  1. (a) Use a bar graph to describe the numbers of users in these countries.

  2. (b) Do you think that the United States is an outlier in this data set? Explain your answer.

  3. (c) Describe the major features of your graph in a short paragraph.

Question 1.26

1.26 Facebook use increases by country. Refer to the previous exercise. The report also gave the increases in the number of Facebook users for a one-month period for the same countries:

FACEBK

Country Increase in users
(in millions)
Brazil 2.47
India 1.75
Mexico 0.84
Germany 0.51
France 0.38
Philippines 0.38
Indonesia 0.37
United Kingdom 0.22
United States 0.65
Turkey 0.09

24

  1. (a) Use a bar graph to describe the increase in users in these countries.

  2. (b) Describe the major features of your graph in a short paragraph.

  3. (c) Do you think a stemplot would be a better graphical display for these data? Give reasons for your answer.

  4. (d) Write a short paragraph about possible business opportunities suggested by the data you described in this exercise and the previous one.

Question 1.27

1.27 The Titanic and class. On April 15, 1912, on her maiden voyage, the Titanic collided with an iceberg and sank. The ship was luxurious but did not have enough lifeboats for the 2224 passengers and crew. As a result of the collision, 1502 people died.11 The ship had three classes of passengers. The level of luxury and the price of the ticket varied with the class, with first class being the most luxurious. There were 323 passengers in first class, 277 in second class, and 709 in third class.12

TITANIC

  1. (a) Make a bar graph of these data.

  2. (b) Give a short summary of how the number of passengers varied with class.

  3. (c) If you made a bar graph of the percent of passengers in each class, would the general features of the graph differ from the one you made in part (a)? Explain your answer.

Question 1.28

1.28 Another look at the Titanic and class. Refer to the previous exercise.

TITANIC

  1. (a) Make a pie chart to display the data.

  2. (b) Compare the pie chart with the bar graph. Which do you prefer? Give reasons for your answer.

Question 1.29

1.29 Who survived? Refer to the two previous exercises. The number of first-class passengers who survived was 200. For second and third class, the numbers were 119 and 181, respectively. Create a graphical summary that shows how the survival of passengers depended on class.

TITANIC

Question 1.30

1.30 Potassium from potatoes. The 2015 Dietary Guidelines for Americans13 notes that the average potassium (K) intake for U.S. adults is about half of the recommended amount. A major source of potassium is potatoes. Nutrients in the diet can have different absorption depending on the source. One study looked at absorption of potassium from different sources. Participants ate a controlled diet for five days, and the amount of potassium absorbed was measured. Data for a diet that included 40 milliequivalents (mEq) of potassium were collected from 27 adult subjects.14

KPOT40

  1. (a) Make a stemplot of the data.

  2. (b) Describe the pattern of the distribution.

  3. (c) Are there any outliers? If yes, describe them and explain why you have declared them to be outliers.

  4. (d) Describe the shape, center, and spread of the distribution.

Question 1.31

1.31 Potassium from a supplement. Refer to the previous exercise. Data were also recorded for 29 subjects who received a potassium salt supplement with 40 mEq of potassium. Answer the questions in the previous exercise for the supplemented subjects.

KSUP40

Question 1.32

1.32 Energy consumption. The U.S. Energy Information Administration reports data summaries of various energy statistics. Let’s look at the total amount of energy consumed, in quadrillions of British thermal units (Btu), for each month in a recent year. Here are the data:15

ENERGY

Month Energy
(quadrillion Btu)
Month Energy
(quadrillion Btu)
January 9.58 July 8.23
February 8.46 August 8.21
March 8.56 September 7.64
April 7.56 October 7.78
May 7.66 November 8.19
June 7.79 December 8.82
  1. (a) Look at the table and describe how the energy consumption varies from month to month.

  2. (b) Make a time plot of the data and describe the patterns.

  3. (c) Suppose you wanted to communicate information about the month-to-month variation in energy consumption. Which would be more effective, the table of the data or the graph? Give reasons for your answer.

Question 1.33

1.33 Energy consumption in a different year. Refer to the previous exercise. Here are the data for the previous year:

ENERGY

Month Energy
(quadrillion Btu)
Month Energy
(quadrillion Btu)
January 8.99 July 8.27
February 8.02 August 8.17
March 8.38 September 7.64
April 7.52 October 7.72
May 7.62 November 8.14
June 7.72 December 9.08

25

  1. (a) Analyze these data using the questions in the previous exercise as a guide.

  2. (b) Compare the patterns across the two years. Describe any similarities and differences.

Question 1.34

1.34 Favorite colors. What is your favorite color? One survey produced the following summary of responses to that question: blue, 42%; green, 14%; purple, 14%; red, 8%; black, 7%; orange, 5%; yellow, 3%; brown, 3%; gray, 2%; and white, 2%.16 Make a bar graph of the percents and write a short summary of the major features of your graph.

FAVCOL

Question 1.35

1.35 Least-favorite colors. Refer to the previous exercise. The same study also asked people about their least-favorite color. Here are the results: orange, 30%; brown, 23%; purple, 13%; yellow, 13%; gray, 12%; green, 4%; white, 4%; red, 1%; black, 0%; and blue, 0%. Make a bar graph of these percents and write a summary of the results.

LFAVCOL

Question 1.36

1.36 Garbage. The formal name for garbage is “municipal solid waste.” In the United States, approximately 250 million tons of garbage are generated in a year. Following is a breakdown of the materials that made up American municipal solid waste in 2012:17

GARBAGE

Material Weight
(million tons)
Percent
of total
Food scraps 36.4 14.5
Glass 11.6 4.6
Metals 22.4 8.9
Paper, paperboard 68.6 27.4
Plastics 31.7 12.7
Rubber, leather 7.5 3.0
Textiles 14.3 5.7
Wood 15.8 6.3
Yard trimmings 34.0 13.5
Other 8.5 3.4
Total 250.9 100.0
  1. (a) Add the weights. The sum is not exactly equal to the value of 250.9 million tons given in the table. Why?

  2. (b) Make a bar graph of the percents. The graph gives a clearer picture of the main contributors to garbage if you order the bars from tallest to shortest.

  3. (c) Also make a pie chart of the percents. Comparing the two graphs, notice that it is easier to see the small differences among “Food scraps,” “Plastics,” and “Yard trimmings” in the bar graph.

Question 1.37

1.37 Vehicle colors. Vehicle colors differ among regions of the world. Here are data on the most popular colors for vehicles in North America:18

VCOLOR

Color (percent)
White 24
Black 19
Silver 16
Gray 15
Red 10
Blue 7
Brown 5
Other 4
  1. (a) Describe these data with a bar graph.

  2. (b) Describe these data with a pie chart.

  3. (c) Which graphical summary do you prefer. Give reasons for your answer.

Question 1.38

1.38 Sketch a skewed distribution. Sketch a histogram for a distribution that is skewed to the left. Suppose that you and your friends emptied your pockets of coins and recorded the year marked on each coin. The distribution of dates would be skewed to the left. Explain why.

Question 1.39

1.39 Grades and self-concept. Table 1.3 presents data on 78 seventh-grade students in a rural midwestern school.19 The researcher was interested in the relationship between the students’ “self-concept” and their academic performance. The data we give here include each student’s grade point average (GPA), score on a standard IQ test, and gender, taken from school records. Gender is coded as F for female and M for male. The students are identified only by an observation number (OBS). The missing OBS numbers show that some students dropped out of the study. The final variable is each student’s score on the Piers-Harris Children’s Self-Concept Scale, a psychological test administered by the researcher.

SEVENGR

  1. (a) How many variables does this data set contain? Which are categorical variables and which are quantitative variables?

  2. (b) Make a stemplot of the distribution of GPA, after rounding to the nearest tenth of a point.

  3. (c) Describe the shape, center, and spread of the GPA distribution. Identify any suspected outliers from the overall pattern.

  4. (d) Make a back-to-back stemplot of the rounded GPAs for female and male students. Write a brief comparison of the two distributions.

Question 1.40

1.40 Describe the IQ scores. Make a graph of the distribution of IQ scores for the seventh-grade students in Table 1.3. Describe the shape, center, and spread of the distribution, as well as any outliers. IQ scores are usually said to be centered at 100. Is the midpoint for these students close to 100, clearly above, or clearly below?

SEVENGR

26

Table : TABLE 1.3 Educational Data for 78 Seventh-Grade Students
OBS GPA IQ Gender Self-
concept
OBS GPA IQ Gender Self-
concept
001 7.940 111 M 67 043 10.760 123 M 64
002 8.292 107 M 43 044 9.763 124 M 58
003 4.643 100 M 52 045 9.410 126 M 70
004 7.470 107 M 66 046 9.167 116 M 72
005 8.882 114 F 58 047 9.348 127 M 70
006 7.585 115 M 51 048 8.167 119 M 47
007 7.650 111 M 71 050 3.647 97 M 52
008 2.412 97 M 51 051 3.408 86 F 46
009 6.000 100 F 49 052 3.936 102 M 66
010 8.833 112 M 51 053 7.167 110 M 67
011 7.470 104 F 35 054 7.647 120 M 63
012 5.528 89 F 54 055 0.530 103 M 53
013 7.167 104 M 54 056 6.173 115 M 67
014 7.571 102 F 64 057 7.295 93 M 61
015 4.700 91 F 56 058 7.295 72 F 54
016 8.167 114 F 69 059 8.938 111 F 60
017 7.822 114 F 55 060 7.882 103 F 60
018 7.598 103 F 65 061 8.353 123 M 63
019 4.000 106 M 40 062 5.062 79 M 30
020 6.231 105 F 66 063 8.175 119 M 54
021 7.643 113 M 55 064 8.235 110 M 66
022 1.760 109 M 20 065 7.588 110 M 44
024 6.419 108 F 56 068 7.647 107 M 49
026 9.648 113 M 68 069 5.237 74 F 44
027 10.700 130 F 69 071 7.825 105 M 67
028 10.580 128 M 70 072 7.333 112 F 64
029 9.429 128 M 80 074 9.167 105 M 73
030 8.000 118 M 53 076 7.996 110 M 59
031 9.585 113 M 65 077 8.714 107 F 37
032 9.571 120 F 67 078 7.833 103 F 63
033 8.998 132 F 62 079 4.885 77 M 36
034 8.333 111 F 39 080 7.998 98 F 64
035 8.175 124 M 71 083 3.820 90 M 42
036 8.000 127 M 59 084 5.936 96 F 28
037 9.333 128 F 60 085 9.000 112 F 60
038 9.500 136 M 64 086 9.500 112 F 70
039 9.167 106 M 71 087 6.057 114 M 51
040 10.140 118 F 72 088 6.057 93 F 21
041 9.999 119 F 54 089 6.938 106 M 56

Question 1.41

1.41 Describe the self-concept scores. Based on a suitable graph, briefly describe the distribution of self-concept scores for the students in Table 1.3. Be sure to identify any suspected outliers.

SEVENGR

27

Question 1.42

1.42 The Boston Marathon. Women were allowed to enter the Boston Marathon in 1972. Here are the times (in minutes, rounded to the nearest minute) for the winning women from 1972 to 2015.

Make a graph that shows change over time. What overall pattern do you see? Have times stopped improving in recent years? If so, when did improvement end?

MARATH

Year Time Year Time Year Time Year Time
1972 190 1983 143 1994 142 2005 145
1973 186 1984 149 1995 145 2006 143
1974 167 1985 154 1996 147 2007 149
1975 162 1986 145 1997 146 2008 145
1976 167 1987 146 1998 143 2009 152
1977 168 1988 145 1999 143 2010 146
1978 165 1989 144 2000 146 2011 142
1979 155 1990 145 2001 144 2012 151
1980 154 1991 144 2002 141 2013 146
1981 147 1992 144 2003 145 2014 139
1982 150 1993 145 2004 144 2015 145

1.3 1.3 Describing Distributions with Numbers

When you complete this section, you will be able to:

  • Describe the center of a distribution by using the mean.

  • Describe the center of a distribution by using the median.

  • Compare the mean and the median as measures of center for a particular set of data.

  • Describe the spread of a distribution by using quartiles.

  • Describe a distribution by using the five-number summary.

  • Describe a distribution by using a boxplot.

  • Compare one or more sets of data measured on the same variable by using side-by-side boxplots.

  • Identify outliers by using the 1.5 × IQR rule.

  • Describe the spread of a distribution by using the standard deviation.

  • Choose measures of center and spread for a particular set of data.

  • Compute the effects of a linear transformation on the mean, the median, the standard deviation, and the interquartile range.

We can begin our data exploration with graphs, but numerical summaries make our analysis more specific. For categorical variables, numerical summaries are the counts or percents that we use to construct pie charts or bar graphs. In this section, we focus on numerical summaries for quantitative variables. A brief description of the distribution of a quantitative variable should include its shape and numbers describing its center and spread. We describe the shape of a distribution based on inspection of a histogram or a stemplot. Now we will learn specific ways to use numbers to measure the center and spread of a distribution. We can calculate these numerical measures for any quantitative variable. But to interpret measures of center and spread, and to choose among the several measures we will learn, you must think about the shape of the distribution and the meaning of the data. The numbers, like graphs, are aids to understanding, not “the answer” in themselves.

28

EXAMPLE 1.23

The distribution of business start times. An entrepreneur faces many bureaucratic and legal hurdles when starting a new business. The World Bank collects information about starting businesses throughout the world. They have determined the time, in days, to complete all the procedures required to start a business.20 Data for 189 countries are included in the data set, TTS. For this section, we examine data, rounded to integers, for a sample of 24 of these countries. Here are the data:

16 4 5 6 5 7 12 19 10 2 25 19
38 5 24 8 6 5 53 32 13 49 11 17

TTS24

The stemplot in Figure 1.13 shows us the shape, center, and spread of the business start times. The stems are tens of days and the leaves are days. The distribution is skewed to the right with a very long tail of high values. All but six of the times are less than 20 days. The center appears to be about 10 days, and the values range from 2 days to 53 days. There do not appear to be any outliers.

image
Figure 1.13: Figure 1.13 Stemplot for the sample of 24 business start times, Example 1.23.

Measuring center: The mean

Numerical description of a distribution begins with a measure of its center or average. The two common measures of center are the mean and the median. The mean is the “average value” and the median is the “middle value.” These are two different ideas for “center,” and the two measures behave differently. We need precise recipes for the mean and the median.

THE MEAN x¯

To find the mean x¯ of a set of observations, add their values and divide by the number of observations. If the n observations are x1, x2, . . . , xn, their mean is

x¯=x1+x2++xnn

or, in more compact notation,

x¯=1nxi

29

The Σ (capital Greek sigma) in the formula for the mean is short for “add them all up.” The bar over the x indicates the mean of all the x-values. Pronounce the mean x¯ as “x-bar.” This notation is so common that writers who are discussing data use x¯, y¯, etc., without additional explanation. The subscripts on the observations xi are a way of keeping the n observations separate.

EXAMPLE 1.24

Mean time to start a business. The mean time to start a business is

x¯=x1+x2++xnn=16+4++1724=39124=16.292

TTS24

The mean time to start a business for the 24 countries in our data set is 16.3 days. Note that we have rounded the answer. Our goal in using the mean to describe the center of a distribution is not to demonstrate that we can compute with great accuracy. The additional digits do not provide any additional useful information. In fact, they distract our attention from the important digits that are meaningful. Do you think it would be better to report the mean as 16 days?

The value of the mean will not necessarily be equal to the value of one of the observations in the data set. Our example of time to start a business illustrates this fact.

In practice, you can key the data into your calculator and hit the Mean key. You don’t have to actually add and divide. But you should know that this is what the calculator is doing.

USE YOUR KNOWLEDGE

Question 1.43

1.43 Include the outlier. For Example 1.23, a random sample of 24 countries was selected from a data set that included 189 countries. The South American country of Suriname, where the start time is 208 days, was not included in the random sample. Consider the effect of adding Suriname to the original set. Show that the mean for the new sample of 25 countries has increased to 24 days. (This is a rounded number. You should report the mean with two digits after the decimal to show that you have performed this calculation.)

TTS25

Question 1.44

1.44 Find the mean. Here are the scores on the first exam in an introductory statistics course for 10 students:

83 74 93 85 75 97 93 55 92 81

STAT

Find the mean first-exam score for these students.

30

Exercise 1.43 illustrates an important weakness of the mean as a measure of center: the mean is sensitive to the influence of a few extreme observations. These may be outliers, but a skewed distribution that has no outliers will also pull the mean toward its long tail. Because the mean cannot resist the influence of extreme observations, we say that it is not a resistant measureresistant measure of center.

A measure that is resistant does more than limit the influence of outliers. Its value does not respond strongly to changes in a few observations, no matter how large those changes may be. The mean fails this requirement because we can make the mean as large as we wish by making a large enough increase in just one observation. A resistant measure is sometimes called a robust measurerobust measure.

Measuring center: The median

We used the midpoint of a distribution as an informal measure of center in Section 1.2. The median is the formal version of the midpoint, with a specific rule for calculation.

THE MEDIAN M

The median M is the midpoint of a distribution. Half the observations are smaller than the median and the other half are larger than the median. Here is a rule for finding the median:

  1. 1. Arrange all observations in order of size, from smallest to largest.

  2. 2. If the number of observations n is odd, the median M is the center observation in the ordered list. Find the location of the median by counting (n+1)/2 observations up from the bottom of the list.

  3. 3. If the number of observations n is even, the median M is the mean of the two center observations in the ordered list. The location of the median is again (n+1)/2 from the bottom of the list.

image

Note that the formula (n+1)/2 does not give the median, just the location of the median in the ordered list. Medians require little arithmetic, so they are easy to find by hand for small sets of data. Arranging even a moderate number of observations in order is tedious, however, so that finding the median by hand for larger sets of data is unpleasant. Even simple calculators have an x¯ button, but you will need computer software or a graphing calculator to automate finding the median.

EXAMPLE 1.25

Median time to start a business. To find the median time to start a business for our 24 countries, we first arrange the data in order from smallest to largest:

2 4 5 5 5 5 6 6 7 8 10 11
12 13 16 17 19 19 24 25 32 38 49 53

TTS24

31

The count of observations n=24 is even. The median, then, is the average of the two center observations in the ordered list. To find the location of the center observations, we first compute

location of M=n+12=252=12.5

Therefore, the center observations are the 12th and 13th observations in the ordered list. The median is

M=11+122=11.5

Note that you can use the stemplot in Figure 1.13 (page 28) directly to compute the median. In the stemplot the cases are already ordered and you simply need to count from the top or the bottom to the desired location.

USE YOUR KNOWLEDGE

Question 1.45

1.45 Include the outlier. Include Suriname, where the start time is 208 days, in the data set, and show that the median is 12 days. Note that with this case included, the sample size is now 25 and the median is the 13th observation in the ordered list. Write out the ordered list and circle the outlier. Describe the effect of the outlier on the median for this set of data.

TTS25

Question 1.46

1.46 Calls to a customer service center. The service times for 80 calls to a customer service center are given in Table 1.2 (page 17). Use these data to compute the median service time.

CALLS80

Question 1.47

1.47 Find the median. Here are the scores on the first exam in an introductory statistics course for 10 students:

83 74 93 85 75 97 93 55 92 81

STAT

Find the median first-exam score for these students.

Mean versus median

Exercises 1.43 and 1.45 illustrate an important difference between the mean and the median. Suriname is an outlier. It pulls the mean time to start a business up from 16 days to 24 days. The median increased slightly, from 11.5 days to 12 days.

The median is more resistant than the mean. If the largest start time in the data set was 1200 days, the median for all 25 countries would still be 12 days. The largest observation just counts as one observation above the center, no matter how far above the center it lies. The mean uses the actual value of each observation and so will chase a single large observation upward.

image

The best way to compare the response of the mean and median to extreme observations is to use an interactive applet that allows you to place points on a line and then drag them with your computer’s mouse. Exercises 1.83, 1.84, and 1.85 use the Mean and Median applet on the website for this text to compare the mean and the median.

32

The median and mean are the most common measures of the center of a distribution. The mean and median of a symmetric distribution are close together. If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the mean is farther out in the long tail than is the median.

The endowment for a college or university is money set aside and invested. The income from the endowment is usually used to support various programs. The distribution of the sizes of the endowments of colleges and universities is strongly skewed to the right. Most institutions have modest endowments, but a few are very wealthy. The median endowment of colleges and universities in a recent year was $93 million—but the mean endowment was $498 million.21 The few wealthy institutions pull the mean up but do not affect the median. Don’t confuse the “average” value of a variable (the mean) with its “typical” value, which we might describe by the median.

image

We can now give a better answer to the question of how to deal with outliers in data. First, look at the data to identify outliers and investigate their causes. You can then correct outliers if they are wrongly recorded, delete them for good reason, or otherwise give them individual attention. The outlier in Example 1.21 (page 21) can be dropped from the data once we discover that it is an error. If you have no clear reason to drop outliers, you may want to use resistant measures in your analysis, so that outliers have little influence over your conclusions. The choice is often a matter for judgment.

Measuring spread: The quartiles

A measure of center alone can be misleading. Two countries with the same median family income are very different if one has extremes of wealth and poverty and the other has little variation among families. A drug manufactured with the correct mean concentration of active ingredient is dangerous if some batches are much too high and others much too low.

We are interested in the spread or variability of incomes and drug potencies as well as their centers. The simplest useful numerical description of a distribution consists of both a measure of center and a measure of spread.

We can describe the spread or variability of a distribution by giving several percentiles. The median divides the data in two; half of the observations are above the median and half are below the median. We could call the median the 50th percentile. The upper quartilequartile is the median of the upper half of the data. Similarly, the lower quartile is the median of the lower half of the data. With the median, the quartiles divide the data into four equal parts; 25% of the data are in each part.

We can do a similar calculation for any percent. The pth percentilepercentile of a distribution is the value that has p percent of the observations fall at or below it. To calculate a percentile, arrange the observations in increasing order and count up the required percent from the bottom of the list.

Our definition of percentiles is a bit inexact because there is not always a value with exactly p percent of the data at or below it. We will be content to take the nearest observation for most percentiles, but the quartiles are important enough to require an exact rule.

33

THE QUARTILES Q1 AND Q3

To calculate the quartiles:

  1. 1. Arrange the observations in increasing order and locate the median M in the ordered list of observations.

  2. 2. The first quartile Q1 is the median of the observations whose positions in the ordered list are to the left of the location of the overall median.

  3. 3. The third quartile Q3 is the median of the observations whose positions in the ordered list are to the right of the location of the overall median.

Here is an example.

EXAMPLE 1.26

Finding the quartiles. Here is the ordered list of the times to start a business in our sample of 24 countries:

2 4 5 5 5 5 6 6 7 8 10 11
12 13 16 17 19 19 24 25 32 38 49 53

TTS24

The count of observations n=24 is even, so the median is at position (24+1)/2=12.5, that is, between the 12th and the 13th observation in the ordered list. There are 12 cases above this position and 12 below it. The first quartile is the median of the first 12 observations, and the third quartile is the median of the last 12 observations. Check that Q1=5.5 and Q3=21.5.

Notice that the quartiles are resistant. For example, Q3 would have the same value if the highest start time was 530 days rather than 53 days.

image

Be careful when several observations take the same numerical value. Write down all the observations and apply the rules just as if they all had distinct values.

USE YOUR KNOWLEDGE

Question 1.48

1.48 Find the quartiles. Here are the scores on the first exam in an introductory statistics course for 10 students:

83 74 93 85 75 97 93 55 92 81

STAT

Find the quartiles for these first-exam scores.

image

There are several rules for calculating quartiles, which often give slightly different values. The differences are generally small. For describing data, just report the values that your software gives.

34

The five-number summary and boxplots

In Section 1.2, we used the smallest and largest observations to indicate the spread of a distribution. These single observations tell us little about the distribution as a whole, but they give information about the tails of the distribution that is missing if we know only Q1, M, and Q3. To get a quick summary of both center and spread, use all five numbers.

THE FIVE-NUMBER SUMMARY

The five-number summary of a set of observations consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest. In symbols, the five-number summary is

Minimum Q1M Q3Maximum

EXAMPLE 1.27

Service center call lengths. Table 1.2 (page 17) gives the service center call lengths for the sample of 80 calls that we discussed in Example 1.15. The five-number summary for these data is 1.0, 54.5, 103.5, 200, and 2631. The distribution is highly skewed. The mean is 197 seconds, a value that is very close to the third quartile.

CALLS80

USE YOUR KNOWLEDGE

Question 1.49

1.49 Verify the calculations. Refer to the five-number summary and the mean for service center call lengths given in Example 1.28. Verify these results. Do not use software for this exercise and be sure to show all your work.

CALLS80

Question 1.50

1.50 Find the five-number summary. Here are the scores on the first exam in an introductory statistics course for 10 students:

83 74 93 85 75 97 93 55 92 81

STAT

Find the five-number summary for these first-exam scores.

The five-number summary leads to another visual representation of a distribution, the boxplot.

BOXPLOT

A boxplot is a graph of the five-number summary.

  • A central box spans the quartiles Q1 and Q3.

  • A line in the box marks the median M.

  • Lines extend from the box out to the smallest and largest observations.

35

The lines extending to the smallest and largest observations are sometimes called whiskerswhiskers, and boxplots are sometimes called box-and-whisker plotsbox-and-whisker plots. Software provides many varieties of boxplots, some of which use different choices for the placement of the whiskers.

When you look at a boxplot, first locate the median, which marks the center of the distribution. Then look at the spread. The quartiles show the spread of the middle half of the data, and the extremes (the smallest and largest observations) show the spread of the entire data set.

EXAMPLE 1.28

IQ scores. In Example 1.14 (page 14), we used a histogram to examine the distribution of a sample of 60 IQ scores. A boxplot for these data is given in Figure 1.14. Note that the mean is marked with a “+” and appears very close to the median. The two quartiles are each approximately the same distance from the median, and the two whiskers are approximately the same distance from the corresponding quartiles. All these characteristics are consistent with a symmetric distribution, as illustrated by the histogram in Figure 1.7.

IQ

image
Figure 1.14: Figure 1.14 Boxplot for sample of 60 IQ scores, Example 1.28.

USE YOUR KNOWLEDGE

Question 1.51

1.51 Make a boxplot. Here are the scores on the first exam in an introductory statistics course for 10 students:

83 74 93 85 75 97 93 55 92 81

STAT

Make a boxplot for these first-exam scores.

The 1.5 × IQR rule for suspected outliers

If we look at the data in Table 1.2 (page 17), we can spot a clear outlier, a call lasting 2631 seconds, more than twice the length of any other call. How can we describe the spread of this distribution? The smallest and largest observations are extremes that do not describe the spread of the majority of the data. The distance between the quartiles (the range of the center half of the data) is a more resistant measure of spread than the range. This distance is called the interquartile range.

36

THE INTERQUARTILE RANGE IQR

The interquartile range IQR is the distance between the first and third quartiles,

IQR = Q3Q1

EXAMPLE 1.29

IQR for service center call length data. In Exercise 1.49 (page 34) you verified that the five-number summary for our data on service center call lengths was 1.0, 54.5, 103.5, 200, and 2631. Therefore, we calculate

IQR = Q3Q1

IQR = 200 − 54.5

= 145.5

The quartiles and the IQR are not affected by changes in either tail of the distribution. They are resistant, therefore, because changes in a few data points have no further effect once these points move outside the quartiles.

image

However, no single numerical measure of spread, such as IQR, is very useful for describing skewed distributions. The two sides of a skewed distribution have different spreads, so one number can’t summarize them. We can often detect skewness from the five-number summary by comparing how far the first quartile and the minimum are from the median (left tail) with how far the third quartile and the maximum are from the median (right tail). The interquartile range is mainly used as the basis for a rule of thumb for identifying suspected outliers.

THE 1.5 × IQR RULE FOR OUTLIERS

Call an observation a suspected outlier if it falls more than 1.5 × IQR above the third quartile or below the first quartile.

EXAMPLE 1.30

Suspected outliers for call length data. For the call length data in Table 1.2 (page 17),

1.5 × IQR = 1.5 × 145.5 = 218.25

CALLS80

Any values below 54.5 − 218.25 = −163.75 or above 200 + 218.25 = 418.25 are flagged as possible outliers. There are no low outliers, but the eight longest calls are flagged as possible high outliers. Their lengths are

438 465 479 700 700 951 1148 2631

It is difficult to imagine calls lasting this long.

37

USE YOUR KNOWLEDGE

Question 1.52

1.52 Find the IQR. Here are the scores on the first exam in an introductory statistics course for 10 students:

83 74 93 85 75 97 93 55 92 81

STAT

Find the interquartile range and use the 1.5 × IQR rule to check for outliers. How low would the lowest score need to be for it to be an outlier according to this rule?

Two variations on the basic boxplot can be very useful. The first, called a modified boxplotmodified boxplot, uses the 1.5 × IQR rule. The lines that extend out from the quartiles are terminated in whiskers that are 1.5 × IQR in length. Points beyond the whiskers are plotted individually and are classified as outliers according to the 1.5 × IQR rule.

The other variation is to use two or more boxplots in the same graph to compare groups measured on the same variable. These are called side-by-side boxplotsside-by-side boxplots. The following example illustrates these two variations.

EXAMPLE 1.31

Do poets die young? According to William Butler Yeats, “She is the Gaelic muse, for she gives inspiration to those she persecutes. The Gaelic poets die young, for she is restless, and will not let them remain long on earth.” One study designed to investigate this issue examined the age at death for writers from different cultures and genders.22

POETS

Three categories of writers examined were novelists, poets, and nonfiction writers. We examine the ages at death for female writers in these categories from North America. Figure 1.15 shows modified side-by-side boxplots for the three categories of writers.

Displaying the boxplots for the three categories of writers lets us compare the three distributions. We see that nonfiction writers tend to live the longest, followed by novelists. The poets do appear to die young! There is one outlier among the nonfiction writers, which is plotted individually along with the value of its label. This writer died at the age of 40, young for a nonfiction writer, but not for a novelist or a poet!

image
Figure 1.15: Figure 1.15 Modified side-by-side boxplots for the data on writers’ age at death, for Example 1.31.

38

Measuring spread: The standard deviation

The five-number summary is not the most common numerical description of a distribution. That distinction belongs to the combination of the mean to measure center and the standard deviation to measure spread, or variability. The standard deviation measures spread by looking at how far the observations are from their mean.

THE STANDARD DEVIATION s

The variance s2 of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance of n observations x1, x2, . . . , xn is

s2=(x1x¯)2+(x2x¯)2++(xnx¯)2n1

or, in more compact notation,

s2=1n1(xix¯)2

The standard deviation s is the square root of the variance s2:

s=1n1(xix¯)2

The idea behind the variance and the standard deviation as measures of spread is as follows: The deviations xix¯ display the spread of the values xi about their mean x¯. Some of these deviations will be positive and some negative because some of the observations fall on each side of the mean. In fact, the sum of the deviations of the observations from their mean will always be zero. Squaring the deviations makes the negative deviations positive so that observations far from the mean in either direction have large positive squared deviations. The variance is the average squared deviation. Therefore, s2 and s will be large if the observations are widely spread about their mean and small if the observations are all close to the mean.

EXAMPLE 1.32

Metabolic rate. A person’s metabolic rate is the rate at which the body consumes energy. Metabolic rate is important in studies of weight gain, dieting, and exercise. Here are the metabolic rates of seven men who took part in a study of dieting. (The units are calories per 24 hours. These are the same calories used to describe the energy content of foods.)

1792 1666 1362 1614 1460 1867 1439

METABOL

Enter these data into your calculator or software and verify that

x¯=1600 calories  s=189.24 calories

Figure 1.16 plots these data as dots on the calorie scale, with their mean marked by an asterisk (*). The arrows mark two of the deviations from the mean. If you were calculating s by hand, you would find the first deviation as

x1x¯=17921600=192

39

image
Figure 1.16: Figure 1.16 Metabolic rates for seven men, with the mean (*) and the deviations of two observations from the mean, Example 1.32.

Exercise 1.80 asks you to calculate the seven deviations from Example 1.32, square them, and find s2 and s directly from the deviations. Working one or two short examples by hand helps you understand how the standard deviation is obtained. In practice, you will use either software or a calculator that will find s.

USE YOUR KNOWLEDGE

Question 1.53

1.53 Find the variance and the standard deviation. Here are the scores on the first exam in an introductory statistics course for 10 students:

83 74 93 85 75 97 93 55 92 81

STAT

Find the variance and the standard deviation for these first-exam scores.

The idea of the variance is straightforward: it is the average of the squares of the deviations of the observations from their mean. The details we have just presented, however, raise some questions.

Why do we square the deviations?

  • First, the sum of the squared deviations of any set of observations from their mean is the smallest that the sum of squared deviations from any number can possibly be. This is not true of the unsquared distances. So squared deviations point to the mean as center in a way that distances do not.

  • Second, the standard deviation turns out to be the natural measure of spread for a particularly important class of symmetric unimodal distributions, the Normal distributions. We will meet the Normal distributions in the next section.

Why do we emphasize the standard deviation rather than the variance?

  • One reason is that s, not s2, is the natural measure of spread for Normal distributions, which are introduced in the next section.

  • There is also a more general reason to prefer s to s2. Because the variance involves squaring the deviations, it does not have the same unit of measurement as the original observations. The variance of the metabolic rates, for example, is measured in squared calories. Taking the square root gives us a description of the spread of the distribution in the original measurement units.

Why do we average by dividing by n1 rather than n in calculating the variance?

  • Because the sum of the deviations is always zero, the last deviation can be found once we know the other n − 1. So we are not averaging n unrelated numbers. Only n − 1 of the squared deviations can vary freely, and we average by dividing the total by n − 1.

    40

  • The number n − 1 is called the degrees of freedomdegrees of freedom of the variance or standard deviation. Many calculators offer a choice between dividing by n and dividing by n − 1, so be sure to use n − 1.

Properties of the standard deviation

Here are the basic properties of the standard deviation s as a measure of spread.

PROPERTIES OF THE STANDARD DEVIATION

  • s measures spread about the mean and should be used only when the mean is chosen as the measure of center.

  • s = 0 only when there is no spread. This happens only when all observations have the same value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets larger.

  • s, like the mean x¯, is not resistant. A few outliers can make s very large.

USE YOUR KNOWLEDGE

Question 1.54

1.54 A standard deviation of zero. Construct a data set with 6 cases that has a variable with s = 0.

image

The use of squared deviations renders s even more sensitive than x¯ to a few extreme observations. For example, when we add Suriname to our sample of 24 countries for the analysis of the time to start a business (Exercise 1.43, page 29, and Exercise 1.45, page 31), we increase the standard deviation from 14.2 to 40.8! Distributions with outliers and strongly skewed distributions have standard deviations that do not give much helpful information about such distributions.

USE YOUR KNOWLEDGE

Question 1.55

1.55 Effect of an outlier on the IQR. Find the IQR for the time to start a business with and without Suriname. What do you conclude about the sensitivity of this measure of spread to the inclusion of an outlier?

TTS24, TTS25

Choosing measures of center and spread

How do we choose between the five-number summary and x¯ and s to describe the center and spread of a distribution? Because the two sides of a strongly skewed distribution have different spreads, no single number such as s describes the spread well. The five-number summary, with its two quartiles and two extremes, does a better job.

CHOOSING A SUMMARY

The five-number summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with strong outliers. Use x¯ and s for reasonably symmetric distributions that are free of outliers.

image

Remember that a graph gives the best overall picture of a distribution. Numerical measures of center and spread report specific facts about a distribution, but they do not describe its shape. Numerical summaries do not disclose the presence of multiple modes or gaps, for example. Always plot your data.

41

EXAMPLE 1.33

Results from software. We prefer to examine the numerical summaries and graphical summaries together. Figure 1.17 gives (a) a boxplot, (b) a histogram, and (c) numerical summaries for the time to start a business from Example 1.23 (page 28) using Minitab. Similar displays are given for SPSS in Figure 1.18 (a), (b), and (c) and for JMP in Figure 1.19. Examine and compare the outputs carefully. Notice that they give different numbers of significant digits for some of these numerical summaries. There are also variations in how they make the boxplots and how they define classes for the histograms.

TTS24

image image image
Figure 1.17: Figure 1.17 Graphical and numerical summaries from Minitab: (a) boxplot, (b) histogram, and (c) numerical summaries for the time to start a business, Example 1.33.

42

image image image
Figure 1.18: Figure 1.18 Graphical and numerical summaries from SPSS: (a) boxplot, (b) histogram, and (c) numerical summaries for the time to start a business, Example 1.33.

43

image
Figure 1.19: Figure 1.19 Graphical and numerical summaries from JMP for the time to start a business, Example 1.33.

Changing the unit of measurement

The same variable can be recorded in different units of measurement. Americans commonly record distances in miles and temperatures in degrees Fahrenheit, while the rest of the world measures distances in kilometers and temperatures in degrees Celsius. Fortunately, it is easy to convert numerical descriptions of a distribution from one unit of measurement to another. This is true because a change in the measurement unit is a linear transformation of the measurements.

44

LINEAR TRANSFORMATIONS

A linear transformation changes the original variable x into the new variable xnew given by an equation of the form

xnew = a + bx

Adding the constant a shifts all values of x upward or downward by the same amount. In particular, such a shift changes the origin (zero point) of the variable. Multiplying by the positive constant b changes the size of the unit of measurement.

EXAMPLE 1.34

Change the units.

  1. (a) If a distance x is measured in kilometers, the same distance in miles is

    xnew = 0.62x

    For example, a 10-kilometer race covers 6.2 miles. This transformation changes the units without changing the origin—a distance of 0 kilometers is the same as a distance of 0 miles.

  2. (b) A temperature x measured in degrees Fahrenheit must be reexpressed in degrees Celsius to be easily understood by the rest of the world. The transformation is

    xnew=59 (x32)=1609+59 x

    Thus, the high of 95°F on a hot American summer day translates into 35°C. In this case,

    a=1609 and b=59

    This linear transformation changes both the unit size and the origin of the measurements. The origin in the Celsius scale (0°C, the temperature at which water freezes) is 32° in the Fahrenheit scale.

Linear transformations do not change the shape of a distribution. If measurements on a variable x have a right-skewed distribution, any new variable xnew obtained by a linear transformation xnew = a + bx (for b > 0) will also have a right-skewed distribution. If the distribution of x is symmetric and unimodal, the distribution of xnew remains symmetric and unimodal.

Although a linear transformation preserves the basic shape of a distribution, the center and spread will change. Because linear changes of measurement scale are common, we must be aware of their effect on numerical descriptive measures of center and spread. Fortunately, the changes follow a simple pattern.

EXAMPLE 1.35

Use scores to find the points. In an introductory statistics course, homework counts for 300 points out of a total of 1000 possible points for all course requirements. During the semester, there were 12 homework assignments, and each was given a grade on a scale of 0 to 100. The maximum total score for the 12 homework assignments is therefore 1200. To convert the homework scores to final grade points, we need to convert the scale of 0 to 1200 to a scale of 0 to 300. We do this by multiplying the homework scores by 300/1200. In other words, we divide the homework scores by 4. Here are the homework scores and the corresponding final grade points for five students:

45

Student 1 2 3 4 5
Score 1056 1080 900 1164 1020
Points 264 270 225 291 255

These two sets of numbers measure the same performance on homework for the course. Because we obtained the points by dividing the scores by 4, the mean of the points will be the mean of the scores divided by 4. Similarly, the standard deviation of points will be the standard deviation of the scores divided by 4.

USE YOUR KNOWLEDGE

Question 1.56

1.56 Calculate the points for a student. Use the setting of Example 1.35 to find the points for a student whose score is 950.

Here is a summary of the rules for linear transformations:

EFFECT OF A LINEAR TRANSFORMATION

To see the effect of a linear transformation on measures of center and spread, apply these rules:

  • Multiplying each observation by a positive number b multiplies both measures of center (mean and median) and measures of spread (interquartile range and standard deviation) by b.

  • Adding the same number a (either positive or negative) to each observation adds a to measures of center and to quartiles and other percentiles but does not change measures of spread.

In Example 1.35, when we converted from score to points, we described the transformation as dividing by 4. The multiplication part of the summary of the effect of a linear transformation applies to this case because division by 4 is the same as multiplication by 0.25. Similarly, the second part of the summary applies to subtraction as well as addition because subtraction is simply the addition of a negative number.

The measures of spread IQR and s do not change when we add the same number a to all the observations because adding a constant changes the location of the distribution but leaves the spread unaltered. You can find the effect of a linear transformation xnew = a + bx by combining these rules. For example, if x has mean x¯, the transformed variable xnew has mean a + bx¯.

SECTION 1.3 SUMMARY

46

  • A numerical summary of a distribution should report its center and its spread or variability.

  • The meanx¯ and the median M describe the center of a distribution in different ways. The mean is the arithmetic average of the observations, and the median is their midpoint.

  • When you use the median to describe the center of a distribution, describe its spread by giving the quartiles. The first quartile Q1 has one-fourth of the observations below it, and the third quartile Q3 has three-fourths of the observations below it.

  • The interquartile range is the difference between the quartiles. It is the spread of the center half of the data. The 1.5 × IQR rule flags observations more than 1.5 × IQR beyond the quartiles as possible outliers.

  • The five-number summary consisting of the median, the quartiles, and the smallest and largest individual observations provides a quick overall description of a distribution. The median describes the center, and the quartiles and extremes show the spread.

  • Boxplots based on the five-number summary are useful for comparing several distributions. The box spans the quartiles and shows the spread of the central half of the distribution. The median is marked within the box. Lines extend from the box to the extremes and show the full spread of the data. In a modified boxplot, points identified by the 1.5 × IQR rule are plotted individually. Side-by-side boxplots can be used to display boxplots for more than one group on the same graph.

  • The variance s2 and especially its square root, the standard deviation s, are common measures of spread about the mean as center. The standard deviation s is zero when there is no spread and gets larger as the spread increases.

  • A resistant measure of any aspect of a distribution is relatively unaffected by changes in the numerical value of a small proportion of the total number of observations, no matter how large these changes are. The median and quartiles are resistant, but the mean and the standard deviation are not.

  • The mean and standard deviation are good descriptions for symmetric distributions without outliers. They are most useful for the Normal distributions introduced in the next section. The five-number summary is a better exploratory description for skewed distributions.

  • Linear transformations have the form xnew = a + bx. A linear transformation changes the origin if a ≠ 0 and changes the size of the unit of measurement if b > 0. Linear transformations do not change the overall shape of a distribution. A linear transformation multiplies a measure of spread by b and changes a percentile or measure of center m into a + bm.

  • Numerical measures of particular aspects of a distribution, such as center and spread, do not report the entire shape of most distributions. In some cases, particularly distributions with multiple peaks and gaps, these measures may not be very informative.

SECTION 1.3 EXERCISES

47

For Exercises 1.43 and 1.44, see page 29; for Exercises 1.45 to 1.47, see page 31; for Exercise 1.48, see page 33; for Exercises 1.49 and 1.50, see page 34; for Exercise 1.51, see page 35; for Exercise 1.52, see page 37; for Exercise 1.53, see page 39; for Exercise 1.54, see page 40; for Exercise 1.55, see page 40; and for Exercise 1.56, see page 45.

Question 1.57

1.57 Potassium from potatoes. Refer to Exercise 1.30 (page 24) where you examined the potassium absorption of a group of 27 adults who ate a controlled diet that included 40 mEq of potassium from potatoes for five days.

KPOT40

  1. (a) Compute the mean for these data.

  2. (b) Compute the median for these data.

  3. (c) Which measure do you prefer for describing the center of this distribution? Explain your answer. (You may include a graphical summary as part of your explanation.)

Question 1.58

1.58 Potassium from a supplement. Refer to Exercise 1.31 (page 24) where you examined the potassium absorption of a group of 29 adults who ate a controlled diet that included 40 mEq of potassium from a supplement for five days.

KSUP40

  1. (a) Compute the mean for these data.

  2. (b) Compute the median for these data.

  3. (c) Which measure do you prefer for describing the center of this distribution? Explain your answer. (You may include a graphical summary as part of your explanation.)

Question 1.59

1.59 Potassium from potatoes. Refer to Exercise 1.30 (page 24) where you examined the potassium absorption of a group of 27 adults who ate a controlled diet that included 40 mEq of potassium from potatoes for five days.

KPOT40

  1. (a) Compute the standard deviation for these data.

  2. (b) Compute the quartiles for these data.

  3. (c) Give the five-number summary and explain the meaning of each of the five numbers.

  4. (d) Which numerical summaries do you prefer for describing the distribution, the mean, and the standard deviation of the five-number summary? Explain your answer. (You may include a graphical summary as part of your explanation.)

Question 1.60

1.60 Potassium from a supplement. Refer to Exercise 1.31 (page 24) where you examined the potassium absorption of a group of 29 adults who ate a controlled diet that included 40 mEq of potassium from a supplement for five days.

KSUP40

  1. (a) Compute the standard deviation for these data.

  2. (b) Compute the quartiles for these data.

  3. (c) Give the five-number summary and explain the meaning of each of the five numbers.

  4. (d) Which numerical summaries do you prefer for describing the distribution, the mean, and the standard deviation of the five-number summary? Explain your answer. (You may include a graphical summary as part of your explanation.)

Question 1.61

1.61 Potassium from potatoes. Refer to Exercise 1.30 (page 24) where you examined the potassium absorption of a group of 27 adults who ate a controlled diet that included 40 mEq of potassium from potatoes for five days. In Exercise 1.30, you used a stemplot to examine the distribution of the potassium absorption.

KPOT40

  1. (a) Make a histogram and use it to describe the distribution of potassium absorption.

  2. (b) Make a boxplot and use it to describe the distribution of potassium absorption.

  3. (c) Compare the stemplot, the histogram, and the boxplot as graphical summaries of this distribution. Which do you prefer? Give reasons for your answer.

Question 1.62

1.62 Potassium from a supplement. Refer to Exercise 1.31 (page 24) where you examined the potassium absorption of a group of 29 adults who ate a controlled diet that included 40 mEq of potassium from a supplement for five days. In Exercise 1.31, you used a stemplot to examine the distribution of the potassium absorption.

KSUP40

  1. (a) Make a histogram and use it to describe the distribution of potassium absorption.

  2. (b) Make a boxplot and use it to describe the distribution of potassium absorption.

  3. (c) Compare the stemplot, the histogram, and the boxplot as graphical summaries of this distribution. Which do you prefer? Give reasons for your answer.

Question 1.63

1.63 Compare the potatoes with the supplement. Refer to Exercises 1.30 and 1.31 (page 24). Use a back-to-back stemplot to display the data for the two sources of potassium. Use the stemplot to compare the two distributions and write a short summary of your findings.

KPS40

Question 1.64

1.64 Potassium sources. Refer to Exercises 1.30 and 1.31 (page 24). Use side-by-side boxplots in to describe the distributions.

KPS40

  1. (a) Summarize what you see in the boxplots and compare it with what you saw in the stemplots.

  2. (b) For comparing these two distributions, do you prefer back-to-back stemplots or side-by-side boxplots? Give reasons for your answer.

48

Question 1.65

1.65 Gosset’s data on double stout sales. William Sealy Gosset worked at the Guinness Brewery in Dublin and made substantial contributions to the practice of statistics.23 In his work at the brewery, he collected and analyzed a great deal of data. Archives with Gosset’s handwritten tables, graphs, and notes have been preserved at the Guinness Storehouse in Dublin.24 In one study, Gosset examined the change in the double stout market before and after World War I (1914–1918). For various regions in England and Scotland, he calculated the ratio of sales in 1925, after the war, as a percent of sales in 1913, before the war. Here are the data:

STOUT

Bristol 94 Glasgow 66
Cardiff 112 Liverpool 140
English Agents 78 London 428
English O 68 Manchester 190
English P 46 Newcastle-on-Tyne 118
English R 111 Scottish 24
  1. (a) Compute the mean for these data.

  2. (b) Compute the median for these data.

  3. (c) Which measure do you prefer for describing the center of this distribution? Explain your answer. (You may include a graphical summary as part of your explanation.)

Question 1.66

1.66 Measures of spread for the double stout data. Refer to the previous exercise.

STOUT

  1. (a) Compute the standard deviation for these data.

  2. (b) Compute the quartiles for these data.

  3. (c) Which measure do you prefer for describing the spread of this distribution? Explain your answer. (You may include a graphical summary as part of your explanation.)

Question 1.67

1.67 Are there outliers in the double stout data? Refer to the previous two exercises.

STOUT

  1. (a) Find the IQR for these data.

  2. (b) Use the 1.5 × IQR rule to identify and name any outliers.

  3. (c) Make a boxplot for these data and describe the distribution using only the information in the boxplot.

  4. (d) Make a modified boxplot for these data and describe the distribution using only the information in the boxplot.

  5. (e) Make a stemplot for these data.

  6. (f) Compare the boxplot, the modified boxplot, and the stemplot. Evaluate the advantages and disadvantages of each graphical summary for describing the distribution of the double stout data.

Question 1.68

1.68 Smolts. Smolts are young salmon at a stage when their skin becomes covered with silvery scales and they start to migrate from freshwater to the sea. The reflectance of a light shined on a smolt’s skin is a measure of the smolt’s readiness for the migration. Here are the reflectances, in percents, for a sample of 50 smolts:25

SMOLTS

57.6 54.8 63.4 57.0 54.7 42.3 63.6 55.5 33.5 63.3
58.3 42.1 56.1 47.8 56.1 55.9 38.8 49.7 42.3 45.6
69.0 50.4 53.0 38.3 60.4 49.3 42.8 44.5 46.4 44.3
58.9 42.1 47.6 47.9 69.2 46.6 68.1 42.8 45.6 47.3
59.6 37.8 53.9 43.2 51.4 64.5 43.8 42.7 50.9 43.8
  1. (a) Find the mean reflectance for these smolts.

  2. (b) Find the median reflectance for these smolts.

  3. (c) Do you prefer the mean or the median as a measure of center for these data? Give reasons for your preference.

Question 1.69

1.69 Measures of spread for smolts. Refer to the previous exercise.

SMOLTS

  1. (a) Find the standard deviation of the reflectance for these smolts.

  2. (b) Find the quartiles of the reflectance for these smolts.

  3. (c) Do you prefer the standard deviation or the quartiles as a measure of spread for these data? Give reasons for your preference.

Question 1.70

1.70 Are there outliers in the smolt data? Refer to the previous two exercises.

SMOLTS

  1. (a) Find the IQR for the smolt data.

  2. (b) Use the 1.5 × IQR rule to identify any outliers.

  3. (c) Make a boxplot for the smolt data and describe the distribution using only the information in the boxplot.

  4. (d) Make a modified boxplot for these data and describe the distribution using only the information in the boxplot.

  5. (e) Make a stemplot for these data.

  6. (f) Compare the boxplot, the modified boxplot, and the stemplot. Evaluate the advantages and disadvantages of each graphical summary for describing the distribution of the smolt reflectance data.

49

Question 1.71

1.71 Potatoes. A quality product is one that is consistent and has very little variability in its characteristics. Controlling variability can be more difficult with agricultural products than with those that are manufactured. The following table gives the weights, in ounces, of the 25 potatoes sold in a 10-pound bag:

POTATO

7.6 7.9 8.0 6.9 6.7 7.9 7.9 7.9 7.6 7.8 7.0 4.7 7.6
6.3 4.7 4.7 4.7 6.3 6.0 5.3 4.3 7.9 5.2 6.0 3.7
  1. (a) Summarize the data graphically and numerically. Give reasons for the methods you chose to use in your summaries.

  2. (b) Do you think that your numerical summaries do an effective job of describing these data? Why or why not?

  3. (c) There appear to be two distinct clusters of weights for these potatoes. Divide the sample into two subsamples based on the clustering. Give the mean and standard deviation for each subsample. Do you think that this way of summarizing these data is better than a numerical summary that uses all the data as a single sample? Give a reason for your answer.

Question 1.72

1.72 The alcohol content of beer. Brewing beer involves a variety of steps that can affect the alcohol content. A website gives the percent alcohol for 159 domestic brands of beer.26

BEER

  1. (a) Use graphical and numerical summaries of your choice to describe the data. Give reasons for your choice.

  2. (b) The data set contains an outlier. Explain why this particular beer is unusual.

  3. (c) For the outlier, give a short description of how you think this particular beer should be marketed.

Question 1.73

1.73 Outlier for alcohol content of beer. Refer to the previous exercise.

BEER

  1. (a) Calculate the mean with and without the outlier. Do the same for the median. Explain how these values change when the outliers is excluded.

  2. (b) Calculate the standard deviation with and without the outlier. Do the same for the quartiles. Explain how these values change when the outlier is excluded.

  3. (c) Write a short paragraph summarizing what you have learned in this exercise.

Question 1.74

1.74 Calories in beer. Refer to the previous two exercises. The data set also lists calories per 12 ounces of beverage.

BEER

  1. (a) Analyze the data and summarize the distribution of calories for these 159 brands of beer.

  2. (b) In the previous exercise, you identified one brand of beer as an outlier. To what extent is this brand an outlier in the distribution of calories? Explain your answer.

  3. (c) Does the distribution of calories suggest marketing strategies for this brand of beer? Describe some marketing strategies.

Question 1.75

1.75 Median versus mean for net worth. A report on the assets of American households says that the median net worth of U.S. families is $81,200. The mean net worth of these families is $534,600.27 What explains the difference between these two measures of center?

Question 1.76

1.76 Create a data set. Create a data set with seven observations for which the median would change by a large amount if the smallest observation were deleted.

Question 1.77

1.77 Mean versus median. A small accounting firm pays each of its seven clerks $55,000, three junior accountants $80,000 each, and the firm’s owner $650,000. What is the mean salary paid at this firm? How many of the employees earn less than the mean? What is the median salary?

Question 1.78

1.78 Be careful about how you treat the zeros. In computing the median income of any group, some federal agencies omit all members of the group who had no income. Give an example to show that the reported median income of a group can go down even though the group becomes economically better off. Is this also true of the mean income?

Question 1.79

1.79 How does the median change? The firm in Exercise 1.77 gives no raises to the clerks and junior accountants, while the owner’s take increases to $500,000. How does this change affect the mean? How does it affect the median?

Question 1.80

1.80 Metabolic rates. Calculate the mean and standard deviation of the metabolic rates in Example 1.32 (page 38), showing each step in detail. First find the mean x¯ by summing the seven observations and dividing by 7. Then find each of the deviations xix¯ and their squares. Check that the deviations have sum 0. Calculate the variance as an average of the squared deviations (remember to divide by n − 1). Finally, obtain s as the square root of the variance.

METABOL

Question 1.81

image 1.81 Earthquakes. Each year there are about 900,000 earthquakes of magnitude 2.5 or less that are usually not felt. In contrast, there are about 10 of magnitude 7.0 that cause serious damage.28 Explain why the average magnitude of earthquakes is not a good measure of their impact.

50

Question 1.82

1.82 IQ scores. Many standard statistical methods that you will study in Part II of this book are intended for use with distributions that are symmetric and have no outliers. These methods start with the mean and standard deviation, x¯ and s. For example, standard methods would typically be used for the IQ and GPA data in Table 1.3 (page 26).

IQGPA

  1. (a) Find x¯ and s for the IQ data. In large populations, IQ scores are standardized to have mean 100 and standard deviation 15. In what way does the distribution of IQ among these students differ from the overall population?

  2. (b) Find the median IQ score. It is, as we expect, close to the mean.

  3. (c) Find the mean and median for the GPA data. The two measures of center differ a bit. What feature of the data (see your stemplot in Exercise 1.39 or make a new stemplot) explains the difference?

Question 1.83

image 1.83 Mean and median for two observations. The Mean and Median applet allows you to place observations on a line and see their mean and median visually. Place two observations on the line by clicking below it. Why does only one arrow appear?

Question 1.84

image 1.84 Mean and median for three observations. In the Mean and Median applet, place four observations on the line by clicking below it, three close together near the center of the line and one somewhat to the right of these two.

  1. (a) Pull the single rightmost observation out to the right. (Place the cursor on the point, hold down a mouse button, and drag the point.) How does the mean behave? How does the median behave? Explain briefly why each measure acts as it does.

  2. (b) Now drag the rightmost point to the left as far as you can. What happens to the mean? What happens to the median as you drag this point past the other two (watch carefully)?

Question 1.85

image 1.85 Mean and median for seven observations. Place seven observations on the line in the Mean and Median applet by clicking below it.

  1. (a) Add one additional observation without changing the median. Where is your new point?

  2. (b) Use the applet to convince yourself that when you add yet another observation (there are now nine in all), the median does not change no matter where you put the seventh point. Explain why this must be true.

Question 1.86

1.86 Imputation. Various problems with data collection can cause some observations to be missing. Suppose a data set has 20 cases. Here are the values of the variable x for 10 of these cases:

IMPUTE

17 6 12 14 20 23 9 12 16 21

The values for the other 10 cases are missing. One way to deal with missing data is called imputation. The basic idea is that missing values are replaced, or imputed, with values that are based on an analysis of the data that are not missing. For a data set with a single variable, the usual choice of a value for imputation is the mean of the values that are not missing. The mean for this data set is 15.

  1. (a) Verify that the mean is 15 and find the standard deviation for the 10 cases for which x is not missing.

  2. (b) Create a new data set with 20 cases by setting the values for the 10 missing cases to 15. Compute the mean and standard deviation for this data set.

  3. (c) Summarize what you have learned about the possible effects of this type of imputation on the mean and the standard deviation.

Question 1.87

image 1.87 A standard deviation contest. This is a standard deviation contest. You must choose four numbers from the whole numbers 10 to 20, with repeats allowed.

  1. (a) Choose four numbers that have the smallest possible standard deviation.

  2. (b) Choose four numbers that have the largest possible standard deviation.

  3. (c) Is more than one choice possible in either part (a) or part (b)? Explain.

Question 1.88

1.88 Longleaf pine trees. The Wade Tract in Thomas County, Georgia, is an old-growth forest of longleaf pine trees (Pinus palustris) that has survived in a relatively undisturbed state since before the settlement of the area by Europeans. A study collected data on 584 of these trees.29 One of the variables measured was the diameter at breast height (DBH). This is the diameter of the tree at 4.5 feet and the units are centimeters (cm). Only trees with DBH greater than 1.5 cm were sampled. Here are the diameters of a random sample of 40 of these trees:

PINES

10.5 13.3 26.0 18.3 52.2 9.2 26.1 17.6 40.5 31.8
47.2 11.4 2.7 69.3 44.4 16.9 35.7 5.4 44.2 2.2
4.3 7.8 38.1 2.2 11.4 51.5 4.9 39.7 32.6 51.8
43.6 2.3 44.6 31.5 40.3 22.3 43.3 37.5 29.1 27.9
  1. (a) Find the five-number summary for these data.

  2. (b) Make a boxplot.

  3. (c) Make a histogram.

  4. (d) Write a short summary of the major features of this distribution. Do you prefer the boxplot or the histogram for these data?

51

Question 1.89

1.89 Weight gain. A study of diet and weight gain deliberately overfed 15 volunteers for eight weeks. The mean increase in fat was x¯=2.41 kilograms, and the standard deviation was s=1.25 kilograms. What are x¯ and s in pounds? (A kilogram is 2.2 pounds.)

Question 1.90

image 1.90 Changing units from inches to centimeters. Changing the unit of length from inches to centimeters multiplies each length by 2.54 because there are 2.54 centimeters in an inch. This change of units multiplies our usual measures of spread by 2.54. This is true of IQR and the standard deviation. What happens to the variance when we change units in this way?

Question 1.91

1.91 A different type of mean. The trimmed mean is a measure of center that is more resistant than the mean but uses more of the available information than the median. To compute the 10% trimmed mean, discard the highest 10% and the lowest 10% of the observations and compute the mean of the remaining 80%. Trimming eliminates the effect of a small number of outliers. Compute the 10% trimmed mean of the service time data in Table 1.2 (page 17). Then compute the 20% trimmed mean. Compare the values of these measures with the median and the ordinary untrimmed mean.

Question 1.92

image 1.92 Changing units from centimeters to inches. Refer to Exercise 1.88 (page 50). Change the measurements from centimeters to inches by multiplying each value by 0.39. Answer the questions from that exercise and explain the effect of the transformation on these data.

1.4 1.4 Density Curves and Normal Distributions

When you complete this section, you will be able to:

  • Compare the mean and the median for symmetric and skewed distributions.

  • Sketch a Normal distribution for any given mean and standard deviation.

  • Apply the 68–95–99.7 rule to find proportions of observations within one, two, and three standard deviations of the mean for any Normal distribution.

  • Transform values of a variable from a general Normal distribution to the standard Normal distribution.

  • Compute areas under a Normal curve using software or Table A.

  • Perform inverse Normal calculations to find values of a Normal variable corresponding to various areas.

  • Assess the extent to which the distribution of a set of data can be approximated by a Normal distribution.

We now have a kit of graphical and numerical tools for describing distributions. What is more, we have a clear strategy for exploring data on a single quantitative variable:

  1. 1. Always plot your data: make a graph, usually a stemplot or a histogram.

  2. 2. Look for the overall pattern and for striking deviations such as outliers.

  3. 3. Calculate an appropriate numerical summary to briefly describe center and spread.

Technology has expanded the set of graphs that we can choose for Step 1. It is possible, though painful, to make histograms by hand. Using software, clever algorithms can describe a distribution in a way that is not feasible by hand, by fitting a smooth curve to the data in addition to or instead of a histogram. The curves used are called density curvesdensity curves. Before we examine density curves in detail, here is an example of what software can do.

52

EXAMPLE 1.36

Density curves for times to start a business and Titanic passenger ages. Figure 1.20 illustrates the use of density curves along with histograms to describe distributions. Figure 1.20(a) shows the distribution of the times to start a business for 189 countries (see Example 1.23. page 28). The outlier, Suriname, described in Exercise 1.43 (page 29) has been deleted from the data set. The distribution is highly skewed to the right. Most of the data are in the first two classes, with 40 or fewer days to start a business.

TTS

Exercise 1.27 (page 24) describes data on the class of the ticket of the Titanic passengers, and Figure 1.20(b) shows the distribution of the ages of these passengers. It has a single mode, a long right tail, and a relatively short left tail.

TITANIC

image image
Figure 1.20: Figure 1.20 (a) The distribution of the time to start a business, Example 1.36. The distribution is pictured with both a histogram and a density curve. (b) The distribution of the ages of the Titanic passengers, Example 1.36. These distributions have a single mode with tails of two different lengths.

53

image image
Figure 1.21: Figure 1.21 (a) The distribution of Iowa Test vocabulary scores for Gary, Indiana, seventh-graders, Example 1.37. The shaded bars in the histogram represent scores less than or equal to 6.0. (b) The shaded area under the Normal density curve also represents scores less than or equal to 6.0. This area is 0.293, close to the true 0.303 for the actual data.

A smooth density curve is an idealization that gives the overall pattern of the data but ignores minor irregularities. We first discuss density curves in general and then focus on a special class of density curves, the bell-shaped Normal curves.

Density curves

One way to think of a density curve is as a smooth approximation to the irregular bars of a histogram. Figure 1.21 shows a histogram of the scores of all 947 seventh-grade students in Gary, Indiana, on the vocabulary part of the Iowa Test of Basic Skills. Scores of many students on this national test have a very regular distribution. The histogram is symmetric, and both tails fall off quite smoothly from a single center peak. There are no large gaps or obvious outliers. The curve drawn through the tops of the histogram bars in Figure 1.21 is a good description of the overall pattern of the data.

EXAMPLE 1.37

Vocabulary scores. In a histogram, the areas of the bars represent either counts or proportions of the observations. In Figure 1.21(a), we shaded the bars that represent students with vocabulary scores 6.0 or lower. There are 287 such students, who make up the proportion 287/947 = 0.303 of all Gary seventh-graders. The shaded bars in Figure 1.21(a) make up proportion 0.303 of the total area under all the bars. If we adjust the scale so that the total area of the bars is 1, the area of the shaded bars will also be 0.303.

54

In Figure 1.21(b), we shaded the area under the curve to the left of 6.0. If we adjust the scale so that the total area under the curve is exactly 1, areas under the curve will then represent proportions of the observations. That is, area = proportion. The curve is then a density curve. The shaded area under the density curve in Figure 1.21(b) represents the proportion of students with score 6.0 or lower. This area is 0.293, only 0.010 away from the histogram result. You can see that areas under the density curve give quite good approximations of areas given by the histogram.

DENSITY CURVE

A density curve is a curve that

  • Is always on or above the horizontal axis.

  • Has area exactly 1 underneath it.

A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values is the proportion of all observations that fall in that range.

The density curve in Figure 1.21 is a Normal curve. Density curves, like distributions, come in many shapes. Figure 1.22 shows two density curves, a symmetric Normal density curve and a right-skewed curve.

We will discuss Normal density curves in detail in this section because of the important role that they play in statistics. There are, however, many applications where the use of other families of density curves are essential.

A density curve of an appropriate shape is often an adequate description of the overall pattern of a distribution. Outliers, which are deviations from the overall pattern, are not described by the curve.

Measuring center and spread for density curves

Our measures of center and spread apply to density curves as well as to actual sets of observations, but only some of these measures are easily seen from the curve. A mode of a distribution described by a density curve is a peak point of the curve, the location where the curve is highest. Because areas under a density curve represent proportions of the observations, the median is the point with half the total area on each side. You can roughly locate the quartiles by dividing the area under the curve into quarters as accurately as possible by eye. The IQR is the distance between the first and third quartiles. There are mathematical ways of calculating areas under curves. These allow us to locate the median and quartiles exactly on any density curve.

image image
Figure 1.22: Figure 1.22 (a) A symmetric Normal density curve with its mean and median marked. (b) A right-skewed density curve with its mean and median marked.

55

image
Figure 1.23: Figure 1.23 The mean of a density curve is the point at which it would balance.

What about the mean and standard deviation? The mean of a set of observations is their arithmetic average. If we think of the observations as weights strung out along a thin rod, the mean is the point at which the rod would balance. This fact is also true of density curves. The mean is the point at which the curve would balance if it were made out of solid material. Figure 1.23 illustrates this interpretation of the mean.

A symmetric curve, such as the Normal curve in Figure 1.22(a), balances at its center of symmetry. Half the area under a symmetric curve lies on either side of its center, so this is also the median.

For a right-skewed curve, such as those shown in Figures 1.22(b) and 1.23, the small area in the long right tail tips the curve more than the same area near the center. The mean (the balance point), therefore, lies to the right of the median. It is hard to locate the balance point by eye on a skewed curve. There are mathematical ways of calculating the mean for any density curve, so we are able to mark the mean as well as the median in Figure 1.22(b). The standard deviation can also be calculated mathematically, but it can’t be located by eye on most density curves.

MEDIAN AND MEAN OF A DENSITY CURVE

The median of a density curve is the equal-areas point, the point that divides the area under the curve in half.

The mean of a density curve is the balance point, at which the curve would balance if made of solid material.

The median and mean are the same for a symmetric density curve. They both lie at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the long tail.

A density curve is an idealized description of a distribution of data. For example, the density curve in Figure 1.21 is exactly symmetric, but the histogram of vocabulary scores is only approximately symmetric. We therefore need to distinguish between the mean and standard deviation of the density curve and the numbers x¯ and s computed from the actual observations. The usual notation for the mean of an idealized distribution is μmean μ (the Greek letter mu). We write the standard deviation of a density curve as σstandard deviation σ (the Greek letter sigma). In Chapter 5, we refer to x¯ and s as statistics associated with a sample and to μ and σ as parameters associated with a population.

56

Normal distributions

One particularly important class of density curves has already appeared in Figures 1.21 and 1.22(a). These density curves are symmetric, unimodal, and bell-shaped. They are called Normal curvesNormal curves, and they describe Normal distributionsNormal distributions. All Normal distributions have the same overall shape.

The exact density curve for a particular Normal distribution is specified by giving the distribution’s mean μ and its standard deviation σ. The mean is located at the center of the symmetric curve and is the same as the median. Changing μ without changing σ moves the Normal curve along the horizontal axis without changing its spread.

The standard deviation σ controls the spread of a Normal curve. Figure 1.24 shows two Normal curves with different values of σ. The curve with the larger standard deviation is more spread out.

The standard deviation σ is the natural measure of spread for Normal distributions. Not only do μ and σ completely determine the shape of a Normal curve, but we can locate σ by eye on the curve. Here’s how. As we move out in either direction from the center μ, the curve changes from falling ever more steeply

image

to falling ever less steeply

image

The points at which this change of curvature takes place are located at distance σ on either side of the mean m. You can feel the change as you run your finger along a Normal curve, and so find the standard deviation. Remember that m and σ alone do not specify the shape of most distributions, and that the shape of density curves in general does not reveal σ. These are special properties of Normal distributions.

image
Figure 1.24: Figure 1.24 Two Normal curves, showing the mean μ and the standard deviation σ.

57

There are other symmetric bell-shaped density curves that are not Normal. The Normal density curves are specified by a particular equation. The height of the density curve at any point x is given by

1σ2πe12(xμσ)2

We will not make direct use of this fact, although it is the basis of mathematical work with Normal distributions. Notice that the equation of the curve is completely determined by the mean μ and the standard deviation σ.

Why are the Normal distributions important in statistics? Here are three reasons.

  1. 1. Normal distributions are good descriptions for some distributions of real data. Distributions that are often close to Normal include scores on tests taken by many people (such as the Iowa Test of Figure 1.21, page 53), repeated careful measurements of the same quantity, and characteristics of biological populations (such as lengths of baby pythons and yields of corn).

  2. 2. Normal distributions are good approximations to the results of many kinds of chance outcomes, such as tossing a coin many times.

  3. 3. Many statistical inference procedures based on Normal distributions work well for other roughly symmetric distributions.

However, even though many sets of data follow a Normal distribution, many do not. Most income distributions, for example, are skewed to the right and so are not Normal. Non-Normal data, like nonnormal people, not only are common but are also sometimes more interesting than their Normal counterparts.

The 68–95–99.7 rule

Although there are many Normal curves, they all have common properties. Here is one of the most important.

THE 68–95–99.7 RULE

In the Normal distribution with mean μ and standard deviation σ:

  • Approximately 68% of the observations fall within σ of the mean μ.

  • Approximately 95% of the observations fall within 2σ of μ.

  • Approximately 99.7% of the observations fall within 3σ of μ.

Figure 1.25 illustrates the 68–95–99.7 rule. By remembering these three numbers, you can think about Normal distributions without constantly making detailed calculations.

58

image
Figure 1.25: Figure 1.25 The 68–95–99.7 rule for Normal distributions.

EXAMPLE 1.38

Heights of young women. The distribution of heights of young women aged 18 to 24 is approximately Normal with mean μ = 64.5 inches and standard deviation σ = 2.5 inches. Figure 1.26 shows what the 68–95–99.7 rule says about this distribution.

Two standard deviations equals five inches for this distribution. The 95 part of the 68–95–99.7 rule says that the middle 95% of young women are between 64.5 − 5 and 64.5 + 5 inches tall, that is, between 59.5 and 69.5 inches. This fact is exactly true for an exactly Normal distribution. It is approximately true for the heights of young women because the distribution of heights is approximately Normal.

The other 5% of young women have heights outside the range from 59.5 to 69.5 inches. Because the Normal distributions are symmetric, half of these women are on the tall side. So the tallest 2.5% of young women are taller than 69.5 inches.

image
Figure 1.26: Figure 1.26 The 68–95–99.7 rule applied to the heights of young women, Example 1.38.

59

Because we will mention Normal distributions often, a short notation is helpful. We abbreviate the Normal distribution with mean μ and standard deviation σ as N(μ, σ)N(μ, σ). For example, the distribution of young women’s heights is N(64.5, 2.5).

USE YOUR KNOWLEDGE

Question 1.93

1.93 Test scores. Many states assess the skills of their students in various grades. One program that is available for this purpose is the National Assessment of Educational Progress (NAEP).30 One of the tests provided by the NAEP assesses the reading skills of 12th-grade students. In a recent year, the national mean score was 288 and the standard deviation was 38. Assuming that these scores are approximately Normally distributed, N(288, 38), use the 68–95–99.7 rule to give a range of scores that includes 95% of these students.

Question 1.94

1.94 Use the 68–95–99.7 rule. Refer to the previous exercise. Use the 68–95–99.7 rule to give a range of scores that includes 99.7% of these students.

Standardizing observations

As the 68–95–99.7 rule suggests, all Normal distributions share many properties. In fact, all Normal distributions are the same if we measure in units of size σ about the mean μ as center. Changing to these units is called standardizing. To standardize a value, subtract the mean of the distribution and then divide by the standard deviation.

STANDARDIZING AND z-SCORES

If x is an observation from a distribution that has mean μ and standard deviation σ, the standardized value of x is

z=xμσ

A standardized value is often called a z-score.

A z-score tells us how many standard deviations the original observation falls away from the mean, and in which direction. Observations larger than the mean are positive when standardized, and observations smaller than the mean are negative.

To compare scores based on different measures, z-scores can be very useful. For example, see Exercise 1.124 (page 73), where you are asked to compare an SAT score with an ACT score.

EXAMPLE 1.39

Find some z-scores. The heights of young women are approximately Normal with μ = 64.5 inches and σ = 2.5 inches. The z-score for height is

z=height64.52.5

60

A woman’s standardized height is the number of standard deviations by which her height differs from the mean height of all young women. A woman 68 inches tall, for example, has z-score

z=6864.52.5=1.4

or 1.4 standard deviations above the mean. Similarly, a woman 5 feet (60 inches) tall has z-score

z=6064.52.5=1.8

or 1.8 standard deviations less than the mean height.

USE YOUR KNOWLEDGE

Question 1.95

1.95 Find the z-score. Consider the NAEP scores (see Exercise 1.93, page 59), which we assume are approximately Normal, N(288, 38). Give the z-score for a student who received a score of 350.

Question 1.96

1.96 Find another z-score. Consider the NAEP scores, which we assume are approximately Normal, N(288, 38). Give the z-score for a student who received a score of 240. Explain why your answer is negative even though all the test scores are positive.

We need a way to write variables, such as “height” in Example 1.38, that follow a theoretical distribution such as a Normal distribution. We use capital letters near the end of the alphabet for such variables. If X is the height of a young woman, we can then shorten “the height of a young woman is less than 68 inches” to “X < 68.” We will use lowercase x to stand for any specific value of the variable X.

We often standardize observations from symmetric distributions to express them in a common scale. We might, for example, compare the heights of two children of different ages by calculating their z-scores. The standardized heights tell us where each child stands in the distribution for his or her age group.

Standardizing is a linear transformation that transforms the data into the standard scale of z-scores. We know that a linear transformation does not change the shape of a distribution, and that the mean and standard deviation change in a simple manner. In particular, the standardized values for any distribution always have mean 0 and standard deviation 1.

If the variable we standardize has a Normal distribution, standardizing does more than give a common scale. It makes all Normal distributions into a single distribution, and this distribution is still Normal. Standardizing a variable that has any Normal distribution produces a new variable that has the standard Normal distribution.

THE STANDARD NORMAL DISTRIBUTION

The standard Normal distribution is the Normal distribution N(0, 1) with mean 0 and standard deviation 1.

If a variable X has any Normal distribution N(μ, σ) with mean μ and standard deviation σ, then the standardized variable

Z=Xμσ

has the standard Normal distribution.

61

image
Figure 1.27: Figure 1.27 The cumulative proportion for a value x is the proportion of all observations from the distribution that are less than or equal to x. This is the area to the left of x under the Normal curve.

Normal distribution calculations

Areas under a Normal curve represent proportions of observations from that Normal distribution. There is no formula for areas under a Normal curve. Calculations use either software that calculates areas or a table of areas. The table and most software calculate one kind of area: cumulative proportionscumulative proportion. A cumulative proportion is the proportion of observations in a distribution that lie at or below a given value. When the distribution is given by a density curve, the cumulative proportion is the area under the curve to the left of a given value. Figure 1.27 shows the idea more clearly than words do.

The key to calculating Normal proportions is to match the area you want with areas that represent cumulative proportions. Then get areas for cumulative proportions either from software or (with an extra step) from a table. The following examples show the method in pictures.

EXAMPLE 1.40

NCAA eligibility for competition. To be eligible to compete in their first year of college, the National Collegiate Athletic Association (NCAA) requires Division I athletes to meet certain academic standards. These are based on their grade point average (GPA) in certain courses and combined scores on the SAT Critical Reading and Mathematics sections or the ACT composite score.31

image
Mitchell Layton/Getty Images

For a student with a 3.0 GPA, the combined SAT score must be 800 or higher. Based on the distribution of SAT scores for college-bound students, we assume that the distribution of the combined Critical Reading and Mathematics scores is approximately Normal with mean 1010 and standard deviation 225.32 What proportion of college-bound students have SAT scores of 800 or more?

Here is the calculation in pictures: the proportion of scores above 800 is the area under the curve to the right of 800. That’s the total area under the curve (which is always 1) minus the cumulative proportion up to 800.

image

area right of 800 = total area − area left of 800

0.8247 = 1 − 0.1753

62

That is, the proportion of college-bound SAT takers with a 3.0 GPA who are eligible to compete is 0.8247, or about 82%.

There is no area under a smooth curve that is exactly over the point 800. Consequently, the area to the right of 800 (the proportion of scores > 800) is the same as the area at or to the right of this point (the proportion of scores ≥ 800). The actual data may contain a student who scored exactly 800 on the SAT. That the proportion of scores exactly equal to 800 is 0 for a Normal distribution is a consequence of the idealized smoothing of Normal distributions for data.

EXAMPLE 1.41

NCAA eligibility for aid and practice. The NCAA has a category of eligibility in which a first-year student may not compete but is still eligible to receive an athletic scholarship and to practice with the team. The requirements for this category are a 3.0 GPA and combined SAT Critical Reading and Mathematics scores of at least 620.

What proportion of college-bound students who take the SAT would be eligible to receive an athletic scholarship and to practice with the team but would not be eligible to compete? That is, what proportion have scores between 620 and 800? Here are the pictures:

image

area between 620 and 800 = area left of 800 − area left of 620

0.1338 = 0.1753 − 0.0415

About 13% of college-bound students with a 3.0 GPA have SAT scores between 620 and 800.

How do we find the numerical values of the areas in Examples 1.40 and 1.41? If you use software, just plug in mean 1010 and standard deviation 225. Then ask for the cumulative proportions for 800 and for 620. (Your software will probably refer to these as “cumulative probabilities.” We will learn in Chapter 4 why the language of probability fits.) Sketches of the areas that you want similar to the ones in Examples 1.40 and 1.41 are very helpful in making sure that you are doing the correct calculations.

image

You can use the Normal Curve applet on the text website to find Normal proportions. The applet is more flexible than most software—it will find any Normal proportion, not just cumulative proportions. The applet is an excellent way to understand Normal curves. But, because of the limitations of web browsers, the applet is not as accurate as statistical software.

If you are not using software, you can find cumulative proportions for Normal curves from a table. That requires an extra step, as we now explain.

63

Using the standard Normal table

The extra step in finding cumulative proportions from a table is that we must first standardize to express the problem in the standard scale of z-scores. This allows us to get by with just one table, a table of standard Normal cumulative proportions. Table A in the back of the book gives standard Normal probabilities. The picture at the top of the table reminds us that the entries are cumulative proportions, areas under the curve to the left of a value z.

EXAMPLE 1.42

Find the proportion from z. What proportion of observations on a standard Normal variable Z take values less than 1.47? We need to find the area to the left of 1.47; locate 1.4 in the left-hand column of Table A and then locate the remaining digit 7 as .07 in the top row. The entry opposite 1.4 and under .07 is 0.9292. This is the cumulative proportion we seek. Figure 1.28 illustrates this area.

image
Figure 1.28: Figure 1.28 The area under a standard Normal curve to the left of the point z = 1.47 is 0.9292, Example 1.42.

Now that you see how Table A works, let’s redo the NCAA Examples 1.40 and 1.41 using the table.

EXAMPLE 1.43

Find the proportion from x. What proportion of college-bound students who take the SAT have scores of at least 800? The picture that leads to the answer is exactly the same as in Example 1.40. The extra step is that we first standardize to read cumulative proportions from Table A. If X is SAT score, we want the proportion of students for which Xx, where x = 800.

  1. 1. Standardize. Subtract the mean, then divide by the standard deviation, to transform the problem about X into a problem about a standard Normal Z:

    X ≥ 800

    X10102258001010225

    Z ≥ −0.93

  2. 2. Use the table. Look at the pictures in Example 1.40. From Table A, we see that the proportion of observations less than −0.93 is 0.1762. The area to the right of −0.93 is therefore 1 − 0.1762 = 0.8238. This is about 82%.

64

The area from the table in Example 1.43 (0.8238) is slightly less accurate than the area from software in Example 1.40 (0.8247) because we must round z to two places when we use Table A. The difference is rarely important in practice.

EXAMPLE 1.44

Eligibility for aid and practice. What proportion of all students who take the SAT would be eligible to receive athletic scholarships and to practice with the team but would not be eligible to compete in the eyes of the NCAA? That is, what proportion of students have SAT scores between 620 and 800? First, sketch the areas, exactly as in Example 1.41. We again use X as shorthand for an SAT score.

  1. 1. Standardize.

    620 ≤X < 800

    6201010225 X1010225 < 8001010225

    −1.73 ≤ Z < −0.93

  2. 2. Use the table.

    area between −1.73 and −0.93 = (area left of −0.93) − (area left of −1.73)

    = 0.1762 − 0.0418 = 0.1344

As in Example 1.41, about 13% of students would be eligible to receive athletic scholarships and to practice with the team.

Sometimes we encounter a value of z more extreme than those appearing in Table A. For example, the area to the left of z = −4 is not given in the table. The z-values in Table A leave only area 0.0002 in each tail unaccounted for. For practical purposes, we can act as if there is zero area outside the range of Table A.

USE YOUR KNOWLEDGE

Question 1.97

1.97 Find the proportion. Consider the NAEP scores, which are approximately Normal, N(288, 38). Find the proportion of students who have scores less than 350. Find the proportion of students who have scores greater than or equal to 350. Sketch the relationship between these two calculations using pictures of Normal curves similar to the ones given in Example 1.40 (page 61).

Question 1.98

1.98 Find another proportion. Consider the NAEP scores, which are approximately Normal, N(288, 38). Find the proportion of students who have scores between 300 and 350. Use pictures of Normal curves similar to the ones given in Example 1.41 (page 62) to illustrate your calculations.

Inverse Normal calculations

Examples 1.40 to 1.44 illustrate the use of Normal distributions to find the proportion of observations in a given event, such as “SAT score between 620 and 800.” We may instead want to find the observed value corresponding to a given proportion.

65

Statistical software will do this directly. Without software, use Table A backward, finding the desired proportion in the body of the table and then reading the corresponding z from the left column and top row.

EXAMPLE 1.45

How high for the top 10%? Scores for college-bound students on the SAT Critical Reading test in recent years follow approximately the N(500, 120) distribution.33 How high must a student score to place in the top 10% of all students taking the SAT?

Again, the key to the problem is to draw a picture. Figure 1.29 shows that we want the score x with an area of 0.10 above it. That’s the same as area below x equal to 0.90.

image
Figure 1.29: Figure 1.29 Locating the point on a Normal curve with area 0.10 to its right, Example 1.45.

Statistical software has a function that will give you the x for any cumulative proportion you specify. The function often has a name such as “inverse cumulative probability.” Plug in mean 500, standard deviation 120, and cumulative proportion 0.9. The software tells you that x = 653.786. We see that a student must score at least 654 to place in the highest 10%.

Without software, first find the standard score z with cumulative proportion 0.9, then “unstandardize” to find x. Here is the two-step process:

  1. 1. Use the table. Look in the body of Table A for the entry closest to 0.9. It is 0.8997. This is the entry corresponding to z = 1.28. So z = 1.28 is the standardized value with area 0.9 to its left.

  2. 2. Unstandardize to transform the solution from z back to the original x scale. We know that the standardized value of the unknown x is z = 1.28. So x itself satisfies

    x500120=1.28

    66

    Solving this equation for x gives

    x = 500 + (1.28)(120) = 653.6

    This equation should make sense: it finds the x that lies 1.28 standard deviations above the mean on this particular Normal curve. That is the “unstandardized” meaning of z = 1.28. The general rule for unstandardizing a z-score is

    x = μ + zσ

USE YOUR KNOWLEDGE

Question 1.99

1.99 What score is needed to be in the top 20%? Consider the NAEP scores, which are approximately Normal, N(288, 38). How high a score is needed to be in the top 20% of students who take this exam?

Question 1.100

1.100 Find the score that 75% of students will exceed. Consider the NAEP scores, which are approximately Normal, N(288, 38). Seventy-five percent of the students will score above x on this exam. Find x.

Normal quantile plots

The Normal distributions provide good descriptions of some distributions of real data, such as the Iowa Test vocabulary scores. The distributions of some other common variables are usually skewed and therefore distinctly non-Normal. Examples include economic variables such as personal income and gross sales of business firms, the survival times of cancer patients after treatment, and the service lifetime of mechanical or electronic components. While experience can suggest whether or not a Normal distribution is plausible in a particular case, it is risky to assume that a distribution is Normal without actually inspecting the data.

A histogram or stemplot can reveal distinctly non-Normal features of a distribution, such as outliers, pronounced skewness, or gaps and clusters. If the stemplot or histogram appears roughly symmetric and unimodal, however, we need a more sensitive way to judge the adequacy of a Normal model. The most useful tool for assessing Normality is another graph, the Normal quantile plotNormal quantile plot.

Here is the basic idea of a Normal quantile plot. The graphs produced by software use more sophisticated versions of this idea. It is not practical to make Normal quantile plots by hand.

  1. 1. Arrange the observed data values from smallest to largest. Record what percentile of the data each value occupies. For example, the smallest observation in a set of 20 is at the 5% point, the second smallest is at the 10% point, and so on.

  2. 2. Do Normal distribution calculations to find the values of z corresponding to these same percentiles. For example, z = −1.645 is the 5% point of the standard Normal distribution, and z = −1.282 is the 10% point. We call these values of Z Normal scoresNormal scores.

    67

  3. 3. Plot each data point x against the corresponding Normal score. If the data distribution is close to any Normal distribution, the plotted points will lie close to a straight line.

Any Normal distribution produces a straight line on the plot because standardizing turns any Normal distribution into a standard Normal distribution. Standardizing is a linear transformation that can change the slope and intercept of the line in our plot but cannot turn a line into a curved pattern.

USE OF NORMAL QUANTILE PLOTS

If the points on a Normal quantile plot lie close to a straight line, the plot indicates that the data are Normal. Systematic deviations from a straight line indicate a non-Normal distribution. Outliers appear as points that are far away from the overall pattern of the plot. An optional line can be drawn on the plot that corresponds to the Normal distribution with mean equal to the mean of the data and standard deviation equal to the standard deviation of the data.

Figures 1.30 and 1.31 are Normal quantile plots for data we have met earlier. The data x are plotted vertically against the corresponding standard Normal z-score plotted horizontally. The z-score scale generally extends from −3 to 3 because almost all of a standard Normal curve lies between these values. These figures show how Normal quantile plots behave.

EXAMPLE 1.46

IQ scores are approximately Normal. Figure 1.30 is a Normal quantile plot of the 60 fifth-grade IQ scores from Table 1.1 (page 14). The points lie very close to the straight line drawn on the plot. We conclude that the distribution of IQ data is approximately Normal.

IQ

image
Figure 1.30: Figure 1.30 Normal quantile plot of IQ scores, Example 1.46. This distribution is approximately Normal.

68

EXAMPLE 1.47

Times to start a business are skewed. Figure 1.31 is a Normal quantile plot of the data on times to start a business from Example 1.23. We have excluded Suriname, the outlier that you examined in Exercise 1.43 (page 29). The line drawn on the plot shows clearly that the plot of the data is curved. We conclude that these data are not Normally distributed. The shape of the curve is what we typically see with a distribution that is strongly skewed to the right.

TIME

image
Figure 1.31: Figure 1.31 Normal quantile plot of 188 times to start a business, with the outlier, Suriname, excluded, Example 1.47. This distribution is highly skewed.

Real data often show some departure from the theoretical Normal model. When you examine a Normal quantile plot, look for shapes that show clear departures from Normality. Don’t overreact to minor wiggles in the plot. When we discuss statistical methods that are based on the Normal model, we are interested in whether or not the data are sufficiently Normal for these procedures to work properly. We are not concerned about minor deviations from Normality. Many common methods work well as long as the data are approximately Normal and outliers are not present.

image

BEYOND THE BASICS

Density Estimation

A density curve gives a compact summary of the overall shape of a distribution. Many distributions do not have the Normal shape. There are other families of density curves that are used as mathematical models for various distribution shapes. Modern software offers more flexible options. A density estimatordensity estimator does not start with any specific shape, such as the Normal shape. It looks at the data and draws a density curve that describes the overall shape of the data. Density estimators join stemplots and histograms as useful graphical tools for exploratory data analysis.

Density estimates can capture other unusual features of a distribution. Here is an example.

69

EXAMPLE 1.48

StubHub! StubHub! is a website where fans can buy and sell tickets to sporting events. Ticket holders wanting to sell their tickets provide the location of their seats and the selling price. People wanting to buy tickets can choose from among the tickets offered for a given event.34

STUBHUB

Tickets for the 2015 NCAA women’s basketball tournament were available from StubHub! in a package deal that included the semifinal games and the championship game. On June 28, 2014, StubHub! listed 518 tickets for sale. A histogram of the distribution of ticket prices with a density estimate is given in Figure 1.32. The distribution has three peaks: one around $700, another around $2800, and the third around $4650. This is the identifying characteristic of a trimodal distribution. There appears to be three types of tickets. How would you name the three types?

image
Figure 1.32: Figure 1.32 Histogram of StubHub! price per seat for tickets to the 2015 NCAA Women’s Semifinal and Championship games, with a density estimate, Example 1.48.

Many distributions that we have met have a single peak, or mode. The distribution described in Example 1.48 has three modes and is called a trimodal distributiontrimodal distribution. A distribution that has two modes is called a bimodal distributionbimodal distribution.

The previous example reminds of a continuing theme for data analysis. We looked at a histogram and a density estimate and saw something interesting. This led us to speculation. Additional data on the type and location of the seats may explain more about the prices than we see in Figure 1.32.

SECTION 1.4 SUMMARY

  • The overall pattern of a distribution can often be described compactly by a density curve. A density curve has total area 1 underneath it. Areas under a density curve give proportions of observations for the distribution.

  • The mean μ (balance point), the median (equal-areas point), and the quartiles can be approximately located by eye on a density curve. The standard deviation σ cannot be located by eye on most density curves. The mean and median are equal for symmetric density curves, but the mean of a skewed curve is located farther toward the long tail than is the median.

  • The Normal distributions are described by bell-shaped, symmetric, unimodal density curves. The mean μ and standard deviation σ completely specify the Normal distribution N(μ, σ). The mean is the center of symmetry, and σ is the distance from μ to the change-of-curvature points on either side. All Normal distributions satisfy the 68–95–99.7 rule.

    70

  • To standardize any observation x, subtract the mean of the distribution and then divide by the standard deviation. The resulting z-score z = (xμ)/σ says how many standard deviations x lies from the distribution mean. All Normal distributions are the same when measurements are transformed to the standardized scale.

  • If X has the N(μ, σ) distribution, then the standardized variable Z = (Xμ)/σ has the standard Normal distribution N(0, 1). Proportions for any Normal distribution can be calculated by software or from the standard Normal table (Table A), which gives the cumulative proportions of Z < z for many values of z.

  • The adequacy of a Normal model for describing a distribution of data is best assessed by a Normal quantile plot, which is available in most statistical software packages. A pattern on such a plot that deviates substantially from a straight line indicates that the data are not Normal.

SECTION 1.4 EXERCISES

For Exercises 1.93 and 1.94, see page 59; for Exercises 1.95 and 1.96, see page 60; for Exercises 1.97 and 1.98, see page 64; and for Exercises 1.99 and 1.100, see page 66.

Question 1.101

1.101 Means and medians.

  1. (a) Sketch a symmetric distribution that is not Normal. Mark the location of the mean and the median.

  2. (b) Sketch a distribution that is skewed to the left. Mark the location of the mean and the median.

Question 1.102

1.102 The effect of changing the standard deviation.

  1. (a) Sketch a Normal curve that has mean 30 and standard deviation 8.

  2. (b) On the same x axis, sketch a Normal curve that has mean 30 and standard deviation 12.

  3. (c) How does the Normal curve change when the standard deviation is varied but the mean stays the same?

Question 1.103

1.103 The effect of changing the mean.

  1. (a) Sketch a Normal curve that has mean 30 and standard deviation 8.

  2. (b) On the same x axis, sketch a Normal curve that has mean 40 and standard deviation 8.

  3. (c) How does the Normal curve change when the mean is varied but the standard deviation stays the same?

Question 1.104

1.104 NAEP music scores. In Exercise 1.93 (page 59) we examined the distribution of NAEP scores for the 12th-grade reading skills assessment. For eighth-grade students, the average music score is approximately Normal with mean 150 and standard deviation 35.

  1. (a) Sketch this Normal distribution.

  2. (b) Make a table that includes values of the scores corresponding to plus or minus one, two, and three standard deviations from the mean. Mark these points on your sketch along with the mean.

  3. (c) Apply the 68–95–99.7 rule to this distribution. Give the ranges of reading score values that are within one, two, and three standard deviations of the mean.

Question 1.105

1.105 NAEP U.S. history scores. Refer to the previous exercise. The scores for 12th-grade students on the U.S. history assessment are approximately N(288,32) Answer the questions in the previous exercise for this assessment.

Question 1.106

1.106 Standardize some NAEP music scores. The NAEP music assessment scores for eighth-grade students are approximately N(150,35). Find z-scores by standardizing the following scores: 150, 140, 100, 180, 230.

Question 1.107

1.107 Compute the percentile scores. Refer to the previous exercise. When scores such as the NAEP assessment scores are reported for individual students, the actual values of the scores are not particularly meaningful. Usually, they are transformed into percentile scores. The percentile score is the proportion of students who would score less than or equal to the score for the individual student. Compute the percentile scores for the five scores in the previous exercise. State whether you used software or Table A for these computations.

71

Question 1.108

image 1.108 Are the NAEP U.S. history scores approximately Normal? In Exercise 1.105, we assumed that the NAEP U.S history scores for 12th-grade students are approximately Normal with the reported mean and standard deviation, N(288,32). Let’s check that assumption. In addition to means and standard deviations, you can find selected percentiles for the NAEP assessments (see previous exercise). For the 12th-grade U.S. history scores, the following percentiles are reported:

Percentile Score
10% 246
25% 276
50% 290
75% 311
90% 328

Use these percentiles to assess whether or not the NAEP U.S History scores for 12th-grade students are approximately Normal. Write a short report describing your methods and conclusions.

Question 1.109

image 1.109 Are the NAEP mathematics scores approximately Normal? Refer to the previous exercise. For the NAEP mathematics scores for 12th-graders, the mean is 153 and the standard deviation is 34. Here are the reported percentiles:

Percentile Score
10% 110
25% 130
50% 154
75% 177
90% 197

Is the N(153,34) distribution a good approximation for the NAEP mathematics scores? Write a short report describing your methods and conclusions.

Question 1.110

1.110 Do women talk more? Conventional wisdom suggests that women are more talkative than men. One study designed to examine this stereotype collected data on the speech of 42 women and 37 men in the United States.35

TALK

  1. (a) The mean number of words spoken per day by the women was 14,297 with a standard deviation of 6441. Use the 68–95–99.7 rule to describe this distribution.

  2. (b) Do you think that applying the rule in this situation is reasonable? Explain your answer.

  3. (c) The men averaged 14,060 words per day with a standard deviation of 9056. Answer the questions in parts (a) and (b) for the men.

  4. (d) Do you think that the data support the conventional wisdom? Explain your answer. Note that in Section 7.2 we will learn formal statistical methods to answer this type of question.

Question 1.111

1.111 Data from Mexico. Refer to the previous exercise. A similar study in Mexico was conducted with 31 women and 20 men. The women averaged 14,704 words per day with a standard deviation of 6215. For men the mean was 15,022 and the standard deviation was 7864.

TALKM

  1. (a) Answer the questions from the previous exercise for the Mexican study.

  2. (b) The means for both men and women are higher for the Mexican study than for the U.S. study. What conclusions can you draw from this observation?

Question 1.112

1.112 A uniform distribution. If you ask a computer to generate “random numbers” between 0 and 1, you will get observations from a uniform distribution. Figure 1.33 graphs the density curve for a uniform distribution. Use areas under this density curve to answer the following questions.

  1. (a) Why is the total area under this curve equal to 1?

  2. (b) What proportion of the observations lie above 0.44?

  3. (c) What proportion of the observations lie between 0.44 and 0.70?

Question 1.113

1.113 Use a different range for the uniform distribution. Many random number generators allow users to specify the range of the random numbers to be produced. Suppose that you specify that the outcomes are to be distributed uniformly between 0 and 4. Then the density curve of the outcomes has constant height between 0 and 4, and height 0 elsewhere.

  1. (a) What is the height of the density curve between 0 and 4? Draw a graph of the density curve.

  2. (b) Use your graph from part (a) and the fact that areas under the curve are proportions of outcomes to find the proportion of outcomes that are more than 1.

  3. (c) Find the proportion of outcomes that lie between 1.5 and 2.5.

image
Figure 1.33: Figure 1.33 The density curve of a uniform distribution, Exercise 1.122.
image image image
Figure 1.34: Figure 1.34 Three density curves, Exercise 1.115.

72

Question 1.114

1.114 Find the mean, the median, and the quartiles. What are the mean and the median of the uniform distribution in Figure 1.33? What are the quartiles?

Question 1.115

1.115 Three density curves. Figure 1.34 displays three density curves, each with three points marked on it. At which of these points on each curve do the mean and the median fall?

Question 1.116

image 1.116 Use the Normal Curve applet. Use the Normal Curve applet for the standard Normal distribution to say how many standard deviations above and below the mean the quartiles of any Normal distribution lie.

Question 1.117

image 1.117 Use the Normal Curve applet. The 68–95–99.7 rule for Normal distributions is a useful approximation. You can use the Normal Curve applet on the text website to see how accurate the rule is. Drag one flag across the other so that the applet shows the area under the curve between the two flags.

  1. (a) Place the flags one standard deviation on either side of the mean. What is the area between these two values? What does the 68–95–99.7 rule say this area is?

  2. (b) Repeat for locations two and three standard deviations on either side of the mean. Again compare the 68–95–99.7 rule with the area given by the applet.

Question 1.118

1.118 Find some proportions. Using either Table A or your calculator or software, find the proportion of observations from a standard Normal distribution that satisfies each of the following statements. In each case, sketch a standard Normal curve and shade the area under the curve that is the answer to the question.

  1. (a) Z > 1.75

  2. (b) Z < 1.75

  3. (c) Z > −0.80

  4. (d) −0.80 < Z < 1.75

Question 1.119

1.119 Find more proportions. Using either Table A or your calculator or software, find the proportion of observations from a standard Normal distribution for each of the following events. In each case, sketch a standard Normal curve and shade the area representing the proportion.

  1. (a) Z ≤ −1.4

  2. (b) Z ≥ −1.4

  3. (c) Z > 2.0

  4. (d) −1.4 < Z < 2.0

Question 1.120

1.120 Find some values of z. Find the value z of a standard Normal variable Z that satisfies each of the following conditions. (If you use Table A, report the value of z that comes closest to satisfying the condition.) In each case, sketch a standard Normal curve with your value of z marked on the axis.

  1. (a) 38% of the observations fall below z

  2. (b) 70% of the observations fall above z

Question 1.121

1.121 Find more values of z. The variable Z has a standard Normal distribution.

  1. (a) Find the number z that has cumulative proportion 0.88.

  2. (b) Find the number z such that the event Z > z has proportion 0.12.

Question 1.122

1.122 Find some values of z. The Wechsler Adult Intelligence Scale (WAIS) is the most common IQ test. The scale of scores is set separately for each age group, and the scores are approximately Normal with mean 100 and standard deviation 15. People with WAIS scores below 70 are considered developmentally disabled when, for example, applying for Social Security disability benefits. What percent of adults are developmentally disabled by this criterion?

Question 1.123

1.123 High IQ scores. The Wechsler Adult Intelligence Scale (WAIS) is the most common IQ test. The scale of scores is set separately for each age group, and the scores are approximately Normal with mean 100 and standard deviation 15. The organization MENSA, which calls itself “the high-IQ society,” requires a WAIS score of 130 or higher for membership. What percent of adults would qualify for membership?

There are two major tests of readiness for college, the ACT and the SAT. ACT scores are reported on a scale from 1 to 36. The distribution of ACT scores is approximately Normal with mean μ = 21.5 and standard deviation σ = 5.4. SAT scores are reported on a scale from 600 to 2400. The distribution of SAT scores is approximately Normal with mean μ = 1498 and standard deviation σ = 316. Exercises 1.124 through 1.133 are based on this information.

73

Question 1.124

1.124 Compare an SAT score with an ACT score. Jessica scores 1830 on the SAT. Ashley scores 27 on the ACT. Assuming that both tests measure the same thing, who has the higher score? Report the z-scores for both students.

Question 1.125

1.125 Make another comparison. Joshua scores 16 on the ACT. Anthony scores 1050 on the SAT. Assuming that both tests measure the same thing, who has the higher score? Report the z-scores for both students.

Question 1.126

1.126 Find the ACT equivalent. Jorge scores 2090 on the SAT. Assuming that both tests measure the same thing, what score on the ACT is equivalent to Jorge’s SAT score?

Question 1.127

1.127 Find the SAT equivalent. Alyssa scores 30 on the ACT. Assuming that both tests measure the same thing, what score on the SAT is equivalent to Alyssa’s ACT score?

Question 1.128

1.128 Find an SAT percentile. Reports on a student’s ACT or SAT results usually give the percentile as well as the actual score. The percentile is just the cumulative proportion stated as a percent: the percent of all scores that were lower than or equal to this one. Renee scores 2050 on the SAT. What is her percentile?

Question 1.129

1.129 Find an ACT percentile. Reports on a student’s ACT or SAT results usually give the percentile as well as the actual score. The percentile is just the cumulative proportion stated as a percent: the percent of all scores that were lower than or equal to this one. Joshua scores 19 on the ACT. What is his percentile?

Question 1.130

1.130 How high is the top 12%? What SAT scores make up the top 12% of all scores?

Question 1.131

1.131 How low is the bottom 12%? What SAT scores make up the bottom 12% of all scores?

Question 1.132

1.132 Find the ACT quintiles. The quintiles of any distribution are the values with cumulative proportions 0.20, 0.40, 0.60, and 0.80. What are the quintiles of the distribution of ACT scores?

Question 1.133

1.133 Find the SAT quartiles. The quartiles of any distribution are the values with cumulative proportions 0.25 and 0.75. What are the quartiles of the distribution of SAT scores?

Question 1.134

1.134 Do you have enough “good cholesterol?” High-density lipoprotein (HDL) is sometimes called the “good cholesterol” because low values are associated with a higher risk of heart disease. According to the American Heart Association, people over the age of 20 years should have at least 40 milligrams per deciliter (mg/dl) of HDL cholesterol.36 U.S. women aged 20 and over have a mean HDL of 55 mg/dl with a standard deviation of 15.5 mg/dl. Assume that the distribution is Normal.

  1. (a) What percent of women have low values of HDL (40 mg/dl or less)?

  2. (b) HDL levels of 60 mg/dl and higher are believed to protect people from heart disease. What percent of women have protective levels of HDL?

  3. (c) Women with more than 40 mg/dl but less than 60 mg/dl of HDL are in the intermediate range, neither very good or very bad. What proportion are in this category?

Question 1.135

1.135 Men and HDL cholesterol. HDL cholesterol levels for men have a mean of 46 mg/dl with a standard deviation of 13.6 mg/dl. Answer the questions given in the previous exercise for the population of men.

Question 1.136

1.136 Diagnosing osteoporosis. Osteoporosis is a condition in which the bones become brittle due to loss of minerals. To diagnose osteoporosis, an elaborate apparatus measures bone mineral density (BMD). BMD is usually reported in standardized form. The standardization is based on a population of healthy young adults. The World Health Organization (WHO) criterion for osteoporosis is a BMD 2.5 standard deviations below the mean for young adults. BMD measurements in a population of people similar in age and sex roughly follow a Normal distribution.

  1. (a) What percent of healthy young adults have osteoporosis by the WHO criterion?

  2. (b) Women aged 70 to 79 are of course not young adults. The mean BMD in this age is about −2 on the standard scale for young adults. Suppose that the standard deviation is the same as for young adults. What percent of this older population has osteoporosis?

Question 1.137

1.137 Deciles of Normal distributions. The deciles of any distribution are the 10th, 20th, . . ., 90th percentiles. The first and last deciles are the 10th and 90th percentiles, respectively.

  1. (a) What are the first and last deciles of the standard Normal distribution?

  2. (b) The weights of 9-ounce potato chip bags are approximately Normal with mean 9.12 ounces and standard deviation 0.15 ounce. What are the first and last deciles of this distribution?

Question 1.138

image 1.138 Quartiles for Normal distributions. The quartiles of any distribution are the values with cumulative proportions 0.25 and 0.75.

  1. (a) What are the quartiles of the standard Normal distribution?

  2. (b) Using your numerical values from part (a), write an equation that gives the quartiles of the N(μ, σ) distribution in terms of μ and σ.

74

Question 1.139

1.139 IQR for Normal distributions. Continue your work from the previous exercise. The interquartile range IQR is the distance between the first and third quartiles of a distribution.

  1. (a) What is the value of the IQR for the standard Normal distribution?

  2. (b) There is a constant c such that IQRcσ for any Normal distribution N(μ, σ). What is the value of c?

Question 1.140

image 1.140 Outliers for Normal distributions. Continue your work from the previous two exercises. The percent of the observations that are suspected outliers according to the 1.5 × IQR rule is the same for any Normal distribution. What is this percent?

Question 1.141

1.141 Deciles of HDL cholesterol. The deciles of any distribution are the 10th, 20th, . . . , 90th percentiles. Refer to Exercise 1.134 where we assumed that the distribution of HDL cholesterol in U.S. women aged 20 and over is Normal with mean 55 mg/dl and standard deviation 15.5 mg/dl. Find the deciles for this distribution.

Question 1.142

1.142 Longleaf pine trees. Exercise 1.88 (page 50) gives the diameter at breast height (DBH) for 40 longleaf pine trees from the Wade Tract in Thomas County, Georgia. Make a Normal quantile plot for these data and write a short paragraph interpreting what it describes.

PINES

Question 1.143

1.143 Potassium from potatoes. Refer to Exercise 1.30 (page 24) where you used a stemplot to examine the potassium absorption of a group of 27 adults who ate a controlled diet that included 40 mEq of potassium from potatoes for five days. In Exercise 1.61 (page 47), you compared the stemplot, the histogram, and the boxplot as graphical summaries of this distribution.

KPOT40

  1. (a) Generate these three graphical summaries.

  2. (b) Make a Normal quantile plot and interpret it.

Question 1.144

1.144 Potassium from a supplement. Refer to Exercise 1.31 (page 24) where you used a stemplot to examine where you examined the potassium absorption of a group of 29 adults who ate a controlled diet that included 40 mEq of potassium from a supplement for five days. In Exercise 1.62 (page 47), you compared the stemplot, the histogram, and the boxplot as graphical summaries of this distribution.

KSUP40

  1. (a) Generate these three graphical summaries.

  2. (b) Make a Normal quantile plot and interpret it.

CHAPTER 1 EXERCISES

Question 1.145

image 1.145 Sources of energy consumed. Energy consumed in the United States can be classified as coming from one of three sources: fossil fuels, nuclear and electric power, and renewable energy. In 2014, the energy from these three sources was 80.3, 8.3, and 9.6 quadrillion Btu, respectively. In 2004, the corresponding amounts were 85.8, 8.2, and 6.1.37 Write a description of the changes from 2004 to 2014 expressed in these data. Illustrate your summary with appropriate graphical summaries. Be sure to discuss both the amounts of energy from each source as well as the percents.

Question 1.146

1.146 CO2 emissions in vehicles. Natural Resources Canada tests new vehicles each year and reports several variables related to fuel consumption for vehicles in different classes.38 For 2015, it provides data for 526 vehicles that use regular fuel. Two variables reported are carbon dioxide (CO2) emissions and highway fuel consumption. CO2 is measured in grams per kilometer (g/km), and highway fuel consumption measured in liters per 100 kilometers (L/km). Use graphical and numerical summaries to describe the distribution of CO2 emissions for these vehicles. Be sure to justify your choice of summaries.

CANFREG

Question 1.147

1.147 Highway fuel consumption. Refer to the previous exercise. Use graphical and numerical summaries to describe the distribution of highway fuel consumption for these vehicles. Be sure to justify your choice of summaries.

CANFREG

Question 1.148

1.148 Jobs for business majors. What types of jobs are available for students who graduate with a business degree? The website careerbuilder.com lists job opportunities classified in a variety of ways. A recent posting had 25,120 jobs. The following table gives types of jobs and the numbers of postings listed under the classification “business administration” on a recent day:39

BUSJOBS

Type Number
Management 10916
Sales 5981
Information technology 4605
Customer service 4116
Marketing 3821
Finance 2339
Health care 2231
Accounting 2175
Human resources 1685

Describe these data using the methods you learned in this chapter, and write a short summary about jobs that are available for those who have a business degree. Include comments on the limitations that should be kept in mind when interpreting this particular set of data.

75

Question 1.149

1.149 Flopping in the 2014 World Cup. Soccer players are often accused of spending an excessive amount of time dramatically falling to the ground followed by other activities, suggesting that a possible injury is very serious. It has been suggested that these tactics are often designed to influence the call of a referee or to take extra time off the clock. Recordings of the first 32 games of the 2014 World Cup were analyzed, and there were 302 times when the referee interrupted the match because of a possible injury. The number of injuries and the total time, in minutes, spent flopping for each of the 32 teams who participated in these matches was recorded.40 Here are the data:

FLOPS

Country Injuries Time
Brazil 17 3.30
Chile 16 6.97
Honduras 15 7.67
Nigeria 15 6.42
Mexico 15 3.97
Costa Rica 13 3.80
USA 12 6.40
Ecuador 12 4.55
France 10 7.32
South Korea 10 4.52
Algeria 10 4.05
Iran 9 5.43
Russia 9 5.27
Ivory Coast 9 4.63
Croatia 9 4.32
Colombia 9 4.32
Uruguay 9 4.12
Greece 9 2.65
Cameroon 8 3.15
Germany 8 1.97
Spain 8 1.82
Belgium 7 3.38
Japan 7 2.08
Italy 7 1.60
Switzerland 7 1.35
England 7 3.13
Argentina 6 2.80
Ghana 6 1.85
Australia 6 1.83
Portugal 4 1.82
Netherlands 4 1.65
Bosnia and
Herzegovina
2 0.40

Describe these data using the methods you learned in this chapter, and write a short summary about flopping in the 2014 World Cup based on your analysis.

Question 1.150

1.150 Twitter accounts. Twitter has more than 52,900,000 million users in the United States. A study of Twitter accounts classified users by age. Here are the numbers of users (in millions) for six age groups:41

TWIT

Age Number Age Number
18–24 11.7 45–54 6.7
25–34 13.3 55–64 4.1
35–44 8.7 65 and over 2.7

Describe these data using the methods you learned in this chapter, and write a short summary about the age distribution of Twitter users based on your analysis.

Question 1.151

1.151 What graph would you use? What type of graph or graphs would you plan to make in a study of each of the following issues?

  1. (a) What makes of cars do students drive? How old are their cars?

  2. (b) How many hours per week do students study? How does the number of study hours change during a semester?

  3. (c) Which radio stations are most popular with students?

  4. (d) When many students measure the concentration of the same solution for a chemistry course laboratory assignment, do their measurements follow a Normal distribution?

Question 1.152

image 1.152 Canadian international trade. The government organization Statistics Canada provides data on many topics related to Canada’s population, resources, economy, society, and culture. Go to the web page statcan.gc.ca/start-debut-eng.html. Under the “Subject” tab, choose “International trade.” Pick some data from the resources listed and use the methods that you learned in this chapter to create graphical and numerical summaries. Write a report summarizing your findings that includes supporting evidence from your analyses.

Question 1.153

image 1.153 Travel and tourism in Canada. Refer to the previous exercise. Under the “Subject” tab, choose “Travel and tourism.” Pick some data from the resources listed and use the methods that you learned in this chapter to create graphical and numerical summaries. Write a report summarizing your findings that includes supporting evidence from your analyses.

76

Question 1.154

1.154 Leisure time for college students. You want to measure the amount of “leisure time” that college students enjoy. Write a brief discussion of two issues:

  1. (a) How will you define “leisure time”?

  2. (b) Once you have defined leisure time, how will you measure Sally’s leisure time this week?

Question 1.155

image 1.155 How much vitamin C do you need? The Food and Nutrition Board of the Institute of Medicine, working in cooperation with scientists from Canada, have used scientific data to answer this question for a variety of vitamins and minerals.42 Their methodology assumes that needs, or requirements, follow a distribution. They have produced guidelines called dietary reference intakes for different gender-by-age combinations. For vitamin C, there are three dietary reference intakes: the estimated average requirement (EAR), which is the mean of the requirement distribution; the recommended dietary allowance (RDA), which is the intake that would be sufficient for 97% to 98% of the population; and the tolerable upper level (UL), the intake that is unlikely to pose health risks. For women aged 19 to 30 years, the EAR is 60 milligrams per day (mg/d), the RDA is 75 mg/d, and the UL is 2000 mg/d.43

  1. (a) The researchers assumed that the distribution of requirements for vitamin C is Normal. The EAR gives the mean. From the definition of the RDA, let’s assume that its value is the 97.72 percentile. Use this information to determine the standard deviation of the requirement distribution.

  2. (b) Sketch the distribution of vitamin C requirements for 19- to 30-year-old women. Mark the EAR, the RDA, and the UL on your plot.

Question 1.156

image 1.156 How much vitamin C do men need? Refer to the previous exercise. For men aged 19 to 30 years, the EAR is 75 milligrams per day (mg/d), the RDA is 90 mg/d, and the UL is 2000 mg/d. Answer the questions in the previous exercise for this population.

Question 1.157

image 1.157 How much vitamin C do women consume? To evaluate whether or not the intake of a vitamin or mineral is adequate, comparisons are made between the intake distribution and the requirement distribution. Here is some information about the distribution of vitamin C intake, in milligrams per day, for women aged 19 to 30 years:44

Percentile (mg/d)
Mean 1st 5th 19th 25th 50th 75th 90th 95th 99th
84.1 31 42 48 61 79 102 126 142 179
  1. (a) Use the 5th, the 50th, and the 95th percentiles of this distribution to estimate the mean and standard deviation of this distribution assuming that the distribution is Normal. Explain your method for doing this.

  2. (b) Sketch your Normal intake distribution on the same graph with a sketch of the requirement distribution that you produced in part (b) of Exercise 1.155.

  3. (c) Do you think that many women aged 19 to 30 years are getting the amount of vitamin C that they need? Explain your answer.

Question 1.158

image 1.158 How much vitamin C do men consume? To evaluate whether or not the intake of a vitamin or mineral is adequate, comparisons are made between the intake distribution and the requirement distribution. Here is some information about the distribution of vitamin C intake, in milligrams per day, for men aged 19 to 30 years:

Percentile (mg/d)
Mean 1st 5th 19th 25th 50th 75th 90th 95th 99th
122.2 39 55 65 85 114 150 190 217 278
  1. (a) Use the 5th, the 50th, and the 95th percentiles of this distribution to estimate the mean and standard deviation of this distribution assuming that the distribution is Normal. Explain your method for doing this.

  2. (b) Sketch your Normal intake distribution on the same graph with a sketch of the requirement distribution that you produced in Exercise 1.156.

  3. (c) Do you think that many men aged 19 to 30 years in the United States are getting the amount of vitamin C that they need? Explain your answer.

Question 1.159

1.159 Time spent studying. Do women study more than men? We asked the students in a large first-year college class how many minutes they studied on a typical weeknight. Here are the responses of random samples of 30 women and 30 men from the class:

STUDY

Women Men
170 120 180 360 240 80 120 30 90 200
120 180 120 240 170 90 45 30 120 75
150 120 180 180 150 150 120 60 240 300
200 150 180 150 180 240 60 120 60 30
120 60 120 180 180 30 230 120 95 150
90 240 180 115 120 0 200 120 120 180
  1. (a) Examine the data. Why are you not surprised that most responses are multiples of 10 minutes? We eliminated one student who claimed to study 30,000 minutes per night. Are there any other responses that you consider suspicious?

  2. (b) Make a back-to-back stemplot of these data. Report the approximate midpoints of both groups. Does it appear that women study more than men (or at least claim that they do)?

  3. (c) Make side-by-side boxplots of these data. Compare the boxplots with the stemplot you made in part (b). Which to you prefer? Give reasons for your answer.

77

Question 1.160

1.160 Product preference. Product preference depends in part on the age, income, and gender of the consumer. A market researcher selects a large sample of potential car buyers. For each consumer, she records gender, age, household income, and automobile preference. Which of these variables are categorical and which are quantitative?

Question 1.161

1.161 Two distributions. If two distributions have exactly the same mean and standard deviation, must their histograms have the same shape? If they have the same five-number summary, must their histograms have the same shape? Explain.

Question 1.162

1.162 Spam filters. A university department installed a spam filter on its computer system. During a 21-day period, 6693 messages were tagged as spam. How much spam you get depends on what your online habits are. Here are the counts for some students and faculty in this department (with log-in IDs changed, of course):

ID Count ID Count ID Count ID Count
AA 1818 BB 1358 CC 442 DD 416
EE 399 FF 389 GG 304 HH 251
II 251 JJ 178 KK 158 LL 103

All other department members received fewer than 100 spam messages. How many did the others receive in total? Make a graph and comment on what you learn from these data.

SPAM

Question 1.163

image 1.163 Phish. One of the most favored songs of the band Phish is “Divided Sky.” The band plays this song at many of their concerts. Frequently, after the main theme, Trey, the guitarist, pauses before playing the resolving note.45 The data file PHISH gives the date of each concert where “Divided Sky” was played, the venue, and the length of the pause for 366 concerts. Analyze the data and write a report summarizing what you have found. Be sure to include graphical and numerical summaries. Include the rationale for decisions that you made in performing your analysis. For example, did you give any consideration to the relatively large number of zeros?

PHISH

Question 1.164

image 1.164 Visits to a help room for statistics. A help room staffed by graduate students provides assistance to students taking statistics courses. To justify the cost of providing this service, extensive records are kept. Each time a student visits the help room, the student signs a sheet with several variables. These include the date of the visit, the course number that they are taking, the time they arrived at the room, and the time that they left the room. The length of time that the each student spent in the help room is computed from the two time variables. Data for 1268 visits are given in the file HELP.46 Analyze the data and write a report summarizing what you have found. Be sure to include graphical and numerical summaries. Include the rationale for the choices of methods that you chose for your analysis. There are some missing course numbers. How did you handle these?

HELP

Question 1.165

image 1.165 Blueberries and anthocyanins. Anthocyanins are compounds that have been associated with health benefits associated with the heart, bones, and the brain. Blueberries are a good source of many different anthocyanins. Researchers at the Piedmont Research Station of North Carolina State University have assembled a database giving the concentrations of 18 different anthocyanins for 267 varieties of blueberries.47 Four of the anthocyanins measured are delphinidin-3-arabinoside, malvidin-3-arabinoside, cyanidin-3-galactoside, and delphinidin-3-glucoside, all measured in units of mg/100g of berries. In the data file, we have simplified the names of these anthocyanins to Antho1, Antho2, Antho3, and Antho4. Figure 1.35 gives graphical and numeric summaries from JMP for Antho1. Use this output to write a summary of the distribution of Antho1 using the methods and ideas that you learned in this chapter.

BERRIES

Question 1.166

image 1.166 Blueberries and anthocyanins, Antho2. Refer to the previous exercise. Generate your own output for the analysis of Antho2 and use your output to write a summary of the distribution of Antho2 using the methods and ideas that you learned in this chapter.

BERRIES

Question 1.167

image 1.167 Blueberries and anthocyanins, Antho3. Refer to Exercise 1.165. Figure 1.36 gives the JMP output for Antho3. Use this output to write a summary of the distribution of Antho3 using the methods and ideas that you learned in this chapter.

BERRIES

Question 1.168

image 1.168 Blueberries and anthocyanins, Antho4. Refer to Exercise 1.165. Generate your own output for the analysis of Antho4 and use your output to write a summary of the distribution of Antho4 using the methods and ideas that you learned in this chapter.

BERRIES

78

image
Figure 1.35: Figure 1.35 JMP descriptive statistics for Antho1, Exercise 1.165.
image
Figure 1.36: Figure 1.36 JMP descriptive statistics for Antho3, Exercise 1.167.