For Exercises 1.43 and 1.44, see page 29; for Exercises 1.45, 1.46, and 1.47, see page 31; for Exercise 1.48, see page 33; for Exercises 1.49 and 1.50, see page 34; for Exercise 1.51, see page 35; for Exercise 1.52, see page 37; for Exercise 1.53, see page 39; for Exercise 1.54, see page 40; for Exercise 1.55, see page 40; and for Exercise 1.56, see page 45.
1.57 Potassium from potatoes. Refer to Exercise 1.30 (page 24) where you examined the potassium absorption of a group of 27 adults who ate a controlled diet that included 40 mEq of potassium from potatoes for five days.
(a) Compute the mean for these data.
(b) Compute the median for these data.
(c) Which measure do you prefer for describing the center of this distribution? Explain your answer. (You may include a graphical summary as part of your explanation.)
1.57 (a) ˉx = 3208.44. (b) M = 3130.37. (c) Because the distribution is right-skewed with a potential outlier, the median is a better measure of center.
1.58 Potassium from a supplement. Refer to Exercise 1.31 (page 24) where you examined the potassium absorption of a group of 29 adults who ate a controlled diet that included 40 mEq of potassium from a supplement for five days.
(a) Compute the mean for these data.
(b) Compute the median for these data.
(c) Which measure do you prefer for describing the center of this distribution? Explain your answer. (You may include a graphical summary as part of your explanation.)
1.59 Potassium from potatoes. Refer to Exercise 1.30 (page 24) where you examined the potassium absorption of a group of 27 adults who ate a controlled diet that included 40 mEq of potassium from potatoes for five days.
(a) Compute the standard deviation for these data.
(b) Compute the quartiles for these data.
(c) Give the five-number summary and explain the meaning of each of the five numbers.
(d) Which numerical summaries do you prefer for describing the distribution, the mean, and the standard deviation of the five-number summary? Explain your answer. (You may include a graphical summary as part of your explanation.)
1.59 (a) s = 306.68. (b) Q1 = 3027.64, Q3 = 3286.95. (c) Min = 2664.38 (this is the smallest value), Q1 = 3027.64 (this value has 25% of the observations below it), M = 3130.37 (this is the middle observation, or has 50% of the observations below or above it), Q3 = 3286.95 (this value has 75% of the observations below it), Max = 4213.49 (this is the largest value). (d) The five-number summary would be better for this distribution because it is right-skewed with a potential outlier.
1.60 Potassium from a supplement. Refer to Exercise 1.31 (page 24) where you examined the potassium absorption of a group of 29 adults who ate a controlled diet that included 40 mEq of potassium from a supplement for five days.
(a) Compute the standard deviation for these data.
(b) Compute the quartiles for these data.
(c) Give the five-number summary and explain the meaning of each of the five numbers.
(d) Which numerical summaries do you prefer for describing the distribution, the mean, and the standard deviation of the five-number summary? Explain your answer. (You may include a graphical summary as part of your explanation.)
1.61 Potassium from potatoes. Refer to Exercise 1.30 (page 24) where you examined the potassium absorption of a group of 27 adults who ate a controlled diet that included 40 mEq of potassium from potatoes for five days. In Exercise 1.30, you used a stemplot to examine the distribution of the potassium absorption.
(a) Make a histogram and use it to describe the distribution of potassium absorption.
(b) Make a boxplot and use it to describe the distribution of potassium absorption.
(c) Compare the stemplot, the histogram, and the boxplot as graphical summaries of this distribution. Which do you prefer? Give reasons for your answer.
1.61 (a) The distribution is right-skewed with a potential outlier. (b) The distribution is right-skewed. (c) Preference will vary. The only advantage of the stemplot is that it preserves the data; otherwise, the histogram is likely better. The boxplot is also fine but hides some of the details that the histogram shows.
1.62 Potassium from a supplement. Refer to Exercise 1.31 (page 24) where you examined the potassium absorption of a group of 29 adults who ate a controlled diet that included 40 mEq of potassium from a supplement for five days. In Exercise 1.31, you used a stemplot to examine the distribution of the potassium absorption.
(a) Make a histogram and use it to describe the distribution of potassium absorption.
(b) Make a boxplot and use it to describe the distribution of potassium absorption.
(c) Compare the stemplot, the histogram, and the boxplot as graphical summaries of this distribution. Which do you prefer? Give reasons for your answer.
1.63 Compare the potatoes with the supplement. Refer to Exercises 1.30 and 1.31 (page 24). Use a back-to-back stemplot to display the data for the two sources of potassium. Use the stemplot to compare the two distributions and write a short summary of your findings.
1.63 The KPOT values are right-skewed, whereas the KSUP values are fairly symmetric. The center for KSUP is higher than the center for the KPOT. Also, the KPOT values are more spread out than the KSUP values.
1.64 Potassium sources. Refer to Exercises 1.30 and 1.31 (page 24). Use side-by-side boxplots in to describe the distributions.
(a) Summarize what you see in the boxplots and compare it with what you saw in the stemplots.
(b) For comparing these two distributions, do you prefer back-to-back stemplots or side-by-side boxplots? Give reasons for your answer.
1.65 Gosset’s data on double stout sales. William Sealy Gosset worked at the Guinness Brewery in Dublin and made substantial contributions to the practice of statistics.23 In his work at the brewery, he collected and analyzed a great deal of data. Archives with Gosset’s handwritten tables, graphs, and notes have been preserved at the Guinness Storehouse in Dublin.24 In one study, Gosset examined the change in the double stout market before and after World War I (1914–1918). For various regions in England and Scotland, he calculated the ratio of sales in 1925, after the war, as a percent of sales in 1913, before the war. Here are the data:
Bristol | 94 | Glasgow | 66 |
Cardiff | 112 | Liverpool | 140 |
English Agents | 78 | London | 428 |
English O | 68 | Manchester | 190 |
English P | 46 | Newcastle-on-Tyne | 118 |
English R | 111 | Scottish | 24 |
(a) Compute the mean for these data.
(b) Compute the median for these data.
(c) Which measure do you prefer for describing the center of this distribution? Explain your answer. (You may include a graphical summary as part of your explanation.)
1.65 (a) ˉx = 122.9. (b) M = 102.5. (c) The data set is right-skewed with an outlier, so the median is a better center.
1.66 Measures of spread for the double stout data. Refer to the previous exercise.
(a) Compute the standard deviation for these data.
(b) Compute the quartiles for these data.
(c) Which measure do you prefer for describing the spread of this distribution? Explain your answer. (You may include a graphical summary as part of your explanation.)
1.67 Are there outliers in the double stout data? Refer to the previous two exercises.
(a) Find the IQR for these data.
(b) Use the 1.5 × IQR rule to identify and name any outliers.
(c) Make a boxplot for these data and describe the distribution using only the information in the boxplot.
(d) Make a modified boxplot for these data and describe the distribution using only the information in the boxplot.
(e) Make a stemplot for these data.
(f) Compare the boxplot, the modified boxplot, and the stemplot. Evaluate the advantages and disadvantages of each graphical summary for describing the distribution of the double stout data.
1.67 (a) IQR = 62. (b) Outliers are below − 26 or above 222. London is confirmed as an outlier. (c) The first three quarters are about equal in length, and the last is extremely long. (d) The main part of the distribution is relatively symmetric; there is one extreme high outlier. The minimum is about 25, the first quartile is about 70, the median is about 100, and the third quartile is about 125. There is a gap in the data from roughly 200 to about 425.
1.68 Smolts. Smolts are young salmon at a stage when their skin becomes covered with silvery scales and they start to migrate from freshwater to the sea. The reflectance of a light shined on a smolt’s skin is a measure of the smolt’s readiness for the migration. Here are the reflectances, in percents, for a sample of 50 smolts:25
57.6 | 54.8 | 63.4 | 57.0 | 54.7 | 42.3 | 63.6 | 55.5 | 33.5 | 63.3 |
58.3 | 42.1 | 56.1 | 47.8 | 56.1 | 55.9 | 38.8 | 49.7 | 42.3 | 45.6 |
69.0 | 50.4 | 53.0 | 38.3 | 60.4 | 49.3 | 42.8 | 44.5 | 46.4 | 44.3 |
58.9 | 42.1 | 47.6 | 47.9 | 69.2 | 46.6 | 68.1 | 42.8 | 45.6 | 47.3 |
59.6 | 37.8 | 53.9 | 43.2 | 51.4 | 64.5 | 43.8 | 42.7 | 50.9 | 43.8 |
(a) Find the mean reflectance for these smolts.
(b) Find the median reflectance for these smolts.
(c) Do you prefer the mean or the median as a measure of center for these data? Give reasons for your preference.
1.69 Measures of spread for smolts. Refer to the previous exercise.
(a) Find the standard deviation of the reflectance for these smolts.
(b) Find the quartiles of the reflectance for these smolts.
(c) Do you prefer the standard deviation or the quartiles as a measure of spread for these data? Give reasons for your preference.
1.69 (a) s = 8.80. (b) With n = 50, the positions of Q1 and Q3 will be at 13 and 38. We find Q1 = 43.79 and Q3 = 57.02.
1.70 Are there outliers in the smolt data? Refer to the previous two exercises.
(a) Find the IQR for the smolt data.
(b) Use the 1.5 × IQR rule to identify any outliers.
(c) Make a boxplot for the smolt data and describe the distribution using only the information in the boxplot.
(d) Make a modified boxplot for these data and describe the distribution using only the information in the boxplot.
(e) Make a stemplot for these data.
(f) Compare the boxplot, the modified boxplot, and the stemplot. Evaluate the advantages and disadvantages of each graphical summary for describing the distribution of the smolt reflectance data.
1.71 Potatoes. A quality product is one that is consistent and has very little variability in its characteristics. Controlling variability can be more difficult with agricultural products than with those that are manufactured. The following table gives the weights, in ounces, of the 25 potatoes sold in a 10-pound bag:
7.6 | 7.9 | 8.0 | 6.9 | 6.7 | 7.9 | 7.9 | 7.9 | 7.6 | 7.8 | 7.0 | 4.7 | 7.6 |
6.3 | 4.7 | 4.7 | 4.7 | 6.3 | 6.0 | 5.3 | 4.3 | 7.9 | 5.2 | 6.0 | 3.7 |
(a) Summarize the data graphically and numerically. Give reasons for the methods you chose to use in your summaries.
(b) Do you think that your numerical summaries do an effective job of describing these data? Why or why not?
(c) There appear to be two distinct clusters of weights for these potatoes. Divide the sample into two subsamples based on the clustering. Give the mean and standard deviation for each subsample. Do you think that this way of summarizing these data is better than a numerical summary that uses all the data as a single sample? Give a reason for your answer.
1.71 (a) Because weight is quantitative and has a decent number of observations (n = 25), a histogram is a good choice. Mean and standard deviation are a good starting point for numerical summaries. (b) Now that we see the distribution is left-skewed, the choice of using the mean and standard deviation was not a good choice. Median and quartiles would have been a better choice.
1.72 The alcohol content of beer. Brewing beer involves a variety of steps that can affect the alcohol content. A website gives the percent alcohol for 159 domestic brands of beer.26
(a) Use graphical and numerical summaries of your choice to describe the data. Give reasons for your choice.
(b) The data set contains an outlier. Explain why this particular beer is unusual.
(c) For the outlier, give a short description of how you think this particular beer should be marketed.
1.73 Outlier for alcohol content of beer. Refer to the previous exercise.
(a) Calculate the mean with and without the outlier. Do the same for the median. Explain how these values change when the outliers is excluded.
(b) Calculate the standard deviation with and without the outlier. Do the same for the quartiles. Explain how these values change when the outlier is excluded.
(c) Write a short paragraph summarizing what you have learned in this exercise.
1.73 (a) With the outlier: ˉx = 5.235, M = 4.90. Without the outlier: ˉx = 5.265, M = 4.905. The values are nearly identical with and without the outlier. (b) With the outlier: s = 1.406, Q1 = 4.40, Q3 = 5.60. Without the outlier: s = 1.356, Q1 = 4.430, Q3 = 5.600. The values are nearly identical with and without the outlier. (c) Even though there is one outlier, its removal does not change the numerical summaries at all. This is partly due to the large sample and partly due to the fact that this outlier is not too far from the other observations so that removing it doesn’t have a huge effect on the analysis.
1.74 Calories in beer. Refer to the previous two exercises. The data set also lists calories per 12 ounces of beverage.
(a) Analyze the data and summarize the distribution of calories for these 159 brands of beer.
(b) In the previous exercise, you identified one brand of beer as an outlier. To what extent is this brand an outlier in the distribution of calories? Explain your answer.
(c) Does the distribution of calories suggest marketing strategies for this brand of beer? Describe some marketing strategies.
1.75 Median versus mean for net worth. A report on the assets of American households says that the median net worth of U.S. families is $81,200. The mean net worth of these families is $534,600.27 What explains the difference between these two measures of center?
1.75 Some people like celebrities and business executives have very large net worths, which will pull the mean worth making it much larger than the median (Bill Gates of Microsoft, Warren Buffett, Oprah Winfrey, etc.).
1.76 Create a data set. Create a data set with seven observations for which the median would change by a large amount if the smallest observation were deleted.
1.77 Mean versus median. A small accounting firm pays each of its seven clerks $55,000, three junior accountants $80,000 each, and the firm’s owner $650,000. What is the mean salary paid at this firm? How many of the employees earn less than the mean? What is the median salary?
1.77 The mean is $115,909.09. Ten of the employees make less than the mean. M = $55,000.
1.78 Be careful about how you treat the zeros. In computing the median income of any group, some federal agencies omit all members of the group who had no income. Give an example to show that the reported median income of a group can go down even though the group becomes economically better off. Is this also true of the mean income?
1.79 How does the median change? The firm in Exercise 1.77 gives no raises to the clerks and junior accountants, while the owner’s take increases to $900,000. How does this change affect the mean? How does it affect the median?
1.79 The median doesn’t change, but the mean increases to $138,636.36.
1.80 Metabolic rates. Calculate the mean and standard deviation of the metabolic rates in Example 1.32 (page 38), showing each step in detail. First find the mean ˉx by summing the seven observations and dividing by 7. Then find each of the deviations xi−ˉx and their squares. Check that the deviations have sum 0. Calculate the variance as an average of the squared deviations (remember to divide by n − 1). Finally, obtain s as the square root of the variance.
1.81 Earthquakes. Each year there are about 900,000 earthquakes of magnitude 2.5 or less that are usually not felt. In contrast, there are about 10 of magnitude 7.0 that cause serious damage.28 Explain why the average magnitude of earthquakes is not a good measure of their impact.
1.81 The average would be 2.5 or less (an earthquake that isn’t usually felt). These do little or no damage.
1.82 IQ scores. Many standard statistical methods that you will study in Part II of this book are intended for use with distributions that are symmetric and have no outliers. These methods start with the mean and standard deviation, ˉx and s. For example, standard methods would typically be used for the IQ and GPA data in Table 1.3 (page 26).
(a) Find ˉx and s for the IQ data. In large populations, IQ scores are standardized to have mean 100 and standard deviation 15. In what way does the distribution of IQ among these students differ from the overall population?
(b) Find the median IQ score. It is, as we expect, close to the mean.
(c) Find the mean and median for the GPA data. The two measures of center differ a bit. What feature of the data (see your stemplot in Exercise 1.39 or make a new stemplot) explains the difference?
1.83 Mean and median for two observations. The Mean and Median applet allows you to place observations on a line and see their mean and median visually. Place two observations on the line by clicking below it. Why does only one arrow appear?
1.83 For n = 2, the median is also the average of the two values.
1.84 Mean and median for four observations. In the Mean and Median applet, place four observations on the line by clicking below it, three close together near the center of the line and one somewhat to the right of these two.
(a) Pull the single rightmost observation out to the right. (Place the cursor on the point, hold down a mouse button, and drag the point.) How does the mean behave? How does the median behave? Explain briefly why each measure acts as it does.
(b) Now drag the rightmost point to the left as far as you can. What happens to the mean? What happens to the median as you drag this point past the other two (watch carefully)?
1.85 Mean and median for seven observations. Place seven observations on the line in the Mean and Median applet by clicking below it.
(a) Add one additional observation without changing the median. Where is your new point?
(b) Use the applet to convince yourself that when you add yet another observation (there are now nine in all), the median does not change no matter where you put the seventh point. Explain why this must be true.
1.85 (a) The median of seven (sorted) points is the fourth, while the median of eight points is the average of the fourth and fifth. If these are to be the same, the added point must be equal to the fourth point of the original seven, so that the fourth and fifth points are now the same. (b) Regardless of the configuration of the first seven points, if the eighth point is added so as to leave the median unchanged, then in that (sorted) set of eight, the fourth and fifth points must the same. Once we add a ninth point, one of these two points will be the new middle (fifth) point, so the median will not change.
1.86 Imputation. Various problems with data collection can cause some observations to be missing. Suppose a data set has 20 cases. Here are the values of the variable x for 10 of these cases:
17 | 6 | 12 | 14 | 20 | 23 | 9 | 12 | 16 | 21 |
The values for the other 10 cases are missing. One way to deal with missing data is called imputation. The basic idea is that missing values are replaced, or imputed, with values that are based on an analysis of the data that are not missing. For a data set with a single variable, the usual choice of a value for imputation is the mean of the values that are not missing. The mean for this data set is 15.
(a) Verify that the mean is 15 and find the standard deviation for the 10 cases for which x is not missing.
(b) Create a new data set with 20 cases by setting the values for the 10 missing cases to 15. Compute the mean and standard deviation for this data set.
(c) Summarize what you have learned about the possible effects of this type of imputation on the mean and the standard deviation.
1.87 A standard deviation contest. This is a standard deviation contest. You must choose four numbers from the whole numbers 10 to 20, with repeats allowed.
(a) Choose four numbers that have the smallest possible standard deviation.
(b) Choose four numbers that have the largest possible standard deviation.
(c) Is more than one choice possible in either part (a) or part (b)? Explain.
1.87 (a) Picking the same number for all four observations results in a standard deviation of 0. (b) Picking 10, 10, 20, and 20 results in the largest standard deviation = 5.77. (c) For part (a), you may pick any number as long as all observations are the same. For part (b), only one choice provides the largest standard deviation.
1.88 Longleaf pine trees. The Wade Tract in Thomas County, Georgia, is an old-growth forest of longleaf pine trees (Pinus palustris) that has survived in a relatively undisturbed state since before the settlement of the area by Europeans. A study collected data on 584 of these trees.29 One of the variables measured was the diameter at breast height (DBH). This is the diameter of the tree at 4.5 feet and the units are centimeters (cm). Only trees with DBH greater than 1.5 cm were sampled. Here are the diameters of a random sample of 40 of these trees:
10.5 | 13.3 | 26.0 | 18.3 | 52.2 | 9.2 | 26.1 | 17.6 | 40.5 | 31.8 |
47.2 | 11.4 | 2.7 | 69.3 | 44.4 | 16.9 | 35.7 | 5.4 | 44.2 | 2.2 |
4.3 | 7.8 | 38.1 | 2.2 | 11.4 | 51.5 | 4.9 | 39.7 | 32.6 | 51.8 |
43.6 | 2.3 | 44.6 | 31.5 | 40.3 | 22.3 | 43.3 | 37.5 | 29.1 | 27.9 |
(a) Find the five-number summary for these data.
(b) Make a boxplot.
(c) Make a histogram.
(d) Write a short summary of the major features of this distribution. Do you prefer the boxplot or the histogram for these data?
1.89 Weight gain. A study of diet and weight gain deliberately overfed 15 volunteers for eight weeks. The mean increase in fat was ˉx=2.41 kilograms, and the standard deviation was s=1.25 kilograms. What are ˉx and s in pounds? (A kilogram is 2.2 pounds.)
1.89 ˉx = 5.302 pounds and s = 2.75 pounds.
1.90 Changing units from inches to centimeters. Changing the unit of length from inches to centimeters multiplies each length by 2.54 because there are 2.54 centimeters in an inch. This change of units multiplies our usual measures of spread by 2.54. This is true of IQR and the standard deviation. What happens to the variance when we change units in this way?
1.91 A different type of mean. The trimmed mean is a measure of center that is more resistant than the mean but uses more of the available information than the median. To compute the 10% trimmed mean, discard the highest 10% and the lowest 10% of the observations and compute the mean of the remaining 80%. Trimming eliminates the effect of a small number of outliers. Compute the 10% trimmed mean of the service time data in Table 1.2 (page 17). Then compute the 20% trimmed mean. Compare the values of these measures with the median and the ordinary untrimmed mean.
1.91 Full data set: ˉx = 196.575 and M = 103.5 minutes. The 10% and 20% trimmed means are ˉx = 127.734 and ˉx = 111.917 minutes. While still larger than the median of the original data set, they are much closer to the median than the ordinary untrimmed mean.
1.92 Changing units from centimeters to inches. Refer to Exercise 1.88 (page 50). Change the measurements from centimeters to inches by multiplying each value by 0.39. Answer the questions from that exercise and explain the effect of the transformation on these data.