How do retail companies that fail differ from those that succeed? An accounting professor compares two samples of retail companies: one sample of failed retail companies and one of retail companies that are still active. Which of two incentive packages will lead to higher use of a bank’s credit cards? The bank designs an experiment where credit card customers are assigned at random to receive one or the other incentive package. Two-sample problems such as these are among the most common situations encountered in statistical practice.
379
Two-Sample Problems
You must carefully distinguish two-sample problems from the matched pairs designs studied earlier. In two-sample problems, there is no matching of the units in the two samples, and the two samples may be of different sizes. As a result, inference procedures for two-sample data differ from those for matched pairs.
We can present two-sample data graphically with a back-to-back stemplot for small samples (page 17) or with side-by-side boxplots for larger samples (page 29). Now we will apply the ideas of formal inference in this setting. When both population distributions are symmetric, and especially when they are at least approximately Normal, a comparison of the mean responses in the two populations is most often the goal of inference.
We have two independent samples, from two distinct populations (such as failed companies and active companies). We measure the same quantitative response variable (such as the cash flow margin) in both samples. We will call the variable in the first population and in the second because the variable may have different distributions in the two populations. Here is the notation that we will use to describe the two populations:
Population | Variable | Mean | Standard deviation |
---|---|---|---|
1 | |||
2 |
We want to compare the two population means, either by giving a confidence interval for or by testing the hypothesis of no difference, . We base inference on two independent SRSs, one from each population. Here is the notation that describes the samples:
Population | Sample size | Sample Mean |
Sample Standard deviation |
---|---|---|---|
1 | |||
2 |
Throughout this section, the subscripts 1 and 2 show the population to which a parameter or a sample statistic refers.
The two-sample statistic
The natural estimator of the difference is the difference between the sample means, . If we are to base inference on this statistic, we must know its sampling distribution. Here are some facts:
Reminder
rules for means, p. 226
380
The variance of the difference is
Because the samples are independent, their sample means and are independent random variables. The addition rule for variances says that the variance of the difference of two independent random variables is the sum of their variances.
Reminder
rules for variances, p. 231
Because any Normal random variable has the distribution when standardized, we have arrived at a new statistic. The two-sample statistic
has the standard Normal sampling distribution and would be used in inference when the two population standard deviations and are known.
In practice, however, and are not known. We estimate them by the sample standard deviations and from our two samples. Following the pattern of the one-sample case, we substitute the standard errors for the standard deviations in the two-sample statistic. The result is the two-sample statistic:
two-sample statistic
Unfortunately, this statistic does not have a distribution. A distribution replaces an distribution only when a single standard deviation () is replaced by an estimate (). In this case, we replaced two standard deviations ( and ) by their estimates ( and ).
Nonetheless, we can approximate the distribution of the two-sample statistic by using the distribution with an approximation for the degrees of freedom . We use these approximations to find approximate values of for confidence intervalsand to find approximate -values for significance tests. There are two procedures used in practice:
df approximation
Satterthwaite
approximation
The choice of approximation rarely makes a difference in our conclusion. Most statistical software uses the first option to approximate the distribution unless the user requests another method. Use of this approximation without software is a bit complicated.18
If you are not using software, we recommend the second approximation. This approximation is appealing because it is conservative.19 That is, margins of error for confidence intervals are a bit wider than they need to be, so the true confidence level is larger than . For significance testing, the true -values are a bit smaller than those we obtain from the approximation; thus, for tests at a fixed significance level, we are a little less likely to reject when it is true.
381
The two-sample confidence interval
We now apply the basic ideas about procedures to the problem of comparing two means when the standard deviations are unknown. We start with confidence intervals.
Two-Sample Confidence Interval
Draw an SRS of size from a Normal population with unknown mean and an independent SRS of size from another Normal population with unknown mean . The confidence interval for given by
has confidence level at least no matter what the population standard deviations may be. The margin of error is
Here, is the value for the density curve with area between and . The value of the degrees of freedom is approximated by software or the smaller of and .
EXAMPLE 7.10 Smart Shopping Carts and Spending
smrtcrt
Smart shopping carts are shopping carts equipped with scanners that track the total price of the items in the cart. While both consumers and retailers have expressed interest in the use of this technology, actual implementation has been slow. One reason for this is uncertainty in how real-time spending feedback affects shopping. Retailers do not want to adopt a technology that is going to lower sales.
To help understand the smart shopping cart’s influence on spending behavior, a group of researchers designed a study to compare spending with and without realtime feedback. Each participant was asked to shop at an online grocery store for items on a common grocery list. The goal was to keep spending around a budget of $35. Half the participants were randomly assigned to receive real-time feedback—specifically, the names of the products currently in their cart and the total price. The non-feedback participants only saw the total price when they completed their shopping.
Figure 7.9 shows side-by-side boxplots of the data.20 There appears to be a slight skewness in the total price, but no obvious outliers in either group. Given these results and the large sample sizes, we feel confident in using the procedures.
In general, the participants with real-time feedback appear to have spent more than those without feedback. The summary statistics are
Group | |||
---|---|---|---|
With feedback | 49 | 33.137 | 6.568 |
Without feedback | 48 | 30.315 | 6.846 |
382
We’d like to estimate the difference in the two means and provide an estimate of the precision. Plugging in these summary statistics, the 95% confidence interval for the difference in means is
Using software, the degrees of freedom are 94.63 and . This approximation gives
The conservative approach would use the smaller of
Table D does not supply a row for but gives for . We use because it is the closest value of in the table that is less than 47. With this approximation we have
1.684 | 2.021 | |
90% | 95% |
The conservative approach does give a wider interval than the more accurate approximation used by software. However, the difference is very small (just a nickel at each end). We estimate the mean difference in spending to be $2.82 with a margin of error of slightly more than $2.70. The data do not provide a very precise estimate of this difference.
Apply Your Knowledge
7.40 How to assemble a new machine.
You ran a two-sample study to compare two sets of instructions on how to assemble a new machine. You randomly assign each employee to one of the instructions and measure the time (in minutes) it takes to assemble. Assume that and . Find a 95% confidence interval for the average difference in time using the second approximation for degrees of freedom.
7.41 Another two-sample confidence interval.
Refer to the previous exercise. Suppose instead your study results were and . Find a 95% confidence interval for the average difference using the second approximation for degrees of freedom. Compare this interval with the one in the previous exercise.
7.41
. With the smaller sample sizes, the interval got wider.
383
The two-sample significance test
The same ideas that we used for the two-sample confidence intervals also apply to two-sample significance tests. We can use either software or the conservative approach with Table D to approximate the -value.
Two-Sample Significance Test
Draw an SRS of size from a Normal population with unknown mean and an independent SRS of size from another Normal population with unknown mean . To test the hypothesis , compute the two-sample statistic
and use -values or critical values for the distribution, where the degrees of freedom are either approximated by software or are the smaller of and .
EXAMPLE 7.11 Does Real-time Feedback Influence Spending?
smrtcrt
For the grocery spending study described in Example 7.10, we want to see if there is a difference in average spending between the group of participants that had real-time feedback and the group that did not. For a formal significance test, the hypotheses are
The two-sample test statistic is
The -value for the two-sided test is . Software gives the approximate -value as 0.0410 and uses 94.63 as the degrees of freedom.
For the second approximation, the degrees of freedom are equal to 47. Because there is no row for , we use the closest value of in the table that is less than 47. Comparing with the entries in Table D for 40 degrees of freedom, we see that lies between and . The data do suggest that consumers on a budget will spend more when provided with real-time feedback .
0.025 | 0.02 | |
2.021 | 2.123 |
Apply Your Knowledge
7.42 How to assemble a new machine, continued.
Refer to Exercise 7.40 (page 382). Perform a significance test to see if there is a difference between the two sets of instructions using . Make sure to specify the hypotheses, test statistic, and its -value, and state your conclusion.
384
7.43 Another two-sample -test.
Refer to Exercise 7.41 (page 382).
7.43
(a) . (b) . The data are not significant at the 5% level, and there is not enough evidence to show a difference between the two sets of instructions. (b) Because the confidence interval contains 0, we fail to reject .
Robustness of the two-sample procedures
The two-sample procedures are more robust than the one-sample methods. When the sizes of the two samples are equal and the distributions of the two populations being compared have similar shapes, probability values from the table are quite accurate for a broad range of distributions when the sample sizes are as small as .21 When the two population distributions have different shapes, larger samples are needed. The guidelines given on page 372 for the use of one-sample procedures can be adapted to two-sample procedures by replacing “sample size” with the “sum of the sample sizes” . Specifically,
These guidelines are rather conservative, especially when the two samples are of equal size. In planning a two-sample study, you should usually choose equal sample sizes. The two-sample procedures are most robust against non-Normality in this case, and the conservative probability values are most accurate.
Here is an example with large sample sizes that are almost equal. Even if the distributions are not Normal, we are confident that the sample means will be approximately Normal. The two-sample procedures are very robust in this case.
EXAMPLE 7.12 Wheat Prices
The U.S. Department of Agriculture (USDA) uses sample surveys to produce important economic estimates.22 One pilot study estimated wheat prices in July and in January using independent samples of wheat producers in the two months. Here are the summary statistics, in dollars per bushel:
Month | |||
---|---|---|---|
January | 45 | $6.66 | $0.24 |
July | 50 | $6.93 | $0.27 |
The July prices are higher on the average. But we have data from only a limited number of producers each month. Can we conclude that national average prices in July and January are not the same? Or are these differences merely what we would expect to see due to random variation?
Because we did not specify a direction for the difference before looking at the data, we choose a two-sided alternative. The hypotheses are
385
Because the samples are moderately large, we can confidently use the procedures even though we lack the detailed data and so cannot verify the Normality condition.
The two-sample statistic is
The conservative approach finds the -value by comparing 5.16 to critical values for the distribution because the smaller sample has 45 observations. We must double the table tail area because the alternative is two-sided.
0.0005 | ||
3.551 |
Table D does not have entries for 44 degrees of freedom. When this happens, we use the next smaller degrees of freedom. Our calculated value of is larger than the entry in the table. Doubling 0.0005, we conclude that the -value is less than 0.001. The data give conclusive evidence that the mean wheat prices were higher in July than they were January .
In this example, the exact -value is very small because says that the observed mean is more than five standard deviations above the hypothesized mean. The difference in mean prices is not only highly significant but large enough (27 cents per bushel) to be important to producers.
In this and other examples, we can choose which population to label 1 and which to label 2. After inspecting the data, we chose July as Population 1 because this choice makes the statistic a positive number. This avoids any possible confusion from reporting a negative value for . Choosing the population labels is not the same as choosing a one-sided alternative after looking at the data. Choosing hypotheses after seeing a result in the data is a violation of sound statistical practice.
Inference for small samples
Small samples require special care. We do not have enough observations to examine the distribution shapes, and only extreme outliers stand out. The power of significance tests tends to be low, and the margins of error of confidence intervals tend to be large. Despite these difficulties, we can often draw important conclusions from studies with small sample sizes. If the size of an effect is as large as it was in the preceding wheat price example, it should still be evident even if the n’s are small.
EXAMPLE 7.13 More about Wheat Prices
wheat
In the setting of Example 7.12, a quick survey collects prices from only five producers each month. The data are
Month | Price ($/bushel) | ||||
---|---|---|---|---|---|
January | 6.6125 | 6.4775 | 6.3500 | 6.7525 | 6.7625 |
July | 6.7350 | 6.9000 | 6.6475 | 7.2025 | 7.0550 |
The prices are reported to the nearest quarter of a cent. First, examine the distributions with a back-to-back stemplot after rounding each price to the nearest cent.
386
January | July | |
5 | 6.3 | |
8 | 6.4 | |
6.5 | ||
1 | 6.6 | 5 |
65 | 6.7 | 4 |
6.8 | ||
6.9 | 0 | |
7.0 | 6 | |
7.1 | ||
7.2 | 0 |
The pattern is reasonably clear. Although there is variation among prices within each month, the top three prices are all from July and the three lowest prices are from January.
A significance test can confirm that the difference between months is too large to easily arise just by chance. We test
The price is higher in July . The difference in sample means is 31.7 cents.
Figure 7.10 gives outputs for this analysis from several software systems. Although the formats and labels differ, the basic information is the same. All report the sample sizes, the sample means and standard deviations (or variances), the statistic, and its -value. All agree that the -value is very small, though some give more detail than others. Excel and JMP outputs, for example, provide both one-sided and two-sided -values. Some software (SAS, SPSS, and Minitab) labels the groups in alphabetical order. In this example, January is then the first population and , the negative of our result. Always check the means first and report the statistic (you may need to change the sign) in an appropriate way. Be sure to also mention the size of the effect you observed, such as “The sample mean price for July was 31.7 cents higher than in January.”
SAS and SPSS report the results of two procedures: a special procedure that assumes that the two population variances are equal and the general two-sample procedure that we have just studied. This “equal-variances” procedure is most helpful when the sample sizes and are small and it is reasonable to assume equal variances.
The pooled two-sample procedures
There is one situation in which a statistic for comparing two means is not approximately distributed but has exactly a distribution. Suppose that the two Normal population distributions have the same standard deviation. In this case, we need substitute only a single standard error in a statistic, and the resulting statistic has a distribution. We will develop the statistic first, as usual, and from it the statistic.
Call the common—but still unknown—standard deviation of both populations . Both sample variances and estimate . The best way to combine these two estimates is to average them with weights equal to their degrees of freedom. This gives more weight to the information from the larger sample. The resulting estimator of is
387
This is called the pooled estimator of because it combines the information in both samples.
pooled estimator of
When both populations have variance , the addition rule for variances says that has variance equal to the sum of the individual variances, which is
The standardized difference of means in this equal-variance case is, therefore,
388
This is a special two-sample statistic for the case in which the populations have the same . Replacing the unknown by the estimate gives a statistic. The degrees of freedom are , the sum of the degrees of freedom of the two sample variances. This statistic is the basis of the pooled two-sample inference procedures.
389
Pooled Two-Sample Procedures
Draw an SRS of size from a Normal population with unknown mean and an independent SRS of size from another Normal population with unknown mean . Suppose that the two populations have the same unknown standard deviation. A level confidence interval for is
Here is the value for the density curve with area between and .
To test the hypothesis , compute the pooled two-sample statistic
and use -values from the distribution.
CASE 7.2 Active versus Failed Retail Companies
In what ways are companies that fail different from those that continue to do business? To answer this question, one study compared various characteristics of active and failed retail firms.23 One of the variables was the cash flow margin. Roughly speaking, this is a measure of how efficiently a company converts its sales dollars to cash and is a key profitability measure. The higher the percent, the more profitable the company. The data for 101 companies appear in Table 7.3.
Active firms | Failed firms | |||||||
---|---|---|---|---|---|---|---|---|
−15.57 | 4.13 | −19.37 | 17.27 | 32.29 | −1.44 | 23.87 | 49.07 | −7.53 |
23.43 | −8.75 | −1.35 | 34.55 | 1.70 | −0.67 | −23.91 | 7.29 | −14.81 |
3.17 | 11.62 | 9.38 | 13.40 | 2.20 | −22.26 | −5.12 | −24.34 | −38.27 |
−0.35 | −27.78 | 0.65 | −40.82 | 23.55 | 24.45 | 7.71 | −28.79 | −38.35 |
−9.65 | −16.01 | 36.31 | −27.71 | 9.73 | 40.48 | 9.88 | −7.99 | −18.91 |
3.37 | 5.80 | −15.60 | −3.58 | 8.46 | 8.83 | −46.38 | −41.30 | 0.37 |
40.25 | −13.39 | 15.86 | −2.25 | 12.97 | 28.21 | 1.41 | −25.56 | 5.28 |
11.02 | 30.00 | 4.84 | 30.60 | 6.57 | −20.31 | −15.13 | 8.48 | 15.72 |
27.97 | 3.72 | −0.71 | −16.46 | 7.76 | −4.20 | −11.00 | 1.27 | 14.23 |
13.08 | −9.31 | 20.21 | −10.45 | 21.39 | ||||
−22.10 | −24.55 | 28.93 | 35.83 | 21.02 | ||||
12.28 | 0.43 | 22.49 | −8.54 | −30.46 | ||||
−1.89 | 27.92 | 32.79 | −0.52 | 6.35 |
390
As usual, we first examine the data. Histograms for the two groups of firms are given in Figure 7.11. Normal curves with mean and standard deviation equal to the sample values are superimposed on the histograms. The distribution for the active firms looks more Normal than the distribution for the failed firms. However, there are no outliers or strong departures from Normality that will prevent us from using the procedures for these data. Let’s compare the mean cash flow margin for the two groups of firms using a significance test.
EXAMPLE 7.14 Does the Cash Flow Margin Differ?
cmps
CASE 7.2 Take Group 1 to be the firms that were active and Group 2 to be those that failed. The question of interest is whether or not the mean cash flow margin is different for the two groups. We therefore test
Here are the summary statistics:
Group | Firms | |||
---|---|---|---|---|
1 | Active | 74 | 5.42 | 18.80 |
2 | Failed | 27 | −7.14 | 21.67 |
The sample standard deviations are fairly close. A difference this large is not particularly unusual even in samples this large. We are willing to assume equal population standard deviations. The pooled sample variance is
391
so that
The pooled two-sample statistic is
The -value is , where has the distribution.
In Table D, we have entries for 80 and 100 degrees of freedom. We will use the entries for 100 because is so close. Our calculated value of is between the and entries in the table. Doubling these, we conclude that the two-sided -value is between 0.005 and 0.01. Statistical software gives the result . There is strong evidence that the average cash flow margins are different.
0.005 | 0.0025 | |
2.626 | 2.871 |
Of course, a -value is rarely a complete summary of a statistical analysis. To make a judgment regarding the size of the difference between the two groups of firms, we need a confidence interval.
EXAMPLE 7.15 How Different Are Cash Flow Margins?
cmps
CASE 7.2 The difference in mean cash flow margins for active versus failed firms is
For a 95% margin of error, we will use the critical value from the distribution. The margin of error is
We report that the active firms have current cash flow margins that average 12.56% higher than failed firms, with margin of error 8.74% for 95% confidence. Alternatively, we are 95% confident that the difference is between 3.82% and 21.30%.
1.660 | 1.984 | |
90% | 95% |
The pooled two-sample procedures are anchored in statistical theory and have long been the standard version of the two-sample in textbooks. But they require the condition that the two unknown population standard deviations are equal. This condition is hard to verify. We discuss methods to assess this condition in Chapter 14.
The pooled procedures are, therefore, a bit risky. They are reasonably robust against both non-Normality and unequal standard deviations when the sample sizes are nearly the same. When the samples are quite different in size, the pooled procedures become sensitive to unequal standard deviations and should be used with caution unless the samples are large. Unequal standard deviations are quite common. In particular, it is common for the spread of data to increase when the center moves up. We recommend regular use of the unpooled procedures, particularly when software automates the Satterthwaite approximation.
392
Apply Your Knowledge
7.44 Using software.
Figure 7.10 (pages 387–388) gives the outputs from five software systems for comparing prices received by wheat producers in July and January for small samples of five producers in each month. Some of the software reports both pooled and unpooled analyses. Which outputs give the pooled results? What is the pooled test statistic and its -value?
7.45 Wheat prices revisited.
Example 7.12 (pages 384–385) gives summary statistics for the price of wheat in January and July. The two sample standard deviations are relatively close, so we may be willing to assume equal population standard deviations. Calculate the pooled test statistic and its degrees of freedom from the summary statistics. Use Table D to assess significance. How do your results compare with the unpooled analysis in the example?
7.45
. The data give evidence that there is a difference between the January and July wheat prices. The results are nearly identical to the unpooled analysis.