287
Data-driven companies—both in manufacturing and service—gather data on various aspects of their businesses in order to draw conclusions about their own performance and about their markets.
These are all examples in which statistical inference—namely, drawing conclusions about a population or process from sample data—would be used. By taking into account the natural variability in the sample data, we learn that inference provides a statement of how much confidence we can place in our conclusions. Although there are numerous methods for inference, there are only a few general types of statistical inference. This chapter introduces the two most common types: confidence intervals and tests of significance.
Because the underlying reasoning for these two types of inference remains the same across different settings, this chapter considers just one simple setting: inference about the mean of a large population whose standard deviation is known. This setting, although unrealistic, allows us the opportunity to focus on the underlying rationale of these types of statistical inference rather than the calculations.
288
Later chapters will present inference methods to use in most of the settings we met in learning to explore data. In fact, there are libraries—both of books and of computer software—full of more elaborate statistical techniques. Informed use of any of these methods, however, requires a firm understanding of the underlying reasoning. That is the goal of this chapter. A computer or calculator will do the arithmetic, but you must still exercise sound judgment based on understanding.
Overview of inference
In drawing conclusions about a population from data, statistical inference emphasizes substantiating these conclusions via probability calculations in that probability incorporates chance variation in the sample data. We have already examined data and arrived at conclusions many times. How do we move from summarizing a single data set to formal inference involving probability calculations?
The foundation for this was described in Section 5.3. There, we not only discussed the use of statistics as estimates of population parameters, but we also described the chance variation of a statistic when the data are produced by random sampling or randomized experimentation.
Reminder
parameters and statistics, p. 276
There are a variety of statistics used to summarize data. In the previous chapter, we focused on categorical data for which counts and proportions are the most common statistics used. We now shift our focus to quantitative data. The sample mean, percentiles, and standard deviation are all examples of statistics based on quantitative data. In this chapter, we concentrate on the sample mean. Because sample means are just averages of observations, they are among the most frequently used statistics.
The sample mean from a sample or an experiment is an estimate of the mean of the underlying population, just as the sample proportion is an estimate of a population parameter . In Section 5.3, we learned that when data are produced by random sampling or randomized experimentation, a statistic is a random variable and its sampling distribution shows how the statistic would vary in repeated data productions. To study inference about a population mean , we must first understand the sampling distribution of the sample mean .
Reminder
sampling distribution, p. 277