In this chapter we studied several graphical and numerical methods to summarize both categorical and quantitative variables. Remember that depending on whether the data comes from a categorical or quantitative variable, different numerical and graphical methods are applied.
Data that arises from categorical variables is summarized numerically with a frequency table or a relative frequency table. This table contains the possible measurements of the categorical variable along with the corresponding frequency (or relative frequency) of each measurement. The two most common types of graphs appropriate for a categorical variable are a bar graph and a pie chart. A bar graph uses the heights of the bars to indicate the frequency or relative frequency of measurements that fall into each category. A pie chart is a circle graph that is used when the categorical data consists of all parts of a single whole and when each measurement belongs to only one category.
Histograms, stem and leaf plots, boxplots and time plots are graphs which are used to display quantitative data. A histogram displays the possible values of the quantitative variable in intervals along the horizontal axis in numerical order. The frequencies (or relative frequencies) of the values are displayed on the vertical axis. A stem and leaf plot separates each data point into two pieces: “the stem” which consists of all digits except the final digit and “the leaf” which is the final digit. A time plot shows how a quantitative variable changes over time. The horizontal axis displays the time period in intervals and the vertical axis displays the possible values of the quantitative variable. Typically we will want to describe the distribution of quantitatitve data according to characteristics such as shape, center, variability and whether or not there appear to be unusual values. A graph’s shape is often described as symmetric, left-skewed, or right-skewed and we can further describe its shape by how many peaks it has: unimodal, bimodal or multi-modal. Finally, if there are any outliers they should be noted. Outliers are data values that are quite different from the rest of the data values.
When we summarize quantitative data we should include a measure of center and a measure of variability. The center is the value of the quantitative variable such that roughly half of the data is smaller than this value. One common measure of center is the mean which is the arithmetic average. When we talk about the variability of a data set we are simply measuring how far the data values are spread out, typically from the mean. One measure of variability is range which is the difference between the maximum and minimum data values. A more commonly used measure of variability is standard deviation which is a number that describes, on average, how far the data is spread out from its mean. A statistic is resistant to outliers if it is not influence by extremely low or high data values. Both the mean and median are not resistant to outliers.
The Empirical Rule tells us that when our data is roughly bell-shaped, about 68% of all measurements lie within 1 standard deviation of the mean, about 95% of the measurements lie within 2 standard deviations of the mean, and about 99.7% of all of the measurements lie within three standard deviations. In practice we often want to compare measurements between two different distributions to determine which measurement is more unusual with respect to its distribution. A z-score measures how far away an observation is from its mean in standard deviation units.
If there are outliers or severe skewness in the data set, instead of reporting the mean and standard deviation, we commonly report the median as a measure of center and the IQR (interquartile range) as a measure of variability, which are both resistant to outliers. The median is the middle data value in an ordered list and the pth percentile is the number such that p% of the measurements fall at or below that number. Q1 is therefore the 25th percentile and Q3 is the 75% percentile. The IQR (interquartile range) is the difference between the third and first quartiles, Q3 and Q1. A boxplot is an additional type of graph which displays the 5-number summary: minimum, Q1, median, Q3, and maximum. Outliers can be identified by looking for measurements that are either less than 1.5*IQR of the third quartile or for measurements that are greater than 1.5*IQR of the first quartile. One additional measure of center that is sometimes used is the mode, the measurement (or measurements) that occur most frequently.