EXAMPLE 2 Identification of Outliers
Consider two datasets:
Race number | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Time (seconds) | 60.81 | 66.11 | 47.32 | 42.69 | 43.40 | 44.82 | 42.67 |
Race number | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
Time (seconds) | 45.17 | 41.20 | missing | 42.47 | 41.74 | 40.40 | 42.90 |
IQ test score | 100 | 102 | 110 | 115 | 118 | 123 | 124 |
Reading test score | 40 | 65 | 55 | 70 | 75 | 95 | 45 |
IQ test score | 125 | 126 | 130 | 135 | 140 | 143 | 147 |
Reading test score | 70 | 85 | 90 | 75 | 95 | 85 | 95 |
246
Scatterplots of these datasets appear in Figure 6.3 and Figure 6.4, respectively. Outliers have been circled. The scatterplot in Figure 6.3 shows two circled outliers—they are associated with the highest values of the response variable, time, and lowest values of the explanatory variable, race number. Whenever possible, look for explanations for the presence of outliers. In this case, the swimmer had just learned the butterfly, which explains why her times in the first two races (when she was worried about getting disqualified) were unusually slow.
The outlier circled in Figure 6.4 was flagged as an outlier by a statistical program. In this case, the outlier does not correspond to the minimum or maximum values of either the response or explanatory variables. Instead, the point (124, 45) indicates a reading test score that is low in comparison to the reading test scores of other students with IQ test scores close to 124. Without additional information about this student, we don’t have an explanation for the presence of this outlier.