4.2 Significant Differences

4-2 How do we know whether an observed difference can be generalized to other populations?

Data are “noisy.” The average score in one group (children who were breast-fed as babies, for example) could conceivably differ from that in another group (children who were bottle-fed as babies) not because of any real difference but merely because of chance fluctuations in the people sampled. How confidently, then, can we infer that an observed difference is not just a fluke—a chance result from the research sample? For guidance, we can ask how reliable and significant the differences are. These inferential statistics help us determine if results can be generalized to a larger population.

When Is an Observed Difference Reliable?

In deciding when it is safe to generalize from a sample, we should keep three principles in mind:

  1. Representative samples are better than biased samples. The best basis for generalizing is not from the exceptional and memorable cases one finds at the extremes but from a representative sample of cases. Research never randomly samples the whole human population. Thus, it pays to keep in mind what population a study has sampled.
  2. Less-variable observations are more reliable than those that are more variable. As we noted earlier in the example of the basketball player whose game-to-game points were consistent, an average is more reliable when it comes from scores with low variability.
  3. More cases are better than fewer. An eager prospective student visits two university campuses, each for a day. At the first, the student randomly attends two classes and discovers both instructors to be witty and engaging. At the next campus, the two sampled instructors seem dull and uninspiring. Returning home, the student (discounting the small sample size of only two teachers at each institution) tells friends about the “great teachers” at the first school, and the “bores” at the second. Again, we know it but we ignore it: Averages based on many cases are more reliable (less variable) than averages based on only a few cases.

The point to remember: Smart thinkers are not overly impressed by a few anecdotes. Generalizations based on a few unrepresentative cases are unreliable.

47

When Is an Observed Difference Significant?

Perhaps you’ve compared men’s and women’s scores on a laboratory test of aggression, and found a gender difference. But individuals differ. How likely is it that the difference you observed was just a fluke? Statistical testing can estimate that.

Here is the underlying logic: When averages from two samples are each reliable measures of their respective populations (as when each is based on many observations that have small variability), then their difference is likely to be reliable as well. (Example: The less the variability in women’s and in men’s aggression scores, the more confidence we would have that any observed gender difference is reliable.) And when the difference between the sample averages is large, we have even more confidence that the difference between them reflects a real difference in their populations.

In short, when sample averages are reliable, and when the difference between them is relatively large, we say the difference has statistical significance. This means that the observed difference is probably not due to chance variation between the samples.

For a 9.5-minute video synopsis of psychology’s scientific research strategies, visit LaunchPad’s Video: Research Methods.

In judging statistical significance, psychologists are conservative. They are like juries who must presume innocence until guilt is proven. For most psychologists, proof beyond a reasonable doubt means not making much of a finding unless the odds of its occurring by chance, if no real effect exists, are less than 5 percent.

When reading about research, you should remember that, given large enough or homogeneous enough samples, a difference between them may be “statistically significant” yet have little practical significance. For example, comparisons of intelligence test scores among hundreds of thousands of firstborn and later-born individuals indicate a highly significant tendency for firstborn individuals to have higher average scores than their later-born siblings (Kristensen & Bjerkedal, 2007; Zajonc & Markus, 1975). But because the scores differ by only one to three points, the difference has little practical importance.

The point to remember: Statistical significance indicates the likelihood that a result will happen by chance. But this does not say anything about the importance of the result.

RETRIEVAL PRACTICE

  • Can you solve this puzzle?

The registrar’s office at the University of Michigan has found that usually about 100 students in Arts and Sciences have perfect marks at the end of their first term at the University. However, only about 10 to 15 students graduate with perfect marks. What do you think is the most likely explanation for the fact that there are more perfect marks after one term than at graduation (Jepson et al., 1983)?

Averages based on fewer courses are more variable, which guarantees a greater number of extremely low and high marks at the end of the first term.

  • _____________ statistics summarize data, while _____________ statistics determine if data can be generalized to other populations.

Descriptive; inferential

48