Describe the role of statistical significance in psychological research.
Compare null hypothesis significance testing with the use of effect size and confidence intervals.
Review
Select the NEXT button to continue with the Review.
1. After researchers have collected data, they typically calculate descriptive statistics, such as the mean and standard deviation. These statistics provide a summary of the central tendency and variability of each distribution of scores. Next, they calculate inferential statistics to determine whether the results support their hypotheses about behavior. For example, if researchers are comparing an experimental group with a control group, the descriptive statistics indicate whether or not the experimental treatment changed performance, and the inferential statistics indicate how much confidence the researchers should have in the results.
Review
Select the NEXT button to continue with the Review.
2. The questions that inferential statistics can answer about a hypothesis usually involve a comparison. Did one group of participants perform differently than another group? On average, did participants perform differently in one condition of an experiment than in other conditions? Does the correlation coefficient for two variables differ from zero? (A correlation of zero indicates no relationship.)
Review
Select the NEXT button to continue with the Review.
3. If researchers find a difference among groups of participants, they want to know whether or not it is a “real” difference—a reliable, repeatable result—rather than a fluke due to random errors in measurement, random fluctuations in people’s behavior, or failure of random assignment to control for preexisting differences between the groups. Roughly speaking, we consider a result to be statistically significant if there is a low probability that the result was due to chance factors.
Review
Select the NEXT button to continue with the Review.
4. One standard method of determining statistical significance is to pose a null hypothesis—typically claiming that there is no difference—and then testing the likelihood that the observed results would have occurred if the null hypothesis were true. This method is called null hypothesis significance testing (NHST). For example, if researchers are comparing an experimental group to a control group, the null hypothesis is that the two groups do not differ. Assuming that the null hypothesis is true, the likelihood of obtaining results equal to or greater than the observed results is called the probability level, or p-level. Typically, if the p-level is less than 0.05 (5 percent), the result is considered statistically significant.
Review
Select the NEXT button to continue with the Review.
5. There is a long-standing controversy within psychology as to whether the NHST approach is an appropriate way to evaluate research results. The p-level takes into account the size of the sample; the larger the sample, the easier it is to obtain a statistically significant result (with p < .05), even with a very small difference between groups. Also, the p-level is frequently misinterpreted as indicating the probability that the null hypothesis is true. That is actually backwards. Given 100 percent probability that the null hypothesis is true, the p-level indicates the probability of obtaining a difference equal to or greater than the difference measured (observed in the results).
Review
Select the NEXT button to continue with the Review.
6. In addition to calculating the p-level, researchers usually calculate the effect size of a difference, which indicates the magnitude of the difference in a way that is not influenced by the size of the sample. Instead of using the NHST approach, some researchers prefer to calculate the confidence interval (margin of error) for each statistic, and then use those numbers to evaluate the reliability and importance of the finding. Researchers typically calculate the confidence interval for a mean in a way that provides a 95 percent probability that the actual mean of the population falls within the interval.
Review
Select the NEXT button to continue with the Review.
7. On a graph of the research results, the mean scores often have error bars indicating the 95 percent confidence intervals. This allows the viewer to estimate the likelihood that the observed difference is a meaningful difference. In the left graph, the confidence intervals for females and males overlap, suggesting that the difference in mean scores is not reliable. In the right graph, the non-overlapping confidence intervals suggest that the measured age gap in performance could be a real difference.
Review
Select the NEXT button to continue with the Review.
8. Finally, it is important to remember that a statistically significant result (with p < .05) is not necessarily an important result, even if the p-level is very small (such as p < .01 or p < .001). A small difference may be reliable and repeatable, but not have much impact on people’s behavior. For example, in the study illustrated in this graph, headache sufferers who had a pain level of 10 experienced a 40 percent decrease in pain after taking one aspirin tablet. Those who took two tablets experienced a 46 percent improvement. The difference was statistically significant, but most people would not notice the difference in their daily life.
Practice 1: Exploring Research Statistics
Roll over each term about statistics to see a brief description in the context of interpreting research results.
population
sample
mean
standard deviation
correlation coefficient (r)
descriptive statistics
inferential statistics
statistical significance
null hypothesis
p-level
null hypothesis significance testing (NHST)
confidence interval
effect size
a group of people (or animals) whose behavior is of interest to researchers; from this group, one or more samples are selected for measurement
a group of people (or animals) whose behavior is measured; this group is drawn from a larger population, and the sample results are usually generalized to the population
a measure of central tendency calculated by adding all scores and then dividing by the number of scores
a measure of variability, indicating how tightly the scores are clustered around the mean
a statistic that indicates the precise numerical relationship between two variables; r can range from -1.0 to +1.0
numbers calculated from a distribution of scores, indicating the central tendency (average) and the variability (amount of scatter around the average)
numbers calculated from a distribution of scores to provide evidence supporting or opposing an hypothesis
whether a research result differs sufficiently from what would be expected from chance alone, due to random variations in behavior
a statistical assumption about the absence of an effect (no difference between two values)
probability of finding a difference that is equal to or greater than what was actually measured, assuming that the null hypothesis is true
way of evaluating results by comparing the observed outcome to what would be expected if the null hypothesis is true
a range of scores calculated such that there is a specific probability (usually .95) that the value of interest actually falls within that range
way of measuring the strength of a result, yielding a number that indicates the difference between two values; not affected by sample size
Practice 2: Interpreting Research Results
Roll over each statement to see whether the conclusion is accurate or unjustified.
If a statistical test yields a p-level of .12, researchers using the null hypothesis significance testing (NHST) approach would accept the result as a statistically significant effect.
If a statistical test yields a p-level of .02, researchers using the null hypothesis significance testing (NHST) approach could conclude that the results are statistically significant, because there is only a 2 percent probability that the outcome was due to chance.
After the mean performance for two different groups has been calculated, if the 95 percent confidence intervals for those two means do not overlap, researchers could conclude that the difference between the groups was probably a reliable effect—a real difference.
FALSE. Researchers who use NHST would look for a p-level of less than .05 before they would claim statistical significance.
FALSE. Researchers who use NHST would conclude that the results were statistically significant, because the probability of getting that large a difference (or a larger difference) if the null hypothesis were true would only be 2 percent.
TRUE. Researchers who use confidence intervals would be delighted to have no overlap, because that would indicate that the means of the two groups were reliably different.
Quiz 1
Match the terms to their descriptions by dragging each colored circle to the appropriate gray circle. When all the circles have been placed, select the CHECK ANSWER button.
Quiz 2
For each statement, select one of the buttons to indicate whether the statement is True or False. When responses have been chosen for all the statements, select the CHECK ANSWER button.
True | False | |
---|---|---|
Researchers consider a result to be statistically significant if there is a low probability that the outcome was due to chance. |
||
A p-level of less than .05 indicates that the probability of the outcome being due to chance is less than 5 percent. |
||
Many researchers do not believe that null hypothesis significance testing (NHST) is an appropriate way to evaluate research outcomes. |
||
If the mean score for Group A is at least 2 or 3 points higher than the mean of Group B, we can conclude that the difference between the groups is statistically significant. |