9 Inference for Categorical Data

Printed Page 470

SECTION 9.1 Summary

The null hypothesis for $r \times c$ tables of count data is that there is no relationship between the row variable and the column variable.
Expected cell counts under the null hypothesis are computed using the formula
$expected count = \frac{row total \times column total}{n}$
The null hypothesis is tested by the chi-square statistic, which compares the observed counts with the expected counts:
$Χ^{2} = Σ \frac{{(observed - expected)}^{2}}{expected}$
Under the null hypothesis, $Χ^{2}$ has approximately the chi-square distribution with $(r - 1) (c - 1)$ degrees of freedom. The P-value for the test is
$P (χ^{2} \geq Χ^{2})$

where $χ^{2}$ is a random variable having the $χ^{2} (df)$ distribution with $df = (r - 1) (c - 1)$ .
The chi-square approximation is adequate for practical use when the average expected cell count is 5 or greater and all individual expected counts are 1 or greater, except in the case of $2 \times 2$ tables. All four expected counts in a $2 \times 2$ table should be 5 or greater.
To analyze a two-way table, first compute percents or proportions that describe the relationship between the row and column variables. Then calculate expected counts, the chi-square statistic, and the P-value.
Two different models for generating $r \times c$ tables lead to the chi-square test. In the first model, independent SRSs are drawn from each of $c$ populations, and each observation is classified according to a categorical variable with $r$ possible values. The null hypothesis is that the distributions of the row categorical variable are the same for all $c$ populations. In the second model, a single SRS is drawn from a population, and observations are classified according to two categorical variables having $r$ and $c$ possible values. In this model, $H_{0}$ states that the row and column variables are independent.