e-Statistics

Goodness of Fit

In the experiment on pea breeding Mendel's theory predicts the probabilities of occurrence associated with the types of progeny, say ``round yellow'', ``wrinkled yellow'', ``round green'', and ``wrinkled green.'' Here we want to test whether the data from $ n$ observation is consistent with his theory--goodness of fit.

The model probabilities

$\displaystyle p_1,\ldots,p_k
$

are specified (usually in the column Probability or Percentage) for k categories or ``cells.'' Out of the total size n each observation is classified into one of the k cells, and the expected cell frequencies

$\displaystyle E_1, \ldots, E_k
$

are calculated from the model probabilities by

$\displaystyle E_i = n \times p_i,
i = 1,\ldots,k.
$

The observed cell frequencies

$\displaystyle X_1, \ldots, X_k
$

gives the total size $ n = X_1 + \cdots + X_k$ of cell frequencies. Then the goodness of fit to the model can be assessed by comparing the observed cell frequencies with the expected cell frequencies. Here the statement of null hypothesis becomes ``the model is valid.'' The discrepancy between the data and the model can be measured by the Pearson's chi-square statistic

$ \chi^2 = \displaystyle\sum_{i=1}^k \frac{(X_i - E_i)^2}{E_i} =$

Under the null hypothesis (that is, assuming that the model probabilities are correct), the distribution of Pearson's chi-square $ \chi^2$ is approximated by the chi-square distribution with $ (k-1) =$ degrees of freedom. Therefore, we can reject the null hypothesis if you observe that $ \chi^2 = x$ and $ x > \chi^2_{\alpha,df}$, casting doubt on the validity of the model. Or equivalently, by computing the $ p$-value

$ p^* = P(Z > \chi^2) =$

with a random variable $ Z$ having the chi-square distribution with $ (k-1)$ degrees of freedom, we can find that the null hypothesis is rejected if $ p^* < \alpha$.