Analysis of Variance

A model introduces the population mean $ \mu_i = \mu + \alpha_i$ for each level $ i = 1,\ldots,k$. Here the parameter $ \mu$ is the overall average, and $ \alpha_i$ is known as the i-th factor effect. The hypothesis testing problem to detect ``some effects'' of factor level becomes

$\displaystyle H_0:\: \alpha_1 = \cdots = \alpha_k = 0$    versus $\displaystyle \quad
H_A:\: \alpha_i \neq 0$    for some $ i$.

Equivalently we can write the hypothesis testing problem as follows:

$\displaystyle H_0:\: \mu_1 = \cdots = \mu_k$    versus $\displaystyle \quad
H_A:\: \mu_i \neq \mu_j$    for some $ i \neq j$.

The data from k groups are arranged either (a) all in a single variable with another categorical variable indicating factor levels, or (b) in multiple columns each of whose variables represents a factor level. The statistical inference begins with calculation of the sample mean $ \displaystyle
\bar{X}_{i\cdot} = \frac{1}{n_i} \sum_{j=1}^{n_i} X_{ij}$ within group for every factor level $ i = 1,\ldots,k$, which is the point estimate of $ \mu_i$. It is also useful to obtain the sample standard deviation within the group, that is, the square root of $ \displaystyle\frac{1}{n_i-1} \sum_{j=1}^{n_i} (X_{ij} - \bar{X}_{i\cdot})^2$.

We then proceed to compute the analysis of variance table (AOV table) which summarizes the degree of freedom (df), the sum of squares (SS), and mean squares (MS).

  1. $ \displaystyle
SS_{\mbox{group}} = \sum_{i=1}^k
n_i (\bar{X}_{i\cdot} - \bar{X}_{\cdot\cdot})^2$ is the sum of squares between groups, having $ df_{\mbox{group}} = k - 1$ degrees of freedom. Thus, the mean sqaure is given by

    $\displaystyle MS_{\mbox{group}} = \displaystyle\frac{SS_{\mbox{group}}}{k-1}

  2. $ \displaystyle
SS_{\mbox{error}} = \sum_{i=1}^k \: \sum_{j=1}^{n_i}
(X_{ij} - \bar{X}_{i\cdot})^2$ is the sum of squares within groups, having $ df_{\mbox{error}} = n - k$ degrees of freedom. Thus, the mean sqaure is given by

    $\displaystyle MS_{\mbox{error}} = \displaystyle\frac{SS_{\mbox{error}}}{n-k}

  3. $ \displaystyle
SS_{\mbox{total}} = \sum_{i=1}^k \: \sum_{j=1}^{n_i}
(X_{ij} - \bar{X}_{\cdot\cdot})^2$ is the total sum of squares, having $ df_{\mbox{total}} = n - 1$ degrees of freedom. It can be decomposed into

    $\displaystyle SS_{\mbox{total}} = SS_{\mbox{group}} + SS_{\mbox{error}}


$\displaystyle \bar{X}_{\cdot\cdot}
= \frac{1}{n} \sum_{i=1}^k \: \sum_{j=1}^{n_i} X_{ij}
= \frac{1}{n} \sum_{i=1}^k n_i \bar{X}_{i\cdot}

is the overall sample mean with the total sample size $ n = n_1 + \cdots + n_k$, and represents the point estimate of $ \mu$. The statistical model assumes (i) the same variance $ \sigma^2$ for different groups, and (ii) the independent normal random variable $ X_{ij} \sim N(\mu_i, \sigma^2)$, $ j = 1,\ldots,n_i$, for each level $ i = 1,\ldots,k$. Then the mean square $ MS_{\mbox{error}}$ within groups represents the mean square error (MSE), and becomes the point estimate of $ \sigma^2$.

Under the null hypothesis $ H_0$ the test statistic

$\displaystyle F = \frac{MS_{\mbox{group}}}{MS_{\mbox{error}}}

has the F-distribution with $ (k-1, n-k)$ degree of freedom. By $ F_{\alpha,m_1,m_2}$ we denote the critical point of $ F$-distribution with $ (m_1,m_2)$ degree of freedom satisfying $ P(X > F_{\alpha,m_1,m_2}) = \alpha$ when X is the F-distributed random variable. In the hypothesis testing problem of one-way layout we reject $ H_0$ with significance level $ \alpha$ when the observed value $ F = x$ satisfies $ x > F_{\alpha,k-1,n-k}$. Or, equivalently we can compute the p-value $ p^* = P(X > x)$ and reject $ H_0$ when $ p^* < \alpha$.