Summary Statistics

Summary statistics are employed when the variable has the numerical data

$\displaystyle X_1, X_2, \ldots, X_n

such as 0.223 or 152.7. Here $ n$ denotes the sample size. It lists the measures of central location (mean and median) and the measures of dispersion (standard deviation, quartiles).

  1. Mean. The sample mean

    $\displaystyle \bar{X} = \frac{X_1 + X_2 + \cdots + X_n}{n}

    is namely the ``average'' value of the observations. Extreme values, often considered as outliers, affect the sample mean. So that ``trimming'' can be considered to reduce the degree of influence.

  2. Median. The sample median is the value of the ``middle'' data point. When the size $ n$ is an odd number, the median is simply the middle value; for example, the median of ``2, 4, and 7'' is 4. When we have the data with even number $ n$ of the size, the median is the mean of the two middle values. Thus, the median of the numbers ``2, 4, 7, 12'' is (4+7)/2 = 5.5. The sample median is known to be less affected by exterme measurements in comparison to the mean.

  3. Standard deviation (S.D.). The sample variance.

    $\displaystyle S^2 = \frac{(X_1-\bar{X})^2 + (X_2-\bar{X})^2 + \cdots +

    is the average squared deviation of each observed value from the sample mean $ \bar{X}$. The calculation is often carried out in a simpler form

    $\displaystyle S^2 = \frac{1}{n-1}
\left( X_1^2 + X_2^2 + \cdots + X_n^2 - n \bar{X}^2 \right).

    The square root $ S = \sqrt{S^2}$ of the sample variance is referred as the sample standard deviation, indicating the ``scatteredness'' of the data. When the shape of sample distribution is symmetric and unimodal the following rule, known as ``empirical rule,'' applies: Approximated by a normal density function, 68%, 95%, and 99.7% of data fall within the interval $ (\bar{X}-S,\bar{X}+S)$, $ (\bar{X}-2S,\bar{X}+2S)$ and $ (\bar{X}-3S,\bar{X}+3S)$, respectively.

    The coefficient of variation (CV)

    CV$ = \displaystyle\frac{S}{\vert\bar{X}\vert} =$
    can be used to compare the variability in a different unit of measurement.

  4. Lower and upper quartiles. The 25th sample percentile is the value indicating that 25% of the observations takes values smaller than this one. Similarly, we can define 50th percentile, 75th percentile, and so on. Note that 50th percentile is the median. We call 25th percentile the lower quartile and 75th percentile the upper quartile. The interquartile range (IQR) is then defined as the difference

       IQR $\displaystyle =$   (Upper quartile) $\displaystyle -$   (Lower quartile)

    between them.

The data file may consist of either single column or multiple columns. Each column is identified with the variable name at the top, followed by the sample data. The sample size $ n$ may vary with the choice of variable if the column contains blank entries (NA's).