Summary Statistics
Summary statistics are employed when the variable has the numerical data
such as 0.223 or 152.7. Here
- Mean.
The sample mean
is namely the ``average'' value of the observations. Extreme values, often considered as outliers, affect the sample mean. So that ``trimming'' can be considered to reduce the degree of influence. - Median.
The sample median is the value of the ``middle'' data point.
When the size
is an odd number, the median is simply the middle value;
for example, the median of ``2, 4, and 7'' is 4.
When we have the data with even number
of the size,
the median is the mean of the two middle values.
Thus, the median of the numbers ``2, 4, 7, 12'' is (4+7)/2 = 5.5.
The sample median is known to be
less affected by exterme measurements in comparison to the mean.
- Standard deviation (S.D.).
The sample variance.
is the average squared deviation of each observed value from the sample mean
.
The calculation is often carried out in a simpler form
The square root
of the sample variance
is referred as the sample standard deviation,
indicating the ``scatteredness'' of the data.
When the shape of sample distribution is symmetric and unimodal
the following rule, known as ``empirical rule,'' applies:
Approximated by a normal density function,
68%, 95%, and 99.7% of data fall within the
interval
,
and
,
respectively.
The coefficient of variation (CV)
CVcan be used to compare the variability in a different unit of measurement.
- Lower and upper quartiles.
The 25th sample percentile is the value
indicating that
25% of the observations takes values smaller than this one.
Similarly, we can define 50th percentile, 75th percentile, and so on.
Note that 50th percentile is the median.
We call 25th percentile the lower quartile
and 75th percentile the upper quartile.
The interquartile range (IQR)
is then defined as the difference
IQR
(Upper quartile)
(Lower quartile)
between them.
The data file may consist of either single column or multiple columns.
Each column is identified with the variable name at the top,
followed by the sample data.
The sample size
may vary with the choice of variable
if the column contains blank entries (NA's).