Statistics/Summary/Variance

Measure of Scale
When describing data it is helpful (and in some cases necessary) to determine the spread of a distribution. One way of measuring this spread is by calculating the variance or the standard deviation of the data.

In describing a complete population, the data represents all the elements of the population. As a measure of the "spread" in the population one wants to know a measure of the possible distances between the data and the population mean. There are several options to do so. One is to measure the average absolute value of the deviations. Another, called the variance, measures the average square of these deviations.

A clear distinction should be made between dealing with the population or with a sample from it. When dealing with the complete population the (population) variance is a constant, a parameter which helps to describe the population. When dealing with a sample from the population the (sample) variance is actually a random variable, whose value differs from sample to sample. Its value is only of interest as an estimate for the population variance.

Population variance and standard deviation
Let the population consist of the N elements x1,...,xN. The (population) mean is:


 * $$\mu = \frac 1N \sum_{i=1}^Nx_i$$.

The (population) variance &sigma;2 is the average of the squared deviations from the mean or (xi - &mu;)2 - the square of the value's distance from the distribution's mean.


 * $$\sigma^2 = \frac 1N \sum_{i=1}^N (x_i - \mu)^2$$.

Because of the squaring the variance is not directly comparable with the mean and the data themselves. The square root of the variance is called the Standard Deviation &sigma;. Note that &sigma; is the root mean squared of differences between the data points and the average.

Sample variance and standard deviation
Let the sample consist of the n elements x1,...,xn, taken from the population. The (sample) mean is:


 * $$\bar{x} = \frac 1n \sum_{i=1}^nx_i$$.

The sample mean serves as an estimate for the population mean &mu;.

The (sample) variance s2 is a kind of average of the squared deviations from the (sample) mean:


 * $$s^2 = \frac 1{n-1} \sum_{i=1}^n (x_i - \bar{x})^2$$.

Also for the sample we take the square root to obtain the (sample) standard deviation s

A common question at this point is "why do we square the numerator?" One answer is: to get rid of the negative signs. Numbers are going to fall above and below the mean and, since the variance is looking for distance, it would be counterproductive if those distances factored each other out.

Example
When rolling a fair die, the population consists of the 6 possible outcomes 1 to 6. A sample may consist instead of the outcomes of 1000 rolls of the die.

The population mean is:


 * $$\mu = \frac 16 (1+2+3+4+5+6) = 3.5$$,

and the population variance:


 * $$\sigma^2=\frac 16 \sum_{i=1}^n (i-3.5)^2=\frac 16(6.25+2.25+0.25+0.25+2.25+6.25)=\frac {35}{12} \approx 2.917$$

The population standard deviation is:
 * $$\sigma = \sqrt{\frac {35}{12}} \approx 1.708$$.

Notice how this standard deviation is somewhere in between the possible deviations.

So if we were working with one six-sided die: X = {1, 2, 3, 4, 5, 6}, then &sigma;2 = 2.917. We will talk more about why this is different later on, but for the moment assume that you should use the equation for the sample variance unless you see something that would indicate otherwise.

Note that none of the above formulae are ideal when calculating the estimate and they all introduce rounding errors. Specialized statistical software packages use more complicated logarithms that take a second pass of the data in order to correct for these errors. Therefore, if it matters that your estimate of standard deviation is accurate, specialized software should be used. If you are using non-specialized software, such as some popular spreadsheet packages, you should find out how the software does the calculations and not just assume that a sophisticated algorithm has been implemented.

For Normal Distributions
The empirical rule states that approximately 68 percent of the data in a normally distributed dataset is contained within one standard deviation of the mean, approximately 95 percent of the data is contained within 2 standard deviations, and approximately 99.7 percent of the data falls within 3 standard deviations.

As an example, the verbal or math portion of the SAT has a mean of 500 and a standard deviation of 100. This means that 68% of test-takers scored between 400 and 600, 95% of test takers scored between 300 and 700, and 99.7% of test-takers scored between 200 and 800 assuming a completely normal distribution (which isn't quite the case, but it makes a good approximation).

Robust Estimators
For a normal distribution the relationship between the standard deviation and the interquartile range is roughly: SD = IQR/1.35.

For data that are non-normal, the standard deviation can be a terrible estimator of scale. For example, in the presence of a single outlier, the standard deviation can grossly overestimate the variability of the data. The result is that confidence intervals are too wide and hypothesis tests lack power. In some (or most) fields, it is uncommon for data to be normally distributed and outliers are common.

One robust estimator of scale is the "average absolute deviation", or aad. As the name implies, the mean of the absolute deviations about some estimate of location is used. This method of estimation of scale has the advantage that the contribution of outliers is not squared, as it is in the standard deviation, and therefore outliers contribute less to the estimate. This method has the disadvantage that a single large outlier can completely overwhelm the estimate of scale and give a misleading description of the spread of the data.

Another robust estimator of scale is the "median absolute deviation", or mad. As the name implies, the estimate is calculated as the median of the absolute deviation from an estimate of location. Often, the median of the data is used as the estimate of location, but it is not necessary that this be so. Note that if the data are non-normal, the mean is unlikely to be a good estimate of location.

It is necessary to scale both of these estimators in order for them to be comparable with the standard deviation when the data are normally distributed. It is typical for the terms aad and mad to be used to refer to the scaled version. The unscaled versions are rarely used.