R Programming/Descriptive Statistics

In this section, we present descriptive statistics, ie a set of tools to describe and explore data. This mainly includes univariate and bivariate statistical tools.

Generic Functions
We introduce some functions to describe a dataset.


 * names gives the names of each variable
 * str gives the structure of the dataset
 * summary gives the mean, median, min, max, 1st and 3rd quartile of each variable in the data.
 * describe (Hmisc package) gives more details than summary
 * contents (Hmisc package)
 * dims in the Zelig package.
 * descr in the descr package gives min, max, mean and quartiles for continuous variables, frequency tables for factors and length for character vectors.
 * <tt>whatis</tt> (YaleToolkit) gives a good description of a dataset.
 * <tt>detail</tt> in the SciencesPo package gives a broad range of statistics for continuous variables, frequency tables for factors and length for character vectors.
 * <tt>describe</tt> in the psych package also provides summary statistics:

Moments

 * <tt>mean</tt> computes the mean
 * the variance : <tt>var</tt>.
 * the standard deviation <tt>sd</tt>.
 * the skewness <tt>skewness</tt> (fUtilities, moment or e1071)
 * the kurtosis : <tt>kurtosis</tt> (fUtilities, moment or e1071)
 * all the moments : <tt>moment</tt> (moment) and <tt>all.moments</tt> (moment).

Order statistics

 * the range, the minimum and the maximum : <tt>range</tt> returns the range of a vector (minimum and maximum of a vector), <tt>min</tt> the minimum and <tt>max</tt> the maximum.
 * <tt>IQR</tt> computes the interquartile range. <tt>median</tt> computes the median and <tt>mad</tt> the median absolute deviation.
 * <tt>quantile</tt>, <tt>hdquantile</tt> in the Hmisc package and <tt>kuantile</tt> in the quantreg packages computes the sample quantiles of a continuous vector. <tt>kuantile</tt> may be more efficient when the sample size is big.

Inequality Index

 * The gini coefficient : <tt>Gini</tt> (ineq) and <tt>gini</tt> (reldist).
 * <tt>ineq</tt> (ineq) gives all inequalities index.


 * Concentration index


 * Poverty index

Plotting the distribution
We can plot the distribution using a box plot (<tt>boxplot</tt>), an histogram (<tt>hist</tt>), a kernel estimator (<tt>plot</tt> with <tt>density</tt>) or the empirical cumulative distribution function (<tt>plot</tt> with <tt>ecdf</tt>). See the Nonparametric section to learn more about histograms and kernel density estimators. <tt>qqnorm</tt> produces a normal QQ plot and <tt>qqline</tt> adds a line to the QQ plot which passes through the first and the third quartile.


 * A box-plot is a graphical representation of the minimum, the first quartile, the median, the third quartile and the maximum.
 * <tt>stripchart</tt> and <tt>stem</tt> are also availables.

Goodness of fit tests
Kolmogorov Smirnov Test :

The KS test is one sample goodness of fit test. The test statistic is simply the maximum of the absolute value of the difference between the empirical cumulative distribution function and the theoritical cumulative distribution function. <tt>KSd</tt> (sfsmisc) gives the critical values for the KS statistic. As an example, we draw a sample from a Beta(2,2) distribution and we test if it fits a Beta(2,2) a Beta(1,1) and a uniform distribution.

Some tests are specific to the normal distribution. The Lillie Test is an extension of the KS test when the parameters are unknown. This is implemented with the <tt>lillie.test</tt> in the nortest package. <tt>shapiro.test</tt> implements the Shapiro Wilk Normality Test

See also the package ADGofTest for another version of this test.
 * Andersen Darling Test :


 * Shapiro-Francia normality test :


 * Pearson chi-square normality test :
 * Cramer-von Mises normality test


 * Jarque-Bera test :

Discrete variable
We generate a discrete variable using <tt>sample</tt> and we tabulate it using <tt>table</tt>. We can plot using a pie chart (<tt>pie</tt>), a bar chart (<tt>barplot</tt> or <tt>barchart</tt> (lattice)) or a dot chart (<tt>dotchart</tt> or <tt>dotplot</tt> (lattice)).


 * <tt>freq</tt> (descr) prints the frequency, the percentages and produces a barplot. It supports weights.

Continuous variables

 * Covariance : <tt>cov</tt>
 * Pearson's linear correlation : <tt>cor</tt>.
 * Pearson's correlation test <tt>cor.test</tt> performs the test.
 * Spearman's rank correlation :
 * <tt>cor</tt> with <tt>method = "spearman"</tt>.
 * <tt>spearman</tt> (Hmisc)
 * Spearman's rank correlation test :
 * <tt>spearman2</tt> (Hmisc)
 * <tt>spearman.test</tt> (Hmisc)
 * <tt>spearman.test</tt> (pspearman package) performs the Spearman’s rank correlation test with precomputed exact null distribution for n <= 22.
 * Kendall's correlation : <tt>cor</tt> with <tt>method = "kendall"</tt>. See also the Kendall package.

Discrete variables

 * <tt>table</tt>, <tt>xtabs</tt> and <tt>prop.table</tt> for contingency tables. <tt>ftable</tt> (stats package) for a flat (nested) table.
 * <tt>assocplot</tt> and <tt>mosaicplot</tt> for graphical display of contingency table.
 * <tt>CrossTable</tt> (descr) is similar to SAS Proc Freq. It returns a contingency table with Chi square and Fisher independence tests.
 * <tt>my.table.NA</tt> and <tt>my.table.margin</tt> (cwhmisc)
 * <tt>chisq.detail</tt> (TeachingDemos)

Discrete and Continuous variables

 * <tt>bystats</tt> Statistics by Categories in the Hmisc package
 * <tt>summaryBy</tt> (doBy)
 * Multiple box plots : <tt>plot</tt> or <tt>boxplot</tt>


 * Equality of two sample mean <tt>t.test</tt> and <tt>wilcox.test</tt>, Equality of variance <tt>var.test</tt>, equality of two distributions <tt>ks.test</tt>.