Data Mining Algorithms In R/Clustering/Expectation Maximization

Introduction
In statistics an optimization problem is used to maximize or minimize a function, and its variables in a specific space. As these optimization problems may assume many different types, each one with its own characteristics, many techniques have been developed to solve some of these.

One of these techniques is the Maximum Likelihood and its main goal is to adjust a statistic model with specific data set, estimating its unknown parameters so the function can describe all the parameters in the dataset. In other words, the method will adjust some variables of a statistical model from a dataset or a known distribution, so the model can “describe” each data sample and estimate others. This technique is very important in data mining and knowledge discovery area as it can be used as basis for more complex and powerful methods.

The Expectation-Maximization method is one of the methods developed from Maximum Likelihood, trying to estimate the likelihood in problems which some variables are unobserved. This method was first documented in 1977 by, although the technique was informally proposed in literature, as suggested by the author. So, this work was its first formalization in the literature.

The function can be calculated approximating the parameters in an iterative process, which steps called Expectation (E-step) and Maximization (M-step) are executed to find the unobserved variables and re-estimate the parameters for the maximum likehood, respectively. The next section will describe the method with more details, presenting an algorithm with all steps of the calculation of the function.

A Brief Overview
The expectation-maximization algorithm described in is an unsupervised clustering method which doesn’t require a training step based on a density mixture. It uses a sub-optimal iterative approach to find the probability distribution parameters to the attributes with maximum likelihood.

The algorithm’s input are the data set (x), the total number of clusters (M), the tolerance threshold (E) and the maximum number of iterations. For each iteration, the E-step (expectation) e executed at first to estimate each class probability, followed by the M-step (maximization) to estimate the parameters array for the probability distribution of each class for the next iteration. The method will end when the distribution parameters falls into the tolerance threshold or the maximum number of iterations is reached.

Initialization
Each cluster j of M contains a parameters array(tetha), which has a mean (uu) and a covariance matrix (Sigmaj), in order to represent the Gaussian probability distribution details. This distribution is used to bound the elements of dataset x.

$$ \theta\ (t) = \{ \mu_{j},\sum_{j} (t) \} j = 1..M  $$

In the first iteration, the method generates random initial values to mean and covariance, in order to approximate the parameters array (tetha) to its real distribution.

E-Step
The main goal of the E-step is to estimate the probability of each element belong the each class $$(P (C_{j}|xk))$$.

$$ P(C_{j}|x) = \frac{\left | \sum_{j}(t)\right \vert ^{-\frac{1}{2}} * exp^{n_{j}*P_{j}(t)}}{\sum_{k=1}^M \left | \sum_{j}(t)\right \vert ^{-\frac{1}{2}} * exp^{\sigma_{k}*P_{k}(t)} } $$

Packages
The expectation-maximization in algorithm in R, proposed in , will use the package mclust. This package contains crucial methods for the execution of the clustering algorithm, including functions for the E-step and M-step calculation. The package manual explains all of its functions, including simple examples. This manual can be found in.

Executing the Algoritm
The function “em” can be used for the expectation-maximization method, as it implements the method for parameterized Gaussian Mixture Models (GMM), starting in the E-step. This function uses the following parameters:


 * model-name: the name of the model used;
 * data: all the collected data, which must be all numerical. If the data format is represent by a matrix, the rows will represent the samples (observations) and the columns the variables;
 * parameters: model parameters, which can assume the following values: pro, mean, variance and Vinv, corresponding to the mixture proportion for the components of mixture, mean of each component, parameters variance and the estimate hypervolume of the data region, respectively.
 * others: less relevant parameters which wont be described here. More details can be found in the package manual.

After the execution, the function will return:


 * model-name: the name of the model;
 * z: a matrix whose the element in position [I,k] presents the conditional probability of the ith sample belongs to the kth mixture component;
 * parameters: same as the input;
 * others: other metrics which wont be discussed here. More details can be found in the package manual.

A simple example
In order to demonstrate how to use the R to execute the expectation-Maximization method, the following algorithm presents a simple example for a test dataset. This example can also be found in the package manual.

> modelName = ``EEE'' > data = iris[,-5] > z = unmap(iris[,5]) > msEst <- mstep(modelName, data, z) > names(msEst) > modelName = msEst$modelName > parameters = msEst$parameters > em(modelName, data, parameters)

The first line executes the M-step so the parameters used in the em function can be generated. This function is called mstep and its inputs are model name, as “EEE”, the dataset, as flower study dataset and finally, the z matrix, which contains the conditional probability of each class contains each data sample. This z matrix is generated by the unmap function.

After the M-step, the algorithm will show (line 2) the attributes of the object returned by this function. The third line will start the clustering process using some of the result of the M-step method as input.

The clustering method will return the parameters estimated in the process and the conditional probability of each sample falls in each class. These parameters include mean and variance, and this last one corresponds to the use of the mclustVariance method.

Visualization
After the execution of the clustering algorithm, a chart can be plotted with the results. In R, this visualization can be done by using the coorProj function. The following algorithm shows how to use it:

> data = iris[,5] > dimens = c(2,4) > what = "errors" > parameters = msEst$parameters > truth = iris[,5] > coordProj( data, dimens, what, parameters, z, truth)