Data Mining Algorithms In R/Clustering/Partitioning Around Medoids (PAM)

= Introduction =

Clustering is an unsupervised machine learning algorithm that groups entities, from a dataset, that have high degree of similarity in the same cluster. Nowadays lots of areas are using these kinds of algorithms to separate datasets into groups in an automated way, and still have a good quality result.

The clustering process is not a universal process because there are many groups of datasets, for some of them the kind of metric used is relevant, for others the entities that represent each cluster are more interesting. Like dataset groups there are many clustering algorithms each one tries to take advantage of data type, so this way each one of them is more suited for a specific kind of data.

This section will explain a little more about the Partitioning Around Medoids (PAM) Algorithm, showing how the algorithm works, what are its parameters and what they mean, an example of a dataset, how to execute the algorithm, and the result of that execution with the dataset as input.

= The Partitioning Around Medoids (PAM) Algorithm =

Algorithm
The PAM algorithm was developed by Leonard Kaufman and Peter J. Rousseeuw, and this algorithm is very similar to K-means, mostly because both are partitional algorithms, in other words, both break the dataset into groups (clusters), and both work by trying to minimize the error, but PAM works with Medoids, that are an entity of the dataset that represent the group in which it is inserted, and K-means works with Centroids, that are artificially created entity that represent its cluster.

The PAM algorithm partitions the dataset of n objects into k clusters, where both the dataset and the number k is an input of the algorithm. This algorithm works with a matrix of dissimilarity, whose goal is to minimize the overall dissimilarity between the representants of each cluster and its members. The algorithm uses the following model to solve the problem: $$ F(x) = \text{minimize}\sum_{i=1}^n\sum_{j=1}^n d(i, j)z_{ij} $$

Subject to:

$$ \begin{array}{lll} \sum_{i=1}^n z_{ij} = 1, j=1,2,..,n\\ z_{ij} \leq y_{i}, i,j = 1,2,..,n\\ \sum_{i=1}^n y_{i} = k, k = \text{number of clusters}\\ y_{i}, z_{ij} \in \{0,1\}, i,j = 1,2,..,n \end{array} $$

Where F(x) is the main function to minimize, d(i,j) is the dissimilarity measurement between the entities i and j, and zij is a variable that ensures that only the dissimilarity between entities from the same cluster will be computed in the main function. The others expressions are constraints that have the following functions: (1.) ensures that every single entity is assigned to one cluster and only one cluster, (2.) ensures that the entity is assigned to its medoid that represents the cluster, (3.) ensures that there are exactly k clusters and (4.) lets the decision variables assume just the values of 0 or 1.

The PAM algorithm can work over two kinds of input, the first is the matrix representing every entity and the values of its variables, and the second is the dissimilarity matrix directly, in the latter the user can provide the dissimilarity directly as an input to the algorithm, instead of the data matrix representing the entities. Either way the algorithm reaches a solution to the problem, in a general analysis the algorithm proceeds this way:

Build phase:
 * 1. Choose k entities to become the medoids, or in case these entities were provided use them as the medoids;
 * 2. Calculate the dissimilarity matrix if it was not informed;
 * 3. Assign every entity to its closest medoid;

Swap phase:
 * 4. For each cluster search if any of the entities of the cluster lower the average dissimilarity coefficient, if it does select the entity that lowers this coefficient the most as the medoid for this cluster;
 * 5. If at least one medoid has changed go to (3), else end the algorithm.

Implementation
THIS IS NOT DESCRIBING THE "PAM" ALGORITHM. This is a k-means variant using medoids, but not PAM with its charateristic BUILD and SWAP phases.

The pseudocode of PAM algorithm is shown below:

 Algorithm 1: PAM Algorithm  Input: E = {e1,e2,...en} (dataset to be clustered or matrix of dissimilarity)
 * k (number of clusters)
 * metric (kind of metric to use on dissimilarity matrix)
 * diss (flag indicating that E is the matrix of dissimilarity or not)

Output: M = {m1,m2,...,mk} (vector of clusters medoids)
 * L = {l(e) | e = 1,2,...,n} (set of cluster labels of E)

foreach mi € M do
 * mi ← ej € E; (e.g. random selection)

end if diss ≠ true
 * Dissimilarity ← CalculateDissimilarityMatrix(E, metric);

else
 * Dissimilarity ← E;

end repeat
 * foreach ei € E do
 * l(ei) ← argminDissimilarity(ei, Dissimilarity, M);
 * end
 * changed ← false;
 * foreach mi € M do
 * Mtmp ← SelectBestClusterMedoids(E, Dissimilarity, L);
 * end
 * if Mtmp ≠ M
 * M ← Mtmp;
 * changed ← true;
 * end

until changed = true;

In the R programming language, the PAM algorithm is available in the cluster package and can be called by the following command:

pam(x, k, diss, metric, medoids, stand, cluster.only, do.swap, keep.diss, keep.data, trace.lev)

Where the parameters are:

x: numerical data matrix representing the dataset entities, or can be the dissimilarity matrix, it depends on the value of the diss parameter. In case x is a data matrix each row is an entity and each column is an variable, and in this case missing values are allowed as long as every pair of entities has at least one case not missing. In case x is a dissimilarity matrix it is not allowed to have missing values.

k: number of clusters that the dataset will be partitioned where 0 < k < n, where n is the number of entities.

diss: logical flag, if it is TRUE x is used as the dissimilarity matrix, if it is FALSE, then x will be considered as a data matrix.

metric: an string specifying each of the two metrics will be used to calculate the dissimilarity matrix, the metric variable can be “euclidean” to use the Euclidean distance, or can be “manhattan” to use the Manhattan distance.

stand: logical flag, if it is TRUE then the measurements in x will be standardized before calculating the dissimilarities. Measurements are standardized for each column, by subtracting the column's mean value and dividing by the variable's mean absolute deviation. If x is a dissimilarity matrix then this parameter is ignored.

cluster.only: logical flag, if it is TRUE, only the clustering will be computed and returned.

do.swap: logical flag, indicates if the swap phase should happen (TRUE) or not (FALSE).

keep.diss: logical flag indicating if the dissimilarities should (TRUE) or not (FALSE) be kept in the result.

keep.data: logical flag indicating if the input data x should (TRUE) or not (FALSE) be kept in the result.

trace.lev: an numeric parameters specifying a trace level for printing diagnostics during the build and swap phase of the algorithm. Default 0 does not print anything.

The PAM algorithm return a pam object that contains the information about the result of the execution of the algorithm.

Visualization
In R there are two ways of seeing the result of the PAM algorithm, the first one is to print the object that the algorithm returns, and the second one is to plot the data from the object creating a graphic of the result. The first way of visualizing the information is a bit more complicated to understand but it gives a more complete and accurate information, but the second way is a lot more easy to understand and lets the user have a better view of the information and to add information that would be relevant for him.

To view the data of the result of the execution of PAM algorithm in a textual way there are two ways one more simple that gives a more summarized information about the object, and another one that gives you a more complete information about it. In the two commands listed below the first one prints the information in a summarized way, while the second one prints it in a more complete way.

print (result) summary (result)

The other way of visualizing the data from the result of the execution of the algorithm is using graphics and that can be done by using the following command:

plot (result)

Example: To show an example of use of the algorithm and a result from its execution a simple dataset with few entities and few dimension was used, as it is shown in the following table:

Table 1: Simple dataset

As we can see the data is separated in two clusters, so we will use an k = 2. The PAM algorithm can be executed as follows:

x <- read.table(“table.txt”) result <- pam(x, 2, FALSE, "euclidean") summary(result) plot(result$data, col = result$clustering) points(result$medoids, col = 1:2, pch = 4)
 * 1) load the table from a file
 * 1) execute the pam algorithm with the dataset created for the example
 * 1) print the results data in the screen
 * 1) plot a graphic showing the clusters and the medoids of each cluster

Printing the result form the execution gives you:

Medoids: ID x y 4  4  2 2 6 6 11 5 Clustering vector: 1 2 3 4 5 6 7 8 9 1 1 1 1 2 2 2 2 2  Objective function: build    swap 1.255618 0.915849 Numerical information per cluster: size max_diss  av_diss diameter separation [1,]   4 1.414214 0.8535534 2.236068   8.062258 [2,]    5 1.414214 0.9656854 2.236068   8.062258 Isolated clusters: L-clusters: character(0) L*-clusters: [1] 1 2 Silhouette plot information: cluster neighbor sil_width 3      1        2 0.8898942 4       1        2 0.8788422 1       1        2 0.8549629 2       1        2 0.8297000 6       2        1 0.8790384 9       2        1 0.8631441 8       2        1 0.8425790 7       2        1 0.8232848 5       2        1 0.7747713 Average silhouette width per cluster: [1] 0.8633498 0.8365635 Average silhouette width of total data set: [1] 0.8484685 36 dissimilarities, summarized : Min. 1st Qu. Median   Mean 3rd Qu. Max. 1.0000 1.4142  8.3951  6.1559  9.9362 11.7050  Metric :  euclidean Number of objects : 9 Available components: [1] "medoids"   "id.med"     "clustering" "objective"  "isolation"  "clusinfo" "silinfo"   "diss"       "call" [10] "data"

While plotting gives you:



Case Study
In this section we will see a case study using PAM.

Scenario
In this case study a part of the database iris available in the R package datasets was used. This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. Using the data that this dataset provides us, it is natural to think of verifying if the flowers of each one of three species of iris are really similar to the others from the same species, so in this case study length and width of both petal and sepal will be used to cluster the dataset into 3 groups and then verify if the clusters really match with the flowers species.

The dataset that was used into this case study consist of the following columns:


 * Flower: An id of the flower;
 * Sepal.Length: A numeric value of the length of the sepal in centimeters;
 * Sepal.Width: A numeric value of the width of the sepal in centimeters;
 * Petal.Length: A numeric value of the length of the petal in centimeters;
 * Petal.Width: A numeric value of the width of the petal in centimeters;
 * Species: A text identifying the species of the flower.

Input Data
The input data is a table consisting of 50% (75 entities) of the original iris dataset that have 150 flowers and 5 attributes each. So the dataset used in this case study is represented by the following table:

Table 2: Sample from iris dataset

Execution
The process was done as follows:

data <- read.table(“sampleiris.txt”) result <- pam(data[1:4], 3, FALSE, “euclidean”) summary(result) plot (data, col = result$clustering) points(result$medoids, col = 1:3, pch = 4)
 * 1) import data
 * 1) execution
 * 1) print results
 * 1) plot clusters
 * 1) add the medoids to the plot

Output
The following data was printed as result of the execution:

Medoids: ID Sepal.Length Sepal.Width Petal.Length Petal.Width 8   8          5.0         3.4          1.5         0.2 64  39          6.1         2.9          4.7         1.4 103 53          7.1         3.0          5.9         2.1 Clustering vector: 1  2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22   23  24  25  51  52  53  54  55  56    1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1    1   1   1   2   2   2   2   2   2   57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75 101 102 103  104 105 106 107 108 109 110 111 112    2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   3   2   3    3   3   3   2   3   3   3   2   2  113 114 115 116 117 118 119 120 121 122 123 124 125    3   2   2   3   3   3   3   2   3   2   3   2   3  Objective function: build     swap 0.7148339 0.6990539 Numerical information per cluster: size max_diss  av_diss diameter separation [1,]  25 1.236932 0.5137400 2.042058  1.9000000 [2,]   34 1.951922 0.8085343 2.727636  0.3741657 [3,]   16 1.284523 0.7559609 2.147091  0.3741657  Isolated clusters: L-clusters: [1] 1 L*-clusters: character(0) Silhouette plot information: cluster neighbor  sil_width 1        1        2  0.84941732 5         1        2  0.84830238 8         1        2  0.84812593 18        1        2  0.84784555 12        1        2  0.83221128 22        1        2  0.82890349 20        1        2  0.82456328 3         1        2  0.82337894 7         1        2  0.81910409 10        1        2  0.81662688 11        1        2  0.80769429 2         1        2  0.80592613 13        1        2  0.80278163 4         1        2  0.79810574 23        1        2  0.79482977 24        1        2  0.78999596 17        1        2  0.78539723 21        1        2  0.78454015 25        1        2  0.77452963 6         1        2  0.75995941 9         1        2  0.74605493 14        1        2  0.74277337 19        1        2  0.72082914 15        1        2  0.71581750 16        1        2  0.66155611 68        2        3  0.60036142 56        2        3  0.59753885 62        2        3  0.59698924 72        2        3  0.59691421 70        2        3  0.59514179 54        2        3  0.58507022 67        2        3  0.56989428 60        2        1  0.56350914 63        2        3  0.55592514 75        2        3  0.54720666 74        2        3  0.53971473 64        2        3  0.53757677 69        2        3  0.51098390 65        2        1  0.50762488 107       2        3  0.48295375 55        2        3  0.46851074 52        2        3  0.46827948 59        2        3  0.44164146 66        2        3  0.42147865 71        2        3  0.41421605 73        2        3  0.41282512 122       2        3  0.40891392 120       2        3  0.40207904 57        2        3  0.39510378 114       2        3  0.37176468 124       2        3  0.34854822 102       2        3  0.33532624 61        2        1  0.32662688 58        2        1  0.20142024 51        2        3  0.19024422 115       2        3  0.16320750 53        2        3  0.11554863 112       2        3 -0.07433144 111       2        3 -0.07748205 103       3        2  0.59622203 106       3        2  0.59241159 108       3        2  0.58027197 110       3        2  0.56716967 123       3        2  0.56182697 121       3        2  0.55568135 119       3        2  0.53242285 118       3        2  0.52551154 125       3        2  0.51206488 105       3        2  0.49243542 101       3        2  0.45749953 113       3        2  0.44409513 109       3        2  0.37181492 117       3        2  0.26375026 116       3        2  0.21777715 104       3        2  0.21412781 Average silhouette width per cluster: [1] 0.7931708 0.4153331 0.4678177 Average silhouette width of total data set: [1] 0.5524757 2775 dissimilarities, summarized : Min. 1st Qu. Median   Mean 3rd Qu. Max. 0.1000 1.1136  2.5080  2.6329  3.9006  7.0852  Metric :  euclidean Number of objects : 75 Available components: [1] "medoids"   "id.med"     "clustering" "objective"  "isolation"  "clusinfo" "silinfo"   "diss"       "call" [10] "data"

And the following graphic was generated as well:



Analysis
After the execution of the algorithm and the analysis of the data it was possible to tell that the clusters were well grouped and correlated with the species of each flower. In the data there was a total of 75 elements, 25 from Setosa species, 25 from Versicolor species and 25 from Virginica species, and the algorithm clustered the elements from Setosa as cluster 1, the ones from Versicolor as cluster 2 and the ones from Virginica as cluster 3. After verifying the results we find that out of 75 elements, 66 were correctly clustered, giving an error margin of 12%—which is a very good result.

= References =


 * 1) The R Development Core Team, R: A Language and Environment for Statistical Computing.
 * 2) Kaufman, L., Rousseeuw, P. J., Clustering by Means of Medoids.