Data Mining Algorithms In R/Classification/SVM

Introduction
Support Vector Machines (SVMs) are supervised learning methods used for classification and regression tasks that originated from statistical learning theory. As a classification method, SVM is a global classification model that generates non-overlapping partitions and usually employs all attributes. The entity space is partitioned in a single pass, so that flat and linear partitions are generated. SVMs are based on maximum margin linear discriminants, and are similar to probabilistic approaches, but do not consider the dependencies among attributes.

Traditional Neural Network approaches have suffered difficulties with generalization, producing models which overfit the data as a consequence of the optimization algorithms used for parameter selection and the statistical measures used to select the best model. SVMs have been gaining popularity due to many attractive features and promising empirical performance. They are based on the Structural Risk Minimization (SRM) principle have shown to be superior to the traditional principle of Empirical Risk Minimization (ERM) employed by conventional Neural Networks. ERM minimizes the error on the training data, while SRM minimizes an upper bound on the expected risk. This gives SRM greater generalization ability, which is the goal in statistical learning. According to, SVMs rely on preprocessing the data to represent patterns in a high dimension, typically much higher than the original feature space. Data from two categories can always be separated by a hyperplane when an appropriate nonlinear mapping to a sufficiently high dimension is used.

A classification task usually involves training and test sets which consist of data instances. Each instance in the training set contains one target value (class label) and several attributes (features). The goal of a classifier is to produce a model able to predict target values of data instances in the testing set, for which only the attributes are known. Without loss of generality, the classification problem can be viewed as a two-class problem in which one's objective is to separate the two classes by a function induced from available examples. The goal is to produce a classifier that generalizes well, i.e. that works well on unseen examples. The below picture is an example of a situation in which various linear classifiers can separate the data. However, only one maximizes the distance between itself and the nearest example of each class (i.e. the margin) and for that is called the optimal separating hyperplane. It is intuitively expected that this classifier generalizes better than the other options. The basic idea of SVM classifier uses this approach, i.e. to choose the hyperplane that has the maximum margin.



Figure 1: Example of separating hyperplanes

Algorithm
Let D be a classification dataset with n points in a d-dimensional space D = {(xi, yi)}, with i = 1, 2, ..., n and let there be only two class labels such that yi is either +1 or -1. A hyperplane h(x) gives a linear discriminant function in d dimensions and splits the original space into two half-spaces:


 * $$h(x) = w^Tx + b = w_1x_1 + w_2x_2 + ... + w_dx_d + b\,$$,


 * where w is a d-dimensional weight vector and b is a scalar bias. Points on the hyperplane have h(x) = 0, i.e. the hyperplane is defined by all points for which wTx = -b.

According to, if the dataset is linearly separable, a separating hyperplane can be found such that for all points with label -1, h(x) < 0 and for all points labeled +1, h(x) > 0. In this case, h(x) serves as a linear classifier or linear discriminant that predicts the class for any point. Moreover, the weight vector w is orthogonal to the hyperplane, therefore giving the direction that is normal to it, whereas the bias b fixes the offset of the hyperplane in the d-dimensional space.

Given a separating hyperplane h(x) = 0, it is possible to calculate the distance between each point xi and the hyperplane by:


 * $$\delta_i = \tfrac{y_ih(x_i)}{||w||}$$

The margin of the linear classifier is defined as the minimum distance of all n points to the separating hyperplane.


 * $$\delta^*=\min_{x_i}\{\tfrac{y_ih(x_i)}{||w||}\}$$

All points (vectors x*i) that achieve this minimum distance are called the support vectors for the linear classifier. In other words, a support vector is a point that lies precisely on the margin of the classifying hyperplane.

In a canonical representation of the hyperplane, for each support vector x*i with label y*i we have that $$y^*_ih(x^*_i) = 1$$. Similarly, for any point that is not a support vector, we have that $$y_ih(x_i) > 1\,$$, since, by definition, it must be farther from the hyperplane than a support vector. Therefore we have that $$y_ih(x_i) \geq 1,\ \forall x_i \in D$$.

The fundamental idea behind SVMs is to choose the hyperplane with the maximum margin, i.e. the optimal canonical hyperplane. To do this, one needs to find the weight vector w and the bias b that yield the maximum margin among all possible separating hyperplanes, that is, the hyperplane that maximizes $$\tfrac{1}{||w||}$$. The problem then becomes that of solving a convex minimization problem (notice that instead of maximizing the margin $$\tfrac{1}{||w||}.$$, one can obtain an equivalent formulation of minimizing $$||w||$$) with linear constraints, as follows:


 * Objective Function : $$min \tfrac{||w||^2}{2}$$


 * Linear Constraints : $$y_ih(x_i) \geq 1,\ \forall x_i \in D$$


 * This minimization problem can be solved using the Lagrange multiplier method, which introduces a Lagrange multiplier α for each constraint:


 * $$\alpha_i(y_ih(x)-1) = 0\ \text{with}\ \alpha_i \geq 0$$

This method states that αi = 0 for all points that are at a distance larger than $$\tfrac{1}{||w||}$$ from the hyperplane, and only for those points that are exactly at the margin, i.e. the support vectors, αi > 0. The weight vector of the classifier is obtained as a linear combination of the support vectors, while the bias is the average of the biases obtained from each support vector.

SVMs can handle linearly non-separable points, where the classes overlap to some extent so that a perfect separation is not possible, by introducing slack variables εi for each point xi in D. If 0 ≤ εi < 1, the point is still correctly classified. Otherwise, if εi > 1, the point is misclassified. So the goal of the classification becomes that of finding the hyperplane (w and b) with the maximum margin that also minimizes the sum of slack variables. A methodology similar to that described above is necessary to find the weight vector w and the bias b.

SVMs can also solve problems with non-linear decision boundaries. The main idea is to map the original d-dimensional space into a d’-dimensional space (d’ > d), where the points can possibly be linearly separated. Given the original dataset D = {xi, yi} with i = 1,...,n and the transformation function Φ, a new dataset is obtained in the transformation space DΦ = {Φ(xi), yi} with i = 1,...,n. After the linear decision surface is found in the d’-dimensional space, it is mapped back to the non-linear surface in the original d-dimensional space. To obtain w and b, Φ(x) needn't be computed in isolation. The only operation required in the transformed space is the inner product Φ(xi)TΦ(xj), which is defined with the kernel function (K) between xi and xj. Kernels commonly used with SVMs include:


 * the polynomial kernel:
 * $$K(x_i,x_j) = (x_i^Tx_j+1)^q$$, where $$q$$ is the degree of the polynomial


 * the gaussian kernel:
 * $$K(x_i,x_j) = e^{-\frac{||x_i-x_j||^2}{2\sigma^2}}$$, where $$\sigma$$ is the spread or standard deviation.


 * the gaussian radial basis function (RBF):
 * $$K(x_i,x_j)=e^{-\gamma||x_i-x_j||^2},\, \gamma\geq 0$$


 * the Laplace Radial Basis Function (RBF) kernel:
 * $$K(x_i,x_j)=e^{-\gamma||x_i-x_j||},\, \gamma\geq 0$$


 * the hyperbolic tangent kernel:
 * $$K(x_i,x_j)=\tanh(x_i^Tx_j+\text{offset})$$


 * the sigmoid kernel:
 * $$K(x_i,x_j)=\tanh(ax_i^Tx_j+\text{offset})$$


 * the Bessel function of the first kind kernel:
 * $$K(x_i,x_j)=\left( \tfrac{Bessel_{v+1}^n(\sigma||x_i-x_j||)}  {(||x_i-x_j||)^{-n(v+1)}}  \right)$$


 * the ANOVA radial basis kernel:
 * $$K(x_i,x_j)= \left( \sum_{k=1}^n e^{-\sigma(x_i^k - x_j^k)^2}     \right)^d$$


 * the linear splines kernel in one dimension:
 * $$K(x_i,x_j)=1+x_ix_j \min(x_i,x_j)-\tfrac{x_i+x_j}{2}\min(x_i,x_j)^2 + \tfrac{\min(x_i,x_j)^3}{3}$$

According to, the Gaussian and Laplace RBF and Bessel kernels are general-purpose kernels used when there is no prior knowledge about the data. The linear kernel is useful when dealing with large sparse data vectors as is usually the case in text categorization. The polynomial kernel is popular in image processing, and the sigmoid kernel is mainly used as a proxy for neural networks. The splines and ANOVA RBF kernels typically perform well in regression problems.

Available Implementations in R
R is a language and environment for statistical computing and graphics. There are five packages that implement SVM in R :


 * e1071
 * kernlab
 * klaR
 * svmpath
 * shogun

This documentation will focus on the e1071 package because it is the most intuitive. For information on the others, see the references cited above and the report of.

e1071 package
The e1071 package was the first implementation of SVM in R. The svm function provides an interface to libsvm, complemented by visualization and tuning functions. libsvm is a fast and easy-to-use implementation of the most popular SVM formulation of classification (C and $$\nu$$ ), and includes the most common kernels (linear, polynomial, RBF, and sigmoid). Multi-class classification is provided using the one-against-one voting scheme. It also includes the computation of decision and probability values for predictions, shrinking heuristics during the fitting process, class weighting in the classification mode, handling of sparse data, and cross-validation.

The R implementation is based on the S3 class mechanisms. It basically provides a training function with standard and formula interfaces, and a predict method. In addition, a plot method for visualizing data, support vectors, and decision boundaries is provided. Hyperparameter tuning is done using the tune framework, which performs a grid search over specified parameter ranges.

Installing and Starting the e1071 Package
To install e1071 package in R, type

install.packages('e1071', dependencies = TRUE)

and to start to use the package, type

library(e1071)

Main Functions in the e1071 Package for Training, Testing, and Visualizing
Some e1071 package functions are very important in any classification process using SVM in R, and thus will be described here.

The first function is svm, which is used to train a support vector machine. Some import parameters include:


 * data: an optional data frame containing the variables in the model. If this option is used, the parameters x and y described below, aren't necessary;


 * x: a data matrix, a vector, or a sparse matrix that represents the instances of the dataset and their respective properties. Rows represent the instances and columns represent the properties;


 * y: a response vector with one label for each row (instance) of x;


 * type: sets how svm will work. The possible values for classification are: C, nu and one (for novelty detection);


 * kernel: defines the kernel used in training and prediction. The options are: linear, polynomial, radial basis and sigmoid;


 * degree: parameter needed if the kernel is polynomial (default: 3);


 * gamma: parameter needed for all types of kernels except linear (default: 1/(data dimension));


 * coef0: parameter needed for polynomial and sigmoid kernels (default: 0);


 * cost: cost of constraint violation (default: 1). This is the ‘C’-constant of the regularization term in the Lagrange formulation;


 * cross: specifies the cross-validation. A k > 0 is necessary. In this case, the training data is performed to assess the quality of the model: the accuracy rate for classification;


 * probability: logical indicating whether the model should allow for probability predictions.

An example of svm usage is given below:

library(MASS) data(cats) model <- svm(Sex~., data = cats)

The first two commands specify the usage of the cats dataset, which contains 144 instances, 2 numerical attributes for each instance ("Bwt" and  "Hwt"), and the class for each instance (attribute "Sex"). The instance class can be "F", for female, or "M", for male. In the third command, the parameter "Sex~." indicates the attribute (column) of the dataset to be used as instance classes.

For information on the parameters of the model and on the number of support vectors, type:

print(model) summary(model)

The result of the summary command is shown below:

Call: svm(formula = Sex ~ ., data = cats) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.5 Number of Support Vectors: 84 ( 39 45 )  Number of Classes:  2 Levels: F M

To see the built model with a scatter plot of the input, the plot function can be used. This function optionally draws a filled contour plot of the class regions. The main parameters of this function are listed below:


 * model: an object of class svm data, which results from the svm function;


 * data: the data to visualize. It should be the same data used for building the model in the svm function;


 * symbolPalette, svSymbol, dataSymbol, and colorPalette: these parameters control the colors and symbols used to represent support vectors and the other data points.

The following command will produce the below graph, in which support vectors are shown as ‘X’, true classes are highlighted through symbol color, and predicted class regions are visualized using colored background.

plot(model, cats)



The predict function predicts values based on a model trained by svm. For a classification problem, it returns a vector of predicted labels. Detailed information about its usage can be obtained with the following command.

help(predict.svm)

Let us first divide the cats dataset into a train and a test set:

index <- 1:nrow(cats) testindex <- sample(index, trunc(length(index)/3)) testset <- cats[testindex,] trainset <- cats[-testindex,]

Now we run the model again using the train set and predict classes using the test set in order to verify if the model has good generalization.

model <- svm(Sex~., data = trainset) prediction <- predict(model, testset[,-1])

The -1 is because the dependent variable, Sex, is in column number 1.

A cross-tabulation of the true versus the predicted values yields (the confusion matrix):

tab <- table(pred = prediction, true = testset[,1])

If you type tab, you will see the confusion matrix like is shown below:

true pred F  M      F 10  8 M 6 24

With this information, it is possible to compute the sensitivity, the specificity and the precision of the model to the test set.

Model accuracy rates can be computed using the classAgreement function:

classAgreement(tab)

The tune function can be used to tune hyperparameters of statistical methods using a grid search over the supplied parameter ranges.

tuned <- tune.svm(Sex~., data = trainset, gamma = 10^(-6:-1), cost = 10^(1:2)) summary(tuned)

These commands will list the best parameters, the best performance, and details of the tested parameter values, as shown below.

Parameter tuning of `svm': - sampling method: 10-fold cross validation - best parameters: gamma cost 0.1 100   - best performance: 0.1566667 - Detailed performance results: gamma cost    error dispersion 1 1e-06   10 0.2600000  0.1095195 2 1e-05   10 0.2600000  0.1095195 3 1e-04   10 0.2600000  0.1095195 4 1e-03   10 0.2600000  0.1095195 5 1e-02   10 0.2833333  0.1230890 6 1e-01   10 0.1788889  0.1359264 7 1e-06  100 0.2600000  0.1095195 8 1e-05  100 0.2600000  0.1095195 9 1e-04  100 0.2600000  0.1095195 10 1e-03 100 0.2833333  0.1230890 11 1e-02 100 0.1788889  0.1359264 12 1e-01 100 0.1566667  0.1014909

Case Study
In this section we use a dataset to breast cancer diagnostic and apply svm in it. The svm model will be able to discriminate benign and malignant tumors.

The DataSet
The dataset can be downloaded at. In this dataset there are 569 instances and 32 attributes for each instance. The first attribute is the identification of instance, the second is the label for the instance class, which can be M (malignant tumor) or B (benign tumor). The following 30 attributes are real-valued input features that are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. Finally, there are 357 benign instances and 212 malignant instances in dataset.

In order to read the dataset, after downloading it and saving it, type in R:

dataset <- read.csv('/home/myprofile/wdbc.data', head = FALSE)

'/home/myprofile/' is the path where the dataset was saved.

Preparing the DataSet
Let us now divide at random the dataset in two subsets, one with about 70% of the instances to training, and another with around the remaining 30% of instances to testing:

index <- 1:nrow(dataset)

testindex <- sample(index, trunc(length(index)*30/100))

testset <- dataset[testindex,]

trainset <- dataset[-testindex,]

Choosing Parameters
Now, we will use the tune function to do a grid search over the supplied parameter ranges (C - cost, $$\gamma$$ - gamma), using the train set. The range to gamma parameter is between 0.000001 and 0.1. For cost parameter the range is from 0.1 until 10.

It's important to understanding the influence of this two parameters, because the accuracy of an SVM model is largely dependent on the selection them. For example, if C is too large, we have a high penalty for nonseparable points and we may store many support vectors and overfit. If it is too small, we may have underfitting.

Notice that there aren't names for the columns (attributes) in the database. Then, R considers default names for them, as such V1, to the first column, V2 to the second and so on. It's possible to check this typing:

names(dataset)

Then, as the class label is the second column of the dataset, the first parameter to tune function will be V2:

tuned <- tune.svm(V2~., data = trainset, gamma = 10^(-6:-1), cost = 10^(-1:1))

The results are showed with the following command:

summary(tuned)

Parameter tuning of `svm':

- sampling method: 10-fold cross validation

- best parameters: gamma cost 0.001  10

- best performance: 0.02006410

- Detailed performance results: gamma cost     error dispersion 1 1e-06  0.1 0.36333333 0.05749396 2 1e-05  0.1 0.36333333 0.05749396 3 1e-04  0.1 0.36333333 0.05749396 4 1e-03  0.1 0.30064103 0.06402773 5 1e-02  0.1 0.06256410 0.04283663 6 1e-01  0.1 0.08512821 0.05543939 7 1e-06  1.0 0.36333333 0.05749396 8 1e-05  1.0 0.36333333 0.05749396 9 1e-04  1.0 0.28314103 0.05862576 10 1e-03 1.0 0.05506410 0.04373139 11 1e-02 1.0 0.02756410 0.02188268 12 1e-01 1.0 0.03256410 0.02896982 13 1e-06 10.0 0.36333333 0.05749396 14 1e-05 10.0 0.28314103 0.05862576 15 1e-04 10.0 0.05500000 0.04684490 16 1e-03 10.0 0.02006410 0.01583519 17 1e-02 10.0 0.02256410 0.01845738 18 1e-01 10.0 0.05532051 0.04110686

Training The Model
In order to build a svm model to predict breast cancer using C=10 and gamma=0.001, which were the best values according the tune function run before, type:

model <- svm(V2~., data = trainset, kernel = "radial", gamma = 0.001, cost = 10)

To see the results of the model, as the number of support vectors is necessary type:

summary(model)

The result follows:

Call: svm(formula = V2 ~ ., data = trainset, kernel = "radial", gamma = 0.001, cost = 10)

Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 10 gamma: 0.001

Number of Support Vectors: 79

( 39 40 )

Number of Classes: 2

Levels: B M

Testing the Model
Now we run the model again the test set to predict classes.

prediction <- predict(model, testset[,-2])

The -2 is because the label column to instance classes, V2, is in the second column.

To produce the confusion matrix type:

tab <- table(pred = prediction, true = testset[,2])

The confusion matrix is:

true pred  B   M   B 103   6 M  0  61

This means that there are 103 benign instances in test set and all of them were predicted as benign instances. On the other hand, there are 67 malign instances in test set, 61 were predicted rightly and 6 as benign instances.

Let:


 * TP: true positive, i.e. malign instances predicted rightly
 * FP: false positive, i.e. benign instances predicted as malign
 * TN: true negative, i.e. benign instances predicted rightly
 * |N|: total of benign instances
 * |P|: total of malign instances

$$ sensitivity=\frac{TP}{|P|} $$

$$ specificity=\frac{TN}{|N|} $$

$$ precision=\frac{TP}{TP+FP} $$

For this problem we have:

$$ sensitivity=\frac{61}{61+6}=0.91 $$

$$ specificity=\frac{103}{103}=1 $$

$$ precision=\frac{61}{61+0}=1 $$

The classification results are suitable.