Data Mining Algorithms In R/Classification/adaboost

Boosting is one of the most important developments in classification methodology. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. This seemingly mysterious phenomenon can be understood in terms of well-known statistical principles, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion.

Algorithm
While boosting has evolved somewhat over the years, we describe the most commonly used version of the AdaBoost procedure (Freund and Schapire - 1996) which we call Discrete AdaBoost. This is essentially the same as AdaBoost.M1 for binary data in Freund and Schapire. Here is a concise description of AdaBoost in the two-class classification setting.We have training data $$(x_1,y_1), ..., (x_n,y_n)$$ with $$x_i$$ a vector valued feature and $$y_i = -1$$ or 1. We define $$F(x) = \sum_{1}^{M} c_mf_m$$ where each $$f_m(x)$$ is a classifier producing values plus or minus 1 and $$c_m$$ are constants; the corresponding prediction is sign $$(F(x))$$. The AdaBoost trains the classifiers $$f_m(x)$$ on weighted versions of the training sample, giving higher weight to cases that are currently misclassified. This is done for a sequence of weighted samples, and then the final classifier is defined to be a linear combination of the classifiers from each stage.

Implementation
Adaboost is part of ada package. In this section you find more information about installing and using it on R Environment.

Type the following commands in R console to install and load the ada package:

install.packages("ada") library("rpart") library("ada")

The function used to execute the algorithm adaboost is:

ada(x, y,test.x,test.y=NULL, loss=c("exponential","logistic"), type=c("discrete", "real", "gentle"), iter=50, nu=0.1, bag.frac=model.coef=TRUE, bag.shift=FALSE, max.iter=20, delta=10^(-10), verbose=...,na.action=na.rpart)

The arguments are:

x: matrix of descriptors.

Y: vector of responses. ‘y’ may have only two unique values.

test.x: testing matrix of descriptors (optional)

test.y: vector of testing responses (optional)

loss: loss="exponential", "ada","e" or any variation corresponds to the default boosting under exponential loss. loss="logistic","l2","l" provides boosting under logistic loss.

type: type of boosting algorithm to perform. “discrete” performs discrete Boosting (default). “real” performs Real Boost. “gentle” performs Gentle Boost.

Iter: number of boosting iterations to perform. Default = 50.

Nu: shrinkage parameter for boosting, default taken as 1.

bag.frac: sampling fraction for samples taken out-of-bag. This allows one to use random permutation which improves performance.

model.coef: flag to use stageweights in boosting. If FALSE then the procedure corresponds to epsilon-boosting.

bag.shift: flag to determine whether the stageweights should go to one as nu goes to zero. This only makes sense if bag.frac is small. The rationale behind this parameter is discussed in (Culp et al., 2006).

max.iter: number of iterations to perform in the newton step to determine the coefficient.

delta: tolerance for convergence of the newton step to determine the coefficient.

Verbose: print the number of iterations necessary for convergence of a coefficient.

Formula: a symbolic description of the model to be fit.

data: an optional data frame containing the variables in the model.

Subset: an optional vector specifying a subset of observations to be used in the fitting process.

na.action: a function that indicates how to process ‘NA’ values. Default=na.rpart.

...: arguments passed to rpart.control. For stumps, use rpart.control(maxdepth=1,cp=- 1,minsplit=0,xval=0). maxdepth controls the depth of trees, and cp controls the complexity of trees. The priors should also be fixed through the parms argument as discussed in the second reference.

Type the following command to show the result from this algorithm:

summary(AdaObject) varplot(VariableImportanceObject)

When using usage 'ada(x,y)': x data can take the form data.frame or as.matrix. y data can take form data.frame, as.factor, as.matrix, as.array, or as.table. missing values must be removed from the data prior to execution.

When using usage 'ada(y~.)': data must be in a data frame. Response can have factor or numeric values. missing values can be present in the descriptor data, whenever na.action is set to any option other than na.pass.

After the model is fit, 'ada' prints a summary of the function call, the method used for boosting, the number of iterations, the final confusion matrix (observed classification vs predicted classification; labels for classes are same as in response), the error for the training set, and testing, training, and kappa estimates of the appropriate number of iterations.

A summary of this information can also be obtained with the command ‘print(x)’. Corresponding functions (Use help with summary.ada, predict.ada, . . . varplot for additional information on these commands): summary : function to print a summary of the original function call, method used for boosting, number of iterations, final confusion matrix, accuracy, and kappa statistic (a measure of agreement between the observed classification and predicted classification). ‘summary’ can be used for training, testing, or validation data.

predict : function to predict the response for any data set (train, test, or validation)

plot : function to plot performance of the algorithm across boosting iterations. Default plot is iteration number (x-axis) versus prediction error (y-axis) for the data set used to build the model. Function can also simultaneously produce an error plot for an external test set and a kappa plot for training and test sets.

pairs : function to produce pairwise plots of descriptors. Descriptors are arranged by decreasing frequency of selection by boosting (upper left = most frequently chosen). The color of the marker in the plot represents class membership; the Size of the marker represents predicted class probability. The larger the marker, the higher the probability of classification.

varplot : plot of variables ordered by the variable importance measure (based on improvement).

addtest : add a testing data set to the ada object, therefore the testing errors only have to be computed once.

update : add more trees to the ada object.

Scenario
A data set that contains information about compounds used in drug discovery. Specifically, this data set consists of 5631 compounds on which an in-house solubility screen (ability of a compound to dissolve in a water/solvent mixture) was performed. Based on this screen, compounds were categorized as either insoluble (n=3493) or soluble (n=2138). Then, for each compound, 72 continuous, noisy structural descriptors were computed. Of these descriptors, one contained missing values for approximately 14% (n=787) of the observations. The objective of the analysis is to model the relationship between the structural descriptors and the solubility class. The data will be called soldat.

Data
Input format: x1 a numeric vector x2 a numeric vector x3 a numeric vector x4 a numeric vector x5 a numeric vector x6 a numeric vector x7 a numeric vector x8 a numeric vector x9 a numeric vector x10 a numeric vector x11 a numeric vector x12 a numeric vector x13 a numeric vector x14 a numeric vector x15 a numeric vector x16 a numeric vector x17 a numeric vector x18 a numeric vector x19 a numeric vector x20 a numeric vector . . . x72 a numeric vector with missing data y a numeric vector

Execution
data("soldat") n <- nrow(soldat) set.seed(100) ind <- sample(1:n) trainval <- ceiling(n * .5) testval <- ceiling(n * .3) train <- soldat[ind[1:trainval],] test <- soldat[ind[(trainval + 1):(trainval + testval)],] valid <- soldat[ind[(trainval + testval + 1):n],]

control <- rpart.control(cp = -1, maxdepth = 14,maxcompete = 1,xval = 0) gen1 <- ada(y~., data = train, test.x = test[,-73], test.y = test[,73], type = "gentle", control = control, iter = 70) gen1 <- addtest(gen1, valid[,-73], valid[,73]) summary(gen1) varplot(gen1)

Output
Loss: exponential Method: gentle Iteration: 70 Training Results Accuracy: 0.987 Kappa: 0.972 Testing Results Accuracy: 0.765 Kappa: 0.487



Analysis
Testing accuracy rates are printed in the order they are entered so the accuracy on the testing set is 0.765 and on the validation set 0.781. For this type of early drug discovery data, the Gentle AdaBoost algorithm performs adequately with test set accuracy of 76.5% (kappa is approximately 0.5). In order to enhance our understanding regarding the relationship between descriptors and the response, the varplot function was employed.