Data Mining Algorithms In R/Frequent Pattern Mining/arulesNBMiner

Introduction
The technique to be discussed in this chapter is used in frequent itemset mining. There are several situations which people are interested in co-occurrence of two or more items of a set. It is important to establish which items co-occur, since association rules can be extracted from frequent itemsets [1]. A typical example of an application is a supermarket where one discovers customers who buy meat and beer also tend to buy coal. Thus, a frequent item set would be meat-beer-coal and an association rule would be that, in general, customers who buy meat and beer have more chances to buy coal.

Many works have dealt with the problem of frequent itemset mining. Most of them show the necessity of a min_support threshold, which is an itemset minimum frequency in the data and in general defined by the miner user. In addition, these studies have as goal to mine a complete set of frequent itemsets that satisfy min_support [2]. The application of a minimum support results in several assumptions which are rarely discussed or verified. One assumption is that items occur in the database following a possibly unknown but stable process, and that the items occur in the database with roughly similar frequencies. Nevertheless, in the real world, transactions data have a frequency distribution highly skewed with most items occurring infrequently, while just some of them occur with high frequency. In a database where this phenomenon happens, interesting patterns are not found since some of the associated items are too infrequent to satisfy the user-specified minimum support [3].

Some algorithms such as TFP were developed in such a way that a user does not need to determine a min_support. However, he needs to inform the minimum size of itemsets (min_l) and the number of itemsets (k) he wishes to mine. Furthermore, the TFP algorithms only mine frequent closed itemsets.[2]. Again, the user has a parameter (min_l) which he should specify before mining the data, which is a subtle decision.

Therefore, this chapter presents an algorithm, which is implemented in an R-package and uses a simple stochastic model (Negative Binomal model or NB-model) to estimate a minimum support, utilizing knowledge of the process which generates transaction data, and allows for highly skewed frequency distributions. The name of the package is arulesNBMiner, which is a Java implementation of a depth first search algorithm to mine NB-frequent itemsets of NB-precise rules [4]. The algorithm utilizes the information contained in its own data structure to estimate the minimum support. It then uses a precision limit to estimate the min_support for each k-itemset, plus one extension it calculates at a different minimum support.

Algorithm
Function NBSelect$$\left(l,\ L,\ n,\ \tilde{k},\ \tilde{a}',\ \pi\right)$$ 1. rmax = max(c.count) in L 2. rrescale = sum(c.count) in L 3. for each tuple [c,c.count] ∈ L do nobs [c.count]++ 4. do 5. $$\text{precision}=\frac{(1-\vert l\vert)P[R_1\ge p\;\vert\;k=\tilde k,\;a=r_\text{rescale}\cdot\tilde a']}{\sum_{r=p}^{r_\text{max} } n_{ob} s[r]}$$ 6. while (precision ≥ π ∧ (p−−) > 0) 7. return {c| [c,c.count] ∈ L ∧ c.count > p} where l = the pattern for which candidate items are selected L = a data structure which holds count information for items which co-occur with pattern l; we use tuples c, c.count, where c represents a candidate  item and c.count the count. n = the total number of available items $$\tilde{k}\text{ and }\tilde{a}'$$ = estimated parameters for the data set π = user-specified precision threshold

Implementation (description of used modules and packages)
The first thing one has to do to use NBMiner is installing the NBMiner package. This package has three dependencies packages: arules (http://cran.r-project.org/web/packages/arules/index.html) which provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules), Matrix (http://cran.r-project.org/web/packages/Matrix/index.html) which is classes and methods for dense and sparse matrices and operations on them using Lapack and SuiteSparse, and lettice (http://cran.r-project.org/web/packages/lattice/index.html) which is a framework for data visualization developed at the Bell Labs. Besides, the user has to install on computer the Sun Java(TM) Development Kit (JDK) 6, since the NBMiner uses a package called rJava which is part of JDK. This must be installed in the user operational system.

Installation of the dependencies packages can be performed within the R environment using the function "install.packages(“package name”)". The name of NBMiner package is arulesNBMiner.

install.packages(“arulesNBMiner”)

After the user installed the necessaries packages, he must load them. This can be done using the function "library(package name)".

library(arulesNBMiner)

However, before we show how to use the NBMiner package, it is necessary to show how to load the data. Here, we are using a data on csv format. In this format each line is a transaction like the example below:

Using the function "read.table" on R environment, we are able to load the data. This function creates a "data.frame" object. Its syntax is: object name<-read.table(“filename”, header=TRUE/FALSE,sep=”,”) The argument header  indicates the data has or does not a header while the sep indicates which character is being used to separate the elements. Later, the user has to transform this data frame object in a transaction object. For that, he can use the function "as". Its syntax is: object name<-as(data frame object,”transactions”) It is important to mention the user has to load the NBMiner package before he uses the function as with parameter transaction. Furthermore, there is another data format that some R functions can read and transform it in a transaction object. The user can utilize a binary format like the example below:

Here, each line represents a transaction and each column represents a term of the database. The number 1 means presence of that term in that transaction, while 0 means no presence. To read this data, user can first utilize the function "scan". It creates a vector object, and its syntax is:

object name<-scan(“file name”,sep=”,”)

After that, it is necessary to transform the vector object in a matrix object. This is performed using the function "dim". Its syntax is: dim(vector object)<-c(dim1,dim2)

Dim1 and Dim2 are the dimensions of the matrix which the user want creating. If the user wants, he can name each row and column of the matrix using the function "dimnames". Its syntax is: dimnames(matrix object)<-list(pridim=c(names),segdim=c(names)) Where pridim is a vector with first dimension names, and segdim is a vector with second dimension names. Finally, this object can be transformed in transaction object using the function "as". Its syntax is:

object name<-as(matrix object,”ItemMatrix”)

Soon after the user loaded the package and the data, it is necessary to create the parameters object using the function of NBMiner called "NBMinerParameters". This function reads the data and extracts from them some parameters which will be used in the next step. The syntax of this function is: object name<-NBMinerParameters(data object, pi=#, theta=#, maxlen=#, minlen=#, trim=#, verb=TRUE/FALSE, plot=TRUE/FALSE, rules=TRUE/FALSE) where: data object = the data as an object of class transaction pi = precision threshold theta = pruning parameter maxlen = maximal number of items in found itemsets (default = 5) minlen = minimum number of items in found itemsets (default = 1) trim = fraction of incidences to trim off the tail of the frequency distribution of the data verbose = use verbose output for the estimation procedure plot = plot the model rules = mine NB-precise rules instead of NB-frequent itemsets
 * 1) = numbers

After the user perform this procedure, he is able to run the algorithm to mine his data. For that, one will use the function called "NBMiner". Its syntax is:

object name<-NBMiner(data object, parameter=object parameter, control = list(verb=TRUE, debug=FALSE))

If the user uses the option rules=TRUE when he created the object parameter, the algorithm will mine the NB-precise rules. Otherwise, the same algorithm will mine the NB-frequent itemsets. Finally, we summarized the necessaries commands to use the NBMiner package for mining tasks using the first data structure mentioned above, these are below:

library(package name) object name<-read.table(“filename”, header=TRUE/FALSE,sep=”,”) object name<-as(data frame object,”transaction”) object name<-NBMinerParameters(data object, pi=#, theta=#, maxlen=#, minlen=#, trim=#, verb=TRUE/FALSE, plot=TRUE/FALSE, rules=TRUE/FALSE) object name<-NBMiner(data object, parameter=object parameter, control = list(verb=TRUE, debug=FALSE

Visualization
To visualize data information contained in a “data.frame” object just use the object name. Here, we use table as object name:

table = read.table("data",sep=",",header=TRUE) table

However, when the same principle is used in “transactions” format, we have just the number of transactions and items generated by the "as" function:

tableT = as(table,"transactions") tableT transactions in sparse format with 6 transactions (rows) and 18 items (columns)

Use “inspect” function to visualize the result of the conversion to the “transactions” format:

inspect(tableT)

To summarize the data information in "transaction" format use "summary".

summary(tableT) transactions as itemMatrix in sparse format with 6 rows (elements/itemsets/transactions) and 18 columns (items) and a density of 0.3333333 most frequent items: Employment=Private   Sex=Male    Accounts=USA 6               5            5  Marital=Civil    Education=College  (Other) 3               2            15  element (itemset/transaction) length distribution: sizes 6 6    Min. 1st Qu. Median Mean   3rd Qu. Max. 6      6       6       6       6       6  includes extended item information - examples: labels			variables   	levels 1 	Employment=Private 	Employment   	Private 2 	Education=Associate 	Education 	Associate 3 	Education=Bachelor  	Education  	Bachelor includes extended transaction information - examples: transaction		ID 1           		1 2             		2 3             		3

To see labels of the items generated in the conversion to transaction format use “itemInfo”:

itemInfo(tableT) labels 		variables		levels 1	Employment=Private 	Employment     	Private 2   	Education=Associate  	Education    		Associate 3    	Education=Bachelor  	Education     		Bachelor 4	Education=College 	Education      		College 5	Education=HSgrad 	Education       	HSgrad 6	Marital=Civil   	Marital        		Civil 7	Marital=Divorced 	Marital    		Divorced 8	Marital=Separated   	Marital    		Separated 9	Marital=Unmarried   	Marital    		Unmarried 10  	Occupation=Clerical 	Occupation     		Clerical 11 	Occupation=Executive 	Occupation    		Executive 12	Occupation=Repair 	Occupation      	Repair 13   	Occupation=Service 	Occupation      	Service 14	Occupation=Transport 	Occupation   		Transport 15	Sex=Female      	Sex       		Female 16	Sex=Male       	Sex         		Male 17	Accounts=Jamaica  	Accounts      		Jamaica 18	Accounts=USA  		Accounts 		USA

Use “labels” to see only the labels of the itens and transactions:

labels(tableT) $items [1]	"Employment=Private"	"Education=Associate"		"Education=Bachelor" [4]	"Education=College"    "Education=HSgrad"      	"Marital=Civil" [7]	"Marital=Divorced"     "Marital=Separated"  		"Marital=Unmarried" [10]	"Occupation=Clerical"  "Occupation=Executive"  	"Occupation=Repair" [13]	"Occupation=Service"   "Occupation=Transport"  	"Sex=Female" [16]	"Sex=Male"             "Accounts=Jamaica"      	"Accounts=USA" $transactionID [1] "1" "2" "3" "4" "5" "6"

To see NB mined results use the same commands above mentioned to “transactions” data:

paramA <- NBMinerParameters(tableT, trim = 0.01, pi = 0.999, theta = 0.8, rules = TRUE, plot = FALSE, verbose = FALSE, minlen=3, maxlen=5) tableNB <- NBMiner(tableT, parameter = paramA, control = list(verb = FALSE, debug=FALSE)) inspect(head(tableNB)) lhs                    		rhs		                precision 1	{Education=HSgrad, Sex=Male}      	=> {Employment=Private}    	0.9991467 2	{Sex=Male, Accounts=USA} 		=> {Marital=Civil}		0.9999421 3	{Sex=Male, Accounts=USA} 		=> {Education=HSgrad}     	0.9977636 4	{Education=College, Accounts=USA} 	=> {Employment=Private}		0.9982934 5	{Marital=Civil, Accounts=USA} 		=> {Sex=Male}             	0.9999752 6	{Employment=Private, Sex=Male}		=> {Accounts=USA}		0.9999948

Where "lhs" is the antecedent of the rules and de "rhs" is the consequent of the rules.

We can also show the data distribution in the space with “image” function:

image(tableNB)

Scenario
It contains artificial data created by Graham Williams (developer of Rattle) and is supplied with Rattle. To quote from the Rattle documentation: "It consists of 2,000 fictional clients who have been audited, perhaps for compliance with regard to the amount of a tax refund that is being claimed. For each case an outcome is recorded (whether the taxpayer's claims had to be adjusted or not) and any amount of adjustment that resulted is also recorded."

Input data no longer exists to download
Available on http://cs.anu.edu.au/Student/comp3420/mining/audit.csv.

Execution
table = read.table("audit.csv",sep=",",header=TRUE) tableT = as(table,"transactions") paramA = NBMinerParameters(tableT, trim = 0.01, pi = 0.999, theta = 0.8, rules = TRUE, plot = FALSE, verbose = FALSE, minlen=3, maxlen=5) transNB = NBMiner(tableT, parameter = paramA, control = list(verb = FALSE, debug=FALSE))

Output
transNB set of 18158 rules

tableNB = as(transNB,"data.frame") write.table(tableNB,file="auditNB.csv",sep=",")

Analysis
Based on the data above we can note that when the Marital state is “Civil” and the "Occupation" is "Professional" we have an "Chinese" account.