Data Mining Algorithms In R/Clustering/K-Cores

Introduction
In social networks analysis one of the major concerns is identification of cohesive subgroups os actores within a network. Friendship relation, publications citation, and many other more. Many studies and researches are focused on social network analysis, including in data mining. It is really important to find patterns in behavior of large online social networks, so the firms behind are able to create better mechanism to handle all that information with lower cost.

Online services such as Orkut, Facebook, Twitter and so on, have millions of users using their services simultaneously and interacting with others. Even in different services, the behavior of the network is similar.

People tend to interact in the same way as they do in real life, in a structure called “small world”, where people in a social network can reach any other person with less than seven steps. Such behavior can be studied to prevent disease propagation or to predict how fast an information can flow in society.

Several notions were introduced to formally describe cohesive groups: cliques, n–cliques, n–clans, n–clubs, k–plexes, k–cores, lambda sets,. . . For most of them it turns out that they are algorithmically difficult, classified as NP hard. However, for cores very efficient algorithm exists.

The Technique
There are several tasks when handling with social networks. There are some of them listed above.

Link-based object classification – We can classify the object based on its links.

Object type prediction – We can predict the object type based on the objects linked to it.

Link type prediction – Here we want to predict the link type based on the objects linked by it.

Predicting link existence – Now we don't want to know nothing about the link, but it's existence. How to predict when a link exists between two objects? That's the point here.

Link cardinality estimation – There are several ways to estimate the cardinality of a link, such as counting the number of links of an object, or counting the numbers of smallest paths that pass through an object.

Object reconciliation – The task is to check whether two objects are the same, based on the links and attributes. Remove duplicate instances are useful in many applications.

Group detection – A clustering task. Here we want to know when a group of objects belong to the same group.

Subgraph detection - Subgraph identification is to find characteristic subgraphs in a network. This is a form of search in graphs.

Metadata mining – Metadata is data about data.

The technique that will be explained in this text is about group detection in social networks, based on the degree of each node in the network.

The Algorithm
K-cores in graph theory were introduced by Seidman in 1983 and by Bollobas in 1984 as a method of (destructively) simplifying graph topology to aid in analysis and visualization. They have been more recently defined as the following by Batagelj et al.: given a graph G = {V,E}> with vertices set V and edges set E, the k-core is computed by pruning all the vertices (with their respective edges) with degree less than k. That means that if a vertex u has degree du, and it has n neighbors with degree less than k, then u's degree becomes du − n, and it will be also pruned if k > du − n.

This operation can be useful to filter or to study some properties of the graphs. For instance, when you compute the 2-core of graph G, you are cutting all the vertices which are in a tree part of graph. (A tree is a graph with no loops).

Note that, there is a refinement (possibly empty) of the k-core of a graph G, for which there exists at least k paths between any two pairs of vertices of G. This concept is called structural cohesion in sociology[1] and vertex-connectivity in graph theory, and is equivalent via the Menger theorem to a k-component, a maximal graph that cannot be disconnected by fewer than k nodes.

Cores
The notion of core is presented in Butts (2010) as following:

Let G = (V, E) be a graph, and let f (v, S, G) for v ∈ V, S ⊆ V be a real-valued vertex property function (in the language of Batagelj and Zaversnik). Then some set H ⊆ V is a generalized k-core for f if H is a maximal set such that f (v, H, G) ≥ k for all v ∈ H. Typically, f is chosen to be a degree measure with respect to S (e.g., the number of ties to vertices in S). In this case, the resulting k-cores have the intuitive property of being maximal sets such that every set member is tied (in the appropriate manner) to at least k others within the set.

Degree-based k-cores are a simple tool for identifying well-connected structures within large graphs. Let the core number of vertex v be the value of the highest-value core containing v. Then, intuitively, vertices with high core numbers belong to relatively well-connected sets (in the sense of sets with high minimum internal degree). It is important to note that, while a given k-core need not be connected, it is composed of subsets which are themselves well-connected; thus, the k-cores can be thought of as unions of relatively cohesive subgroups. As k-cores are nested, it is also natural to think of each k-core as representing a “slice” through a hypothetical “cohesion surface” on G. (Indeed, k-cores are often visualized in exactly this manner.)

The kcores function produces degree-based k-cores, for various degree measures (with or without edge values). The return value is the vector of core numbers for V, based on the selected degree measure. Missing (i.e., NA) edge are removed for purposes of the degree calculation.

= Implementation in R =

Package sna
Butts (2010) presents a series of implementations in R about algorithms to analysis  in social networks. The implementations uses the package "sna" as follow:


 * Package: ‘sna’
 * Version: 2.0-1
 * Date: 2009-06-07
 * Title: Tools for Social Network Analysis
 * Author: Carter T. Butts 
 * Maintainer: Carter T. Butts 
 * Depends: R (>= 2.0.0), utils
 * Suggests: network, rgl, numDeriv, SparseM, statnet
 * Description: A range of tools for social network analysis, including node and graph-level indices, structural distance and covariance methods, structural equivalence detection, p* modeling, random graph generation, and 2D/3D network visualization.
 * License: GPL (>= 2)
 * URL: http://erzuli.ss.uci.edu/R.stuff
 * Repository: CRAN
 * Date/Publication: 2009-06-08 07:08:51

To install the package "sna" see http://erzuli.ss.uci.edu/R.stuff

Description
kcores calculates the k-core structure of the input network, using the centrality measure indicated in cmode.

Usage
kcores(dat, mode = "digraph", diag = FALSE, cmode = "freeman", ignore.eval = FALSE)

Arguments
one or more (possibly valued) graphs. dat "digraph" for directed data, otherwise "graph". mode logical; should self-ties be included in the degree calculations? diag the degree centrality mode to use when constructing the cores. cmode logical; should edge values be ignored when computing degree? ignore.eval

Value
A vector containing the maximum core membership for each vertex.

 function (dat, mode = "digraph", diag = FALSE, cmode = "freeman",    ignore.eval = FALSE) {   dat <- as.edgelist.sna(dat, as.digraph = TRUE, suppress.diag = TRUE) if (is.list(dat)) return(lapply(dat, kcores, dat = dat, mode = mode, diag = diag, cmode = cmode, ignore.eval = ignore.eval)) if (mode == "graph") cmode <- "indegree" n <- attr(dat, "n") m <- NROW(dat) corevec <- 1:n dtype <- switch(cmode, indegree = 0, outdegree = 1, freeman = 2) if (!(cmode %in% c("indegree", "outdegree", "freeman"))) stop("Illegal cmode in kcores.\n") solve <- .C("kcores_R", as.double(dat), as.integer(n), as.integer(m),        cv = as.double(corevec), as.integer(dtype), as.integer(diag),         as.integer(ignore.eval), NAOK = TRUE, PACKAGE = "sna") if (is.null(attr(dat, "vnames"))) names(solve$cv) <- 1:n else names(solve$cv) <- attr(dat, "vnames") solve$cv }

Case Study
To illustrate the k-cores algorithm a simple case study will be shown.

Scenario
Imagine a group of students dining. Each of them may choose another one to sit with him. Each student is a node and each possibility of sitting together is an edge. We want to group them as friends they are separately.

Input Data
The input data is an adjacency matrix representing the network with students and friendship.

 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 1  0  1  2  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0 2   1  0  0  2  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0 3   0  0  0  0  0  0  0  0  1   0   2   0   0   0   0   0   0   0   0   0   0 4   0  0  0  0  1  0  0  2  0   0   0   0   0   0   0   0   0   0   0   0   0 5   0  0  0  1  0  0  0  0  0   0   0   0   0   0   2   0   0   0   0   0   0 6   0  0  0  0  0  0  0  0  2   0   0   0   0   0   0   0   0   0   0   1   0 7   0  0  0  0  0  2  0  0  0   0   0   0   0   0   1   0   0   0   0   0   0 8   0  0  0  0  2  0  0  0  0   0   0   0   0   0   1   0   0   0   0   0   0 9   0  0  0  0  0  1  0  0  0   0   0   0   0   2   0   0   0   0   0   0   0 10  0  0  0  0  0  0  0  0  0   0   0   0   0   0   2   0   0   1   0   0   0 11  0  0  2  0  0  0  0  0  1   0   0   0   0   0   0   0   0   0   0   0   0 12  0  0  0  0  0  0  0  0  0   0   0   0   1   0   0   0   0   0   0   2   0 13  0  0  0  0  0  0  0  0  0   0   0   2   0   0   0   0   0   0   0   0   0 14  0  0  0  0  0  0  0  0  2   0   0   0   0   0   1   0   0   0   0   0   0 15  0  0  0  0  0  0  0  0  2   1   0   0   0   0   0   0   0   0   0   0   0 16  0  0  0  0  0  0  0  0  0   0   0   0   2   0   0   0   0   0   1   0   0 17  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   2   0   0   1 18  0  0  0  0  0  0  0  0  2   0   0   0   0   1   0   0   0   0   0   0   0 19  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   1   0   0   2 20  0  0  0  0  0  0  0  0  0   1   2   0   0   0   0   0   0   0   0   0   0 21  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   1   0   2   0   0 22  0  0  0  0  0  0  0  0  0   0   0   0   2   0   0   0   1   0   0   0   0 23  0  0  0  0  2  0  0  0  0   0   0   0   0   0   0   0   0   0   0   0   0 24  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   2   0   0   1   0 25  0  0  0  0  0  0  0  0  0   0   0   0   0   0   1   0   2   0   0   0   0 26  0  0  0  0  0  0  0  0  0   0   0   0   1   0   0   0   0   0   0   0   0   V22 V23 V24 V25 V26 1   0   0   0   0   0 2    0   0   0   0   0 3    0   0   0   0   0 4    0   0   0   0   0 5    0   0   0   0   0 6    0   0   0   0   0 7    0   0   0   0   0 8    0   0   0   0   0 9    0   0   0   0   0 10   0   0   0   0   0 11   0   0   0   0   0 12   0   0   0   0   0 13   1   0   0   0   0 14   0   0   0   0   0 15   0   0   0   0   0 16   0   0   0   0   0 17   0   0   0   0   0 18   0   0   0   0   0 19   0   0   0   0   0 20   0   0   0   0   0 21   0   0   0   0   0 22   0   0   0   0   0 23   0   0   1   0   0 24   0   0   0   0   0 25   0   0   0   0   0 26   0   0   2   0   0

When there's value 1 it means the first choice and 2 the second one.

Execution
To determine the maximum k-core for each vertex from the network, we have:

 kcores(people)

Where "people" is the matrix.

Output
A vector containing the maximum core membership for each vertex.

 1 2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2

Analysis
As it is a small network, the maximum k-core is for k=2, so there aren't clear group definition.

Package RBGL
BOOST library.
 * Package: ‘RBGL’
 * Version: 1.26.0
 * Date: 2010-10-26
 * Title: An interface to the BOOST graph library
 * Author: Vince Carey, Li Long, R. Gentleman
 * Maintainer: Li Long 
 * Depends: graph, methods
 * Suggests: Rgraphviz
 * Description: A fairly extensive and comprehensive interface to the graph algorithms contained in the
 * License: Artistic-2.0
 * URL: http://www.bioconductor.org
 * Repository: CRAN
 * Date/Publication: 2010-10-26 12:22:53

Usage
kCores(g, EdgeType=c("in", "out"))

Arguments

 * g: an instance of the graph class’
 * EdgeType: what types of edges to be considered when g is directed

Details
The implementation is based on the algorithm by V. Batagelj and M. Zaversnik, 2002. The example snacoreex.gxl is in the paper by V. Batagelj and M. Zaversnik, 2002.

Value
A vector of the core numbers for all the nodes in g.

Examples
library(RBGL) con1 <- file(system.file("XML/snacoreex.gxl",package="RBGL")) kcoex <- fromGXL(con1) close(con1) kCores(kcoex) A C B E F D G H J K I L M N O P Q R S T U 1 2 1 2 3 3 3 3 3 3 3 3 2 2 1 1 2 2 2 2 0

con2 <- file(system.file("XML/conn2.gxl",package="RBGL")) kcoex2 <- fromGXL(con2) close(con2) kCores(kcoex2) A B C D E G H F 3 3 3 3 3 3 3 3

kCores(kcoex2, "in") A B C D E G H F 0 0 0 0 0 0 0 0

kCores(kcoex2, "out") A B C D E G H F 0 0 0 0 0 0 0 0

=References= BATAGELJ, V. & MRVAR, A. (2002). Generalized Cores. Journal of ACM, v. 5, 2002.

BATAGELJ, V. & MRVAR, A. (2003). An O(m) Algorithm for Cores Decomposition of Networks.

CAREY, V. & LONG, L. & GENTLEMAN, R. (2010). An Interface to the BOOST graph library. http://www.bioconductor.org/packages/release/bioc/html/RBGL.html.

BUTTS, C. T. (2010). Tools for Social Networks Analysis. http://erzuli.ss.uci.edu/R.stuff/.