Practical DevOps for Big Data/Anomaly Detection

Introduction
In anomaly detection the nature of the data is a key issue. There can be different types of data such as: binary, categorical or continuous. In DICE we deal mostly with the continuous data type although categorical or even binary values could be present. Most metrics data relate to computational resource consumption, execution time etc. There can be instances of categorical data that denotes the status/state of a certain job or even binary data in the form of boolean values. This makes the creation of data sets on which to run anomaly detection an extremely crucial aspect of ADT, because some anomaly detection methods don’t work on categorical or binary attributes.

It is important to note that most, if not all, anomaly detection techniques and tools, deal with point data, meaning that no relationship is assumed between data instances. In some instances this assumption is not valid as there can be spatial, temporal or even sequential relationships between data instances. This in fact is the assumption we are basing ADT on with regard to the DICE context.

All data in which the anomaly detection techniques will use are queried from monitoring platform (DMon). This means that some basic statistical operations (such as aggregations, median etc.) can already be integrated into the DMon query. In some instances this can reduce the dataset size in which to run anomaly detection.

An extremely important aspect of detecting anomalies in any problem domain is the definition of the anomaly types that can be handled by the proposed method or tool. In the next paragraphs, we will give a short definition of the classification of anomalies in relation to the DICE context.

First, we have point anomalies which are the simplest types of anomalies, represented by data instances that can be considered anomalous with respect to the rest of the data. Because this type of anomaly is simple to define and check, a big part of research effort will be directed towards finding them. Our intend was to investigate these types of anomalies and included them in DICE ADT. However, as there are a lot of existing solutions already on the market this is not be the main focus of ADT instead we use the Watcher solution from the ELK stack to detect point anomalies. A more interesting type of anomalies in relation with DICE are the so called contextual anomalies. These are considered anomalous only in a certain context and not otherwise. The context is a result of the structure from the data set. Thus, it has to be specified as part of the problem formulation.

When defining the context, we consider contextual attributes which are represented by the neighbours of each instance and behavioural attributes which describe the value itself. In short, anomalous behaviour is determined using the values for the behavioural attributes from within the specified context.

The last types of anomalies are called collective anomalies. These anomalies can occur when a collection of related data instances are anomalous with respect to the entire data set. In other words, individual data instances are not anomalous by themselves. Typically collective anomalies are related to sequence data and can only occur if data instances are related.

Motivations
During the development phases of an application, performance bottlenecks as well as unwanted or undocumented behaviour is commonplace. A simple monitoring solution is not a viable solution as the metrics usually require expert knowledge of the underlying platform architecture. This is especially true in DevOps environments. The DICE anomaly detection tool contains both supervised and unsupervised methods which are able to flag undesired application behaviours.

The unsupervised methods are useful during the initial stages of a newly developed application. In this situation developers have a hard time of deciding if a particular performance profile is normal or not for the application. Supervised methods on the other hand can be used by developers in order to detect contextual anomalies between application versions. In contrast to the unsupervised methods, this requires a training set so that they can be trained to detect different anomalous instances.

Existing solutions
There are a wide range of anomaly detection methods currently in use. These can be split up into two distinct categories based on how they are trained. First there are the methods used in supervised methods. In essence these can be considered as classification problems in which the goal is to train a categorical classifier that is able to output a hypothesis about the anomality of any given data instances. These classifiers can be trained to distinguish between normal and anomalous data instances in a given feature space. These methods do not make assumptions about the generative distribution of the event data, they are purely data driven. Because of this, the quality of the data is extremely important.

For supervised learning methods, labelled anomalies from application data instances are a prerequisite. False positives frequency is high in some instances. This can be mitigated by comprehensive validation/testing. Computational complexity of validation and testing can be substantial and represents a significant challenge, which has been taken into consideration in the ADT tool. Methods used for supervised anomaly detection include but are not limited to: Neural Networks, Neural Trees, ART1, Radial Basis Function, SVM, Association Rules and Deep Learning based techniques. In unsupervised anomaly detection methods, the base assumption is that normal data instances are grouped in a cluster in the data while anomalies don’t belong to any cluster. This assumption is used in most clustering based methods, such as: DBSCAN, ROCK, SNN FindOut, WaveCluster. The second assumption on which K-Means, SOM, Expectation Maximization (EM) algorithms are based is that normal data instances belong to large and dense clusters while anomalies in small and sparse ones. It is easy to see that the effectiveness of each of unsupervised, or clustering based, methods is largely based in the effectiveness of individual algorithms in capturing the structure of normal data instances.

It is important to note that these types of methods are not designed with anomaly detection in mind. The detection of anomalies is more often than not a product of clustering based techniques. Also, the computational complexity in the case of clustering based techniques can be a serious issue and careful selection of the distance measure used is a key factor.

How the tool works
The ADT is made up of a series of interconnected components that are controlled from a simple command line interface. This interface is meant to be used only for the initial version of the tool. Future versions will feature a more user friendly interface. The full architecture can be viewed in the architecture figure from this section. In total, there are 8 components that make up ADT. The general architecture is meant to encompass each of the main functionalities and requirements identified during the requirements deliverables.

First, we have the data-connector component, which is used to connect to DMon. It is able to query the monitoring platform and also send it new data. This data can be detected anomalies or learned models. For each of these types of data, data-connector creates a different index inside DMon. For anomalies, it creates an index of the form anomaly-UTC, where UTC stands for Unix time, similarly to how the monitoring platform deals with metrics and their indices. This means that the index is rotated every 24 hours.

After the monitoring platform is queried, the resulting dataset can be in JSON, CSV or RDF/XML. However, in some situations, some additional formatting is required. This is done by the data formatter component. It is able to normalize the data, filter different features from the dataset or even window the data. The type of formatting the dataset may or may not need is highly dependant on the anomaly detection method used. The feature selection component is used to reduce the dimensionality of the dataset.

Not all features of a dataset may be needed to train a predictive model for anomaly detection. So in some situations, it is important to have a mechanism that allows the selection of only the features that have a significant impact on the performance of the anomaly detection methods. Currently, only two types of feature selection is supported. The first is Principal Component Analysis (from Weka) and Wrapper Methods.

The next two components are used for training and then validating predictive models for anomaly detection. For training, a user must first select the type of method desired. The dataset is then split up into training and validation subsets and later used for cross validation. The ratio of validation to training size can be set during this phase. Parameters related to each method can also be set in this component. Validation is handled by a specialized component which minimizes the risk of overfitting the model as well as ensuring that out of sample performance is adequate. It does this by using cross validation and comparing the performance of the current model with past ones.

Once validation is complete, the model exporter component transforms the current model into a serialized loadable form. We use the PMML format wherever possible in order to ensure compatibility with as many machine learning frameworks as possible. This makes the use of ADT in a production like environment much easier. Not all model are currently exportable in this format, in particular neural network based models are not compatible. The resulting model can be fed into DMon. In fact, the core services from DMon (specifically Elasticsearch) have to role of a serving layer from a lambda architecture. Both detected anomalies and trained models are stored in the DMon and can be queried directly from the monitoring platform. In essence, this means that other tools from the DICE toolchain need to know only the DMon endpoint in order to see what anomalies have been detected.

Furthermore, the training and validation scenarios is in fact the batch layer while unsupervised methods and/or loaded predictive models are the speed layer. Both these scenarios can be accomplished by ADT. This integration is an open-challenge detailed in the next section.

The last component is the anomaly detection engine. It is responsible for detecting anomalies. It is important to note that it is able to detect anomalies however it is unable to communicate them to the serving layer (i.e. DMon). It uses the dmon-connector component to accomplish this. The anomaly detection engine is also able to handle unsupervised learning methods. We can see this from the architecture figure that the Anomaly detection engine is in some ways a subcomponent of the model selector which selects both pre-trained predictive models and unsupervised methods.

Open Challenges
Currently, the anomaly detection tool relies on state of the art techniques for classification and anomaly detection. However, as this field is constantly evolving, the integration of new algorithms will be required. This integration is made easy by the internal architecture of the tool. Furthermore, several methods dealing with extremely unbalanced datasets have still to be explored. Some of the future improvements/challenges are:


 * Inclusion of Oversampling and Undersampling techniques such as SMOTE and ADASYN
 * Distributed hyperparameter optimization (i.e. running on Spark)
 * Bayesian hyperparameter optimization
 * Stacking and Bagging meta learning algorithms/methods
 * Usage of deep learning techniques
 * Currently, we use the Keras library with a Tensorflow backend for training Neural Networks. However, no deep learning topologies have been tested yet.

Application domains
Because of the close integration with the monitoring platform (DMon) the anomaly detection tool can be applied to any platforms and applications supported by it. For a Storm based DIA, the anomaly detection tool queries DMon for all performance metrics. These metrics can be queried per deployed Storm topology. The workflow for utilising the tool is as follows:
 * Query the data
 * Aggregate different data sources (for example system metrics with Storm metrics)
 * Filter the data
 * Specify the anomaly detection method to use
 * Set training or detection mode (pre-trained model has to be selected by user)
 * For supervised methods a training data set has to be supplied by the user (including target values)
 * If no target value is provided, ADT considers the last column as target.

The resulting anomalies are stored inside DMon in a separate index. Consequently, anomalies can be queried and visualised like any other metric inside DMon. As mentioned before, because of the tight integration with DMon and the metric agnostic nature of ADT, applying it to other tools supported by DMon and DICE requires the same steps as for Storm.

It is also worth mentioning that ADT constructs the training and testing sets based on the structure from DMon. This means that as long as DMon has a flat representation of the metrics, ADT can use them.

Conclusion
The Anomaly detection tool developed during DICE is able to use both supervised and unsupervised methods. In conjunction with the DMon monitoring platform, it forms a lambda architecture that is able to both detect potential anomalies as well as continuously train new predictive models (both classifiers and clusterers). It is able to successfully detect performance related anomalies, thanks in part to the integrated preprocessing, training and validation modules.

It is also important to point out that the anomaly detection tool is in essence technology agnostic. It doesn't require apriori knowledge about the underlying technologies. The only context that it is given, is the one which is inherent in the structure of the training/validation dataset. This makes it ideal for novice users in developing DIAs.