Practical DevOps for Big Data/Configuration Optimisation

Introduction
Finding optimal configurations for data-intensive applications is a challenging problem due to the large number of parameters that can influence their performance and the lack of analytical models to anticipate the effect of a change. To tackle this issue, DevOps can incorporate tuning methods where an experimenter is given a limited budget of experiments and needs to carefully allocate this budget to find optimal configurations. In this section we discuss a methodology that could be used to implement configuration optimization in DevOps.

Motivation
Once a data-intensive system approaches its final stages of deployment for shipping to the end users it becomes increasingly important to tune its performance and reliability. Tuning can be also important in initial prototypes, to determine bounds on achievable performance. In either case, this is a time-consuming operation since the number of candidate configurations can grow very large. For example, the optimization of 10 configuration options taking ON/OFF values would require in the worst case 1024 experiments to determine the globally optimum configuration. As data-intensive applications may feature multiple technologies simultaneously, for example a data analysis pipeline consisting of Apache Spark and Kafka, it becomes increasingly difficult to optimally configure such systems. Sub-optimal configurations imply that a technology is more expensive to operate and may not deliver the intended performance, forcing the service provider to supply additional hardware resources. Configuration optimization tools are therefore sought in DevOps to guide this configuration phase in order to quickly find a configuration that is reasonably close to optimal.

Existing Solutions
There are currently a number of mathematical methods and approaches that can be used in the search for optimal configuration.

A classic family of mathematical methods consists of design of experiments (DoE) methods, which fit a nonlinear polynomial to response data to predict system response in unexplored configurations. Such methods provide a way to reduce the number of experiments needed to fit the response model in particular cases, such as those where configuration options are binary valued (e.g., ON/OFF). It is however difficult to consider tens of parameters with these methods, which are often limited to much smaller dimensionality (8-10).

Related methods are those developed in statistics and machine learning to address the multi-armed bandit problem. This is a general problem that abstracts allocation of a finite set of resources with incomplete information on the rewards that can be obtained from the decisions. Such situations arise commonly in configuration optimization, where it is not clear in advance how changing the level of a configuration option will impact the system. The crucial trade-off is here to decide what amount of time to spend for exploitation vs exploration, while the former refers to focusing on the optimization option with the largest expected payoff, while the latter refers to improving the knowledge of the other optimization options to refine the estimates of the expected payoffs. The method proposed in this chapter, called Bayesian optimization, belongs to this class of solution techniques and offers flexibility in choosing such trade-off.

Several commercial services for performance optimization exist in the market, e.g.: These services in part depend on the expertise and skills of consultants. The solution that is pursued here is algorithmic and very few comparable methods exist for Big Data frameworks. For example, http://unraveldata.com/ offers a Big Data management solution involving auto-tuning for Big Data applications. No public information exists on the underpinning auto-tuning methods, and the website is rather vague, suggesting that this is not the main feature of the product. Conversely, the solution discussed here is entirely tailored to configuration and features innovative methods based on Gaussian processes that have already been proved experimentally more effective than the scientific state-of-the-art.
 * PragmaticWorks
 * Oracle
 * HP
 * Centaurea

How the tool works
Configuration optimization (CO) uses a DevOps toolchain to iteratively run tests on the Big Data application. At each iteration, a new configuration is tested using a load generator and performance data is collected. This data is used to train machine learning models to decide the configuration to evaluate the system at the next iteration. Using this decision, the testing cycle is iterated, until a configuration that achieves performance objectives is found.

The challenge for a testing system is to define an efficient search algorithm, which can automatically decide what experiments to run in order to quickly tune the system. The fewer the experiments are needed to tune the system, the more cost-effective the process will be. The main issue is that the system behavior is hard to predict, thus the influence of a configuration parameter on performance may be not be known before load tests are carried out on the application after that parameter is changed.

BO4CO is an implementation of CO produced in DICE that focuses on optimizing configurations based on a technique known as Bayesian Optimization. Bayesian Optimization is a machine learning methodology for black-box global optimization. In this methodology, the unknown response of a system to a new configuration is modeled using a Gaussian process (GP), an important class of machine learning models. GPs are used to predict the response of the platform to changes in configuration. GPs can take into account mean and confidence intervals associated with measurements, predict system behavior in unexplored configurations, and can be re-trained quickly to accommodate for new data. More importantly, Bayesian optimization method can be much faster on real systems than any existing auto-tuning technique.

CO methodological steps
The logic of the CO tool is iterative. The configuration optimizer automatically selects a configuration at each iteration employing the BO4CO algorithm that determines the best configuration to test next during the procedure. BO4CO estimates the response surface of the system using observed performance data. It selects the next configuration to test, using the estimated data, searching for points that have a good chance of being the optimum configuration. This is illustrated on the left-hand side of the figure below, which shows these exact steps as defining the configuration optimization process.



The right-hand side of the above figure recalls the important fact that the response models based on GPs provide the ability to query the performance of the application in untested configurations, thus providing a mechanism to guide the CO search. In particular, after making a new test it is possible to assess the accuracy of the model and supply a new observation to it, to refine its accuracy for future use. In the DICE implementation of the methodology, there is tool support to automatically perform training and prediction with GPs, in addition to a GUI-based specification of the boundaries of the configuration options.

Open challenges
A methodology such as CO could be applied in principle also to migration problems, but at present it has only been used in the context of optimizing a new data-intensive application. Many companies and public sector organizations are progressively migrating their applications to emerging Big Data frameworks such as Spark, Cassandra, Storm, and Hadoop. Three research challenges may be envisioned:


 * 1) Auto-tuning methods based on machine learning are not customized for individual Big Data technologies, leading to sub-optimal experimental time and cost, which are worsened especially when we multiple technologies hosted are used together.
 * 2) Performance auto-tuning lacks validation in industrial Big Data context, current research focuses on academic and lab testbeds.
 * 3) Big Data require continuous auto-tuning, since workload intensities & data volumes grow continuously.

Storm configuration
The CO methodology has been tested in against different Apache Storm topologies. The improvements to the Storm topology configuration and to the end latency (the difference between the timestamp once the job arrives at the topology and the time it has been processed and leaves the topology) when CO finds optimal configuration comparing with the default values are up to 3 orders of magnitudes compared with the default values. The study shows that the tool finds the optimum configuration only within first 100 iterations. The full factorial combination of the configuration parameters in this experiment is 3840, and 100 experiments is equal to 2% of the total experiments. Note that in order to measure the difference between the optimum configuration found by the configuration optimization tool we performed the full factorial experiments (i.e., 3840 experiments each for 8 minutes over 3480*8/60/24=21 days). Further application to a real-world social media platform can be found in.

Cassandra configuration
A validation study has also been performed in on optimizing Cassandra read latency. A variant of the BO4CO algorithm called TL4CO has been also considered. TL4CO integrates into the Bayesian optimization method a technique called transfer learning, which allows re-using historical data from past configuration optimization cycle to accelerate new configuration cycles. More details about TL4CO can be found in. The experiments collected measurements of latency versus throughput for both read and write operations in a 20-parameter configuration space. The configurations that are found by the CO tool (with TL4CO and BO4CO algorithms), the default settings and the one prescribed by experts are annotated. The results have shown that the configuration that the CO tool with TL4CO initialization finds only after 20 iterations results in a slightly lower latency but much higher throughput compared to the one suggested by the experts.

Conclusion
This chapter has discussed the problem of automatically optimizing the configuration of data-intensive applications. Technologies such as Storm exemplify the challenge, as they require to jointly optimize several tens of configuration parameters without a good understanding on how each parameter level interacts with the other parameters. The CO DevOps tool illustrates a possible solution. Iterative experiments are conducted until finding an optimal configuration. Bayesian optimization, a black-box optimization technique, can considerably accelerate the configuration task compared to standard experimental methods from statistics, such as response surfaces and experimental designs.