Practical DevOps for Big Data/Fault Injection

Introduction
The operation of data intensive applications almost always requires dealing with various failures. Therefore during the development of an application, tests have to be made in order to assess the reliability and resilience of the system. These test the ability of a system to cope with faults and to highlight any vulnerable areas. The fault-injection tool (FIT) allows users to generate faults on their Virtual Machines, giving them a means to test the resiliency of their installation. Using this approach the designers can use robust testing, highlighting vulnerable areas to inspect before it reaches a commercial environment. Users or application owners can test and understand their application design or deployment in the event of a cloud failure or outage, thus allowing for the mitigation of risk in advance of a cloud based deployment.

Motivation
Current and projected growth in the big data market provides three distinct targets for the tool. Data Centre owners, cloud service providers and application owners are all potential beneficiaries due to their data intensive requirements. The resilience of the underlying infrastructure is crucial to these areas. Data Centre owners can gauge the stress levels of different parts of their infrastructure and thus offer advice to their customers, address bottlenecks or even adapt the pricing of various levels of assurances. For developers FIT provides the missing and essential service of evaluating the resiliency and dependability of their applications, which can only be demonstrated in the application’s runtime by deliberately introducing faults. By designing the FIT to be a lightweight and versatile tool it is trivial to use it during Continuous Integration or within another tool for running complex failure scenarios. Used in conjunction with other tools not within the scope of this report, FIT could monitor and evaluate the effect of various faults on an application and provide feedback to the developers on application design.

Existing solutions

 * DOCTOR (IntegrateD SOftware Fault InjeCTiOn EnviRonment) allows injection of memory and register faults, as well as network communication faults. It uses a combination of time-out, trap and code modification. Time-out triggers inject transient memory faults and traps inject transient emulated hardware failures, such as register corruption. Code modification is used to inject permanent faults.


 * Orchestra is a script driven fault injector which is based around Network Level Fault Injection. Its primary use is the evaluation and validation of the fault-tolerance and timing characteristics of distributed protocols. Orchestra was initially developed for the Mach Operating System and uses certain features of this platform to compensate for latencies introduced by the fault injector. It has also been successfully ported to other operating systems.


 * Xception is designed to take advantage of the advanced debugging features available on many modern processors. It is written to require no modification of system source and no insertion of software traps, since the processor's exception handling capabilities trigger fault injection. These triggers are based around accesses to specific memory locations. Such accesses could be either for data or fetching instructions. It is therefore possible to accurately reproduce test runs because triggers can be tied to specific events, instead of timeouts.


 * Grid-FIT (Grid – Fault Injection Technology) is a dependability assessment method and tool for assessing Grid services by fault injection. Grid-FIT is derived from an earlier fault injector WS-FIT which was targeted towards Java Web Services implemented using Apache Axis transport. Grid-FIT utilises a novel fault injection mechanism that allows network level fault injection to be used to give a level of control similar to Code Insertion fault injection whilst being less invasive.


 * LFI (Library-level Fault Injector) is an automatic testing tool suite, used to simulate in a controlled testing environment, exceptional situations that programs need to handle at runtime but that are not easy to check via input testing alone. LFI automatically identifies the errors exposed by shared libraries, finds potentially buggy error recovery code in program binaries and injects the desired faults at the boundary between shared libraries and applications.

Design
The FIT has been designed thus far to specifically to comprise a best practice strategy to ensure data intensive applications are reliable, allowing for rigorous testing of applications both during development and after deployment. This will allow cloud platform owners/Application VM owners a means to test the resiliency of a cloud installation and applications by generating faults at the cloud level. The DICE FIT has also been designed and developed in a modular fashion. This allows the replacement of any function that carries out Faults as well as potential to extend the tool as required. The FIT was designed to be as lightweight on the VMs as possible, so for this reason only existing well-tested tools have been implemented for causing faults. The FIT downloads, installs and configures only what is required at the time, because of this no unnecessary tools or dependences are installed. FIT allows VM Admins, application owners and Cloud Admins to generate various faults. It runs independently and externally to any target environment. Figure 1 illustrates the architecture.

Operation
The FIT is designed to work from the command line or through a Graphical User Interface. The user can invoke actions which connect to the target VM and automatically install any required tools and dependences or evoke the required APIs. The command line switches and parameters allow users to select a specific fault and the parameters of the fault such as the amount of RAM to user or which services to stop. An example command line call to connect to a node using SSH and cause memory stress with 2GB is as follows:

This call was ran on a real cloud system. The tool connected via SSH and determined the OS version by checking the, Ubuntu in this case. It then gathered the memory stress tool suitable for Ubuntu, which is Memtester in this case. Finally the FIT called Memtester to saturate memory on the target node. Figure 2 shows available memory on the target node before (left) and during FIT's invocation (right) as detected by a monitoring tool, where it can be seen that nearly all 2GB of available RAM had been saturated.

To access the VM level and issue commands the DICE FIT uses JSCH to SSH to the Virtual Machines. By using JSCH the tool is able to connect to any VM that has SSH enabled and can then issue commands as a pre-defined user. This allows greater flexibility of commands as well as the installation of tools and dependencies. The DICE FIT is released under the permissive Apache Licence 2.0 and supports the OS configurations Ubuntu (tested with versions 14.04 and 15.10), and Centos with set Repo configured and wget installed (tested on version 7).

Graphical User Interface
The GUI provides the same functionality as the command line version of the tool. The GUI provides users with a visual way of interacting with the tool which can make the tool more accessible for a range of users. The user can select from the available actions from a home screen, as seen in Figure 3. Each button leads to a page where a user can enter the relevant inputs and then the fault can be executed. These inputs are the equivalent of the command line parameters.

In the CPU overload page, the user enters the details of the VM and the amount of time to run the overload for. In place of the password the user can also upload an SSH key from a file. Any feedback and output from running the fault is shown at the bottom of the page. Using the GUI is a straightforward process of selecting the desired fault and providing the VM details. In the example below a high CPU usage fault is chosen, and the address, username and password of the VM is entered. The number of CPUs on the machine is entered and then the amount of time to overload the CPU for, in this case 30 seconds, as shown in Figure 4.

The GUI successfully complements the objectives achieved by the command line tool. It makes the fault injection tool highly accessible, allowing anyone to find vulnerabilities and test the resiliency of their systems.

Installation
The source code can be found in the DICE GitHub repository. The repository contains the source code and a WAR file so it can be deployed on a server, such as Apache Tomcat. Once the image is deployed on the server it will be immediately available and ready to use.

Open Challenges
Future work will involve a more detailed classification of faults in conjunction with an analysis of the cause, effect and response of various fault scenarios. From a design and operational perspective an investigation will be undertaken to decide on how best to integrate with related services. There is a requirement for robust monitoring, graphing and analysis to operate in conjunction with the FIT. This could mean logically packaging a single solution, using a suite of tools or building new inherent features. This will result in firstly a wider scope of technologies that can be stressed with workloads in a similar way to MongoDB such as Storm, Spark and Cassandra. Further control over the timing, duration of faults will be investigated.

Integration


A major step forward in the development of the Fault Injection Tool is the integration with other DICE tools. This work incorporates the FIT deeper within the DICE toolset, and provides further useful functionality for users of the FIT. One tool in which the FIT has been fully integrated with is the DICE Deployment Service, developed by XLAB, which is shown in Figure 5. This new feature allows faults to be caused on all of the VMs which make up a deployment into a DICE virtual deployment container. The method in which this integration works is through the GUI version of the FIT. First, the user must acquire a token from the deployment service, in order to be able to authenticate with the API. From there, an option is given on the GUI to list all containers running on the DICE deployment service. The user can then choose the container they wish to cause faults on. A JSON file can also be uploaded in order to further customise the type of faults to be caused on certain VMs within the deployment. This is accomplished by matching a fault with the name of the component type that is associated (e.g., hosted on) with the VM in the application’s deployment blueprint. After these attributes are provided, the desired faults are automatically caused on all of the selected VMs inside the container, to simulate the faults occurring at an application level. This enables that the user needs to fill in the form only once for a virtual deployment container, then use the same information for all the subsequent (re)deployments in the container. Additionally, we show in the next chapter the integration of FIT with the new dmon-gen utility to automate the generation of anomalies.

Additionally, there is the integration of FIT with the new dmon-gen utility to automate the generation of anomalies. Using this we can also specify FIT specific parameters for each experiment. For example we can use this to stress the CPU and memory of all of the hosts on a given platform or to use exactly 50% of the available resources.

Another area in which integration is being explored is the incorporation of the monitoring software developed by IeAT to the FIT GUI. If this is feasible, it will provide users with detailed statistics and analysis of how the VM(s) are performing before, during and after faults have been caused. This could provide a deeper understanding of the effects of the faults caused by the FIT application, and how the VMs/applications perform under the strain of these different faults being simulated.

The FIT can generate VM faults for use by application owners and VM admins. The tool is designed to run independently and externally to any target environment.

Validation


To validate the impact of FIT usage the common task manager program ‘top’ found in many Unix-like operating systems was used. Using the Linux ‘top’ command on the target VM the current state of its resources can then be seen. Regarding CPU saturation, before running the CPU overload fault, the %Cpu usage is measured. While running the CPU overload the %Cpu quickly rises to near 100%. The stress command issued by the FIT would typically account for around 99.2% of the available CPU capacity. Regarding memory saturation, the FIT ‘stressmem’ feature is called with designated parameters. The tool first connects via SSH to the VM and determined the OS version by checking the /etc/*-release for the version of the OS (Ubuntu in our case). It then looks for the memory stress tool suitable for Ubuntu, for example Memtester. If the tool is not found first the DICE FIT installs the tool along with dependencies. Finally, the FIT calls Memtester to saturate memory in the target node. Again, a standard monitoring tool such as ‘top’ (in the %MEM column) will show the 2GB (or whatever was specified in the memory size parameter of stressmem) RAM available to the VM being saturated. The use of standard measurement tools already bundled with the OS means it is then easy for users to ensure the injected fault is having the desired effect.

In the following example, using the Linux ‘top’ command on the target VM the current state of its resources can be seen. Before running the CPU overload fault, the %Cpu usage is at 0.7% as seen in Figure 6. While running the CPU overload the %Cpu quickly rises to 100%. In the processes below we can see that the stress command is using 99.2% of the CPU as shown in Figure 7.

Conclusion
The main advantages of the FIT in comparison to other solutions are its availability as an open source solution, a command line version that makes it easy to integrate with other tools, a GUI that makes it easy to use for the non expert user and well documented instructions. An important differentiator is the cloud-agnostic nature of the tool and its ability for consistent use on multiple target environments. Further future work will focus on the difficulties and possibilities of extensibility by external users, investigating limitations, consider different topologies, operating systems and vendor agnostic cloud provider infrastructure as well as evaluating the overhead of operation. Containerised environments will also be considered as future FIT targets to help understand the effect on microservices when injecting faults to the underlying host as well as the integrity of the containerised deployment. In the longer term other target cloud APIs could be added to the FIT as well as investigating related academic work.