Practical DevOps for Big Data/Quality Verification

Introduction
The analysis of system correctness is fundamental to produce systems whose runtime behavior is guaranteed to be correct. However, the notion of correctness is general and needs to be refined to suit the specific scenario that embeds such a system. To this end, appropriate criteria has to be considered to define what correctness is, according to the kind of system that is taken into account and to the user requirements that the implemented system should exhibit at runtime. Verification in DICE aims to define a possible meaning of correctness for DIAs and to provide tools supporting the formal analysis.

The tool that is available in DICE for the safety analysis of DIAs is D-VerT. Safety verification in DICE is performed to check, in the DIA model, the reachability of unwanted configurations, i.e., a malfunction which consists of behaviors that do not conform to some non-functional requirements specified by the Quality Engineer.

Motivations
Verification is intended to focus on the analysis of the effect of an incorrect design of timing constraints which might cause a latency in the processing of data. Therefore, the causes of the undesired behaviors that reflect in practice on the outcome of the verification can be inferred by the designers using the analysis tool.

The design of timing constraints is an aspect of the software design which is quite relevant in applications such as those considered in DICE. The unforeseen delay can actually lead the system to incorrect behaviors that might appear at runtime in various form depending on the kind of application under development. Underestimating the computational power of nodes affects the time that is needed by the node to process the input data, which, in turn, might have an effect on the whole time span of the application, as the processing delay might trigger a chain reaction. For instance, in a streaming application, the slowness of a node might cause accumulation of messages in the queues and lead possibly to memory saturation, if no procedures take action to deal with the anomaly. In a batch application, an unpredicted delay might affect the total processing time and alter the application behavior which, in turn, violates the Service Level Agreement with the customers.

Existing solutions
To the best of authors' knowledge, there are no available tools offering such peculiarities that are specifically developed for the DICE reference technologies.

How the tool works
DICE adopts model checking techniques to verify DIA specified in DTSM diagrams. A verification problem is specified by a formal description of the DIA (an annotated DTSM models that contain suitable information about the timing features of the application undergoing analysis) and a logical formula representing the property that its executions must satisfy.

The verification process in DICE relies on a fully automatic procedure that is based on dense-time temporal logic and it is realized in accordance with the bounded model-checking approach. It is designed to be carried out in an agile way: the DIA designer performs verification by using a lightweight approach. More precisely, D-VerT fosters an approach whereby formal verification is launched through interfaces that hide the complexity of the underlying models and engines. These interfaces allow the user to easily produce the formal model to be verified along with the properties to be checked and eliminates the need for experts of the formal verification techniques.

The outcome of the verification is used for refining the model of the application at design time, in case anomalies are detected by D-VerT (see use case).

The verification process
The verification process consists of the following steps. The designer draws a class diagram in the DICE IDE representing the DTSM model of the DIA and, afterward, provides all the annotations required for the analysis, based on the selected technology that she employs to implement the application. When the user starts the analysis, the annotated DTSM model is converted into a formal model that represents the abstracted behavior of the application at runtime. Based on the kind of system to implement - either a Storm application or a Spark application - the tool selects the class of properties to verify and performs the analysis. Finally, when the outcome is available, the user requests D-VerT to show the result in the DICE IDE to see if the property is fulfilled or not and, in the negative case, the trace of the system that violates it. Figure 1 shows the main steps of the D-VerT workflow: the creation through the DICE IDE of the UML DTSM model of the application, the automatic transformation from the UML model to the formal model and the actual run verification task.

The architecture
As shown in Figure 2, D-VerT has a client-server architecture: the client component is an Eclipse plugin that is fully integrated with the DICE IDE and the server component is a RESTful web server. The client component manages the transformation from the DTSM model defined by the user to an intermediate JSON object which is then used to invoke the server. The server component, based on the content of the JSON file generates the formal model which is then fed to the core satisfiability/model checking tool. This tool is in charge of verifying if the property holds for the provided set of formulae, i.e., the formal model of the system.



Open challenges
The usability of the tool is subordinated to the availability of a numerical estimation of some non-functional metrics that are needed by the users of D-VerT for annotating DTSM models that undergo verification. These values can be either collected by means of a monitoring platform tracking events from a running application, whose DTSM model is elicited from the implemented solution, or they can be estimated through designers experience or by means of historical data previously collected from other applications that have similarities with the one in analysis.

Application domain
The current version of D-VerT supports
 * bottleneck analysis for Storm topologies and
 * feasibility and boundedness analysis for Spark jobs.

In the case of Storm topologies, the bottleneck analysis helps designers to discover bottleneck nodes as their timing features are not correctly designed. A wrong design of timing constraints might actually cause anomalies such as (i) latency in processing tuples and (ii) monotonic growth of queues occupancy.

In the case of the Spark jobs, the feasibility analysis aims at checking if there exist an execution of the system whose duration is lower than a specific deadline, therefore witnessing the feasibility of such execution, while the boundedness analysis checks (making strong assumptions on the idle time of the application) whether all the possible executions of the system are below a certain threshold.

Most of the streaming and batch reference technologies can describe applications by means of graphs representing the computation performed by the application. While, in some cases, such graphs can be derived from the application defined in a declarative way (e.g. in Apache Spark and Apache Flink), other technologies, such as Apache Storm and Apache Samza, allow to directly specify the topology of the application, hence to define applications in a compositional way by combining multiple blocks of computation.

Storm
A Storm application is specified at DTSM level, by means of the topology which implements the application. A topology is a graph which is composed of two classes of nodes. In addition, a topology defines exactly the connections among the nodes, being the inter node communication based on message exchange (a message is called tuple). Therefore, for any node n, it is statically defined at design time (DTSM model) both the list of nodes subscribing to n - i.e., receiving its emitted messages - and the list of nodes that n is subscribed to.
 * 1) Input nodes (spouts) are sources of information and they do not have any role in the definition of the logic of the application. They can be described through the information related to the stream of data they inject in the topology, and the features of their computation is not relevant to define the final outcome of the topology.
 * 2) Computational nodes (bolts) elaborate input data and produce results which, in turn, are emitted towards other nodes of the topology. The bottleneck analysis focuses on bolts only.

The following notions (and parameters) are needed by the D-VerT user to perform verification.

The behavior of a bolt is defined by a sequence of five different states idle, execute, take, emit, fail with the following meaning:
 * idle: no tuples are currently processed in the bolt.
 * execute: at least one but at most Takemax tuples are currently elaborated in the bolt.
 * emit: the bolt emits tuples towards all the bolts that are subscribed.
 * take: the bolt takes at least one but at most limited number of tuples (defined by the following parallelism value) are removed from the queue and initializes a proper number of concurrent thread to process them all.
 * fail: the bolt is failed and any of the previous states cannot occur.

Each bolt has the following parameters:
 * σ is a positive real value which abstracts the functionality that is computed by a bolt. If a bolt filters tuples, i.e., the number of incoming tuples is greater than the number of outgoing tuples then 0 < σ < 1, otherwise its value is σ ≥ 0. It expresses a ratio between the size of the input stream and the outgoing stream.
 * parallelism is the maximum numbers of concurrent threads that may be instantiated in a spout/bolt. Therefore, an active spout/bolt can emit/remove a number of tuples from its queue at the same time which is equal to parallelism value.
 * α is a positive real value which represents the amount of time that a bolt requires to elaborate one tuple.
 * rebootTime (min/max) is a positive real value which represents the amount of time that a bolt requires to restore its functionality after a failure.
 * idleTime (max) is a positive real value which represents the amount of time that a bolt may be in idle state.
 * timeToFailure (min) is a positive real value which represents the amount of time between two consequent failures.

Each spout is described by the average emit rate of the tuples that are injected into the topology.

Spark
A Spark application is specified at DTSM level, by means of a Directed Acyclic Graph (DAG) of operations transforming data that are aggregated in the nodes, called stages.

The fundamental data structure in Spark is the so-called Resilient Distributed Dataset (RDD). RDDs support two types of operations:
 * Transformations are operations (such as map, filter, join, union, and so on) that are performed on an RDD and which yield a new RDD.
 * Actions are operations (such as reduce, count, first, and so on) that return a value obtained by executing a computation on an RDD.

Spark arranges the transformations to maximize the number of operations executed in parallel by scheduling their operations in a proper way. A stage is a sequence of transformations that are performed in parallel over many partitions of the data and that are generally concluded by a shuffle operation. Each stage is a computational entity that produces a result as soon as all its constituting operations are completed. Each stage consists of many tasks that carry out the transformations of the stage; a task is a unit of computation that is executed on a single partition of data.

The computation realized by a DAG is called job. Hence, a DAG defines the functionality of a Spark job by means of an operational workflow that specifies the dependencies among the stages manipulating RDDs. The dependency between two stages is a precedence relation; therefore, a stage can be executed only if all its predecessors have finished their computation.

At runtime, the actual computation is realized through workers, executors and tasks over a cluster of nodes. A worker is a node that can run application code in the cluster and that contains some executors, i.e., processes that run tasks and possibly store data on a worker node.

The following notions (and parameters) are needed by the D-VerT user to perform verification.
 * Latency is an estimation of the duration of the task per data partition, and with
 * Tot_cores is the number of CPU cores executing the task.
 * Tot_tasks is the total number of tasks to be executed
 * deadline is the maximum duration of the Spark application.

The analyses that can be performed are the following:
 * feasibility analysis of Spark jobs verifies whether there exist an execution of the system whose duration is lower than the deadline;
 * boundedness analysis checks whether all the possible executions of the system are below the deadline.

A test case - DigitalPebble web crawler
D-VerT is an Eclipse plugin that can be executed within the DICE IDE. To carry out formal verification tasks, Storm topologies and Spark jobs have to be specified by means of UML class diagrams and activity diagrams, respectively, conveniently annotated with the DICE::Storm and DICE::Spark profiles. The diagrams can easily be drawn by dragging and dropping Eclipse items that are available in the DICE toolbars whereas the annotations can be specified in the bottom panel of the DICE IDE.

Figure 3 depicts the Storm topology implementing the DigitalPebble web crawler. Each node, either spout or bolt, has been annotated with the parameter values that are required to perform the bottleneck analysis.



D-VerT can be set through a Run Configuration window allowing users to launch the analysis. The parameters that the user can specify are the following:
 * the file containing the model undergoing the analysis,
 * the time bound limiting the length of the executions that the solver considers in the analysis of the topologies. Being D-VerT based on bounded model checking approach, the time bound is the maximum number of distinct and consecutive events that the solver can use to produce an execution of the topology.
 * The solver plugin to be used in the analysis,
 * The set of monitored bolts that are analyzed, and
 * The IP address and the port where the D-VerT server is running.

The analysis allows for the detection of possible runs of the system leading to an unbounded growth of at least one of the bolts’ queues. If D-VerT detects a problem, it returns a trace witnessing the existence of such a run of the system—i. e., the counterexample violating the boundedness property. The trace is returned to the user both in a textual format (i.e., the bare output of underlying solver) and in a graphical format. Figure 4 shows an example of such output trace. It represents the evolution of the number of tuples in a bolt's queue over time. The trace is composed by a prefix and a suffix: the latter, highlighted by the gray background, captures the growing trend of the queue size, as it corresponds to a series of operations in the system that can be repeated infinitely many times. When no trace is detected, the result is UNSAT.



Figure 5 shows a Spark DAG implementing the WordCount application.



The Run Configuration dialog, which D-VerT shows to users, allows them to choose the kind of analysis to perform, either feasibility or boundedness, beside the parameters defining the model to be used and the server IP and port.

Spark analysis returns a boolean outcome (yes/no) which specifies whether the property is violated or not. The negative response is supplied with a counterexample proving the violation.

Conclusion
The main achievement of verification in DICE has been the development of abstract models to support safety verification of Apache Storm and Spark applications. The models have been built by means of a logical approach. They were designed to be configurable and were provided with a layered structure, which facilitates the integration with the DICE framework by decoupling the verification layer from the modeling DTSM diagrams. Ad-hoc model-to-model transformations were developed to obtain verifiable models from the user design, the latter realized through the verification toolkit in the DICE IDE.