Practical DevOps for Big Data/Review of UML Diagrams Useful for Big Data

Introduction
Documenting Big Data Architectures can entail re-use of classical notations for software architecture description augmented with appropriate notations aimed at isolating and identifying the data-intensive nature of Big Data applications. In this vein, the DICE ecosystem offers a plethora of ready-to-use tools and notations to address a variety of quality issues (performance, reliability, correctness, privacy-by-design, etc.). In order to make profit of these tools, the user has to use the explicit notation we have defined to support their scenario. The notation in question entails building specific UML diagrams enriched with specific profiles, that is, the standard UML mechanism to design domain-specific extensions --- in our case, the mechanism in question was used to define stereotypes and tagged values inside the DICE Profiles and specific to model data-intensive constructs, features, and characteristics. The DICE profiles tailor the UML meta-model to the domain of DIAs. For example, the generic concept of Class can become more specific, i.e., to have more semantics, by mapping it to one or many concrete Big Data notions and technical characteristics, such as, compute and storage nodes (from a more abstract perspective) or Storm Nimbus nodes. Besides the power of expression, the consistency of the models behind the DICE profile remains guaranteed thanks to the meta-models and their relations we defined using the UML standard. In essence, the role of these diagrams and their respective profiles is twofold:
 * 1) Provide a high level of abstraction of concepts specific to the Big Data domain (e.g., clusters, nodes…) and to Big Data technologies (e.g., Cassandra, Spark…);
 * 2) Define a set of technical (low level) properties to be checked/evaluated by tools.
 * 1) Define a set of technical (low level) properties to be checked/evaluated by tools.

Methodological Overview
The methodological steps entailed by the activities above encompass at least the following modelling and documentation activities:


 * 1) Elaborate a component-based representation of a high-level structural architecture view of the data intensive application (i.e., a DPIM Component Diagram) - in the scope of DICE, this is done using the simple and familiar notations of a UML profile whence the user draws the stereotypes and constructs necessary to specify his/her Data-Intensive Applications nodes (source node, compute node, storage node, etc.);
 * 2) Augment the component-based representation with the property and non-functional specifications concerning that representation;
 * 3) Refine that very same component-based representation with technological decisions - the decisions themselves represent the choice of which technology shall realise which data-intensive application node. For example, a <> conceptual stereotype is associated with a <> in the DPIM architecture view;
 * 4) Associate several data-intensive technology-specific diagrams representing the technological structure and properties of each of the data-intensive nodes. These diagrams essentially “explode” the technological nodes and contain information specific to those technological nodes. For example, a <> in the DPIM architecture representation can become a <> in its DTSM counterpart ; finally, the DTSM layer will feature yet another diagram, more specifically, a data-model for the Cassandra Cluster. These separate technology-specific “images” serve the purpose of allowing data-intensive application analysis and verification;
 * 5) Elaborate a deployment-specific component deployment diagram where the several technology specific diagrams fall into place with respect to their infrastructure needs. This diagram belongs to the DDSM layer and contains all necessary abstractions and properties to build a deployable and analysable TOSCA blueprint. Following our <> example, at this level, the DTSM <> node (refined from the previous DPIM <> construct) is finally associated with a DDSM diagram where the configuration of the cluster is fully specified (i.e., VMs type and number, allocation of software components to VMs, etc.);
 * 6) Finally, once the data-intensive deployment-specific component diagram is available, DICE deployment modelling and connected generative technology (DICE Deployment Modelling ) can be used to realise a TOSCA blueprint for that diagram.
 * 1) Elaborate a deployment-specific component deployment diagram where the several technology specific diagrams fall into place with respect to their infrastructure needs. This diagram belongs to the DDSM layer and contains all necessary abstractions and properties to build a deployable and analysable TOSCA blueprint. Following our <> example, at this level, the DTSM <> node (refined from the previous DPIM <> construct) is finally associated with a DDSM diagram where the configuration of the cluster is fully specified (i.e., VMs type and number, allocation of software components to VMs, etc.);
 * 2) Finally, once the data-intensive deployment-specific component diagram is available, DICE deployment modelling and connected generative technology (DICE Deployment Modelling ) can be used to realise a TOSCA blueprint for that diagram.
 * 1) Finally, once the data-intensive deployment-specific component diagram is available, DICE deployment modelling and connected generative technology (DICE Deployment Modelling ) can be used to realise a TOSCA blueprint for that diagram.
 * 1) Finally, once the data-intensive deployment-specific component diagram is available, DICE deployment modelling and connected generative technology (DICE Deployment Modelling ) can be used to realise a TOSCA blueprint for that diagram.

Existing Solutions and UML Modelling Summary
Model Driven Development (MDD) is a well known approach and has been widely exploited in many areas of software engineering. Examples are the web and mobile application development, see for instance the WebRatio approach, and the development of multi-cloud applications, see for instance the MODAClouds project, that offers a modelling approach, called MODACloudsML, to specify the constructs and concepts needed to model and deploy cloud applications and their infrastructure needs (e.g., VMs, resources, etc.).

Also, recently in the literature a number of works have been proposed which attempt to take advantage from MDD concepts and technologies within the context of Big Data application. In other related literature, an interesting approach is proposed with the aim to allow MDD of Hadoop MR applications. After defining a meta-model for a Hadoop MR application, which can be used to define a model of the application, the approach offers an automatic code generation mechanism. The output is a complete code scaffold, which has to be detailed by the developer of data intensive applications with the implementation of the main application-level Hadoop MR components, the Map and Reduce functions, according to placeholders in the generated code. The main goal is to demonstrate how MDD allows to dramatically reduce the accidental complexity of developing a Hadoop MR application. Similar support is offered by Stormgen which aims to provide a DSL for defining Storm-based architectures, called topologies.

This work exploits Ecore for building the metamodel and Xtext for generating the grammar of the language. Stormgen also provides automatic code generation using the Xtend language, a dialect of Java. Also in this case the user has to specify for each element of a Storm topology (Bolts and Spouts) the desired implementation, since the main focus is on designing the topology. Authors of this approach plan to have also a graphical DSL coupled with the textual one using eclipse GMF (Graphical Modeling Framework).

While these approaches provide a first evidence of the utility of MDD in the context of data intensive applications, they are both focused on relying on a single underlying technology. Moreover, they focus on the development phase and do not target the deployment aspects, which would require the development and operation teams to reason on the platform nodes supporting the execution of technological components and on their allocation to concrete computational and storage resources.

By way of contrast, in the DICE technical space Designers exploiting UML modelling for their Data-Intensive applications will be required to produce (at least) one component diagram for their architectural structure view (DICE DPIM) and two (or more) diagrams for their technology-specific structure and behavior view (DICE DTSM), taking the care of producing exactly two diagrams (a structural and a behavioral view) for every technological node in their architectural structure view (DICE DPIM) as long as that requires analysis. DICE UML modelling does not encourage the proliferation of many diagrams, e.g., for the purpose of re-documentation - DICE focus is on quality-aware design and analysis of Data-Intensive applications. Therefore, DICE UML modelling promotes the modelling of all and only the technological nodes that require specific analytical attention and quality-awareness. Finally, Designers will be required to refine their architectural structure view with deployment-specific constructs and decisions (DICE DDSM).

Quick Reference Scenario
For example, for a simple WordCount application featuring a single Source Node, a single Compute Node and a single Storage Node, all three requiring specific analysis and quality improvement. Therefore, Designers are required to produce (in the DICE IDE) a total of 7 diagrams: (1) an architectural structure view of the general application, containing three nodes (Storage, Compute and Source) along with their properties and QoS/QoD annotations; (2) a structural and behavioral technology-specific view for every technology that requires analysis - let us assume a class diagram and an activity diagram for Storage, Compute and Source Node technologies respectively. Finally, the diagram produced in (1) is required to be refined with appropriate deployment-specific constructs, mappings and annotations.

The next section provides a realistic usage scenario of the above modelling procedure for the purpose of clarifying the DICE modelling process.

For more details (tutorials, documentation, examples, etc.), on the DICE profile and the connected technologies the reader may find additional elaborations on the DICE Knowledge Repository.

DICE UML Modelling in Action: A Sample Scenario
As a toy example, we refer to a simple Storm application of our own device called WikiStats which takes as input a compressed stream of 20GB web pages in XML containing snapshots of all the articles in Wikipedia. The application then processes the stream to derive article statistics. Let’s assume we are interested initially in deploying our application as soon as possible rather than analyse its behavior;

Step a: DPIM Model
The DPIM model for our toy example is a component-based aggregation of two nodes: a compute node (entrusted with processing wiki pages) and a storage node (entrusted with storing and presenting results). For the sake of space we do not present this simplistic DPIM layer.

Step b and c: DPIM Model Refinement
At this point the DPIM model is used as a basis to refine DIA modelling with appropriate technological decisions; in our case, the DPIM component diagram components representing DIA nodes are stereotyped with additional technology-specific stereotypes, in our case that is the <> stereotype for the only compute node in the DPIM; this signifies that the component is established to be a Storm Compute Node. Similarly, the DPIM component diagram component representing the storage node is stereotyped with an additional stereotype, that is, the <> stereotype; this signifies that the component is established to be a Cassandra cluster.

Step d: DTSM Model Creation
At this point, we need to “explode” the two nodes in our DPIM refined with technological decisions - all we need to do is to create a new class diagram and elaborate further on the technical-detail internals for both nodes (e.g., Storm topology details for the <> and schemas for the <>). As a consequence, we prepare a new class diagram where a new class is created with the <> stereotype and is immediately associated with bolts and spouts required in WikiStats; similarly, data schemas are prepared for bolts and linked to a <> class of which we assume no need for further internal details.

Step e: DDSM Model Creation
At this point, the technologies used in the DTSM are mapped to physical resources and automated rollout is applied to obtain a deployable TOSCA blueprint (see DICER tool and DICE Delivery Service for additional deployment features). DDSM creation at this step involves creating or refining a UML Deployment Diagram with DDSM Profile Stereotypes. Continuous OCL-assisted modelling can be used to refine the UML Deployment diagram in a semi-automated fashion. In a typical scenario, the DICE user randomly selects a technology from the DTSM diagram and instantiates a Deployment Node to apply that technology stereotype on it. Subsequently, the DICE user can check the diagram for satisfaction of DICE-DDSM OCL constraints, addressing any missing dependencies for that technology as well as any missing deployment specifications (e.g., additional nodes, firewalls, missing characteristics and attributes, etc.). The same process shall be replicated by the DICE user until all the technologies in the DTSM are modelled at the DDSM level as well. Finally, a deployment artifact representing the DIA runnable instance itself shall conclude the modelling at the DICE DDSM layer. Automated TOSCA blueprint creation now ensues, using the 1-click deployment feature of the DICER tool - this feature allows activating a DICER + Deployment Service pipeline on the prepared DDSM artifacts such that immediate deployment can follow.

Conclusion
DICE UML modelling is aimed at providing a coherent and complete overview of quality-aware and data-intensive applications using standard UML diagrams to fully express and articulate the needed details. In these details we include: (a) a structural view of the data-intensive architecture; (b) a behavioral view of the same architecture; (c) a structural and behavioral view of technology-specific constructs for the architecture; (d) the infrastructure design for said architecture.

The above models are intended to both document and support the quality-aware design and operation of data-intensive applications.