Practical DevOps for Big Data/Platform-Independent Modelling

Introduction

DICE provides Software Architects with a set of core concepts, at the DPIM layer, to specify the fundamental architecture elements that constitute a Data-Intensive Application (DIA), i.e., during the DIA Design phase. Designers may use the identified core architecture elements to quickly put together the structural view of their Big-Data application, highlighting and tackling concerns such as data flow and essential high-level processing properties (e.g., rate, properties provided and required by every component, etc.) as well as key data processing needs (e.g., batch, streaming, etc.).

DPIM Profile

DPIM includes all concepts that are relevant to structure a DIA. At the DPIM level we define the high level topology of the application and its QoS requirements. Elements of the DPIM meta-model fall into two categories:
 * 1) Active DIA elements, which process the data, such as computational nodes ;
 * 2) Passive DIA elements, which stores and visualize the data, such as the storage nodes ;

More in particular, the DICE DPIM Profile meta-model shows that DIA elements are essentially aggregates of two sets of components. Firstly, the "ComputationNode", which is basically responsible for carrying out computational task like map, or reduce in MapReduce. One of important attributes of ComputationNode is "computationType" that shows the processing type of big-data i.e, batch processing or stream processing. The ComputationNode itself, further specializes into "SourceNode" and "Visualization" nodes. The role of the SourceNode is to provide data for processing. In other words, the SourceNode represents the source of data which are coming into application in order to being processed. The attribute "sourceType" further specifies the characteristics of source. The ultimate goal of a big-data application is to process the data that have high volume and velocity. So the SourceNode, and ComputationNode are in DPIM since there are the essential part of each and every DIA. The sourceNode is the entry point of data into the application and the Computation is where data would be processed. Visualization here means to visualize the data to represent the knowledge more intuitively and effectively by using different graphs which are computed through Data-Intensive means. Even though, the visualization of big-data itself could be done by a separate application, but here we considered visualization as specification of ComputationNode since ultimately the visualization is a data-intensive computation task. Another element which is also specification of ComputationNode is the FilterNode. Its role is to do any type of pre-processing and post-processing of data if needed.

The second key element in the DICE profile is the "StorageNode". As its name may suggest, the StorageNode represents the element which is responsible to store the data, either for long or short term. Moreover, it is associated with "Channel" that represents the communication channel in the application. The specification of Channel also shows the restrictions and constraints of a channel. It also specifies the characteristics related to transformation of data like information rate and taps. The concept of StorageNode in DPIM mainly corresponds with the "database" in the model. In some cases, it could also be a "filesystem". The channel in DPIM is a representation of "Governance and data Integration" in which mainly includes the technologies responsible for transferring the data, like message broker systems. The other elements in the model are "DataSpecification" and "QoSRequiredProperty", which are annotation stubs for specification the type and format of data and the QoS for system and its evaluation respectively. These annotations are inherited from MODACloudML. Appendix A specifies the DICE Profile with greater detail. Table 1 summarizes the current list of stereotypes of the DICE Profile for the DPIM level.

DPIM Example: The Maritime Operations Case Study In this section, we describe a UML-based design (i.e., Activity Diagram) that is annotated using the DPIM profile. In particular, input parameters are assigned to the mean durations of the action steps (i.e., hostDemand tagged-values) and to the data stream arrival rate (i.e., arrivalRate tagged-value). We show the modelling of a portion of the Maritime Operations case study.



As previously explained in Introduction to Modelling, the DPIM profile rely on the standard MARTE and DAM profiles. This is because DAM is a profile specialized in dependability and reliability analysis, and MARTE offers the GQAM sub-profile, a complete framework for quantitative analysis. Therefore, they matches perfectly to our purposes: the quality assessment of data intensive applications. Moreover, MARTE offers the NFPs and VSL sub-profiles. The NFP sub-profile aims to describe the non-functional properties of a system, performance in our case. The latter, the VSL sub-profile, provides a concrete textual language for specifying the values of metrics, constraints, properties, and parameters related to performance. VSL expressions are used in DPIM-profiled models with two main goals: (i) to specify the input parameters of the model and (ii) to specify the performance metric(s) that will be computed for the model (i.e., the output results). An example of VSL expression for a host demand tagged value of type NFP_Duration is:

expr=$parse (1), unit=ms (2), statQ=mean (3), source=est (4)

This expression specifies that the parsing step in (yellow box in the image) demands $parse (1) milliseconds (2) of processing time, whose mean value (3) will be obtained from an estimation in the real system (4). $parse is a variable that can be set with concrete values during the analysis of the model.

Conclusion DICE UML-based application modelling heavily rotates around DPIM, a new refined UML profile to specify the fundamental architecture elements that constitute a Data-Intensive Application (DIA).