Practical DevOps for Big Data/News and Media

Use Case Description
The news and media domain is a highly demanding sector in terms of handling large data streams coming from the social media. ATC SA, as one of the leading brands around the world in ICT application for the news and media domain, has developed NewsAssset – a commercial product positioned in the news and media domain. NewsAsset is a commercial product positioned in the news and media domain, branded by Athens Technology Center, an SME located in Greece. The NewsAsset suite constitutes an innovative management solution for handling large volumes of information offering a complete and secure electronic environment for storage, management and delivery of sensitive information in the news production environment. The platform proposes a distributed multi-tier architecture engine for managing data storage composed by media items such as text, images, reports, articles, videos, etc. Innovative software engineering practices, like Big Data technologies, Model-Driven Engineering (MDE) techniques, Cloud Computing processes and Service-Oriented methods have penetrated in the media domain. News agencies are already feeling the impact of the capabilities that these technologies offer (e.g. processing power, transparent distribution of information, sophisticated analytics, quick responses, etc.) facilitating the development of the next generation of products, applications and services. Especially considering interesting media and burst events which is out there in the digital world, these technologies can offer efficient processing and can provide an added value to journalists. At the same time, heterogeneous sources like social networks, sensor networks and several other initiatives connected to the Internet are continuously feeding the world of Internet with a variety of real data at a tremendous pace: media items describing burst events, traffic speed on roads; slipperiness index for roads receiving rain or snowfall; air pollution levels by location; etc. As more of those sources are entering the digital world, journalists will be able to access data from more and more of them, aiding not only in disaster coverage, but being used in all manner of news stories. As that trend plays out, when a disaster is happening somewhere in the world, it is the social networks like Twitter, Facebook, Instagram, etc. that people are using to watch the news ecosystem and try to learn what damage is where, and what conditions exist in real-time. Many eyewitnesses will snap a disaster photo and post it, explaining what’s going on. Subsequently, news agencies have realized that social-media content are becoming increasingly useful for disaster news coverage and can benefit from this future trend only if they adopt the aforementioned innovative technologies. Thus, the challenge for NewsAsset is to catch up with this evolution and provide services that can handle the developing new situation in the media industry. In addition, during DICE project we have identified a great business opportunity on the “Fake News” sector. More specific, we have used DICE tools in order to develop specific modules (part of the News Orchestrator application) for a new and innovative product which is being connected via an API to our NewsAsset suite or can be sold as a standalone solution. This new product in being called TruthNest (www.truthnest.com). TruthNest is a service ATC has implemented for assessing the trustworthiness of information found in Social Media. TruthNest users are able to capture streams from social networks, from which they are then able to analyse a single post and gain insights according to several dimensions of a verification process.

Use Case Scenarios
TruthNest is an online comprehensive tool that can promptly and accurately discover, analyse, and verify the credibility and truthfulness of reported events, news and multimedia content that emerge in social media in near real time. The end user has the ability to verify the credibility of a single post within seconds by activating, with a single click, a series of analysis events for achieving the desired result. More specific, TruthNest users will be able to bring in streams from social networks which will then be able to analyse and gain insights as to several dimensions of the verification process. In addition, they will also be able to create and monitor new “smart” streams from within TruthNest. An important module for TruthNest, which has been developed from scratch, is the “Trend Topic Detector”. The “Trend Topic Detector” provides to the end user a visualisation environment showing prominent trends that derive from social media, and, more specifically, from Twitter. What is critical to mention at this stage, is that only the “Trend Topic Detector” module has been developed by using DICE tools while the rest of TruthNest’s components have been developed by using conventional tools and methodologies as these have been used by ATC’s engineering and development team.

“Trend Topic Detector” Detailed Description
The Trend Topic Detector is centered around a clustering module. This creates clusters of tweets that relate to the search criteria submitted. The clusters are formulated by grouping the tweets found based on their common terms. The module tracks a percentage of the tweets posted onwards on Twitter, as the Twitter streaming API limitations impose. While it is restricted currently on Twitter stream API, it can take input from multiple social media (YouTube, Flickr and others) however it has not been implemented yet.

Architecture
The main pipeline of the clustering module is implemented as a Storm topology, where sequential bolts perform specific operations on the crawling procedure. These bolts include entity extraction (by using Stanford NER classifier) and minHash computation to estimate how similar the formulated sets are. The tweets terms are extracted and indexed in a running Solr instance. The most frequent terms are computed in a rolling window of 5 minutes and 20 clusters are formulated by default. A label (typically the text of a tweet) is assigned to each cluster. The results are stored in a Mongo database. The module is highly configurable and offers nearly real time computation of clusters.

How to use the clustering module:

 * 1) The user sets search terms through the user interface. The default language is English, other languages are not officially supported. Some trending topic cluster computation settings are also available (e.g. window computation time).
 * 2) The process is initialized and the user is informed on the number of tweets currently analysed. A diagram (line chart) is shown that is renewed every few seconds, showing the stream progress. After 10 minutes if tweets were found, the first trending topic clusters are presented. A set of 20 trending topic clusters is shown. The trending topic clusters are clickable and the user can view the items that consist them.
 * 3) The trending topic clusters are re-computed every five minutes and their content is updated. The user can view details (e.g. clusters evolution through time).
 * 4) The user can search the trending topic clusters and tag the most important ones. A favourite filter is also available.
 * 5) The user can start/stop a trending topic cluster, edit and delete it. He can save a trending topic cluster and restart it at a later time.
 * 6) There are limits on the number of trending topic clusters created and their activity period. Typically, the clusters are stopped after 24h.