Quantitative Analysis of Apache Storm Applications: the Newsasset Case Study
Total Page:16
File Type:pdf, Size:1020Kb
Noname manuscript No. (will be inserted by the editor) Quantitative Analysis of Apache Storm Applications: The NewsAsset Case Study Jos´eI. Requeno · Jos´eMerseguer · Simona Bernardi · Diego Perez-Palacin · Giorgos Giotis · Vasilis Papanikolaou Received: date / Accepted: date Abstract The development of Information Systems to- 1 Introduction day faces the era of Big Data. Large volumes of in- formation need to be processed in realtime, for exam- Innovative practices for Information Systems develop- ple, for Facebook or Twiter analysis. This paper ad- ment, like Big Data technologies, Model-Driven Engi- dresses the redesign of NewsAsset, a commercial prod- neering techniques or Cloud Computing processes have uct that helps journalists by providing services, which penetrated in the media domain. News agencies are analyze millions of media items from the social net- already feeling the impact of these technologies (e.g., work in realtime. Technologies like Apache Storm can transparent distribution of information, sophisticated help enormously in this context. We have quantita- analytics or processing power) for facilitating the de- tively analyzed the new design of NewsAsset to assess velopment of the next generation of applications. Espe- whether the introduction of Apache Storm can meet cially, considering interesting media and burst events, the demanding performance requirements of this media which is out there in the digital world, these technolo- product. Our assessment approach, guided by the Uni- gies can offer very efficient processing capabilities and fied Modeling Language (UML), takes advantage, for can provide an added value to journalists. performance analysis, of the software designs already Apache Storm (Apache, 2017a) is a free and open used for development. In addition, we converted UML source distributed realtime computation system that into a domain-specific modeling language (DSML) for can process a million tuples per second per node. Storm Apache Storm, thus creating a profile for Storm. Later, helps for improving real-time analysis, news and adver- we transformed said DSML into an appropriate lan- tisements, the customization of searches, and the opti- guage for performance evaluation, specifically, stochas- mization of a wide range of online services that require tic Petri nets. The assessment ended with a successful low-latency processing. Today, the volume of informa- software design that certainly met the scalability re- tion in Internet increases exponentially, especially that quirements of NewsAsset. of interest for the media. For example, in the case of natural disasters, social or sportive events, the traffic of Keywords Apache Storm · UML · Petri nets · tweets or messages may rise up to 10 or 100 times with Software Performance · Software Reuse respect to the number of messages in a normal situa- tion (Ranjan, 2014). Hence, applications developed us- ing Apache Storm need to be very demanding in terms of performance and reliability. Jos´eIgnacio Requeno, Jos´eMerseguer, Simona Bernardi and This paper addresses, using Apache Storm, the re- Diego Perez-Palacin Dpto. de Inform´aticae Ingenier´ıade Sistemas design of NewsAsset, a commercial product developed Universidad de Zaragoza (Spain) by the Athens Technological Center (ATC, 2018). To E-mail: fnrequeno,jmerse,simonab,[email protected] this end, we apply a quality-driven methodology, that Giorgos Giotis and Vasilis Papanikolaou we already introduced in (Requeno et al., 2017), for the Athens Technology Center, ATC (Greece) performance assessment of Apache Storm applications. E-mail: fg.giotis, [email protected] For ATC the redesign also means to reuse coding of the 2 Jos´eI. Requeno et al. current version of the NewsAsset, then trying to impact plied in a generic context for stream processing (Nalepa only on the stream processing for leveraging Apache et al., 2015b) or distributed systems (Samolej and Rak, Storm. The simulation-based approach that we apply 2009; Rak, 2015). Generalised stochastic Petri nets (Chi- here is useful for predicting the behavior of the appli- ola et al., 1993), the formalism for performance analy- cation for future demands, and the impact of the stress sis that we adopt here, have been already used for the situations in some performance parameters (e.g., appli- performance assessment of Apache Hadoop MapReduce cation response time, throughput or device utilization). (et al., 2016). A recent publication uses fluid Petri nets Consequently, ATC gets, before reimplementation and for the modeling and performance evaluation of Apache deployment of the full application, a valuable feedback Spark applications (et al., 2017). However, the work in which saves coding and monetary efforts. (Requeno et al., 2017) was the first entirely devoted to In particular, this paper extends the approach in the Apache Storm performance evaluation, combining a (Requeno et al., 2017) with respect to the quality-driven genuine UML profile and GSPNs, and the present work methodology in different aspects. First, we improve the validates and extends it as aforementioned. UML profile of the methodology for introducing a relia- The rest of the paper is organized as follows. Sec- bility characterization of Storm. Consequently, we con- tion 2 introduces the NewsAsset case study. Section 3 vert UML into a DSML1 for performance and reliabil- recalls the basics on Apache Storm for performance and ity of Apache Storm applications. Second, we propose reliability and defines the DSML. Section 4 presents our new transformations, into stochastic Petri nets (SPN), performance modeling approach with focus on the case for some performance parameters of Storm not already study. Section 5 is devoted to the performance analy- addressed in (Requeno et al., 2017). Moreover, we in- sis of NewsAsset. Finally, Section 6 draws a conclusion. troduce computation of reliability metrics by means of Appendix A details the transformation to get perfor- the UML profile. Consequently, our approach enables mance models. Appendix B explains the computation the performance and reliability assessment of Apache of reliability metrics in a Storm design. Appendix C re- Storm applications. Finally, the application of the method- calls basic notions of Generalized stochastic Petri nets ology to the NewsAsset case study has been useful to (Chiola et al., 1993). validate the approach in a real scenario and to assess ATC about its scalability. On the modeling side, our DSML allows to work 2 A Case Study in the Media Domain with the Apache Storm performance and reliability pa- rameters in the very same model used for the workflow Heterogeneous sources like social or sensor networks are and deployment specifications. Moreover, the developer continuously feeding the world of Internet with a variety takes advantage of all the facilities provided by a UML of real data in a tremendous pace: media items describ- software development environment. These reasons rec- ing burst events, traffic speed on roads, or air pollution ommend the UML modeling, instead of doing it directly levels by location. Journalists are able to access these with the SPN, that can be merely obtained by trans- data aiding them in all manner of news stories. It is formation. the social networks like Twitter, Facebook or Instagram Regarding the related work, (Ranjan, 2014) discusses that people are using to watch the news ecosystem and the role of modeling and simulation in the era of big try to learn what conditions exist in real-time. Subse- data applications and defends that they can empower quently, news agencies have realized that social-media practitioners and academics in conducting \what-if" content is becoming increasingly useful for news cover- analyses. (Singhal and Verma, 2016) develop a frame- age and can benefit from this trend only if they adopt work for efficiently set-up heterogeneous MapReduce current innovative technologies that effectively manage environments and (Nalepa et al., 2015a,b) address the such volume of information. Thus, the challenge is to need of modeling and performance assessment in stream catch up with this evolution and provide services that applications. More in particular, a generic profile for can handle the new situation in the media industry. modeling big data applications is defined for the Palla- NewsAsset is a commercial product positioned in dio Component Model (Kroß et al., 2015). In (Kroß and the news and media domain, branded by Athens Tech- Krcmar, 2016), the authors model and simulate Apache nology Center (ATC), a SME2 located in Greece. NewsAs- Spark streaming applications. Mathematical models for set suite constitutes an innovative management solu- predicting the performance of Spark applications are tion for handling large volumes of information offering introduced in (Wang and Khan, 2015). Some of these a complete and secure electronic environment for stor- works use variants of the Petri nets, but they are ap- age, management and delivery of sensitive information 1 Domain Specific Modeling Language. 2 Small and medium-sized enterprise. Quantitative Analysis of Apache Storm Applications: The NewsAsset Case Study 3 in the news production environment. The platform pro- the system. The goal is to optimize the existing pro- poses a distributed multi-tier architecture engine for cessing time by means of not only minimizing the time managing data storage composed by media items such slot duration to reflect real time processing but also by as text, images, reports, articles or videos. maximizing the