Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-4, 2017 ISSN: 2454-1362, http://www.onlinejournal.in

Cloud Computing Environment – A Big Data Dash

Rajendra Kaple SGBAU, Amravati

Abstract: Rapid growth of internet applications like includes data sets with sizes beyond the ability of social media, online shopping, banking services etc., current technology, method and theory to capture, leads to process a huge amount of diverse data, manage, and process the data within a acceptable which continues to witness a quick increase. lapsed time. The definition of big data is also given Effective management and analysis of large scale as, high-volume, high-velocity, and a variety of data leads to an interesting but critical challenge. information resources that require new forms of Recently, big data has involved a lot of attention processing to enable superior decision making, from academic world, industry as well as insight detection and process optimization. government. In this paper we include several big According to Wikimedia, big data is a collection of data processing techniques from system and data sets so large and complex that it becomes application sides. From the view of cloud data difficult to process using traditional database Management and big data processing mechanisms, management tools. The purpose of this paper is to let us concentrate on the key issues of big data offer an overview of big data studies and related processing, including cloud computing platform, works, which targets at providing a general view of cloud architecture, cloud database and data storage big data management technologies and applications. system. And, then let us acquaint with MapReduce optimization strategies and applications reported in 2. Big Data Management System : Many the literature. Lastly, focus on the open issues and researchers have suggested that commercial challenges, and talk over the research possibilities, Database Management Systems are not suitable for on big data processing in cloud computing processing extremely huge amount of data. Classic environments. architecture’s main failure is, the database server while encountered with top workloads. One database 1. Introduction : In the last twenty years, the server has restriction of scalability and cost, which continuous increase of computational power has are two important areas of big data processing. In produced an vast flow of data. Big data is not only order to adapt various large data processing models, becoming more accessible but also more clear to D. Kossmann et al. presented four different computers. For example, the social website, constructions based on classic multilayer database Facebook, serves around six hundred billion pages application architecture which are partitioning, per month, stores three billion new photos every , distributed control and caching month, and manages twenty five billion portions of constructions [1]. It is clear that the alternative content. Google’s search and advertising business, providers have different business models and target facebook, flickr, YouTube, and LinkedIn use a AI different kind of applications. Google seems to be tricks; require examining vast quantities of data and more interested in small applications with light making decisions istantly. Multimedia data mining workloads whereas Azure is currently the most platforms provide an easy access to achieve these realistic service for medium to large services. Most goals with the minimum amount of effort in terms of of recent cloud service providers are utilizing hybrid software, CPU and network resources. Big data and architecture that is capable of fulfilling their actual cloud computing are both the fast growing service necessities. In this section, let us discuss big technologies. Cloud computing is associated with data architecture from three key aspects: distributed new standard for providing computing infrastructure file system, non-structural and semi structured data and big data processing method for all types of storage and open source cloud platform. resources. Moreover, some new cloud based technologies have to be accepted because dealing A. Distributed File System: with big data for analogous processing is difficult. (GFS) is a portion based distributed file system that Then what is Big Data? In the publication of the supports fault-tolerance by data separating and journal of Science 2008, Big Data means the reproduction. As an fundamental storage layer of progress of the human intellectual processes, usually Google’s cloud computing platform, it is used to read

Imperial Journal of Interdisciplinary Research (IJIR) Page 1374

Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-4, 2017 ISSN: 2454-1362, http://www.onlinejournal.in input and store output of MapReduce. Similarly, differs from key-value storage system. Facebook Hadoop also has a distributed file system as its data proposed the design of a new cluster-based data storage layer called Hadoop Distributed File System warehouse system, Llama[6], a hybrid data (HDFS)[2], which is an open-source counterpart of management system which combines the features of GFS. GFS and HDFS are user level filesystems that row-wise and column-wise database systems. They do not implement POSIX (Portable Operating also describe a new column-wise file format for System Interface) semantics and heavily improved Hadoop called CFile, which provides better for the case of large files (measured in gigabytes). performance than other file formats in data analysis.

Amazon Simple Storage Service (S3) is an online C. Open Source Cloud Platform: The main idea public storage web service offered by Amazon web behind data centre is to empower the virtualization services. This file system is focuses at clusters hosted technology to maximize the utilization of computing on the Amazon Elastic Compute Cloud server-on- resources. Therefore, it provides the basic ingredients demand infrastructure. S3 aims to provide such as storage, CPUs, and network bandwidth as a scalability, high availability, and low latency at commodity by specialized service providers at low goods costs. ES2 is an elastic storage system of unit cost. For reaching the goals of big data epiC6, which is designed to support both management, most of the research institutions and functionalities within the same storage. The system enterprises bring virtualization into cloud provides efficient data loading from different architectures. Amazon Web Services (AWS), sources, flexible data partitioning scheme, index and Eucalyptus, Open nebula, Cloud stack and Open parallel sequential scan. In addition, there are general stack are the most popular cloud management filesystems that have not to be addressed such as platforms for infrastructure as a service (IaaS). Moose File System (MFS), Kosmos Distributed AWS9 is not free but it has huge usage in elastic Filesystem (KFS). platform. It is very easy to use and only pay-as-you- go. The Eucalyptus [7] works in IaaS as an open B. Non-structural and Semi-structured Data source. It uses virtual machine in controlling and Storage: With the success of the web, more and managing resources. Since Eucalyptus is the earliest more IT companies have increasing needs to store cloud management platform for IaaS, it signs API and analyze the ever growing web data, such as compatible agreement with AWS. It has a leading search logs, crawled web content, and click streams, position in the private cloud market for the AWS usually in the range of petabytes (1000 terabytes), ecological environment. Open NEBULA[8] has collected from different web services. However, web integration with various environments. It can offer data sets are usually non-relational or less structured the richest features, flexible ways and better and processing such semi-structured data sets at scale interoperability to build private, public or hybrid involves another challenge. Moreover, simple clouds. Open Nebula is not a Service Oriented distributed file systems mentioned above cannot Architecture (SOA) design and has weak decoupling satisfy service providers like Google, Yahoo!, for computing, storage and network independent Microsoft and Amazon. All providers have their components. CloudStack10 is an open source cloud perseverance to serve potential users and own their which delivers public cloud relevant state of-the-art of big data management computing similar to Amazon EC2 but using users’ systems in the cloud environments. Big table[3] is a own hardware. Cloud Stack users can take full distributed storage system of Google for managing advantage of cloud computing to deliver higher structured data that is designed to scale to a very efficiency, limitless scale and faster deployment of large size say petabytes of data, across thousands of new services and systems to the end user. At present, commodity servers. Big table does not support a full Cloud Stack is one of the Apache open source relational data model. However, it provides clients projects. OpenStack11 is a collection of open source with a simple data model that supports active control software projects aiming to build an open-source over data layout and format. PNUTS[4] is a huge community with researchers, developers and scale hosted database system designed to support enterprises. People in this community share a Yahoo!’s web applications. The main focus of the common goal to create a cloud that is simple to system is on data allocation for web applications, deploy, massively scalable and full of rich features. rather than complex queries. Using PNUTS, new In current situation, OpenStack has good community applications can be built very easily and the overhead and ecological environment. However, it still have of creating and maintaining these applications is some shortcomings like incomplete functions and simple. The Dynamo[5] is a highly available and lack of commercial supports. scalable distributed key/value based data store built for supporting internal Amazon’s applications. It provides a simple primary-key only interface to meet the requirements of these applications. However, it

Imperial Journal of Interdisciplinary Research (IJIR) Page 1375

Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-4, 2017 ISSN: 2454-1362, http://www.onlinejournal.in

3. Applications and optimization : capable of detecting near duplicates over massive datasets efficiently. In addition, C. Ranger et al.[14] A. Application: In this age of data eruption, parallel implement the MapReduce framework on multiple processing is essential to perform a huge volume of processors in a single machine, which has gained data in a timely manner. The use of parallelization good performance. techniques and algorithms is the key to achieve better scalability and performance for processing big data. B. Optimization: In this section, let us see the details At present, there are a lot of popular parallel of approaches to improve the performance of processing models, including MPI, General Purpose processing big data with MapReduce. GPU (GPGPU), MapReduce and MapReduce-like. MapReduce proposed by Google, is a very popular 1) Data Transfer Difficulties: It is a big challenge big data processing model that has rapidly been that cloud users must consider how to minimize the studied and applied by both industry and academic cost of data transmission. Consequently, researchers world. MapReduce has two major advantages: the have begun to propose variety of approaches. Map- MapReduce model hide details related to the data Reduce-Merge[15] is a new model that adds a Merge storage, distribution, replication, load balancing and phase after Reduce phase that combines two reduced so on. Furthermore, it is so simple that programmers outputs from two different MapReduce jobs into one, only specify two functions, which are map Function which can efficiently merge data that is already and reduce function, for performing the processing of partitioned and sorted (or hashed) by map and reduce the big data. We divided existing MapReduce modules. Map-Join-Reduce[16] is a system that applications into three categories: partitioning sub- extends and improves MapReduce runtime space, decomposing Sub-processes and approximate framework by adding Join stage before Reduce stage overlapping calculations. While MapReduce is to perform complex data analysis tasks on large referred to as a new approach of processing big data clusters. They present a new data processing strategy in cloud computing environments, it is also criticized which runs filtering-join aggregation tasks with two as a “major step backwards” compared with consecutive MR jobs. It adopts one-to-many Database Management System (DBMS)[9]. We all shuffling scheme to avoid frequent check pointing know that MapReduce is schema free and index free. and shuffling of intermediate results. Moreover, Thus, the MapReduce framework requires parsing different jobs often perform similar work, thus each record at reading input. As the argument sharing similar work reduces overall amount of data continues, the final result shows that neither is good transfer between jobs. MRShare[17] is a sharing at the other does well, and the two technologies are framework proposed by T. Nykiel et al. that complementary [10]. Hadoop DB is a hybrid system transforms a batch of queries into a new batch that which efficiently takes the best features from the can be executed more efficiently by merging jobs scalability of MapReduce and the performance of into groups and evaluating each group as a single DBMS. The result shows that Hadoop DB improves query. Data skew is also an important factor that task processing times of Hadoop by a large factor to affects data transfer cost. In order to overcome this match the shared nothing DBMS. Lately, J. Dittrich deficiency, we propose a method [18] that divides a et al. propose a new type of system named MapReduce job into two phases: sampling Hadoop++[11] which indicates that Hadoop DB has MapReduce job and expected MapReduce job. The also severe drawbacks, including forcing user to use first phase is to sample the input data, gather the DBMS, changing the interface to SQL and so on. inherent distribution on keys’ frequencies and then There are also certain papers adapting different make a good partition scheme in advance. In the inverted index, which is a simple but practical index second phase, expected MapReduce job applies this structure and appropriate for MapReduce to process partition scheme to every mapper to group the big data, such as [12] etc. MapReduce has received a intermediate keys quickly. lot of attentions in many fields, including data mining, information retrieval, image retrieval, 2) Iterative Optimization: MapReduce also is a machine learning, and pattern recognition. For prevalent platform in which the dataflow takes the example, Mahout12 is an Apache project that aims at form of a directed acyclic graph of operators. building scalable machine learning libraries which However, it requires lots of Input/Outputs and are all implemented on the Hadoop. However, as the unnecessary computations while solving the problem amount of data that need to be processed grows, of iterations with MapReduce. Twister [19] proposed many data processing methods have become not by J. Ekanayake et al. is an enhanced MapReduce suitable or limited. Recently, many research efforts runtime that supports iterative MapReduce have exploited the MapReduce framework for computations efficiently, which adds an extra solving challenging data processing problems on Combine stage after Reduce stage. Thus, data output large scale datasets in different domains. from combine stage flows to the next iteration’s Map MapDupReducer[13] is a MapReduce based system stage. It avoids instantiating workers repeatedly

Imperial Journal of Interdisciplinary Research (IJIR) Page 1376

Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-4, 2017 ISSN: 2454-1362, http://www.onlinejournal.in during iterations and previously instantiated workers mapping mechanism that exploits pruning rules for are reused for the next iteration with different inputs. distance filtering. In addition, two approximate HaLoop[20] is similar to Twister, which is a revised algorithms minimize the number of replicas to version of the MapReduce framework that supports reduce the shuffling cost. for iterative applications by adding a Loop control. It also allows to cache both stages’ input and output to 4. Challenge: The top seven big data drivers are save more I/Os during iterations. There exist lots of science data, Internet data, finance data, mobile iterations during graph data processing. Pregel[21] device data, sensor data, RFID data and streaming implements a programming model driven by the data. Tied with recent advances in machine learning Bulk Synchronous Parallel(BSP) model, in which and reasoning, as well as rapid growth in computing each node has its own input and transfers only some power and storage, we are transforming our ability to messages which are required for the next iteration to make sense of these increasingly large, diverse, noisy other nodes. and incomplete datasets collected from a variety of sources. We consider there are three important 3) Online: There are some jobs which need to aspects while we face problems in processing big process online while original MapReduce cannot do data, and we present our points of view in details as this very well. MapReduce Online [22] is designed to follows. support online aggregation and continuous queries in MapReduce. It raises an issue that frequent check Big Data Storage and Management: pointing and shuffling of intermediate results limit pipelined processing. They modify MapReduce We need to design hierarchical storage architecture. framework by making Mappers push their data Besides, previous computer algorithms are not able temporarily stored in local storage to Reducers to effectively storage data that is directly acquired periodically in the same MR job. In addition, Map- from the actual world, due to the heterogeneity of the side pre-aggregation is used to reduce big data. However, they perform excellent in communication. Hadoop Online Prototype processing homogeneous data. Therefore, how to re- (HOP)[23] proposed by Tyson Condie is similar to organize data is one big problem in big data MapReduce Online. HOP is a modified version of management. Virtual server technology can intensify MapReduce framework that allows users to early get the problem, raising the prospect of overcommitted returns from a job as it is being computed. It also resources, especially if communication is poor supports for continuous queries which enable between the application, server and storage MapReduce programs to be written for applications administrators. We also need to solve the bottleneck such as event monitoring and stream processing problems of the high concurrent I/O and single- while retaining the fault tolerance properties of named node in the present Master-Slave system Hadoop. D. Jiang et al.[24] found that the merge sort model. Big Data Computation and Analysis: While in MapReduce costs lots of I/Os and seriously affects processing a query in big data, speed is a significant the performance of MapReduce. In the study, the demand [27]. However, the process may take time results are hashed and pushed to hash tables held by because mostly it cannot traverse all the related data reducers as soon as each map task outputs its in the whole database in a short time. In this case, intermediate results. Then, reducers perform index will be an optimal choice. Application aggregation on the values in each bucket. Since each parallelization and divide-and-conquer is natural bucket in the hash table holds all values which computational paradigms for approaching big data correspond to a distinct key, no grouping is required. problems. But getting additional computational In addition, reducers can perform aggregation on the resources is not as simple as just upgrading to a fly even when all mappers are not completed yet. bigger and more powerful machine on the fly. The traditional serial algorithm is inefficient for the big 4) Join Query Optimization: Join Query is a popular data. If there is enough data parallelism in the problem in big data area. However a join problem application, users can take advantage of the cloud’s needs more than two inputs while MapReduce is reduced cost model to use hundreds of computers for devised for processing a single input. R. Vernica et a short time costs. al.[25] proposed a 3-stage approach for end-to-end set-similarity joins. They efficiently partition the data Big Data Security: By using online big data across nodes in order to balance the workload and application, a lot of companies can greatly reduce minimize the need for replication. Wei Lu et al. their IT cost. However, security and privacy affect investigate how to perform kNN join using the entire big data storage and processing, since there MapReduce[26]. Mappers cluster objects into is a massive use of third-party services and groups, then Reducers perform the kNN join on each infrastructures that are used to host important data or group of objects separately. To reduce shuffling and to perform critical operations. Besides, current computational costs, they design an effective technologies of privacy protection are mainly based

Imperial Journal of Interdisciplinary Research (IJIR) Page 1377

Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-4, 2017 ISSN: 2454-1362, http://www.onlinejournal.in on static data set, while data is always dynamically [7] D. Nurmi, R. Wolski, C. Grzegorczyk, G. changed, including data pattern, variation of attribute Obertelli, S. Soman, L. Youseff, and D. Zagorodnov, and addition of new data. Thus, it is a challenge to “The eucalyptus open-source cloud-computing implement effective privacy protection in this system,” in Cluster Computing and the Grid, 2009. complex circumstance. In addition, legal and CCGRID’09. 9th IEEE/ACM International regulatory issues also need attention. Symposium on. IEEE, 2009, pp. 124–131. [8] P. Sempolinski and D. Thain, “A comparison and 5. Conclusion : In this paper we have discussed a critique of eucalyptus, opennebula and nimbus,” in systematic flow of study on the big data processing Cloud Computing Technology and Science in the cloud computing environment. The key issues, (CloudCom), 2010 IEEE Second International including cloud storage and computing architecture, Conference on. Ieee, 2010, pp. 417–426. parallel processing framework, major applications [9] D. DeWitt and M. Stonebraker, “Mapreduce: A and optimization of MapReduce is targeted. Big Data major step backwards,” The Database Column, vol. is not a new concept but due to increasing internet 1, 2008. data, it becomes very challenging. It is a need, for [10] M. Stonebraker, D. Abadi, D. DeWitt, S. accessible storage index and a distributed approach Madden, E. Paulson, A. Pavlo, and A. Rasin, to redeem required results in real time. Big data will “Mapreduce and parallel dbmss: friends or foes,” be multifaceted and exist continuously, which is the Communications of the ACM, vol. 53, no. 1, pp. 64– big opportunity for us. In the future, significant 71, 2010. challenges need to be tackled by industry and [11] J. Dittrich, J. Quian´e-Ruiz, A. Jindal, Y. academic world. It is an crucial need that computer Kargin, V. Setty, and J. Schad, “Hadoop++: Making scholars and social sciences scholars make close a yellow elephant run like a cheetah (without it even cooperation, in order to ensure the long term noticing),” Proceedings of the VLDB Endowment, achievement of cloud computing and collectively vol. 3, no. 1-2, pp. 515–529, 2010. discover new region. [12] D. Logothetis and K. Yocum, “Ad-hoc data processing in the cloud,” Proceedings of the VLDB References : Endowment, vol. 1, no. 2, pp. 1472–1475, 2008. [13] C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, [1] D. Kossmann, T. Kraska, and S. Loesing, “An H. Li, W. Tian, J. Xu, and R. Li, “Mapdupreducer: evaluation of alternative architectures for transaction detecting near duplicates over massive datasets,” in processing in the cloud,” in Proceedings of the 2010 Proceedings of the 2010 international conference on international conference on Management of data. Management of data. ACM, 2010, pp. 1119–1122. ACM, 2010, pp. 579–590. processing on large [14] C. Ranger, R. Raghuraman, A. Penmetsa, G. clusters,” Communications of the ACM, vol. 51, no. Bradski, and C. Kozyrakis, “Evaluating map reduce 1, pp. 107–113, 2008. for multi-core and multiprocessor systems,” in High [2] D. Borthakur, “The hadoop distributed file Performance Computer Architecture, 2007. HPCA system: Architecture and design,” Hadoop Project 2007. IEEE 13th International Symposium on. Website, vol. 11, 2007. IEEE, 2007, pp. 13–24. [3] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. [15] H. Yang, A. Dasdan, R. Hsiao, and D. Parker, Wallach, M. Burrows, T. Chandra, A. Fikes, and R. “Map-reduce merge: simplified relational data Gruber, “Bigtable: A distributed structured data processing on large clusters,” in Proceedings of the storage system,” in 7th OSDI, 2006, pp. 305–314. 2007 ACM SIGMOD international conference on [4] B. Cooper, R. Ramakrishnan, U. Srivastava, A. Management of data. ACM, 2007, pp. 1029– 1040. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. [16] D. Jiang, A. Tung, and G. Chen, “Map-Join- Weaver, and R. Yerneni, “Pnuts: Yahoo!’s hosted Reduce: Toward scalable and efficient data analysis data serving platform,” Proceedings of the VLDB on large clusters,” Knowledge and Data Engineering, Endowment, vol. 1, no. 2, pp. 1277–1288, 2008. IEEE Transactions on, vol. 23, no. 9, pp. 1299–1311, [5] G. DeCandia, D. Hastorun, M. Jampani, G. 2011. Kakulapati, A. Lakshman, A. Pilchin, S. [17] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, Sivasubramanian, P. Vosshall, and W. Vogels, and N. Koudas, “Mrshare: Sharing across multiple “Dynamo: amazon’s highly available keyvalue queries in store,” in ACM SIGOPS Operating Systems Review, Map reduce,” Proceedings of the VLDB Endowment, vol. 41, no. 6. ACM, 2007, pp. 205–220. vol. 3, no. 1-2, pp. 494–505, 201 [6] Y. Lin, D. Agrawal, C. Chen, B. Ooi, and S. Wu, [18] Y. Xu, P. Zou, W. Qu, Z. Li, K. Li, and X. Cui, “Llama: leveraging columnar storage for scalable “Sampling based partitioning in map reduce for join processing in the mapreduce framework,” in skewed data,” in China-Grid, 2012 Seventh China Proceedings of the 2011 international conference on Grid Annual Conference on. IEEE, 2012, pp. 1–8. Management of data. ACM, 2011, pp. 961–972. [19] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu, and G. Fox, “Twister: a runtime for

Imperial Journal of Interdisciplinary Research (IJIR) Page 1378

Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-4, 2017 ISSN: 2454-1362, http://www.onlinejournal.in iterative map reduce,” in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, 2010, pp.810–818. [20] Y. Bu, B. Howe, M. Balazinska, and M. Ernst, “Haloop: Efficient iterative data processing on large clusters,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 285–296, 2010. [21] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for largescale graph processing,” in Proceedings of the 2010 international conference on Management of data. ACM, 2010, pp. 135–146. [22] T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears, “Map reduce online,” in Proceedings of the 7th USENIX conference on Networked systems design and implementation, 2010, pp. 21–21. [23] T. Condie, N. Conway, P. Alvaro, J. Hellerstein, J. Gerth, J. Talbot, K. Elmeleegy, and R. Sears, “Online aggregation and continuous query support in map reduce,” in ACM SIGMOD, 2010, pp. 1115– 1118. [24] D. Jiang, B. Ooi, L. Shi, and S. Wu, “The performance of map reduce: An in-depth study,” Proceedings of the VLDB Endowment, vol. 3, no. 1- 2, pp. 472–483, 2010. [39] R. Vernica, M. Carey, and C. Li, “Efficient parallel set similarity joins using map reduce,” in SIGMOD conference. Citeseer, 2010, pp. 495–506. [25]R. Vernica, M. Carey, and C. Li, “Efficient parallel set similarity joins using map reduce,” in SIGMOD conference. Citeseer, 2010, pp. 495–506. [26] C. Zhang, F. Li, and J. Jestes, “Efficient parallel knn joins for large data in mapreduce,” in Proceedings of the 15th International Conference on Extending Database Technology. ACM, 2012, pp. 38–49. [27] X. Zhou, J. Lu, C. Li, and X. Du, “Big data challenge in the management perspective,” Communications of the CCF, vol. 8, pp. 16–20, 2012. [28] Big Data Processing in Cloud Computing Environments, Changqing Ji, Yu Li, Wenming Qiu, Uchechukwu Awada, Keqiu Li, 2012 International Symposium on Pervasive Systems, Algorithms and Networks.

Imperial Journal of Interdisciplinary Research (IJIR) Page 1379