Data Compression Technique on Hadoop's Next Generation
Total Page:16
File Type:pdf, Size:1020Kb
INTERNATIONAL JOURNAL OF SCIENTIFIC PROGRESS AND RESEARCH (IJSPR) ISSN: 2349-4689 Issue 92, Volume 33, Number 01, 2017 Data Compression Technique On Hadoop’s Next Generation: Yarn Vanisha Mavi1, Nidhi Tyagi2 1M.Tech student, CSE, AKTU Lucknow 2Professor MIET, Meerut Abstract - Data Compression is a strategy for encoding decides that permits considerable diminishment in the aggregate Compression number of bits to store or transmit a document [1]. There is a complete range of different data compression techniques available both online and offline working such that it becomes really difficult to choose which technique serves the best. In this paper we represent number of techniques to compress and decompress the text data and summarize the design, Lossy Lossless development, and current state of deployment of Hadoop. It also gives an overview of YARN concepts, related issues and challenges, tools and techniques. Figure 1:-Types of compression Keywords - Big Data, YARN, Data compressions. 3.2 What to compress? 1.1 INTRODUCTION Compressing input files: Big data described as a collection of structured, If the input file is compressed, then the bytes read in unstructured and semi structured data [1]. from HDFS is reduced, which means less time to read 2.1 Hadoop data. This time conservation is beneficial to the Hadoop is a distributed system infrastructure researched performance of job execution. and developed by the Apache Foundation [2]. The users If the input files are compressed, they will be can develop the distributed applications although they do decompressed automatically as they are read by not know the lower distributed layer so that the users can MapReduce, using the filename extension to determine make full use of the power of the cluster to perform the which codec to use. For example, a file ending in .gz high-speed computing and storage. The core technology of can be identified as gzip-compressed file and thus read Hadoop is the Hadoop Distributed File System (HDFS) and with GzipCodec. the MapReduce. Compressing output files: 3.1 Compression, Often we need to store the output as history files. If the amount of output per day is extensive, and we often Compression, source coding, or bit-rate reduction involves need to store history results for future use, then these encoding information using fewer bits than the original accumulated results will take extensive amount of representation [3]. Compression can be either lossy or HDFS space. However, these history files may not be lossless. Lossless compression reduces bits by identifying used very frequently, resulting in a waste of HDFS and eliminating statistical redundancy. No information is space. Therefore, it is necessary to compress the lost in lossless compression. Lossy compression reduces output before storing on HDFS. bits by identifying unnecessary information and removing it. The process of reducing the size of a data file is referred Compressing map output: to as data compression [4]. In the context of data Even if your MapReduce application reads and writes transmission, it is called source coding (encoding done at uncompressed data, it may benefit from compressing the source of the data before it is stored or transmitted) in the intermediate output of the map phase. Since the opposition to channel coding. map output is written to disk and transferred across the network to the reducer nodes, by using a fast compressor such as LZO or Snappy, you can get performance gains simply because the volume of data to transfer is reduced. www.ijspr.com IJSPR | 21 INTERNATIONAL JOURNAL OF SCIENTIFIC PROGRESS AND RESEARCH (IJSPR) ISSN: 2349-4689 Issue 92, Volume 33, Number 01, 2017 3.3 Common input format intermediate data and the output. MR2 [6] is shown first, followed by MR1 [7]. Compression File Split Tool Algorithm format extension table MRv2 Coding gzip gzip DEFLATE .gz no hadoop jar hadoop-examples-.jar sort "- Dmapreduce.compress.map.output=true" bzip2 bzip2 bzip2 .bz2 yes "Dmapreduce.map.output.compression.codec=org.ap Yes, if LZO Lzop LZO .lzo ache.hadoop.io.compress.GzipCodec" indexed "-Dmapreduce.output.compress=true" Snappy n/a Snappy . Snappy no "Dmapreduce.output.compression.codec=org.apache. Table 1:- Common input formats hadoop.io.compress.GzipCodec" –outKey 3.4 Choosing a Data Compression Format org.apache.hadoop.io.Text -outValue Whether to compress your data and which compression org.apache.hadoop.io.Text input output formats to use can have a significant impact on MRv1 Coding performance. Two of the most important places to consider data compression are in terms of MapReduce jobs and data hadoop jar hadoop-examples-.jar sort "- stored in HBase. For the most part, the principles are Dmapred.compress.map.output=true" similar for each [5]. "- You need to balance the processing capacity required to Dmapred.map.output.compression.codec=org.apache. compress and uncompress the data, the disk IO required hadoop.io.compress.GzipCodec" to read and write the data, and the network bandwidth "-Dmapred.output.compress=true" required to send the data across the network. The correct balance of these factors depends upon the "- characteristics of your cluster and your data, as well as Dmapred.output.compression.codec=org.apache.hado your usage patterns. op.io.compress.GzipCodec" -outKey Compression is not recommended if your data is org.apache.hadoop.io.Text-outValue already compressed (such as images in JPEG format). org.apache.hadoop.io.Text input output In fact, the resulting file can actually be larger than the 4.1 YARN original. YARN (yet another resource negotiator) is a key element GZIP compression uses more CPU resources than of the hadoop data processing architecture that provides Snappy or LZO, but provides a higher compression different data handling mechanism, including interactive ratio. GZip is often a good choice for cold data, which SQL (Structured query language) and batch processing. In is accessed infrequently. Snappy or LZO are a better YARN, the Resource Manager will be the resources choice for hot data, which is accessed frequently. distributor while the Application Master is responsible for the communication with the Resource Manager and BZip2 can also produce more compression than GZip cooperate with the Node-manager to complete the tasks for some types of files, at the cost of some speed when [8]. Yarn can be considered as operating system of hadoop compressing and decompressing. HBase does not ecosystem. It improves the performance of data processing support BZip2 compression. in Hadoop by separating the resource management and Snappy often performs better than LZO. It is worth scheduling capabilities of map reduce from its data running tests to see if you detect a significant processing components. YARN combines a central difference. resource manager that reconciles the way applications use Hadoop system resources with node manager agents that For MapReduce, if you need your compressed data to monitor the processing operations of individual cluster be splittable, BZip2, LZO, and Snappy formats are nodes. Separating HDFS from MapReduce with YARN splittable, but GZip is not. Splittability is not relevant to makes the Hadoop environment more suitable for HBase data. operational applications that can't wait for batch jobs to For MapReduce, you can compress either the finish. YARN. Given the limitations of MapReduce, the intermediate data, the output, or both. Adjust the main purpose of YARN is to divide the tasks for the parameters you provide for the MapReduce job JobTracker. In YARN, the Resources are managed by the accordingly. The following examples compress both the Resource Manager and the jobs are traced by the www.ijspr.com IJSPR | 22 INTERNATIONAL JOURNAL OF SCIENTIFIC PROGRESS AND RESEARCH (IJSPR) ISSN: 2349-4689 Issue 92, Volume 33, Number 01, 2017 Application Master. The Task Tracker has become the The Application Master is cooperating with the NodeManager. Hence, the global Resource Manager and NodeManager to put tasks in the suitable containers to the local NodeManager compose the data computing run the tasks and monitor the tasks. When the framework. In YARN, the Resource Manager will be the container has errors, the Application Master will apply resources distributor while the Application Master is for another resource from the Scheduler to continue responsible for the communication with the Resource the process. Manager and cooperate with the NodeManager to complete Container the tasks [9]. In YARN, the Container is the source unit which is 4.2 YARN architecture the available node splitting the organization resources. Compared with the old MapReduce Architecture, it is easy Instead of the Map and Reduce source pools in to find out that YARN is more structured and simple. Then, MapReduce, the Application Master can apply for any the following section will introduce the YARN numbers of the Container. architecture. Due to the same property Containers, all the Containers can There are following four core components of the be exchanged in the task execution to improve efficiency. YARN Architecture [10]: 4.3 Difference between MapReduce and YARN (Yet ResourceManager another resource negotiator):- Drawbacks of MapReduce become the benefits of YARN (yet another resource According to the different functions of the negotiator). ResourceManager, is a designer has divided it into two lower level components: The Scheduler and the Here are the drawbacks of MapReduce:- Application Manager. On the one hand, the Scheduler assigns the resource to the different running The MAPREDUCE Model is unable to process applications based on the cluster size, queues, and the streaming data and in-memory interface resource constraints. The Scheduler is only responsible operations. Some of the limitations are as for the resources allocation but is not responsible for follows:- the monitoring the application implementation and SCALABILITY task failure. On the other hand, the Application Manager is in charge of receiving jobs and AVALIABILITY redistributing the containers for the failure objects.