Efficient Data Searching and Retrieval Using Block Level Deduplication
Total Page:16
File Type:pdf, Size:1020Kb
EFFICIENT DATA SEARCHING AND RETRIEVAL USING BLOCK LEVEL DEDUPLICATION 1SUPRIYA MORE, 2KAILAS DEVADKAR 1Department of Computer Engineering, Sardar Patel Institute of Technology, Mumbai, India. 2Deputy Head of Department, Sardar Patel Institute of Technology, Mumbai, India. E-mail: [email protected], [email protected] Abstract - As we are living in the era of digital world there is tremendous growth in data amount. The increased data growth rate imposed lots of challenges regarding storage system. Deduplication is the emerging technology which not only reduces the storage space but also eliminates the redundant data. This paper proposed an efficient deduplication method for distributed system which uses block level chunking for elimination of redundant data. In traditional deduplication system there is only one global index table for storing unique hash values, but in our system there will be different index table for each of the distributed data-server in which reduces the searching and retrieval time. Here each incoming block will be checked for duplicate in each of the index table parallely.The comparative results with traditional deduplication system shows that our system has improved the processing time and storage utilization. Keywords - Data Deduplication, Index Management, Distributed System, Storage. I. INTRODUCTION block level deduplication technique in our proposed system. Nowadays due to increase in the data growth rate Data searching and retrieval is one of the important there is huge pressure on storage system. It is very operations which can affect the overall deduplication important to effectively use the storage system in system. The overall performance of system is order to store large amount of data in minimum depends upon the searching time required to find storage space. According to research it has been matching index.[1]It is very challenging to build a proved that almost fifty percent of data is in duplicate reliable index management for cluster deduplication form[4],so the question is why to waste our storage since the block index entries increases as the data memory just to store duplicate data. Today storage is amount increased. In traditional deduplication very expensive need not only for enterprise technique, for storing unique hash values i.e. unique organization but also for basic home users. Due to the fingerprints one global index table is maintained. different new technologies like Internet of things, Frequent queries burdens the single index table, it Cloud computing millions of data is being generated may put overhead on it’s a result it affects the overall over a network every second. Most of the data is searching process. And another problem is, complete dynamic in nature i.e. it keeps on changing or data block needs to be transferred at data server even modified by user repeatedly. Deduplication is a if it is duplicate and then deduplication performed, solution for such problem. It is a technique which which increases the network bandwidth overhead. effectively eliminates the duplicate data and stores In this paper we proposed a system which obtains the only the unique i.e. original data. parallel deduplication with improved processing time. There are different methods of deduplication which It contains one metadata server which has different depends upon Chunking type, ie.file level chunking index table for each of the data server in cluster and block level chunking [5].When user uploads any where all the data is being backed up in distributed file for backup, the first stage is to generate hash manner respectively. When new data block arrives for value of that particular file, next the generated hash backup it will be searched in each index table will be compared with already stored hash values in parallely. Here deduplication technique is splitted i.e. index table .If match found that means same data is hashing, blocking will be done at application server already exist in storage hence it will discard that data level, matching will be done at metadata server level. and only gives the reference pointer to the matched Only unique blocks will be transferred to the data data. In this way it will eliminate the duplicate data. servers for storage which reduces the network In file level entire data file will be considered as bandwidth. Blocks will be transferred depending whole one chunk hence for each file only one hash is upon the hash entries in each of the index table in being generated. But in Block level chunking each metadata server, data block will be stored at a data file will be divided into fixed size blocks and hash node which has less number of entries in index table value will be generated for each block. There can be a such a way it achieves load balancing. case where one file contains redundant data within In rest of the paper section 2 describes related work itself in such scenario, file level deduplication will done on deduplication system, section 3 gives fail to eliminate such duplication, but block level can methodology and system architecture of proposed easily eliminate such duplicate data. We will be using system. Section 4 covers evaluation and results using Proceedings of WRFER International Conference, 24th June, 2018, New Delhi, India 14 Efficient Data Searching and Retrieval using Block Level Deduplication different test cases; finally section 5 concluded the methods. Chunk level and file level deduplication paper. have been analyzed. AmanpreetKaur and Sonia Sharma[10] analyses the information about II. RELATED WORK deduplication or the cloud based systems. they include the methods that are used to achieve cost Q. Liu et al.[1] derives the the new effective storage and effective bandwidth usage by Halodedudeduplication system which is more deduplication. scalable hadoop based deduplication method. They performed parallel deduplication using Map reduce III. PROPOSED METHODOLOGY and HDFS.In each data node there will be separate local database to store hash identifiers which properly Data searching and retrieval is one of the important manages the index table performance. This method operations which can affect the overall deduplication increases the speed of fingerprint searching system. The overall performance of system is effectively. In Map reduce they have only used Map depends upon the searching time required to find stage to decrease the processing time of the system. matching index[1].In our proposed system to At Metadata Server they have used HBase. eliminate the issue of global index table we are N Kumar et al.[2] introduced bucket based creating index table for each node which in turns deduplication method to achieve reliable data results in parallel deduplication and faster data deduplication. In this Input data file is divided into retrieval operations. fixed size chunks. Then MD5 algorithm is used to Depending on the survey and the issues identified, we generate unique hash identifier for each chunk. These have proposed efficient deduplication strategy as hash values are compared with the hash values stored follows: in bucket to check whether block is duplicate or not 1) Client register, login and upload input file which is to ,for this they have used map reduce technique. If be backup. match found it will be considered as duplicate and 2) Chunking phase -Input file is divided into fixed can be discarded. length chunks. J. Yong et al. [3] proposed a cloud storage system 3) After chunking, using hash algorithm MD5, Hash value is calculated for each data block. This is then Dedu.They have used HDFS and HBase for searched in metadata server parallel according to data deduplication system, HBase is used for faster search nodes in order to check whether this data block exist in efficiency.Dedu is a Data deduplication system which any data node. is used for effective management of duplicate data. 4) If match found only reference pointer will be sent to For this they have used deduplication as a frontend the respective data node's index table, otherwise whole application and second major component is mass data block with its hash identifier will be sent to the storage system in cloud as a backend. They used data node in encrypted format. HDFS for mass storage system .VMware is used for 5) On data node side, it will decrypt the data block, store cloud simulation. hash identifier in index table and store the data block. It sends metadata regularly to the metadata server. A. Venish and K. Siva Sanka [4] and R. Vikraman [5] have included different algorithms for the data deduplication and performed comparison different chunking algorithm.Manogar [6] examined and compared different data deduplication methods and concluded that variable size data de-duplication is better from other deduplication techniques. R-S Chang et al.[7] proposed deduplication decision system. In this there are two thresholds; it splits the data as cold data and hot data depending on low access frequency and high access frequency. They propose a dynamic deduplication decision to improvise the storage utilization of data nodes which uses HDFS as file system. Proposed system can be seen as a proper deduplication strategy to efficient storage utilization under the limited storage requirement. Yunfeng Zhu [8] In distributed deduplication storage system, They have examined the load balance problem by balancing the load on data effective and reliable deduplication can be achieved. Due to chunking deduplication process significantly slows down which affects the performance of retrieval operation.Bhaskar et al.[9] focuses on deduplication Figure 1: Flow diagram of proposed methodology Proceedings of WRFER International Conference, 24th June, 2018, New Delhi, India 15 Efficient Data Searching and Retrieval using Block Level Deduplication ALGORITHM respective hash identifier from application server and A. Deduplication (file f) stores it. Meta-data for each block is transferred B. Split the file f into fixed size blocks regularly to Metadata server.