EFFICIENT DATA SEARCHING AND RETRIEVAL USING BLOCK LEVEL DEDUPLICATION

1SUPRIYA MORE, 2KAILAS DEVADKAR

1Department of Computer Engineering, Sardar Patel Institute of Technology, Mumbai, India. 2Deputy Head of Department, Sardar Patel Institute of Technology, Mumbai, India. E-mail: [email protected], [email protected]

Abstract - As we are living in the era of digital world there is tremendous growth in data amount. The increased data growth rate imposed lots of challenges regarding storage system. Deduplication is the emerging technology which not only reduces the storage space but also eliminates the redundant data. This paper proposed an efficient deduplication method for distributed system which uses block level chunking for elimination of redundant data. In traditional deduplication system there is only one global index table for storing unique hash values, but in our system there will be different index table for each of the distributed data-server in which reduces the searching and retrieval time. Here each incoming block will be checked for duplicate in each of the index table parallely.The comparative results with traditional deduplication system shows that our system has improved the processing time and storage utilization.

Keywords - Data Deduplication, Index Management, Distributed System, Storage.

I. INTRODUCTION block level deduplication technique in our proposed system. Nowadays due to increase in the data growth rate Data searching and retrieval is one of the important there is huge pressure on storage system. It is very operations which can affect the overall deduplication important to effectively use the storage system in system. The overall performance of system is order to store large amount of data in minimum depends upon the searching time required to find storage space. According to research it has been matching index.[1]It is very challenging to build a proved that almost fifty percent of data is in duplicate reliable index management for cluster deduplication form[4],so the question is why to waste our storage since the block index entries increases as the data memory just to store duplicate data. Today storage is amount increased. In traditional deduplication very expensive need not only for enterprise technique, for storing unique hash values i.e. unique organization but also for basic home users. Due to the fingerprints one global index table is maintained. different new technologies like Internet of things, Frequent queries burdens the single index table, it Cloud computing millions of data is being generated may put overhead on it’s a result it affects the overall over a network every second. Most of the data is searching process. And another problem is, complete dynamic in nature i.e. it keeps on changing or data block needs to be transferred at data server even modified by user repeatedly. Deduplication is a if it is duplicate and then deduplication performed, solution for such problem. It is a technique which which increases the network bandwidth overhead. effectively eliminates the duplicate data and stores In this paper we proposed a system which obtains the only the unique i.e. original data. parallel deduplication with improved processing time. There are different methods of deduplication which It contains one metadata server which has different depends upon Chunking type, ie.file level chunking index table for each of the data server in cluster and block level chunking [5].When user uploads any where all the data is being backed up in distributed file for backup, the first stage is to generate hash manner respectively. When new data block arrives for value of that particular file, next the generated hash backup it will be searched in each index table will be compared with already stored hash values in parallely. Here deduplication technique is splitted i.e. index table .If match found that means same data is hashing, blocking will be done at application server already exist in storage hence it will discard that data level, matching will be done at metadata server level. and only gives the reference pointer to the matched Only unique blocks will be transferred to the data data. In this way it will eliminate the duplicate data. servers for storage which reduces the network In file level entire data file will be considered as bandwidth. Blocks will be transferred depending whole one chunk hence for each file only one hash is upon the hash entries in each of the index table in being generated. But in Block level chunking each metadata server, data block will be stored at a data file will be divided into fixed size blocks and hash node which has less number of entries in index table value will be generated for each block. There can be a such a way it achieves load balancing. case where one file contains redundant data within In rest of the paper section 2 describes related work itself in such scenario, file level deduplication will done on deduplication system, section 3 gives fail to eliminate such duplication, but block level can methodology and system architecture of proposed easily eliminate such duplicate data. We will be using system. Section 4 covers evaluation and results using

Proceedings of WRFER International Conference, 24th June, 2018, New Delhi, India 14 Efficient Data Searching and Retrieval using Block Level Deduplication different test cases; finally section 5 concluded the methods. Chunk level and file level deduplication paper. have been analyzed. AmanpreetKaur and Sonia Sharma[10] analyses the information about II. RELATED WORK deduplication or the cloud based systems. they include the methods that are used to achieve cost Q. Liu et al.[1] derives the the new effective storage and effective bandwidth usage by Halodedudeduplication system which is more deduplication. scalable hadoop based deduplication method. They performed parallel deduplication using Map reduce III. PROPOSED METHODOLOGY and HDFS.In each data node there will be separate local database to store hash identifiers which properly Data searching and retrieval is one of the important manages the index table performance. This method operations which can affect the overall deduplication increases the speed of fingerprint searching system. The overall performance of system is effectively. In Map reduce they have only used Map depends upon the searching time required to find stage to decrease the processing time of the system. matching index[1].In our proposed system to At Metadata Server they have used HBase. eliminate the issue of global index table we are N Kumar et al.[2] introduced bucket based creating index table for each node which in turns deduplication method to achieve reliable data results in parallel deduplication and faster data deduplication. In this Input data file is divided into retrieval operations. fixed size chunks. Then MD5 algorithm is used to Depending on the survey and the issues identified, we generate unique hash identifier for each chunk. These have proposed efficient deduplication strategy as hash values are compared with the hash values stored follows: in bucket to check whether block is duplicate or not 1) Client register, login and upload input file which is to ,for this they have used map reduce technique. If be backup. match found it will be considered as duplicate and 2) Chunking phase -Input file is divided into fixed can be discarded. length chunks.

J. Yong et al. [3] proposed a cloud storage system 3) After chunking, using hash algorithm MD5, Hash value is calculated for each data block. This is then Dedu.They have used HDFS and HBase for searched in metadata server parallel according to data deduplication system, HBase is used for faster search nodes in order to check whether this data block exist in efficiency.Dedu is a Data deduplication system which any data node. is used for effective management of duplicate data. 4) If match found only reference pointer will be sent to For this they have used deduplication as a frontend the respective data node's index table, otherwise whole application and second major component is mass data block with its hash identifier will be sent to the storage system in cloud as a backend. They used data node in encrypted format. HDFS for mass storage system .VMware is used for 5) On data node side, it will decrypt the data block, store cloud simulation. hash identifier in index table and store the data block. It sends metadata regularly to the metadata server. A. Venish and K. Siva Sanka [4] and R. Vikraman [5] have included different algorithms for the data deduplication and performed comparison different chunking algorithm.Manogar [6] examined and compared different data deduplication methods and concluded that variable size data de-duplication is better from other deduplication techniques. R-S Chang et al.[7] proposed deduplication decision system. In this there are two thresholds; it splits the data as cold data and hot data depending on low access frequency and high access frequency. They propose a dynamic deduplication decision to improvise the storage utilization of data nodes which uses HDFS as file system. Proposed system can be seen as a proper deduplication strategy to efficient storage utilization under the limited storage requirement. Yunfeng Zhu [8] In distributed deduplication storage system, They have examined the load balance problem by balancing the load on data effective and reliable deduplication can be achieved. Due to chunking deduplication process significantly slows down which affects the performance of retrieval operation.Bhaskar et al.[9] focuses on deduplication Figure 1: Flow diagram of proposed methodology

Proceedings of WRFER International Conference, 24th June, 2018, New Delhi, India 15 Efficient Data Searching and Retrieval using Block Level Deduplication ALGORITHM respective hash identifier from application server and A. Deduplication (file f) stores it. Meta-data for each block is transferred B. Split the file f into fixed size blocks regularly to Metadata server. Each data node contains C. for each block b do index table to store hash identifier of stored blocks. D. H = getHashvalue(b) E. Match = search the H in index table 3. Meta-data Server - Metadata server contains F. if Match = true then index table for each of the data server where actual G. Send the reference pointer and update data is stored. Metadata server stores all the metadata table information about data blocks. Information is stored H. else Add new record to in index table separately for each data node, hence searching can be I. Update information in metadata table done parallel here. When any block arrives for J. end if backup its hash identifier will be searched in all index K. end for tables parallely. If the block is duplicate it will simply L. end Deduplication discard it and send reference to existing similar block, but if it is unique it will be stored in the data node SYSTEM DESIGN AND ARCHITECTURE who has less hash entries such a way it will also As shown in figure 2 there will be one metadata achieve load balancing. server which contains different index table for each of Index table: Index table is most important component the data server in cluster where all the data is being of whole system. New incoming block will be backed up in distributed manner. When new data searched in index table for duplicate. There are block arrives for backup it will be searched in each following fields in index table: File id, file name, index table parallely. Here deduplication technique is andblock offset, Hash value of block. Here we take splitted i.e. hashing, blocking will be done at unique hash value as a primary key in order to search application server level, matching will be done at duplicate block. There will be separate index table for metadata server level. At data server only unique each of the data node, hence the searching will be blocks will be transferred to the data servers for more faster. storage which reduces the network bandwidth. IV. EVALUATION

The proposed technique is implemented using machine Lenovo with 8GB RAM, on 64 bit OS with Windows 10 version. Random data files have been used as a input data set. The proposed technique has been implemented using language. Spring boot framework is used for implementing MD5 and Fixed size blocking algorithm. In order to store data files on multiple storage system we have used OpenKM database system.openKM is a document management system, OpenKM uses an embedded database called HSQLDB. This database is integrated into JBoss and offers good performance and low hardware requirements. Starting with OpenKM 5, you can create the databases automatically by configuring the .dialect and hibernate.hbm2ddl properties in OpenKM.cfg.

Figure 2 : Block diagram of proposed deduplication technique For distributed index system MySQL is used. MySQL is a open source lightweight relational In our proposed system, there are following database which is reliable and powerful tool for components: parallel indexing. In our experiment as we have used 1. Application Server - It is a client side system two OpenKM storage system for storing blocks, there which does various tasks like blocking of input file will two index table for both of the storage system in into small data blocks, applying MD5 Hashing metadata server. For centralized index method hash algorithm on each data block to generate unique hash identifiers are stored in single MySQL index table. value of block, Checking hash value of data block in metadata server, encryption of data blocks, and There are following performance measures we have distribution of data blocks to different data node on. used for system analysis and to compare with traditional single server deduplication: 2. Data Nodes - There are different data nodes which act as storage servers, which save all the data needs to 1) File size after deduplication - It is the most be backup. Data node receives unique data block and important parameter which gives the total amount of

Proceedings of WRFER International Conference, 24th June, 2018, New Delhi, India 16 Efficient Data Searching and Retrieval using Block Level Deduplication size after deduplication process. It shows the count of how much unique data is present.

2) Deduplication Ratio - Deduplication ratio gives the count of how much duplicate data have been removed from the files. It is defined by the ratio of deduplicate file size and original file size.

Deduplication ratio = Deduplicated file size / Original file size

3) Searching time - It is the time require for searching similar hash identifier in index table.

Figure 4: File level vs. Block level deduplication

ii) Block size vs. Deduplication ratio We have also examined the relationship between block size, deduplication ratio and processing time. The smaller the block size more duplicate data can be eliminated, as block size increases deduplication ratio decreases. But there is a contradiction between block size and processing time. Smaller block size increases total number of blocks hence it will put more

overhead on searching, hashing process, which in Table 1 turns degrades the system performance. Fig 5 Deduplication results illustrated relation between block size and deduplication ratio.

Figure 3: Deduplication Result Figure 5: Deduplication ratio vs. Block size i) File level deduplication vs. Block level iii) Processing time in traditional vs. proposed deduplication system Here comparison between traditional file level In our proposed system we used three MySQL index deduplication and block level deduplication is tables to store hash identifiers of blocks stored in analyzed. In file level chunking entire file will be distributed data servers. Hence when new incoming considered as one whole chunk, hence for each file block arrives its hash identifier searched parallel in there is only one unique hash identifier. In block level three of the index tables, which reduced the searching chunking we took 25 kb as fixed chunk size. and retrieval time. If block identifier is unique it will Accordingly file will be divided into blocks, last be stored in index table which has less hash entries. block size may have size less than 25 kb.Fig 4 Such a way load balance is also achieved. Fig 6 illustrates the performance of both the deduplication illustrates the improvement in processing time of techniques, which shows block level deduplication proposed parallel deduplication system. eliminates more redundant data.

Proceedings of WRFER International Conference, 24th June, 2018, New Delhi, India 17 Efficient Data Searching and Retrieval using Block Level Deduplication (ICRITO) (Trends and Future Directions), Noida, 2016, pp. 267-271. [3] Z. Sun, J. Shen, and J. Yong,”A novel approach to data deduplication over the engineering-oriented cloud systems” Integrated Computer-Aided Engineering, vol. 20, no. 1, pp. 45–57, 2013. [4] Venish and K. Siva Sankar,”Study of Chunking Algorithm in Data Deduplication”Springer India 2016. [5] R. Vikraman and A. S,”A Study on Various Data DeduplicationSystems”International Journal of Computer Applications, vol. 94, no.4, pp. 35-40, 2014. [6] E. Manogar and S. Abirami,”A Study on Data Deduplication Techniques for Optimized Storage”Sixth International Conference on Advanced Computing(lCoAC), IEEE 2014, pp. 161-166. Figure 6: Processing time in traditional vs. proposed system [7] R-S Chang, C-S Liao, K-Z Fan, and C-M Wu,”DynamicDeduplication Decision in a Hadoop CONCLUSION Distributed File System”International Journal of Distributed Sensor Networks, pp. 1-14, April 2014. [8] Min Xu, Yunfeng Zhu, Patrick P. C. Lee, Yinlong,”Even Data deduplication is a technique by which one can Data Placement for Load Balance in Reliable Distributed improve the storage space, by eliminating duplicate Deduplication Storage Systems”In Proc. of IEEE data. Traditional deduplication based on centralized International Symposium on Quality of Service (IWQoS), pp. 349-358, 2015. index table put more overhead on system and [9] DeepuS ,Bhaskar ,Shylaja ”PERFORMANCE increases the processing time. In our proposed system COMPARISON OF DEDUPLICATION TECHNIQUES we used parallel index searching technique, which FOR STORAGE IN CLOUD COMPUTING contains separate index table for each of the data ENVIRONMENT”Asian Journal of Computer Science And Information Technology 4 : 5 (2014) 42 - 46. node in metadata server. Incoming input file is [10] AmanpreetKaur,SoniaSharmar”An Efficient Framework and divided into blocks using fixed size blocking Techniques of Data Deduplication in Cloud Computing” algorithm, MD5 algorithm used to calculate hash of IJCST Vol. 8,April - June 2017. each block, finally hash identifier will be searched in [11] ShengmeiLuo, Guangyan Zhang, Chengwen Wu,”Boafft: Distributed Deduplication for Big Data Storage in the already stored block hashes in index tables Cloud”IEEE TRANSACTIONS ON CLOUD COMPUTING, parallely.Comparative experiment shows our VOL. 4, NO. X, XXXXX 2016. proposed system reduces the processing time hence [12] DeepavaliBhagwat,KaveEshghi,Darrell D. E. Long “Extreme improves the system performance. In this paper we Binning: Scalable, Parallel Deduplication for Chunk-based File Backup” in Proc. IEEE Int. Symp. used text documents as input data set, in future more Modell.Anal.Simulation Comput.Telecommun. Syst., 2009, work can be done on images, videos,etc. pp. 1–9. [13] C. Liu, Y. Lu, C. Shi, et al.”ADMAD: Application-driven REFERENCES metadata aware deduplication archival storage System”in Proc. 5th IEEE Int. Workshop Storage Netw. Archit. Parallel I/Os, 2008, pp. 29–35.7. [1] Q. Liu, Y. Fu, G. Ni, R. Hou,”Hadoop Based Scalable Cluster [14] JyotiMalhotra,JagdishBakal “A Survey and Comparative Deduplication for Big Data”2016 IEEE 36th International Study of Data Deduplication Techniques” 2015 International Conference on Distributed Computing Systems Workshops. Conference on Pervasive Computing (ICPC) . [2] N Kumar, R. Rawat, and S. C. Jain,”Bucket Based Data [15] Y. Fu, N. Xiao, and F. Liu“Research and development on key Deduplication Technique”5th International Conference on techniques of data deduplication “Journal of Computer Reliability, Infocom Technologies and Optimization Research and Development, vol. 49, no. 1, pp. 12–20, 2012.



Proceedings of WRFER International Conference, 24th June, 2018, New Delhi, India 18