SECURE MULTI KEYWORD FUZZY WITH SEMANTIC EXPANSION BASED SEARCH OVER ENCRYPTED CLOUD DATA

ARFA BAIG

Dept of Computer Science & Engineering B.N.M Institute of Technology, Bangalore, India E-mail: [email protected]

Abstract— The initiation of cloud computing has led to ease of access in Internet-based computing and is commonly used for web servers or development systems where there are security and compliance requirements. Nevertheless, some of the confidential information has to be encrypted to avoid any intrusion. Henceforward as an attempt, a semantic expansion based multi- keyword fuzzy search provides solution over encrypted cloud data by using the locality-sensitive hashing technique. This solution returns not only the accurately matched files, but also the files including the terms semantically related to the query keyword. In the proposed scheme fuzzy matching is achieved through algorithmic design rather than expanding the index files. It also eradicates the need of a predefined dictionary and effectively supports multiple keyword fuzzy search without increasing the index or search complexity. The indexes are formed based on locality sensitive hashing (LSH), the result files are returned according to the total relevance score.

Index Terms—Multi keyword fuzzy search, Locality Sensitive Hashing, Secure Semantic Expansion. support fuzzy search and also required the use of pre- I. INTRODUCTION defined dictionary which lacked scalability and Cloud computing is a form of computing that depends flexibility for modification and updation of the data. on sharing computing resources rather than having These drawbacks create the necessity of the new local servers or personal devices to handle technique of multi keyword fuzzy search. applications. The public cloud deployments are Thus from a new perception semantic expansion commonly used for web servers or development based multi keyword fuzzy search reinforces the systems where security and compliance requirements system usability by returning the exactly matched of larger organizations and their customers is not an files and the files including the terms semantically issue. In cloud computing, scalable and pliant storage related to the query keyword, which boosts the search and computation resources are provisioned as flexibility and usability. measured services through the Internet. Cloud Fuzzy searching will find a word even if it is computing empowers cloud customers to enjoy the misspelled. For instance, a fuzzy search for apple will on-demand high quality applications and services find appple. Fuzzy searching can be beneficial for from a centralized pool of configurable computing searching text that may contain typographical errors. resources. This technique can dismiss the burden of Multi-keyword search scheme abolishes the storage management which allows universal data requirement of a predefined keyword dictionary and access with independent geographical locations, and accomplishes this by several novel designs based on avoid capital spending on hardware, software, and locality-sensitive hashing which is secure, efficient personnel maintenances, etc. and accurate. To diminish the risk of data leakage to the cloud service providers, the straight forward solution to II. RELATED WORK implement data privacy is to encrypt sensitive data before being outsourced. Unfortunately, data A. Privacy preserving multi-keyword text search in , if not carried out appropriately and may the cloud supporting similarity based ranking reduce the efficiency of data utilization. Typically, a With the increasing popularity of cloud computing, user reclaims files of interest to him/her via keyword huge amount of documents are outsourced to the search instead of retrieving back all the file such as cloud for reduced management cost and ease of keyword based search technique has been extensively access. Although encryption helps protecting user used in the daily life, e.g. Google plaintext keyword data confidentiality, it leaves the well-functioning yet search. Nonetheless, the technologies are invalid after practically-efficient secure search functions over the keywords are encrypted. encrypted data a challenging problem. The paper Fuzzy search using symmetric encryption has been a presents a privacy-preserving multi-keyword text challenge as it was being carried out using single search (MTS) scheme with similarity-based ranking exact keyword only and there had been use of to address this problem. To support multi-keyword inverted indexes as well which were not so search and search result ranking, a search index based proficient. In order to preserve the privacy of the on term frequency and the vector space model is query keyword cosine similarity measurement has proposed along with cosine similarity measure to been used for the multi keyword search but it did not achieve higher search result accuracy. To improve the

Proceedings of 14th IRF International Conference, Bengaluru, India, 31st May 2015, ISBN: 978-93-85465-25-3

66 Secure Multi Keyword Fuzzy With Semantic Expansion Based Search Over Encrypted Cloud Data search efficiency, a tree-based index structure and into two steps. The first step finds the candidate list in various adaption methods for multi-dimensional terms of secure pruning codes. In particular, two (MD) algorithm is proposed so that the practical methods are developed to construct these pruning search efficiency is much better than that of linear codes. The second step uses a semi honest third party search. To further enhance the search privacy, two to determine the best matching keyword depending secure index schemes are used to meet the stringent on secure similarity function. This intends to reveal privacy requirements under strong threat models, i.e., as little information as possible to that third party and known cipher text model and known background hopes that developing such a system will enhance the model. Finally, the effectiveness and efficiency of the utilization of retrieval information systems and make proposed scheme are demonstrated through extensive these systems more user-friendly. experimental evaluation. III. EXISTING SYSTEM B. Public encryption with keyword search. The problem of searching on data that is encrypted It is desired to store data on data storage servers such using a public key system is studied. Considering as mail servers and file servers in encrypted form to user Bob who sends email to user Alice encrypted reduce security and privacy risks. But this generally under Alice’s public key. An email gateway wants to implies that one has to detriment functionality for test whether the email contains the keyword “urgent” security. For example, if a client wishes to reclaim so that it could route the email accordingly. only documents containing certain words, it was not Alice, on the other hand does not wish to give the earlier known how to let the data storage server gateway the ability to decrypt all her messages. This perform the search and answer the query without loss defines and constructs a mechanism that enables of data confidentiality. Here the cryptographic Alice to provide a key to the gateway that enables the schemes for the problem of searching on encrypted gateway to test whether the word “urgent” is a data is carried out and provide proofs of security for keyword in the email without learning anything else the resulting . This technique has a about the email. The paper refers to the mechanism number of essential advantages. Provably secure: this as Public Key Encryption with keyword Search. As scheme provides secrecy for encryption, such that the another example, consider a mail server that stores untrusted server cannot acquire anything about the various messages publicly encrypted for Alice by plain text when only given the cipher text. Provides others. Using the mechanism Alice can send the mail query isolation for searches which means that the server a key that will enable the server to identify all untrusted server cannot learn anything more about the messages containing some specific keyword, but plain text than the search result. Controlled searching learn nothing else. The paper defines the concept of is carried out so that the untrusted server cannot public key encryption with keyword search and gives search for an arbitrary word without the user’s several constructions. authorization; they also support hidden queries, such that the user may ask the untrusted server to search C. Approximate Keyword-based Search over for a secret word without revealing the word to the Encrypted Cloud Data server. This scheme didn’t contain an index, thus, the To protect the privacy, users have to encrypt their search operation went through the entire file. sensitive data before outsourcing it to the cloud. However, the traditional encryption schemes are inadequate since they make the application of indexing and searching operations more challenging tasks. Accordingly, searchable encryption systems are developed to conduct search operations over a set of encrypted data. Unfortunately, these systems only allow their clients to perform an exact search but not approximate search, an important need for all the current information retrieval systems. Recently, an increased attention has been paid to the approximate searchable encryption systems to find keywords that match the submitted queries approximately. This work focuses on constructing a flexible secure index that allows the cloud server to perform the Fig1: Working of the existing system approximate search operations without revealing the content of the query trapdoor or the index content. IV. PROPOSED SYSTEM Specifically, the most recently , order preserving symmetric encryption A. Architecture (OPSE), has been employed to protect our keywords. This is a multi-keyword fuzzy search solution based The proposed scheme divides the search operation on semantic query expansion while supporting

Proceedings of 14th IRF International Conference, Bengaluru, India, 31st May 2015, ISBN: 978-93-85465-25-3

67 Secure Multi Keyword Fuzzy With Semantic Expansion Based Search Over Encrypted Cloud Data similarity ranking. The fuzzy search reinforces the simple inner product result thus is a good degree of system usability by returning the exactly matched the number of matching keywords. files and the files including the terms semantically B. Bloom Filter related to the query keyword. As shown in figure, Generally a Bloom filter is a space-efficient data owner constructs a secure searchable index for probabilistic data structure that is used to check the file set and then uploads the encrypted files, along whether an element is a member of a set. In other with the secure index, to the cloud server. To search words, a query returns either “possibly in set” or over the encrypted files; an authorized user uses the “definitely not in set”. Elements can be added to the trapdoor, i.e., the “encrypted” version of query set, but not removed. keyword(s), and sends the trapdoor to the cloud Algorithm 1: Working of the Bloom Filter server. Once the trapdoor is received the cloud server executes the search algorithm over the secure indexes 1) Bit array of m bits. and yields the matched files to the user as the search 2) Initially set to ‘0’. result. File indexes are created using locality- 3) Set of S={a1; a2;...; an} sensitive hashing (LSH) in Bloom filter; this scheme 4) Hash functions from H = { hi|hi : S -> finds documents with matching keywords [1;m]; 1 <=i <=ℓ } competently. Finally, the matched files are returned 5) Set of S uses ℓ independent hash function in order according to the total relevance score. 6) Insert an element a ϵ S 7) Incremented by 1 8) If element q is present in S 9) Feed the ℓ hash function to get ℓ array position 10) If q=0 11) then 12) q ϵ S 13) Else 14) q ϵ S 15) End

C. Locality-Sensitive Hashing

Locality-sensitive hashing (LSH) is a method of performing probabilistic dimension reduction of high- dimensional data. The essential idea is to hash the input items so that related items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items). The hashing used in LSH is dissimilar from conventional hash functions, such as those used in , as in case of LSH the

Fig 2: Search over encrypted data in cloud computing goal is to maximize probability of “collision” of similar items rather than evade collisions. The security model used here is the “honest-but- curious” model for the cloud server as in, It adopts Algorithm 2: Working of Locality ensitive that the cloud server would honestly survey the Hashing(LSH) chosen protocols and procedures to fulfill its service provider’s role, while it may evaluate the information 1) Distance metric d stored and processed on the server in order to learn 2) A hash function family H is (r1; r2; p1; p2) added information about its customers. 3) If any two points s; t and h ϵ H satisfy: The usage of LSH functions in constructing per-file if d(s; t)<= r1 : Pr[h(s) = Bloom filter based index is the key to employing h(t)] >=p1 if d(s; t) >=r2 : fuzzy search. The indexes and queries are signified as Pr[h(s) = h(t)] <=p2 vectors instead of words. The keywords are first 4) For p- stable LSH function ha,b(v) altered into its bigram vector representation and then 5) ha,b(v) =[ av+b/w] inserted into the Bloom filter by LSH functions. The 6) Get the nearest word query can be produced in the same way by inserting 7) End multiple keywords to be searched into a query Bloom filter. If a document contains the keyword(s) in the Where query, the equivalent bits in both vectors will be 1 1) d(s; t) is the distance between the point s thus the inner product will yield a high value. This and the point t.

Proceedings of 14th IRF International Conference, Bengaluru, India, 31st May 2015, ISBN: 978-93-85465-25-3

68 Secure Multi Keyword Fuzzy With Semantic Expansion Based Search Over Encrypted Cloud Data 2) a,v are vectors e) Encrypt the index using Index Enc(SK; 3) b,w are real numbers ID) and output EncSK(ID). D. Multi Keyword Fuzzy Search [5] Trapdoor (Q; SK): This scheme constructs index on a per file basis, as, a) Generate an m-bit long Bloom filter for the ID for file D. The index ID, consists of all the query Q. keywords in D, which is an m-bit Bloom filter. To b) Insert every search keyword into the carry out fuzzy and multiple keyword search, each bloom filter using the LSH function. keyword is converted into a bigram vector and then c) Encrypt Q using Query Enc(SK;Q), and LSH functions is used instead of standard hash output the EncSK(Q). functions to insert the keywords into the Bloom filter ID. [6] Search(EncSK(Q);EncSK(ID)): One crucial step to build index is the keyword a) The inner product of EncSK(ID) and transformation. A keyword is first transmuted to a EncSK(Q) is calculated. bigram set, which contains all the contiguous 2 letters b) Exact number of matching bits shows appeared in the keyword. By this a keyword can be whether the query keywords existed in the misspelled in many diverse ways but still be signified document. in a vector that is very close to the correct one, and c) End user receives the output. this closeness (distance) is measured by Euclidean distance, the well-known metric for distance between vector-type data items. LSH functions will hash inputs with similarity within definite threshold into the same output with high probability. The use of LSH functions in building the per-file Bloom filter based index is the key to executing fuzzy search. The ultimate secure index for each file is a Bloom filter that comprises all the keywords in the file. The query can be produced in the same way by injecting multiple keywords to be searched into a query Bloom filter. The search can then be done by qualifying the relevance of the query to each file, which is done through a simple inner product of the index vector and the query vector. If a document consists of the keyword(s) in the query, the corresponding bits in both vectors will be 1 hence the inner product will Fig 3: Example of the multi keyword fuzzy search return a high value. V. WORKING AND ANALYSIS This scheme is based on symmetric cryptography and comprises of four polynomial-time algorithms: The service provider here provides the internet service for accessing, using, or participating in the [1] KeyGen(m): Internet. The data owner constructs a secure a) Given a security parameter m. searchable index for the file set and then uploads the b) Output is the secret key SK. encrypted files, together with the secure index, to the cloud server. The cloud server implements the search [2] Index Enc(SK; I): algorithm over the secure indexes and returns the a) Split the index into vectors. matched files to the end user as the search result. b) Encrypt the vectors using secret key. c) Secure index created EncSK(I).

[3] Query Enc(SK;Q): a) Split the query into vectors. b) Encrypt the vectors using secret key. c) Trapdoor constructed EncSK(Q).

[4] BuildIndex(D; SK; l): a) Choose ℓ independent LSH functions from the LSH family. b) Construct an m-bit Bloom filter ID as the index for each file D. c) Extract the keywords set WD from D. d) Insert each keyword into the index. Fig 4: The cloud server side

Proceedings of 14th IRF International Conference, Bengaluru, India, 31st May 2015, ISBN: 978-93-85465-25-3

69 Secure Multi Keyword Fuzzy With Semantic Expansion Based Search Over Encrypted Cloud Data The data owner constructs the secure indexes by CONCLUSION using the secret key and then uploads the indexes alongside with the data files to the cloud server. The Approach of leveraging LSH functions to construct cloud server will store the encrypted data and hence the file index is novel. This project provides an the data and be restructured efficiently, due to the fact efficient solution to the secure fuzzy search of that it doesn’t need a pre-defined global dictionary a multiple keywords. Euclidean distance is adopted to search and every document is individually indexed. capture the similarity between the keywords. The Consequently, dataset updates, such as file adding, secure inner product computation is used to calculate file deleting and file modifying, can be done easily the similarity score so as to enable result ranking. carried out, concerning only the indexes of the files to Does not require a pre- defined dictionary, hence can be modified, without upsetting any other files. support dataset updates efficiently. It is privacy preserving and secure. This scheme is not sensitive to data which is misspelled.

ACKNOWLEDGMENT

This project is supported by the B. N. M. Institute of Technology under the Visvesvaraya Technological University, Belgaum.

Fig 5: Uploading file at the data owner side REFERENCES

At the cloud server side the index is build using the [1] Ibrahim A, Jin H, Yassin AA, Zou D (2012) Approximate Keyword-based Search over Encrypted LSH and is inserted into the bloom filter the server Cloud Data. In: IEEE Ninth International Conference receives the trapdoor from the end user which is on e-Business Engineering (ICEBE). IEEE, Hangzhou, secured query and the search process is carried out China. using the search algorithm and using the inner [2] Boneh D, Di Crescenzo G, Ostrovsky R, Persiano G (2004) Public key encryption with keyword search. In: product the result is generated. Advances in Cryptology-Eurocrypt 2004. Springer, Berlin/Heidelberg [3] Curtmola R, Garay J, Kamara S, Ostrovsky R (2006) Searchable symmetric encryption: improved definitions and efficient constructions. In: Proceedings of the 13th ACM conference on Computer and communications security. ACM, Alexandria, VA, USA. [4] Liu C, Zhu L, Li L, Tan Y (2011) Fuzzy keyword search on encrypted cloud storage data with small index. In: IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS). IEEE, Beijing, China. [5] Li J, Wang Q, Wang C, Cao N, Ren K, Lou W (2010) Fuzzy keyword search over encrypted data in cloud Fig 6: Keyword search at the data user side computing. In: Proceedings of IEEE INFOCOM. IEEE, San Diego, CA, USA. [6] Chang Y-C, Mitzenmacher M (2005) Privacy The ranked result received by the user which will be preserving keyword searches on remote encrypted data. in the encrypted form, will undergo decryption using In: Applied Cryptography and Network Security. the same secret key and the plain text result will be Springer, Berlin/Heidelberg. provided to the end user. [7] [7] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, T. Hou, and H. Li, “Privacy preserving multi-keyword text search in the cloud supporting similarity based ranking,” in ASIACCS 2013 [8] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy- preserving multi-keyword ranked search over encrypted cloud data,” INFOCOM 2011. [9] M. Kuzu, M. S. Islam, and M. Kantarcioglu, “Efficient similarity search over encrypted data,” 28th International Conference on Data Engineering 2012. [10] Yang C, Zhang W, Xu J, Xu J, Yu N (2012) A Fast Privacy-Preserving Multi-keyword Search Scheme on Cloud Data. In: International Conference on Cloud and Service Computing (CSC). IEEE, Shanghai, China. Fig 7: Keyword search result display



Proceedings of 14th IRF International Conference, Bengaluru, India, 31st May 2015, ISBN: 978-93-85465-25-3

70