1 Provable Ownership of Encrypted Files in De-duplication

Chao Yang†‡, Jianfeng Ma† and Jian Ren‡ †School of CS, Xidian University Xi’an, Shaanxi, 710071. Email: {chaoyang, jfma}@mail.xidian.edu.cn ‡Department of ECE, Michigan State University East Lansing, MI 48824. Email: {chaoyang, renjian}@msu.edu

Abstract—The rapid adoption of cloud storage services has tells a client that it does not have to upload the file, it created an issue that many duplicated copies of files are stored in means that some other clients have the same file, which the remote storage servers, which not only wastes the communica- could be a sensitive information [5]. More seriously, Halevi tion bandwidth for duplicated file uploading, but also increases the cost of security data management. To solve this problem, et al. recently found some new attacks to the client-side client-side deduplication was introduced to avoid the client from deduplication system [6]. In these attacks, by learning just uploading files already existed in the remote servers. However, the a small piece of information about the file, namely its hash existing scheme was recently found to be vulnerable to security value, an attacker is able to get the entire file from the server. attacks in that by learning a small piece of information related These attacks are not just theoretical. Some similar attacks to the file, such as the hash value of the file, the attacker may be able to get full access of the entire file; and the confidentiality that were implemented against were also discovered of the date may be vulnerable to “honest-but-curious” attacks. by Mulazzani et al. [7], recently. The root cause of all these In this paper, to solve the problems mentioned above, we attacks is that there is a very short piece of information that propose a cryptographically secure and efficient scheme to represents the file, and an attacker that learns it can get access support cross-user client side deduplication over encrypted file. to the entire file. Our scheme utilizes the technique of spot checking in which the client only need to access small portions of the original file, Furthmore, the confidentiality of users’ sensitive data dynamic coefficients, randomly chosen indices of the original files against the cloud storage server in client-side deduplication and a subtle approach to distribute the file encrypting key among is another serious security problem. clients to satisfy security requirements. Our extensive security There are two kinds of straightforward methods: i) The analysis shows that the proposed scheme can generate provable user’s sensitive data can be encrypted by the cloud storage ownership of the encrypted file (POEF) with the presence of the curious server, and maintain a high detection probability of the server who will choose and maintain the encrypting key. But client misbehavior. Both performance analysis and simulation it is reported that as a famous cloud storage server, Dropbox results demonstrate that our proposed scheme is much more mistakenly kept all user data open to public for almost 4 hours, efficient than the existing schemes, especially in reducing the due to a new bug in their software [8]. It is also reported that burden of the client. a bug in Twitter’s client software which allows adversary to Index Terms—Cloud storage, Deduplication, Enrypted File, access users’ private data, is discovered [9]. ii) If users’ data Provable Ownership, Spot-checking are encrypted on client side and the encrypting key is kept away from cloud storage server, then there will be no such I.INTRODUCTION failure of privacy protection of sensitive data, even if cloud With the rapid adoption of Cloud services, a large volume storage server made such mistakes or was hacked in. of data is stored at remote servers, so techniques to save disk However, the second kind of straightforward client side space and network bandwidth are needed. A key concept in encryption with randomly chosen encrypting key will stop this context is deduplication, in which the server stores only a deduplication [5]. The reason is twofold: 1) The cloud storage single copy of a file, regardless of how many clients want to server does not possess the original file in plaintext anymore, store that file. All clients possessing that file only use the link so it is hard for server to authenticate whether a new client to the single copy of the file stored at the server. Furthermore, has the proof of ownership of the original file. 2) Encryptions if the server already has a copy of the file, then clients do of the same file by different users with different encrypting not have to upload it again to the server, which will save the keys will result in different ciphertexts, which will prevent bandwidth as well as the storage capacity and is called client- deduplication across multiusers for happening. side deduplication [1] extensively employed by Dropbox [2] Recently there are only a few of solutions to these new secu- and Wuala [3]. It is reported that business applications can rity problems metioned above. Mulazzani et al. [7] discovered achieve deduplication ratios from 1:10 to as much as 1:500, and implemented a new attack against Dropbox’s deduplica- resulting in disk and bandwidth savings of more 90% [4]. tion system and proposed a preliminary and simple revisal to However, the client-side deduplication introduces some new the communication protocol of Dropbox; Halevi et al. [6] put security problems. Harnik et al. found that when a server forward the proof-of-ownership (POW) model, in which, based 2 on Merkle Hash Trees and error-control encoding, a client can Computation requirements. The server typically has to han- prove to a server that it indeed has a copy of a file without dle a large number of files concurrently. So the solution should actually uploading it. However, neither of two methods above not impose too much burden on the server, even though it has is able to tackle the problem of the confidentiality of users’ more powerful computation capability. On the other hand, the sensitive data against the cloud storage server. To overcome client has limited storage as well as computation resources, this deficiency, Jia et al. [10] recently proposed a solution and it is the leading actor in the deduplication scenario who to support cross-user client side deduplication proof over has to prove to the server that it possesses the exactly same encrypted data. Actually they proposed a method to distribute file already stored at the server. So, the design of the solution a randomly chosen per-file encrypting key to all owners of should pay more attention to reducing the burden on the client the same file, and combined it with the POW proof method in terms of the computation and storage resources and, at the [6] to form a new scheme. However, this combination makes same time, keep the burden on the server at a relatively low their scheme inherit the drawbacks of POW proof method: level. the scheme cannot guarantee the freshness of the proof in Security requirements. Although the server has only the every challenge and has to build Merkle Hash Tree on the encrypted data without the file encrypting key, the solution origianl file which is inherently inefficient. Moreover, their must insist on that the verification and proof should be based scheme failed to provid enough security protection against on the availability of the original data in its original form, key exposure, because it encrypts the file encrypting key only instead of any stored message authentication code (MAC), or with a static and constant hash value of the original file in previously used verification results. Moreover, the requested all key distribution processes. As a result, the applicability parts of the original file should be randomly chosen every of these shcemes in scenario of client-side deduplication over time and the generated proof should be totally different in each encrypted data are greatly limited. challenge. So it is infeasible for anybody to forge or prepare In this paper, to solve the problem in the scenario of client- the proof in advance and to satisfy the verification challenge. side deduplication over encrypted data mentioned above, we Furthermore, the file encrypting key should be encrypted with propose a cryptographically secure and efficient solution where fresh and different keys in every process of key distribution a client proves to the server that it indeed has the encrypted between clients minimizing the risk of key exposure. file, which is called a Provable Ownership of Encrypted File (POEF). We achieve the efficient goal by relying on spot B. System Model checking [20], in which the client could only access small portions of the original file to generate the proof of possessing A typical network architecture for cloud data storage is the original file correctly, thus greatly reducing the burden of illustrated in Figure 1. There are two entities as follows: computation on the client and minimizing the I/O between the Storage Server. It will provide cloud storage service to all client and the server. At the same time, by utilizing dynamic kinds of users. Its computation and storage capability (CPU, coefficients and randomly chosen indices of the original files, I/O, network, etc) is stronger than each single user. Storage our scheme mixes the sampled portions of the original file Server will maintain the integrity of users’ data regardless of with the dynamic coefficients together to generate the unique plaintext or ciphertext, and the availability of cloud service. proof in every challenge. Furthermore, our scheme proposes Client Users. There are many Client Users who create user a subtle approach to encrypt the file encrypting key with accounts and passwords from Storage Server. Each of them dynamic data totally different in each challenge, and combine could login into the server with their accounts, and then upload the key distribution with the proof process in an efficient files to or retrieve files from the server. In particular, the client piggyback way. These techniques help to meet the key security users intend to protect the confidentiality of their sensitive data requirements and achieve the provable-security goal. against not only the outside attackers but also the curious cloud The rest of the paper is organized as follows. Section II storage server. So they will encypt the files or data without introduces the system model, adversary model and notations. leaking any information of the file encrypting key to the server, Then we provide the detailed description of our scheme in and upload the encrypted data onto the server. Section III. Section IV and section V give the security analysis In the deduplication scenario, the server keeps a single and performance evaluations respectively, followed by Section copy of the encrypted original file, regardless of the number VI which overviews the related work. Finally, Section VII of clients that request to store the file. All client users that gives the concluding remark of the whole paper. possess the original file only use the link to the single copy of the encrypted original file stored at the server. Specifically, II.PRELIMINARIESAND PROBLEM STATEMENT a client user first sends the hash value of the original file to A. Requirements the server, who checks whether the hash value already exists Some important requirements that constrain the solution in in its database. If the hash value is the same as an existing our setting are discussed as follow: hash value stored at the server, the server will challenge the Bandwidth requirements. The protocol of proof runs be- client to ask for the proof of possession of the original file. tween the server and client should be bandwidth efficient. Upon a successful challenge and proof, client does not have to Specifically, it must consume much less bandwidth than the upload the file again to the server and may delete the orgininal size of the file, otherwise the client could just upload the file file locally. At the same time, the server marks the client as itself. an owner of the original file and could help transmit the file 3

• KF is the encrypting key chosen by the fist client who encrypted the original file and uploaded it to ther server. • CF - the encrypted original file F with encrypting key KF ; • f - the number of blocks of the original file; • (b1, ..., bf ) -an collection of all blocks of the original file;

• (Cb1 , ..., Cbf ) -an collection of all enrypted blocks of the original file; • α - pseudo-random function (PRF), which is defined as α : {0, 1}∗ × key → {0, 1}µ with µ as the security parameters; Fig. 1: Cloud Data Storage with Deduplication over Encrypted • β - pseudo-random permutation (PRP), which is defined File or Data as β : {0, 1}∗ × key → {0, 1}∗; • hk - a keyed cryptographic hash function with key k applied on input; encrypting key in ciphertext to it from other owners of the • c - the number of blocks requested in a single challenge; file. From then on, there is no difference between the client • S1,S2 - random seeds used to form the challenge set in ∗ ∗ and the users who uploaded the encrypted original file. Thus, every challenge: S1 ←R {0, 1} ,S2 ←R {0, 1} ; the deduplication process saves the bandwidth as well as the storage capacity. III.PROVABLE OWNERSHIPOF ENCRYPTED FILE In this section, we present our proposed cryptographically C. Adversary Model secure remote ownership of encrypted file verification scheme Firt of all, it is supposed that there is a pre-shared symmet- and protocol in two different cases: Online case and Offline rical key between any client and the storage server. In other case. words, the communication between them is protected by the Online case: In online case, either the first client uploading pre-shared key. the encrypted original file to the server, or one of the clients Then, we consider four kinds of security threats to storage who have been proved the owner of the original file is online, systems that carry out client-side deduplication across multiple which means one of these clients, denoted by OC, could be users. First, the deduplication system may use a standard chosen to act as a verifier to iniate a challenge process to verify hash function. e.g. SHA256. In that case, it is probable that whether a new client, denoted by NC, indeed possesses the an attacker could obtain the hash value of a file owned by original file. As a reward, the cloud storage provider would other clients who may have published the hash value as the give these OCs some discounts on using the storage space of signature of the file or use the hash value elsewhere. The weak cloud servers, which is a feasible busseness model. Moreover, adversary achieves this kind of attack by public ways. Second, the more NCs become the owner of the original file, the taking strong adversary into consideration, an attacker may be higher probability an OC will be avalaible (online). able to compromise a server temporarily, getting access to Offline case: In contrast to the online case, there is no its stored data, which includes the hash values for the files OC available (online) who can act as a verifier to iniate a stored on it. Getting the proxy of a file, the attacker can proof process to verify whether a NC indeed has the original download the original file, which may include confidential file in offline case. So, we postulate that there is a Trusted data or information. The attacker also may be able to publish Third Party (TTP) who can act as a proxy for OC to iniate a these stolen hash values and make anyone else to obtain these challenge process. Specifically, every time when a NC finish privacy files. Third, it is plausible that similar attacks would the ownership proof of a specific encrypted file successfully, happen on the client side. For example, an attacker could it must immediately run a special protocol with the TTP to embed some malicious softwares into a client’s machine. The generate some proofs in advance about this file to form a malicious software can use a low-bandwidth covert channel to ”proof database” kept secretly at the TTP, each of which could send out the hash values of interesting files, and enable others be used to verify whether a new NC possesses the original file. to get the privacy files from the cloud server. Last but not Furthmore, it is supposed that the communications between least, the storage server is considered as honest but curious: it NCs and the TTP are protected by a pre-shared symmetrical is curious to access users’ sensitive data by all possible ways, key. The offline case is a common situation we encounter, and for example, the server may has the encrypting key of users’ also a feasible busseness model existed. data while these data are encypted on the server side, or the server could steal the encrypting key when transmitting it from one client to another. A. Definitions Definition 1. Provable Ownership of Encrypted File (POEF) A POEF scheme is a collection of four polynomial-time algo- D. Notation and Preliminaries rithms (OnlineProofRecover, OfflineProofRecover, Proof- • F - the original file; Gen, and ProofCheck) such that: 4

OnlineProofRecover(KF ,K, (Cb1 , ..., Cbc ), Chal) → the file, the TTP randomly selects a prepared proof (ReVd, Authk) is run by the OC to decrypt the encrypted from the ”proof database” and sends the corresponding file blocks according to Chal and generate corresponding (Chal, ReVd,Ks) to the new NC, who then runs Proof- 0 proof of the original file. It takes as input the file encrypting Gen(Chal, F ,Ks, ReVd, ) → V and sends the proof V back key KF , a temporary session key Ks among NC,OC and to the TTP through the S. Finally, the TTP checks the va-

S, an collection of encrypted file blocks (Cb1 , ..., Cbc ) and lidity of the proof V by running ProofCheck (Authk,V ) → 0 0 0 0 the challenge set Chal. It returns a recovered proof ReVd { T rue , F alse } and inform the S the result. of ownership of the original file and a authentication value These steps above could be executed an unlimited number Authk of the file encrypting key. of times in order to confirm whether NC possesses the original OfflineProofRecover(Chal, F, K) → (ReVd, Authk) is file and transmit the file encrypting key KF to it. run by a NC who just now successfully proves his ownership We state the security for POEF protocol using a security of the file yet without deleting the file. It takes as input the game that captures the ability of the adversary. Intuitively, the original file F , the challenge set Chal and the random session POEF Game captures that an adversary cannot successfully key Ks chosen by the TTP . It returns a recovered proof ReVd construct a valid proof or retrieve the file encrypting key of ownership of the original file and a authenfication value without possessing all the blocks corresponding to a given Authk of the file encrypting key. challenge, unless it guesses all the missing blocks. (Chal, F 0, K, ReV ) → V is run by the new ProofGen d POEF Game . NC in order to generate a proof of ownership of the original Setup. Let file F be any file from the domain {0, 1}∗ file and derive the file encrypting key KF . It takes as input 0 and the Ks be the temporary session key shared between a temporary session key Ks, an collection of file F blocks which the NC possesses, a recovered proof of ownership of the challenger and the adversary. The challenger randomly K the original file ReV and a challenge set Chal. It returns generates a file encrypting key F , which is used to encrypt d F C a proof V of ownership of the original file for the blocks the into the ciphertext of F . Finally, the challenger sends C K K determined by the challenge set Chal and retrieves the file the F and s to the adversary and keeps the F secret. In online case the challenger has no original file but in offline encrypting key KF . 0 0 0 0 case the challenger does have the original file F . ProofCheck (Authk,V ) → { T rue , F alse } is run by the OC or the TTP in order to verify the proof of ownership Query. The adversary makes proof queries adaptively: of the original file sent from the new NC. It takes as input a It selects two random numbers S1,S2 and the requested number of blocks c to form (c, S1,S2) = Chal, authenfication value Authk of the file encrypting key and the corresponding proof V . It returns 0T rue0 or 0F alse0 whether and then sends it to the challenger with the corre- the V is a correct proof of ownership of the original file for sponding enrypted file blocks in online case or without the blocks determined by Chal. them in offline case. The challenger runs OnlineProofRe- cover(K ,K , (C , ..., C ), Chal) → (ReV , Auth ) or F s bi1 bic d k Online POEF protocol: OfflineProofRecover(Chal, F, Ks) → (ReVd, Authk) and Online Setup: The server S possesses the encrypted sends the output proof back to the adversary. The adversary original file CF , and divides it into blocks to store. When the continues to query the challenger on the Chals of its choices. NC want to start a new process of proof, the OC randomly The adversary then stores all the proofs as an collection of chooses a session key Ks and sends it to the S and NC. (pf1, pf2, ...pfn), together with the corresponding Chals. Online Challenge: The OC generates a challenge Challenge. The challenger generates a new challenge Chal, set Chal, randomly retrieves corresponding encrypted computes the corresponding proof (ReVd, Authk) and sends file blocks from the S and runs the OnlineProofRe- them to the adversary in order to get back a proof V of cover(K ,K , (C , ..., C ), Chal) → (ReV , Auth ), F s bi1 bic d k ownership of the original file from the adversary. and then sends the revovered proof ReVd to the Attack. The adversary computes a proof of ownership NC through the S. After that, the NC runs V 0 Chal ReV 0 for the and d without possessing the origi- ProofGen(Chal, F ,Ks, ReVd, ) → V and sends the nal file and sends back to the challenger. If the output of proof V back to OC under the help of the S. Finally, the OC 0 ProofCheck(Authk,V ) run by the challenger is ”True”, the check the validity of the proof V by running ProofCheck 0 0 0 0 adversary has won the POEF Game (it means the adversary (Authk,V ) → { T rue , F alse } and inform S the result. could cheating of the ownership verication and retrieve the file Offline POEF protocol: encrypting key). Offline Setup: The server S possesses the encrypted original file CF , and divide it into blocks to store. While a Definition 2 (POEF Security Definition). A POEF proto- NC successfully proves the ownership of the file yet without col built on a POEF scheme is a secure proof of own- deleting it, the TTP will form several challeng sets, for ership of the original encrypted file, if for any proba- example three sets, and send them to the NC, who will runs bilistic polynomial-time (PPT) adversary A, the probability, 0 0 OfflineProofRecover(Chal, F, K) → (ReVd, Authk) three Pwin[A(V , ProofCheck(Authk,V )) = ”T rue”], that A times and sends the output proofs back to the TTP . wins the POEF Game on a specific challenge set Chal is Offline Challenge: When a new NC claims the own- negligibly close to the probability that A can successfully ership of a specific file, indexed by the hash value of perform strong collision attack of the hash function. 5

B. Efficient and Secure POEF Scheme Algorithm 2 Online POEF Protocol: A Protocol of Provable Ownership of Encrypted File In this section, we elaborate on the POEF scheme presented in algorithm 1. Setup: The Online and Offline POEF protocol is both constructed 1: Server S decomposes the encrypted file CF into f blocks, from the POEF scheme. Both of them include two phases: (Cb1 , ..., Cbf ). The f blocks of the file could be stored in Setup and Challenge. The detailed algorithm is described in f logically different locations. Algorithm 2 (online) and Algorithm 3 (offline) respectively. 2: The OC randomly chooses a temporary seesion key ∗ Ks ←R {0, 1} and sends it to the S who, as a proxy, Algorithm 1 Provable Ownership of Encrypted File (POEF) forwards the session key Ks to the NC. They all keep the Ks secret as the current session key. OnlineProofRecover(KF ,K, (Cb1 , ..., Cbc ), Chal): Challenge: 1: The NC claims the ownership of a specific file stored at 1: Let (c, S1,S2) = Chal, K = Ks and Let (Cb1 , ..., Cbc ) = (Cb , ..., Cb ) where 1 ≤ c ≤ f; the S and requests to start a proof challenge; i1 ic ∗ 2: Decrypt the encrypted file blocks with the file encrypting 2: The OC chooses two random numbers S1 ←R {0, 1} ∗ and S2 ←R {0, 1} as the random seeds; key: (bi1 , bi2 ..., bic ) = DKF (Cb1 , ..., Cbc ); 3: Compute a temporary key: k2 = hK (S2); 3: Then, the OC prepares to get the proof of ownership for 4: For 1 ≤ τ ≤ c, compute dynamic coefficients: δτ = c distinct blocks of the specific file F , where 1 ≤ c ≤ f: αk (τ); • Computes a temporary key: k1 = hK (S1); 2 L s 5: Compute ReVd = KF hK [hK (bi1 , δ1) k hK (bi2 , δ2) k • For 1 ≤ τ ≤ c, compute the indices of the blocks for

· · · k hK (biτ , δτ )], where k denotes concatenation; which the proof is generated: iτ = βk1 (τ); 6: Compute the authentication value of the file encrypting • Retrieves the corresponding enrypted file blocks key: Auth = h(K kK); (C , ..., C ) from the server S according to the k F bi1 bic 7: Output ReV = (ReVd, Authk). file indices iτ ; OfflineProofRecover(Chal, F, K): 4: Next, the OC forms the challenge set (c, S1,S2) = Chal and runs the OnlineProofRe-

cover(KF ,Ks, (Cbi , ..., Cbi ), Chal) → (ReVd, Authk) 1: Let F = (b1, b2..., bf ), K = Ks and Chal = (c, r1, r11), 1 c 5: Then, the OC sends the revovered proof ReV and the where 1 ≤ c ≤ f; d challenge set (c, S ,S ) = Chal to the S who then 2: Compute temporary keys: k = h (r ), k = h (r ); 1 2 1 K 1 2 K 11 forwards them to the NC; 3: For 1 ≤ τ ≤ c: 0 6: The NC runs ProofGen(Chal, F ,Ks, ReVd, ) → V and • compute the indices of the blocks for which the proof sends back to S the proof of ownership of the original file is generated:i = β (τ); τ k1 V ; The S then forwards the proof V to the OC; • compute dynamic coefficients: δ = α (τ); τ k2 7: The OC checks the validity of the proof V by running 4: ReV = K L h [h (b , δ ) k h (b , δ ) k Compute d F K K i1 1 K i2 2 ProofCheck(Authk,V ) and then inform the S the result. · · · k hK (biτ , δτ )], where k denotes concatenation; 5: Compute the authentication value of the file encrypting key: Authk = h(KF kK); IV. SECURITY ANALYSIS 6: Output ReV = (ReVd, Authk). 0 A. Security Proof ProofGen(Chal, F , K, ReVd): One of the major design goal is to prevent the client 0 0 0 0 from producing an authenticator for the remote server by 1: Let F = (b1, b2..., bf ), K = Ks and Chal = (c, S1,S2), where 1 ≤ c ≤ f; simply accessing a fingerprint of the original file to pass the verification sent from the remote server. Our proposed scheme 2: Compute temporary keys: k1 = hK (S1), k2 = hK (S2); and protocol make it infeasible that the cheating remote client, 3: For 1 ≤ τ ≤ c: without possessing a specific file, can convince the remote • compute the indices of the blocks for which the proof server that it has the ownership of the file. In fact, we have is generated:i = β (τ); τ k1 the following theorem. • compute dynamic coefficients: δτ = αk2 (τ); 0 0 0 4: H = h [h (b , δ ) k h (b , δ ) k · · · k Theorem 1. Suppose the hash function in POEF scheme is Compute K K i1 1 K i2 2 0 h (b , δ )] k cryptographic hash function which has the properties of col- K iτ τ , where denotes concatenation; 0 0 L lision resistance and pre-image resistance, the POEF scheme 5: Compute KF = H ReVd and keep it secret; 0 and protocols constructed in Algorithm1,2,3 are secure proofs 6: Output V = h(KkK ). F of ownership of the original file as per Definition 2. ProofCheck(Authk,V ): Proof: If there is an adversary A could win the POEF 1: If V = Auth , then output ’True’; k Game with a non-negligible probability, in other words, the 2: Otherwise output ’False’. A could cheating of the ownership verication and retrieve the file encrypting key with non-negligible probability, then, we 6

Algorithm 3 Offline POEF Protocol: A Protocol of Provable x 6= hK (bi1 , δ1) k hK (bi2 , δ2) k · · · k hK (biτ , δτ ). Ownership of Encrypted File When x 6= hK (bi1 , δ1) k hK (bi2 , δ2) k · · · k hK (biτ , δτ ),

Setup: it means that x is a strong collision of hK (bi1 , δ1) k hK (bi , δ2) k · · · k hK (bi , δτ ). When x = hK (bi , δ1) k 1: The server S decomposes the encrypted file CF into f 2 τ 1 hK (bi , δ2) k · · · k hK (bi , δτ ) and x is derived with- blocks, (Cb , ..., Cb ). The f blocks of the file could be 2 τ 1 f out accessing the original data components: (b , b , ··· b ), stored in f logically different locations; i1 i2 iτ it means that the remote client can derive h (b , δ ) k 2: After a NC successfully proves to S that it ownes the K i1 1 h (b , δ ) k · · · k h (b , δ ), without accessing the orig- original file F , yet without deleting it, the NC, now called K i2 2 K iτ τ NCC, immediately informs the TTP ; inal data components. There are two possible ways to get 3: The TTP prepares to obtain some proofs of ownership for these quantities: i) predicate these numbers without using (b , b , ··· b ). In this case h (b , δ ) looks random to c distinct blocks of the specific file F , where 1 ≤ c ≤ f; i1 i2 iτ K iτ τ 4: The TTP chooses three groups of random numbers be predicated. The predicting probability for this case is 1/|hK (bi , δτ )|, which is negligible. ii) compute hK (bi , δτ ) and forms three challenge sets: (c, r1, r11) = Chal1, τ τ from previous stored hK (biτ ) and δτ . In this case, essentially, (c, r2, r22) = Chal2 and (c, r3, r33) = Chal3; Then the 0 0 ∗ b h (b ) = the remote client needs to find an iτ , such that K iτ TTP chooses a random session key Ks ←R {0, 1} and 0 0 h (b , δ ), where b 6= b k δ . In other words, b is a sends all of them to the NCC; K iτ τ iτ iτ τ iτ 5: The NCC runs OfflineProofRecover(Chal, F, K) → strong collision of the cryptographic hash function hk(∗). (ReVd, Authk) three times with Chal1, Chal2, Chal3 To sum up the arguments above, A being able to win the and Ks respectively, and sends three outputs to the TTP , POEF Game with non-negligible probability could perform a which will form the ”proof database” gradually. strong collision attack of the cryptographic hash function with Challenge: a non-negligible probability. However this is in sharp conflict with the premise. So this completes the proof. 1: A new NC claims the ownership of the specific file F stored at the S and requests to start a proof process; 2: After the S informs the TTP , the TTP randomly selects B. Security Characteristics Analysis a prepared proof from the ”proof database”, and sends As mentioned above, there are also some important security the corresponding (Chal, ReVd,Ks) to the S who then requirements which new solution of encrypted data ownership forwards it to the new NC; 0 proof should meet. Here we elaborate on these requirements, 3: The NC runs ProofGen(Chal, F ,Kr, ReVd, ) → V and analyze and compare our proposed POEF scheme with a sends back to S the proof of ownership of the original file typical scheme Leakage-Resilient Client-side Deduplication V ; The S then forwards the proof V to the TTP ; of Encrypted Data (LR-DED) [10] in terms of the security 4: The TTP checks the validity of the proof V by running requirements they have met. ProofCheck(Authk,V ) and then inform the S the result. First of all, when the challenger (OC or TTP in our POEF scheme, S in LR-DED scheme) asks the client for the proof of ownership of the original file, the indices of the original show how to use this adversary to perform a strong collision file as the challenge content should be randomly generated so attack of the cryptographic hash function with a non-negligible that the client cannot predict the requested blocks and forge or probability. prepare the proof in advance. We call this security requirement In order to cheat of the ownership verication and retrieve as Random Index. the file encrypting key with non-negligible probability without Secondly, when the client generates the proof of the own- access the original data, this requires the A be able to produce ership, the original file blocks must get involved in calcu- a quantity y so that h(y) = h(K k KF ). In this case, i means lating the ownership proof in every challenge sent from the that y = K k KF , or y 6= K k KF . When y 6= K k KF , the challenger. In this way, the client cannot cheat the server y is a strong collision of h(K k KF ). When y = K k KF , by providing just some small pieces of information about there are two possible ways to get the values: the original file blocks, for example some hash values of I) predicate the file encrypting key KF and session key K. file blocks, to pass the challenge test. We call this security However, in this case, the KF and K are randomly choosen requirement as Calculated With Original File. by the client so the probability for this case is 1/2[|KF |+|K|], Thirdly, when new clients carry out the proof protocol, the which is negligible. generated proof in each challenge should be totally different 0 II) generate the correct proof material H to dervie the from any other proof. In other words, in each challenge, a correct file encrypting key KF . In this case, for simplifi- ’unique’ and ’fresh’ proof should be generated and checked cation, we suppose the A has the session key K. So, in to determine whether the client can pass the challenge test. 0 order to successfully generate the H without accessing the So, this mechanism can be used to protect the proof schemes original data, this requires the the A to be able to produce a against the replay attack. We call this security requirement as quantity x so that h(x) = hK [hK (bi1 , δ1) k hK (bi2 , δ2) k Dynamic Proof.

· · · k hK (biτ , δτ )] without using the original data compo- Last but not least, the file encrypting key should be nents: (bi1 , bi2 , ··· biτ ). In this case, it means that either encrypted with different fresh keys every time when it is x = hK (bi1 , δ1) k hK (bi2 , δ2) k · · · k hK (biτ , δτ ), or distributed to the new client who has proved the ownership 7 of the original file. In this way, the file encrypting key could be securely stored and transmitted against the key exposure. We call this security requirement as Secure Encrypting Key. The comparison results about the security characteristics mentioned above between our POEF scheme and the LR-DED scheme are depicted in Table I. Because our POEF scheme compels clients to combine the original file blocks with the dynamic random coefficients to calculate the ownership proof, it could satisfy the security requirements of ”With Original File” and ”Dynamic Proof”. On the contrary, due to using the pre-generated static hash value of the Merkle Hash Tree’s leaves, the LR-DED scheme could not satisfy the two security requirements. Furthermore, our POEF scheme proposes a subtle approach to encrypt the file encrypting key with the dynamically and freshly generated ownership proof in each challenge, which not only satisfy the ”Secure Encrypting Key” security requirement, but also greatly enhance the efficiency of the scheme. However, the LR-DED scheme fails to meet this Fig. 2: The detection probability Px(f = 1000) security requirement bacause it encrypts the file encrypting key with a static and constant original file hash value, which will definitely increase the risk of key exposure. To sum up, our proposed POEF scheme is not only provably secure but also satisfies these important security requirements with big advantages over the typical LR-DED scheme.

C. Detection Probability Analysis Next, we will analyze the guarantee which our proposed POEF scheme offers. On one hand ,we pay attention to the probability that a client succeeds in generating the proof of ownership of the original file to the challenger (OC or TTP in online and offline cases respectively). Suppose that the client claims it stores a f-blocks file, out of which it MAYBE miss x blocks, and the challenger will ask for c blocks in one challenge intending to detect the client’s misbehavior. Let X be a discrete random variable used to Fig. 3: The detection probability P (f = 30000) represent the number of missed blocks that have been detected; x Let Px be the probability that at least one missed block has been detected and (1 − P ) be the probability that no missed x In this way, we can get the approximate minimum c required block has been detected. So, we have: for each challenge to detect at least one missed block, which can be expressed as:

Px = P {x ≥ 1} c = dlog x (1 − Px)e. (1) f−x (1− f ) c = 1 − f  First, we fix the number of the file blocks f and show Px c c−1 as a function of c for four values of x(x = 1%, 5%, 10%, 15% Y f − x − i = 1 − of f), which is shown in Figure 2 & 3. f − i i=0 In order to achieve high detection probability, e.g. Px = 99%, the challenger has to request 315, 83, 42, 28 blocks in It follows that: one challenge respectively for x = 1%, 5%, 10%, 15% of f, c c  x   x  where f = 1000 blocks. When f = 30000 blocks (A typical 1 − ≤ 1 − P ≤ 1 − , f − (c − 1) x f DVD file contains 30G bytes data and could be divided into 30000 blocks with 1M bytes in one block), the number of Since c − 1  f, the difference between the left hand side blocks which the challenger should request to achieve the same and the right hand side is very small. Therefore, we have detection probability are 452, 90, 44, 29 blocks respectively in  x c  x c one challenge. 1 − Px ≈ 1 − .i.e.Px ≈ 1 − 1 − f f Fixing the detection probability Px, it can be seen that 8

TABLE I: Security Characteristics and Comparison Scheme Random Calculated With Dynamic Secure Index Original File Proof Encrypting Key √ √ √ √ POEF √ LR-DED NONE NONE NONE

Suppose the result of the challenge is that NONE of the missed blocks has been detected. In this case, we want to calculate how many blocks the challenger has to request in one time challenge can ensure the client NC possesses at least 95% data of the original file (i.e. the number of missed blocks by the client x ≤ 5% of f) with a high detection probability Px = 99%. The reason of choosing a relatively high detection probability (Px > 99%) is that it would guarantee more accurately that client possesses at least 95% data of the original file. Because the amount of the blocks of the entire file has a minimal impact, we might as well fix f = 30000 blocks. Fig. 4: The detection probability Px(x = 5% and 15% of f) Let x = 5% of f, Px = 99%, we substitute these concrete values into the formula (1) above. It follows that: c = d89.781e so, c = 90 blocks. the increase of the number of missed blocks x will make So, it can be concluded that: the number of requested blocks c in one challenge decrease The challenger challenges the client NC only one time for rapidly. At the same time, it also can be seen that when the the the proof of ownership of the original file. Fixing f = 30000, number of missed blocks x is relatively small(e.g. x ≤ 1%), if the requested blocks in the challenge are equal to or greater the increase of the number of the entire file blocks has a than 90 blocks, then it can be ensured that the client NC considerable impact on the number of blocks requested in possesses at least 95% data of the original file with a high one challenge. But when the number of missed blocks x is detection probability of Px = 99%. Actually, more visualized relatively large(e.g. x ≥ 15%), the increase of f has a marginal conclusion can be drawn from the Figure 4, in which the blue impact on the c. curve represents the case of x ≤ 5% of f, when f = 30000. Next, we fix the number of missed blocks x(x = 5% of So, all the corresponding c values represented by points (Px, c) f), and show Px as a function of c in four different values in the area below the blue curve can ensure the number of f(f = 1000, 3000, 5000, 30000). Then we fix x = 15% of of missed blocks x ≤ 5% of f with different detection f and draw the curve of Px as a function of c in the same probabilities. figure, which is shown in Figure 4. 2)The challenger will challenge the client NC totally T From the Figure 4 above we can see that with the fixed times to ask for the proof of ownership of the original file. number of missed blocks x, the increase of the number of the In the T challenges, suppose that there are γ times (γ ≤ T ) file blocks has a minimal impact on the function relationship in which NONE of the missed blocks has been detected between the detection probability Px and the number of blocks and T − γ times in which at least one missed block has c verified in one challenge. But if fixing the number of the been detected. The two kinds of possible results are called entire file blocks f, it can be seen that the increase of x will ’failure’ and ’success’ respectively. Then, we want to know make the c decrease rapidly, which is the same as the results what probability of γ failures in T challenges can ensure the from the Figure 2 & 3. number of missed blocks x ≤ 5% of f. In other words, what On the other hand, in a typical client-side deduplication probability of γ out of T can give us enough confidence to tell scenario of encrypted file, if a client possesses the majority that the client NC possesses the majority of the original file of the original file, e.g. 95% of the original file, we consider (95% data of the original file), i.e. the client approximately that it is enough to tell the client has the full ownership of possesses all of the original file. the original file although some bits of the original file are For simplicity, considering T challenges as independent missed by the client. This is because, in many situations, it is Bernoulli Experiments of T times. The experiment result is not necessary to demand the client must have every bit of the either ’failure’ whose probability is represented by (1 − Px) original file exactly. For example, maybe the client want to that no missed block has been detected, or ’success’ whose claim the ownership of a movie DVD with different subtitles probability is represented by Px that at least one missed block or a slightly different version of a free software. So, next we has been detected. Let Y be the discrete random variable will study the relationship between the detection probability defined as the number of times of ’failure’ and Py be the and the number of times of challenge under two conditions. probability that there are γ times in which NONE of the 1)Only once will the challenger challenge the client NC to missed blocks has been detected out of the total T challenges. ask for the proof of ownership of the original file. So, we have: 9

O(w) in case of two w-digit numbers. So, the whole cost for the NCC will be (c + 6)O(log(r)log(u)) + O(w). Py = P {Y = γ} In the scheme of LR-DED, the client who first uploading the T  γ T −γ = γ (1 − Px) (Px) file will carry out their custom designed pairwise independent T ! hash function with large output size. In this function, all of = [(1 − λ)c]γ (1 − (1 − λ)c)T −γ , γ!(T − γ)! f-blocks of the original file will be hashed into f-blocks with different block size firstly, and then, after reversing the order where let λ = x . f of each block in original file, the reversing output will be hashed again with the same style. Atfter that, the client will Assume T, γ, c are all constant, it can be seen that P is a y run a mixing process to XOR each output block in non-reverse decreasing function of independent variable λ. stage with that in reverse stage and output l blocks digest. Example 1. Let f = 30000, x = 5%f, c = 50,T = 10, γ = 2, Afterwards, the client will build a Merkle Hash Tree over the we have output l blocks digest, which requires the client to computer 2 l + (1 + 2 + ··· + l ) = l +10l hash functions to prepare the " 50#2 "  50#8 2 8 10! 1500 1500 values of all nodes. At last, the client will performce twice Py ≈ 1 − 1 − 1 − 2! × 8! 30000 30000 hash functions to generate the hash value of the plaintext and = 0.1404 = 14.04%. ciphertext of the original file, and one time XOR operation to encrypt the file encrypting key with the hash value of the Example 2. Let f = 30000, x = 5%f, c = 30,T = 10, γ = 2, original file. So the whole cost of the client who first uploading we have l2+10l the file is (2f + 8 +2)O(log(r)log(u))+[f +1]O(w). At " #2 " #8 the same time, the server only need to store the corresponding 10!  1500 30  1500 30 Py ≈ 1 − 1 − 1 − encrypted file and some hash values, the cost of which can be 2! × 8! 30000 30000 ignored. = 0.3000 = 30.00%. Phase2 Challenge: In our scheme of online POEF, the OC will run hash functions twice to form a challenge set So, from the analysis above we can conclude that: and the online proof recover algorithm once, and then send The challenger challenges the client NC 10 times for the the output to the NC. The online proof recover algorithm proof of ownership of the original file. If the probability of in our scheme includes c + 4 hash functions and one time TWO failures in TEN challenges is equal to or greater than XOR computation. So, the whole cost for the OC will be 14.04%, then this result can ensure the number of missed (c+6)O(log(r)log(u))+O(w). At the same time, the NC will blocks x ≤ 5% of f, in other word, the client approximately carry out the proof generation algorithm once, which includes has the full ownership of the original file. From the formula c + 6 hash functions and once XOR computation. So, the above, different results could be drawn from different initial whole cost for the NC will be (c+6)O(log(r)log(u))+O(w). conditions. In our scheme of offline POEF, the TTP randomly se- lect a prepared proof and challenge set from the ”proof V. PERFORMANCE EVALUATION database” to challenge the NC, the cost of which can be First of all, we carry out a theoretical analysis of our ignored. The NC will carry out the proof generation algorithm scheme’s performance and compare it with the typical scheme: once, which includes c + 6 hash functions and one time Leakage-Resilient Client-side Deduplication of Encrypted XOR computation. So, the whole cost for the NC will be Data (LR-DED [10]). Then, a practical simulation is imple- (c + 6)O(log(r)log(u)) + O(w). mented to evaluate the performance of the two schemes. In the scheme of LR-DED, the server only need to choose c leaves of the Merkle Hash Tree as requested blocks to challenge the client and then check the correctness of the proof A. Theoretical analysis sent back from the client, so the compuatianal complexity Phase 1 Setup: In our scheme of online POEF, during of server is very low and can be ignored. On the other Setup phase, the OC only need to randomly generate a hand, the new client has to re-compute the custom designed temporary session key and transmit it to the S and NC, the pairwise independent hash function with large output size on cost of which can be ignored. the original file and re-build a Merkle Hash Tree over the l In our scheme of offline POEF, the TTP will form several blocks digest output. After that, the client will performce one groups of challenge sets and a temporary session key and time XOR operation to extract the file encrypting key and one send them to the NCC, the cost of which can be ignored. time hash function to prove to the server it has successfully The NCC will carry out the offline proof recover algorithm restored the file encrypting key. So the whole cost of the new l2+10l once and send the output back to the TTP . The offline client is (2f + 8 + 1)O(log(r)log(u)) + [f + 1]O(w). proof recover algorithm in our scheme includes c + 6 hash The Table II summarizes the computational complexity of functions and one time XOR computation. The computational the three schemes in different phases. complexity of cryptographic hash functions is O(r ∗ u) = In the client-side deduplication, it is the client that proves O(log(r)log(u)), if the hash function is like {0, 1}log(r) → to the challenger it indeed possesses the original file. Since {0, 1}log(u). And the XOR’s computational complexity is the client usually has less computation capability and storage 10

TABLE II: Theoretical Analysis and Comparison of Performance Scheme OC / NCC-TTP / First client / Server New Client POEF-online Setup: Zero Setup: Zero Challenge: (c + 6)O(log(r)log(u)) Challenge: (c + 6)O(log(r)log(u)) +O(w) +O(w) POEF-offline Setup: (c + 6)O(log(r)log(u)) Setup: Zero +O(w) Challenge: Zero Challenge:(c + 6)O(log(r)log(u)) +O(w) l2+10l LR-DED Setup: (2f + 8 + 2)O(log(r)log(u)) Setup: Zero +[f + 1]O(w) l2+10l Challenge: Zero Challenge:(2f + 8 + 1)O(log(r)log(u)) +[f + 1]O(w) capacity, so we mainly focus on the computational complexity of the client. From the results in Table II, we can see that, during the setup phase, the computational complexity of new client both in our POEF scheme and in the LR-DED scheme can be ignored. During the challenge phase, the new client in our POEF scheme computes hash function of c + 6 times, but the new client in LR-DED has to compute hash function of l2+10l (2f + 8 + 1) times. Although the size of l and f of file blocks in LR-DED is constant, the requested blocks c is much less than the size of l and f. In summary, our scheme POEF has a great advantage over the scheme LR-DED in terms of performance.

B. Simulation analysis

We implement our proposed POEF scheme and the typical Fig. 5: New Client Computation Time (POEF & LR-DED) scheme of LR-DED [10] to measure and compare their perfor- mance. We ran the protocols on random files of sizes 16KByte through 1GByte. The experiments were conducted on an Intel 3.00 GHz Intel Core 2 Duo system with 64 KB cache, 1988 For the POEF, the measurements show that as the size MHz EPCI bus, and 2048 MB of RAM. The system runs of the whole file increases, the time to read the requested Ubuntu10.04, kernel version 2.6.34. We used C++ for the portions of the original file from the disk increases accordingly implementation. We also used the SHA256 from Crypto++ and fluctuates considerably sometimes. At the same time, we version 0.9.8b [23]. The files are stored on an ext4 file system assume the number of missed blocks by client is at most 5% on a Seagate Barracuda 7200.7 (ST23250310AS) 250GB Ultra , while the detection rate is at least 99% in this simulation. It ATA/100 drive. All experimental results represent the mean of can be found that the number of blocks requested by the OC or 10 trials. These results are depicted in Figure 5,6,7 and the TTP is relatively small as compared to the whole number of full set of numbers are given in Table III. file blocks, and only changes slightly. In this way, the running New Client Computation Time: For our proposed POEF time of Algorithm 2 in our POEF scheme is comparatively scheme, the new client computation time includes (i) the time short and has little variation. to read the requested portions of the original file from the For the LR-DED, the measurements show that both of the disk according to the randomly chosen indices given by the disk reading time and Merkle Hash Tree building time increase OC or TTP ; and (ii) the time to execute the Algorithm 2 linearly with the size of the original file. The time increase is presented above for POEF scheme (Because the new client’s significant especially for relatively large files (e.g., ≥64MB). computational complexity in online case is almost the same as Moreover, building the Merkle Hash Tree takes much longer that in offline case, so we only measure the computation time time than running Algorithm 2 in our scheme. This is because of new client in Algorithm 2). For the LR-DED scheme, the the new client in LR-DED scheme has to build a Merkle Hash new client computation time includes (i) the time to read the Tree on all blocks of the whole file in order to answer the whole file from the disk; (ii) the time to perform the custom challenges from the server, which is very inefficient. designed pairwise independent hash function; and (iii) the time Server Computation Time: For our POEF scheme, the to compute the Merkle Hash Tree over the output digest [10]. server computation time (including OC and TTP and S) is For simplicity, the implementation of LR-DED in this paper approximately the same as the new client computation time, only includes the time used to build the Merkle Hash Tree. which is comparatively short as a whole. For the LR-DED Nevertheless our scheme POEF costs less time than the LR- scheme, this is the time spent by the first client who uploading DED. the file to build the Merkle Hash Tree and by the server 11

TABLE III: Performance Measurements and Comparison (New Client Computation Time) LR-DED POEF-online Size Disk Read Merkle Tree Total Disk Read Algorithm 2 Total (MB) (ms) (ms) (ms) (ms) (ms) (ms) 0.015625 0.10 0.76 0.86 0.13 0.28 0.41 0.03125 0.09 1.25 1.34 0.10 0.25 0.35 0.0625 0.14 2.58 2.72 0.10 0.25 0.35 0.125 0.18 5.02 5.20 0.10 0.25 0.35 0.25 0.25 10.01 10.26 0.10 0.26 0.36 0.5 0.50 20.02 20.52 0.11 0.24 0.35 1 1.04 40.03 41.07 0.11 0.25 0.36 2 2.08 80.27 82.35 0.27 0.17 0.44 4 4.41 141.72 146.13 0.28 0.16 0.44 8 8.88 307.59 316.47 2.10 0.22 2.32 16 18.45 599.99 618.44 3.35 0.24 3.59 32 38.39 1232.97 1271.36 2.13 0.22 2.35 64 76.06 2464.01 2540.07 2.72 0.23 2.95 128 154.44 5026.66 5181.10 3.36 0.22 3.58 256 302.99 9991.60 10294.59 2.47 0.22 2.69 512 498.35 19595.84 20094.19 2.77 0.24 3.01 1024 1146.77 37678.37 38825.14 1.71 0.21 1.92

to check the Merkle Hash Tree authentication signatures. Although the overhead of proof verification is very low, the building Merkle Hash Tree has high computatinal complexity. In this way, the time duration at the server side (including the first client and server) is relatively high. Network Transmission Time: This is an estimated time required to transfer the protocol data. The overall size of the message transmitted by protocols is less than 1KByte in each challenge for both schemes. Thus, the network transmission time can be negligible too. From the measurements above, it can be seen that the simulation results are consistent with the theoretical analysis of the performance. Therefore, it can be concluded that our pro- posed POEF scheme is much more efficient than the LR-DED scheme, especially in the aspect of new client computation time.

Fig. 6: New Client Computation Time (LR-DED) VI.RELATED WORK Juels et al. [11] presented a formal proof of retrievability (PoR) model to check the remote data integrity. Their scheme uses disguised blocks called sentinels hidden in the original file blocks and error-correcting code to detect data corruption and ensure both possession and retrievability of files at remote cloud servers. Based on this model, Shacham et al. [12] constructed a random linear function based homomorphic authenticator which enables unlimited number of checks and greatly decreases communication overhead. Wang et al. [13] improves the PoR model [12] by introducing the Merkle Hash Tree construction for forming the authentication block tag, and addresses the problem of how to provide public verifiability and data dynamic operation support for remote data integrity check. Ateniese et al. [14], [15] defined the provable data possession (PDP) model for ensuring possession of file on untrusted storages. This scheme utilized public key based homomorphic tags for authenticating the data file, and support Fig. 7: New Client Computation Time (POEF) public verifiability. However, there is a big difference between their scenario and ours: in their scenario, the server proves to the client that it stores the file integrally, but in our scenario 12 of client-side deduplication, with a big role reversal, it is the file encrypting key exposure, because it encrypts the file the client that proves to the server that it indeed possesses encrypting key only with a static and constant hash value of the original file completely. Furthermore, this role reversal is the original file in all key distribution processes. significant, and the PoR and PDP schemes are not applicable in Another well studied, related problem is how to verify our scenario. The reason is that, in the PoR and PDP schemes, the integrity of memory contents in computing platforms the input file must be pre-processed by the client to embed and ensure that no memory corruption affects the performed some secrets into it. This method makes it possible for the computations [17]. Many of these schemes utilize Merkle Hash server to show the proof of possessing the file by replying Tree [18] for memory and data authentication. to the client queries with answers that are consistent with these secrets. In the scenario of client-side deduplication, a VII.CONCLUSIONS new client only has the original file itself, it is impossible to The new technology Client-side deduplication can save insert secrets into it in advance for the purpose of proof. This the bandwidth of uploading copies of existing files to the excludes many similar solutions [11], [12], [14], [16], [21], server in the Cloud Storage Services. However, this technology [22]. introduces some new security problems which have not been Recently, some security problems to the client-side dedu- well understood. In this paper, we propose a cryptographically plication system have been identified. Harnik et al. found that secure and efficient scheme over encrypted data, called a the deduplication systems could be used as a side channel to Provable Ownership of Encrypted Files (POEF), in which a reveal sensitive information about the contents of files, and client proves to the server that it indeed possesses the entire also could be used as a covert channel by which malicious original file of partial information about it. The aim of the software can communicate with the outside world, regardless scheme is to solve two main problems: i) By learning just the firewall settings of the compromised machine [5]. They a small piece of information about the file, namely its hash want to decrease the security risks of deduplication by setting value, an attacker is able to get the entire file from the server in a random threshold for every file and performing deduplication the client-deduplication scenario; ii) The ”honest-but-curious” only if the number of copies of the file exceeds this threshold. server could damage the confidentiality of users’ sensitive data But their solution only reduces the security risks instead during the process of client side deduplication. of getting rid of it. Mulazzani et al. [7] discovered and We achieve the efficient goal by relying on dynamic spot implemented a new attack against Dropbox’s deduplication checking, in which the client could only access small portions system in which an attacker can get the original file from of the original file to generate the proof of possessing the the server after acquiring only a small piece of information original file correctly, thus greatly reducing the burden of about the file. They also proposed a preliminary and simple computation on the client and minimizing the I/O between revisal to the communication protocol of Dropbox to address the client and the server, while maintaining a high detection the problem. But their scheme has limited security features probability of the misbehavior of clients. At the same time, by and was not well evaluated. At the same time, Halevi et al., utilizing dynamic coefficients and randomly chosen indices of also found some similar attacks to the client-side deduplication the original files, our scheme mixes the randomly sampled system in their work [6]. They put forward the proof-of- portions of the original file with the dynamic coefficients ownership (POW) model, in which a client can prove to a together to generate the unique proof in every challenge. server based on Merkle Hash Tree and error-control encoding Furthermore, our scheme also proposes a subtle approach to that it indeed has a copy of a file without actually uploading encrypt the file encrypting key with the dynamically generated it. However, their scheme cannot guarantee the freshness of ownership proof in each challenge to securely distribute the the proof in every challenge. Furthermore, their scheme has file encrypting key among clients, and combines the key dis- to build Merkle Hash Tree on the encoded data, which is tribution with the proof process in an efficient piggyback way. inherently inefficient. These techniques help to meet the key security requirements More recently, Jia et al., found another serious security and make it infeasible that the cheating remote client, without problem in client-side deduplication scenairo: how to retain the possessing a specific file, can convince the remote server confidentiality of users’ sensitive data against the cloud storage that it has the ownership of the file, or the curious server server during the process of deduplication [10], which receives can obtain some privacy information of users’ sensitive data. few attentions in current literatures and moreover makes the Finally, rigorous security proof and extensive performance schemes in [6] and [7] inapplicable in client-side deduplication simulation and analysis are conducted and the results show over encrypted data. So, they proposed a solution to support that the proposed POEF scheme is not only provably secure cross-user client side deduplication proof over encrypted data. but also highly efficient especially in reducing the burden of Actually they proposed a method to distribute a randomly the client. chosen per-file encrypting key to all owners of the same file, and combined it with the POW proof method [6] to form the REFERENCES new scheme. However, this combination makes their scheme [1] Wikipedia: Comparison of online services http://en.wikipedia. inherit the drawbacks of POW proof method: great possibility org/wiki/Remote backup service. of reusing same proofs in difference challenges and inherently [2] Dropbox Cloud Service: http://www.dropbox.com/. [3] Wuala: Wuala http://www.wuala.com/. inefficient Merkle Hash Tree building on files. Moreover, their [4] M. Dutch and L. Freeman. Understanding data de-duplication ratios. scheme failed to provid enough security protection against SNIA, February 2009. http://www.snia.org/. 13

[5] D. Harnik, B. Pinkas, and A. Shulman-Peleg. Side channels in cloud services, the case of deduplication in cloud storage. IEEE Security and Privacy Magazine, special issue of Cloud Security, 2010. [6] Halevi, S., Harnik, D., Pinkas, B., Shulman-Peleg, A.: Proofs of own- ership in remote storage systems. In: ACM CCS ’11: ACM conference on Computer and communications security, pages 491-500, 2011. [7] M. Mulazzani, S. Schrittwieser, M. Leithner, M. Huber, and E. Weippl. Dark clouds on the horizon: Using cloud storage as attack vector and online slack space. In USENIX Security, 8 2011. [8] Dropbox left user accounts unlocked for 4 hours sunday http://www.wired.com/threatlevel/2011/06/dropbox/. [9] Twitter: Tweetdeck http://money.cnn.com/2012/03/30/technology/tweetdeck- bug-twitter/. [10] Jia Xu, Ee-Chien Chang, and Jianying Zhou. Leakage-Resilient Client- side Deduplication of Encrypted Data in Cloud Storage. Cryptology ePrint Archive, Report 2011/538, 2011. http://eprint.iacr.org/. [11] A. Juels and B. S. Kaliski, Jr. Pors: proofs of retrievability for large files. In ACM CCS 07, pages 584597. ACM, 2007. [12] H. Shacham and B. Waters. Compact proofs of retrievability. In ASI- ACRYPT 08, pages 90107. Springer-Verlag, 2008. [13] Q. Wang, C. Wang, J. Li, K. Ren, and W. Lou. Enabling public verifiability and data dynamics for storage security in . In ESORICS09, pages 355370. Springer-Verlag, 2009. [14] G. Ateniese, R. Burns, R. Curtmola, J. Herring, L. Kissner, Z. Peterson, and D. Song. Provable data possession at untrusted stores. In ACM CCS 07, pages 598609. ACM, 2007. [15] ATENIESE, G., BURNS, R., CURTMOLA, R., HERRING, J., KHAN, O., KISSNER, L., PETERSON, Z., AND SONG, D. Remote data check- ing using provable data possession. ACM Transactions on Information and System Security (TISSEC) 14, 1 (2011), 12. [16] K. Bowers, A. Juels, and A. Opera. Hail: a high-availability and integrity layer for cloud storage. In ACM CCS 09, pages 187198. ACM, 2009. [17] R. Elbaz, D. Champagne, C. Gebotys, R. B. Lee, N. Potlapally, and L. Torres. Hardware mechanisms for memory authentication: A survey of existing techniques and engines. ToCS IV, LNCS, pages 122, March 2009. [18] R. C. Merkle. A certified digital signature. In Proceedings on Advances in cryptology, CRYPTO 89, pages 218238, New York, NY, USA, 1989. Springer-Verlag New York, Inc. [19] Lionel Biard and Dominique Noguet. Reed-Solomon Codes for Low Power Communications. Journal of Communications, Vol 3, No 2 (2008), 13-21, Apr 2008. [20] Funda Ergn , Sampath Kannan , S. Ravi Kumar , Ronitt Rubinfeld , Mahesh Viswanathan, Spot-checkers, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.259-268, May 24-26, 1998, Dallas, Texas, United States [21] K. D. Bowers, A. Juels, and A. Oprea, Proofs of Retrievability: Theory and Implementation, Cryptology ePrint Archive, Report 2008/175, 2008, http://eprint.iacr.org/. [22] R. Curtmola, O. Khan, R. Burns, and G. Ateniese, MR-PDP: Multiple- Replica Provable Data Possession, Proc. of ICDCS 08, pp. 411420, 2008. [23] W. Dai. Crypto++ Library, 5.6.1, Jan, 2011. http://www.cryptopp.com/