<<

Scalable and Efficient Provable Data Possession

Giuseppe Ateniese1, Roberto Di Pietro2, Luigi V. Mancini3, and Gene Tsudik4

Abstract i.e., how to efficiently and securely ensure that the server returns correct and complete results in response Storage outsourcing is a rising trend which prompts a to its clients’ queries [1, 2]. Later research focused number of interesting security issues, many of which on outsourcing encrypted data (placing even less trust have been extensively investigated in the past. However, in the server) and associated difficult problems mainly Provable Data Possession (PDP) is a topic that has only having to do with efficient querying over encrypted do- recently appeared in the research literature. The main main [3, 4, 5, 6]. issue is how to frequently, efficiently and securely ver- ify that a storage server is faithfully storing its client’s More recently, however, the problem of Provable (potentially very large) outsourced data. The storage Data Possession (PDP) –is also sometimes referred to server is assumed to be untrusted in terms of both secu- as Proof of Data Retrivability (POR)– has popped up rity and reliability. (In other words, it might maliciously in the research literature. The central goal in PDP is or accidentally erase hosted data; it might also relegate to allow a client to efficiently, frequently and securely it to slow or off-line storage.) The problem is exacer- verify that a server – who purportedly stores client’s po- bated by the client being a small computing device with tentially very large amount of data – is not cheating the limited resources. Prior work has addressed this prob- client. In this context, cheating means that the server lem using either public or requiring might delete some of the data or it might not store all the client to outsource its data in encrypted form. data in fast storage, e.g., place it on CDs or other ter- In this paper, we construct a highly efficient and tiary off-line media. It is important to note that a storage provably secure PDP technique based entirely on sym- server might not be malicious; instead, it might be sim- metric key cryptography, while not requiring any bulk ply unreliable and lose or inadvertently corrupt hosted . Also, in contrast with its predecessors, our data. An effective PDP technique must be equally appli- PDP technique allows outsourcing of dynamic data, i.e, cable to malicious and unreliable servers. The problem it efficiently supports operations, such as block modifi- is further complicated by the fact that the client might cation, deletion and append. be a small device (e.g., a PDA or a cell-phone) with limited CPU, battery power and communication facili- ties. Hence, the need to minimize bandwidth and local computation overhead for the client in performing each 1. Introduction verification. In recent years, the concept of third-party data ware- Two recent results PDP [7] and POR [8] have high- housing and, more generally, data outsourcing has be- lighted the importance of the problem and suggested come quite popular. Outsourcing of data essentially two very different approaches. The first [7] is a public- means that the data owner (client) moves its data to a key-based technique allowing any verifier (not just the third-party provider (server) which is supposed to – pre- client) to query the server and obtain an interactive sumably for a fee – faithfully store the data and make proof of data possession. This property is called pub- it available to the owner (and perhaps others) on de- lic verifiability. The interaction can be repeated any mand. Appealing features of outsourcing include re- number of times, each time resulting in a fresh proof. duced costs from savings in storage, maintenance and The POR scheme [8] uses special blocks (called sen- personnel as well as increased availability and transpar- tinels) hidden among other blocks in the data. Dur- ent up-keep of data. ing the verification phase, the client asks for randomly A number of security-related research issues in data picked sentinels and checks whether they are intact. If outsourcing have been studied in the past decade. Early the server modifies or deletes parts of the data, then work concentrated on data and integrity, sentinels would also be affected with a certain proba- bility. However, sentinels should be indistinguishable 1 The Johns Hopkins University. E-mail: [email protected] from other regular blocks; this implies that blocks must 2Universita` di Roma Tre. E-mail: [email protected] 3Universita` di Roma “La Sapienza”. E-mail: be encrypted. Thus, unlike the PDP scheme in [7], POR [email protected] cannot be used for public databases, such as libraries, 4University of California, Irvine. E-mail: [email protected] repositories, or archives. In other words, its use is lim- ited to confidential data. In addition, the number of service is free, Alice wants to make sure that her data queries is limited and fixed a priori. This is because is faithfully stored and readily available. To verify data sentinels, and their position within the database, must possession, Alice could use a resource-constrained per- be revealed to the server at each query – a revealed sen- sonal device, e.g., a PDA or a cell-phone. In this realis- tinel cannot be reused. tic setting, our two design requirements – (1) outsourc- ing data in cleartext and (2) bandiwdth and computation efficiency – are very important. Motivation: As pointed out in [9], data generation is currently outpacing storage availability, hence, there will be more and more need to outsource data. Contributions: This paper’s contribution is two-fold: A particularly interesting application for PDP is dis- 1. Efficiency and Security: the proposed PDP tributed data store systems. Consider, for example, scheme, as [8], relies only on efficient symmetric- the Freenet network [10], or the Cooperative Internet key operations in both setup (performed once) and Backup Scheme [11]. Such systems rely and thrive on verification phases. However, our scheme is more free storage. One reason to misbehave in such systems efficient than POR as it requires no bulk encryp- would be if storage space had real monetary cost at- tion of outsourced data and no data expansion due tached to it. Moreover, in a distributed backup system, a to additional sentinel blocks —see Section 5 for user who grants usage of some of its own space to store details. Our scheme provides probabilistic assur- other users’ backup data is normally granted usage of ance of the untampered data being stored on the a proportional amount of space somewhere else in the server with the exact probability fixed at setup, as network, for his own backup. Hence, a user might try to in [7, 8]. Furthermore, our scheme is provably se- obtain better redundancy for his own data. Furthermore, cure in the random oracle model (ROM). users will only place trust in such a system as long as the storage can be consistently relied upon, which is diffi- 2. Dynamic Data Support: unlike both prior re- cult in the event of massive cheating. A PDP scheme sults [7, 8], the new scheme supports secure and could act as a powerful deterrent to cheating, thus in- efficient dynamic operations on outsourced data creasing trust in the system and helping spread its pop- blocks, including: modification, deletion and ap- ularity and usage. Finally, note that same considera- pend.1 Supporting such operations is an important tions can be applied to alternative service models, such step toward practicality, since many envisaged ap- as peer-to-peer data archiving [12], where new forms plication are not limited to data warehousing, i.e., of assurance of data integrity and data accessibility are dynamic operations need to be provided. needed. (Though simple replication offers one path to higher-assurance data archiving, it also involves unnec- Roadmap: The next section introduces our approach essary, and sometimes unsustainable, overhead.) to provable data possession and discusses its effective- Another application of PDP schemes is in the con- ness and security; Section 3 extends and enhances it to text of censorship-resistance. There are cases where a support dynamic outsourced data (i.e., operations such duly authorized third party (censor) may need to modify as block update, append and delete). Then, Section 4 a document in some controlled and limited fashion [13]. discusses some features of our proposal, followed by However, an important problem is to prevent unautho- Section 5 which overviews related work. Finally, Sec- rized modifications. In this case, there are some well- tion 6 includes a discussion of our results and outlines known solutions, such as [14, 15] and [16]. But, they ei- avenues for future work. ther require data replication or on-demand computation of a function (e.g., a hash) over the entire outsourced data. Though the latter might seem doable, it does not 2. Proposed PDP Scheme scale to petabytes and terabytes of data. In contrast, a well-designed PDP scheme would be, at the same time, In this section we describe the proposed scheme. secure and scalable/efficient. It consists of two phases: setup and verification (also A potentially interesting new angle for motivating called challenge in the literature). secure and efficient PDP techniques is the outsourcing of personal digital content. (For example, consider the 2.1. Notation rapidly increasing popularity of personal content out- sourcing services, such as .MAC, GMAIL, PICASA and • D – outsourced data. We assume that D can be OFOTO.) Consider the following scenario: Alice wants represented as a single contiguous file of d equal- to outsource her life-long collection of digital content sized blocks: D[1],...,D[d]. The actual bit-length (e.g. photos, videos, art) to a third party, giving read of a block is not germane to the scheme. access to her friends and family. The combined content 1However, these operations involve slightly higher computational is precious to Alice and, whether or not the outsourcing and bandwidth costs with respect to our basic scheme. • OWN – the owner of the data. 2.3. Setup phase

• SRV – the server, i.e., the entity that stores out- We start with a database D divided into d blocks. We sourced data on owner’s behalf. want to be able to challenge the storage server t times. We use a pseudo-random function, f , and a pseudo- • H(·) – cryptographic . In practice, random permutation g defined as: we use standard hash functions, such as SHA-1, SHA-2, etc. f : {0,1}c × {0,1}k −→{0,1}L

• AEkey(·) – an scheme, and that provides both privacy and authenticity. In practice, privacy and authenticity is achieved by g : {0,1}l × {0,1}L −→{0,1}l. encrypting the message first and then computing In our case l = logd since we use g to permute indices. a Code (MAC) on the re- The output of f is used to generate the key for g and sult. However, a less expensive alternative is to use c = logt. We note that both f and g can be built out of a mode of operation for the cipher that provides standard block ciphers, such as AES. In this case L = authenticity in addition to privacy in a single pass, 128. We use the PRF f with two master secret keys W such as OCB, XCBC, IAPM, [17]. and Z, both of k bits. The key W is used to generate −1 session permutation keys while Z is used to generate • AEkey (·) – decryption operation for the scheme in- troduced above. challenge nonces. During the Setup phase, the owner OWN generates • fkey(·) – pseudo-random function (PRF) indexed in advance t possible random challenges and the cor- on some (usually secret) key. In practice, a ”good” responding answers. These answers are called tokens. acts as a PRF, in particular we could To produce the ith token, the owner generates a set of r use AES where keys, input blocks, and outputs indices as follows: are all of 128 bits (AES is a good pseudo-random permutation). Other even more practical construc- 1. First, generate a permutation key ki = fW (i) and a tions of PRFs deployed in standards use ”good” challenge nonce ci = fZ(i). MAC functions, such as HMAC [18]. 2. Then, compute the set of indices: {Ij ∈ [1,...,d] | 1 ≤ j ≤ r}, where Ij = gk ( j). • gkey(·) – pseudo-random permutation (PRP) in- i dexed under key. AES is assumed to be a good 3. Finally, compute the token: PRP. To restrict the PRP output to sequence num- vi = H( ci, D[I1],..., D[Ir]). bers in a certain range, one could use the tech- niques proposed by Black and Rogaway [19]. Basically, each token vi is the answer we expect to re- ceive from the storage server whenever we challenge it 2.2. General Idea on the randomly selected data blocks D[I1],..., D[Ir]. The challenge nonce ci is needed to prevent potential Our scheme is based entirely on symmetric-key pre-computations performed by the storage server. No- cryptography. The main idea is that, before outsourcing, tice that each token is the output of a cryptographic hash OWN pre-computes a certain number of short posses- function so its size is small. sion verification tokens, each token covering some set Once all tokens are computed, the owner outsources of data blocks. The actual data is then handed over the entire set to the server, along with the file D, by to SRV . Subsequently, when OWN wants to obtain encrypting each token with an authenticated encryption a proof of data possession, it challenges SRV with a function2. The setup phase is shown in detail in Algo- set of random-looking block indices. In turn, SRV rithm 1. must compute a short integrity check over the speci- fied blocks (corresponding to the indices) and return 2.4. Verification Phase it to OWN . For the proof to hold, the returned in- tegrity check must match the corresponding value pre- To perform the i-th proof of possession verification, computed by OWN . However, in our scheme OWN OWN begins by re-generating the i-th token key ki as has the choice of either keeping the pre-computed to- in step 1 of Algorithm 1. Note that OWN only needs kens locally or outsourcing them – in encrypted form – to store the master keys W,Z, and K, plus the current to SRV . Notably, in the latter case, OWN ’s storage token index i. It also re-computes ci as above. Then, overhead is constant regardless of the size of the out- 2Note that these values could be stored by the owner itself if the sourced data. Our scheme is also very efficient in terms number of queries is not too large. This would require the owner to of computation and bandwidth. store just a short hash value per query. Algorithm 1: Setup phase 2.5. Detection Probability begin Choose parameters c,l,k,L and functions f ,g; We now consider the probability Pesc that OWN Choose the number t of tokens; manages to successfully complete one run of our verifi- Choose the number r of indices per cation protocol while SRV has either deleted, or tam- verification; pered with, m data blocks. Note that SRV avoids de- Generate randomly master keys tection only if none among all the r data blocks in- W,Z,K ∈ {0,1}k. volved in the i-th token verification process, are deleted for (i ← 1 to t) do or modified. Thus, we have that: begin Round i  mr Pesc = 1 − (1) 1 Generate ki = fW (i) and ci = fZ(i) d 2 Compute For example, if the fraction of corrupted blocks (m/d) vi = H( ci , D[gki (1)] ,..., D[gki (r)] ) is 1% and r = 512, then the probability of avoiding de- 0 3 Compute vi = AEK( i , vi) tection is below 0.6%. This is in line with the detection end probability expressed in [7]. 0 We stress that, as in [7], we are considering Send to SRV : (D,{[i,vi] for 1 ≤ i ≤ t}) end economically-motivated adversaries that are willing to modify or delete a percentage of the file. In this context, deleting or modifying few bits won’t provide any finan- cial benefit. If detection of any modification/deletion of OWN sends SRV both ki and ci (step 2 in Algorithm small parts of the file is important then erasure codes 2). Having received the message from OWN , SRV could be used. The file is first encoded and then the computes: PDP protocol is run on the encoded file. Erasure codes in the context of checking storage integrity were used, for example, in OceanStore [20], PASIS [21], and by z = H( ci , D[gki (1)] ,..., D[gki (r)] ) Schwarz and Miller [22] (see also references therein). Erasure codes can help detect small changes to the file 0 0 and possibly recover the entire file from a subset of its SRV then retrieves vi and returns [z, vi] to OWN who, −1 0 blocks (as in [8]). However, erasure codes (even the in turn, computes v = AEK (vi) and checks whether v=(i, z). If the check succeeds, OWN assumes that near-optimal ones) are very costly and impractical for SRV is storing all of D with a certain probability. files with a very large number of blocks (the type of files we are considering here). In addition, encoding will (1) significantly expand the original file and (2) render im- Algorithm 2: Verification phase practical any operations on dynamic files, such as up- dating, deleting, and appending file blocks. begin Challenge i Notice that our discussion above concerns erasure 1 OWN computes ki = fW (i) and ci = fZ(i) codes applied to the file by the client and not by the 2 OWN sends {ki,ci} to SRV server. Obviously, the server may (and should) use era- 3 SRV computes sure codes independently to prevent data corruptions re- z = H( ci , D[gki (1)] ,..., D[gki (r)] ) sulting from unreliable storage. 0 4 SRV sends {z,vi} to OWN Encoding the file via appropriate redundancy codes 0 5 OWN extracts v from vi. If decryption fails is a topic orthogonal to PDP and we do not consider it or v 6= (i , z) then REJECT. further. end 2.6. Security Analysis

We point out that there is almost no cost for OWN We follow the security definitions in [7, 8] and in to perform a verification. It only needs to re-generate particular the one in [7] which is presented in the form the appropriate [ki,ci] pair (two PRF-s invocations) and of a security game. In practice we need to prove that perform one decryption in order to check the reply from our protocol is a type of proof of knowledge of the SRV . Furthermore, the bandwidth consumed by the queried blocks, i.e., if the adversary passes the verifica- verification phase is constant (in both step 2 and 4 of tion phase then we can extract the queried blocks. The Algorithm 2). This represents truly minimal overhead. most generic (extraction-based) definition that does not The computation cost for SRV , though slightly higher depend on the specifics of the protocol is introduced in (r PRP-s on short inputs, and one hash), is still very [8]. Another formal definition for the related concept of reasonable. sub-linear authentication is provided in [23]. We have a challenger that plays a security game stores the input and the corresponding output in a (experiment) with the adversary A. There is a Setup table (to be consistent with his replies). phase where A is allowed to select data blocks D[i] for 1 ≤ i ≤ d. In the Verification phase the challenger se- • At the challenge phase the simulator challenges A lects a nonce and r random indices and sends them to on the i −th value, xi and sends a random value ci A. Then the adversary generates a proof of possession to A. P for the blocks queried by the challenger. • A replies with a string P. If P passes the verification then the adversary won the game. As in [7], we say that our scheme is a Prov- Note that, in our original game, the value xi corre- able Data Possession scheme if for any probabilistic sponds to the ordered sequence of r concatenated blocks polynomial-time adversary A, the probability that A which are queried during the i −th challenge. In addi- wins the PDP security game on a set of blocks is neg- tion, the simulator can query any xi only once. ligibly close to the probability that the challenger can Clearly if the adversary wins the game (P = extract those blocks. In this type of proof of knowl- H(ci, xi)) with non-negligeable probability then xi is edge, a ”knowledge extractor” can extract the blocks extracted with non-negligeable probability. This is be- from A via several queries, even by rewinding the ad- cause the best the adversary can do to impede the ex- versary which does not keep state. traction of xi is to guess the correct output of the random At first, it may seem that we do not need to model oracle or to find a collision (a random oracle is trivially the hash function as a random oracle and that we might collision-resistant). Both these events happen with neg- just need its collision-resistance property (assuming we ligible probability. are using a ”good” standard hash function and not some contrived hash scheme). However, the security defini- 3. Supporting Dynamic Outsourced Data tion for PDP is an extraction-type definition since we are required to extract the blocks that were challenged. Thus far, we assumed that D represents static or So we need to use the random oracle model but not re- warehoused data. Although this is the case for many en- ally in a fundamental way. We just need the ability to visaged settings, such as libraries, there are many poten- see the inputs to the random oracle but we are not using tial scenarios where outsourced data is dynamic. This its programmability feature (i.e., we are not ”cooking- leads us to consider various data block operations (e.g., up” values as outputs of the random oracle). We need update, delete, append and insert) and the implications to see the inputs so that we can extract the queried data upon our scheme which stem from supporting each op- blocks. Since we are using just a random oracle our eration. proof does not rely on any cryptographic assumption One obvious and trivial solution to all dynamic op- but it is information-theoretic. erations, is (for each operation) for OWN to down- We show the following: load from SRV the entire outsourced data D and to re- run the setup phase. This would clearly be secure but Theorem 1 Our scheme is a secure Provable Data Pos- highly inefficient. We now show how to amend the ba- session scheme in the random oracle model. sic scheme to efficiently handle dynamic operations on outsourced data. Proof sketch. We are assuming that AEK(·) is an ideal authenticated encryption. This implies that, given 3.1. Block Update AEK(X), the adversary cannot see or alter X, thus we can assume that X is stored directly by the challenger n (i.e., there is no need for the adversary to send X to the We assume that OWN needs to modify the -th data challenger and we can remove AE (X) from the pic- block which is currently stored on SRV , from its cur- K D[n] D0[n] ture). rent value to a new version, denoted . In order to amend the remaining verification tokens, More formally, our game is equivalent to the follow- needs to factor out every occurrence of D[n] and ing: OWN replace it with D0[n]. However, one subtle aspect is that • A simulator S sets up a PDP system and chooses OWN cannot disclose to SRV which (if any) veri- n its security parameters. fication tokens include the -th block. The reason is simple: if SRV knows which tokens include the n-th D[n] • The adversary A selects values x1, ..., xn and sends block, it can simply discard if and when it knows them to the simulator S that no remaining tokens include it. Another way to see the same problem is to consider the probability of • The adversary can query the random oracle at any a given block being included in any verification token: point in time. For each input to the random ora- since there are r blocks in each token and t tokens, the cle, the simulator replies with a random value and probability of a given block being included in any token r∗t is d . Thus, if this probability is small, a single block in XMACC [24] since we use the key ci only once (it update is unlikely to involve any verification token up- changes at each challenge). date. If SRV is allowed to see when a block update The update operation is illustrated in Algorithm 3. causes no changes in verification tokens, it can safely As shown at line 7, OWN modifies all tokens regard- modify/delete the new block since it knows (with ab- less of whether they contain the n-th block. Hence, solute certainty) that it will not be demanded in future SRV cannot discern tokens covering the n-th block. verifications. However, the crucial part of the algorithm is line 6 Based on the above discussion, we require OWN to where OWN simultaneously replaces the hash of the modify all remaining verification tokens. We also need block’s old version H(ci, j, D[n]) with the new one: 0 to amend the token structure as follows (refer to line 2 H(ci, j, D [n]). This is possible due to the basic proper- in Alg. 1 and line 3 in Alg. 2): ties of the L operator.

from: Algorithm 3: Update vi = H( ci , D[gki (1)] ,..., D[gki (r)] ) % Assume that block D[n] is being modified to 0 0 vi = AEK( i , vi) D [n] to: % and, assume that no tokens have been used up thus far. v = H(c , 1, D[g (1)]) ⊕ ... i i ki begin ⊕ H(c , r, D[g (r)]) i ki 1 Send to SRV : UPDATE REQUEST 0 0 vi = AEK( ctr ,i , vi) 2 Receive from SRV : {[i,vi] | 1 ≤ i ≤ t}) %where ctr is an integer counter 3 ctr = ctr + 1 for (i ← 1 to t) do 0 −1 0 We use ctr to keep track of the latest version of the 4 z = AEK (vi) token v (in particular, to avoid replay attacks) and to % if decryption fails, exit i 0 change the encryption of unmodified tokens. Alterna- % if z is not prefixed by (ctr − 1),i, exit 0 tively, we could avoid storing ctr by just replacing the 5 extract vi from z key K of the authenticated encryption with a new fresh ki = fW (i) key. ci = fZ(i) The modifications described above are needed to al- for ( j ← 1 to r) do if (g ( j) == n) then low the owner to efficiently perform the token update. ki 6 vi = vi ⊕ H(ci, j, D[n]) ⊕ The basic scheme does not allow it since, in it, a token is 0 computed by serially hashing r selected blocks. Hence, H(ci, j, D [n]) 0 it is difficult to factor out any given block and replace it 7 v = AEK( ctr, i, vi) 3 i by a new version. 0 0 8 Send to SRV : {n, D [n], {[i,vi] | 1 ≤ i ≤ t}} One subtle but important change in the computation end of vi above is that now each block is hashed along with an index. This was not necessary in the original scheme since blocks were ordered before being hashed together. We now consider the costs. The bandwidth con- Now, since blocks are hashed individually, we need to sumed is minimal given our security goals: since take into account the possibility of two identical blocks OWN does not store verification tokens, it needs to (with the same content but different indices) that are obtain them in line 2. Consequently, and for the same selected as part of the same verification token. If we reason, it also needs to upload new tokens to SRV at computed each block hash as: H(ci, D[gki ( j)]), instead line 8. Computation costs for OWN are similar to of: H(ci, j, D[gki ( j)]), then two blocks with identical those of the setup phase except that no hashing is in- content would produce the same hash and would cancel volved. There is no computation cost for SRV . As far each other out due to the exclusive-or operation. In- as storage, OWN must have enough to accommodate cluding an unique index in each block hash addresses all tokens, as in the setup phase. this issue. We note that our XOR construction is ba- sically the same as the XOR MAC defined by Bellare, Security Analysis. The security of the Update algo- H(c ,...) et al. [24]. In the random oracle model i is rithm in the random oracle model follows directly from c a PRF under the key i. However, we do not need a the proof of the static case. The only difference is in the randomized block as in XMACR [24] or a counter as way we compute vi: Rather than hashing the concatena- tion of blocks, the adversary hashes each single block 3Of course, it can be done trivially by asking SRV to supply all blocks covered by a given token; but, this would leak information and and XOR the resulting outputs. This does not change make the scheme insecure. the ability of the simulator to extract the blocks queried during the challenge. Indeed, as in [24], we include outsourced data represents a log. Clearly, the cost of a unique index in the hash (from 1 to r) that “forces” re-running the setup phase for each single block append the adversary to query the random oracle on each single would be prohibitive. block indicated by the challenge. In addition, the anal- We obviously can not easily increase the range of ysis of the collision-resistance property of the XOR-ed the PRP gki (·) or other parameters since these are hard- hashes comes directly from [24], so we omit it here. coded into the scheme – we pre-compute all answers to all challenges. However, rather than considering a 3.2. Block Deletion single contiguous file of d as an array of equal-sized blocks, we could consider a logical bi-dimensional After being outsourced, certain data blocks might structure of the outsourced data, and append a new need to be deleted. We assume that the number of block to one of the original blocks D[1],...,D[d] in blocks to be deleted is small relatively to the size of a round-robin fashion. In other words, assume that the file. This is to avoid affecting the detection proba- OWN has the outsourced data D[1],...,D[d], and that bility appreciably. If large portions of the file have to it wants to append the blocks D[d + 1],...,D[d + k]. be deleted then the best approach in this case is to re- Then a logical matrix is built by column as follows: run the basic PDP scheme on the new file. Supporting 0 block deletion turns out to be very similar to the update D [1] = D[1] , D[d + 1] operation. Indeed, deleted blocks can be replaced by D0[2] = D[2] , D[d + 2] a predetermined special block in their respective posi- ... tions via the update procedure. Line 6 of Algorithm 3 D0[k] = D[k] , D[d + k] can be modified as follows: ... from: D0[d] = D[d] 0 vi = vi ⊕ H(ci, j, D[n]) ⊕ H(ci, j, D [n]) to: For the index j in the i-th challenge, the server will have vi = vi ⊕ H(ci, j, D[n]) ⊕ H(ci, j, DBlock) to include in the computation of the XOR-ed hashes vi any blocks linked to D[g ( j)], i.e., the entire row g ( j) where DBlock is a fixed special block that represents ki ki in the logical matrix above. In particular, will deleted blocks. This small change results in the deleted SRV include: block being replaced in all verification tokens where it has occurred. Moreover, as in the update case, all to- H(ci, j, D[gki ( j)]) ⊕ H(ci, d + j, D[gki ( j) + d]) ⊕ kens are modified such that SRV remains oblivious as ... ⊕ H(ci, αd + j, D[gki ( j) + αd]) far as the inclusion of the special block. The costs as- sociated to the delete operation are identical to those of where α is the length of the row gki ( j) of the logical the update. matrix. Clearly, OWN will have to update the XOR- ed hashes in a similar fashion by invoking the Update 3.3. Batching Updates and Deletions algorithm whenever a certain block must be appended. The advantage of this solution is that we can just Although the above demonstrates that the proposed run the Update operation to append blocks so we can technique can accommodate certain operations on out- even batch several appends. Notice also that the prob- sourced data, it is clear that the cost of updating all re- ability for any block to be selected during a challenge maining verification tokens for a single block update is still 1/d (this is because intuitively we consider d or deletion is not negligible for OWN . Fortunately, rows of several blocks combined together). The draw- it is both easy and natural to batch multiple operations. back is that the storage server will have to access more Specifically, any number of block updates and deletes blocks per query and this may become increasingly ex- can be performed at the cost of a single update or delete. pensive for SRV as the number of blocks appended to To do this, we need to modify the for-loop in Algorithm the database increases. 3 to take care of both deletions and updates at the same time. Since the modifications are straight-forward, we Insert: A logical insert operation corresponds to an do not illustrate them in a separate algorithm. append coupled with maintaining a data structure con- taining a logical-to-physical block number mapping for 3.4. Single-Block Append each “inserted” block. A true (physical) block insert op- eration is difficult to support since it requires bulk block In some cases, the owner might want to increase the renumbering, i.e., inserting a block D[ j] corresponds to size of the outsourced database. We anticipate that the shifting by one slot all blocks starting with index j + 1. most common operation is block append, e.g., if the This affects many (if not all) rows in the logical matrix described above and requires a substantial number of would involve high overhead, as each append re- computations. quires OWN to fetch all unused tokens, which is a bandwidth-intensive activity. 3.5. Bulk Append 4. Discussion Another variant of the append operation is bulk ap- pend which takes place whenever OWN needs to up- In this section we discuss some features of the pro- load a large number of blocks at one time. There posed scheme. are many practical scenarios where such operations are common. For example, in a typical business environ- 4.1. Bandwidth-Storage Tradeoff ment, a company might accummulate electronic docu- ments (e.g., sales receipts) for a fixed term and might Our basic scheme, as presented in Section 2, in- be unable to outsource them immediately for tax and volves OWN outsourcing to SRV not only data D, instant availability reasons. However, as the fiscal year but also a collection of encrypted verification tokens: rolls over, the accummulated batch would need to be 0 {[i,vi] for 1 ≤ i ≤ t}. An important advantage of this uploaded to the server in bulk. method is that OWN remains virtually stateless, i.e., To support the bulk append operation, we sketch out it only maintains the “master-key” K. The storage bur- a slightly different protocol version. In this context, we den is shifted entirely to SRV . The main disadvantage require for OWN to anticipate the maximum size (in of this approach is that dynamic operations described blocks) for its data. Let dmax denote this value. How- in Section 3 require SRV to send all unused tokens ever, in the beginning, OWN only outsources d blocks back to OWN resulting in the bandwidth overhead of where d < dmax. (t × r × |AEkey|) bits. Of course, OWN must subse- The idea behind bulk append is simple: OWN bud- quently send all updated tokens back to SRV ; however, gets for the maximum anticipated data size, including as discussed earlier, this is unavoidable in order to pre- (and later verifying) not r but rmax = dr ∗ (dmax/d)e serve security. blocks per verification token. However, in the begin- One simple way to reduce bandwidth overhead is for ning, when only d blocks exist, OWN , as part of setup, OWN to store some (or all) verification tokens locally. computes each token vi in a slightly different fashion 0 Then, OWN simply checks them against the responses (the computation of vi remains unchanged): received from SRV . We stress that each token is a hash function computed on a set of blocks, thus will rmax OWN M have to store, in the worst case, a single hash per chal- vi = j where: j=1 lenge. Another benefit of this variant is that it obviates the need for encryption (i.e., AEkey is no longer needed)  H(c , j, D[g ( j)]) if g ( j) ≤ d for tokens stored locally. Also, considering that SRV Q = i ki ki j 0|H(·)| otherwise is a busy entity storing data for a multitude of owners, storing tokens at OWN has a further advantage of sim- The important detail in this formula is that, whenever plifying update, append, delete and other dynamic op- the block index function gki ( j) > d (i.e., it exceeds the erations by cutting down on the number of messages. current number of outsourced blocks), OWN simply skips it. If we assume that dmax/d is an integer, we know 4.2. Limited Number of Verifications that, on average, r indices will fall into the range of ex- isting d blocks. Since both OWN and SRV agree on One potentially glaring drawback of our scheme is the number of currently outsourced blocks, SRV will the pre-fixed (at setup time) number of verifications t. follow exactly the same procedure when re-computing For OWN to increase this number would require to re- each token upon OWN ’s verification request. running setup which, in turn, requires obtaining D from Now, when OWN is ready to append a number (say, SRV in its entirety. However, we argue that, in prac- d) of new blocks, it first fetches all unused tokens from tice, this is a not an issue. Assume that OWN wants to the server, as in the block update or delete operations periodically (every M minutes) obtain a proof of posses- described above. Then, OWN re-computes each un- sion and wants the tokens to last Y years. The number used token by updating the counter (as in Equation 2). of verification tokens required by our scheme is thus: The only difference is that now, OWN factors in (does (Y × 365 × 24 × 60)/M. The graph in Figure 1 shows not skip) H(ci, j, D[gki ( j)]) whenever gki ( j) falls in the the storage requirements (for verification tokens) for a ]d,2d] interval. (It continues to skip, however, all block range of Y and M values. The graph clearly illustrates indices exceeding 2d.) that this overhead is quite small. For example, assuming It is easy to see that this approach works but only AEK(·) block size of 256 bits, with less than 128 Mbits with sufficiently large append batches; small batches (16 Mbytes), outsourced data can be verified each hour Figure 1. Token Storage Overhead (in bits): X-axis shows verification frequency (in minutes) and Y-axis shows desired longevity (in years).

for 64 years, or, equivalently, every 15 minutes for 16 • Each data block is 4-KB (|b| = 212) and d = years. 237/212 = 225. Furthermore, the number of tokens is independent from the number of data blocks. Thus, whenever out- • OWN needs to perform one verification daily for sourced data is extremely large, extra storage required the next 32 years, i.e., t = 32 × 365 = 11,680. by our approach is negligible. For instance, the Web • OWN requires the detection probability of 99% Capture project of the Library of Congress has stored, 4 with 1% of the blocks being missing or corrupted; as of May 2007, about 70 Terabytes of data. As shown hence r = 29 (see Equation 1). above, checking this content every 15 minutes for the next 16 years would require only 1 Mbyte of extra stor- Based on the above, each hash, computed over roughly age per year, which arguably can be even stored directly 221 = 212 ×29 bytes, would take – on a low-end comput- at OWN . ing device with a 1-GHz CPU, less than 0.04 seconds [25]. The total number of hashes is 11,680. Thus, setup 4.3. Computation Overhead time is about 11,680 × 0.04 = 467 seconds, which is less than 8 minutes. Note that setup also involves t × r Another important variable is the computation over- PRP, 2t PRF, as well as t AEK invocations. However, head of the setup phase in the basic scheme. Recall that these operations are performed over short inputs (e.g., at setup OWN needs to compute t hashes, each over 128 bits) and their costs are negligible with respect to a string of size r ∗ |b| size, where |b| is the block size. those of hashing. (In the modified scheme, the overhead is similar: t × r hashes over roughly one block each.) 5. Related Work To assess the overhead, we assume that to process a single byte, a hash function (e.g., SHA), requires just 20 Initial solutions to the PDP problem were provided machine cycles [25], which is a conservative assump- by Deswarte et al. [26] and Filho et al. [27]. Both use tion. Furthermore, we assume the following realistic RSA-based hash functions to hash the entire file at ev- parameters: ery challenge. This is clearly prohibitive for the server whenever the file is large. • OWN outsources 237 bytes of data, i.e., the con- Similarly, Sebe et al. [28] give a protocol for remote tent of a typical hard drive – 128-GB. file integrity checking, based on the Diffie-Hellman problem in composite-order groups. However, the 4See http://www.loc.gov/webcapture/faq.html server must access the entire file and in addition the client is forced to store several bits per file block, so evaluate our scheme, for the same amount of data, we storage at the client is linear with respect to the number assume t tokens, each based on r verifiers. of file blocks as opposed to constant. For our scheme, the probability of no corrupted Schwarz and Miller [22] propose a scheme that al- block being detected after all t tokens are consumed is: lows a client to verify storage integrity of data across d−η multiple servers. However, even in this case, the server m (2) must access a linear number of file blocks per challenge. d  Golle et al. [29] first proposed the concept of en- m forcement of storage complexity and provided efficient Note that η accounts for the number of different blocks schemes. Unfortunately the guarantee they provide is used by at least one token. Indeed, to avoid detec- weaker than the one provided by PDP schemes since it tion, SRV cannot corrupt any block used to compute only ensures that the server is storing something at least any token. It can be shown that, for a suitably large as large as the original file but not necessarily the file d, for every t, it holds that: η ≥ min{tr/2,d/2} with itself. overwhelming probability. Hence, the upper bound for Provable data possession is a form of memory check- Equation 2, when tr/2 ≤ d/2, is given by: ing [30, 23] and in particular it is related to the concept tr d− 2  of sub-linear authentication introduced by Naor and m (3) Rothblum [23]. However, schemes that provide sub- d  linear authentication are quite inefficient. m Compared to the PDP scheme in [7], our scheme is In [8], SRV can avoid detection if it does not corrupt significantly more efficient in both setup and verifica- any sentinel block. Since a total of s × t sentinels are tion phases since it relies only on symmetric-key cryp- added to the outsourced data, the probability of avoiding tography. The scheme in [7] allows unlimited verifica- detection is: d  tions and offers public verifiability (which we do not m (4) achieve). However, we showed that limiting the num- d+(s×t) ber of challenges is not a concern in practice. In ad- m dition, our scheme supports dynamic operations, while The above equations show that, in our scheme, for a the PDP scheme in [7] works only for static databases. given t, increasing r also increases the tamper-detection Compared to the POR scheme [8], our scheme pro- probability. However, r has no effect on the total num- vides better performance on the client side, requires ber of tokens, i.e., does not influence storage overhead. much less storage space, and uses less bandwidth (sizes As for POR, for a given t, increasing tamper detection of challenges and responses in our scheme is a small probability requires increasing in s, which, in turn, in- constant, less than the size of a single block). To appre- creases storage overhead. ciate the difference in storage overhead, consider that, POR needs to store s sentinels per verification, where 6. Conclusions each sentinel is one data block in size (hence, a total of s × t sentinels). In contrast, our scheme needs a single We developed and presented a step-by-step design of encrypted value (256 bits) per verification. Note that, in a very light-weight and provably secure PDP scheme. It POR, for a detection probability of around 93%, where surpasses prior work on several counts, including stor- at most 1% of the blocks have been corrupted, the sug- age, bandwidth and computation overheads as well as gested value s is on the order of one thousand [8]. Like the support for dynamic operations. However, since it is [7], POR is limited to static data (in fact, [8] considers based upon symmetric key cryptography, it is unsuitable supporting dynamic data an open problem). In sum- for public (third-party) verification. A natural solution mary, we provide more features than POR by consum- to this would be a hybrid scheme combining elements ing considerably less resources 5. of [7] and our scheme. To further compare the two schemes we assess their To summarize, the work described in this paper rep- respective tamper detection probabilities. In [8], the resents an important step forward towards practical PDP outsourced data is composed of d blocks and there are techniques. We expect that the salient features of our s sentinels. We compute the probability that, m being scheme (very low cost and support for dynamic out- the number of corrupted blocks, the system consumes sourced data) make it attractive for realistic applica- all sentinels and corruption is undetected. Similarly, to tions.

5We have recently learned that a MAC-based variant of our first References basic scheme was mentioned by Ari Juels during a presentation at CCS ’07 in November 2007 ([31]). This is an independent and con- current discovery. Indeed, an early version of our paper was submitted [1] P. Devanbu, M. Gertz, C. Martel, and S. Stubblebine, to NDSS 2008 in September 2007. “Authentic third-party data publication,” in IFIP DB- Sec’03, also in Journal of Computer Security, Vol. 11, cipher mode of operation for efficient authenticated en- No. 3, pages 291-314, 2003, 2003. cryption,” ACM TISSEC, vol. 6, no. 3, 2003. [2] E. Mykletun, M. Narasimha, and G. Tsudik, “Authenti- [18] M. Bellare, R. Canetti, and H. Krawczyk, “Keying hash cation and integrity in outsourced databases,” in ISOC functions for message authentication,” in CRYPTO’96, NDSS’04, 2004. 1996. [3] H. Hacigum¨ us,¨ B. Iyer, C. Li, and S. Mehrotra, “Exe- [19] J. Black and P. Rogaway, “Ciphers with arbitrary fi- cuting sql over encrypted data in the database-service- nite domains.,” in Proc. of CT-RSA. 2002, vol. 2271 of provider model,” in ACM SIGMOD’02, 2002. LNCS, pp. 114–130, Springer-Verlag. [4] D. Song, D. Wagner, and A. Perrig, “Practical tech- [20] John Kubiatowicz, David Bindel, Yan Chen, Steven Cz- niques for searches on encrypted data,” in IEEE erwinski, Patrick Eaton, Dennis Geels, Ramakrishan S&P’00, 2000. Gummadi, Sean Rhea, Hakim Weatherspoon, Westley [5] D. Boneh, G. di Crescenzo, R. Ostrovsky, and G. Per- Weimer, Chris Wells, and Ben Zhao, “Oceanstore: an siano, “Public key encryption with key-word search,” in architecture for global-scale persistent storage,” SIG- EUROCRYPT’04, 2004. PLAN Not., vol. 35, no. 11, pp. 190–201, 2000. [6] P. Golle, J. Staddon, and B. Waters, “Secure conjunc- [21] Garth R. Goodson, Jay J. Wylie, Gregory R. Ganger, and tive keyword search over encrypted data,” in ACNS’04, Michael K. Reiter, “Efficient byzantine-tolerant erasure- 2004. coded storage,” in DSN ’04: Proceedings of the 2004 [7] G. Ateniese, R. Burns, R. Curtmola, J. Herring, L. Kiss- International Conference on Dependable Systems and ner, Z. Peterson, and D. Song, “Provable data possession Networks, Washington, DC, USA, 2004, p. 135, IEEE at untrusted stores,” in ACM CCS’07, Full paper avail- Computer Society. able on e-print (2007/202), 2007. [22] T. S. J. Schwarz and E. L. Miller, “Store, forget, and [8] A. Juels and B. Kaliski, “PORs: Proofs of retrievability check: Using algebraic signatures to check remotely for large files,” in ACM CCS’07, Full paper available administered storage.,” in Proceedings of ICDCS ’06. on e-print (2007/243), 2007. 2006, IEEE Computer Society. [9] John F. Gantz, David Reinsel, Christopher Chute, Wolf- [23] M. Naor and G. N. Rothblum, “The complexity of on- gang Schlichting, John McArthur, Stephen Minton, line memory checking,” in Proc. of FOCS, 2005, Full Irida Xheneti, Anna Toncheva, and Alex Man- version appears as ePrint Archive Report 2006/091. frediz, “The expanding digital universe: A fore- [24] M. Bellare, R. Guerin,´ and P. Rogaway, “XOR MACs: cast of worldwide information growth through 2010. New methods for message authentication using finite IDC white paper—sponsored by EMC,” Tech. Rep., pseudorandom functions,” in CRYPTO’95, 1995. March 2007, http://www.emc.com/about/ [25] B. Preneel, et al., “NESSIE D21 - performance of opti- destination/digital_universe/. mized implementations of the primitives,” . [10] Ian Clarke, Oskar Sandberg, Brandon Wiley, and [26] Y. Deswarte, J.-J. Quisquater, and A. Saidane, “Remote Theodore W. Hong, “Freenet: a distributed anony- integrity checking,” in Proc. of Conference on Integrity mous information storage and retrieval system,” in In- and Internal Control in Information Systems (IICIS’03), ternational workshop on Designing privacy enhancing November 2003. technologies, New York, NY, USA, 2001, pp. 46–66, [27] D. L. G. Filho and P. S. L. M. Baretto, “Demon- Springer-Verlag New York, Inc. strating data possession and uncheatable data transfer,” [11] M. Lillibridge, S. Elnikety, A. Birrell, M. Burrows, and IACR ePrint archive, 2006, Report 2006/150, http: M. Isard, “A cooperative internet backup scheme,” in //eprint.iacr.org/2006/150. USENIX’03, 2003. [28] F. Sebe, A. Martinez-Balleste, Y. Deswarte, J. Domingo- [12] B. Cooper and H. Garcia-Molina, “Peer-to-peer data Ferrer, and J.-J. Quisquater, “Time-bounded remote file trading to preserve information,” ACM ToIS, vol. 20, integrity checking,” Tech. Rep. 04429, LAAS, July no. 2, pp. 133–170, 2002. 2004. [13] G. Ateniese, D. Chou, B. de Medeiros, and G. Tsudik, [29] P. Golle, S. Jarecki, and I. Mironov, “Cryptographic “Sanitizable signatures,” in ESORICS’05, 2005. primitives enforcing communication and storage com- [14] M. Waldman and D. Mazieres,` “TANGLER: a plexity.,” in Financial Cryptography, 2002, pp. 120– censorship-resistant publishing system based on docu- 135. ment entanglements,” in ACM CCS’01, 2001. [30] M. Blum, W. Evans, P. Gemmell, S. Kannan, and [15] J. Aspnes, J. Feigenbaum, A. Yampolskiy, and S. Zhong, M. Naor, “Checking the correctness of memories,” in “Towards a theory of data entanglement,” in Proc. of Proc. of the FOCS ’95, 1995. Euro. Symp. on Research in Computer Security, 2004. [31] A. Juels and B. S. Kaliski, ,” http: [16] M. Waldman, A. Rubin, and L. Cranor, “PUBLIUS: //www.rsa.com/rsalabs/staff/bios/ a robust, tamper-evident, censorship-resistant web pub- ajuels/presentations/Proofs_of_ lishing system,” in USENIX SEC’00, 2000. Retrievability_CCS.ppt. [17] P. Rogaway, M. Bellare, and J. Black, “OCB: A block-