<<

c ACM, 2013. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the 18th ACM symposium on Access Control Models and Technologies, 2013. http://doi.acm.org/10.1145/2462410.2462420 A Storage-Efficient Cryptography-Based Access Control Solution for Subversion

Dominik Leibenger Christoph Sorge University of Paderborn University of Paderborn Warburger Straße 100 Warburger Straße 100 33098 Paderborn, Germany 33098 Paderborn, Germany [email protected] [email protected]

ABSTRACT the VCS. A VCS will not normally store complete copies systems are widely used in develop- of each version of a file: changes might only affect a small ment and document management. Unfortunately, version- fraction of that file, so only differences between two versions ing confidential files is not normally supported: Existing are stored. solutions encrypt the transport channel, but store data in One of the most common VCS is Subversion (SVN), which plaintext within a repository. We come up with an access relies on a central server to store the history of versioned control solution that allows secure versioning of confidential files. Like other VCS, SVN is not well-suited for work- files even in the presence of a malicious server administra- ing with confidential files. The transport channel is usu- tor. Using convergent encryption as a building block, we ally protected, e.g. using TLS. However, once on the server, enable space-efficient storage of version histories despite se- each user with read access to the file system, as well as the cure encryption. We describe an implementation of our con- server’s administrators, can read the file. SVN allows to cept for the Subversion (SVN) system, and evaluate storage define access rules, but their enforcement requires a com- efficiency and runtime of this implementation. Our imple- pletely trustworthy server. Users can encrypt the content mentation is compatible with existing SVN versions without using external applications; unfortunately, this makes it im- requiring changes to the storage backend. possible for the VCS to save only space-saving differences between file versions—even with only one byte changed in a long document, the encrypted versions will be completely Categories and Subject Descriptors different. Moreover, key management does not tie in with K.6.5 [Management of Computing and Information the versioning system. Even assuming there was a public- Systems]: Security and Protection; E.3 [Data Encryp- key infrastructure (PKI) in place, granting users read access tion] to a file (and revoking that right later on) is a complex and error-prone task. If a user is supposed to get access to more than one file version, access rights must be granted (e.g., the General Terms file must be encrypted with the respective user’s public key) Security, Design, Algorithms, Experimentation, Performance each time a new version is uploaded to the SVN repository. Our contribution is to define a solution for securing access Keywords to files in an SVN repository with the following properties: Rights can be managed individually for each file. Access control; version control systems; subversion; conver- • Enforcing read access control through encryption al- gent encryption; confidentiality; efficient storage • lows protecting the confidentiality of files even against attacks by repository administrators. 1. INTRODUCTION Despite encryption, space-saving differences are stored • Version control systems (VCS) have become an invaluable instead of entire copies of each file version. To achieve tool for software development and are also used during the this, we allow attackers to see the positions of changes creation of all kinds of documents. As their main feature, made between different versions of the same file. We retain full compatibility with old software versions, they allow restoring any previous version of a file. Most VCS • also facilitate the collaboration of authors working together though not all features will work if either the client or on a project. Authors obtain the latest copy from the VCS, the server does not support our solution. work on that copy, and then an updated version to Necessary key exchanges tie in with the SVN architec- • ture. They are authenticated without use of a PKI; a shared secret between a pair of users is sufficient. The remainder of this paper is structured as follows: We Permission to make digital or hard copies of all or part of this work for discuss related work in Sec. 2 and outline our goals in Sec. 3. personal or classroom use is granted without fee provided that copies are Sec. 4 presents the system design. The encryption scheme not made or distributed for profit or commercial advantage and that copies that allows for storage efficiency is worked out in Sec. 5; a bear this notice and the full citation on the first page. To copy otherwise, to security analysis follows in Sec. 6. Sec. 7 describes the imple- republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. mentation, which we analyze concerning storage / runtime SACMAT’13, June 12–14, 2013, Amsterdam, The Netherlands. overhead in Appendix A. Sec. 8 concludes this paper. Copyright 2013 ACM 978-1-4503-1950-8/13/06 ...$15.00. 2. PRELIMINARIES & RELATED WORK The concept has been extended in [18]: The authors sug- While the concepts developed in this work might be ap- gest applying CE to chunks of a file, which are determined plied to other VCS, we develop our access control solution using a context-sensitive chunking procedure, to allow data for the SVN [19] system. SVN uses a client/server architec- deduplication even between similar files. Rashid et al. [17] ture. Each client stores a working copy, which contains both recently re-used this concept to build a privacy-preserving the versions of the files last obtained from the server and data deduplication framework for cloud environments. We the versions edited by the user. After performing changes describe and further extend the procedure of [18] in Sec. 5. to files, the client performs a commit operation to add the The Tahoe filesystem [20], in contrast, uses CE for dedu- changes to the repository; an update gets the latest contents plication of entire files only. The authors introduce a con- from the repository. Communication with the server can be vergence secret, so the equality of files can only be detected based on WebDAV [7], or can use a specific SVN protocol [4]. by users sharing that secret. This concept is similar to the In both cases, the client only sends (or receives) the changes obfuscator we introduce to hide equality of chunks between with respect to the previously stored version. different files. The server stores the whole revision history of the files and None of these approaches have been applied to VCSs. So folders added to the repository as well as of property lists far, besides using secure channels for communication, exist- 2 containing metadata that can be associated to them. After ing VCSs only cope with integrity and authenticity. 3 storing a revision, a user cannot change or delete contents and Monotone ensure integrity of the version history (if a of already stored data. Several storage backends exist. The trustworthy copy of the current version is available) and in- most common one is FSFS [15, 3]; it stores the oldest version tegrate the use of signatures. To the best of our knowledge, of each file completely. Once a file is changed, FSFS only we are the first to provide a fine-grained, storage-efficient ac- stores the (binary) difference compared to a (not necessarily cess control solution for a VCS that is resistant to a server the latest) previous version. This way, SVN provides data compromise and thus able to provide true confidentiality. deduplication between different revisions of individual files. Data deduplication between different files is not provided. 3. GOALS, THREATS AND ASSUMPTIONS Our goal to develop a secure and storage-efficient access The goal of this work is to protect the confidentiality, in- control solution for this system comprises two main chal- tegrity and authenticity of specific file versions stored in an lenges: A solution to allow convenient access rights manage- SVN repository by providing a fine-grained, user-definable ment that includes key distribution, and a concept for data access control solution. In more detail, our goals are: deduplication despite secure encryption. Work on secure Compatibility: We limit ourselves to changes that only file systems faces similar challenges; however, the majority • of existing work focusses on the key distribution problem. extend the SVN architecture, without restricting interoper- The secure file system SiRiUS [10] supports per-file access ability with existing SVN versions. This especially requires right management by encrypting each file using a randomly- independence of our solution from the used storage backend. Fine-Grained Access Control and Convenient Usability: generated key that is encrypted with the authorized users’ • keys and stored as part of the file’s metadata. It supports Our solution shall allow users of an SVN repository to pro- separate read / write access rights. tect their own files stored in the repository by defining access Plutus [12] also distinguishes between read and write ac- rights for them. Users shall be able to individually grant cess rights, but groups files with equal access rights into file read and write permissions for a specific file, as well as per- groups and stores their keys in an encrypted, shared lockbox missions to manage the file’s access rights, to other specific to reduce key distribution costs. The authors introduce a users. Access rights management shall be convenient inso- method called key rotation: On access rights changes, keys far as key exchanges between users shall be performed auto- are changed in a way that allows the computation of pre- matically when access rights are managed (we only require a vious keys from the latest one, which prevents the need for one-time manual key exchange per user for authentication). Integrity and Authenticity: If access rights have been as- immediate re-encryption of file contents on key changes. Key • regression [9] fixes some security flaws of key rotation. sociated to a file, our solution shall protect the file’s integrity In contrast, in the Wuala1 distributed file system, Groli- and authenticity: every authorized user shall be able to ver- mund et al. [11] introduce a data structure called Cryptree ify if the file’s contents have been written by an authorized to explicitly store dependencies between arbitrary keys. user according to the currently granted access rights on the We use a simple, SiRiUS-like approach for key distribution file. This shall hold even against active attackers with full as it fits best into the SVN architecture: The other solutions read/write access to the repository server. Note that since depend on relations between different files which are not the SVN architecture expects only the server to store the guaranteed to be available in SVN setups (for example, a whole revision history, we cannot provide protection against user might check out only a single file or directory). malicious deletion (including rollbacks that correspond to Despite those systems that focus on key management, few deleting revisions) by the server administrator. Confidentiality and Storage Efficiency: We want to pro- secure file systems provide data deduplication. • Farsite introduces a concept called convergent encryption vide confidentiality of files whose access rights have been (CE) which we adapt in Section 5: They use a cryptographic restricted. Thus, attackers with full read/write access to hash of a file as a key to encrypt that file, so identical files can the repository server shall not be able to read those files’ be recognized and deduplicated despite secure encryption. contents. This shall be ensured through encryption—using Similar to the SiRiUS approach, that key is then encrypted an encryption scheme that is believed to provide IND-CCA- with the authorized users’ keys to grant them access rights. secure encryption. 2http://git-scm.com/about 1http://www.wuala.com/ 3http://www.monotone.ca/ However, we want to retain storage efficiency, which SVN granted for a specific file are not that clear when rights can achieves by deduplicating data between different file revi- be changed over time. While WRITE and GRANT rights sions. As SVN’s storage backend is transparent to the clients, can be granted to as well as revoked from a user effectively data deduplication has to be done at the server side, i.e. with- at any time, the impact of a READ right revocation is less out knowledge of file contents. This leads to a clear conflict certain: The user could have made copies of the file revisions with the confidentiality goal. We resolve this conflict by al- he had access to, thus revoking the READ right would only lowing to soften the security requirements when committing be effective regarding future but not previous revisions. To further revisions of a specific file. Allowing attackers to see make the effective rights clear and comprehensible, we tie the extent and positions of changes to a file (which includes our access control mechanism to single file revisions instead information about how file contents are moved between re- of whole files. In particular, we store all access-rights-related visions) enables generation of ciphertexts suitable for SVN’s information as metadata for a specific file revision. This server-side data deduplication techniques. Users may decide leads to minor limitations: According to the revision con- for each file revision if this information leakage is acceptable. cept of the SVN architecture, granted rights of archived file For files whose access rights change frequently, users may revisions are immutable—a user is only able to change these even decide to allow data deduplication between file revi- rights for future revisions of the file (so each access right sions with different access rights—again by softening the se- modification requires a commit). However, albeit rights are curity requirements. In this case, some additional informa- tied to revisions, modifying a file or its access rights requires tion is leaked only to users who have or had legitimate rights the WRITE or GRANT right for the latest revision, thus to read some revision of a file. They are able to decrypt any these rights can be revoked with immediate effect. Solely a parts of other revisions of the same file which were contained READ right for an old revision can be exercised by a user in an already-known version and might also be able to verify even if he does not hold the right for the latest revision. the existence of single plaintext fragments. Further, they are able to derive very limited information from chunk bound- 4.2 Access Rights aries of other revisions of that file (cf. Section 5). As described before, we distinguish between the rights Note that this information leakage is limited to information READ, WRITE and GRANT. They are enforced as follows: leaked between revisions of a single file. We always hide The READ right is enforced by encrypting the file con- similarities between different files as SVN does not perform tents• using a key triple (K , K , K ) (for details see Sec. 5) data deduplication between files anyway. R O I that is generated randomly by the file owner or any GRANT - We achieve these goals under some assumptions: As au- authorized user and only made available to READ-autho- thentication is password-based, we assume that attackers do rized users. All cryptographic operations are performed at not know (and cannot break) these passwords. We further the client side and the keys are unknown to the repository, so assume that no attacker can break the Diffie-Hellman key the server can only see the ciphertexts. Thus, access control exchange, and that used hash functions provide resistance regarding this right can be resistant to a server compromise. against preimage attacks as well as strong collision resis- While the repository does not know the file contents, server- tance. Moreover, we assume the existence of a pseudoran- side efficient storage of similar file revisions is retained by dom permutation (PRP) which we use as encryption prim- using a special encryption scheme that is described in Sec. 5. GRANT access control is performed using a shared itive. Users are considered authorized for specific actions • w.r.t. a file version if they are granted the corresponding ac- secret key KG (again generated randomly by a GRANT - cess rights by a user with GRANT rights (i.e. a user allowed authorized user like the file owner) that is only known to to manage access rights) for that file version, as described GRANT -authorized users. This key is used to prove / verify in Section 4.1. A file owner and other GRANT-authorized the integrity / authenticity of access rights, trust relation- users (as well as their SVN clients) are considered trustwor- ships and cryptographic keys. thy w.r.t. the specific file. The WRITE right is enforced by a simple server-side ACL,• since the repository administrator can modify or delete ciphertexts anyway due to his physical access to the storage. 4. SYSTEM DESIGN Our access control solution consists of two building blocks: Whenever access rights are changed, the affected keys First, we design and integrate a suitable access control con- are replaced by new keys that are chosen randomly by the cept for the SVN architecture. Second, we adapt an encryp- GRANT -authorized user’s client that performs the change. tion scheme that achieves confidentiality without preventing 4.3 Authentication and Key Exchange the use of data deduplication techniques such as SVN’s dif- ference calculations. This section deals with the former. To allow a convenient exertion of access rights, we do not require the users to exchange those cryptographic keys man- 4.1 Overview ually. Instead, every user u has to remember a single pass- word CP (u) (which must differ from the one used to authen- The idea of our access control solution is to let the owner ticate to the repository) that allows him to exercise all of his of a file (i.e. the user adding that file to the SVN) decide on access rights. We use the repository as transport channel for its precise access rights. The following rights can be granted: communication between users, so we need to minimize the READ: Allow to read the contents of the specific file. number of communication rounds. This works as follows: • WRITE: Allow to modify a specific file’s contents. • GRANT: Allow to grant arbitrary rights to and revoke 1. An individual key KDH (f, u) is negotiated for every • arbitrary rights from other users. user u of a file f using a Diffie-Hellman key exchange [6]. To this point, this definition of access rights seems rather This key is made available both to the specific user u as well straightforward. However, the semantics of access rights as to every GRANT -authorized user of file f. For this, we regard the group of GRANT -authorized users of a specific file revisions. Thus, a user may achieve storage efficiency by file f as one party of the DH key exchange and the user u re-using KO of an ancestor when creating a new file revision. as the other party. The private DH value of the GRANT To achieve this, we adapt the concept proposed by Storer group is chosen by any GRANT -authorized user (e.g. the file et al. [18]. Their idea is to split a plaintext m into so- owner); the private DH value of u is chosen by u. We store called chunks using a context-sensitive chunking procedure the public DH values in the file’s metadata. The private val- before encrypting each chunk deterministically using conver- ues of the GRANT group and of u are stored encrypted with gent encryption (see Section 2). This way, identical chunks KG and CP (u), respectively, so that both u and GRANT - result in identical ciphertexts. The concatenation of those authorized users are able to determine KDH (f, u) later on. encrypted chunks forms the ciphertext of m. Unfortunately, 2. As the DH key exchange does not provide authenti- this solution would have some shortcomings in our usage sce- cation, we authenticate the negotiated DH keys separately. nario: First of all, it would not only leak similarities between The idea is to build bidirectional trust chains from the owner different revisions of a file, but also between different files, of file f (who is assumed to be trusted and serves as trust as contents are encrypted independent of the encryption key. anchor) to every user u—i.e. every user u is authenticated Furthermore, due to the context-sensitive chunking, chunk by the file owner over a chain of trusted GRANT -authorized boundaries leak some information about the plaintext m. users and vice versa. For this, every user u negotiates a se- Even worse, deterministic encryption of chunks could also cret passphrase with an arbitrary GRANT -authorized user leak information about the structure of m, as repeating con- using a separate channel (e.g. telephone), which is used tents would lead to repeating chunks and thus to repetitions by them to authenticate each other. We use a symmet- in the ciphertext. ric HMAC-based authentication mechanism; details of this To overcome these shortcomings, we modify their concept protocol are omitted due to space restrictions. in two points: First, we make the chunking procedure and 3. The access rights granted to a user u are stored in the the computation of chunk-specific keys depend on KO to file’s metadata and authenticated using an HMAC keyed limit CE’s security implications to the scope of a specific by KG so that modifications can be detected by GRANT - value of KO. Secondly, we introduce appearance counters authorized users. To allow exertion of access rights, the which are involved in chunk boundary determination and keys associated to an access right are stored within the file’s chunk encryption to prevent potential security risks regard- metadata—encrypted with the key KDH (f, u) for every au- ing files with repeating contents. thorized user u. Since KDH (f, u) is known to all GRANT - We continue with a formal description of our scheme. authorized users, changed keys can be distributed easily to the users by simply encrypting the new keys with KDH (f, u). 5.1 Formal Description In this section, we describe our encryption and authen- 5. ENCRYPTION SCHEME tication scheme Π = (Gen, EncMac, Dec) following the notations defined in [13]. We later analyze our encryption In principle, we could use any secure encryption scheme scheme’s security in the random oracle model [1], so we as- for encrypting file contents. However, traditional schemes sume the existence of a random oracle H. As we focus on would prevent data deduplication between similar file revi- the encryption scheme, we refer to the contents of files and sions. To achieve storage efficiency, we use a special encryp- file revisions as messages in the remainder. tion scheme that we describe in this section. Unfortunately, data deduplication and secure encryption Requirements are conflicting goals as the former has to depend on simi- We build upon an existing pseudorandom permutation (PRP) larities between contents which the latter requires to hide. F (with F (x) denoting F (k,x)) as defined in [13, Section This leads to an inevitable trade-off. The benefit of our en- k 3.6.3] analogous to [13, Definition 3.23]: cryption scheme lies in the fact that it allows to adjust this Definition. Let F : 0, 1 ∗ 0, 1 ∗ 0, 1 ∗ be an effi- trade-off: The encryption of single file revisions achieves the { } × { } → { } same security properties as the underlying encryption func- cient, length-preserving, keyed function. We say that F is a tion (e.g. AES). However, when further revisions are stored, pseudorandom permutation if for all probabilistic polynomial- the user may choose to leak a limited amount of information time distinguishers D, there exists a negligible function negl F ( ) n f( ) n to allow data deduplication. such that: Pr[D k (1 ) = 1] Pr[D (1 ) = 1] negl(n), − ≤ Towards this goal, our encryption scheme requires a triple where k 0, 1 n is chosen uniformly at random and f is ← { } of keys (KR, KO, KI ) instead of a single encryption key. KR chosen uniformly at random from the set of permutations corresponds to the classic encryption key that is used to en- mapping n-bit strings to n-bit strings. sure confidentiality. It may be used for an arbitrary amount of encryptions, but has to be changed whenever access rights Further, we build upon an existing MAC function ΠM = (GenM , Mac, Vrfy) with unique tags. Again, we expect of the encrypted content change. KI is used similarly to en- n GenM to return a uniform random number in 0, 1 . sure integrity / authenticity. KO—the obfuscator—serves as Utilizing the random oracle H, we define two different{ } hash a kind of secret initialization vector (IV). Using a unique IV n n n functions HR, HS : 0, 1 0, 1 0, 1 ∗ 0, 1 . for every encryption achieves the strongest security guaran- { } × { } × { } → { } tees. However, in contrast to other schemes, using the same Definition of Π IV for multiple file revisions is secure insofar as it only leaks We construct Π from a to-be-defined private-key encryption a controlled amount of information: Fixing KO only results in deterministic encryption—both regarding a whole plain- scheme Π=(Gen, Enc, Dec) and the MAC function ΠM text as well as variable-length chunks within a plaintext. By according to [13, Construction 4.19]: n revealing which parts have changed between two revisions, Genb , on inputd 1d, runsdGen and GenM to obtain keys • this allows data deduplication between ciphertexts of similar K and KI , respectively. d EncMac, on input key (K, KI ) and message m, com- The key Km is then used to encrypt the chunk contents • j putes c EncK (m),t MacK (c) and outputs (c, t). using the pseudorandom permutation F . For technical rea- ← ← I Dec, on input a key (K, KI ) and a ciphertext (c, t), sons, we prefix the chunk content with its length before en- • 4 crypting it . Kmj is then encrypted (using the key KR) outputs DecdK (c) if VrfyKI (c, t) = 1; otherwise, is returned. ⊥ and stored within the encrypted chunk representation. To summarize, every chunk mj is encrypted as follows (with d Definition of Π being the concatenation operator):

cj = FK (Km ),FK (length(mj ) mj ) (3) The key-generation function Gen is defined as follows: On R j mj n b input 1 , it outputs a key tuple (KR, KO) with KR, KO each   being chosen uniformly at randomd from 0, 1 n. 5.1.3 Finalization { } The encryption function Enc is more complex: When get- When all chunks have been encrypted, their ciphertexts ting a key K =(KR, KO) and a message m as input, it first c1,...,cq are concatenated to build the ciphertext c of the splits the message into chunks,d before performing separate whole message: q encryptions for the individual chunks. The phases of Enc Enc(KR,KO )(m) = j=1 cj (4) are now described in detail. f  d The decryptiond function Dec is straightforward: On input 5.1.1 Chunking a key K = (KR, KO) and a ciphertext c, it sequentially k c Chunking is performed in a similar way as introduced in decrypts the chunks c1,...,cdq: For every chunk cj =(cj , cj ), 1 k [14] and presented in the convergent encryption scenario in it first determines the chunk key Km := F − (c ) and then j KR j [18]: A sliding window (fixed size: w bytes) is moved over decrypts the chunk content and length by invoking lj mj := the message m and the windows’ contents mp,w are used 1 c q FK− (cj ). Finally, j=1 mj is output. Note that decryption as input for the hash function H to compute a pseudo- mj R depends only on K , not on K . random fingerprint at every byte position p (the authors of Rf O [14] and [18] use a rolling hash function to compute those 5.2 Advantages fingerprints). The positions whose fingerprints are below a If every encryption application uses a random K , our specific threshold T are used as chunk boundaries. We make O scheme achieves CCA-secure encryptions, as we prove in three changes to the original procedure: First, we calcu- Sec. 6.1. However, it can achieve storage efficiency within late the fingerprints using H (k,i,x) with k := K , so that R O the SVN scenario when used as follows: chunking depends on K . Secondly, we prevent repeating O Assign random keys K , K , K to every file; change window contents from resulting in the same fingerprint by R O I • them only on access rights changes. including an appearance counter i in the fingerprint calcu- As long as appearance counters are not affected, using the lation. Thirdly, we introduce a minimum chunk size of l same K for multiple file revisions results in a deterministic, bytes. More formally, we create a chunk boundary at every O context-sensitive chunking, which achieves identical chunks byte position p that meets the following condition: within unchanged parts between those revisions—even if the p l max p = 0 p is a chunk boundary positions of those unchanged parts change between revisions − ≥ p′,p′ 0, Πi H with SHA-256; queries prefixed by a key are performed does not represent a correct encryption scheme, as Enci using SHA-256-HMAC (with the key used as key instead of can produce ciphertexts thate Dece i cannot decrypt. We cane a prefix). The pseudorandom permutation F is instantiated ignore this since the Πi’s are only auxiliary constructionsg with AES-256-CBC [5], with the IVs being chosen as follows and our proof does not dependg on the existence of Dec . to achieve deterministic encryption: i e For encryption of the chunk content, we set IV = 0. While Theorem. If HR,HS are random oracles, Π achieves IND- ∞ g this is considered to be insecure in general use cases, our CPA-secure encryption. content-based key generation guarantees that we never en- e crypt different data using the same key. This makes the zero Proof. Our first goal is to show that Π achieves IND- vector unique in the scope of the used encryption key. CPA-secure encryption if all PRP invocations are replaced For encryption of the chunk key, the situation is not that by invocations of truly random functions.e We denote this simple: Since the same key K is used to encrypt different variant by Π to indicate that i . R ∞ →∞ chunk keys, we cannot use a constant IV here. We solve this By [13, Definition 3.21], “a private-key encryption scheme by using the first block of the ciphertext of a chunk as IV Πi is IND-CPA-securee if for all probabilistic polynomial-time for the encryption of its corresponding chunk key5. Since no adversaries there exists a negligible function negl such Acpa 1 two chunks have the same key, these ciphertext blocks differ thate Pr[PrivK (n) = 1] 2 + negl(n), where the prob- ,Πi ≤ for different chunks, resulting in a unique IV. Effectively, ability is takenA over the random coins used by , as well as this makes the first block of the chunk key encryption and the random coinse used in the experiment.” TheA experiment the second block of the chunk content encryption depend PrivKcpa (n) is defined as follows in [13, Section 3.5] (for ,Π on the same IV, but since different keys are used for these A i convenience, we adapt the presentation to our notations): encryptions, the uniqueness property is retained. e n 1. A key K is generated by running Geni(1 ). 2. The adversary is given input 1n and oracle access 6. SECURITY ANALYSIS A to EnciK ( ), and outputs a pair ofg messages m0,m1 of Our encryption scheme’s security guarantees depend on the same length. its usage: We achieve IND-CCA-secure encryption assuming 3. A randomg bit b 0, 1 is chosen, and then a cipher- the existence of a PRP and a random oracle when random ← { } text c Enc (m ) is computed and given to . We values K are used for each encryption application6. Oth- iK b O call c the← challenge ciphertext. A erwise, some controlled amount of information is leaked to 4. The adversaryg continues to have oracle access to allow data deduplication. We analyze both variants’ security A EnciK ( ), and outputs a bit b′. guarantees—regarding attackers that neither know KR, KO 5. The output of the experiment is defined to be 1 if b′ = nor KI —separately in this section. The security analysis cpa bg, and 0 otherwise. (In case PrivK (n) = 1, we say follows the definitions and proofs given in [13]. ,Πi that succeeded.) A A e K Now let be any adversary solving this experiment in 6.1 IND-CCA with random O A To formalize the usage scenario where a random value polynomial time q(n) (with q(n) being the number of steps K is used for each encryption application, we consider a may perform—including queries to Enci , HR, HS and O A K slightly modified version Π∗ of our encryption and authen- f0,...,f2n 1). The intuition of our proof is as follows: − A tication scheme Π, which replaces the encryption part Π of can only learn sth. about the challenge,g if he sees any of the the scheme with Π=(Gen, Enc, Dec): (random) values output by HR,HS, fk during encryption of mb at any other point of time. We show that this happens Gen(n) invokes (KR, KO) Gen(n) and returns bKR. • ← with negligible probability. EncK (m)e invokesg ( ,g KO) gGen(n) to generate a R Let HQ denote the event that any query to HR or HS dur- • g ←d random KO and returns Enc(KR,KO )(m). (The de- ing encryption of mb in step 2 is also made at any other point notesg that the first component isd ignored.) of time during the experiment. Similarly, let FQ denote the

DecKR (c) simply invokesdm Dec(K , )(c) and re- event that a specific query fk(x) during encryption of mb • ← R ⊥ turns m (note that KO is not used in decryption). occurs at any other point of time. If all of those queries are Theg only difference between Π andd Π∗ is that the latter unique (i.e. HQ FQ), has never seen any of the values ∨ A does not include a static value KO in the key material, but output by HR (cf. Eq. 1), HS (cf. Eq. 2) or any f (cf. Eq. 3) chooses a random value KO during each encryption. that are used to produce the ciphertext of mb. As HR,HS We want to show that ΠandΠ∗ are CPA-secure and CCA- are modelled using a random oracle and as fk produces truly secure, respectively. For this, we define some variants Πi random outputs, those values are independent from m0,m1 in ’s view. Thus, ’s chance to succeed is 1 : of the encryption schemee Π. In addition to F,HR,HS, we A A 2 n 1 let Πi have access to 2 different truly random permuta-e Pr[PrivKcpa (n) = 1 HQ FQ]= (5) ,Π | ∨ tions f0( ),...,f2n 1( ) (onee truly random permutation cor- A ∞ 2 − respondse to each key of the PRP F ), and define it as follows: Repetitions of queriese to HR,HS within a single encryp- tion are excluded due to the appearance counters. As each 5This also saves us the need for storing the IV separately. 6 7 This always applies to files with only a single revision. This means the first i queries during Πi’s whole lifetime.

e query to HR and HS includes the key KO, which is cho- PRP. As there is only a polynomial amount of invocations, sen uniformly at random during encryption, a repetition of we conclude that Π is also CPA-secure using only a PRP. some query to HR or HS requires that the same value KO Regard any polynomially bounded adversary with run- was either chosen by chance during another encryption ap- time q(n). Our goal is to show that Pr[PrivKA cpa (n) = e ,Π plication or that it was guessed by during his runtime. 1 cpa A A 1] + negl(n). We know that Pr[PrivK (n) = 1] The chance for this to happen within runtime q(n) is: ≤ 2 ,Π e ≤ 1 A ∞ q(n) 2 + negl(n). Due to his runtime q(n), can make at most Pr[HQ] (6) A e n ≤ 2 q(n) queries to EnciK , with each queried message having a maximum length of q(n). As the encryption operation We now analyze the probability of FQ under HQ (i.e. the performs at most one query to F and f , respectively, for case where all H ,H queries are unique). A repetition of a g k k R S every byte of its input, a maximum of 2q(n)2 queries to query to some f during encryption of m can occur either k b F /f may occur within the whole experiment. Within this during other encryptions or due to direct queries by . Both k k A runtime, Π and Π2q(n)2 behave equally, so we have: cases are analyzed separately in the following paragraphs: ∞ cpa cpa Enci queries the truly random permutations by fK (j) Pr[PrivK (n)=1] = Pr[PrivK (n)= 1] K ,Π ,Π e 2q(ne)2 and fj ( ) for every output j produced by HS (i.e. for each A A ∞ 1 chunkg key, cf. Eq. 3). Thus, if all outputs j of HS are unique, e + negl(n)e (9) all queries fK (j), fj ( ) are unique. As we assume HQ, no ≤ 2 Next we show that cannot distinguish the experiments repetition of a query to HS occurred, so a repetition of a PrivKcpa (n) and PrivKAcpa (n) with non-negligible prob- query to any fk during encryption requires that at least two ,Πi ,Πi 1 A A − different queries to HS produced equal outputs during the ability. We do this by defining a polynomial-time distin- e F ( ) e experiment. As has runtime q(n), he can only query the guisher D ′ as follows: A encryption of messages of length at most q(n). During en- n 1. Run (1 ) and simulate the oracle EnciK ( ) by exe- cryption of a message of length q(n), at most q(n) queries A 2 cuting the algorithm EnciK to answer ’s oracle que- are made to HS, resulting in a maximum number of q(n) ries—with one modification: When the—overall—gA i-th queries to H during the whole experiment. As H gener- S S query to a (pseudo-)randomg permutation (F or f) is ates outputs of length n bits, the collision probability of HS 2 made, use F ′ instead of fK (x). q(n)2 during the experiment is bounded by 2 2n . 2. When outputs messages m0,m1, choose b 0, 1 A ← { } A repetition of a query to fk might also occur due to an randomly and return EnciK (mb). explicit query by . However, if no repetition of a query to 3. Continue answering oracle queries by as in step 1. H occurred, thisA requires that correctly guesses at least A S 4. When outputs b′, outputg 1 if b′ = b, otherwise 0. A A one chunk key j for which a query fK (j) or fj ( ) is made by If D is instantiated with the PRP F ′, the only difference cpa EnciK during encryption of mb. As has runtime q(n) and between D and experiment PrivK (n) is that F ′(x) is A ,Πi 1 as there are at most q(n) chunk keys each having a length A − queried instead of Fj (x) at any point of time. As there might g q(n)2 e of n bits, the chance for this to happen is bounded by 2n . be other queries to Fj during the experiment, two situations We conclude: might lead to inconsistent behavior: Fj (x) might be queried q(n)22 q(n)2 q(n)4 + 2 q(n)2 Pr[FQ HQ] + = (7) at any other point of time during the experiment and pro- n n n+1 | ≤ 2 2 2 2 duce different output than F ′(x), or F ′(x) might produce an Combining Equations 5, 6 and 7, we get: output that is also output by Fj (y),y = x at any other point of time during the experiment. As the experiment allows at Pr[PrivKcpa (n)=1] 2 ,Π most 2q(n) invocations of Fj , the probability of the latter is A ∞ 2 cpa 2q(n) = Pr[PrivK (n) = 1 HQ FQ] Pr[HQ FQ] upper-bounded by 2n . Further, as all invocations Fj (x) ,Πe | ∨ ∨ A ∞ done by Enc include an n-bit number chosen uniformly +Pr[PrivKcpa (n) = 1 HQ FQ] Pr[HQ FQ] iK e,Π | ∨ ∨ A ∞ at random (cf. Equation 3), a repetition of the invocation cpa Pr[PrivK (n) = 1 HQ FQ]+ Pr[HQ FQ] Fj (x) requiresg that this n-bit value has been chosen again ≤ ,Π e | ∨ ∨ A ∞ by chance or has been guessed by . The probability for 1 A 2 + Pr[HQe]+ Pr[FQ] this to happen is upper-bounded by 2q(n) , too. ≤ 2 2n F ( ) 1 If neither of these situations occur, we can think of D ′ = + Pr[HQ]+ Pr[FQ HQ] Pr[HQ] +Pr[FQ HQ] Pr[HQ] 2 | | as using a slightly modified PRP F (for the whole experi- 1 1 2q(n) q(n)4 + 2q(n)2 ment) that is defined exactly like F , but switches two values + 2Pr[HQ] +Pr[FQ HQ] + + n n+1 ≤ 2 | ≤ 2 2 2 so that its outputs are consistent toe F ′: Fj (x)= F ′(x) and 1 q(n)4 + 2 q(n)2 + 4 q(n) 1 1 = + + negl(n) (8) Fj (Fj− (F ′(x))) = Fj (x). As F is a PRP, the view of 2 2n+1 ≤ 2 when run by D is distributed identicallye to the view of A cpa A ine experiment PrivK (n).e As the probability for this This proves that Π is IND-CPA-secure. ,Πi 1 ∞ A − 4q(n)2 situation is at least 1 e n and D outputs 1 whenever Theorem. If F is a pseudorandom permutation and HR,HS 2 e succeeds in the experiment,− we know that: A are random oracles, Π achieves CPA-secure encryption. 2 F ( ) n 4q(n) cpa Proof. We already know that Π achieves CPA-secure Pr[D ′ (1 )= 1] 1 Pr[PrivK (n) = 1] (10) e ≥ − 2n ,Πi 1 encryption if only truly random permutations are used. We   A − will show that ’s advantage is negligiblee if only single invo- If D is instantiated with a truly random permutatione f ′, A the only difference between D and experiment PrivKcpa (n) cations of some truly random permutation are replaced by a ,Π A i e is that f ′(x) is queried instead of fj (x) at any point of time. This is a rather specific use case which prevents the ap- We can use the same arguments as before to show that the plication of classic security properties like ciphertext indis- view of when run by D is distributed identically to the tinguishability. However, we want to show that re-using KO view of A in experiment PrivKcpa (n) with probability at retains some basic notion of security. While we leak informa- A ,Πi 2 A tion about identical contents in different messages, we can least 1 4q(n) . Assuming that succeeds in any other 2n e prove that we do not leak any information about different case, we− can give an upper bound:A contents that are encrypted using the same value KO. f ( ) n Pr[D ′ (1 )=1] 6.2.1 Security Guarantees for Different Plaintexts 4q(n)2 4q(n)2 1 Pr[PrivKcpa (n)=1]+ (11) To formalize this, we introduce a restricted variant of n ,Π n cpa ≤ − 2 A i 2 the PrivK (n) experiment seen in the previous section:   ,Π e A As we know (by assumption) that F ′ is a PRP, we get: the chosen different plaintext attack (CDPA) experiment. b cdpad cpa F ( ) n f ( ) n PrivK (n) is defined exactly like PrivK (n), with the negl′(n) Pr[D ′ (1 )= 1] Pr[D ′ (1 )= 1] ,Π ,Π ≥ − A A 2 difference that is only allowed to ask for encryptions of 4q(n) cpa b A b 1 Pr[PrivK (n)=1] messages which do not have an overlapping fragment with ≥ − 2n ,Πi 1   A − more than d bytes size. Thus, all substrings occuring in more 2 2 4q(n) e 4q(n) 1 Pr[PrivKcpa (n)= 1] than one message (including both queries makes to EncK − − 2n ,Πi − 2n A   A as well as the messages m0,m1) must have a maximum size We can solve this to get: e of d bytes. If does not adhere to these constraints,d the output of the experimentA is defined to be 0. 4q(n)2 1 negl (n) negl (n)+ Analogous to the previous definition, we say an encryption ′′ ′ n 2 ≥ 2 1 4q(n) scheme Π is CDPA -secure if for all probabilistic polynomial-   − 2n d Pr[PrivKcpa (n)= 1] Pr[PrivKcpa (n) = 1] (12) time adversaries there exists a negligible function negl ≥ ,Πi 1 − ,Πi Acdpad 1 A − A such thatb Pr[PrivK (n) = 1] + negl(n). Again, the ,Π ≤ 2 Combining Equationse 9 and 12, we conclude:e probability is takenA over the random coins used by and the random coins usedb in the experiment. A Pr[PrivKcpa (n)=1] = Pr[PrivKcpa (n) =1] We emphasize that this is a significantly weaker security ,Π ,Π0 A A model than CPA security, since we cannot hope to show Pr[PrivKcpa (n)= 1] ,Πe e ≤ A 2q(n)2 strong security properties when intentionally leaking infor- e mation to an attacker. The security proof shall merely pro- + Pr[PrivKcpa (n) =1] Pr[PrivKcpa (n)= 1] ,Π0 − ,Π 2 vide an intuition that our scheme does not leak information A A 2q(n) in situations where we do not explicitly require it to do so. 1 e e + negl(n) ≤ 2 Theorem. If Π achieves CPA-secure encryption, Π achieves 2q(n)2 CDPAd-secure encryption for d = min w,l 1. + Pr[PrivKcpa (n)=1] Pr[PrivKcpa (n)= 1] { }− ,Πi 1 − ,Πi e b i=1 A − A Proof. To show this, we need to define another variant of X 1 2e 1 e our encryption scheme Π first. Let Πi =(Geni, Enci, Deci) + negl (n) + 2q(n) negl′′(n) + negl′′′(n) (13) ≤ 2 ≤ 2 be defined as follows: Geni = Gen, Decbi = Dec b d d d This proves that Π is CPA-secure. • Enci, on input key (KR, KO), invokes ( , K ) Geni(n) • O′ ← tod obtaind a randomd K d. Then it acts like Enc, but Theorem. If Π achievese IND-CPA-secure encryption, Π∗ O′ d d achieves IND-CCA-secure encryption. uses KO′ instead of KO for the—overall—first i queries e to HX , X R,S that it makes during its lifetime.d Proof. As Π∗ is an instantiation of [13, Construction ∈ { } 4.19] (see Sec. 5.1), we can apply [13, Theorem 4.20], whose Now regard the sequence of encryption schemes Π0, Π1,... Surely, Π behaves exactly like Π, so we have: proof can be found in [13, Chapter 4.8]: “If ΠE is a CPA- 0 b b secure private-key encryption scheme and ΠM is a secure Pr[PrivKcdpad (n)=1] = Pr[PrivKcdpad (n) = 1] (14) ,Π ,Π message authentication code with unique tags, then Con- b A 0 b A struction 4.19 is a CCA-secure private-key encryption sche- We can furtherb see that—up to the bi-th oracle query— me.” Since Π is CPA-secure and ΠM is a secure message au- Enci behaves exactly like Enc, since it uses a fresh, ran- thentication code with unique tags, Π∗ is CCA-secure. domly generated key KO′ for each encryption application e insteadd of the supplied valuegKO. K 6.2 Security Guarantees with fixed O Now let q(n) be the maximum runtime of in experi- A If KO is re-used for multiple encryptions (in our system, ment PrivKcdpad (n). In each step, can make at most one ,Π this can be the case for different revisions of the same file), A A we intentionally leak some information to allow data dedu- query to the encryptione oracle, so Enc can be executed at plication. We allow a possible attacker to see relations be- most q(n) times during that experiment. During encryp- tween different messages that are encrypted with the same tion of a message m, Enc makes atg most m queries to HR | | KO, so he might recognize the positions and the size of plain- and at most m queries to HS, respectively. As—due to | | text fragments that are identical between different messages. its runtime— cannotg generate plaintexts of length greater This shall even be true if equal plaintext fragments appear than q(n), eachA encryption query results in at most 2q(n) at different positions within different messages. queries to HX , X R,S , resulting in a maximum total ∈ { } 2 number of 2q(n) queries to HX during the whole experi- Combining both theorems, we have proven that Π achieves ment. As Enc and Enc are identical up to the i-th query CDPAmin w,l 1-secure encryption. Intuitively, this means, i { }− to HX , this implies: that an arbitrary amount of plaintexts may be encrypted using the same (KR, KO, KI ) key triple without leaking any dcdpa g cdpa Pr[PrivK d (n)=1] = Pr[PrivK d (n) = 1] (15) information to an attacker, as long as none of those plain- ,Π 2 ,Π A 2q(n) A texts share a substring of length min w,l bytes. We show thatb an adversary cannot distinguishe the experi- { } ments PrivKcdpad (n) and PrivKcdpad (n) with non-negligible ,Πi ,Πi 1 6.2.2 Security Guarantees for Similar Plaintexts probability:A The only differenceA between− both experiments b b While the model presented in the previous section pro- is that HX (K ,x,y) is queried instead of HX (KO,x,y) at O′ vides strong security guarantees under certain conditions, any point of time within the former experiment. we have to intentionally leak some information to achieve Per definition, can only succeed if all queries made to A storage efficiency. In other words, the conditions required the encryption oracle during the experiment are different in for the proof cannot always be met. We therefore describe the sense that no min w,l -byte window content occurs in { } what information an attacker can gain in the general case. more than one query. Thus, different success probabilities It is easy to see that identical plaintexts are always en- for those experiments imply that adheres to those con- A crypted to the same ciphertext (if KR, KO and KI are un- straints. As w-byte window contents are used as input to changed), so in this case, an attacker can see nothing more HR by the encryption scheme (see Equation 1) and repeti- than how often a specific plaintext was encrypted. tions of the same w-byte window within a single encryption The encryption of different, but similar plaintexts (that application result in different inputs to HR due to the ap- share a substring of length at least min w,l bytes) using pearance counter, this implies that all inputs to HR during the same key is the most common use{ case.} We analyze the whole experiment are different. Similarly, as the encryp- what information is leaked in this case to provide an intu- tion scheme does not generate chunks smaller than l bytes, ition that our design decisions are reasonable. The analysis all inputs to HS (see Equation 2) are unique, too. is performed in two steps: First we focus on security im- If did not query any of the values HX (K ,x,y) or A O′ plications of our chunk encryption scheme, later we analyze HX (KO,x,y) directly to the oracle, both values are—due to the chunking procedure. the random oracle—independent from both all inputs and all other outputs of the random oracle, so cannot distinguish between those values. Different successA probabilities in both Security Implications of Chunk Encryption experiments thus require that did query any of those val- Imagine all chunk boundaries have been determined com- ues within the experiment. AsA this requires a correct guess pletely randomly. Let m1,...,mM be the list of all messages of either KO or KO′ within ’s runtime, the probability for that have been encrypted with Π using the same key and let 2qA(n) 1 M j this to happen is at most 2n : c ,...,c be their respective ciphertexts. Let further mi j j 2q(n) denote the i-th chunk of messageb m and let c denote its Pr[PrivKcdpad (n)= 1] Pr[PrivKcdpad (n)= 1] (16) i ,Πi − ,Πi 1 ≤ 2n encrypted representation. Clearly, our deterministic encryp- A A − 8 tion mechanism leaks information about equality of chunks . Combiningb Equations 14, 15 andb 16, we get: But what information is leaked beyond equality? Pr[PrivKcdpad (n) =1] Pr[PrivKcdpad (n) =1] To analyze this, regard the sequence of all different chunks ,Π ,Π A − A m1′ ,...,mr′ that have ever been encrypted using the same 3 2q(n) q(n) key and let C′ denote the maximum number of occurrences e 2q(n)2b = (17) i n n 2 ≤ 2 2 − of mi′ within any single message. From those chunks, we can r C′ cdpad cpa i The only difference between PrivK (n) and PrivK (n) build a plaintext m = i=1 j=1 mi′ . If we apply a modi- ,Π ,Π A A is that is less restricted in the latter. We conclude: fied version of Π, that achieves exactly that chunking, to m, A e e f f 3 cdpa cdpa q(n) we get a ciphertext c that contains all encrypted chunks that Pr[PrivK d (n) =1] Pr[PrivK d (n)=1]+ 1 M 1 M ,Π ,Π n 2 are contained ine c ,...,c . Thus, c ,...,c can be recon- A ≤ A 2 − q(n)3 1 q(n)3 structed from c just using knowledge about which substrings Pr[PrivKb cpa (n)=1]+ e + negl(n)+ ,Π n 2 n 2 have to be concatenated in which order, so all non-positional ≤ A 2 − ≤ 2 2 − 1 M 1 information leaked by c ,...,c is leaked by c, too. e + negl′(n) (18) ≤ 2 In this setup, however, we have only a single application of our encryption scheme that uses a random value K , which with the last inequality being true because q(n) is polyno- O is CPA-secure as shown in Section 6.1. We conclude that mially bounded. This shows that Π is CDPAd-secure. chunk encryptions do not leak information beyond positions and sizes of contents. Theorem. If Π achieves CDPAd-secureb encryption, Π also If an attacker knows KO (e.g. because he has read access achieves CDPAd-secure encryption. to another revision that uses the same KO), he can encrypt b Proof. The only difference between Π, which we have plaintexts (and also determine chunk boundaries) himself, allowing a verification of guessed plaintexts. Therefore, the proven to be CDPAd-secure, and Π is that the latter appends default behaviour of our implementation is to change K on a MAC tag generated by ΠM to each ciphertext.b As the O calculation of this tag does only depend on the ciphertext access rights changes. and an independent key, its presence surely cannot reduce 8 the security properties provided by Π. We conclude that Π This applies across revisions of the same file with equal KO, is CDPAd-secure, too. not within one revision due to the appearance counters. b Security Implications of Chunking and at the client side. Clients not supporting our solution We have already seen that an encrypted chunk for itself would just be unable to access confidential files—just like does not leak any information about its underlying plain- new SVN clients that are not granted rights on those files. text. However, due to the context-sensitive chunking proce- If the server runs an old version, only server-side checks like dure, we can think of each chunk as being annotated with write access control would be disabled, so other clients would some information about its plaintext visible to an adversary. be able to delete confidential files. However, such files could To see which information is leaked, regard an arbitrary but still be restored due to the version history provided by SVN. fixed chunk ci. From the chunking mechanism, an attacker gets to know the following facts: 8. CONCLUSION AND OUTLOOK If ci occurs within any message at any position other We have presented a security solution for the version con- • than its beginning, the attacker knows that the first trol system SVN, which enables secure and storage-efficient w bytes of the chunk fulfill Condition 1. versioning of documents—even in case of a malicious repos- The attacker knows that no w-bytes substring within itory server administrator. The main restrictions of the sys- • the last ci l bytes of ci fulfill Condition 1 within any tem are the attacker’s capability of seeing the positions of | |− yet encrypted message that contains ci. changes in a document, and of verifying whether chunks We emphasize that an adversary cannot evaluate Condi- from a previous version (to which the attacker had been tion 1 for any plaintext without knowledge about KO, so granted access) are still contained in a new version. Both this information does not obviously allow to draw conclu- attacks can be prevented by changing the file’s obfuscator, sions about the plaintext. However, when two revisions of but this comes at the cost of storage efficiency. Our imple- a file are encrypted using the same KO, limited structural mentation does not require changes to SVN’s storage back- information about that file might be leaked: Chunking may end, and is compatible with previous SVN servers (with the reveal the positions of changes more precisely than the en- exception of write access control) and clients (though old cryption procedure itself, if a change affects a chunk bound- clients cannot access encrypted files). Authentication cur- ary. In addition, changed chunk boundaries might affect rently relies on passwords only, as we wanted to avoid the chunk appearance counters and thus the encryption of later administrative overhead of a PKI. For corporate environ- equal chunks, so that those chunk ciphertexts might appear ments, certificate-based authentication may still be a viable to move between two revisions. However, both potential is- alternative, which we aim at supporting in the future. sues are in line with our security claims, which allow reveal- ing the positions and extent of changes if KO is unchanged. 9. ACKNOWLEDGMENTS We thank Gennadij Liske for his valuable comments on 7. SYSTEM IMPLEMENTATION the proof (Sec. 6). We further thank Ronald Petrlic, Se- We have implemented the full concept, as described in bastian Seitz and the anonymous reviewers for their helpful the previous sections, into the SVN library source code9, comments. The work is funded by the German Research extended the SVN command line application, and evaluated Foundation (DFG) under GRK 1479. the performance of our implementation. We wanted to retain full compatibility to other SVN ver- sions to enable a quick deployment of our extension. We 10. REFERENCES have therefore realized nearly all parts of our solution on [1] M. Bellare and P. Rogaway. Random oracles are the client side—namely the working copy library—of the practical: a paradigm for designing efficient protocols. SVN architecture without introducing any new data struc- In Proceedings of CCS ’93, pages 62–73. ACM, 1993. tures that would have to be handled by the repository. For [2] B. H. Bloom. Space/time trade-offs in hash coding this, we store all data in so-called properties—a versioned with allowable errors. Commun. ACM, 13(7):422–426, file-related meta-data mechanism provided by the SVN ar- July 1970. chitecture. All actions provided by our solution (e.g. en- [3] CollabNet, Inc. Skip-Deltas in Subversion. cryption/decryption of files, permission administration) are http://svn.apache.org/repos/asf/subversion/tru designed to only affect a file’s representation or its proper- nk/notes/skip-deltas, Nov. 2005. ties within the user’s working copy, while the synchroniza- [4] CollabNet, Inc. The Subversion protocol. tion between a working copy and the repository (e.g. com- http://svn.apache.org/repos/asf/subversion/tru mit/update operations) stays unchanged. nk/subversion/libsvn_ra_svn/protocol, Sept. 2011. Key management is implemented as follows: All keys are [5] J. Daemen and V. Rijmen. AES Proposal: Rijndael. generated randomly when a confidential file is created or ac- 1999. cess rights are changed. To allow data deduplication, we do [6] W. Diffie and M. Hellman. New directions in not automatically change a file’s KO when new revisions are cryptography. IEEE Transactions on Information generated. A simple command line option allows to achieve Theory, 22(6):644–654, 1976. stronger security guarantees by explicitly changing KO. [7] L. Dusseault. HTTP Extensions for Web Distributed Regarding compatibility, we support arbitrary combina- Authoring and Versioning. RFC 4918, 2007. tions between old/new server (repository) versions and old/ [8] D. E. Eastlake 3rd and T. Hansen. US Secure Hash new client versions. If a client supports our extension, it Algorithms (SHA and SHA-based HMAC and can securely use our access control solution on some files no HKDF). RFC 6234, 2011. matter what other SVN versions are involved at the server [9] K. Fu, S. Kamara, and T. Kohno. Key regression: 9We extended revision 1152561 of the repository’s (ht Enabling efficient key distribution for secure tps://svn.apache.org/repos/asf/subversion/trunk) distributed storage. In Proceedings of NDSS, 2006. [10] E. Goh, H. Shacham, N. Modadugu, and D. Boneh. terval [32, 4096], which resulted in an average deviation of Sirius: Securing remote untrusted storage. In 0.07 percentage points and a maximum deviation of 1.24 Proceedings of NDSS, 2003. percentage points from the expected relative overhead. As [11] D. Grolimund, L. Meisser, S. Schmid, and our security guarantees depend on min w,l (see Sec. 6.2), { } R. Wattenhofer. Cryptree: A folder tree structure for we set l = w, which in addition guarantees a maximum stor- cryptographic file systems. In 25th IEEE Symposium age overhead of z + 32 bytes. With S 256, we produce an ≥ on Reliable Distributed Systems (SRDS), pages overhead of < 20%, which we consider an acceptable over- 189–198, 2006. head that we expect to be compensated by the savings due [12] M. Kallahalla, E. Riedel, R. Swaminathan, Q. Wang, to difference-based efficient storage of multiple file revisions. and K. Fu. Plutus: Scalable secure file sharing on untrusted storage. In Proceedings of the 2nd USENIX A.2 Storage Efficiency Conference on File and Storage Technologies, pages While we have shown that an appropriate parameter choice 29–42, 2003. allows storing single file versions with little overhead, our [13] J. Katz and Y. Lindell. Introduction to Modern main goal is the efficient storage of whole repositories con- Cryptography: Principles and Protocols. Chapman & taining multiple (and similar) revisions of several files. Our Hall/CRC, 2008. system requires storage overhead at several locations. The [14] A. Muthitacharoen, B. Chen, and D. Mazi`eres. A main overhead—as described before—is caused by the meta- low-bandwidth network file system. In Proceedings of data generated by our encryption scheme and depends on SOSP ’01, pages 174–187. ACM, 2001. the file’s size as well as the parameter value S. In addition, [15] C. Pilato, B. Collins-Sussman, and B. W. Fitzpatrick. since we store all access-control-relevant information (such Version Control with Subversion. O’Reilly, 2008. as cryptographic material) in the confidential file’s proper- [16] M. O. Rabin. Fingerprinting by Random Polynomials. ties, every confidential file requires some storage for these Technical Report TR-15-81, Department of Computer properties. The storage requirement for this can be quanti- Science, Harvard University, 1981. fied with about 1.5 KiB constant overhead per file and 2 KiB [17] F. Rashid, A. Miri, and I. Woungang. A secure data constant overhead per authorized user of this file. Note, deduplication framework for cloud environments. In however, that thanks to differential storage this is only gen- Tenth Annual International Conference on Privacy, erated once per access right change (at least once per file), Security and Trust (PST), 2012, pages 81–87, 2012. not once per revision. Despite that, we also generate small [18] M. W. Storer, K. Greenan, D. D. E. Long, and E. L. overhead when storing different file revisions: While for un- Miller. Secure Data Deduplication. In Proceedings of encrypted files, only the contents that actually changed (+ a StorageSS ’08, pages 1–10. ACM, 2008. negligible overhead) have to be stored, our solution requires [19] The Apache Software Foundation. . storage for each change’s full surrounding chunk(s). http://subversion.apache.org/, Apr. 2012. To evaluate our achievements regarding storage efficiency of similar file revisions, we first studied our algorithm’s per- [20] Z. Wilcox-O’Hearn and B. Warner. Tahoe: the formance on a kind of best-case scenario, namely a reposi- least-authority filesystem. In Proceedings of StorageSS tory that contains the version history of a single file, with ’08, pages 21–26. ACM, 2008. only small changes made between each of its revisions. We simulated this situation by setting up an experiment as fol- APPENDIX lows: At first, we committed an empty, confidential file to a A. PERFORMANCE EVALUATION fresh SVN repository. Afterwards, we iterated the following sequence: We chose a position within that file uniformly at In this section, we evaluate our solution’s performance random and inserted a random number (chosen uniformly regarding memory and time efficiency. We first discuss an at random in interval [64, 192], i.e. 128 bytes on average) appropriate choice of the parameter values and then analyze of random bytes there. The resulting extended file version their impact on storage efficiency to prove our savings in was then committed to the repository, so the file and its comparison to usual encryption schemes. Finally, we briefly version history grew with each iteration. Using the gen- evaluate our algorithm’s memory and time requirements. erated repository, we compared our encryption scheme to A.1 Choice of Parameters other solutions: For this purpose, we generated a couple of fresh repositories and re-enacted the previously described As mentioned in Section 5.1.1, our encryption scheme de- repository’s version history for each of them with individ- pends on some parameter values w (window size), l (mini- ual configurations. In the first configuration, we achieved mum chunk size) and S (target chunk size). We set w = confidentiality by encrypting each file revision using a tra- 48 bytes according to the evaluation results provided by ditional encryption scheme (AES-CBC with a fixed key, but Muthitacharoen et al. [14]. For a random file of size z bytes randomly chosen IVs); in the remaining ones, we used our (and with parameter value l = 1 byte), we calculated an access control solution with different parameter values S. average encryption overhead of 43.5z + 32 bytes for storing S The results of this experiment are shown in Fig. 1. Each chunk metadata (length and key), padding and the integrity line represents the development of the total storage require- value. We confirmed this formula to be true for non-random ment of a specific repository when storing the first i revi- data by encrypting the 50 most popular ebooks (47.9 MiB sions. The black line (+ markers) shows the unencrypted in total) from Project Gutenberg10, each as UTF8-encoded repository, which unsurprisingly has the least storage con- text file, with randomly chosen keys with values of S in in- sumption. The violet line ( markers) shows the repository × 10http://www.gutenberg.org/ebooks/search.html/?sort_o whose content is encrypted with a traditional scheme. As rder=downloads, visited on 2012-03-13 expected, its storage consumption rises rapidly with an in- Figure 1: Development of repository size when a versioned confidential file is extended systematically Figure 2: Development of ispCP’s repository size

creasing number of revisions due to the need for storing the A.3 Memory and Time Requirements whole ciphertext of each file version. While the storage effi- We have seen that our access control solution allows ef- ciency of our encryption scheme varies for different values of ficient storage of encrypted repositories. We now consider S, its savings are significant for either value. S = 256 yields the amount of memory and time needed to en-/decrypt con- the best performance and requires about twice as much stor- fidential files. Besides the specific implementation, there are age as the unencrypted configuration. This is in line with conceptual aspects critical to memory / time consumption. our expectation that small (e.g. about-128-byte) changes re- A trivial implementation has two main drawbacks: sult in about S bytes of difference in ciphertext on average. 1. Encryption consumes memory in the order of about 48 These results could suggest that a lower S results in lower times the file size as it has to count the appearances of storage consumption in general, so we repeated the experi- each 48-byte window content (see Sec. 5.1.1). ment with changes of average length 4 096 bytes instead of 2. As HMAC computations are time-consuming, comput- 128 bytes between revisions. In this setup, our solution pro- ing a rolling HMAC (one computation for every byte of duced less overhead and our savings compared to traditional the file size) significantly slows down encryption. encryption were more significant. However, savings through To find a suitable trade-off between memory / time effi- small chunk sizes (i.e. small differences between ciphertexts) ciency and security, we evaluated 4 variants of our algorithm: were outweighed by overhead for storing chunk metadata. HMAC, no repetitions: This is a trivial reference im- • S = 512 yields the best results in that experiment. plementation of our concept, which implements appearance While these results show that our encryption scheme al- counters using hash tables. This version is memory-consu- lows for efficient storage in some hypothetic best-case sce- ming, but adheres strictly to the description in Sec. 5. nario, we surely have to verify if these results are transferable HMAC, no repetitions, bloom filter: This variant is sim- • into practice. In fact, we identified two problematic situa- ilar to the first, but implements the rolling hash appearance tions for our access control solution. The first is a reposi- counter using a bloom filter [2], achieving memory consump- tory consisting of a huge amount of very small files: Since tion of about 3 times of the file size by allowing false positives we have to store metadata for each of these files separately, (i.e. repetitions could be detected by mistake). As false neg- our system would generate a lot of overhead, which might atives are excluded, this only affects storage efficiency, not not be compensated by further savings if the file sizes are security (rolling hash values might change, but repetitions not considerably greater than our chunk sizes. The second are still excluded). case is a repository consisting of arbitrary files, which do not HMAC : This variant ignores repetitions of rolling hash • contain any change history (i.e. each file has only a single values, so periodical file contents could lead to periodical revision) or only changes that affect whole file contents. In chunk boundaries visible in ciphertexts. This has a slightly this case, our encryption scheme would generate chunking- negative effect on the security properties when applied to related metadata overhead that is not compensated by space files with repeating contents, but—if this limitation is ac- savings due to similar file revisions. ceptable—significantly reduces memory consumption. To verify whether these drawbacks are of practical rele- Rabin fingerprints: By using an efficient rolling hash • vance, we evaluated our solution using some real-life data. function (i.e. Rabin fingerprints [16]) for rolling hash compu- For this, we re-enacted the version history of the trunk of tation (and still ignoring repetitions), encryption is speeded the open source project ispCP 11, which consists of many up dramatically. However, while we expect its impact on small ( 10 KiB) source code files as well as rarely-changed security to be negligible, we do not recommend this variant files such≈ as pictures. Thus, this example combines elements as it might leak information about chunk boundaries. from both problematic scenarios discussed above. Analo- The experiments—performed on an Intel(R) Core(TM) gously to the previous experiments, the results are shown in i5-2500K machine with 16 GiB RAM—confirmed our ex- Fig. 2. Since the repository starts with a kind of worst-case pectations: With 3.3 seconds for encrypting a 16 MiB file, situation (at the beginning, 2 768 small ( 10 KiB) files are the Rabin fingerprint variant performs about 10 times faster added), the storage efficiency of our solution≈ seems to be than the corresponding HMAC variant—but leads to secu- worse than the one with traditional encryption. With an in- rity drawbacks. The significant decrease (factor 10) of mem- creasing number of revisions, however, our solution’s savings ory consumption achieved by the bloom filter is also done compared to traditional encryption get significant again. at the expense of computing time (factor 2). When consid- ering only the two provably secure variants, we consider the 11http://isp-control.net; repository: http://www.isp-c bloom filter variant the best trade-off. Thus, this variant is ontrol.net:800/ispcp_svn/trunk, requested on 2012-03-26 our default and has been used for the other experiments.