A Privacy Preserving ML System

Alexander Wu Hank O’Brien UC Berkeley UC Berkeley

Abstract support addition and multiplication, thus fθ is typically It is generally agreed that machine learning models may con- limited to models with activations that can be approxi- tain intellectual property which should not be shared with mated by polynomials [12][3]. users, while the same there should be mechanisms in Once the data is encrypted, the ciphertext can be sent place to prevent the abuse of sensitive user data. We propose to a remote server, where fθ(Ek(x)) is calculated. This a machine learning inference system which provides an end- calculations is typically computationally expensive, with to-end preservation of privacy for both a machine learning latencies greater than 1 second, even with hardware ac- model developer and user. Our system aims to minimize its celeration [3]. constraints on the expressiveness and accuracy of machine learning models. Our system achieves this by utilizing trusted 2. Differential Privacy In differential privacy, a random computation, with a trust-performance trade off which extends perturbation is added to user data, before it is used in a to a cryptographic proof that data is not tampered with model. This perturbation is typically paramaterized by ε and δ. 1 Introduction In particular, A randomized algorithm M with do- | | main N X is (ε,δ) -differentially private if for all S ⊆ Deep learning inference often times requires the use of sen- X Range(M ) and ∀x,y ∈ N such that ||x − y||1 ≤ 1: sitive user data. Because computation must be performed on this data, traditional cryptographic techniques used to protect Pr[M (x) ∈ S] ≤ exp(ε)Pr[M (y) ∈ S] + δ data at rest cannot be easily applied to machine learning in- ference. The model weights, architecture, and code involved [9] in inference are also often times sensitive intellectual prop- erty which is not suitable for distribution to client devices. In the context of deep learning, differential privacy meth- This creates a natural conflict of interest: Application devel- ods parameters greatly affect the accuracy of models and opers do not want to share their models, and users do not in some cases, the architecture [1] want to share their data. Traditional solutions to the problem of privacy preserving machine learning typically fall into 3 3. Hardware Enclaves Hardware based approaches to pri- categories: vacy preserving machine learning typically rely upon secure enclaves such as Intel Software Guard Extensions 1. Cryptographic: Cryptographic approaches to privacy pre- (SGX). SGX utilizes software attestation to provide a serving machine learning typically leverage homomor- cryptographic proof that a specific piece of software is phic cryptography. Homomorphic encryption encrypts running in a secure container on an otherwise untrusted the user’s data, x with a key, k while preserving the struc- machine. On modern chips, Intel SGX maintains 128MB ture of certain operations f [20]. of secure Processor Reserve Memory (PRM) of which 90MB is the Enclave Page Cache (EPC) [8]. Current Intel SGX implementations contain timing side chan- D ( f (E (x))) k θ k nels due to speculative execution vulnerabilities [23]. Here Ek is typically a symmetric homomorphic cipher. In this work, we assume that a patch or similar enclave The user data X is encrypted on a client machine. In the could provide the same guarantees that SGX advertises, context of deep learning, these schemes typically only without changing the programming interface.

1 Notably the 90MB EPC is smaller than many typical Current works do not address how an application developer deep learning models. For example, a typical image clas- can go about using user data without requiring the user give sifier such as Resnet 50v2 [14] is 102MB. Applications that information to the developer, in the same way that an with a larger resident set size require cryptographically application developer can use payment information to make a secure demand paging mechanisms [11]. transaction without the user providing a credit card number, for example, to the developer. In order to improve enclave performance, works exist to distribute untrusted machine learning applications [15] which sandbox untrusted models [16]. Other works improve the performance of model infer- ence by using untrusted accelerators, maintaining in- tegrity, but trading privacy for performance [22].

2 Motivation

Previous works have focused providing a mechanism for per- forming machine learning inference on data in a privacy pre- serving manner. These works have typically had to sacrifice performance and/or use non standard cryptography to ensure privacy. Nearly all existing works we are aware of make the trust assumption that all code run on the end user’s device is trusted. To motivate our system, consider a typical web application in which a user uploads an image which a service performs remote inference on. There are real world instances of ap- plications violating the assumption that code run on a user’s device is trusted [6]. Such applications receive and upload Figure 1: The existing model. The application developer is user data in a non-privacy preserving way, perform inference, untrusted by the user, yet the user must trust the untrusted code then save the data which can then be potentially abused. where the user is entering her data. This user must also trust that the computing environment chosen by the app developer It is also generally accepted that not all code run on a user’s is secure, but she has no way of validating this because she has device should be trusted. Web browsers and even operating no control of what computing environment the app developer systems make the assumption that code run on a user’s device has chosen. These two points are the essence of our motivation is not necessarily trustworthy. While we focus our motivation and work on the case of a web application running on a desktop browser, we believe the motivation is equally applicable in the context of mobile 3 Design Goals applications. The existing machine learning inference trust model in In order to enable applications to use user data without having web applications is roughly all-or-nothing. In order to utilize direct access to it, we would like to design a system which machine learning inference with a web application, a user meets the following goals: must reject using the application all together, or trust it with all of the user’s information. 1. Scalability: the proposed system should not limit the While this naive model sounds reasonable, it is in opposi- scale of web applications which seek to use it. tion to a layered trust approach. Consider existing web ap- plications such as an online marketplace. While a user may 2. Flexibility: there is a clear trade off in current ap- trust the site with some sensitive information, such as a name, proaches between performance and privacy. Not all ap- address, etc, there are some degrees of sensitive information plications have the same privacy concerns/requirements. which users do not entrust most websites with. For exam- For some applications, running inference on a trusted ple, while users may typically trust an online marketplace infrastructure provider (e.g. AWS, GCP, Azure, etc) with their address, they typical do not trust the marketplace may be a sufficient amount of trust. Other sen- with payment information. Instead, users typically trust only sitive applications, such as mobile banking or healthcare, a specific set of payment processors to handle their payment may have stricter requirements and may require crypto- information. graphic proof of execution on trusted, tamperfree hard-

2 ware. These applications should be able to utilize higher is running software which the user and developer can performance techniques which may be insecure. audit. Applications which have lower privacy requirements should not have to suffer the same performance degrada- 4.2 UTE tion as applications which require high performance. The UTE consists of code which the user trusts. The UTE is 3. Usability: application developers should not be re- implemented via browser iframes. Iframes provide a conve- stricted in how they design applications. In particular, nient trusted compute environment. Iframes are run in sepa- they should not be restricted in the user interfaces they rate sandboxes and explicitly isolated from untrusted appli- design. Using a privacy preserving system should also cations via browser isolation policy (e.g. same-origin policy, ideally require minimal changes to existing applications. cookie policy, etc). Iframes are hosted on a different source than the application. 4. Robustness: a privacy preserving inference system For example, if should have minimal limitations on the types of mod- els an application developer should be able to support. https://stanford.edu/application.html There may be a trade off between types of models, and is a web application, it would contain an iframe: privacy mode, but it should be minimized. Robustness should also apply to model accuracy. Privacy preserving whenever possible. The core logic of the UTE is located within a 1x1 pixel transparent iframe. The iframe serves no graphical purpose, 4 Proposed System but must be rendered to create an execution environment. The UTE also consists of graphical components which 4.1 Overview are separate iframes. Because they are from a different host than the application, the application cannot eavesdrop on the Our proposed system fundamentally treats machine learning interaction between the user and UTE. inference as an pure function in a mathematical sense. In UTE graphical components are designed to follow the same particular, we make the assumption/constraint that machine usage pattern as regular html/javascript graphical components. learning inference should not generate any side effect. Any For example, an application developer could implement a file- state it outputs must be explicit. upload box as follows: Our system consists of four main trust domains. an acceptable place to store user information. Notice that interaction follows the same pattern as 2. Application/Application Developer: From the user’s regular HTML components (i.e. file upload is handled perspective, the application developer should be treated by an onselect handler). Like other HTML compo- as a malicious actor. In particular, they have the ability nents, this can either be set graphically, or via code: to execute code on the user’s machine in a sandboxed elem.addEventListener(...) . When a user drags and environment (javascript in browser). The application drops a file, the iframe stores the file in memory, then creates developer may also send data to arbitrary servers. a resource descriptor before passing the event to the applica- 3. Key Value Stores: Key value stores are untrusted com- tion via a dispatch mechanism. The application only receives ponents which are assumed to provide best effort avail- an event containing the resource’s descriptor which it can use ability. KV Stores provide a level of optimization and to reference, but not read the data. are also typically easy components to scale. The application can then initiate inference by uploading the data via publish(), and serve(). 4. Temporary Trusted Containers: Temporary Trusted Finally, the application can call Containers (TTCs) are assumed to be provided by third render(result_descriptor) which will display party infrastructure providers. In a performance opti- the data in another iframe: mized case, TTCs may simply be a service operated by a cloud service provider which the user and developer dispatch_to_render_frame("render", both trust. Under more stringent privacy requirements, {message : {responseUID: responseUID } }, the TTC may provide cryptographic attestation that it onreturn);

3 Figure 2: Our System. In contrast to2, the user does not need to trust untrusted code from the app developer when uploading their data, and the user also does not need to trust the computing environment to not leak data

From the UTE perspective, render can simply modify the 4.3 Application Developer HTML element. The application developer is able to run code, both within a local sandbox, as well as on remote servers. function render(arg){ Applications executed locally should be contained within const responseUID = arg.message.responseUID; the browser sandbox. The browser should be able to prevent document.getElementById("outbox").innerHTML = applications from accessing sensitive data (such as by pre- localStorage[responseUID]; venting file uploads). } The user does not run any trusted code in the application’s execution context. Similar to a system call, or remote proce- 4.2.1 UTE Interface dure call, we provide a set of javascript stubs for interacting with the UTE, rather than working with inter-iframe message passing primitives.

4.4 Message Dispatch The UTE and application communicate via the browser’s postMessage() and receiveMessage API. The message passing system borrows heavily from interprocess communi- cation ideas in microkernel architectures. In particular, mes- sage passing is asynchronous. Notably, unlike many micro- kernels, multiplexing and dispatching function calls is done in "userspace". The application and UTE are responsible for (optionally) maintaining state of calls, and filtering/ignoring invalid calls. Call state is maintained as a random 32 bit integer, which can be discarded/reused when an asynchronous call finishes. When the UTE passes events to the application, the applica- tion developer, can search the DOM and directly invoke UI events. This is implemented in the provided message passing Figure 3: An example of what a user might see when using interface. the system Message passing is low bandwidth, infrequent, and low

4 latency on the scale of human interface times. 4.7 Rationale

4.5 KV Stores Property System component KV Stores and Scalability Key Value Stores are assumed to be publicly accessible stores Serverless functions where encrypted models and user data are stored. Each piece of data is identified by a tuple (location, uid) where the Flexibility TTC’s have different capabilities location specifies which store to look in (i.e. S3, Azure, etc), Iframe design follows standard and the uid uniquely identifies the object within the store. Usability HTML patterns, model inference KV Stores also provide an important optimization for written as typical program scalalability. Scaling and maintaining fault tolerant, highly Models written and executed Robustness available distributed hash tables is a well understood problem as general purpose programs with many options. Separating the data from the computation also provides for faster recovery and reuse of models and arguments. Providing Figure 4: A summary of design goals and system features an abstraction between the model and inference device is also a valuable optimization for increasing inference performance. TTC instances can be spawned "near" replicas of the model Figure 4.7 summarizes how our proposed system meets its or near clients, and similarly, copies of the model can be repli- design goals. Our system achieves scalability by using well cated near the client or TTC instances. This allows for caching understood scalable components. We rely upon distributed performance and additional, longer term optimization. key value stores and serverless functionality. These techniques are well understood, as well as how to avoid bottlenecks. They provide the added benefit of being location independent allow- 4.6 TTC ing for optimization in dynamically relocating components to minimize latency. The TTC provides a secure execution environment for a single Flexibility is achieved by containing changes in privacy serve request. In particular when a serve request occurs, the requirements to the TTC. The system design for scalability request is processed by a TTC. In their current state, machine minimizes overhead, and focuses on maintaining low latency. learning inference frameworks are not robust enough to ef- Thus, only changes to the TTC which increase privacy will ficiently operate and return properly formatted output. For decrease performance, with minimal additional overhead. example, an image classifier would need to read an image file, From an application developer’s perspective our system is preprocess it, run inference, then post process the output (e.g. relatively simple to use. As was demonstrated in section 4.2, the softmax output into text labels). the UTE interface is very similar to typical html graphical In order to provide robust model serving capabilities, a components. Not only should this be familiar to developers, model must essentially be able to describe any arbitrary pro- but it should also allow for relative ease of integrating into gram. Our current TTC defines a model as a zip file containing higher level frameworks such as React, Angular, or JQuery. a model.py file and any additional resources which might be needed. Since a model is defined as an arbitrary program, it is the responsibility of the TTC to ensure that models are purely 4.8 Timeline functional and provide security and isolation in the event that See Figure 4.1 to follow this . a model misbehaves. Model output is also saved in the KV store. This allows the 1. The App developer creates a new ML model and pub- KV store to function as the only persistent datastore, whereas lishes it to one or more KV Stores the TTC’s can be truly temporary. This also ensures that it is • easy to trace global state, and prevent inadvertent side effects. The App Developer encrypts their model following the protocol with the public keys for each The TTC is also responsible for partially providing flexible TTC they trust with this model, and store each tradeoffs between validation security, and performance. In in the KV Store, each under a different key the limit of maximizing performance, the TTC should create a container, and begin executing the model, potentially with 2. The App Developer creates a new application and em- GPU or extra accelerators if possible. In the case of remote at- beds a UTE inside of it. testation, the ttc must run within the secure enclave. Here, we can still treat a model as an arbitrary program, with the slight • In the case of a new website or web-app, this UTE caveat that programs must be statically linked and compiled would be an iframe where the src attribute points (as a restriction of SGX). to a widely trusted UTE provider.

5 • This and the preceding step can happen at any point • the UTE as written supports rendering inference in time. The remaining steps are part of the critical data in a wide variety of formats, but the render path for inference and hence they must be done message will have to be modified to let the ap- only after the user has uploaded data plication indicate to the UTE with what it should render in. To preserve privacy at this stage, 3. The User then visits this site and uploads some user data we think it makes sense to have a white list of pre- to the UTE approved formats (text, images, etc) but this does present a trade off in the freedom of the application • Because the data is uploaded straight to the UTE developer to render their inference result (inside an iframe) the app cannot see or modify this data

4. The UTE then sends a message to the parent application informing it that there was data added. This message also 5 Experiments includes a "data descriptor" which the application can use to refer to this piece of data uniquely in the future, 5.1 Baseline without having control of the data itself.

5. the application then uses this handle to send a "publish" The current accepted way to serve ML models with user data message to the UTE is to simply send user data in a web request to a server under the control of the app developer without any further uploading • this publish message includes an identifier which or encryption. We ran an identical test to this for our latency tells the UTE with which key to encrypt the data baseline. • it also includes an identifier corresponding to the KV Store in which to place the encrypted data

6. The UTE receives the publish message, encrypts the data 5.2 End to end corresponding to the received DataDescriptor, and sends this blob to the specified key value store, keyed by the In our End to end test, we first upload a 1.4KB image to 1 DataDescriptor. When this finishes, the UTE sends a a KV store. We then issue a serve request to our TTC and message back to the application notifying it that the data waited for it to complete. We used Resnet50v2 [14] as our is prepared to be served ML model, which is 90MB compressed (when it is sent to the TTC). The inference result is 34 Bytes of text corresponding 7. Once the app receives this message, it sends a serve re- to the inference result, which is not compressed before being quest to whatever TTC the app developer chooses (which stored in the KV store. We measured the following quantities: should be the same as in the publish message).

8. When the TTC receives this request, it issues a GET 1. the overall time from to finish for one request request to the KV Store where the model is stored at, and another to the KV Store for the encrypted user data. (We emphasize that these need not be the same KV Store) 2. the time it took to upload the inference argument to the KV Store 9. The TTC decrypts the data and model, then performs inference with this model on this data. It then re-encrypts the inference result, computes a DataHandle for this 3. the time taken to download the model onto the TTC result and stores the encrypted result in any KV Store, returning the response DataHandle to the application. 4. Inference time 10. The Application then receives this request and sends it via message to the UTE, asking it to render the result

11. the UTE downloads the inference result from the KV 5. upload time for the inference result to the KV store Store specified in the DataHandle, decrypts it, and - ders it. In the following graphs, the "misc" time represents things we 1See the cryptography section for more information about the encryption did not directly measure, and it was computed by subtracting process. items 2-5 above from the total time per request.

6 stored locally, overall latency is very comparable to the base- line numbers, indicating that our system offers very limited overhead in this case. However, when the model is stored in a remote key-value store, there is significant added latency.

Figure 5: End-to-end timing, see Figure 5.2 for exact data

Figure 5.2 shows the different permutations of where we store our data. Recall that our design supports using a different key-value store for the model and the user’s data, so this chart shows the end-to-end latency of each of those permutations. We ran our TTC on a shared-use computer with the following specs:

• 1x Intel S2600WT board

• 2x Intel E5-2670 2.6GHz Haswell Figure 7: End-to-end timing, no inference or argument upload times • 256GB 2133Mhz DDR4 RAM

• 8x Hitachi 4TB SAS 12Gb/s HDD

• 2x Intel DC S3510 Series 240GB SSD

• 1x Titan V (unused)

For our "Local" storage option, we used an in-memory key To highlight this further, Figure 5.2 is the same plot as Fig- value store located on the same machine to represent the ure 5.2 but without the sections for inference time or argument limiting case of the data being in the same datacenter as the upload. Note that the baseline here is 0, as the baseline is just TTC, and for the remote case we used Azure Blob inference and argument upload, so the height of the bar here storage, to represent data living offsite in another datacenter. is exactly the added latency of our system. We believe that We can see here that overall latency is dominated by two the most common case will be close to the "model local, argu- factors: actual inference (in yellow) and in the remote-model ment remote" case, as this corresponds to a frequently-served case, model-download (in orange). We anticipate inference model which is local to a datacenter but with newly created times getting over an order-of-magnitude faster when we use user-data (for instance, photos) being generated at the edge GPUs, which just makes the model-download times an even and stored in a variety of key-value stores, many of which more significant factor in our overall latency. will have a somewhat high latency to the TTC. Under these One key observation here is the distinction between the conditions, Figure 5.2 shows that we add only 0.83 seconds model-remote and model-local tests. When the model is of latency 2Because we didn’t upload to a KV store and instead "uploaded" the argument to the TTC in the body of the serve request, argument upload is actually part of misc for the baseline To explore this further, we synthetically added latency to 4 The baseline data was taken on a different day on the same computer, but both the GET and PUT operation in the KV Store to see this computer was experiencing moderate CPU congestion during our test, which we think is why the Local/Local test actually falls below the baseline, how this affected our system’s end-to-end latency, as seen in even accounting for error bars Figure8.

7 Model: Remote/ Model: Remote/ Model: Local/ Model: Local/ Baseline Argument: Remote Argument: Local Argument: Remote Argument: Local Argument Upload 0.377 0.303 0.299 0.303 N/A 2 Model Download 13.860 14.185 0.160 0.157 N/A Argument Download 0.347 0.004 0.334 0.004 N/A Inference 13.268 13.264 3.362 3.346 4.157 Upload 0.144 0.003 0.140 0.004 N/A Misc 0.205 0.204 0.196 0.200 0.354 Total 27.824 27.661 4.191 3.710 4.510 Standard Deviation 1.334 1.161 0.059 0.037 0.116 (Total)

Figure 6: End-to-end timing data, each point is an average over 40 trials. All data in seconds and all taken on the same day with very low CPU congestion, except for the baseline 4

50ms added 100ms added 500ms added 1000ms added Argument Upload 0.055 0.105 0.506 1.004 Model Download 0.153 0.105 0.155 0.149 Argument Download 0.054 0.104 0.505 1.005 Result Upload 0.053 0.104 0.504 1.005 misc 0.082 0.135 0.129 0.108 Total 0.397 0.554 1.799 3.272 Standard Deviation 0.067 0.052 0.173 0.120

Figure 9: Overall latency with synthetic latency added to user- data KV Stores

We also analyzed the number of lines of code it took us to make each component of this system. We would like to highlight however that this is likely a very low bound on how much it would take in practice, as we did not write much of the boilerplate necessary in a productionized application. This Figure 8: Non-inference response time compared to a variety also ignores all files not tracked by git, and all logs and data of simulated added latencies on the user data KV Store (Data files have also been excluded. Cython/ signal implementa- below: 5.2) tion not included. Lines of Code Testing: 524 KV Store: 69 TTC: 491 Note the X axis in this plot, although latencies here appear UTE: 589 to grow at a superlinear rate, in actuality non-inference re- Other: 404 sponse time is 3.272 seconds with 1000ms added latency, a mere factor of 6 larger than the non-inference response time Figure 10: Lines of Code with a 100ms added latency of 0.553 seconds. This indicates that overall system latency is actually growing sublinearly as we consider KV Stores with higher and higher latencies, even 6 Cryptography when discounting the inference time. We chose to ignore in- ference time here as we wanted to highlight the scaling factor Cryptographic communication primarily relies upon the Sig- here of our overhead, which this represents. Recall that in the nal Protocol, a well-established asynchronous double ratchet baseline case, we send the userdata in the serve request itself protocol. In particular, we rely heavily upon Signal’s Ex- as there is no user data KV Store, so by increasing the latency tended Triple Diffie Hellman Key Exchange (3XDH) [18]. of the userdata KV store, we’re able to increase the latency of 3XDH uses 3 sets of keys, which can be adapted to achieve only components which are unique to our system. scale:

8 • Identity Key: The identity key corresponds to each TTC Finally, CPU features such as Intel SGX [8] and ARM target. Note, this does not necessarily correspond to a TrustZone [2]. Works such as Ryoan [16] provide process single device. Instead, if multiple TTC’s are behind a isolation for enclave code. Inference within the enclave is load balancer, they will share an Identity Key demonstrated in works like Chiron [15]. Riverbed uses remote attestation for Information Flow Control (IFC) to prevent • Pre Signed Key: The pre signed key is periodically ro- information leakage, but uses relaxed security constraints and tated. We use a single pre signed key per identity key at enforces less isolation [24]. any point in time. We rely on established systems, for encryption and model inference. In particular, our system is designed to use the • \One time pre keys: One time pre keys allow load for asynchronous key exchange [18]. Signal balancing. Rather than a traditional load balancer as- is widely used and is generally considered secure [7]. We rely signing tasks to TTCs. TTCs are effectively assigned on the ONNXRuntime for non-secure enclave based model whichever requests are encrypted using their correspond- inference [4]. ing prekeys.

In the case of non-secure enclave based TTC’s, the public 8 Future Work portion of the identity key can be distributed using traditional key signing/certificate authority techniques. For example, an 1. Finish integrating Signal: We ran into many issues con- infrastructure provider could sign the identity key with it’s necting the opensource libsignal to openSSH but having SSL key which is already part of a chain of trust. In the case this encryption here is very important to prevent data of secure enclaves, the identity key should be distributed with from being leaked to the KV stores. an attestation proof that the key was generated by a program 2. Properly link ONNXRuntime with CUDANN: We had which anyone can verify securely manages the key. Note this trouble linking ONNXRuntime (our execution environ- program would be part of the TTC or sandbox, but not part ment) with the system drivers needed to use the hardware of the application developer’s model. accelerator on the machine. This should be a very simple After a session is established, the TTC can use the session problem to solve, and could also be solved by switching to publish an encrypted output of the model that can only be to a different compute provider like AWS. decrypted by the original client UTE. 3. Add support for multiple-input-single-output ML mod- 7 Related Work els. consider a model which merges two faces, this must operate on multiple pieces of data. Our system design The topic of privacy preserving machine learning inference supports this, but we need to modify a handful of func- has been researched. Early theoretical works established the tions to finish this. basis for homomorphic cryptography [20]. Modern imple- 4. Right now, our system is designed to return the inference mentations of fully homomorphic encryption (FHE) which result to the party which issued the request. However, build on these works typically have high performance over- we think it should be possible to modify this protocol head, in the seconds to minutes [3][12]. The cryptographic slightly such that the inference result can be encrypted ciphers used in homomorphic encryption such as YASHE’ [5] with the keys for a separate user. We would need to and "Somewhat Practical Fully Homomorphic Encryption" add a mechanism to notify this other user, and we are [10]. Other cryptographic approaches to privacy preserving concerned about how this would work with key rotation machine learning include secure multi party computation in the signal protocol, so this feature might require that (SMPC) which preserves privacy. These typically rely upon a in order for the inference result to be decrypted, the subset of Shamir’s Secret Sharing [21], Garbled Circuits [25], recipient can only be offline for some maximum length and The Goldreich-Micali-Wigderson (GMW) protocol [13]. of time, but more work is necessary here. Like FHE, SMPC is computationally expensive. Inference takes roughly the same order of magnitude [19]. These crypto- 5. Create a different UTE for mobile and desktop appli- graphic approaches all face a similar computational bottleneck cations. We designed this system to have a consistent on deep learning tasks. interface between the UTE and the app developer, so There are numerous differential privacy [9] based ap- this to alternate platforms will be easy from that proaches to machine learning. Notable early works in dif- perspective, if an app developer writes an application for ferential privacy in machine learning explored the privacy- one environment and then wishes to port over this to a accuracy tradeoff involved in deep learning [1]. Other works new environment, our system should not get in the way explore the challenges of differential privacy in production of that (see our Usability goal). However, because our systems. In particular data reuse, or the lack thereof [17]. system depends on certain isolation guarantees that the

9 platform must provide to us, creating a new UTE will seconds of latency when the model is not co-located is signif- likely be a significant endeavour. icantly less acceptable. We’ve identified a handful of areas in which we can address this 6. One common question is how a user can determine if they can trust what they think us a UTE on a web- 1. Right now, we fetch the data and the model in serial. site. While this generally reduces to a problem of phish- In the near term, we can change this to a parallel fetch, ing/social engineering and so overall is out of scope of although as we can see in 5.2 the data fetch takes a this paper, we propose a browser extension that main- minimal amount of time compared to the model fetch, tains a list of widely trusted UTE origins. Then, when so this offers only limited gains. a user attempts to upload anything into a page who’s origin does not match this list, it would send a browser 2. We haven’t tested under these conditions yet, but as de- alert warning the user that they might be sharing their mand on a TTC increases, we anticipate that we will data unsafely. This would at least inform the user and be under utilizing the compute resources available by they could hopefully make an informed running the model on each input, one at a time. We plan to implement a batching feature which will batch up re- • This list would be maintained in much the same quests to the same model in the same TTC if that TTC way as a browser’s certificate revocation list, al- and model are experiencing significant load. though here we have a default-deny strategy be- 3. Add something like a CDN for models which should cause the user can override this deny, so we think models which are frequently used in a particular its better to warn more frequently here area get copied closer to that region, reducing model 7. There are numerous "real-world" usability improve- download latencies in the process. ments we plan to make as this system gets more mature. 4. We recognized that certain models could be re-used For instance, we should give the app developer more across many app developers, if these happened to be control over how their inference gets rendered (iframe public models. (Note that while the trust assumption size, is it a photo or text, etc). of the App Developer not trusting the user with their model may not apply, it still may be infeasible to serve 8.1 Enclave Computing the model on the user’s device, perhaps for power con- sumption or available compute reasons). In this case, Our demo has shown our result under the looser trust assump- TTC providers could advertise a list of "onsite" models tions we described earlier, where both parties trust a large and a corresponding they guarantee to support with a cloud computing company. However, we would like to get certain SLA on the inference time, similar to how certain data on the other end of that spectrum, which is where one virtual machine images are advertised by cloud hosing or both of these parties does not trust the company and we providers. must use secure enclaves. This would mean modifying our TTC to support enclaves, but also in a more macro sense, to 5. Finally, we intend to integrate our system with the Global our knowledge there are no hardware-accelerators for ML Data Plane (GDP) which will abstract away the notion models that provide the same privacy guarantees that secure of where our models are stored, and ideally will employ enclaves do. For this reason, it is likely that more work on dynamic replication to place copies of frequently used secure enclaves which provide good hardware acceleration is models near high-activity TTCs, which would silently necessary before they are viable for widespread use in ML. reduce the overall model-serving latency without any However, if there were to be a performant enclave offered, other party having to take any action. our design would further generalize to support a TTC on the "Edge." This would offer latency advantages as data may not • GDP allows us to think of data storage as a service, have to travel very far (if the model is loaded onto the edge and it lets us simply put and get data from each device) and this device would presumably be quite close to of our other nodes without having to worry about the issuer of the serve request. Furthermore, the trust issues which KV store is more efficient for which partic- of running compute on some arbitrary edge device would be ular TTC, but it lets the user dictate where they’ll solved by the enclave, as it would provide a mechanism with trust with their data and where they wont. This which to prove the user’s data wasn’t leaked. meshes well with the trust model that underlies this system, 8.2 Improving latency As it stands, it’s not ideal to increase each inference by 0.83 seconds, but acceptable in many cases. However, adding 30

10 References [12] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. [1] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan Cryptonets: Applying neural networks to encrypted data McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. with high throughput and accuracy. In International Deep learning with differential privacy. In Proceedings Conference on Machine Learning, pages 201–210, 2016. of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pages 308–318, [13] Oded Goldreich, Silvio Micali, and Avi Wigderson. How New York, NY, USA, 2016. ACM. to play any mental game. In Proceedings of the nine- teenth annual ACM symposium on Theory of computing, [2] A ARM. Security technology building a secure system pages 218–229. ACM, 1987. using trustzone technology (white paper). ARM Limited, 2009. [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In [3] Ahmad Al Badawi, Jin Chao, Jie Lin, Chan Fook European conference on computer vision, pages 630– Mun, Sim Jun Jie, Benjamin Hong Meng Tan, Xiao 645. Springer, 2016. Nan, Khin Mi Mi Aung, and Vijay Ramaseshan Chan- drasekhar. The alexnet moment for homomorphic en- [15] Tyler Hunt, Congzheng Song, Reza Shokri, Vitaly cryption: Hcnn, the first homomorphic cnn on encrypted Shmatikov, and Emmett Witchel. Chiron: Privacy- data with gpus. arXiv preprint arXiv:1811.00778, 2018. preserving machine learning as a service. arXiv preprint arXiv:1803.05961, 2018. [4] Junjie Bai, Fang Lu, Ke Zhang, et al. Onnx: Open neu- [16] Tyler Hunt, Zhiting Zhu, Yuanzhong Xu, Simon Peter, ral network exchange. https://github.com/onnx/ and Emmett Witchel. Ryoan: A distributed sandbox for onnx, 2019. untrusted computation on secret data. In 12th USENIX [5] Joppe W Bos, Kristin Lauter, Jake Loftus, and Michael Symposium on Operating Systems Design and Imple- Naehrig. Improved security for a ring-based fully ho- mentation (OSDI 16), pages 533–549, Savannah, GA, momorphic encryption scheme. In IMA International November 2016. USENIX Association. Conference on Cryptography and Coding, pages 45–64. [17] Mathias L‘ecuyer, Riley Spahn, Kiran Vodrahalli, Rox- Springer, 2013. ana Geambasu, and Daniel Hsu. Privacy accounting [6] Thomas Brewster. Faceapp: Is the russian face-aging and quality control in the sage differentially private ml app a danger to your privacy?, Jul 2019. platform. SIGOPS Oper. Syst. Rev., 53(1):75–84, July 2019. [7] Katriel Cohn-Gordon, Cas Cremers, Benjamin Dowl- ing, Luke Garratt, and Douglas Stebila. A formal secu- [18] Moxie Marlinspike and Trevor Perrin. The x3dh key rity analysis of the signal messaging protocol. In 2017 agreement protocol. Open Whisper Systems, 2016. IEEE European Symposium on Security and Privacy [19] M Sadegh Riazi, Christian Weinert, Oleksandr (EuroS&P), pages 451–466. IEEE, 2017. Tkachenko, Ebrahim M Songhori, Thomas Schneider, and Farinaz Koushanfar. Chameleon: A hybrid [8] Victor Costan and Srinivas Devadas. Intel sgx ex- secure computation framework for machine learning plained. IACR Cryptology ePrint Archive, 2016(086):1– applications. In Proceedings of the 2018 on Asia 118, 2016. Conference on Computer and Communications Security, [9] Cynthia Dwork, Aaron Roth, et al. The algorithmic pages 707–721. ACM, 2018. foundations of differential privacy. Foundations and [20] Ronald L. Rivest and Michael L. Dertouzos. On data Trends R in Theoretical Computer Science, 9(3–4):211– banks and privacy homomorphisms. 1978. 407, 2014. [21] Adi Shamir. How to share a secret. Commun. ACM, [10] Junfeng Fan and Frederik Vercauteren. Somewhat prac- 22(11):612–613, November 1979. tical fully homomorphic encryption. IACR Cryptology ePrint Archive, 2012:144, 2012. [22] Florian Tramer and Dan Boneh. Slalom: Fast, verifi- able and private execution of neural networks in trusted [11] Andrew Ferraiuolo, Andrew Baumann, Chris Haw- hardware. arXiv preprint arXiv:1806.03287, 2018. blitzel, and Bryan Parno. Komodo: Using verification to disentangle secure-enclave hardware from software. [23] Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel In Proceedings of the 26th Symposium on Operating Genkin, Baris Kasikci, Frank Piessens, Mark Silberstein, Systems Principles, pages 287–305. ACM, 2017. Thomas F Wenisch, Yuval Yarom, and Raoul Strackx.

11 Foreshadow: Extracting the keys to the intel {SGX} web services. kingdom with transient out-of-order execution. In 27th {USENIX} Security Symposium ({USENIX} Security 18), pages 991–1008, 2018. [25] Andrew Chi-Chih Yao. How to generate and exchange secrets. In 27th Annual Symposium on Foundations of [24] Frank Wang, Ronny Ko, and James Mickens. Riverbed: Computer Science (sfcs 1986), pages 162–167. IEEE, Enforcing user-defined privacy constraints in distributed 1986.

12