Providing Enhanced Functionality for Data Store Clients

Providing Enhanced Functionality for Data Store Clients Arun Iyengar IBM T.J. Watson Research Center Yorktown Heights, NY 10598 Email: [email protected] Abstract—Data stores are typically accessed using a client- Object Storage server paradigm wherein the client runs as part of an application Cloudant Server Server Cassandra Server process which is trying to access the data store. This paper OpenStack API presents the design and implementation of enhanced data store clients having the capability of caching data for reducing the latency for data accesses, encryption for providing confidentiality before sending data to the server, and compression for reducing Object the size of data sent to the server. We support multiple approaches Cassandra Cloudant Storage for caching data as well as multiple different types of caches. Client client client We also present a Universal Data Store Manager (UDSM) which allows an application to access multiple different data Fig. 1: Data stores are typically accessed using clients. This stores and provides a common interface to each data store. The work focuses on enhancing the functionality of clients. UDSM provides both synchronous and asynchronous interfaces to each data store that it supports. An asynchronous interface allows an application program to access a data store and continue execution before receiving a response from the data store. The because a data store lacks encryption features or data is not UDSM also can monitor the performance of different data stores. securely transmitted between the application and data store A workload generator allows users to easily determine and compare the performance of different data stores. server. For certain applications, encryption at the client is a ne- cessity as the data store provider simply cannot be trusted to be The paper examines the key design issues in developing both secure. There have been many serious data breaches in recent the enhanced data store clients and the UDSM. It also looks years in which confidential data from hundreds of millions at important issues in implementing client-side caching. The of people have been stolen. Some of the widely publicized enhanced data store clients and UDSM are used to determine data breaches have occurred at the US government, Yahoo!, the performance of different data stores and to quantify the performance gains that can be achieved via caching. Anthem, Democratic National Committee, eBay, Home Depot, JP Morgan Chase, Sony, and Target. I. INTRODUCTION Data stores are often implemented using a client-server paradigm in which a client associated with an application A broad range of data stores are currently available includ- program communicates with one or more servers using a ing SQL (relational) databases, NoSQL databases, caches, and protocol such as HTTP (Figure 1). Clients provide interfaces file systems. An increasing number of data stores are available for application programs to access servers. This paper focuses on the cloud and through open source software. There clearly is on providing multiple data store options, improving perfor- a need for software which can easily provide access to multiple mance, and providing data confidentiality by enhancing data data stores as well as compare their performance. One of the store clients. No changes are required to servers. That way, goals of this work is to address this need. our techniques can be used by a broad range of data stores. A second goal of this work is to improve data store Requiring changes to the server would entail significantly performance. Latencies for accessing data stores are often high. higher implementation costs and would seriously limit the Poor data store performance can present a critical bottleneck number of data stores our techniques could be applied to. for users. For cloud-based data stores where the data is stored at a location which is distant from the application, the added We present architectures and implementations of data store latency for communications between the data store server clients which provide enhanced functionality such as caching, and the applications further increases data store latencies [1], encryption, compression, asynchronous (nonblocking) inter- [2]. Techniques for improving data store performance such as faces, and performance monitoring. We also present a universal caching are thus highly desirable. A related issue is that there data store manager (UDSM) which gives application programs are benefits to keeping data sizes small; compression can be a access to a broad range of data store options along with the key component for achieving this. enhanced functionality for each data store. Caching, encryption, compression, and asynchronous (nonblocking) interfaces A third goal of this work is to provide data confidentiality are essential; users would benefit considerably if they become as it is critically important to many users and applications. standard features of data store clients. Unfortunately, that is Giving users the means to encrypt data may be essential either not the case today. Research in data stores often focuses on server features with inadequate attention being paid to client UDSM is designed to allow new clients for the same data features. store to replace older ones as the clients evolve over time. The use of caching at the client for reducing latency is A key feature of the UDSM is a common key-value particularly important when data stores are remote from the interface which is implemented for each data store supported applications accessing them. This is often the case when data by the UDSM. If the UDSM is used, the application program is being stored in the cloud. The network latency for accessing will have access to both the common key-value interface for data at geographically distant locations can be substantial [3]. each data store as well as customized features of that data store Client caching can dramatically reduce latency in these sit- that go beyond the key-value interface, such as SQL queries uations. With the proliferation of cloud data stores that is for a relational database. If an application uses the key-value now taking place, caching becomes increasingly important for interface, it can use any data store supported by the UDSM improving performance. since all data stores implement the interface. Different data stores can be substituted for the key-value interface as needed. Client encryption capabilities are valuable for a number of The UDSM provides a synchronous (blocking) interface to reasons. The server might not have the ability to encrypt data. data stores for which an application will block while waiting Even if the server has the ability to encrypt data, the user might for a response to a data store request. It also provides an not trust the server to properly protect its data. There could be asynchronous (nonblocking) interface to data stores wherein malicious parties with the ability to breach the security of the an application program can make a request to a data store servers and steal information. and not wait for the request to return a response before Another reason for using client-side encryption is to pre- continuing execution. The asynchronous interface is important serve confidentiality of data exchanged between the client and for applications which do not need to wait for all data server. Ideally, the client and server should be communicating store operations to complete before continuing execution and via a secure channel which encrypts all data passed between can often considerably reduce the completion time for such the client and server. Unfortunately, this will not always be applications. the case, and some servers and clients will communicate over Most existing data store clients only provide a synchronous unencrypted channels. interface and do not offer asynchronous operations on the Compression can reduce the memory consumed within data store. A key advantage to our UDSM is that it provides a data store. Client-based compression is important since an asynchronous interface to all data stores it supports, even not all servers support compression. Even if servers have if a data store does not provide a client with asynchronous efficient compression capabilities, client-side compression can operations on the data store. still improve performance by reducing the number of bytes that The UDSM also provides monitoring capabilities as well need to be transmitted between the client and server. In cloud as a workload generator which allows users to easily determine environments, a data service might charge based on the size the performance of different data stores and compare them to of objects sent to the server. Compressing data at the client pick the best option. before sending the data to the server can save clients money in this type of situation. While caching can significantly improve performance, the optimal way to implement caching is not straightforward. Our enhanced data store clients and UDSM are architected There are multiple types of caches currently available with in a modular way which allows a wide variety of data stores, different performance trade-offs and features [7], [8], [9], [10], caches, encryption algorithms, and compression algorithms. [11]. Our enhanced clients can make use of multiple caches to Widely used data stores such as Cloudant (built on top of offer the best performance and functionality. We are not tied to CouchDB), OpenStack Object Storage, and Cassandra have a specific cache implementation. As we describe in Section III, existing clients implemented in commonly used programming it is important to have implementations of both an in-process languages. Popular language choices for data store clients cache as well as a remote process cache like Redis [7] or mem- are Java, Python, and Node.js (which is actually a JavaScript cached [8] as the two approaches are applicable to different runtime built on Chrome’s V8 JavaScript engine). These clients scenarios and have different performance characteristics.

Load more