A Distributed Cache for Hadoop Distributed File System in Real-Time Cloud Services

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/261085184

Conference Paper · September 2012 DOI: 10.1109/Grid.2012.17

CITATIONS READS 19 283

4 authors, including:

Jing Zhang Gongqing Wu Nanjing University of Science and Technology Hefei University of Technology

21 PUBLICATIONS 48 CITATIONS 16 PUBLICATIONS 110 CITATIONS

SEE PROFILE SEE PROFILE

Xindong Wu UVM

310 PUBLICATIONS 10,268 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Machine Learning with Crowdsourcing View project

All content following this page was uploaded by Jing Zhang on 21 February 2017.

The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document and are linked to publications on ResearchGate, letting you access and read them immediately. 2012 ACM/IEEE 13th International Conference on Grid Computing

A Distributed Cache for Hadoop Distributed File System in Real-time Cloud Services

Jing Zhang1, Gongqing Wu1, Xuegang Hu1, Xindong Wu1, 2

1 Department of Computer Science 2 Department of Computer Science Hefei University of Technology University of Vermont Hefei, 230039, China Burlington, VT 05405, U.S.A. [email protected], [email protected] [email protected]

Abstract—The improvement of file access performance is a One of the significant designed features of the Hadoop great challenge in real-time cloud services. In this paper, we system is high throughput which is extremely suitable for analyze preconditions of dealing with this problem considering handling large-scale data analysis and processing problems. the aspects of requirements, hardware, software, and network This original design provides Hadoop with outstanding environments in the cloud. Then we describe the design and performance in off-line massive data processing with peta- implementation of a novel distributed layered cache system byte magnitude data sources. In recent years, with the built on the top of the Hadoop Distributed File System which is continuous development of broadband networks, Internet named HDFS-based Distributed Cache System (HDCache). applications with real-time interactive features have The cache system consists of a client library and multiple cache increased significantly. This type of real-time cloud services. The cache services are designed with three access computing environment has the following characteristics. (1) layers an in-memory cache, a snapshot of the local disk, and the actual disk view as provided by HDFS. The files loading Personalized service: one of the cloud computing goals is to from HDFS are cached in the shared memory which can be provide a user-adaptive virtual information service system. A directly accessed by a client library. Multiple applications personalized service is built on the basis of analysis on integrated with a client library can access a cache service personal historical information. (2) Shorter period of users simultaneously. Cache services are organized in the P2P style generating data and users consuming data: although the using a distributed hash table. Every file cached has three personalized models of individuals are derived from users replicas in different cache service nodes in order to improve generated contents, the period of the data recreation is robustness and alleviates the workload. Experimental results shortening substantially. The continuous optimizations of the show that the novel cache system can store files with a wide system are taking place at any time which makes end users range in their sizes and has the access performance in a feel smooth improvement with the increasing frequencies of millisecond level in highly concurrent environments. using the system. (3) Real-time dynamics: in order to support real-time personalized features, the cloud services must Keywords-distributed cache system; cloud storage; HDFS; schedule enough resources affiliated with a special person in real-time file access; in-memory cloud several seconds or less. The total resource pool may contain hundreds of millions of users’ model data. (4) Differentiated I. INTRODUCTION management of personal data: In cloud services, personal Apache Hadoop [1] is a well-known project that includes data of a user can be classified into several categories, such open source implementations of a distributed file system [2] as accessing log, profile, uploaded (or generated) materials and a MapReduce parallel processing framework that were and business model. Generally speaking, some information inspired by Google’s GFS [3] and MapReduce [4] projects. for a user’s profile and business model can be extracted from The emergence of the open source Hadoop system eliminates accessing log and uploaded materials by using data mining the technical barrier to cloud computing. Several rising stars related techniques. The cloud services manage these of international IT companies, such as Facebook and Twitter, categories of data in different manners. The user accessing are dedicated to making contributions to the Hadoop log data are simply stored in a backup storage system. The community as well as deploying and using this system to user’s uploaded data are stored in the networking storage building their own cloud computing systems. After several system and loaded into memory when the user explicitly years of development, Hadoop gradually forms a cloud reads or modifies them. The user’s profile and business computing ecosystem consisting of a set of technical model data are very critical in real-time services as they solutions which include the HBase distributed database, Hive determine the business category and quality of services, distributed data warehouse, ZooKeeper coordination services which are generally loaded implicitly when the user logs into for distributed applications and etc. All the components are the system and cannot be scheduled out until the user logs built on the top of low-cost commercial hardware, with the out. During the user visits real-time cloud services, this kind extensive availability and fault-tolerance, which makes of data may be accessed for many times. The size of this Hadoop gradually become the mainstream of a commercial kind of data is typically not too large due to their generated implementation as a cloud computing technology. methods and we assume it is about 10MB. Thus, the real system must schedule this kind of data (about 10 MB) for a

1550-5510/12 $26.00 © 2012 IEEE 12 DOI 10.1109/Grid.2012.17 specific user from several million model packages (about system, to enhance the ability of certain aspects of a system 100TB), then compute the results and send back to the user. is bound to damage the ability of some other aspects. All these procedures must be completed in 2 seconds. That Although NoSQL databases, such as HBase, Cassandra resource scheduling procedure must be completed in a [33], and MongoDB [34], are considered to be a good millisecond interval level. solution for solving problems of randomly reading and The Hadoop Distributed File System (HDFS) meets the writing binary data from persistent storage devices. Some requirements of massive data storage, but lacks the evaluations of these systems have shown that the consideration of real-time file access. In HDFS file reading performance of these databases still cannot meet the needs of may contain several interactions of connecting NameNode real-time access to big data [35] [36]. According to the and DataNodes, which dramatically decreases the access YCSB benchmark [35], the read latency of both HBase and performance when the system is under a heavy burden of Cassandra cannot be acceptable when throughput exceeds data and a concurrent workload. Thus, how to improve 8000 ops/sec. The HDCache is not a NoSQL DB and it only HDFS file access performance (especially file reading draws on the key-value storage strategy and the data performance) is a key issue in real-time cloud services. replication mechanism used in NoSQL DB. The HDCache In this paper, we present a novel distributed cache system simplifies the accessing model in the granularity of files named HDCache built on the top of HDFS, which has rich instead of tables, which improves performance dramatically. features and high access performance. The cache system Another idea of addressing real-time access is to provide provides general purpose storage for variable workloads an in-memory storage system. J. Ousterhout et al put including both huge and small files. In the rest of the paper, forward a tentative plan to build cloud computing systems in we will describe the details of this cache system. In Section DRAM [15] [16]. DRAM Cloud is a great ambition in the II, some related work is reviewed. In Section III, we describe long run; however it currently still faces with many the prerequisites, considerations and motivations of challenges and cannot be practical in the short term. A designing a new cache system. Section IV provides the practical simple alternative to the RAM cloud is the design and implementation details of the cache system. Memcached [17] which provides general-purpose key-value Section V evaluates the performance and the hit ratio issue of storage entirely in DRAM and is widely used to offload the cache system. Section VI concludes this paper with some back-end database system [18]. Memcached can be used for future work. accelerating MapReduce tasks on the Hadoop cluster [19] and also can be used for promoting read throughput in II. RELATED WORK massive small-file storage systems [20]. Memcached and The pitfalls of the Hadoop Distributed File System have Redis [21] in-memory key-value storage systems are thought been studied widely, since the project was launched by to be more feasible in productive environments and are used Yahoo! company. J. Shafer analyzed the performance of widely in many IT companies, such as Facebook [22] and HDFS thoroughly and reached a conclusion that one of the Twitter [23]. Memcached and Redis are well implemented major causes of the performance bottlenecks is the tradeoffs open source systems; however, they still have some between portability and performance [5]. The most shortages that we will analyze in the next section. mentioned shortage of HDFS is the weak performance when Our HDCache can be viewed as a first step attempt to dealing with small files. Some applications solve this realize a DRAM Cloud system. The HDCache system tries problem on the top of HDFS by combining abundant small to cache the content that will be accessed in near future files into large ones and building an index for each small file which has the same design principle as Memcached and in order to reduce the file counts in the system [6]. The Redis. However, HDCache overcomes their defects such as others try to modify HDFS I/O features and the DataNode failures in dealing with large files, no replication and meta-data management implementation in order to provide a persistent serialization. better performance [7] [8]. Many approaches [9] [10] [11] can be classified into these two categories. III. PREREQUISITES AND DESIGN CONSIDERATIONS These two methods cannot fundamentally improve the The goal of our work is to design and implement a system performance. (1) Combining small files into a large distributed cache system on the top of HDFS that can file is a typical time-consuming operation. Thus the key accelerate person-specific data access in large-scale real-time objective of this method is to improve system throughput cloud services. Our novel HDCache system is based on the rather than its response time. The HBase [12] systematically following factors, prerequisites and design considerations. adopts this idea to solve the small file storage problem by introducing a Google Bigtable [13] like key-value distributed On-the-top Method rather than Built-in Method database in order to make file combinations and retrieval As mentioned in section II, many systems attempt to transparent to the end users. K. Dana evaluated the modify HDFS features to improve performance. These performance of HBase and found that reads slow down as practices violate the original design principles and it is the number of rows written increases, which indicates it may definitely difficult to get good results. Reducing HDFS not meet the need of real-time access [14]. (2) Modifying system workload is conducive to performance improvement, HDFS I/O features and altering DataNode meta-data and therefore building a distributed cache system on the top management implementation are comparatively dangerous. of HDFS is a better choice. The latter makes the cache and Generally speaking, without completely re-designing of the HDFS systems independent of each other, which is essential

13 for the maintenance of the entire system compatibility in well (SQL query results are usually small) but does not fit large-scale deployment and in system upgrades. From the online personal data whose size is in the range from several perspective of software engineering, a loose association KB to dozens of MB. Using this strategy will result in an between HDCache and HDFS is a better design choice for enormous overhead of management and a large number of the independent evolution of both systems. With the open memory fragments. source community promoting HDFS, the performance of (2) The Memcached has no local serialization or HDCache will be improved simultaneously. Meanwhile, any snapshot mechanism. When a cache server crashes down, the change of HDCache will not affect the performance of the cached contents will be lost. The cost of reconstructing lost underlying HDFS system. contents is extremely expensive. The concept proposed by Google of building a cloud computing system on Network I/O rather than Disk I/O inexpensive commercial hardware [37] has been widely Cloud computing systems are usually built on the top of accepted. Thus, a persistent serialization mechanism is a low-cost commercial hardware connected by Gigabit requirement on the condition of unstable hardware and Ethernet. In practice, the network I/O rate is about 100MB/s software. that is approximately equal to the disk I/O rate. On one hand, (3) The servers of the Memcached system are a real-time cloud computing system stores large amounts of independent one another. The distributed function is data, on the other hand, data access of the system usually provided by the client. The client algorithm decides which appears in the way of sudden and random bursting, which server to connect with and send requests to no matter evidently slows down the disk scheduling performance whether the server status is normal or malfunctioned. This resulting in the read efficiency being no more than 50MB/sec. leaves the complex management issues to the user who must Consequently, accessing data over the Ethernet usually is a design and implement a central management service to better choice than reading them from an HDFS DataNode coordinate multiple Memcached servers and clients. disk. If the cloud computing system is deployed on top of the (4) The Memcached has a simple consistency checking high-speed networks such as 10-Gigabit Ethernet, InfiniBand process by setting an expired time. This may cause a burst of and Myrinet, network I/O obviously has huge advantages network traffic when a large number of cache data are compared to disk I/O. expired simultaneously. A better way to solve this problem is Layered Data Accessing Model to disperse the consistency check in a certain duration, which is based on the access frequency of these expired data. There are three data access layers in the system when Our novel HDCache system will overcome the building a cache on the top of HDFS. The first layer is in- shortcomings mentioned above and make it more suitable for memory cache in which the data access rate is approximately real-time cloud services. equal to the memory unit access rate (ignoring OS memory swap). The second layer is local disk snapshot and remote in- IV. DESIGN AND IMPLEMENTATION memory cache with a data access rate about 50~100MB/s. Our novel HDCache system is built on the top of the The bottom layer is HDFS where all data are stored in Hadoop Distributed File System. The cache system and the DataNodes with the accessing rate influenced by many HDFS are loosely coupled. The system can be viewed as a factors such as data load, concurrency of threads and C/S architecture. The only thing the third-party applications network traffic etc. need to do is to integrate with a client-side dynamic library. Applications using distributed cache firstly retrieve the The third-party applications use the cache to access data desired file in DRAM cache, and if missing, the cache stored in HDFS transparently with very high performance. service will contact with another cache service for the file or This section describes the key design and implementation load it from a local disk snapshot if existed. If the procedure issues on the cache system. still cannot get the desired file, the cache service requested by the client will load the file from HDFS. The details of this Architecture process will be discussed in the next section. The HDCache system currently aims at deploying in Motivations of Designing a New Cache System intranet environments within an organization isolated from the outside by firewalls. Although the security issues cannot Can we modify an existing cache system such as be ignored in a real cloud system, in this paper we assume Memcached to meet the requirements of real-time services? the cache system is run in a secure circumstance. Figure 1 Although declared to be the most suitable for large-scale describes a simple example of deployment of our system. Internet applications, the following defects make The HDCache system can be deployed on HDFS Memcached invalid in real-time cloud services. NameNode, DataNode and any other application systems (1) The Memcached system is designed for caching (such as web servers) that can access HDFS through data that are stored in a database. It is not a typical cloud networks and need cache functions no matter whatever the storage system. Memcached uses a pre-allocated memory OS of these systems are Windows or Linux. The cache pool called slab to manage memory. One slab contains system contains two parts: a client dynamic library (libcache) multiple chunks which are the basic memory-allocated units. and a service daemon. Users only need to integrate with the Slabs are divided into several groups according to chunks’ libcache in their applications, and they can access the cache sizes. This memory management strategy fits the DB data services that are on the same machine or connected through

14 networks. One cache service can serve multiple applications communicating with ZooKeeper [28] servers remotely, and simultaneously. Figure 2 describes the internal architecture (3) calculating hash values of desired files and locating a of the service daemon and the libcache client library.y specific cached file. Calculating the hash value of a file is a time-consuming operation. To avoid calculating the hash value of the same file in the system scope, when a client calculates a hash value, it will store the value in ZooKeeper servers. The ZooKeeper service can be viewed as a database that stores the information as a tree in the memory. ZooKeeper will be discussed in the following section of System Management. In the startup process of a libcache library, the libcache initialization procedure contacts with ZooKeeper servers, fetches all files’ hash values in ZooKeeper and stores them orderly in its memory. Because the size of a hash value is usually several bytes, the storing of these values from the client will only consume a little memory. When accessing a particular file, libcache looks up the hash value first instead of calculating it for every time. HDCache is designed for write-once-read-many access model for files, which is approximately held in a cloud computing environment. Thus, when a libcache client opens a file, it can be shared with the other clients running on the same host by sharing the same file descriptor. In this manner, even a file name string comparing operation is avoided, which accelerates the file access efficiency.

Figure 1. Typical network topology of deployment

A cache service runs as a daemon on the host. The HDFS access module uses HDFS API to load file contents into pre- allocated shared memory. Shared Memory Manager (SMM) is the core module that bridges the service and client library. The serialization module periodically writes the meta-data and all cached files into a local OS file system, forming a series of snapshots which can be used for reestablishing a cache after a system crashes down. Another function of the serialization module is to provide swap when the cache is deployed on the small RAM hosts. The client library uses sockets (over TCP) to communicate with a cache service to exchange control messages. Compared with highly efficient shared memory, we choose sockets as an inter-process communication facility for the reason that libcache can contact with both local and remote cache services in the same way on heterogeneous systems. The consistency of Figure 2. Architecture of HDCache system cache is guaranteed by a validator which has comprehensive validation rules in order to achieve the balance of resources HDCache Service Internal consumption and timeliness. Compared with the Memcached system in which client libcache Library applications copy the cached content from a service process The HDCache system provides a client library called to their own process memory spaces, our cache system libcache integrated by upper layer applications. The libcache chooses shared memory which theoretically has higher consists of two major components: Communication & performance. Figure 3 describes an overview of the Shared Control and Shared Memory Access. The Communication & Memory Manager (SMM) module. Control module undertakes the tasks of (1) interoperating The shared memory is divided into pages with a fixed with the HDCache service on the same host, (2) size (typically about 4KB that can be reset by users). The

15 page is a basic memory allocation unit in our system. At the are accessed the next time; and the frequently accessed files end of every page, leaves 4 (32bit OS) or 8 (64 bit OS) bytes do this check in the background in order to keep the latest storing the next page of the same file or storing ‘0x0’ version. When a large number of files expire at the same standing for the file end, i.e. all pages of a file are organized time, a random delay is added to the consistency checkpoint as a linked list. The first page number of a file is stored in the in order to reduce an unexpected burst of network traffic. FirstP domain of Meta Info Map, a data structure storing the When the HDCache service processes consistency validation, cached file name, hash value and other information about the the client still can access a file from the cache, however, the file. The client library fetches the first page number of a latest version of the file will not be seen until the validation desired file from the cache server and then directly accesses process completes. The client library also provides functions shared memory for file content. The Mem Bitmap data to conduct immediate validation in the scope of a cache structure is used for management of allocated and free pages. service or specifying a particular cached file. When a client opens a cached file, the libcache records the first page address of this file and returns a file descriptor Distributed Storage Process to upper layer applications. The file descriptor is related to HDFS introduces replica to enhance system robustness. the meta information of the file such as read/write pointer By default, each file stored in HDFS has three replicas that possessed by the client. The client uses this file descriptor to are stored in different machines, which minimizes the impact read/write the content of the file. Although a cache service of a machine crash. Our cache system also adopts this design can serve multiple clients simultaneously, the clients that every file cached has three replicas. maintain the read/write pointers respectively. When multiple clients share one cached file, the cache service maintains a Algorithm: Cache_File reference counter for each opened file, the value of which is Input: File Name, Local Node IP Address the number of clients that are sharing this file. The client also Output: The file stored in specific cache services maintains a simple Meta Info Map structure which is usually Procedure: Cache_File (filename, ipaddr) named as the Client Open Files Table. 1. Store file in NodeLocal. Meta Info Map Mem Bitmap 2. File_HV := KetamaFunc (filename) FileName KatamaHV FirstP 0 1 0 1 1 0 NodeLocal_HV := KetamaFunc (ipaddr) 0 1 0 0 0 0 NodeB File 1 KHVa 1 Find whose hash value is clockwise equal to Ă Ă Ă or greater than File_HV File 2 KHVb 6 0 1 0 1 1 0 If NodeB_HV = NodeLocal_HV ĂĂĂ Then Find NodeB’ whose hash value is clockwise File N KHVx x libcache greater than NodeLocal_HV , and Node2 := NodeB’ Node2:=NodeB R Else S Store file in Node2 Ă 32 V 3. Node2 C_HV := 2 - Node2_HV If Node2 C_HV does not exist Shared Memory Node2 C_HV=Node2_HV or NodeC Figure 3. Shared memory management of HDCache service Then Find whose hash value is clockwise equal to or greater than Node2 C_HV When the client acquires a file and the cache misses, the , and Node3 := NodeC cache service will fetch the file from other cache services, Else Node3 := Node2 C_HV local snapshots or HDFS. When the cache service loads the Store file in Node3 fetched file to its shared memory, if the memory has no END enough pages to be allocated, the service will use LRU Note: *_HV means the hash value of * (Least Recently Used) algorithm to eliminate some files unused for a long time. The statistic information for the LRU algorithm is also stored in the Meta Info Map. When Figure 4. Alogrithm of loading file to distributed cache processing LRU in memory, the eliminated file can be swapped into a local disk. If the local disk has not enough Compared with the HDFS single NameNode design space, LRU also can be applied to a local disk space to delete scheme, our cache chooses the DHT (Distributed Hash Table) some unused files. instead. DHT is widely used in P2P system design and also Files loaded into the cache system respectively have introduced in cloud storage. A typical example is Amazon’s expire times that can be set by the client. If the expire time Dynamo [24] which is an important part of Amazon’s Elastic Computing Cloud. The key concept of DHT is consistent fires, the consistency validator decides whether to process 32 consistency check according to network traffic, system hash that maps the key to a position in 0-2 continuum. workload, and the file access frequency. The infrequently There are quite a few hash functions that can be used to accessed files do not conduct consistency check until they implement consistent hash, such as FNV [25], CRC hash [26]

16 and Ketama hash [27] etc. We choose Ketama hash function itself to a ZooKeeper server with its IPs, Ketama hash values not only because it has an open-source implementation but of its IPs, and any other information. Any component also it balances the computing performance, hit ratio and integrated with ZooKeeper client library can contact with the dispersity. Figure 4 describes the distributed file storage ZooKeeper servers to acquire global information, perceive scheme. other members status and/or coordinate with other processes. The procedure Cache_File is called when the cache As we mentioned above, all Ketama hash values of visited service receives a file access request and cannot find any files are stored in the ZooKeeper Servers. Thus, in replica entirely in the distributed cache system. The cache distributed storage most of the hash value calculations are service will calculate the Ketama hash value of the file and not needed as long as the system runs enough time. find the other two counterparts on the continuum; then store Introducing ZooKeeper to the HDCache system is not only the file in these cache nodes as three replicas. beneficial to the system component management but also Figure 5 demonstrates the algorithm running a typical conducive to performance improvement. circumstance. When Node A caches File 1, one of the replica HDFS provides a Posix-like C/C++ APIs (libhdfs) [32] is stored in the local cache service on Node A, another for easy integration with the third-party applications. In order replica is stored in Node B whose hash value is clockwise to eliminate differences in the usage of APIs, the user equal to or greater than File 1 hash value on the continuum, interface of the cache system is designed as a Posix-like file and the last replica is stored in the Complement Node of B if system operation, even if internally, the data are stored in a existed, or stores in the node that is clockwise greater than key-value fashion. Table I lists some typical APIs and their B’s Complement (Cache X). When a file access request meanings of the client library. arrives at a cache service, then the service retrieves the file in its DRAM; if missing, the cache service will contact with the TABLE I. IMPORTANT APIS OF CLIENT LIBRARY other two counterparts for the file or load it from a local disk APIs description snapshot if existed. If this procedure still cannot get the libcache connect desired file, the cache service will load the file from HDFS cache_connect() (disconnect) to local cache_disconnect() and tell the counterparts to store the replicas if possible. HDCache service 0/232 Node A open a cached file, if not cache_fopen(filename,mod) existed, cache service File 1 loads it into RAM Cache X Cache A cache_fread (fd,buffer,size) Hash cache_fwrite(fd,buffer,size) read/write cached file Func cache_fclose(fd) close cached file _ descriptive information Cache B cache_finfo(filename) of a cached file current status of cache cache_status() Cache B service process cache or file cache_validate() cache_fvalidate(filename) consistency validation Cache C manually

V. EVALUATION Figure 5. Novel DHT based file distributed stroage scheme We implemented the novel cache system in both Linux (64bit) and Windows (64bit). We chose 64bit OS because System Management & User Interface 64bit OS is currently the mainstream in commercial The coordination and management of multiple processes environments and can provide large memory addressing has always been a great challenge in distributed system space. Taking into account the performance we implemented design and implementations, although this issue has been the system with C++ program language. studied for several decades. Our system chooses the We setup a test-bed consisting of twenty-one servers ZooKeeper [28] [29] as a distributed system management running Ubuntu 11.04 64bit OS, each with four Intel Xeon infrastructure. ZooKeeper is a centralized service for E5620 2.6GHz CPU (16 cores), 16 GB of DRAM, four 7200 maintaining configuration information, naming, providing RPM 2TB SATA hard disks and a 1000 Mbps Ethernet card. distributed synchronization, and providing group services. Twenty of these computers are configured to be DataNode Compared with its counterpart Google’s Chubby [30] servers storing files sized in the range of 4K to 10G, and the distributed lock service, ZooKeeper provides a richer set of remaining one is configured to be NameNode servers. Then features. ZooKeeper is based on the Zab [31] atomic the storage capacity of the cluster is about 160TB and the in- broadcast protocol using by default simple majority quorums use data volume is about 20T. On every DataNode, we to decide on a proposal. Thus ZooKeeper can work if a deploy a cache service. The cache client (a multi-thread majority of servers are correct (i.e., with 2f + 1 server we can tester integrated with the libcache library) is also deployed tolerate f failures). When the cache service starts, it registers on the DataNode.

17 Basic Performance Benchmark HBase in a tiny cluster that contains only one HMaster and The first test is about the performance of cache APIs. We one Region Server. This configuration guarantees all use a client with only one thread running API for many times. operations are completed on the local host if we run In real-time cloud services, an individual’s personal data benchmark tools on the Region Server. Although making the seldom exceed 10MB, thus we cache the files sized 4KB to strong constraint, the network traffic is still unavoidable, 10MB in this test. In order to compare with other systems, because HMaster will be involved in the process of reading we also benchmark the performance of Redis and and writing data. Table IV shows experimental results of Memcached. Table II shows the typical basic performance of HBase. the important cache operations in different file sizes. TABLE III. PERFORMANCE OF REDIS AND MEMCACHED In the basic performance benchmark, we omit the hash value calculation and network traffic overheads for the Operation Time (millisecond) System & Operation following reasons. (1) When a file is cached, the hash value (operations per second) also is cached. Thus, the value does not need to calculate REDIS GET (10MB) 12.61 ( 79 ) twice. (2) Because Memcached and Redis cannot schedule cached content over the networks, so we only benchmark the REDIS SET (10MB) 17.13 ( 58 ) performance in the condition that files are already cached. (3) REDIS GET (4KB) 0.039 ( 25,641 ) When we benchmark large file storage performance, if the file is missing, the bottleneck is network I/O. REDIS SET (4KB) 0.051 ( 19,608 ) Memcached GET (4KB) 0.034 ( 29,412 ) TABLE II. BASIC PERFORMANCE OF CACHE SYSTEM Memcached SET (4KB) 0.038 ( 26,316 ) Operation Time (millisecond) APIs (operations per second) The HBase client provides a cache function. When the cache_fopen 0.023 ( 43,478 ) data are queried from the table, the client caches the data for cache_fclose 0.021 ( 47,619 ) further usage, which is semantically consistent with our cache system. Thus, in the experiment, data are cache first. cache_fread(10MB) 7.69 ( 130 ) The experimental results show: (1) In 4KB files, our cache_fwrite(10MB) 16.45 ( 61 ) HDCache GET operation (cache_fopen, cache_fread, cache_fclose cache_fread(4KB) 0.004 ( 250,000 ) and operation sequence, about 0.048 ms) is about 7 times faster than HBase Cached GET operation. cache_fwrite(4KB) 0.012 ( 83,333 ) The reason may be that HBase must extract 4KB data from a cache_finfo 0.16 ( 6,250 ) cached column of the table. (2) In 4KB files, HBase PUT (similar to SET) operation is a bit faster than HDCache SET operation (cache_fopen, cache_fwrite, and The cache_fread operation reads cache content to a cache_fclose operation sequence, about 0.056 ms). The client application. The cache_fopen, cache_fread, results show that the data are stored to HBase client memory and cache_fclose operation sequence is equal to Redis buffer before streaming to a database table. (3) In 10MB GET operation. The cache_fwrite operation modifies large files, HBase performance declines dramatically. Our cached file content. The cache_fopen, cache_ HDCache is 26 times faster for the GET operation and 60 fwrite, and cache_fclose operation sequence is equal times faster for the SET operation. to Redis SET operation. During the benchmark, we find that There are quite a few NoSQL databases such as Memcached does not support files sized 10MB, so we MongoDB [34] that have the same design principle as the choose 4KB files for the Memcached benchmark. Table III HBase system. They are likely to combine small data shows experimental results of Redis and Memcached. fragments with sizes from several KB to several hundred KB. Compared with Table II, we find the performances of the When dealing with data having a wider size range, these write (SET) operation of the three systems are in the same systems may perform worse. level (4KB file about 0.04ms, 10MB file about 17ms). The TABLE IV. PERFORMANCE OF HBASE IN TNIY CLUSTER performances of the read (GET) operation of the three systems are different. In 10MB file, our cache can take 123 Operation Time (millisecond) Operation GET (including cache_fopen, cache_fread, and (operations per second) cache_fclose ) operations per second which is 56% HBase Cached GET (4KB) 0.337 ( 2,967 ) faster than Redis (79 operations per second). The results indicate that when processing small files the three HBase PUT (4KB) 0.034 ( 29,411 ) systems almost equally performed, and when dealing with HBase Cached GET (10MB) 204.8 ( 4.88 ) big files, our cache gains the better performance. HBase PUT (10MB) HBase is considered to be an alternate for the situation 985.6 ( 1.015 ) that needs high performance of GET/PUT operations. In order to compare HDCache with HBase, we benchmark

18 Throughput on Concurrency Simulation of Hit Ratio When a client with multi-threads concurrently accesses Our cache system aims at storing personal data in real- the cache system, we evaluate the impact to throughput. In time cloud services. We assume that one user has 10MB this test case, the cache service caches 2GB files in different personalized data, because in real-world applications, a user file sizes. The number of the concurrent threads is set from 1 cannot manage large amounts of data under the network to 200. bandwidth and timing constraints. In an industrial cloud Figure 6 shows the results of the read efficiency on computing environment, in order to improve the efficiency concurrency. The read efficiency declines with the increase of the services and to reduce the workload on the system, we of client threads. However, when the number of threads always try to avoid scheduling personal data from one server exceeds 90, the efficiency fluctuates in a constant range. The to another, hoping that the user always logins onto the a read efficiency can keep greater than 1GB/s when the specific server so long as it works. We assume that the number of client threads is up to 200. critical personal data are profile and business model with the size about 10MB. Taking the normal commercial server with 48GB RAM into account, the configuration that a cache service holds 2000 users' data is a sound choice. Theoretically, if the access is uniform distributed, the cache memory size and the hit ratio are approximately in a linear relationship. To serve 2000 users, the cache needs 20GB memory to achieve a 100% hit ratio. However, the user access of a cloud services has some characteristics. We use the following scenario to simulate a user access model in real-time cloud services. a) On a single cache service node, there are 2,000 users accessing the personal data in a day, and the total access count is 100,000 (avg. 50 per user a day). b) Every user’s access count is a random number between 5 and 500;In a real system, the user’s access count is seldom outside this rage. Figure 6. Read efficiency on different number of concurrent threads c) The time point when a specific user access cache is random. Once a user begins to access the cache system, this Figure 7 shows cache_fopen operation performance user’s next access time point is near the last access time in a on concurrency. The response time of the cache_fopen random short delay. We simulate this delay by a random operation is prolonged with the number increase of client number between 1 and 250. That is the access sequence of a threads and fluctuates in a constant range when threads specific user is neither uniform nor consecutive. exceed 30. The cache_fopen operation can keep within 10ms when the number of client threads is up to 200. Concurrent tests show that our cache system can maintain very high performance in both services and client library in multi-thread environments.

Figure 8. Hit ratio of a single node in different cache memory size Figure 7. cache_fopen operation response time on different number of concurrent threads

19 Figure 8 shows the relationship between cache size and different cache nodes. This improvement makes most of the hit ratio in this scenario. We find that one cache service only file operations complete in the local memory or through needs 2.4GB memory (12% of 20GB) to achieve a 90% hit network I/O between the cache services, which greatly ratio. reduces the frequency of access to HDFS, alleviates the We also extend the simulation to a 20-node distributed workload of the HDFS NameNode, and improves the cache system. In this scenario, the number of users is set to performance of the entire cloud computing system. 40,000 and the file access count is set to 200,000, which is Currently, this work is still the first step to bridge high- twenty times as the test mentioned above. In this simulation, throughput cloud computing and real-time cloud computing. we study the relationship between hit ratio and the file Many further studies are ongoing in different aspects. The replica number. Firstly, we use one replica scheme. The HDCache system now is still based on the classic write- users arrive at the cloud system in a random manner and all once-read-many data access model in the cloud. However, personal data requests are randomly routed to the cache. many real-time services have more complex and dynamic Because the requests are disperse within the whole system, data access models. How to introduce a transactional the hit ratio declines dramatically. Then we also use this test function needs an in-depth study. There are many other method, but increase the replica number to 3. Figure 9 shows techniques in cloud computing such as NoSQL database, that using 3 replica of a file algorithm introduced in this MapReduce framework and distributed data warehouse. paper, the hit ratio will increase about 10% compared with Facing these techniques, how to analyze their real-time the 1 replica method. The 3 replica method decreases the characteristics and improve them for real-time usage is still probability of cache missing and fits the personal data access full of challenges. model of a real-time cloud service. ACKNOWLEDGMENTS This work is supported by the National High Technology Research and Development Program of China (863 Program) under Grant No. 2012AA011005, the National Natural Science Foundation of China (NSFC) under Grant No. 61005044 and 60975034, and the Fundamental Research Funds for the Central Universities under Grant No. 2011HGZY0003.

REFERENCES [1] Apache Hadoop. Available at http://hadoop.apache.org. [2] Apache Hadoop Distributed File System. Available at http://hadoop.apache.org/hdfs. [3] S. Ghemawat, H. Gobioff and S. T. Leung, “Google File System”, In Proc. of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP’03), Lake George New York, 2003, pp. 29-43. [4] J. Dean and S. Ghemawat, “Mapreduce: Simplified Data Processing on Large Clusters”. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation (OSDI’04), Berkeley, CA, USA, 2004, pp.137-150. [5] J. Shafer and S Rixner, "The Hadoop distributed filesystem: Figure 9. File replica affects hit ratio balancing portability and performance", In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS2010), White Plains, NY, March 2010. pp.122-133.

VI. CONCLUTIONS [6] J. Han, Y. Zhong, C. Han, X. He. "Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS", In There are more and more real-time services coming forth IEEE International Conference on Cluster Computing and Workshops on the Internet in cloud computing paradigms. Industrial (CLUSTER '09), New Orleans, LA , 2009. pp.1-8. circles tend to use mature technologies such as Hadoop to [7] L. Jiang, B. Li, M. Song, "THE optimization of HDFS based on small build real-time cloud computing systems. To overcome the files", In 3rd IEEE International Conference on Broadband Network and Multimedia Technology (IC-BNMT2010), Beijing, 2010. pp. performance shortages of the Hadoop Distributed File 912-915. System (HDFS), this paper describes a novel distributed [8] G. Mackey, S. Sehrish, J. Wang, "Improving metadata management cache system built on the top of HDFS named HDCache. for small files in HDFS", In 2009 IEEE International Conference on The HDCache system uses shared memory as the Cluster Computing and Workshops (CLUSTER'09), New Orleans, infrastructure which on one hand can deal with files with a Sept, 2009, pp.1-4. wide range in their sizes, and on the other hand still has a [9] J. Xie, S. Yin, et al. "Improving MapReduce performance through very high performance, even if it is shared by a large number data placement in heterogeneous Hadoop clusters", In 2010 IEEE International Symposium on Parallel & Distributed Processing of client application threads. We have introduced DHT to the (IPDPSW), Atlanta, April, 2010, pp.1-9. design of the distributed cache and improved data storage, [10] B. Dong, J. Qiu, et al. "A novel approach to improving the efficiency which make every cached file have three replicas stored in of storing and accessing small files on Hadoop: a case study by

20 PowerPoint files", In 2010 IEEE International Conference on Service [24] G. DeCandia, D.Hastorun, and M. Jampani, et al,"Dynamo: amazon's Computing (SCC), Miami, July 2010, pp.65-72. highly available key-value store", In Proceedings of twenty-first [11] H. Zhang, Y. Han, F. Chen and J. Wen, "A Novel Approach in ACM SIGOPS Symposium on Operating Systems Principles, Improving I/O Performance of Small Meteorological Files on HDFS", Stevenson, Washington, USA, October, 2007, pp.205-220. Applied Mechanics and Materials, vol.117-199, Oct. 2011, pp.1759- [25] G. Fowler, L. Noll, K Vo and D. Eastlake. "The FNV Non- 1765. Cryptographic Hash Algorithm", ietf-draft, 2011, Available at [12] Apache HBase. Available at http://hbase.apache.org. http://tools.ietf.org/html/draft-eastlake-fnv-03.

[13] F. Chang, J, Dean, et al. "Bigtable: A Distributed Storage System for [26] W. Ehrhardt, "CRC Hash", Available at http://www.wolfgang- Structured Data", ACM Transactions on Computer Systems, Vol. ehrhardt.de/crchash_en.html 26(2), 2008, pp.205-218. [27] Ketama Hash. Available at [14] K. Dana, Hadoop HBase Performance Evaluation, unpublished, http://www.audioscrobbler.net/development/ketama. Available at http://www.cs.duke.edu/~kcd/hadoop/kcdhadoop- [28] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, "ZooKeeper: wait- report.pdf. free coordination for internet-scale systems", In Proceedings of the [15] J. Ousterhout, P. Agrawal, et al, "The Case for RAMClouds: Scalable 2010 USENIX Conference on USENIX Annual Technical High-Performance Storage Entirely in DRAM", ACM SIGOPS Conference, Boston, MA, June 23-25, 2010. Operating Systems Review, Vol.43(4), Jan 2010, pp.92-105. [29] Hadoop ZooKeeper. Available at http://zookeeper.apache.org/. [16] J. Ousterhout, P. Agrawal, et al,"The case for RAMCloud", [30] M. Burrows, "The Chubby lock service for loosely-coupled Communications of the ACM, Vol.54(7), July 2011, pp.121-130. distributed systems", In Proceedings of the 7th Symposium on [17] Memcached: a distributed memory object caching system. Available Operating Systems Design and Implementation (OSDI'06), 2006, at http://www.danga.com/memcached. pp.335-350.

[18] J. Petrovic, "Using Memcached for Data Distribution in Industrial [31] B. Reed and F. P. Junqueira. "A simple totally ordered broadcast Environment", In Third International Conference on Systems protocol". In LADIS'08: Proceedings of the 2nd Workshop on Large- (ICONS08), Cancun, 2008, pp.358-372. Scale Distributed Systems and Middleware, pages 1–6, New York, NY, USA, 2008. ACM. [19] S. Zhang, J. Han, Z. Liu, K. Wang,"Accelerating MapReduce with

Distributed Memory Cache ", In 15th International Conference on [32] Hadoop Distributed File System C/C++ APIs Documnet, Available at Parallel and Distributed Systems (ICPADS09), Shenzhen, 2009, http://hadoop.apache.org/common/docs/current/libhdfs.html. pp.472-478. [33] The Apache Cassandra Project. Available at http:// [20] C. Xu, X. Huang, N. Wu, P. Xu, G. Yang, "Using Memcached to http://cassandra.apache.org/. Promote Read Throughput in Massive Small-File Storage System", In [34] The Mongo Database Project. Available at http://www.mongodb.org/. 9th Internatinal Conference on Grid and Cooperative Computing [35] B. F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan (GCC), Nanjing, 2010, pp.24-29. and Russell Sears. "Benchmarking cloud serving systems with [21] Redis. Avialable at http://code.google.com/p/redis YCSB". In Proceedings of the 1st ACM Symposium on Cloud [22] D. Borthakur et al. "Apache Hadoop goes realtime at Facebook", In Computing (SoCC'10), Indianapolis, IN, 2010. pp.143-154. Proceedings of the 2011 International Conference on Management of [36] Robin Hecht and Stefan Jablonski. "A Use Case Oriented Survey". In Data (SIGMOD'11), New York, 2011. 2011 International Conference on Cloud and Service Computing [23] S. Ekanayake, J. Mitchell, Y. Sun and J. Qiu, "Memcached (CSC2011), Hong Kong, pp.336 - 341. Integration with Twister", unpublished, avialable at [37] Luiz A. Barroso, Jeffrey Dean, and Urs Holzle. Web search for a http://salsahpc.org/CloudCom2010/EPoster/cloudcom2010_submissi planet: The Google cluster architecture. IEEE Micro, April 2003, on_264.pdf. 23(2), pp.22-28.

View publication stats