Evaluation of Nosql and Array Databases for Scientific Applications

Evaluation of NoSQL and Array Databases for Scientific Applications Lavanya Ramakrishnan, Pradeep K. Mantha, Yushu Yao, Richard S. Canon Lawrence Berkeley National Lab Berkeley, CA 94720 [lramakrishnan,pkmantha,yyao,scanon]@lbl.gov ABSTRACT Source [12]) and/or array databases. NoSQL databases ad- Scientific users are increasingly considering the use of NoSQL dress many of the needs of a data store for storing less struc- and array databases for storing metadata and data. These tured and evolving data schemas. Current implementations databases offer various advantages including support for real- of schema-less databases are based on the ideas proposed by time changing schema and performance optimizations for Amazon (Dynamo) [6] and Google (Bigtable) [3]. Similarly, specific operations. However, there is a limited understand- array databases provide performance optimizations for array ing of the strengths and weaknesses of these databases for operations. However, there is a limited understanding of the scientific applications. In this paper, we present an eval- strengths and weaknesses of these new class of databases. uation using standard benchmarks (Yahoo! Cloud Serving In this paper, we provide an evaluation of existing cloud Benchmark) and two scientific data sets. Our results indi- schema-less and array databases for scientific data use. For cate that careful understanding of the distribution of the our initial evaluation, we have selected MongoDB [11], Cas- workload as well as aspects such as client side tools and sandra [2], HBase [10], and SciDB [1, 13]. Specifically, our parameters need to be considered to get the optimal perfor- evaluation includes: mance from these databases. • A comparison of Cassandra, HBase and MongoDB using the Yahoo! Cloud Serving Benchmark (YCSB) 1. INTRODUCTION • A comparison of Cassandra, SciDB and PostgresSQL Scientific discoveries for important challenges facing our using two scientific data sets society require federation or integration of data from multi- ple research groups and disciplines. The data infrastructure • Investigation of the impact of various factors impacting must be able to accommodate diverse data-types and for- performance in the context of Cassandra mats from diverse experiment facilities, models and sources. The rest of the paper is organized as follows. In Section 2, Today, data and metadata is primarily stored in file sys- we discuss the methodology of our evaluation and detail re- tems and in rare cases in relational databases [14]. Scalable sults in Section 3. We discuss related work and conclusions file systems are critical as an underlying software layer, but in Sections 4 and 5 respectively. these are not sufficient for user-level data management stor- age, nor do they offer an adequate interface for queries and data analyseos. Also, it is often difficult to attach schema, 2. METHODOLOGY or meaning, to the data beforehand and the schema and In this section, we describe our experiment setup including metadata tends to evolve over time. Thus, datastores need the databases and the workload and data sets. to accommodate real-time changes to the structure of the data while making the data searchable on any attribute. 2.1 NoSQL and Array Databases Cloud applications are increasingly using schema-less (also Schema-less databases are typically classified as document- referred to as Not Only SQL or NoSQL) datastores{which oriented, key-value, and graph-based. A document-oriented enable capture of semi-structured data and allow attribute- database stores, retrieves and manages semi-structured or based searches. Increasingly scientific applications are inves- document-oriented information. Key-value stores use a key- tigating the use of NoSQL databases for their needs (e.g., value pair. HBase, part of the Hadoop ecosystem and Cas- the Materials Project [9], data from the Advanced Light sandra, initially developed by Facebook are key-value stores. MongoDB, developed and supported by 10gen, is a document- oriented store. We Cassandra - 1.1.2, MongoDB - 2.4.5 and HBase - 0.20.3 for our evaluation. SciDB is an array-based database that has recently gained Permission to make digital or hard copies of all or part of this work for traction for scientific data that can be represented in an ar- personal or classroom use is granted without fee provided that copies are ray model. Our selection criteria are based on applicability not made or distributed for profit or commercial advantage and that copies to scientific data, feasibility, active development, and com- bear this notice and the full citation on the first page. To copy otherwise, to munity support. We use SciDB 13.3 in our evalution. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. The goal of our evaluation is not necessarily to determine Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. the best product, but rather to understand the state of the art and the application of each class of schema-less database Workload A: Up- Mix of 50/50 reads and writes. to scientific data. date heavy work- Zipifan load 2.2 Workloads and Data Sets Workload B: Read 95/5 reads/write mix. Zipifan First, we perform a comparison using the Yahoo! Cloud mostly workload Serving Benchmark (YCSB) framework. YCSB facilitates Workload C: Read 100% read. Zipifan the performance comparisons of the new generation of cloud only data serving systems. The workload supports a number of Workload D: Read New records are inserted, and different workloads from update heavy, read heavy, read latest workload the most recently inserted only, read latest and short ranges (Table 2.3). The work- records are the most popular. load consists of 100 million records where each record is 1 (read/update/insert ratio: KB (10 fields by 100 Bytes). 95/0/5).Zipifan Second, we use two scientific data sets to compare the per- Workload E: Short Short ranges of records are formance of Cassandra and SciDB with PostgresSQL. The ranges queried, instead of individual data sets are from bioinformatics and astronomy. records. (read/update/insert The first data set consists of non-redundant (nr) protien ration: 95/0/5) Uniform sequences from bioinformatics. The database had 16 million Workload F: Read- The client will read a record, entries where each entry had the following fields: sequence modify-write modify it, and write back the id, sequence, checksum. We used the sequence id as the key. changes. (read/read-modify- The query used for our evaluation fetches all fields based on write ration: 50/50). Zipifian a sequence id. Workload G: Cus- 100% write The second data set is from astronomy. The data set tom workload contains a collection of 55K spectrums produced from sim- Workload NR Record size matching the bioin- ulation of different supernova with different input param- formatics workoad. Read only eters. Each spectrum contains 2500 wavelength steps and similar to Workload C two record types. The data set contains 275 million records. The primary key is a compound key consisting of spectrum Table 1: The table shows the configuration of the id, wavelength id and record type. The query used calcu- YCSB workloads lates the Chi-square distance between the given observed spectrum to every spectrum in the set, and returns the ID 100,000 of the spectrum that has minimal Chi-square distance. 10,000 2.3 Machines 1,000 SciDB All experiments were run on the Jesup testbed at National Cassandra Energy Research Scientific Computing Center (NERSC). Je- 100 sup is a 160 node testbed cluster, used to explore emerging 10 hardware and software technologies. The nodes consist of Transactions per second (Log scale) quad-core Intel Xeon X5550 (\Nehalem") 2.67 GHz proces- 1 sors (eight cores/node) with 24 GB of memory per node. 1 12 Number of data nodes 18,000 16,000 Figure 2: The figure shows the comparison of Cas- 14,000 sandra and SciDB with the bioinformatics data set. 12,000 Cassandra provides higher transactions/second than 10,000 Cassandra SciDB at one node. But the difference is significant HBase 8,000 MongoDB when we scale to 12 nodes. 6,000 Transactions per second 4,000 7,000 2,000 6,000 0 ABC D E F G NR 5,000 Workload 4,000 SciDB Cassandra Figure 1: The figure shows the comparison of per- 3,000 PostgreSQL 2,000 formance with different workload types (Table 2.3) Transactions per second with the different NoSQL databases. 1,000 0 1 12 Number of data nodes 3. RESULTS We evaluate the various NoSQL stores and SciDB using Figure 3: The figure shows the comparison of Cas- the YCSB benchmarks and two scientific data sets. sandra, PyCassa and PostgresSQL for the astronomy data sets. The array-based queries give supe- 3.1 YCSB rior performance with SciDB. 3.84 M Reads 7.68 M Reads 100 100 80 80 60 1 DN 60 1 DN 6 DNs 6 DNs 12 DNs 12 DNs 40 40 Query time in seconds Query time in seconds 20 20 0 0 0.50 4 8 0.50 4 8 percentage of queries (0.5% queries of input data per thread/client) percentage of queries (0.5% queries of input data per thread/client) (a) 25% of entire data set (b) 50% of entire data set 11.53 M Reads 15.38 M Reads 100 100 80 80 60 1 DN 60 1 DN 6 DNs 6 DNs 12 DNs 12 DNs 40 40 Query time in seconds Query time in seconds 20 20 0 0 0.50 4 8 0.50 4 8 percentage of queries (0.5% queries of input data per thread/client) percentage of queries (0.5% queries of input data per thread/client) (c) 75% of entire data set (d) 100% of entire data set Figure 6: The figure compares the effects of varying database size with Cassandra-Cli tools.

Load more