Quantitative Analysis of Scalable Nosql Databases
Total Page:16
File Type:pdf, Size:1020Kb
QUANTITATIVE ANALYSIS OF SCALABLE NOSQL DATABASES by SURYA NARAYANAN SWAMINATHAN Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN COMPUTER SCIENCE THE UNIVERSITY OF TEXAS AT ARLINGTON December 2015 Copyright c by SURYA NARAYANAN SWAMINATHAN 2015 All Rights Reserved To my professor Dr. Ramez Elmasri ACKNOWLEDGEMENTS I express my sincere gratitude to my advisor Dr. Ramez Elmasri. I would also like to thank Dr. Leonidas Fegaras for his support. Special thanks to Professor David Levine for taking a special interest in my research beyond the call of duty. I am grateful to Professor Ronald Cross for his support, motivation and guid- ance. I would also like to thank my research mates Mr. Mohammadhani Fouladgar, Mr. Upa Gupta, Mr. Neelabh Pant and Mr. Vivek Sharma for their support and camaraderie. Finally, I extend my gratitude towards everyone in Computer Science depart- ment for the memorable experiences I had during my masters November 20, 2015 iv ABSTRACT QUANTITATIVE ANALYSIS OF SCALABLE NOSQL DATABASES SURYA NARAYANAN SWAMINATHAN, M.S. The University of Texas at Arlington, 2015 Supervising Professor: Ramez Elmasri NoSQL databases are rapidly becoming the customary data platform for big data applications. These databases are emerging as a gateway for more alternative approaches outside traditional relational databases and are characterized by efficient horizontal scalability, schema-less approach to data modeling, high performance data access, and limited querying capabilities. The lack of transactional semantics among NoSQL databases has made the application determine the choice of a particular con- sistency model. Therefore, it is essential to examine methodically, and in detail, the performance of different databases under different workload conditions. In this work, three of the most commonly used NoSQL databases: MongoDB, Cassandra and Hbase are evaluated. Yahoo Cloud Service Benchmark, a popular benchmark tool, was used for performance comparison of different NoSQL databases. The databases are deployed on a cluster and experiments are performed with different numbers of nodes to assess the impact of the cluster size. A benchmark suite is presented on the performance of the databases on its capacity to scale horizontally and v on the performance of each database based on various types of workload operations (create, read, write, scan) on varying dataset sizes. vi TABLE OF CONTENTS ACKNOWLEDGEMENTS . iv ABSTRACT . v LIST OF ILLUSTRATIONS . xi LIST OF TABLES . xiii Chapter Page 1. INTRODUCTION . 1 2. BACKGROUND . 4 2.1 Distributed databases . 4 2.2 Neologism . 5 2.3 Evolution . 6 2.4 Brewer's Conjecture . 7 2.4.1 CAP theorem . 8 2.4.2 CAP in NoSQL systems . 9 2.4.3 Issues in CAP . 9 2.4.4 CAP - an alternate view . 10 3. SQL vs. NOSQL SYSTEMS . 12 3.1 ACID vs. BASE . 12 3.2 Scalability . 14 3.2.1 Vertical scalability . 15 3.2.2 Horizontal scalability . 15 3.2.3 Elastic scalability . 17 3.3 Speed vs. Features . 17 vii 3.3.1 SQL region . 17 3.3.2 NoSQL region . 19 3.4 Data Replication . 20 3.4.1 Types of replication . 21 3.4.2 Replication strategies . 22 3.4.3 Replication in SQL and NoSQL systems . 24 3.5 Fragmentation . 25 3.5.1 Fragmentation schemes . 26 3.5.2 Fragmentation in SQL systems . 27 3.5.3 Fragmentation in NoSQL systems . 28 3.6 Query Languages . 29 3.6.1 Query languages in relational databases . 30 3.6.2 Query languages in NoSQL databases . 31 4. CATEGORIES OF NOSQL SYSTEMS . 33 4.1 Document-store . 33 4.1.1 MongoDB . 34 4.1.2 CouchDB . 35 4.2 Key-value stores . 36 4.2.1 DynamoDB . 36 4.2.2 Redis . 38 4.3 Tabular stores . 38 4.3.1 Hbase . 39 4.4 Graph stores . 40 4.5 Heterogeneous data stores . 41 4.5.1 DynamoDB . 41 4.5.2 OrientDB . 41 viii 4.5.3 Cassandra . 41 4.6 Other NoSQL databases . 42 4.6.1 Object databases . 42 4.6.2 XML databases . 43 5. EXPERIMENTAL SETUP . 44 5.1 Implementation details . 44 5.2 Database configurations . 45 5.2.1 MongoDB . 45 5.2.2 Cassandra . 45 5.2.3 Hbase . 46 5.3 Benchmarking Tool . 47 5.3.1 Architecture . 47 5.3.2 Workload properties . 49 5.3.3 Workloads definition . 50 5.4 Test cases . 50 5.4.1 Size of the database . 50 5.4.2 Operation count . 51 5.4.3 Cluster size . 51 5.4.4 Comprehensive view . 52 6. RESULTS . 54 6.1 MongoDB . 54 6.1.1 50% Read - 50% Write . 54 6.1.2 100% Read . 57 6.1.3 100% Blind Write . 58 6.1.4 100% Read-Modify-Write . 59 6.1.5 100% Scan . 59 ix 6.2 Cassandra . 61 6.2.1 50% Read - 50% Write . 61 6.2.2 100% Read . 62 6.2.3 100% Blind Write . 64 6.2.4 100% Read-Modify-Write . 65 6.2.5 100% Scan . 66 6.3 Hbase . 67 6.3.1 50% Read - 50% Write . 67 6.3.2 100% Read . 68 6.3.3 100% Blind Write . 69 6.3.4 100% Read-Modify-Write . 71 6.3.5 100% Scan . 72 6.4 MongoDB vs. Cassandra vs. Hbase . 73 6.4.1 50% Read - 50% Write . 74 6.4.2 100% Read . 75 6.4.3 100% Blind Write . 76 6.4.4 100% Read-Modify-Write . 77 6.4.5 100% Scan . 77 6.4.6 Summary . 78 7. CONCLUSION AND FUTURE WORK . 80 7.1 Conclusion . 80 7.2 Future Work . 80 REFERENCES . 82 BIOGRAPHICAL STATEMENT . 86 x LIST OF ILLUSTRATIONS Figure Page 3.1 Vertical scalability vs. Horizontal scalability . 16 3.2 Scaling with Single server (scaling-up) and Multiple servers (scaling-out) 16 3.3 Speed and Scalability vs. Features . 18 5.1 Architecture of YCSB client, adapted from [3] . 48 5.2 Sample data generated by YCSB in Cassandra . 51 5.3 Scope of the operation count and definition of throughput . 52 5.4 Scope of the cluster size . 53 5.5 Comprehensive view of the scope of the experiments . 53 6.1 MongoDB : 50% Read - 50% Write . 56 6.2 MongoDB : 100% Read . ..