An Evaluation of Key-Value Stores in Scientific Applications
Total Page:16
File Type:pdf, Size:1020Kb
AN EVALUATION OF KEY-VALUE STORES IN SCIENTIFIC APPLICATIONS A Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for the Degree Master of Science By Sonia Shirwadkar May 2017 AN EVALUATION OF KEY-VALUE STORES IN SCIENTIFIC APPLICATIONS Sonia Shirwadkar APPROVED: Dr. Edgar Gabriel, Chairman Dept. of Computer Science, University of Houston Dr. Weidong Shi Dept. of Computer Science, University of Houston Dr. Dan Price Honors College, University of Houston Dean, College of Natural Sciences and Mathematics ii Acknowledgments \No one who achieves success does so without the help of others. The wise acknowledge this help with gratitude." - Alfred North Whitehead Although, I have a long way to go before I am wise, I would like to take this opportunity to express my deepest gratitude to all the people who have helped me in this journey. First and foremost, I would like to thank Dr. Gabriel for being a great advisor. I appreciate the time, effort and ideas that you have invested to make my graduate experience productive and stimulating. The joy and enthusiasm you have for research was contagious and motivational for me, even during tough times. You have been an inspiring teacher and mentor and I would like to thank you for the patience, kindness and humor that you have shown. Thank you for guiding me at every step and for the incredible understanding you showed when I came to you with my questions. It has indeed been a privilege working with you. I would like to thank Dr. Shi and Dr. Price for accepting to be my committee members. I truly appreciate the time and effort you spent in reviewing my thesis and providing valuable feedback. A special thanks to my PSTL lab-mates Shweta, Youcef, Tanvir, and Raafat. You have contributed immensely to my personal and professional time at the University of Houston. The last nine months have been a joy mainly because of the incredible work environment in the lab. Thank you for being great friends and for all the encouragement that you have given me. A big thanks to Hope Queener and Jason Marsack at the College of Optometry for teaching me the value of team-work and work ethics. I truly enjoyed working with you. I have been extremely fortunate to have the constant support, guidance, and faith of iii my friends. A big thank you to all my friends in India, for constantly motivating me to follow my dreams. Thank you for the late-night calls, care packages, and all the love that you have given me in the time that I have been away from home. I would like to thank my friends Omkar, Tejus, Sneha, Sonal, Aditya, and Shweta for being my family away from home. I will forever be grateful for the constant assurance and encouragement that you gave me. I would also like to thank my friends, classmates and roomates here in Houston for all their help and support. A special thanks to all my teachers. I would not be here if not for the wisdom that you have shared. You have empowered me to chase my dreams. Each one of you has taught me important life lessons that have always guided me. I will be eternally grateful to have been your student. Last but by no means the least, I would like to thank my family for always being there for me. I would like to start by thanking my Mom and Dad for their unconditional love and support. A very big thank you to Kaka and Kaku for all their love, concern and advice. You all have taught me the beauty of hard-work and perseverance and this thesis would never have been possible without you. Finally, I would like to thank Parikshit for being my greatest source of motivation. You inspire me everyday to be a better version of myself and I would never have made it without you. iv AN EVALUATION OF KEY-VALUE STORES IN SCIENTIFIC APPLICATIONS An Abstract of a Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for the Degree Master of Science By Sonia Shirwadkar May 2017 v Abstract Big data analytics is a rapidly evolving multidisciplinary field that involves the use of com- puting capacity, tools, techniques, and theories to solve scientific and engineering problems. With the big data boom, scientific applications now have to analyze huge volumes of data. NoSQL [1] databases are gaining popularity for these type of applications due to their scal- ability and flexibility. There are various types of NoSQL databases available in the market today [2], including key-value databases. Key-value databases [3] are the simplest NoSQL databases where every single item is stored as a key-value pair. In-memory key-value stores are specialized key-value databases that maintain data in main memory instead of the disk. Hence, they are well-suited for applications having high-frequencies of alternating read and write cycles. The focus of this thesis is to analyze popular in-memory key-value stores and com- pare their performance. We have performed the comparisons based on parameters like in-memory caching support, supported programming languages, scalability, and utilization from parallel applications. Based on the initial comparisons, we evaluated two key-value stores in detail, namely Memcached [4] and Redis [5]. To perform extensive analysis of these two data stores, a set of micro-benchmarks have been developed and evaluated for both Memcached and Redis. Tests were performed to evaluate the scalability, responsiveness and data load handling capacity and Redis outperformed Memcached in all test cases. To further analyze the in-memory caching ability of Redis, we integrated it as a caching layer into an air quality simulation [6] based on Hadoop [7] MapReduce [8] which calculates the eight-hour rolling average of ozone concentration at various sites in Houston, TX. Our aim was to compare the performance of the original air-quality application that uses the disk for data storage, to our application that uses in-memory caching. Initial results show that there is no performance gain achieved by integrating Redis as a caching layer. Further optimizations and configurations of the code is reserved for future work. vi Contents 1 Introduction 1 1.1 Brief Overview of Key-Value Data Stores . .4 1.2 Goals of this Thesis . .6 1.3 Organization of this Document . .7 2 Background 8 2.1 In-memory Key-value Stores . .9 2.1.1 Redis . .9 2.1.2 Memcached . 12 2.1.3 Riak . 15 2.1.4 Hazelcast . 17 2.1.5 MICA (Memory-store with Intelligent Concurrent Access) . 21 2.1.5.1 Parallel Data Access . 21 2.1.5.2 Network Stack . 22 2.1.5.3 Key-value Data Structures . 23 2.1.6 Aerospike . 24 2.1.7 Comparison of Key-Value Stores . 26 2.2 Brief Overview of Message Passing Interface (MPI) . 29 vii 2.3 Brief Overview of MapReduce Programming and Hadoop Eco-system . 31 2.3.1 Integration of Key-Value Stores in Hadoop . 35 3 Analysis and Results 36 3.1 MPI Micro-benchmark . 37 3.1.1 Description of the Micro-benchmark Applications . 38 3.1.1.1 Technical Data . 41 3.1.2 Comparison of Memcached and Redis using our Micro-benchmark . 41 3.1.2.1 Varying the Number of Client Processes . 43 3.1.2.1.1 Using Values of Size 1 KB . 43 3.1.2.1.2 Using Values of Size 32 KB . 44 3.1.2.2 Varying the Number of Server Instances . 47 3.1.2.3 Varying the Size of the Value . 48 3.1.2.4 Observations and Final Conclusions . 50 3.2 Air-quality Simulation Application . 51 3.3 Integration of Redis in Hadoop . 53 3.3.1 Technical Data . 55 3.4 Results and Comparison . 56 4 Conclusions and Outlook 59 Bibliography 62 viii List of Figures 1.1 Key-value pairs . .5 2.1 Redis Cluster . 11 2.2 Redis in a Master-Slave Architecture . 12 2.3 Memcached Architecture . 14 2.4 Riak Ring Architecture . 17 2.5 Hazelcast In-memory Computing Architecture . 19 2.6 Hazelcast Architecture . 20 2.7 MICA Approach . 23 2.8 Aerospike Architecture . 25 2.9 Word Count Using Hadoop MapReduce . 34 3.1 Time Taken to Store and Retrieve Data When the Number of Client Pro- cesses is Varied. 44 3.2 Time Taken to Retrieve Data When the Number of Client Processes is Varied. 46 3.3 Time Taken to Store and Retrieve Data When the Number of Servers is Varied. 48 3.4 Time Taken to Store and Retrieve Data when the Value Size is Varied. 50 3.5 Customized RecordWriter to Read in Data from Redis . 54 ix 3.6 Customized RecordReader to Write Data to Redis . 55 3.7 Comparison of Execution Times (in minutes) for Air-quality Applications Using HDFS and Redis. 57 x List of Tables 2.1 Summary of features of key-value stores . 28 3.1 Time taken to store and retrieve data when number of client processes is varied. 43 3.2 Time taken to store and retrieve data when number of client processes is varied. 45 3.3 Time taken to store and retrieve data when the number of servers is varied. 47 3.4 Time taken to store and retrieve data when the size of the value is varied. 49 3.5 Time taken to execute original air-quality application .