A Comparative Study on Different Big Data Tools

A COMPARATIVE STUDY ON DIFFERENT BIG DATA TOOLS A Paper Submitted to the Graduate Faculty of the North Dakota State University of Agriculture and Applied Science By Sifat Ibtisum In Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE Major Department: Computer Science September 2020 Fargo, North Dakota North Dakota State University Graduate School Title A COMPARATIVE STUDY ON DIFFERENT BIG DATA TOOLS By Sifat Ibtisum The Supervisory Committee certifies that this disquisition complies with North Dakota State University’s regulations and meets the accepted standards for the degree of MASTER OF SCIENCE SUPERVISORY COMMITTEE: Kendall Nygard Chair Pratap Kotala Chad Ulven Approved: October 15, 2020 Simone Ludwig Date Department Chair ABSTRACT Big data has long been the topic of fascination for computer science enthusiasts around the world, and has gained even more prominence in recent times with the continuous explosion of data resulting from the likes of social media and the quest for tech giants to gain access to deeper analysis. This paper discusses various tools in big data technology and conducts a comparison among them. Different tools namely Sqoop, Apache Flume, Apache Kafka, Hive, Spark and many more are included. Various datasets are used for the experiment and a comparative study is made to figure out which tool works faster and more efficiently over the others, and explains the reason behind this. iii ACKNOWLEDGMENTS First and foremost, praises and thanks to the almighty Allah, for his blessing throughout everything and finely being able to finish my paper which is the last requirement of my Master’s program. I would like to express my deep and sincere gratitude to my research supervisor, Dr. Kendall E Nygard, Professor and Head of the Department of Computer Science at North Dakota State University for giving me the opportunity to do research and providing invaluable guidance throughout this research. His dynamism, vision, sincerity, and motivation have deeply inspired me. He has taught me the methodology to carry out the research and to present the research works as clearly as possible. It was a great privilege and honor to work and study under his guidance. I am extremely grateful for what he has offered me. I would also like to thank him for his friendship, empathy, and a great sense of humor. Again, due his absence Dr. Simone Ludwig took over as my chair, and the way she has guided me is really press worthy. Her sincerity, hard work and punctuality was the key reason why I am being able finish my paper so early. I also would like to extend my gratitude to my co-supervisors Dr. Ulven and Dr. Kotala for extending their support in the best possible way. My heartiest thanks to the person, who brought me here is my mother Dr. Ferdouse Islam. My dad expired when I was 13, since then it was my mom who did everything possible to bring me where I am now. Without her unconditional support, I could never be the person who I am today. I still can remember those early days when all the odds were against us but it was she, who single-handedly made things possible. I can't put into words, how proud, honored and blessed I am to be her son. May Almighty grant her a healthy and longer life. Moreover, my little iv sister Prioty, who made countless sacrifices so that I can achieve my dreams. Wish I could ever put into words, how thankful I am to you, my little angle. I also would like to extend my gratitude to my families and friends who believed in my dreams and helped me to progress. Among my families, Jes and Mr. Rony to whom I am really thankful for their exceptional support towards my family. And among my friends, Mr. Rafsan and Mr. Opu are the two people who acted like my elder brothers and did everything possibly they can so that I can achieve my dream. I am truly honored to have you guys in my life. Lastly, I want to thank everyone in the computer science Dept, graduate school and international office of North Dakota State University to guide me throughout this journey and last but not the least each member of NDSU for making my journey so exceptional and memorable. v DEDICATION I dedicate this work to the most valuable person of my life, my mom. vi TABLE OF CONTENTS ABSTRACT ................................................................................................................................... iii ACKNOWLEDGMENTS ............................................................................................................. iv DEDICATION ............................................................................................................................... vi LIST OF TABLES .......................................................................................................................... x LIST OF FIGURES ....................................................................................................................... xi 1. INTRODUCTION ...................................................................................................................... 1 1.1. Characteristics of Big Data .................................................................................................. 3 1.2. Benefits of Big Data and Data Analytics ............................................................................. 4 1.3. What is Big Data .................................................................................................................. 5 1.4. Importance of Big Data Processing ..................................................................................... 8 1.5. Application of Big Data ..................................................................................................... 10 1.6. Big Data and Cloud Computing Challenges ...................................................................... 24 2. TARGET AUDIENCE OF THE BIG DATA TOOLS ............................................................. 26 2.1. Knowing your Current and Future Customers ................................................................... 26 2.2. Targeting your Customers ................................................................................................. 27 2.3. Improving the Customer Experience ................................................................................. 27 2.4. Being Proactive .................................................................................................................. 28 3. LIST OF TOOLS USED IN BIG DATA ................................................................................. 29 3.1. Sqoop ................................................................................................................................. 29 3.2. Apache Flume .................................................................................................................... 29 3.3. Apache Kafka .................................................................................................................... 30 3.4. Hive .................................................................................................................................... 31 3.5. Hbase ................................................................................................................................. 31 3.6. Hadoop Distributed File System (HDFS) .......................................................................... 32 vii 3.7. Apache Spark ..................................................................................................................... 32 3.8. MapReduce ........................................................................................................................ 33 3.9. Pig ...................................................................................................................................... 33 3.10. Yarn ................................................................................................................................. 34 3.11. Zookeeper ........................................................................................................................ 34 3.12. Apache Oozie ................................................................................................................... 34 4. TRANSFORMATION TO NOSQL FROM RDMS ................................................................ 35 4.1. Relational Database ........................................................................................................... 35 4.2. ACID Properties ................................................................................................................ 36 4.3. NoSql DataBase ................................................................................................................. 36 4.4. NoSQL Database Types ..................................................................................................... 37 4.4.1. Document Store Database ........................................................................................... 37 4.4.2. Key Value Store Database .......................................................................................... 38 4.4.3. Graph Stores Database ................................................................................................ 40 4.4.4. Wide column stores database ...................................................................................... 41 4.4.5. Performance of NoSql and Sql Database for Big Data ............................................... 44 5. DESCRIPTION OF DATASETS ............................................................................................

A Comparative Study on Different Big Data Tools

Hadoop Tutorials  Cassandra  Hector API  Request Tutorial  About

Introduction to Hbase Schema Design

Combined Documents V2

Oracle Metadata Management V12.2.1.3.0 New Features Overview

MÁSTER EN INGENIERÍA WEB Proyecto Fin De Máster

Security Log Analysis Using Hadoop Harikrishna Annangi Harikrishna Annangi, [email protected]

E6895 Advanced Big Data Analytics Lecture 4: Data Store

Apache Log4j 2 V

Apache Sentry

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Unravel Data Systems Version 4.5

Talend Open Studio for Big Data Release Notes