A Case Study on Apache Hbase
Total Page:16
File Type:pdf, Size:1020Kb
A case study on Apache HBase A Master’s Project Presented to Department of Computer and Information Sciences SUNY Polytechnic Institute Utica, New York In Partial Fulfillment Of the Requirements for the Master of Science Degree by Rohit Reddy Nalla May 2015 © Rohit Reddy Nalla 2015 Abstract Introduction: Apache HBase is an open-source, non-relational and a distributed data base system built on top of HDFS (Hadoop Distributed File system). HBase was designed post Google’s Big table and it is written in Java. It was developed as a part of Apache’s Hadoop Project. It provides a kind of fault – tolerant mechanism to store minor amounts of non-zero items caught within large amounts of empty items. HBase is used when we require real-time read/write access to huge data bases. HBase project was started by the end of 2006 by Chad Walters and Jim Kellerman at Powerset.[2] The main purpose of HBase is to process large amounts of data. Mike Cafarella worked on code of the working system initially and later Jim Kellerman carried it to the next stage. HBase was first released as a part of Hadoop 0.15.0 in October 2007[2]. The project goal was holding of very large tables like billions of rows X millions of columns. In May 2010, HBase advanced to a major project and it became an Apache Top Level Project. Several applications like Adobe, Twitter, Yahoo, Trend Micro etc. use this data base. Social networking sites like Facebook have implemented its messenger application using HBase. This document helps us to understand how HBase works and how is it different from other data bases. This document highlights about the current challenges in data security and a couple of models have been proposed towards the security and levels of data access to overcome the challenges. This document also discusses the workload challenges and techniques to overcome. Also an overview has been given on how HBase has been implemented in real time application Facebook messenger app. 3 Acknowledgment I, Rohit Reddy Nalla want to express my gratitude to Dr. Sam Sengupta, the advisor of this project has been a constant inspiration and has offered guidance in every possible way one could imagine. I would also like to thank my family and friends for their encouragement and support. 4 Table of Contents 1. Introduction to Database systems .……………………………………………………….8 1.1 Introduction ………………………………………………………………………….8 2. Introduction to HBase …………………………...………………………………………11 2.1 Background ………………………………………………………………………….11 2.2 Data model …………………………………..………………………………………11 2.3 HBase Implementation and Operation ………………………………………………14 2.4 Running HBase ...……………………………………………………………………17 2.4.1 Understanding of basic queries for table creation and updates………………...17 2.4.2 Connecting HBase with Java ………………………………………………….21 3. Current Data Vulnerability and Workload challenges ……..…………………………………27 3.1 Current issues in Data Security ……….…………………………………………….27 3.2 Security model based on Kerberos Authentication and Access Control Lists……...27 3.2.1 Introduction………………………………………………………………..27 3.2.2 Kerberos protocol………………………………………………………….28 3.3 HBase cell level security model……………………………………………………..32 3.3.1 Cell tags ……………………………………………………………...……32 5 3.3.2 Cell ACL’s …..…………………………………………………………….33 3.3.3 Visibility labels (cell labels) …………………………………………...….33 3.3.4 HFile version 3……………………………………………………………..37 3.4 Running multiple workloads on a single HBase cluster……………………………..37 3.4.1 Throttle………………………………………………………………….…38 3.4.2 Decreasing the priority of long running requests………………………..…39 3.5 Data Backup and Disaster Recovery…………………………………………………40 3.5.1 Snapshots…………………………………………………………………..40 3.5.2 HBase Replication……………………………………………...………….41 3.5.3 HBase Export……………………………………………………………....41 3.5.4 Copy Table ….………………………………………………………..……42 4. HBase in Real time …………………………………………………………………………...43 4.1 Background ………………………………………………………………………….43 4.2 Workload challenges ………………………………………………………………...43 4.3 Implementation of HDFS ……………………………………………………………46 4.3.1 Name Node (Master) and Data nodes …………………………………….…...46 4.3.2 Avatar Node …………………………………………………………………...48 4.3.3 HDFS Transaction logging ……………………………………………………49 4.3.4 Distributed Avatar File System …………………...……………………………50 6 4.3.5 Enhancements for Real time Work Load ……...……..…………………....50 4.4 Implementation of HBase at Facebook ………………………………………….….52 4.4.1 ACID Properties……………………………………………………………..52 4.4.2 Availability Enhancements …………………………………………………..53 4.4.3 Performance Improvements…………………………………………………..54 4.5 Production Deployment and challenges …………………………………………….56 4.5.1 Testing ……………………………………………………………………….56 4.5.2 Region Splitting (Manual v/s Automatic)…………………………………….57 4.5.3 Schema changes……………………………………………...……………….57 4.5.4 Data Imports………………………………………………………………..…58 4.5.5 Network IO traffic…………………………………………………………....58 4.6 Future implementations at Facebook………………………………………………...58 5. HBase vs RDBMS ………………………………………………………………………….60 5.1 Background ...………………………………………………………………………..61 5.2 Analysis ……………………………………………………………………………...61 6 Future work …...……………………………………………………………...………………..63 References ……………………………………………………………………………………….64 7 Chapter 1 Introduction to Database systems 1.1 Introduction A Data base is a place where data is stored in an organized way. The stored data is used across various aspects of an application which helps in processing the requests efficiently. In the recent times, usage of internet has brought some revolutionary changes across the world and many online websites and applications have been developed to give users more benefit. While developing an application (or website) maintaining a data base has become relatively important. Maintaining the customer’s data and catalogue has become a challenging task since then. That is when Database management systems (DBMS) came into picture. Database management systems (DBMS) are kind of software applications which allow implementing CURD operations i.e. create, update, read and delete the data in tables. It interacts with end users, third party applications, and internal applications within the system and with the database itself to capture and process the data efficiently. MySQL, PostgreSQL, NoSQL, Oracle etc. are some of the famous data bases. Generally depending on the database models they support DBMSs are classified. Most of the well-known data bases have supported relational model which has been represented by SQL since 1980’s. Initially in 1960’s some General-purpose DBMSs have typically served the needs of many applications and they were cost effective too. However sometimes it has introduced some unnecessary overhead and later there was a realization that it was not a favorable one to use. Then special-purpose DBMSs came into picture which served the purposes. To make use of email system in the most effective way, special-purpose DBMS were used to handle email systems. As days passed on there were lots of advancements in terms of 8 technology perspective. In the areas of computer storage, networks and memory the performance expectations of the DBMS have been started growing. In mid 1960’s “Data Base Task Group” has come up with an navigational approach (CODASYL approach) where records in the data base can be found by navigating relationships from one set of records to another set of records and they were searched in a sequential order. However in further validations, it was proved to be a difficult procedure and it requires rigorous workout in order to serve applications. There was a growing concern on handling large data banks. An efficient service should be given to the users in such a way that their activities should not be affected though there is any change in the internal or external representations of their data. In this context, in 1970’s Edgar Codd wrote numerous papers that outlined a new approach to the database construction that found a break through by A Relational Model of Data for Large Shared Data Bank[5]. He described a new approach for maintaining large data bases and for storing data in it. Unlike navigational approach where records are stored in linked lists, he came up with storing records in tables of fixed length, where each table is considered as one unit. For example, to create a table with user account information such as user ID, password, contact no., date of birth etc., user ID should be unique so that all the other fields are stored by putting the user ID as reference key or primary key. Whenever if we want to collect this information, we can search for it with the help of a primary key so that information stored in the optional tables can be found with the help of this key. With the help of Tuple Calculus (branch of mathematics) he has developed a system which can maintain the operations of a regular database like CURD operations and also it can sort the datasets in one single operation. Based on the Codd’s theory IBM had started working on a modal named System R. Their idea was to develop multi-table systems, in which all the data which is stored for record could be split instead of storing the whole data at one large place. 9 They implemented Codd’s concepts successfully and they have created a product of System R known as SQL or DS. Later it became Database 2(DB2). In later 1970’s based on IBM’s research Larry Ellison also came up with Oracle database. These Relational Database systems (RDBMS) have become popular and were used for a small scale to medium scale applications. SQL language is used for writing queries in order to implement the basic CURD operations. It has powerful features and query languages like MySQL and PostgreSQL. Secondary indexes can be easily created and it performs several inner and outer joins, count, sorting across the rows and columns in data base tables. However if we need to measure or scale up with large amounts of data and random read/write access, then we’ll find difficulty in performing these actions with RDBMS. It makes distribution of data complex with ACID properties and Codd’s 12 rules. To overcome these drawbacks then came the next generation databases known as NoSQL databases. These databases are kind of document-oriented databases depend upon data attributes in the XML document while querying.