. | 1

how hbase works

*this is a study guide that was created from lecture videos and is used to help you gain an understanding of how hbase works.

HBase Foundations

Yahoo released the Hadoop data storage system and Google added HDFS programming interface. HDFS stands for Hadoop Distributed File System and it spreads data across what are called nodes in it’s cluster.

The data does not have a schema as it is just document/files. HDFS is schemaless, distributed and fault tolerant.

MapReduce is focused on data processing and jobs to write to MapReduce are in Java. The operations of a MapReduce job is to find the data and list tasks that it needs to execute and then execute them. The action of executing is called reducers. A downside is that it is batch oriented, which means you would have to read the entire file of data even if you would like to read a small portion of data. Batch oriented is slow.

Hadoop is semistructured data and unstructured data, there is no random access for Hadoop and no transaction support.

HBase is also called the Hadoop database and unlike Hadoop or HDFS, it has a schema. There is an in-memory feature that gives you the ability to read information quickly. You can isolate the data you want to analyze. HBase is random access.

HBase allows for CRUD, which is Creating a new document, Reading the information into an application or process, Update which will allow you to change the value and Delete where

mind movement machine. apache hbase. | 2

you can delete data from the system.

NoSQL Databases

CAP Theorem- Consistency: every read from a database receives the most recent data, Availability: every request receives a non-error- response, Partition tolerance: the system will continue to operate despite the number of arbitrary messages begin dropped between the network between the nodes in a database cluster

Each database platform can ONLY have 2 out of the 3.

Relational database are consistent and available,

Cassandra or CouchDB example databases are available and have built in partition tolerance, Hbase or MongoDB databases are consistent and partition tolerant

Hbase can support real time applications like messaging or transactions because it is consistent

Visit infoworld to learn more about NoSQL databases.

NoSQL Database Structure

The data structure is made up of two columns: Keys, which identify the element in the row and Value, which is one or an array of items.

Document stores help with reliability because if one database or node in the cluster goes down, there are other copies of it used in the operations that have been stored.

mind movement machine. apache hbase. | 3

Graph databases are made above nodes, edges and properties. Neo4j is an example of a graph database.

Column families is how Hbase stores its data. Every row has a key and the columns are grouped together, which is called column families.

Relational Database Management Systems

RDBMS are made up of tables and views, rows and columns, primary keys and foreign keys, data types and using SQL to develop and analyze the database.

RDBMS are rigid and you must define a schema that defines the structure of the columns, data types and relationships before using the database. This can result in a slow application development as it is time-consuming.

RDBMS store data to eliminate redundancies to prevent anomalies or data that doesn’t match between copies of the record. This is the definition of normalization. This makes the data become slow to analyze as the data has been spread out across many different tables.

The solution to this was using a centralized location to store the data, known as a data warehouse which brings all the data into a denormalized format.

RDBMS does not scale well because the data does not have the ability to be distributed across multiple data centers or different regions

If you do need the ability to scale or support hundreds of users, RBMS are a good choice.

NoSQL databases differ as they do not have any rigid system that has to define everything.

mind movement machine. apache hbase. | 4

NoSQL are schemaless, distributed and fault tolerant.

Visit Codeacademy for a further understanding of Relational Database Management Systems.

HBase Interfaces

HBase Shell allows you to perform CRUD operations or you can store CRUD commands in a file to read later. HBase Shell has JRuby style references to tables where you can store variables and other operations using JRuby to interact with data in HBase

Java API HBase Interface is best used for more complex things than the HBase Shell and if your application is built on Java.

Apache Phoenix is an interface that will allow you to enable OLTP and operational analytics on top of HBase. It offers SQL and JDBC interfaces and full acid transaction. It can give you a regular database running on top of HBase and can work with other platforms like Spark, Hive, Pig, Flume and MapReduce

Column Table Layout

Columnar Tables allow you to scan a smaller dataset for faster results. It is optimized for storage and speed and flexibility towards application development.

Sparsely populated tables- if you do not have a value for an attribute, then there is nothing stored in the database

HBase Terminology

A table is a collection of rows, a row is identified by what is called a row key, and inside a row there are different column families.

mind movement machine. apache hbase. | 5

In HBase, a namespace are logical groupings of tables and you can refer to them in a similar way you would a relational database.

Versioning occurs at the individual value level and HBase has built-in versions, which allow you to look back in time.

How is data stored in HBase?

HBase Namespaces

Run commands from shell: create_namespace ‘people’ to create a namespace for people. Run create ‘people:employees’, ‘work’, ‘demo’ to create employees in people namespace

Run drop_namespace ‘people’ to remove a namespace from HBase

Run alter_namespace ‘people’, {METHOD => ‘set’, ‘PROPERTY_NAME’ => ‘PROPERTY_VALUE’} to change the namespace

There is a default namespace if you do not create one within HBase.

Data Model Operations in HBase

Get command is used to get a specific row from an HBase table

Ex. get ‘tablename’, ‘1’ to get row 1 from table called tablename

Scan command will get data from the entire table

Ex. scan ‘tablename’ to get all rows from tablename

mind movement machine. apache hbase. | 6

Put command will insert data if it does not exist, or update the rowx if it does exist

Ex. put ‘tablename’, ‘row:rowID’,’values-to-put-in-row’

Delete command to remove data within a table

Ex. delete ‘

’, ‘’, ‘’, ‘