Apache Hbase. | 1
Total Page:16
File Type:pdf, Size:1020Kb
apache hbase. | 1 how hbase works *this is a study guide that was created from lecture videos and is used to help you gain an understanding of how hbase works. HBase Foundations Yahoo released the Hadoop data storage system and Google added HDFS programming interface. HDFS stands for Hadoop Distributed File System and it spreads data across what are called nodes in it’s cluster. The data does not have a schema as it is just document/files. HDFS is schemaless, distributed and fault tolerant. MapReduce is focused on data processing and jobs to write to MapReduce are in Java. The operations of a MapReduce job is to find the data and list tasks that it needs to execute and then execute them. The action of executing is called reducers. A downside is that it is batch oriented, which means you would have to read the entire file of data even if you would like to read a small portion of data. Batch oriented is slow. Hadoop is semistructured data and unstructured data, there is no random access for Hadoop and no transaction support. HBase is also called the Hadoop database and unlike Hadoop or HDFS, it has a schema. There is an in-memory feature that gives you the ability to read information quickly. You can isolate the data you want to analyze. HBase is random access. HBase allows for CRUD, which is Creating a new document, Reading the information into an application or process, Update which will allow you to change the value and Delete where mind movement machine. apache hbase. | 2 you can delete data from the system. NoSQL Databases CAP Theorem- Consistency: every read from a database receives the most recent data, Availability: every request receives a non-error- response, Partition tolerance: the system will continue to operate despite the number of arbitrary messages begin dropped between the network between the nodes in a database cluster Each database platform can ONLY have 2 out of the 3. Relational database are consistent and available, Cassandra or CouchDB example databases are available and have built in partition tolerance, Hbase or MongoDB databases are consistent and partition tolerant Hbase can support real time applications like messaging or transactions because it is consistent Visit infoworld to learn more about NoSQL databases. NoSQL Database Structure The data structure is made up of two columns: Keys, which identify the element in the row and Value, which is one or an array of items. Document stores help with reliability because if one database or node in the cluster goes down, there are other copies of it used in the operations that have been stored. mind movement machine. apache hbase. | 3 Graph databases are made above nodes, edges and properties. Neo4j is an example of a graph database. Column families is how Hbase stores its data. Every row has a key and the columns are grouped together, which is called column families. Relational Database Management Systems RDBMS are made up of tables and views, rows and columns, primary keys and foreign keys, data types and using SQL to develop and analyze the database. RDBMS are rigid and you must define a schema that defines the structure of the columns, data types and relationships before using the database. This can result in a slow application development as it is time-consuming. RDBMS store data to eliminate redundancies to prevent anomalies or data that doesn’t match between copies of the record. This is the definition of normalization. This makes the data become slow to analyze as the data has been spread out across many different tables. The solution to this was using a centralized location to store the data, known as a data warehouse which brings all the data into a denormalized format. RDBMS does not scale well because the data does not have the ability to be distributed across multiple data centers or different regions If you do need the ability to scale or support hundreds of users, RBMS are a good choice. NoSQL databases differ as they do not have any rigid system that has to define everything. mind movement machine. apache hbase. | 4 NoSQL are schemaless, distributed and fault tolerant. Visit Codeacademy for a further understanding of Relational Database Management Systems. HBase Interfaces HBase Shell allows you to perform CRUD operations or you can store CRUD commands in a file to read later. HBase Shell has JRuby style references to tables where you can store variables and other operations using JRuby to interact with data in HBase Java API HBase Interface is best used for more complex things than the HBase Shell and if your application is built on Java. Apache Phoenix is an interface that will allow you to enable OLTP and operational analytics on top of HBase. It offers SQL and JDBC interfaces and full acid transaction. It can give you a regular database running on top of HBase and can work with other Apache Hadoop platforms like Spark, Hive, Pig, Flume and MapReduce Column Table Layout Columnar Tables allow you to scan a smaller dataset for faster results. It is optimized for storage and speed and flexibility towards application development. Sparsely populated tables- if you do not have a value for an attribute, then there is nothing stored in the database HBase Terminology A table is a collection of rows, a row is identified by what is called a row key, and inside a row there are different column families. mind movement machine. apache hbase. | 5 In HBase, a namespace are logical groupings of tables and you can refer to them in a similar way you would a relational database. Versioning occurs at the individual value level and HBase has built-in versions, which allow you to look back in time. How is data stored in HBase? HBase Namespaces Run commands from shell: create_namespace ‘people’ to create a namespace for people. Run create ‘people:employees’, ‘work’, ‘demo’ to create employees in people namespace Run drop_namespace ‘people’ to remove a namespace from HBase Run alter_namespace ‘people’, {METHOD => ‘set’, ‘PROPERTY_NAME’ => ‘PROPERTY_VALUE’} to change the namespace There is a default namespace if you do not create one within HBase. Data Model Operations in HBase Get command is used to get a specific row from an HBase table Ex. get ‘tablename’, ‘1’ to get row 1 from table called tablename Scan command will get data from the entire table Ex. scan ‘tablename’ to get all rows from tablename mind movement machine. apache hbase. | 6 Put command will insert data if it does not exist, or update the rowx if it does exist Ex. put ‘tablename’, ‘row:rowID’,’values-to-put-in-row’ Delete command to remove data within a table Ex. delete ‘<table name>’, ‘<row>’, ‘<column name>’, ‘<time stamp>’ Versioning for HBase Versioning is used for historical tracking of the changes made to your data. Each value within a column has a version associated with it. The timestamps within HBase use Unix Timestamps which is the difference in milliseconds between the current time and midnight of January 1 1970. HBase Architecture The HBase architecture is split between the Master and the Slave. Master has the working functions embedded and Slaves respond to the request from the master parts of the system. The Name Node is a part of the Master and is used to maintain information about where the data lives within the HBase cluster. Zookeeper (is not specific to HBase) is within the Master and is used to coordinate the other parts of the system. Zookeeper manages the server state of the cluster and keeps it all connected. It is a coordinator that talks with the HMaster component. mind movement machine. apache hbase. | 7 The HMaster handles all of the region servers and can reassign regions and load-balancing of the cluster. Unless you are using Apache Phoenix, it is likely you are using HMaster to interact with the HBase interface. Slave holds region servers, which is where the regions are and has the HDFS data nodes. All data in HBase is stored on HDFS data nodes. Ensure the HDFS data nodes are in the same place as the region servers. Region and Region Servers Tables are inside of a region, regions are inside of region servers and are scaled horizontally based on key ranges. Every row has a unique row key inside of HBase. All the other values are within columns contained within column families. Region servers can store about 1000 regions. They are the admins of regions. They can be used to split regions or issue compaction, which is used to combine data into the least amount of files as needed to save time when retrieving data. Regions are the basic element of organization for HBase. You want to have 20 to 200 large regions instead of a bunch of small regions, because every region requires memory and having a lot of smaller regions can add overhead just for keeping track of where the data is. One region is from 5 to 20 GB. Regions are used to load-balance data across different region servers after major compaction. Ensure you have a good region design from the start. Coprocessors Coprocessors allow you to make changes and update data as the events occur rather than mind movement machine. apache hbase. | 8 having to pull the data across different nodes in a cluster. You can place custom code on a region server. Coprocessors are an efficient way of processing data in a HBase cluster. Observer coprocessors look for specific changes to the system and executes a command when the change occurs. Endpoint processors are related to endpoint rather than certain specified events.