A Case Study on Apache Hbase

Total Page:16

File Type:pdf, Size:1020Kb

A Case Study on Apache Hbase A case study on Apache HBase A Master’s Project Presented to Department of Computer and Information Sciences SUNY Polytechnic Institute Utica, New York In Partial Fulfillment Of the Requirements for the Master of Science Degree by Rohit Reddy Nalla May 2015 © Rohit Reddy Nalla 2015 Abstract Introduction: Apache HBase is an open-source, non-relational and a distributed data base system built on top of HDFS (Hadoop Distributed File system). HBase was designed post Google’s Big table and it is written in Java. It was developed as a part of Apache’s Hadoop Project. It provides a kind of fault – tolerant mechanism to store minor amounts of non-zero items caught within large amounts of empty items. HBase is used when we require real-time read/write access to huge data bases. HBase project was started by the end of 2006 by Chad Walters and Jim Kellerman at Powerset.[2] The main purpose of HBase is to process large amounts of data. Mike Cafarella worked on code of the working system initially and later Jim Kellerman carried it to the next stage. HBase was first released as a part of Hadoop 0.15.0 in October 2007[2]. The project goal was holding of very large tables like billions of rows X millions of columns. In May 2010, HBase advanced to a major project and it became an Apache Top Level Project. Several applications like Adobe, Twitter, Yahoo, Trend Micro etc. use this data base. Social networking sites like Facebook have implemented its messenger application using HBase. This document helps us to understand how HBase works and how is it different from other data bases. This document highlights about the current challenges in data security and a couple of models have been proposed towards the security and levels of data access to overcome the challenges. This document also discusses the workload challenges and techniques to overcome. Also an overview has been given on how HBase has been implemented in real time application Facebook messenger app. 3 Acknowledgment I, Rohit Reddy Nalla want to express my gratitude to Dr. Sam Sengupta, the advisor of this project has been a constant inspiration and has offered guidance in every possible way one could imagine. I would also like to thank my family and friends for their encouragement and support. 4 Table of Contents 1. Introduction to Database systems .……………………………………………………….8 1.1 Introduction ………………………………………………………………………….8 2. Introduction to HBase …………………………...………………………………………11 2.1 Background ………………………………………………………………………….11 2.2 Data model …………………………………..………………………………………11 2.3 HBase Implementation and Operation ………………………………………………14 2.4 Running HBase ...……………………………………………………………………17 2.4.1 Understanding of basic queries for table creation and updates………………...17 2.4.2 Connecting HBase with Java ………………………………………………….21 3. Current Data Vulnerability and Workload challenges ……..…………………………………27 3.1 Current issues in Data Security ……….…………………………………………….27 3.2 Security model based on Kerberos Authentication and Access Control Lists……...27 3.2.1 Introduction………………………………………………………………..27 3.2.2 Kerberos protocol………………………………………………………….28 3.3 HBase cell level security model……………………………………………………..32 3.3.1 Cell tags ……………………………………………………………...……32 5 3.3.2 Cell ACL’s …..…………………………………………………………….33 3.3.3 Visibility labels (cell labels) …………………………………………...….33 3.3.4 HFile version 3……………………………………………………………..37 3.4 Running multiple workloads on a single HBase cluster……………………………..37 3.4.1 Throttle………………………………………………………………….…38 3.4.2 Decreasing the priority of long running requests………………………..…39 3.5 Data Backup and Disaster Recovery…………………………………………………40 3.5.1 Snapshots…………………………………………………………………..40 3.5.2 HBase Replication……………………………………………...………….41 3.5.3 HBase Export……………………………………………………………....41 3.5.4 Copy Table ….………………………………………………………..……42 4. HBase in Real time …………………………………………………………………………...43 4.1 Background ………………………………………………………………………….43 4.2 Workload challenges ………………………………………………………………...43 4.3 Implementation of HDFS ……………………………………………………………46 4.3.1 Name Node (Master) and Data nodes …………………………………….…...46 4.3.2 Avatar Node …………………………………………………………………...48 4.3.3 HDFS Transaction logging ……………………………………………………49 4.3.4 Distributed Avatar File System …………………...……………………………50 6 4.3.5 Enhancements for Real time Work Load ……...……..…………………....50 4.4 Implementation of HBase at Facebook ………………………………………….….52 4.4.1 ACID Properties……………………………………………………………..52 4.4.2 Availability Enhancements …………………………………………………..53 4.4.3 Performance Improvements…………………………………………………..54 4.5 Production Deployment and challenges …………………………………………….56 4.5.1 Testing ……………………………………………………………………….56 4.5.2 Region Splitting (Manual v/s Automatic)…………………………………….57 4.5.3 Schema changes……………………………………………...……………….57 4.5.4 Data Imports………………………………………………………………..…58 4.5.5 Network IO traffic…………………………………………………………....58 4.6 Future implementations at Facebook………………………………………………...58 5. HBase vs RDBMS ………………………………………………………………………….60 5.1 Background ...………………………………………………………………………..61 5.2 Analysis ……………………………………………………………………………...61 6 Future work …...……………………………………………………………...………………..63 References ……………………………………………………………………………………….64 7 Chapter 1 Introduction to Database systems 1.1 Introduction A Data base is a place where data is stored in an organized way. The stored data is used across various aspects of an application which helps in processing the requests efficiently. In the recent times, usage of internet has brought some revolutionary changes across the world and many online websites and applications have been developed to give users more benefit. While developing an application (or website) maintaining a data base has become relatively important. Maintaining the customer’s data and catalogue has become a challenging task since then. That is when Database management systems (DBMS) came into picture. Database management systems (DBMS) are kind of software applications which allow implementing CURD operations i.e. create, update, read and delete the data in tables. It interacts with end users, third party applications, and internal applications within the system and with the database itself to capture and process the data efficiently. MySQL, PostgreSQL, NoSQL, Oracle etc. are some of the famous data bases. Generally depending on the database models they support DBMSs are classified. Most of the well-known data bases have supported relational model which has been represented by SQL since 1980’s. Initially in 1960’s some General-purpose DBMSs have typically served the needs of many applications and they were cost effective too. However sometimes it has introduced some unnecessary overhead and later there was a realization that it was not a favorable one to use. Then special-purpose DBMSs came into picture which served the purposes. To make use of email system in the most effective way, special-purpose DBMS were used to handle email systems. As days passed on there were lots of advancements in terms of 8 technology perspective. In the areas of computer storage, networks and memory the performance expectations of the DBMS have been started growing. In mid 1960’s “Data Base Task Group” has come up with an navigational approach (CODASYL approach) where records in the data base can be found by navigating relationships from one set of records to another set of records and they were searched in a sequential order. However in further validations, it was proved to be a difficult procedure and it requires rigorous workout in order to serve applications. There was a growing concern on handling large data banks. An efficient service should be given to the users in such a way that their activities should not be affected though there is any change in the internal or external representations of their data. In this context, in 1970’s Edgar Codd wrote numerous papers that outlined a new approach to the database construction that found a break through by A Relational Model of Data for Large Shared Data Bank[5]. He described a new approach for maintaining large data bases and for storing data in it. Unlike navigational approach where records are stored in linked lists, he came up with storing records in tables of fixed length, where each table is considered as one unit. For example, to create a table with user account information such as user ID, password, contact no., date of birth etc., user ID should be unique so that all the other fields are stored by putting the user ID as reference key or primary key. Whenever if we want to collect this information, we can search for it with the help of a primary key so that information stored in the optional tables can be found with the help of this key. With the help of Tuple Calculus (branch of mathematics) he has developed a system which can maintain the operations of a regular database like CURD operations and also it can sort the datasets in one single operation. Based on the Codd’s theory IBM had started working on a modal named System R. Their idea was to develop multi-table systems, in which all the data which is stored for record could be split instead of storing the whole data at one large place. 9 They implemented Codd’s concepts successfully and they have created a product of System R known as SQL or DS. Later it became Database 2(DB2). In later 1970’s based on IBM’s research Larry Ellison also came up with Oracle database. These Relational Database systems (RDBMS) have become popular and were used for a small scale to medium scale applications. SQL language is used for writing queries in order to implement the basic CURD operations. It has powerful features and query languages like MySQL and PostgreSQL. Secondary indexes can be easily created and it performs several inner and outer joins, count, sorting across the rows and columns in data base tables. However if we need to measure or scale up with large amounts of data and random read/write access, then we’ll find difficulty in performing these actions with RDBMS. It makes distribution of data complex with ACID properties and Codd’s 12 rules. To overcome these drawbacks then came the next generation databases known as NoSQL databases. These databases are kind of document-oriented databases depend upon data attributes in the XML document while querying.
Recommended publications
  • Introduction to Hbase Schema Design
    Introduction to HBase Schema Design AmANDeeP KHURANA Amandeep Khurana is The number of applications that are being developed to work with large amounts a Solutions Architect at of data has been growing rapidly in the recent past . To support this new breed of Cloudera and works on applications, as well as scaling up old applications, several new data management building solutions using the systems have been developed . Some call this the big data revolution . A lot of these Hadoop stack. He is also a co-author of HBase new systems that are being developed are open source and community driven, in Action. Prior to Cloudera, Amandeep worked deployed at several large companies . Apache HBase [2] is one such system . It is at Amazon Web Services, where he was part an open source distributed database, modeled around Google Bigtable [5] and is of the Elastic MapReduce team and built the becoming an increasingly popular database choice for applications that need fast initial versions of their hosted HBase product. random access to large amounts of data . It is built atop Apache Hadoop [1] and is [email protected] tightly integrated with it . HBase is very different from traditional relational databases like MySQL, Post- greSQL, Oracle, etc . in how it’s architected and the features that it provides to the applications using it . HBase trades off some of these features for scalability and a flexible schema . This also translates into HBase having a very different data model . Designing HBase tables is a different ballgame as compared to relational database systems . I will introduce you to the basics of HBase table design by explaining the data model and build on that by going into the various concepts at play in designing HBase tables through an example .
    [Show full text]
  • The Network Data Storage Model Based on the HTML
    Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE 2013) The network data storage model based on the HTML Lei Li1, Jianping Zhou2 Runhua Wang3, Yi Tang4 1Network Information Center 3Teaching Quality and Assessment Office Chongqing University of Science and Technology Chongqing University of Science and Technology Chongqing, China Chongqing, China [email protected] [email protected] 2Network Information Center 4School of Electronic and Information Engineering Chongqing University of Science and Technology Chongqing University of Science and Technology Chongqing, China Chongqing, China [email protected] [email protected] Abstract—in this paper, the mass oriented HTML data refers popular NoSQL for HTML database. This article describes to those large enough for HTML data, that can no longer use the SMAQ stack model and those who today may be traditional methods of treatment. In the past, has been the included in the model under mass oriented HTML data Web search engine creators have to face this problem be the processing tools. first to bear the brunt. Today, various social networks, mobile applications and various sensors and scientific fields are II. MAPREDUCE created daily with PB for HTML data. In order to deal with MapReduce is a Google for creating web webpage index the massive data processing challenges for HTML, Google created MapReduce. Google and Yahoo created Hadoop created. MapReduce framework has become today most hatching out of an intact mass oriented HTML data processing mass oriented HTML data processing plant. MapReduce is tools for ecological system. the key, will be oriented in the HTML data collection of a query division, then at multiple nodes to execute in parallel.
    [Show full text]
  • Projects – Other Than Hadoop! Created By:-Samarjit Mahapatra [email protected]
    Projects – other than Hadoop! Created By:-Samarjit Mahapatra [email protected] Mostly compatible with Hadoop/HDFS Apache Drill - provides low latency ad-hoc queries to many different data sources, including nested data. Inspired by Google's Dremel, Drill is designed to scale to 10,000 servers and query petabytes of data in seconds. Apache Hama - is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS for massive scientific computations such as matrix, graph and network algorithms. Akka - a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM. ML-Hadoop - Hadoop implementation of Machine learning algorithms Shark - is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones. Apache Crunch - Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run Azkaban - batch workflow job scheduler created at LinkedIn to run their Hadoop Jobs Apache Mesos - is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.
    [Show full text]
  • Apache Hbase, the Scaling Machine Jean-Daniel Cryans Software Engineer at Cloudera @Jdcryans
    Apache HBase, the Scaling Machine Jean-Daniel Cryans Software Engineer at Cloudera @jdcryans Tuesday, June 18, 13 Agenda • Introduction to Apache HBase • HBase at StumbleUpon • Overview of other use cases 2 Tuesday, June 18, 13 About le Moi • At Cloudera since October 2012. • At StumbleUpon for 3 years before that. • Committer and PMC member for Apache HBase since 2008. • Living in San Francisco. • From Québec, Canada. 3 Tuesday, June 18, 13 4 Tuesday, June 18, 13 What is Apache HBase Apache HBase is an open source distributed scalable consistent low latency random access non-relational database built on Apache Hadoop 5 Tuesday, June 18, 13 Inspiration: Google BigTable (2006) • Goal: Low latency, consistent, random read/ write access to massive amounts of structured data. • It was the data store for Google’s crawler web table, gmail, analytics, earth, blogger, … 6 Tuesday, June 18, 13 HBase is in Production • Inbox • Storage • Web • Search • Analytics • Monitoring 7 Tuesday, June 18, 13 HBase is in Production • Inbox • Storage • Web • Search • Analytics • Monitoring 7 Tuesday, June 18, 13 HBase is in Production • Inbox • Storage • Web • Search • Analytics • Monitoring 7 Tuesday, June 18, 13 HBase is in Production • Inbox • Storage • Web • Search • Analytics • Monitoring 7 Tuesday, June 18, 13 HBase is Open Source • Apache 2.0 License • A Community project with committers and contributors from diverse organizations: • Facebook, Cloudera, Salesforce.com, Huawei, eBay, HortonWorks, Intel, Twitter … • Code license means anyone can modify and use the code. 8 Tuesday, June 18, 13 So why use HBase? 9 Tuesday, June 18, 13 10 Tuesday, June 18, 13 Old School Scaling => • Find a scaling problem.
    [Show full text]
  • Final Report CS 5604: Information Storage and Retrieval
    Final Report CS 5604: Information Storage and Retrieval Solr Team Abhinav Kumar, Anand Bangad, Jeff Robertson, Mohit Garg, Shreyas Ramesh, Siyu Mi, Xinyue Wang, Yu Wang January 16, 2018 Instructed by Professor Edward A. Fox Virginia Polytechnic Institute and State University Blacksburg, VA 24061 1 Abstract The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets and millions of webpages for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects [6]. We are using a 21 node Cloudera Hadoop cluster to store and retrieve this information. One goal of this project is to expand the data collection to include more web archives and geospatial data beyond what previously had been collected. Another important part in this project is optimizing the current system to analyze and allow access to the new data. To accomplish these goals, this project is separated into 6 parts with corresponding teams: Classification (CLA), Collection Management Tweets (CMT), Collection Management Webpages (CMW), Clustering and Topic Analysis (CTA), Front-end (FE), and SOLR. This report describes the work completed by the SOLR team which improves the current searching and storage system. We include the general architecture and an overview of the current system. We present the part that Solr plays within the whole system with more detail. We talk about our goals, procedures, and conclusions on the improvements we made to the current Solr system. This report also describes how we coordinate with other teams to accomplish the project at a higher level. Additionally, we provide manuals for future readers who might need to replicate our experiments.
    [Show full text]
  • Apache Hbase: the Hadoop Database Yuanru Qian, Andrew Sharp, Jiuling Wang
    Apache HBase: the Hadoop Database Yuanru Qian, Andrew Sharp, Jiuling Wang 1 Agenda ● Motivation ● Data Model ● The HBase Distributed System ● Data Operations ● Access APIs ● Architecture 2 Motivation ● Hadoop is a framework that supports operations on a large amount of data. ● Hadoop includes the Hadoop Distributed File System (HDFS) ● HDFS does a good job of storing large amounts of data, but lacks quick random read/write capability. ● That’s where Apache HBase comes in. 3 Introduction ● HBase is an open source, sparse, consistent distributed, sorted map modeled after Google’s BigTable. ● Began as a project by Powerset to process massive amounts of data for natural language processing. ● Developed as part of Apache’s Hadoop project and runs on top of Hadoop Distributed File System. 4 Big Picture MapReduce Thrift/REST Your Java Hive/Pig Gateway Application Java Client ZooKeeper HBase HDFS 5 An Example Operation The Job: The Table: A MapReduce job row key column 1 column 2 needs to operate on a series of webpages “com.cnn.world” 13 4.5 matching *.cnn.com “com.cnn.tech” 46 7.8 “com.cnn.money” 44 1.2 6 The HBase Data Model 7 Data Model, Groups of Tables RDBMS Apache HBase database namespace table table 8 Data Model, Single Table RDBMS table col1 col2 col3 col4 row1 row2 row3 9 Data Model, Single Table Apache HBase columns are table fam1 fam2 grouped into fam1:col1 fam1:col2 fam2:col1 fam2:col2 Column Families row1 row2 row3 10 Sparse example Row Key fam1:contents fam1:anchor “com.cnn.www” contents:html = “<html>...” contents:html = “<html>...”
    [Show full text]
  • Database Software Market: Billy Fitzsimmons +1 312 364 5112
    Equity Research Technology, Media, & Communications | Enterprise and Cloud Infrastructure March 22, 2019 Industry Report Jason Ader +1 617 235 7519 [email protected] Database Software Market: Billy Fitzsimmons +1 312 364 5112 The Long-Awaited Shake-up [email protected] Naji +1 212 245 6508 [email protected] Please refer to important disclosures on pages 70 and 71. Analyst certification is on page 70. William Blair or an affiliate does and seeks to do business with companies covered in its research reports. As a result, investors should be aware that the firm may have a conflict of interest that could affect the objectivity of this report. This report is not intended to provide personal investment advice. The opinions and recommendations here- in do not take into account individual client circumstances, objectives, or needs and are not intended as recommen- dations of particular securities, financial instruments, or strategies to particular clients. The recipient of this report must make its own independent decisions regarding any securities or financial instruments mentioned herein. William Blair Contents Key Findings ......................................................................................................................3 Introduction .......................................................................................................................5 Database Market History ...................................................................................................7 Market Definitions
    [Show full text]
  • HDP 3.1.4 Release Notes Date of Publish: 2019-08-26
    Release Notes 3 HDP 3.1.4 Release Notes Date of Publish: 2019-08-26 https://docs.hortonworks.com Release Notes | Contents | ii Contents HDP 3.1.4 Release Notes..........................................................................................4 Component Versions.................................................................................................4 Descriptions of New Features..................................................................................5 Deprecation Notices.................................................................................................. 6 Terminology.......................................................................................................................................................... 6 Removed Components and Product Capabilities.................................................................................................6 Testing Unsupported Features................................................................................ 6 Descriptions of the Latest Technical Preview Features.......................................................................................7 Upgrading to HDP 3.1.4...........................................................................................7 Behavioral Changes.................................................................................................. 7 Apache Patch Information.....................................................................................11 Accumulo...........................................................................................................................................................
    [Show full text]
  • Apache Hadoop Goes Realtime at Facebook
    Apache Hadoop Goes Realtime at Facebook Dhruba Borthakur Joydeep Sen Sarma Jonathan Gray Kannan Muthukkaruppan Nicolas Spiegelberg Hairong Kuang Karthik Ranganathan Dmytro Molkov Aravind Menon Samuel Rash Rodrigo Schmidt Amitanand Aiyer Facebook {dhruba,jssarma,jgray,kannan, nicolas,hairong,kranganathan,dms, aravind.menon,rash,rodrigo, amitanand.s}@fb.com ABSTRACT 1. INTRODUCTION Facebook recently deployed Facebook Messages, its first ever Apache Hadoop [1] is a top-level Apache project that includes user-facing application built on the Apache Hadoop platform. open source implementations of a distributed file system [2] and Apache HBase is a database-like layer built on Hadoop designed MapReduce that were inspired by Googles GFS [5] and to support billions of messages per day. This paper describes the MapReduce [6] projects. The Hadoop ecosystem also includes reasons why Facebook chose Hadoop and HBase over other projects like Apache HBase [4] which is inspired by Googles systems such as Apache Cassandra and Voldemort and discusses BigTable, Apache Hive [3], a data warehouse built on top of the applications requirements for consistency, availability, Hadoop, and Apache ZooKeeper [8], a coordination service for partition tolerance, data model and scalability. We explore the distributed systems. enhancements made to Hadoop to make it a more effective realtime system, the tradeoffs we made while configuring the At Facebook, Hadoop has traditionally been used in conjunction system, and how this solution has significant advantages over the with Hive for storage and analysis of large data sets. Most of this sharded MySQL database scheme used in other applications at analysis occurs in offline batch jobs and the emphasis has been on Facebook and many other web-scale companies.
    [Show full text]
  • Big Data: a Survey the New Paradigms, Methodologies and Tools
    Big Data: A Survey The New Paradigms, Methodologies and Tools Enrico Giacinto Caldarola1,2 and Antonio Maria Rinaldi1,3 1Department of Electrical Engineering and Information Technologies, University of Naples Federico II, Napoli, Italy 2Institute of Industrial Technologies and Automation, National Research Council, Bari, Italy 3IKNOS-LAB Intelligent and Knowledge Systems, University of Naples Federico II, LUPT 80134, via Toledo, 402-Napoli, Italy Keywords: Big Data, NoSQL, NewSQL, Big Data Analytics, Data Management, Map-reduce. Abstract: For several years we are living in the era of information. Since any human activity is carried out by means of information technologies and tends to be digitized, it produces a humongous stack of data that becomes more and more attractive to different stakeholders such as data scientists, entrepreneurs or just privates. All of them are interested in the possibility to gain a deep understanding about people and things, by accurately and wisely analyzing the gold mine of data they produce. The reason for such interest derives from the competitive advantage and the increase in revenues expected from this deep understanding. In order to help analysts in revealing the insights hidden behind data, new paradigms, methodologies and tools have emerged in the last years. There has been a great explosion of technological solutions that arises the need for a review of the current state of the art in the Big Data technologies scenario. Thus, after a characterization of the new paradigm under study, this work aims at surveying the most spread technologies under the Big Data umbrella, throughout a qualitative analysis of their characterizing features.
    [Show full text]
  • Hbase: the Definitive Guide
    HBase: The Definitive Guide HBase: The Definitive Guide Lars George Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo HBase: The Definitive Guide by Lars George Copyright © 2011 Lars George. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or [email protected]. Editors: Mike Loukides and Julie Steele Indexer: Angela Howard Production Editor: Jasmine Perez Cover Designer: Karen Montgomery Copyeditor: Audrey Doyle Interior Designer: David Futato Proofreader: Jasmine Perez Illustrator: Robert Romano Printing History: September 2011: First Edition. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. HBase: The Definitive Guide, the image of a Clydesdale horse, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information con- tained herein. ISBN: 978-1-449-39610-7 [LSI] 1314323116 For my wife Katja, my daughter Laura, and son Leon.
    [Show full text]
  • TR-4744: Secure Hadoop Using Apache Ranger with Netapp In
    Technical Report Secure Hadoop using Apache Ranger with NetApp In-Place Analytics Module Deployment Guide Karthikeyan Nagalingam, NetApp February 2019 | TR-4744 Abstract This document introduces the NetApp® In-Place Analytics Module for Apache Hadoop and Spark with Ranger. The topics covered in this report include the Ranger configuration, underlying architecture, integration with Hadoop, and benefits of Ranger with NetApp In-Place Analytics Module using Hadoop with NetApp ONTAP® data management software. TABLE OF CONTENTS 1 Introduction ........................................................................................................................................... 4 1.1 Overview .........................................................................................................................................................4 1.2 Deployment Options .......................................................................................................................................5 1.3 NetApp In-Place Analytics Module 3.0.1 Features ..........................................................................................5 2 Ranger ................................................................................................................................................... 6 2.1 Components Validated with Ranger ................................................................................................................6 3 NetApp In-Place Analytics Module Design with Ranger..................................................................
    [Show full text]