Hortonworks Data Platform Data Access (December 15, 2017)

Total Page:16

File Type:pdf, Size:1020Kb

Hortonworks Data Platform Data Access (December 15, 2017) Hortonworks Data Platform Data Access (December 15, 2017) docs.cloudera.com Hortonworks Data Platform December 15, 2017 Hortonworks Data Platform: Data Access Copyright © 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including YARN, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included. Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain, free and open source. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to Contact Us directly to discuss your specific needs. Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 4.0 License. http://creativecommons.org/licenses/by-sa/4.0/legalcode ii Hortonworks Data Platform December 15, 2017 Table of Contents 1. What's New in Data Access for HDP 2.6 ....................................................................... 1 1.1. What's New in Apache Hive .............................................................................. 1 1.2. What's New in Apache Tez ............................................................................... 2 1.3. What's New in Apache HBase ........................................................................... 2 1.4. What's New in Apache Phoenix ......................................................................... 2 1.5. Druid ................................................................................................................ 3 2. Data Warehousing with Apache Hive ........................................................................... 4 2.1. Content Roadmap ............................................................................................. 4 2.2. Features Overview ............................................................................................ 6 2.2.1. Temporary Tables ................................................................................... 6 2.2.2. Optimized Row Columnar (ORC) Format ................................................ 7 2.2.3. SQL Optimization ................................................................................... 7 2.2.4. SQL Compliance and ACID-Based Transactions ...................................... 10 2.2.5. Streaming Data Ingestion ..................................................................... 25 2.2.6. Query Vectorization ............................................................................. 25 2.2.7. Beeline versus Hive CLI ......................................................................... 27 2.2.8. Hive JDBC and ODBC Drivers ................................................................ 29 2.3. Moving Data into Apache Hive ....................................................................... 32 2.3.1. Using an External Table ........................................................................ 33 2.3.2. Using Sqoop ......................................................................................... 35 2.3.3. Incrementally Updating a Table ............................................................ 38 2.4. Queries on Data Stored in Remote Clusters ..................................................... 42 2.4.1. Query Capability on Remote Clusters .................................................... 43 2.5. Configuring HiveServer2 .................................................................................. 44 2.5.1. Configuring HiveServer2 for Transactions (ACID Support) ...................... 45 2.5.2. Configuring HiveServer2 for LDAP and for LDAP over SSL ...................... 46 2.6. Securing Apache Hive ...................................................................................... 50 2.6.1. Authorization Using Apache Ranger Policies ......................................... 52 2.6.2. SQL Standard-Based Authorization ....................................................... 53 2.6.3. Required Privileges for Hive Operations ................................................ 55 2.6.4. Storage-Based Authorization ................................................................ 57 2.6.5. Configuring Storage-Based Authorization ............................................. 58 2.6.6. Permissions for Apache Hive Operations ............................................... 59 2.6.7. Row-Level Filtering and Column Masking .............................................. 60 2.7. Troubleshooting .............................................................................................. 60 2.7.1. JIRAs .................................................................................................... 63 3. Enabling Efficient Execution with Apache Pig and Apache Tez .................................... 64 4. Accessing Apache Hive with HCatalog and WebHCat ................................................. 66 4.1. HCatalog ......................................................................................................... 66 4.1.1. HCatalog Community Information ........................................................ 66 4.2. WebHCat ........................................................................................................ 67 4.2.1. WebHCat Community Information ....................................................... 67 4.2.2. Security for WebHCat ........................................................................... 68 5. Persistent Read/Write Data Access with Apache HBase .............................................. 69 5.1. Content Roadmap ........................................................................................... 69 5.2. Deploying Apache HBase ................................................................................ 71 5.2.1. Installation and Setup .......................................................................... 72 5.2.2. Cluster Capacity and Region Sizing ....................................................... 73 iii Hortonworks Data Platform December 15, 2017 5.2.3. Enabling Multitenancy with Namepaces ............................................... 77 5.2.4. Security Features Available in Technical Preview .................................... 79 5.3. Managing Apache HBase Clusters ................................................................... 79 5.3.1. Monitoring Apache HBase Clusters ....................................................... 79 5.3.2. Optimizing Apache HBase I/O .............................................................. 79 5.3.3. Importing Data into HBase with Bulk Load ........................................... 88 5.3.4. Using Snapshots ................................................................................... 89 5.4. Backing up and Restoring Apache HBase Datasets ........................................... 91 5.4.1. Planning a Backup-and-Restore Strategy for Your Environment ............. 91 5.4.2. Best Practices for Backup-and-Restore ................................................... 93 5.4.3. Running the Backup-and-Restore Utility ................................................ 94 5.5. Medium Object (MOB) Storage Support in Apache HBase .............................. 102 5.5.1. Enabling MOB Storage Support .......................................................... 102 5.5.2. Testing the MOB Storage Support Configuration ................................ 103 5.5.3. Tuning MOB Storage Cache Properties ............................................... 103 5.6. HBase Quota Management ........................................................................... 104 5.6.1. Setting Up Quotas .............................................................................. 105 5.6.2. Throttle Quotas .................................................................................. 106 5.6.3. Space Quotas ..................................................................................... 107 5.6.4. Quota Enforcement ............................................................................ 108 5.6.5. Quota Violation Policies ...................................................................... 108 5.6.6. Impact of Quota Violation Policy ........................................................ 109 5.6.7. Number-of-Tables Quotas ................................................................... 111 5.6.8. Number-of-Regions Quotas ................................................................. 112 5.7. HBase Best Practices ...................................................................................... 112 6. Orchestrating SQL and APIs with Apache Phoenix ...................................................
Recommended publications
  • Combined Documents V2
    Outline: Combining Brainstorming Deliverables Table of Contents 1. Introduction and Definition 2. Reference Architecture and Taxonomy 3. Requirements, Gap Analysis, and Suggested Best Practices 4. Future Directions and Roadmap 5. Security and Privacy - 10 Top Challenges 6. Conclusions and General Advice Appendix A. Terminology Glossary Appendix B. Solutions Glossary Appendix C. Use Case Examples Appendix D. Actors and Roles 1. Introduction and Definition The purpose of this outline is to illustrate how some initial brainstorming documents might be pulled together into an integrated deliverable. The outline will follow the diagram below. Section 1 introduces a definition of Big Data. An extended terminology Glossary is found in Appendix A. In section 2, a Reference Architecture diagram is presented followed by a taxonomy describing and extending the elements of the Reference Architecture. Section 3 maps requirements from use case building blocks to the Reference Architecture. A description of the requirement, a gap analysis, and suggested best practice is included with each mapping. In Section 4 future improvements in Big Data technology are mapped to the Reference Architecture. An initial Technology Roadmap is created on the requirements and gap analysis in Section 3 and the expected future improvements from Section 4. Section 5 is a placeholder for an extended discussion of Security and Privacy. Section 6 gives an example of some general advice. The Appendices provide Big Data terminology and solutions glossaries, Use Case Examples, and some possible Actors and Roles. Big Data Definition - “Big Data refers to the new technologies and applications introduced to handle increasing Volumes of data while enhancing data utilization capabilities such as Variety, Velocity, Variability, Veracity, and Value.” The key attribute is the large Volume of data available that forces horizontal scalability of storage and processing and has implications for all the other V-attributes.
    [Show full text]
  • Administration and Configuration Guide
    Red Hat JBoss Data Virtualization 6.4 Administration and Configuration Guide This guide is for administrators. Last Updated: 2018-09-26 Red Hat JBoss Data Virtualization 6.4 Administration and Configuration Guide This guide is for administrators. Red Hat Customer Content Services Legal Notice Copyright © 2018 Red Hat, Inc. This document is licensed by Red Hat under the Creative Commons Attribution-ShareAlike 3.0 Unported License. If you distribute this document, or a modified version of it, you must provide attribution to Red Hat, Inc. and provide a link to the original. If the document is modified, all Red Hat trademarks must be removed. Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law. Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux ® is the registered trademark of Linus Torvalds in the United States and other countries. Java ® is a registered trademark of Oracle and/or its affiliates. XFS ® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. MySQL ® is a registered trademark of MySQL AB in the United States, the European Union and other countries. Node.js ® is an official trademark of Joyent. Red Hat Software Collections is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.
    [Show full text]
  • Apache Sentry
    Apache Sentry Prasad Mujumdar [email protected] [email protected] Agenda ● Various aspects of data security ● Apache Sentry for authorization ● Key concepts of Apache Sentry ● Sentry features ● Sentry architecture ● Integration with Hadoop ecosystem ● Sentry administration ● Future plans ● Demo ● Questions Who am I • Software engineer at Cloudera • Committer and PPMC member of Apache Sentry • also for Apache Hive and Apache Flume • Part of the the original team that started Sentry work Aspects of security Perimeter Access Visibility Data Authentication Authorization Audit, Lineage Encryption, what user can do data origin, usage Kerberos, LDAP/AD Masking with data Data access Access ● Provide user access to data Authorization ● Manage access policies what user can do ● Provide role based access with data Agenda ● Various aspects of data security ● Apache Sentry for authorization ● Key concepts of Apache Sentry ● Sentry features ● Sentry architecture ● Integration with Hadoop ecosystem ● Sentry administration ● Future plans ● Demo ● Questions Apache Sentry (Incubating) Unified Authorization module for Hadoop Unlocks Key RBAC Requirements Secure, fine-grained, role-based authorization Multi-tenant administration Enforce a common set of policies across multiple data access path in Hadoop. Key Capabilities of Sentry Fine-Grained Authorization Permissions on object hierarchie. Eg, Database, Table, Columns Role-Based Authorization Support for role templetes to manage authorization for a large set of users and data objects Multi Tanent Administration
    [Show full text]
  • Synthesis and Development of a Big Data Architecture for the Management of Radar Measurement Data
    1 Faculty of Electrical Engineering, Mathematics & Computer Science Synthesis and Development of a Big Data architecture for the management of radar measurement data Alex Aalbertsberg Master of Science Thesis November 2018 Supervisors: dr. ir. Maurice van Keulen (University of Twente) prof. dr. ir. Mehmet Aks¸it (University of Twente) dr. Doina Bucur (University of Twente) ir. Ronny Harmanny (Thales) University of Twente P.O. Box 217 7500 AE Enschede The Netherlands Approval Internship report/Thesis of: …………………………………………………………………………………………………………Alexander P. Aalbertsberg Title: …………………………………………………………………………………………Synthesis and Development of a Big Data architecture for the management of radar measurement data Educational institution: ………………………………………………………………………………..University of Twente Internship/Graduation period:…………………………………………………………………………..2017-2018 Location/Department:.…………………………………………………………………………………435 Advanced Development, Delft Thales Supervisor:……………………………………………………………………………R. I. A. Harmanny This report (both the paper and electronic version) has been read and commented on by the supervisor of Thales Netherlands B.V. In doing so, the supervisor has reviewed the contents and considering their sensitivity, also information included therein such as floor plans, technical specifications, commercial confidential information and organizational charts that contain names. Based on this, the supervisor has decided the following: o This report is publicly available (Open). Any defence may take place publicly and the report may be included in public libraries and/or published in knowledge bases. • o This report and/or a summary thereof is publicly available to a limited extent (Thales Group Internal). tors . It will be read and reviewed exclusively by teachers and if necessary by members of the examination board or review ? committee. The content will be kept confidential and not disseminated through publication or inclusion in public libraries and/or knowledge bases.
    [Show full text]
  • Talend Open Studio for Big Data Release Notes
    Talend Open Studio for Big Data Release Notes 6.0.0 Talend Open Studio for Big Data Adapted for v6.0.0. Supersedes previous releases. Publication date July 2, 2015 Copyleft This documentation is provided under the terms of the Creative Commons Public License (CCPL). For more information about what you can and cannot do with this documentation in accordance with the CCPL, please read: http://creativecommons.org/licenses/by-nc-sa/2.0/ Notices Talend is a trademark of Talend, Inc. All brands, product names, company names, trademarks and service marks are the properties of their respective owners. License Agreement The software described in this documentation is licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.html. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This product includes software developed at AOP Alliance (Java/J2EE AOP standards), ASM, Amazon, AntlR, Apache ActiveMQ, Apache Ant, Apache Avro, Apache Axiom, Apache Axis, Apache Axis 2, Apache Batik, Apache CXF, Apache Cassandra, Apache Chemistry, Apache Common Http Client, Apache Common Http Core, Apache Commons, Apache Commons Bcel, Apache Commons JxPath, Apache
    [Show full text]
  • Realization of Big Data Ana- Lytics Tool for Optimization Processes Within the Finnish Engineering Company
    OPINNÄYTETYÖ - AMMATTIKORKEAKOULUTUTKINTO TEKNIIKAN JA LIIKENTEEN ALA REALIZATION OF BIG DATA ANA- LYTICS TOOL FOR OPTIMIZATION PROCESSES WITHIN THE FINNISH ENGINEERING COMPANY A u t h o r / s : Karapetyan Karina SAVONIA UNIVERSITY OF APPLIED SCIENCES THESIS Abstract Field of Study Technology, Communication and Transport Degree Programme Degree Programme in Information Technology Author(s) Karapetyan Karina Title of Thesis Realization of Big Data Analytics Tool for optimization processes within the Finnish engineering company Date 23.05.2016 Pages/Appendices 54 Supervisor(s) Mr. Arto Toppinen, Principal Lecturer at Savonia University of Applied Sciences, Mr. Anssi Suhonen, Lecturer at Savonia University of Applied Sciences Client Organisation /Partners Hydroline Oy Abstract Big Data Analytics Tool offers an entire business picture for making both operational and strategic deci- sions from selecting the product price to establishing the priorities for the further vendor’s enhancement. The purpose of the thesis was to explore the industrial system of Hydroline Oy and provide a software solution for the elaboration of the manufacture, due to the internal analyzing within the company. For the development of Big Data Analytics Tool, several software programs and tools were employed. Java-written server controls all components in the project and visualizes the processed data via a user- friendly client web application. The SQL Server maintains data, observed from the ERP system. Moreo- ver, it is responsible for the login and registration procedure to enforce the information security. In the Hadoop environment, two research methods were implemented. The Overall Equipment Effectiveness model investigated the production data to obtain daily, monthly and annual efficiency indices of equip- ment utilization, employees’ workload, resource management, quality degree, among others.
    [Show full text]
  • Mapreduce Service
    MapReduce Service Troubleshooting Issue 01 Date 2021-03-03 HUAWEI TECHNOLOGIES CO., LTD. Copyright © Huawei Technologies Co., Ltd. 2021. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd. Trademarks and Permissions and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd. All other trademarks and trade names mentioned in this document are the property of their respective holders. Notice The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied. The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied. Issue 01 (2021-03-03) Copyright © Huawei Technologies Co., Ltd. i MapReduce Service Troubleshooting Contents Contents 1 Account Passwords.................................................................................................................. 1 1.1 Resetting
    [Show full text]
  • Using Apache Phoenix to Store and Access Data Date Published: 2020-02-29 Date Modified: 2020-07-28
    Cloudera Runtime 7.2.1 Using Apache Phoenix to Store and Access Data Date published: 2020-02-29 Date modified: 2020-07-28 https://docs.cloudera.com/ Legal Notice © Cloudera Inc. 2021. All rights reserved. The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual property rights. No license under copyright or any other intellectual property right is granted herein. Copyright information for Cloudera software may be found within the documentation accompanying each component in a particular release. Cloudera software includes software from various open source or other third party projects, and may be released under the Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms. Other software included may be released under the terms of alternative open source licenses. Please review the license and notice files accompanying the software for additional licensing information. Please visit the Cloudera software product page for more information on Cloudera software. For more information on Cloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss your specific needs. Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility nor liability arising from the use of products, except as expressly agreed to in writing by Cloudera. Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregistered trademarks in the United States and other countries. All other trademarks are the property of their respective owners. Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA, CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OF ANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY OR RELATED SUPPORT PROVIDED IN CONNECTION THEREWITH.
    [Show full text]
  • HDP 3.1.4 Release Notes Date of Publish: 2019-08-26
    Release Notes 3 HDP 3.1.4 Release Notes Date of Publish: 2019-08-26 https://docs.hortonworks.com Release Notes | Contents | ii Contents HDP 3.1.4 Release Notes..........................................................................................4 Component Versions.................................................................................................4 Descriptions of New Features..................................................................................5 Deprecation Notices.................................................................................................. 6 Terminology.......................................................................................................................................................... 6 Removed Components and Product Capabilities.................................................................................................6 Testing Unsupported Features................................................................................ 6 Descriptions of the Latest Technical Preview Features.......................................................................................7 Upgrading to HDP 3.1.4...........................................................................................7 Behavioral Changes.................................................................................................. 7 Apache Patch Information.....................................................................................11 Accumulo...........................................................................................................................................................
    [Show full text]
  • Apache Hbase. | 1
    apache hbase. | 1 how hbase works *this is a study guide that was created from lecture videos and is used to help you gain an understanding of how hbase works. HBase Foundations Yahoo released the Hadoop data storage system and Google added HDFS programming interface. HDFS stands for Hadoop Distributed File System and it spreads data across what are called nodes in it’s cluster. The data does not have a schema as it is just document/files. HDFS is schemaless, distributed and fault tolerant. MapReduce is focused on data processing and jobs to write to MapReduce are in Java. The operations of a MapReduce job is to find the data and list tasks that it needs to execute and then execute them. The action of executing is called reducers. A downside is that it is batch oriented, which means you would have to read the entire file of data even if you would like to read a small portion of data. Batch oriented is slow. Hadoop is semistructured data and unstructured data, there is no random access for Hadoop and no transaction support. HBase is also called the Hadoop database and unlike Hadoop or HDFS, it has a schema. There is an in-memory feature that gives you the ability to read information quickly. You can isolate the data you want to analyze. HBase is random access. HBase allows for CRUD, which is Creating a new document, Reading the information into an application or process, Update which will allow you to change the value and Delete where mind movement machine.
    [Show full text]
  • SAS 9.4 Hadoop Configuration Guide for Base SAS And
    SAS® 9.4 Hadoop Configuration Guide for Base SAS® and SAS/ACCESS® Second Edition SAS® Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. SAS® 9.4 Hadoop Configuration Guide for Base SAS® and SAS/ACCESS®, Second Edition. Cary, NC: SAS Institute Inc. SAS® 9.4 Hadoop Configuration Guide for Base SAS® and SAS/ACCESS®, Second Edition Copyright © 2015, SAS Institute Inc., Cary, NC, USA All rights reserved. Produced in the United States of America. For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication. The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others' rights is appreciated. U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a) and DFAR 227.7202-4 and, to the extent required under U.S.
    [Show full text]
  • Hortonworks Data Platform for Enterprise Data Lakes Delivers Robust, Big Data Analytics That Accelerate Decision Making and Innovation
    IBM United States Software Announcement 218-187, dated March 20, 2018 Hortonworks Data Platform for Enterprise Data Lakes delivers robust, big data analytics that accelerate decision making and innovation Table of contents 1 Overview 5 Publications 2 Key prerequisites 5 Technical information 2 Planned availability date 6 Ordering information 2 Description 7 Terms and conditions 5 Program number 9 Prices 10 Corrections Overview Hortonworks Data Platform is an enterprise ready open source Apache Hadoop distribution based on a centralized architecture supported by YARN. Hortonworks Data Platform is designed to address the needs of data at rest, power real-time customer applications, and deliver big data analytics that can help accelerate decision making and innovation. The official Apache versions for Hortonworks Data Platform V2.6.4 include: • Apache Accumulo 1.7.0 • Apache Atlas 0.8.0 • Apache Calcite 1.2.0 • Apache DataFu 1.3.0 • Apache Falcon 0.10.0 • Apache Flume 1.5.2 • Apache Hadoop 2.7.3 • Apache HBase 1.1.2 • Apache Hive 1.2.1 • Apache Hive 2.1.0 • Apache Kafka 0.10.1 • Apache Knox 0.12.0 • Apache Mahout 0.9.0 • Apache Oozie 4.2.0 • Apache Phoenix 4.7.0 • Apache Pig 0.16.0 • Apache Ranger 0.7.0 • Apache Slider 0.92.0 • Apache Spark 1.6.3 • Apache Spark 2.2.0 • Apache Sqoop 1.4.6 • Apache Storm 1.1.0 • Apache TEZ 0.7.0 • Apache Zeppelin 0.7.3 IBM United States Software Announcement 218-187 IBM is a registered trademark of International Business Machines Corporation 1 • Apache ZooKeeper 3.4.6 IBM(R) clients can download this new offering from Passport Advantage(R).
    [Show full text]