Integrating Apache Hive with Spark and BI Date Published: 2020-10-13 Date Modified

Cloudera Runtime 7.1.4 Integrating Apache Hive with Spark and BI Date published: 2020-10-13 Date modified: https://docs.cloudera.com/ Legal Notice © Cloudera Inc. 2021. All rights reserved. The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual property rights. No license under copyright or any other intellectual property right is granted herein. Copyright information for Cloudera software may be found within the documentation accompanying each component in a particular release. Cloudera software includes software from various open source or other third party projects, and may be released under the Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms. Other software included may be released under the terms of alternative open source licenses. Please review the license and notice files accompanying the software for additional licensing information. Please visit the Cloudera software product page for more information on Cloudera software. For more information on Cloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss your specific needs. Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility nor liability arising from the use of products, except as expressly agreed to in writing by Cloudera. Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregistered trademarks in the United States and other countries. All other trademarks are the property of their respective owners. Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA, CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OF ANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY OR RELATED SUPPORT PROVIDED IN CONNECTION THEREWITH. CLOUDERA DOES NOT WARRANT THAT CLOUDERA PRODUCTS NOR SOFTWARE WILL OPERATE UNINTERRUPTED NOR THAT IT WILL BE FREE FROM DEFECTS NOR ERRORS, THAT IT WILL PROTECT YOUR DATA FROM LOSS, CORRUPTION NOR UNAVAILABILITY, NOR THAT IT WILL MEET ALL OF CUSTOMER’S BUSINESS REQUIREMENTS. WITHOUT LIMITING THE FOREGOING, AND TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, CLOUDERA EXPRESSLY DISCLAIMS ANY AND ALL IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, NON-INFRINGEMENT, TITLE, AND FITNESS FOR A PARTICULAR PURPOSE AND ANY REPRESENTATION, WARRANTY, OR COVENANT BASED ON COURSE OF DEALING OR USAGE IN TRADE. Cloudera Runtime | Contents | iii Contents Introduction to HWC...............................................................................................4 Introduction to HWC execution modes............................................................................................................... 6 Spark Direct Reader mode................................................................................................................................... 9 JDBC execution mode........................................................................................................................................12 Automating mode selection................................................................................................................................13 Configuring Spark Direct Reader mode.............................................................................................................14 Configuring JDBC execution mode................................................................................................................... 15 Kerberos configurations for HWC..................................................................................................................... 16 Configuring external file authorization.............................................................................................................. 16 Reading managed tables through HWC.............................................................................................................17 Writing managed tables through HWC..............................................................................................................18 API operations.................................................................................................................................................... 20 HWC supported types mapping............................................................................................................. 21 Catalog operations.................................................................................................................................. 22 Read and write operations......................................................................................................................23 Commit transaction in Spark Direct Reader mode................................................................................ 25 Close HiveWarehouseSession operations...............................................................................................25 Use HWC for streaming.........................................................................................................................26 HWC API Examples...............................................................................................................................26 Hive Warehouse Connector Interfaces...................................................................................................27 Submit a Scala or Java application....................................................................................................................29 Submit a Python app.......................................................................................................................................... 30 Apache Hive-Kafka integration............................................................................ 31 Create a table for a Kafka stream......................................................................................................................31 Querying Kafka data...........................................................................................................................................32 Query live data from Kafka................................................................................................................... 32 Perform ETL by ingesting data from Kafka into Hive......................................................................................34 Writing data to Kafka.........................................................................................................................................34 Write transformed Hive data to Kafka...................................................................................................35 Set consumer and producer properties as table properties.................................................................................36 Kafka storage handler and table properties........................................................................................................36 Connecting Hive to BI tools using a JDBC/ODBC driver..................................38 Getting the JDBC driver.....................................................................................................................................38 Integrating Hive and a BI tool...........................................................................................................................39 Specify the JDBC connection string..................................................................................................................39 JDBC connection string syntax..........................................................................................................................40 Using JdbcStorageHandler to query RDBMS.....................................................42 Set up JDBCStorageHandler for Postgres...........................................................43 Cloudera Runtime Introduction to HWC Introduction to HWC You need to understand Hive Warehouse Connector (HWC) to query Apache Hive tables from Apache Spark. Examples of supported APIs, such as Spark SQL, show some operations you can perform, including how to write to a Hive ACID table or write a DataFrame from Spark. HWC is software for securely accessing Hive tables from Spark. You need to use the HWC if you want to access Hive managed tables from Spark. You explicitly use HWC by calling the HiveWarehouseConnector API to write to managed tables. You might use HWC without even realizing it. HWC implicitly reads tables when you run a Spark SQL query on a Hive managed table. You do not need HWC to read or write Hive external tables, but you might want to use HWC to purge external table files. From Spark, using HWC you can read Hive external tables in ORC or Parquet formats. From Spark, using HWC you can write Hive external tables in ORC format only. Creating an external table stores only the metadata in HMS. If you use HWC to create the external table, HMS keeps track of the location of table names and columns. Dropping an external table deletes the metadata from HMS. You can set an option to also drop the actual data in files, or not, from the file system. If you do not use HWC, dropping an external table deletes only the metadata from HMS. If you do not have permission to access the file system, and you want to purge table data in addition to metadata, you need

Load more