Apache Impala Reference Date Published: 2019-11-01 Date Modified: 2021-04-26

Cloudera Runtime 7.2.9 Apache Impala Reference Date published: 2019-11-01 Date modified: 2021-04-26 https://docs.cloudera.com/ Legal Notice © Cloudera Inc. 2021. All rights reserved. The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual property rights. No license under copyright or any other intellectual property right is granted herein. Copyright information for Cloudera software may be found within the documentation accompanying each component in a particular release. Cloudera software includes software from various open source or other third party projects, and may be released under the Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms. Other software included may be released under the terms of alternative open source licenses. Please review the license and notice files accompanying the software for additional licensing information. Please visit the Cloudera software product page for more information on Cloudera software. For more information on Cloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss your specific needs. Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility nor liability arising from the use of products, except as expressly agreed to in writing by Cloudera. Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregistered trademarks in the United States and other countries. All other trademarks are the property of their respective owners. Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA, CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OF ANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY OR RELATED SUPPORT PROVIDED IN CONNECTION THEREWITH. CLOUDERA DOES NOT WARRANT THAT CLOUDERA PRODUCTS NOR SOFTWARE WILL OPERATE UNINTERRUPTED NOR THAT IT WILL BE FREE FROM DEFECTS NOR ERRORS, THAT IT WILL PROTECT YOUR DATA FROM LOSS, CORRUPTION NOR UNAVAILABILITY, NOR THAT IT WILL MEET ALL OF CUSTOMER’S BUSINESS REQUIREMENTS. WITHOUT LIMITING THE FOREGOING, AND TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, CLOUDERA EXPRESSLY DISCLAIMS ANY AND ALL IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, NON-INFRINGEMENT, TITLE, AND FITNESS FOR A PARTICULAR PURPOSE AND ANY REPRESENTATION, WARRANTY, OR COVENANT BASED ON COURSE OF DEALING OR USAGE IN TRADE. Cloudera Runtime | Contents | iii Contents Performance Considerations................................................................................... 5 Performance Best Practices.................................................................................................................................. 5 Query Join Performance.......................................................................................................................................7 Table and Column Statistics.................................................................................................................................8 Generating Table and Column Statistics................................................................................................20 Runtime Filtering................................................................................................................................................23 Partitioning.......................................................................................................................................................... 26 Partition Pruning for Queries................................................................................................................. 29 HDFS Caching....................................................................................................................................................32 HDFS Block Skew............................................................................................................................................. 36 Understanding Performance using EXPLAIN Plan...........................................................................................37 Understanding Performance using SUMMARY Report....................................................................................38 Understanding Performance using Query Profile.............................................................................................. 39 Scalability Considerations......................................................................................40 Scaling Limits and Guidelines........................................................................................................................... 45 Dedicated Coordinator........................................................................................................................................46 Hadoop File Formats Support.............................................................................. 50 Using Text Data Files........................................................................................................................................ 51 Using Parquet Data Files....................................................................................................................................54 Using ORC Data Files........................................................................................................................................62 Using Avro Data Files........................................................................................................................................63 Using RCFile Data Files.................................................................................................................................... 67 Using SequenceFile Data Files.......................................................................................................................... 67 Storage Systems Supports......................................................................................68 Impala with HDFS..............................................................................................................................................68 Impala with Kudu...............................................................................................................................................70 Configuring for Kudu Tables................................................................................................................. 71 Impala DDL for Kudu............................................................................................................................71 Impala DML for Kudu Tables............................................................................................................... 76 Impala with HBase............................................................................................................................................. 78 Impala with Azure Data Lake Store (ADLS).................................................................................................... 81 Impala with Amazon S3.....................................................................................................................................84 Specifying Impala Credentials to Access S3......................................................................................... 88 Ports Used by Impala.............................................................................................88 Migration Guide......................................................................................................89 Setting up Data Cache for Remote Reads........................................................... 93 Managing Metadata in Impala..............................................................................94 On-demand Metadata.............................................................................................94 SQL transactions in Impala.................................................................................. 95 Cloudera Runtime Performance Considerations Performance Considerations The following sections explain the factors affecting the performance of Impala features, and procedures for tuning, monitoring, and benchmarking Impala queries and other SQL operations. Performance Best Practices Use the performance guidelines and best practices during planning, experimentation, and performance tuning for an Impala-enabled cluster. Choose the appropriate file format for the data Typically, for large volumes of data (multiple gigabytes per table or partition), the Parquet file format performs best because of its combination of columnar storage layout, large I/O request size, and compression and encoding. Note: For smaller volumes of data, a few gigabytes or less for each table or partition, you might not see significant performance differences between file formats. At small data volumes, reduced I/O from an efficient compressed file format can be counterbalanced by reduced opportunity for parallel execution. When planning for a production deployment or conducting benchmarks, always use realistic data volumes to get a true picture of performance and scalability. Avoid data ingestion processes that produce many small files When producing data files outside of Impala, prefer either text format or Avro, where you can build up the files row by row. Once the data

Load more