Elastic Mapreduce EMR Development Guide
Total Page:16
File Type:pdf, Size:1020Kb
Elastic MapReduce Elastic MapReduce EMR Development Guide Product Documentation ©2013-2019 Tencent Cloud. All rights reserved. Page 1 of 441 Elastic MapReduce Copyright Notice ©2013-2019 Tencent Cloud. All rights reserved. Copyright in this document is exclusively owned by Tencent Cloud. You must not reproduce, modify, copy or distribute in any way, in whole or in part, the contents of this document without Tencent Cloud's the prior written consent. Trademark Notice All trademarks associated with Tencent Cloud and its services are owned by Tencent Cloud Computing (Beijing) Company Limited and its affiliated companies. Trademarks of third parties referred to in this document are owned by their respective proprietors. Service Statement This document is intended to provide users with general information about Tencent Cloud's products and services only and does not form part of Tencent Cloud's terms and conditions. Tencent Cloud's products or services are subject to change. Specific products and services and the standards applicable to them are exclusively provided for in Tencent Cloud's applicable terms and conditions. ©2013-2019 Tencent Cloud. All rights reserved. Page 2 of 441 Elastic MapReduce Contents EMR Development Guide Hadoop Development Guide HDFS Common Operations Submitting MapReduce Tasks Automatically Adding Task Nodes Without Assigning ApplicationMasters YARN Task Queue Management Practices on YARN Label Scheduling Hadoop Best Practices Using API to Analyze Data in HDFS and COS Dumping YARN Job Logs to COS Spark Development Guide Spark Environment Info Using Spark to Analyze Data in COS Using Spark Python to Analyze Data in COS SparkSQL Tutorial Integrating Spark Streaming with Ckafka Practices on Dynamic Scheduling of Spark Resources Kafka and Spark Versions in Each EMR Version Spark Dependencies in Each EMR Version Hbase Development Guide Using HBase Through API Using Hbase with Thrift Spark on Hbase MapReduce on Hbase Phoenix on Hbase Development Guide Phoenix Client Environment Preparation Phoenix Client Tutorial Phoenix JDBC Tutorial Phoenix Best Practices Hive Development Guide Hive's Support for LLAP Basic Hive Operations Connecting to Hive Through Java Connecting to Hive Through Python Mapping Hbase Tables ©2013-2019 Tencent Cloud. All rights reserved. Page 3 of 441 Elastic MapReduce Hive Best Practices Creating Databases Based on COS Practices on Loading JSON Data to Hive Hive metadata management Presto Development Guide Presto Web UI Connector Analyzing Data in COS Sqoop Development Guide Import/Export of Relational Database and HDFS Incremental Data Import into HDFS Importing and Exporting Data Between Hive and TencentDB for MySQL Hue Development Guide Flume Development Guide Flume Overview Storing Kafka Data in Hive Through Flume Storing Kafka Data in HDFS or COS Through Flume Storing Kafka Data in Hive Through Flume Kerberos Development Guide Kerberos HA Support Kerberos Overview Kerberos Use Instructions Accessing Hadoop Secure Cluster Sample Connection from Hadoop to Kerberos Knox Development Guide Knox Development Guide Integrating Knox with Tez Integrating Knox with Tez v0.8.5 Integrating Knox with Tez v0.10.0 Alluxio Development Guide Alluxio Development Documentation Common Alluxio Commands Mounting File System to Unified Alluxio File System Using Alluxio in Tencent Cloud Support for COS Transparent-URI Kylin Development Guide Kylin Overview Livy Development Guide ©2013-2019 Tencent Cloud. All rights reserved. Page 4 of 441 Elastic MapReduce Livy Overview Zeppelin Development Guide Zeppelin Overview Hudi Development Guide Hudi Overview Superset Development Guide Superset Overview Impala Development Guide Impala Overview Impala OPS Manual Analyzing Data on COS/CHDFS ClickHouse Development Guide ClickHouse Overview ClickHouse Usage ClickHouse SQL Syntax Client Overview Getting Started ClickHouse Data Import COS Data Import HDFS Data Import Kafka Data Import MySQL Data Import ClickHouse OPS Monitoring Configuration Description Log Description Data Backup System Table Description Access Permission Control ClickHouse Data Migration Guide Druid Development Guide Druid Overview Druid Usage Ingesting Data from Hadoop in Batches Ingesting Data from Kafka in Real Time TensorFlow Development Guide TensorFlow Overview TensorFlowOnSpark Overview ©2013-2019 Tencent Cloud. All rights reserved. Page 5 of 441 Elastic MapReduce Jupyter Development Guide Jupyter Notebook Overview Kudu Development Guide EMR Development Guide EMR TianQiong Introduction Proprietary Spark Materialized Views of EMR TianQiong Proprietary Kona JDK of EMR TianQiong Ranger Development Guide Ranger Overview Ranger User Guide Integrating HDFS with Ranger Integrating YARN with Ranger Integrating HBase with Ranger Integrating Presto with Ranger Doris Development Guide Doris Overview User Guide Creating a User Creating a Data Table and Importing Data Querying Data Kafka Development Guide Kafka Overview Use Cases Kafka Usage Iceberg Development Guide ©2013-2019 Tencent Cloud. All rights reserved. Page 6 of 441 Elastic MapReduce EMR Development Guide Hadoop Development Guide HDFS Common Operations Last updated:2020-12-21 17:21:45 Tencent Cloud Elastic MapReduce (EMR) Hadoop service has integrated COS. If you enable COS when purchasing an EMR cluster, you can manipulate the data stored in COS using common Hadoop commands as shown below: #cat data hadoop fs -cat /usr/hive/warehouse/hivewithhdfs.db/record/data.txt #Modify the directory or file permission hadoop fs -chmod -R 777 /usr #Change the file or directory owner hadoop fs -chown -R root:root /usr #Create a folder hadoop fs -mkdir <paths> #Send a local file to HDFS hadoop fs -put <localsrc> ... <dst> #Copy a local file to HDFS hadoop fs -copyFromLocal <localsrc> URI #View the storage usage of a file or directory hadoop fs -du URI [URI …] #Delete a file hadoop fs -rm URI [URI …] #Set the number of copies of a directory or file hadoop fs–setrep [-R] [-w] REP PATH [PATH …] #Check for bad blocks of a cluster file hadoop fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]] For more HDFS commands, see the community documentation. If your cluster is a high-availability cluster (dual-namenode), you can see which namenode is active by running following commands: #nn1 is the ID of the namenode; usually nn1 and nn2. hdfs haadmin -getServiceState nn1 #View the current cluster report hdfs dfsadmin -report #namenode exits safe mode hdfs dfsadmin -safemode leave ©2013-2019 Tencent Cloud. All rights reserved. Page 7 of 441 Elastic MapReduce Submitting MapReduce Tasks Last updated:2020-12-15 15:24:37 This operation guide describes: 1. How to perform basic MapReduce job operations in command-line interfaces. 2. How to allow MapReduce jobs to access to the data stored in COS. For more information, please see the community documentation. The job submitted is a wordcount job. To count the words in a file, you need to upload the specified file in advance. The path of relevant software such as Hadoop is /usr/local/service/ . The relevant logs are stored in /data/emr . 1. Preparations for Development You need to create a bucket in COS for this job. Confirm that you have activated Tencent Cloud and created an EMR cluster. When creating the EMR cluster, select Enable COS on the basic configuration page and then enter your SecretId and SecretKey. You can find your SecretId and SecretKey at API Key Management. If you don't have a key, click Create Key to create one. 2. Logging in to an EMR Server You need to log in to any server in the EMR cluster first before performing the relevant operations. A master node is recommended for this step. EMR is built on CVM instances running on Linux; therefore, using EMR in command line mode requires logging in to an CVM instance. After creating the EMR cluster, select Elastic MapReduce in the console. In Cluster Resource > Resource Management, click Master Node to select the resource ID of the master node. Then, you can enter the CVM Console and find the instance of the EMR cluster. For more information on how to log in to a CVM instance, please see Logging in to a Linux Instance. Here, you can choose to log in with WebShell. Click Login on the right of the desired CVM instance to enter the login page. The default username is root, and the password is the one you set when creating the EMR cluster. Once your credentials have been validated, you can access the EMR command-line interface. All Hadoop operations are under the Hadoop user. The root user is logged in by default when you log in ©2013-2019 Tencent Cloud. All rights reserved. Page 8 of 441 Elastic MapReduce to the EMR server, so you need to switch to the Hadoop user. Run the following command to switch users and go to the Hadoop folder: [root@172 ~]# su hadoop [hadoop@172 root]$ cd /usr/local/service/hadoop [hadoop@172 hadoop]$ 3. Data Preparations You need to prepare a text file for counting. There are two ways to do so: storing data in an HDFS cluster and storing data in COS. The first step is to upload a local file to the CVM instance of the EMR cluster with the scp or sftp service. Run the following command on the local command line: scp $localfile root@public IP address:$remotefolder Here, $localfile is the path and the name of your local file; root is the CVM instance username. You can look up the public IP address in the node information in the EMR or CVM Console; and $remotefolder is the path where you want to store the file in the CVM instance. After the upload succeeds, you can check whether the file is in the corresponding folder on the command line of the EMR cluster. [hadoop@172 hadoop]$ ls –l Storing data in HDFS After uploading the data to the CVM instance, you can copy it to the HDFS cluster. The README.txt file in the /usr/local/service/hadoop directory is used here as an example. Copy the file to the Hadoop cluster by running the following command: [hadoop@172 hadoop]$ hadoop fs -put README.txt /user/hadoop/ After the copy is completed, run the following command to view the copied file: [hadoop@172 hadoop]$ hadoop fs -ls /user/hadoop Output: -rw-r--r-- 3 hadoop supergroup 1366 2018-06-28 11:39 /user/hadoop/README.txt ©2013-2019 Tencent Cloud.