AWS Glue Studio User Guide AWS Glue Studio User Guide

Total Page:16

File Type:pdf, Size:1020Kb

AWS Glue Studio User Guide AWS Glue Studio User Guide AWS Glue Studio User Guide AWS Glue Studio User Guide AWS Glue Studio: User Guide Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved. Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored by Amazon. AWS Glue Studio User Guide Table of Contents What is AWS Glue Studio? ................................................................................................................... 1 Features of AWS Glue Studio ....................................................................................................... 2 Visual job editor ................................................................................................................ 2 Job script code editor ......................................................................................................... 2 Job performance dashboard ................................................................................................ 3 Support for dataset partitioning .......................................................................................... 3 When should I use AWS Glue Studio? ........................................................................................... 3 Accessing AWS Glue Studio ......................................................................................................... 3 Pricing for AWS Glue Studio ........................................................................................................ 4 Setting up ......................................................................................................................................... 5 Sign up for AWS ........................................................................................................................ 5 Create an IAM administrator user ................................................................................................. 5 Signing in as an IAM user ........................................................................................................... 6 IAM permissions needed for the AWS Glue Studio user ................................................................... 6 AWS Glue service permissions .............................................................................................. 6 Amazon CloudWatch permissions ......................................................................................... 7 Job-related permissions .............................................................................................................. 7 Data source and data target permissions ............................................................................... 7 Permissions required for deleting jobs .................................................................................. 8 AWS Key Management Service permissions ........................................................................... 8 Additional permissions when using connectors ....................................................................... 8 Set up IAM permissions for AWS Glue Studio ................................................................................. 8 Configuring a VPC for your ETL job .............................................................................................. 9 Populate the AWS Glue Data Catalog ........................................................................................... 9 Tutorial: Getting started .................................................................................................................... 11 Prerequisites ............................................................................................................................ 11 Step 1: Start the job creation process ......................................................................................... 11 Step 2: Edit the data source node in the job diagram .................................................................... 12 Step 3: Edit the transform node of the job .................................................................................. 13 Step 4: Edit the data target node of the job ................................................................................ 13 Step 5: View the job script ........................................................................................................ 14 Step 6: Specify the job details and save the job ........................................................................... 14 Step 7: Run the job .................................................................................................................. 15 Next steps ............................................................................................................................... 15 Creating jobs ................................................................................................................................... 16 Start the job creation process .................................................................................................... 16 Create jobs that use a connector ................................................................................................ 17 Next steps for creating a job in AWS Glue Studio ......................................................................... 17 Editing jobs ..................................................................................................................................... 18 Accessing the job diagram editor ................................................................................................ 18 Job editor features ................................................................................................................... 18 Using schema previews in the visual job editor .................................................................... 19 Using data previews in the visual job editor ......................................................................... 19 Restrictions when using data previews ................................................................................ 20 Editing the data source node ..................................................................................................... 20 Using Data Catalog tables for the data source ..................................................................... 21 Using a connector for the data source ................................................................................ 22 Using files in Amazon S3 for the data source ....................................................................... 22 Using a streaming data source ........................................................................................... 23 Editing the data transform node ................................................................................................ 24 Overview of mappings and transforms ................................................................................ 24 Using ApplyMapping to remap data property keys ................................................................ 25 Using SelectFields to remove most data property keys .......................................................... 26 Using DropFields to keep most data property keys ............................................................... 26 iii AWS Glue Studio User Guide Renaming a field in the dataset ......................................................................................... 27 Using Spigot to sample your dataset .................................................................................. 28 Joining datasets ............................................................................................................... 28 Using SplitFields to split a dataset into two ......................................................................... 30 Overview of SelectFromCollection transform ........................................................................ 30 Using SelectFromCollection to choose which dataset to keep ................................................. 31 Filtering keys within a dataset ........................................................................................... 31 Find and fill missing values in a dataset .............................................................................. 32 Using a SQL query to transform data ................................................................................. 33 Creating a custom transformation ...................................................................................... 34 Configuring data target nodes ................................................................................................... 37 Overview of data target options ........................................................................................ 37 Editing the data target node ............................................................................................. 38 Editing or uploading a job script ................................................................................................ 40 Creating and editing Scala scripts in AWS Glue Studio ........................................................... 41 Creating and editing Python shell jobs in AWS Glue Studio ...................................................
Recommended publications
  • Advantages of Schema Less Database
    Advantages Of Schema Less Database BoydIs Yanaton yacks counterbalanced very tortuously while or removed Blake remains after two-bit thriftier Durward and coralline. dighted Sometimes so anthropologically? streamless Septifragal Anselm malapertly.impone her pavise scrutinizingly, but suppressed Gabriello husband retributively or genuflect The schema of what does a useful because a particular style of your quiz! These databases eliminates the database of the biggest tech conferences worldwide switch from google classroom activity, updating and less than it easy to access an. SQL vs NoSQL Comparative Advantages and Disadvantages. So that schemas you need to databases reliable, advantages less language. PDF Comparison between relational and NOSQL databases. Document oriented databases have various advantages. Alongside increasing data structures, advantages of schema less database designer creates a bottleneck to use to a property graph storage nodes are quite high quality to large scalable databases. Amazon SimpleDB This is primarily a schema-less database that nature meant by handle smaller. Millions of database requires experts, advantages less to store chooses from the advantage of one of a very time and have a single source. What store a NoSQL Graph Database Ontotext Fundamentals. Schema Theory emphasizes the mental connections learners make between pieces of information and can if a shower powerful component of the learning process. NoSQL databases and the advantages and disadvantages of NoSQL. People use schemata to organize current live and defeat a consult for future understanding. Schemata and scripts ELLO. Why Use MongoDB Advantages & Use Cases Studio 3T. The most important benefit in the flexibility that facilitate database system provides. Because of commercial perspective, advantages of less database schema.
    [Show full text]
  • An Evaluation of Compilation-Based PL/PGSQL Execution Tanuj Nayak CMU-CS-21-101 February 2021
    An Evaluation of Compilation-Based PL/PGSQL Execution Tanuj Nayak CMU-CS-21-101 February 2021 Computer Science Department School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Andy Pavlo (Chair) Todd C. Mowry Submitted in partial fulfillment of the requirements for the Fifth Year Master’s Program. Copyright © 2021 Tanuj Nayak Keywords: User Defined Functions, Compilation, Inlining Abstract User Defined Functions (UDFs) are an important analytical feature in modern Database Management Systems (DBMSs) due to their server-side execution proper- ties. These properties allow complex analytical queries to execute without serializing intermediate data over a network. However, query engines often incur significant overheads when executing UDFs due to them being non-declarative in contrast to SQL queries. This contrast causes a lot of context switching between UDF and SQL execution. As a given UDF invokes more SQL queries, these overheads become more noticeable. In this thesis, we investigate the extent to which compilation allow us to overcome such overheads. Compilation for executing SQL queries has become popu- lar in database research in the past decade, especially in the context of main memory DBMSs. It has been shown to deliver significant improvements to query execution performance. We compare the technique of compiling UDFs with query inlining, another recent UDF execution technique. To make this comparison, we implemented a UDF compilation framework in NoisePage, a main-memory compilation-based DBMS. In this framework we compile UDFs into a domain-specific language (DSL) function and evaluated it against query inlining. We find that this framework has greater support across UDF language features than inlining frameworks and allows for more efficient functions.
    [Show full text]
  • CMU-CS-21-106 May 2021
    On Building Robustness into Compilation-Based Main-Memory Database Query Engines Prashanth Menon CMU-CS-21-106 May 2021 School of Computer Science Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee Andrew Pavlo (Co-Chair) Todd C. Mowry (Co-Chair) Jonathan Aldrich Thomas Neumann, Technische Universität München (TUM) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright © 2021 Prashanth Menon This research was sponsored by Google, Intel ISTC-CC, and the National Science Foundation under grant numbers CNS-1065112, IIS-1423210, CNS-1423172, IIS-1718582, and CCF-1822933. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: Code Generation, Query Compilation, Adaptive Query Processing, Vectorized Pro- cessing To my family, here and yet to arrive. Abstract Relational database management systems (DBMS) are the bedrock upon which modern data pro- cessing intensive applications are assembled. Critical to ensuring low-latency queries is the effi- ciency of the DBMSs query processor. Just-in-time (JIT) query compilation is a popular technique to improve analytical query processing performance. However, a compiled query cannot overcome poor choices made by the DBMSs optimizer. A lousy query plan results in lousy query code. Poor query plans often arise and for many reasons. Although there is a large body of work exploring how a query processor can adapt itself at runtime to compensate for inadequate plans, these techniques do not work in DBMSs that rely on compiling queries.
    [Show full text]
  • Scalable and Reactive Data Management for Mobile Internet-Of-Things Applications with Actor-Oriented Databases
    UNIVERSITY OF COPENHAGEN PhD in Computer Science Scalable and Reactive Data Management for Mobile Internet-of-Things Applications with Actor-Oriented Databases Yiwen Wang Supervised by Marcos Antonio Vaz Salles April 2021 Yiwen Wang Scalable and Reactive Data Management for Mobile Internet-of-Things Applications with Actor-Oriented Databases PhD in Computer Science, April 2021 Supervisors: Marcos Antonio Vaz Salles University of Copenhagen Faculty of Science PhD Degree in Computer Science Sigurdsgade 41 2200 Copenhagen 四载春秋,畏作黄粱,虽非寒窗,亦是苦读。知己无柳絮高才,幸有大贤焉 而为其徒,博学慎思,人十能之而己百之,笃行亦则足恃矣。今停笔止言之 际,回首旦暮,赠以诗酒共年华,赚得知交满天下。愿今后去往之地,皆为 热土,期明朝漫漫修远之路,皆伴长风济沧海。 iii Acknowledgements My PhD work conducted in the context of the Future Cropping partnership (Fu- ture Cropping partnership website 2018), supported by Innovation Fund Den- mark. Experimental evaluation partially supported by the AWS Cloud Credits for Research program. In addition, this work was partly supported by the International Network Programme project "Modeling and Developing Actor Database Applications", funded by the Danish Agency for Science and Higher Education (number 7059-000528) and by FAPESP CEPID CCES 13/08293-7. Additional funding provided by FAPESP project 17/02325-5 and by CNPq- Brazil, Department of Computer Science, University of Copenhagen and Pro- gramming technology foundations for Accountability, Privacy-by-design & Robustness in Context-aware Systems (case number: 9131-00077B). Throughout the working of this nearly four years PhD pursuing adventure, I have received a great deal of support and assistance from many aspects. First and foremost, I would like to express my special appreciation to my supervisor, Professor Marcos Antonio Vaz Salles, whose expertise was invaluable in guiding me in the research.
    [Show full text]
  • 732A54 / TDDE31 Big Data Analytics Topic: Dbmss for Big
    732A54 / TDDE31 Big Data Analytics Topic: DBMSs for Big Data Olaf Hartig [email protected] Relational Database Management Systems ● Well-defined formal foundations (relational data model) schema instance/ state Figure from “Fundamentals of Database Systems” by Elmasri and Navathe, Addison Wesley. 732A54 / TDDE31 Big Data Analytics Topic: Database Management Systems for Big Data Olaf Hartig 2 Relational Database Management Systems ● Well-defined formal foundations (relational data model) ● SQL – powerful declarative language – querying – data manipulation – database definition ● Support of transactions with ACID properties (Atomicity, Consistency preservation, Isolation, Durability) ● Established technology (developed since the 1970s) – many vendors – highly mature systems – experienced users and administrators 732A54 / TDDE31 Big Data Analytics Topic: Database Management Systems for Big Data Olaf Hartig 3 Business world has evolved ● Organizations and companies (whole industries) shift to the digital economy powered by the Internet ● Central aspect: new IT applications that allow companies to run their business and to interact with costumers – Web applications – Mobile applications – Connected devices (“Internet of Things”) Image source: https://pixabay.com/en/technology-information-digital-2082642/ 732A54 / TDDE31 Big Data Analytics Topic: Database Management Systems for Big Data Olaf Hartig 4 New Challenges for Database Systems ● Increasing numbers of concurrent users/clients – tens of thousands, perhaps millions – globally distributed
    [Show full text]