Deployment Guide for

AWS Marketplace

Version: 7.1 Doc Build Date: 5/26/2020 Copyright © Trifacta Inc. 2020 - All Rights Reserved. CONFIDENTIAL

These materials (the “Documentation”) are the confidential and proprietary information of Trifacta Inc. and may not be reproduced, modified, or distributed without the prior written permission of Trifacta Inc.

EXCEPT AS OTHERWISE PROVIDED IN AN EXPRESS WRITTEN AGREEMENT, TRIFACTA INC. PROVIDES THIS DOCUMENTATION AS-IS AND WITHOUT WARRANTY AND TRIFACTA INC. DISCLAIMS ALL EXPRESS AND IMPLIED WARRANTIES TO THE EXTENT PERMITTED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE AND UNDER NO CIRCUMSTANCES WILL TRIFACTA INC. BE LIABLE FOR ANY AMOUNT GREATER THAN ONE HUNDRED DOLLARS ($100) BASED ON ANY USE OF THE DOCUMENTATION.

For third-party license information, please select About Trifacta from the User menu. 1. Quick Start . 4 1.1 Install from AWS Marketplace with EMR . . 4 1.2 Upgrade for AWS Marketplace with EMR . 10 2. Deployment 10 2.1 Deployment Overview for AWS . 11 2.2 Security Overview for AWS Marketplace . 17 2.3 Deployment Maintenance 20 2.4 Deployment Process 23 3. Configure 24 3.1 Configure for AWS . 24 3.1.1 Configure for EC2 Role-Based Authentication . 31 3.1.2 Enable S3 Access . 32 3.1.2.1 Create Redshift Connections 45 3.1.3 Configure for EMR . 47 4. Contact Support . 65 5. Legal 65 5.1 Third-Party License Information . 65

Page #3 Quick Start

Install from AWS Marketplace with EMR

Contents:

Scenario Description Install Pre-requisites Internet access Product Limitations Install Desktop Requirements Note about deleting the CloudFormation stack Verify Start and Stop the Platform Verify Operations Troubleshooting SELinux CloudFormation stack fails to deploy with EMR error Upgrade Documentation Related Topics

This documentation applies to installation from a supported Marketplace. Please use the installation instructions provided with your deployment.

If you are installing or upgrading a Marketplace deployment, please use the available PDF content. You must use the install and configuration PDF available through the Marketplace listing.

This step-by-step Guide walks through the process of deploying an instance of Trifacta Wrangler Enterprise through the AWS Marketplace, using a CloudFormation template. CloudFormation templates are designed to configure and launch the product as well as the related AWS compute, network, storage and related services to support the platform. Size of these resources can be configured based on the expected workload of the platform.

Scenario Description

CloudFormation templates enable you to install Trifacta® Wrangler Enterprise with a minimal amount of effort.

After install, customizations can be applied by tweaking the resources that were created by the CloudFormation process. If you have additional requirements or a complex environment, please contact Trifacta Supportfor assistance with your solution.

Install

The CloudFormation template creates a complete working instance of Trifacta Wrangler Enterprise, including the following:

Copyright © 2020 Trifacta Inc. Page #4 VPC and all required networking infrastructure (if deploying to a new VPC) EC2 instance with all supporting policies/roles S3 bucket EMR cluster Configurable autoscaling instance groups All supporting policies/roles

Pre-requisites

If you are integrating the Trifacta platform with an EMR cluster, you must acquire a Trifacta license first. Additional configuration is required. For more information, please contact [email protected].

Before you begin:

1. Read: Please read this entire document before you begin. 2. EULA. Before you begin, please review the End-User License Agreement. See https://docs.trifacta.com/display/PUB/End-User+License+Agreement+-+Trifacta+Wrangler+Enterprise. 3. Trifacta license file: If you have not done so already, please acquire a Trifacta license file from your Trifac ta representative.

Internet access

From AWS, the Trifacta platform requires Internet access for the following services:

NOTE: Depending on your AWS deployment, some of these services may not be required.

AWS S3 Key Management System [KMS] (if sse-kms server side encryption is enabled) Secure Token Service [STS] (if temporary credential provider is used) EMR (if integration with EMR cluster is enabled)

NOTE: If the Trifacta platform is hosted in a VPC where Internet access is restricted, access to S3, KMS and STS services must be provided by creating a VPC endpoint. If the platform is accessing an EMR cluster, a proxy server can be configured to provide access to the AWS ElasticMapReduce regional endpoint.

Product Limitations

The EC2 instance, S3 buckets, and any connected Redshift databases must be located in the same Amazon region. Cross-region integrations are not supported at this time. No support for Hive integration No support for secure impersonation or Kerberos No support for high availability and failover Job cancellation is not supported on EMR. When publishing single files to S3, you cannot apply an append publishing action. Spark 2.4 only.

Install

Desktop Requirements

Copyright © 2020 Trifacta Inc. Page #5 All desktop users of the platform must have the latest stable version of Google Chrome installed on their desktops. All desktop users must be able to connect to the EC2 instance through the enterprise infrastructure.

Steps:

1. In the Marketplace listing, select your deployment option: 1. Deploy into a new VPC: Creates a new VPC and all the required networking infrastructure to run the Trifacta instance and the EMR cluster. 2. Deploy into an existing VPC: Creates the Trifacta instance and EMR cluster in a VPC of your choice. 2. Choose a Template: The template path is automatically populated for you. 3. These parameters are common for both deployment options: 1. Stack Name: Display name of the stack is used in the names of resources created by the stack and as an identifier for the stack.

NOTE: Each instance of the Trifacta platform must have a separate name.

2. Trifacta Server: Please select the appropriate instance depending on the number of users and data volumes of your environment. 3. Key Pair: This SSH key pair is used to access the Trifacta Instance and the EMR cluster instances. 4. Allowed HTTP Source: This range of addresses are permitted access to the Trifacta Instance on port 80, 443, and 3005.

1. Port numbers 80 and 443 do not have any services by default, but you may modify the Trifact a configuration to enable access via these ports. 5. Allowed SSH Source: This range of addresses is permitted access to port 22 on the Trifacta Instance. 6. EMR Cluster Node Configuration: Allows you to customize the configuration of the deployed EMR nodes 1. Reasonable values are used as defaults. 2. If you do customize these values, you should upsize. Avoid downsizing these values. 7. EMR Cluster Autoscaling Configuration: Allows you to customize the autoscaling settings used by the EMR cluster. 1. Reasonable values are used as defaults. 4. If you are launching into a new VPC, skip this step. If you are launching into an existing VPC, you must enter these additional parameters:

1. VPC: The existing VPC into which you want to launch the resources. 2. Trifacta Subnet: The subnet into which you want to launch your Trifacta instance.

1. Please verify that this subnet is part of the VPC you selected. 2. This subnet must be able to communicate with the EMR Subnet (see below). 3. EMR Subnet: The private subnet into which you want to launch your EMR cluster. 1. Please verify that this subnet is part of the VPC you selected. 2. Amazon EMR only supports launching clusters in private subnets. 3. This subnet must be reachable from your Trifacta instance. 4. The EMR cluster requires an S3 endpoint available in this subnet. 5. Click Next. 6. Options: Specify any options as needed for your environment. None of these is required for installation. 7. Review: Review your installation and configured options. 1. Select the checkbox at the end of the page. 2. To launch the stack, click Create Stack. 8. Please wait while the stack creates all required resources. 9. In the Stacks list, select the name of your application. Click the Outputs tab and collect the following information. Instructions on how to use this information are provided later.

Copyright © 2020 Trifacta Inc. Page #6 Parameter Description Use

Trifacta URL value URL and port number to which to Users must connect to this IP address connect to the Trifacta application and port number to access. By default, it is set to 3005 . The access port can be moved to 80 or 443 if desired. Please contact us for more details.

Trifacta Bucket The address of the default S3 bucket This value must be applied through the application after it has been deployed.

Trifacta Instance Id The identifier for the instance of the This value is the default password for platform the admin account.

NOTE: You must change this password on the first login to the application.

10. After the Trifacta instance has been created, it is licensed for use for a couple of days only. This temporary license permits you to finish configuration from within the Trifacta application where you can also upload your permanent license. 11. Please wait while the instance finishes initialization and becomes available for use. When it is ready, navigate to the Trifacta application. 12. When the login screen appears, enter the following: 1. Username: [email protected] 2. Password: (the TrifactaInstanceId value)

NOTE: If the password does not work, please verify that you have not copied extra spaces in the password.

13. From the menu bar, select User menu > Admin console > Admin settings. 14. In the Admin Settings page, you can configure many aspects of the platform, including user management tasks, and perform restarts to apply the changes. 15. Add the S3 bucket that was automatically created to store Trifacta metadata and EMR content. Search for:

"aws.s3.bucket.name"

1. Update the value with the Trifacta Bucket value provided when you created the stack in AWS. 16. Enable the Run in EMR option within the platform. Search for:

"webapp.runinEMR"

1. Select the checkbox to enable it. 17. Click Save underneath the Platform Settings section. 18. The platform restarts, which can take several minutes. 19. In the Admin Settings page, locate the External Service Settings section.

1. AWS EMR Cluster ID: Paste the value for the EMR Cluster ID for the cluster to which the platform is connecting.

1. Verify that there are no extra spaces in any copied value.

Copyright © 2020 Trifacta Inc. Page #7 2. AWS Region: Enter the region code where you have deployed the CloudFormation stack. 1. Example: us-east-1 2. For a list of available regions, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones. html#concepts-available-regions . 3. Resource Bucket: you may use the already created Trifacta Bucket. 1. Verify that there are no extra spaces in any copied value. 4. Resource Path: you should use something like EMRLOGS. 20. Click Save underneath the External Service Settings section. 21. In the Admin Settings page, scroll down to the Upload License section. 22. Click Upload License and upload your license. 1. When the server comes online for the first time, it is assigned a temporary license. Please be sure to upload your license within 24 hours. 23. Updating the license does not require a restart.

Note about deleting the CloudFormation stack

If you must delete the CloudFormation stack, please be aware of the following.

1. The S3 bucket that was created for the stack is not removed. If you want to delete it, you must empty it first and then delete it. 2. Any EMR security groups created for the stack cannot be deleted, due to circular references. The stack deletion process informs you of the security groups that it failed to delete. To complete the deletion: 1. Remove all rules from the security groups. 2. Delete the security groups manually. 3. Re-run the stack deletion, which should complete successfully.

Verify

Start and Stop the Platform

Use the following command line commands to start, stop, and restart the platform.

Start:

sudo service trifacta start

Stop:

sudo service trifacta stop

Restart:

sudo service trifacta restart

Verify Operations

After you have installed or made changes to the platform, you should verify end-to-end operations.

Copyright © 2020 Trifacta Inc. Page #8 NOTE: The Trifacta® platform is not operational until it is connected to a supported backend datastore.

Steps:

1. Login to the application as an administrator. See Login. 2. Through the Admin Settings page, run Tricheck, which performs tests on the Trifacta node and any connected cluster. See Admin Settings Page. 3. In the application menu bar, click Library. Click Import Dataset. Select your backend datastore.

4. Navigate your datastore directory structure to locate a small CSV or JSON file.

5. Select the file. In the right panel, click Create and Transform. 1. Troubleshooting: If the steps so far work, then you have read access to the datastore from the platform. If not, please check permissions for the Trifacta user and its access to the appropriate directories. 2. See Import Data Page. 6. In the Transformer page, some steps have already been added to your recipe, so you can run the job right away. Click Run Job. 1. See Transformer Page.

7. In the Run Job Page: 1. For Running Environment, some of these options may not be available. Choose according to the running environment you wish to test. 1. Photon: Runs job on the Photon running environment hosted on the Trifacta node. This method of job execution does not utilize any integrated cluster. 2. Spark: Runs the job on Spark on the integrated cluster. 3. Databricks: If the platform is integrated with an Azure Databricks cluster, you can test job execution on the cluster. 2. Select CSV and JSON output. 3. Select the Profile Results checkbox. 4. Troubleshooting: At this point, you are able to initiate a job for execution on the selected running environment. Later, you can verify operations by running the same job on other available environments. 5. See Run Job Page.

8. When the job completes, you should see a success message in the Jobs tab of the Flow View page. 1. Troubleshooting: Either the Transform job or the Profiling job may break. To localize the problem, mouse over the Job listing in the Jobs page. Try re-running a job by deselecting the broken job type or running the job in a different environment. You can also download the log files to try to identify the problem. See Jobs Page.

9. Click View Results in the Jobs page. In the Profile tab of the Job Details page, you can see a visual profile of the generated results. 1. See Job Details Page. 10. In the Output Destinations tab, click the CSV and JSON links to download the results to your local desktop. See Import Data Page.

11. Load these results into a local application to verify that the content looks ok.

Troubleshooting

SELinux

By default, Trifacta Wrangler Enterprise is installed on a server with SELinux enabled. Security-enhanced Linux (SELinux) provides a set of security features for, among other things, managing access controls.

Copyright © 2020 Trifacta Inc. Page #9 Tip: The following may be applied to other deployments of the Trifacta platform on servers where SELinux has been enabled.

In some cases, SELinux can interfere with normal operations of platform software. If you are experiencing connectivity problems related to SELinux, you can do either one of the following:

1. Disable SELinux on the server. For more information, please see the CentOS documentation. 2. Apply the following commands on the server, as root: 1. Open ports on the server for listening. 1. By default, the Trifacta application listens on port 3005. The following opens that port when SELinux is enabled:

semanage port -a -t http_port_t -p tcp 3005

2. Repeat the above step for any other ports that you wish to open on the server. 2. Permit nginx, the proxy on the Trifacta node, to open websockets:

setsebool -P httpd_can_network_connect 1

CloudFormation stack fails to deploy with EMR error

After the stack fails to deploy you may see an error like:

ElasticMapReduce Cluster with Id j-2CFQJ4K8HABCD, is in state TERMINATED_WITH_ERRORS and failed to stabilize due to the following reason: {Code: VALIDATION_ERROR,Message: You cannot specify a ServiceAccessSecurityGroup for a cluster launched in public subnet.}

To solve this you must select a private subnet for the EMR cluster to launch in.

For more information, see https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-vpc-subnet.html

Upgrade

For more information, see Upgrade for AWS Marketplace with EMR.

Documentation

You can access complete product documentation in online and PDF format. After the platform has been installed, login to the Trifacta application. Select Help menu > Documentation.

Upgrade for AWS Marketplace with EMR

For more information on upgrading Trifacta® Wrangler Enterprise, please contact Trifacta Support.

Copyright © 2020 Trifacta Inc. Page #10 Deployment

This section provides an overview of deployment architecture and options for Trifacta® Wrangler Enterprise.

Deployment Overview for AWS

Contents:

Overview Key Capabilities Typical Deployment Architecture Diagrams AWS Network Architecture Data Flow and Integrations Planning Guidance Deployment Options Sizing Estimated Costs Expected Time to Deploy Security Considerations Before You Begin AWS Requirements General Requirements Deployment Limitations Next Steps

Overview

Trifacta® Wrangler Enterprise on AWS is designed to handle data wrangling workloads that need to support data at scale and a larger number of end-users. With native support for AWS EMR, Trifacta Wrangler Enterprise provid es organizations with a common platform for analyst teams to more efficiently explore and prepare diverse data while maintaining centralized management of security, governance, and operationalization. Data preparation is the most time-consuming and inefficient part of any data project - taking up over 80% of the time and resources. Trifacta Wrangler Enterprise enables data engineers and analysts to more efficiently explore and prepare the diverse data of today by utilizing machine learning to provide a breakthrough user experience, workflow, and architecture.

Available through AWS Marketplace, Trifacta Wrangler Enterprise utilizes AWS CloudFormation templates to launch an instance of the Trifacta platform, including related AWS resources from within a customer's account.

Key Capabilities

Trifacta Wrangler Enterprise on AWS supports the following:

Scalable and secure platform to address any data preparation initiative Integration with a wide variety of AWS and non-AWS datasources Optimized for execution in the AWS cloud Governance and lineage Centralized administration and orchestration

Copyright © 2020 Trifacta Inc. Page #11 Typical Deployment

Trifacta® Wrangler Enterprise can be deployed to Amazon Web Services (AWS) using CloudFormation templates available through the AWS Marketplace. The CloudFormation template deployed from the AWS Marketplace creates a complete working instance of Trifacta Wrangler Enterprise. The typical deployment scenario for Trifacta Wrangler Enterprise creates all resources required so that you can get up and running with minimal effort.

Tip: You can choose from a variety of EC2 instance sizes through the CloudFormation template. For more information on the sizing, see the Install section.

For a typical deployment, these resources are:

1 x VPC with the required subnets and configuration 1 x EC2 instance for the software and databases 1 x S3 bucket 1 x EMR cluster: 1 x EMR Master node 2 x EMR Core nodes (Autoscaling, configurable) 1 x EMR Task node (Autoscaling, configurable)

Typical EBS storage

For a default r4.4xlarge EC2 instance, the size of the EBS volume is the following in GB:

Default: 200 GB Minimum: 100 GB

Typical workloads

The above configuration is suitable for most customer workloads. If you need to process larger datasets, you may wish to use more EMR nodes on the cluster.

For more information on EMR sizing guidelines, see https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html.

Architecture Diagrams

AWS Network Architecture

The following diagram illustrates the technical fit of the Trifacta deployment into AWS and its resources.

Copyright © 2020 Trifacta Inc. Page #12 Figure: AWS network and resources

Data Flow and Integrations

The following diagram illustrates general data flow from sources through AWS resources and outputs in AWS and other systems.

Copyright © 2020 Trifacta Inc. Page #13 Figure: Data flow and integrations in AWS ecosystem

Import: Read in data from local files, JDBC sources, and Amazon-native resources such as S3, Redshift, and Amazon Glue

NOTE: Source data is never modified.

Transform: Imported data is transformed according to user-defined recipes. This process is managed by the EC2 instance. Amazon EMR: Created during deployment, the Amazon EMR cluster can be used for running larger jobs across more complex datasets Default Bucket: Created during deployment, this bucket is used by default for uploads, sampling, and generated results. Generate Results: The results of your jobs can be written to S3 or Redshift and optionally published to an external, integrated datastore.

Planning Guidance

Deployment Options

Trifacta Wrangler Enterprise for AWS supports the following general deployment options:

Item Description Options

Deployment Where the software is installed EC2 instance

Databases Supported database systems PostgreSQL hosted locally on the Trifacta instance

Base Storage Layer The backend storage system for the product S3

Authentication The supported methods of authentication IAM roles

Copyright © 2020 Trifacta Inc. Page #14 Data Sources Sources where data can be read for job execution S3 Redshift Glue Snowflake And many more

Running Environment Environment for running jobs EMR Photon

Sizing

Trifacta Wrangler Enterprise can be deployed using a variety of types of EC2 instance, as well as EBS storage volume sizes. The CloudFormation template is configured with appropriate default options for the following:

EC2 instance type EBS volume type Instance size selection for managed AWS services Size of EMR cluster

Tip: By default, the CloudFormation template creates an EMR cluster with one master node and two worker nodes.

Estimated Costs

The following are estimated monthly costs for deploying required deployment assets in the us-east-1 AWS region for an average-sized deployment:

NOTE: Changes to the count or size of these assets can change the monthly costs associated with your deployment.

Deployment Assets Cost

1 x Trifacta Wrangler Enterprise instance on r4.4xlarge node $776.72

1 x 200GB EBS Volume for Trifacta Wrangler Enterprise instance $20.00

1 x EMR Master m4.xlarge $146.00

2 x EMR Core node m4.2xlarge $584.00

1 x EMR Task node m4.2xlarge $292.00

3 x EBS Volumes for EMR nodes $40.00

TOTAL $1858.72

Expected Time to Deploy

Deploying the CloudFormation template and completing the required configuration steps should take less than 1 hour.

Copyright © 2020 Trifacta Inc. Page #15 Security Considerations

Security is covered later in this document. For more information, see Security Overview for AWS Marketplace.

Before You Begin

AWS Requirements

To deploy Trifacta Wrangler Enterprise from the AWS Marketplace, you must have a valid AWS account with the capabilities to create the resources deployed by the template. By default CloudFormation uses permissions based on your user credentials to create, modify, or delete resources in the stack. Optionally, when configuring stack parameters, you can choose an IAM role to explicitly define how CloudFormation can manage stack resources. Examples of resources deployed by the template include: VPC S3 EC2 EMR IAM

A basic understanding of the purpose and function of: CloudFormation. For more information, see https://aws.amazon.com/cloudformation/resources/templates/. Amazon EMR. For more information, see https://aws.amazon.com/emr/. EC2 VPC Security Groups IAM Roles and Policies

General Requirements

Technical requirements: The CloudFormation template creates all resources required for successful deployment of the product. No additional resources are needed.

User requirements: The deploying user must have appropriate AWS permissions to deploy the CloudFormation template.

Additional technical requirements are covered in the Install section.

Deployment Limitations

High availability deployments using the CloudFormation template are not supported. The product can be deployed to a single AZ. Multi-AZ or multi-region deployments are not supported.

Next Steps

You should read the other sections of the deployment overview:

Security Overview for AWS Marketplace: Authentication (IAM roles and policies) Sensitive customer data Deployment Maintenance: Health checks and fault monitoring Patches, updates, and upgrades Routine and emergency maintenance Backup and recovery Deployment Process:

Copyright © 2020 Trifacta Inc. Page #16 Install overview Test/Verify

Security Overview for AWS Marketplace

Contents:

Root Access Supported Methods of Authentication Security-Related Resources IAM roles IAM policies EC2 Security Groups VPC NACL Secure Access SSL Client Security Sensitive Customer Data Recommended data security practices

This section covers security considerations on deployment of Trifacta® Wrangler Enterprise through the AWS Marketplace.

Root Access

AWS root access: Not required for any step of deploying the product. EC2 instance root access: After deploying, you must have sudo access to the Trifacta node. Root access enables required command-line operations, such as starting and stopping services.

Supported Methods of Authentication

By default, the product is configured to use the IAM policies and roles created by the CloudFormation template. The CloudFormation template is peer-reviewed and tested internally to verify that it complies with the principle of least access. Customers should verify that any additional customizations do not over-extend these permissions.

NOTE: The default IAM policies and roles follow the principle of least access. If needed, these objects can be modified to grant access to additional resources, such as additional S3 buckets. Additional grants should follow the principle of least access.

Security-Related Resources

NOTE: The default of the CloudFormation template do not grant public access to any resources.

The following types of security objects are automatically created through the CloudFormation template:

IAM roles and policies EC2 security groups VPC network AC

Copyright © 2020 Trifacta Inc. Page #17 NOTE: When the CloudFormation stack is deployed, these resource names are customized to avoid problems with duplicate names. In the deployment guide, the resources are named as they appear in the CloudFormation template for easier location.

IAM roles

Below is a list of the roles created by the Cloudformation template and a brief description of each role's purpose.

Tip: Some of the default AWS EMR roles and policies are not used because they allow much broader access than necessary. Similar policies are created, following the principle of least access. Your Trifacta deployment has access to only the required resources in your account.

Role (as appears in template) Purpose Policies name(s) in template Additional Information

TrifactaNodeRole The Trifacta instance role is TrifactaBucketAccess used to provide access to the TrifactaNodeEmrAccess automatically created S3 bucket.

TrifactaEmrAutoscalingRole Automatic scaling in Amazon arn:aws:iam::aws:policy/service- https://docs.aws.amazon.com EMR requires an IAM role with role /emr/latest/ManagementGuide permissions to add and /AmazonElasticMapReduceforA /emr-automatic-scaling.html terminate instances when utoScalingRole scaling activities are triggered. https://docs.aws.amazon.com or if deployed on GovCloud: /emr/latest/ManagementGuide /emr-iam-roles.html arn:aws-us-gov:iam::aws:policy /service-role /AmazonElasticMapReduceforA utoScalingRole

TrifactaEmrEc2Role This role is attached to EC2 TrifactaEmrEc2RolePolicy https://docs.aws.amazon.com instances within the EMR /emr/latest/ManagementGuide cluster and allows the EMR /emr-iam-role-for-ec2.html service to access the data bucket.

TrifactaEmrServiceRole This role enables Amazon EMR TrifactaEmrServiceRolePolicy https://docs.aws.amazon.com to call other AWS services on /emr/latest/ManagementGuide your behalf when provisioning /emr-iam-role.html resources and performing service-level actions. This role is required for all clusters.

IAM policies

Policy name in template Purpose Roles used in (as appears in Additional Information template)

TrifactaBucketAccess This policy allows access to the TrifactaNodeRole data bucket from the Trifacta .instance

TrifactaEmrEc2RolePolicy This policy allows access to the TrifactaEmrEc2Role https://docs.aws.amazon.com data bucket from the EMR EC2 /emr/latest/ManagementGuide instances. /emr-iam-role-for-ec2.html

TrifactaEmrServiceRolePolicy This policy allows access to the TrifactaEmrServiceRole https://docs.aws.amazon.com data bucket and other AWS /emr/latest/ManagementGuide services as required by the /emr-iam-role.html EMR service.

TrifactaNodeEmrAccess This policy allows access to TrifactaNodeRole EMR from the Trifacta instance.

Copyright © 2020 Trifacta Inc. Page #18 EC2 Security Groups

TrifactaInstanceSecurityGroup

The Trifacta instance security group manages access to the Trifacta instance. During the CloudFormation deployment, you are prompted for an address range in CIDR format from which to allow access.

For security reasons, do not use 0.0.0.0/0 as a source range.

Protocol Port Source Notes

TCP 22 Customer-Provided CIDR This is required for SSH access Range to the Trifacta instance.

TCP 443 Customer -Provided CIDR The Trifacta platform is Range commonly configured to run on this port so we provide a rule for it by default.

TCP 80 Customer -Provided CIDR The Trifacta platform is Range commonly configured to run on this port so we provide a rule for it by default.

TCP 3005 Customer -Provided CIDR The default port where the Trifac Range ta platform runs

TrifactaEMRSecurityGroup

This group is assigned to the EMR master node in addition to the default AWS provided groups. It allows the Trifa cta instance to reach the EMR cluster to get updates on job progress.

Protocol Port Source Notes

TCP 18080 TrifactaInstanceSecurityGroup

TCP 8088 TrifactaInstanceSecurityGroup

VPC NACL

NetworkAcl

This network ACL controls access to all subnets in the VPC created by the CloudFormation template.

Direction Rule # Type Protocol Port Range Source Allow/Deny

Inbound 100 ALL TRAFFIC ALL ALL 0.0.0.0/0 ALLOW

Inbound * ALL TRAFFIC ALL ALL 0.0.0.0/0 DENY

Outbound 100 ALL TRAFFIC ALL ALL 0.0.0.0/0 ALLOW

Outbound * ALL TRAFFIC ALL ALL 0.0.0.0/0 DENY

Secure Access

Security features can be applied to Trifacta Wrangler Enterprise after it has been successfully deployed. Below are listed some key features.

Copyright © 2020 Trifacta Inc. Page #19 SSL

The Trifacta platform can be configured to use SSL for connections between the client and the platform. This configuration requires the creation and deployment of an SSL certificate. For more information, see Configure Security in the Configuration Guide.

Client Security

As needed, you can enable the use of the following features in the user client:

HTTP Strict-Transport Headers Secure cookies

For more information, see Configure Security in the Configuration Guide.

Sensitive Customer Data

NOTE: Trifacta Wrangler Enterprise never modifies source data.

The solution is deployed within the customer environment. The customer is free to manage data access according to enterprise requirements.

Recommended data security practices

For more information on managing customer data on S3, see Enable S3 Access.

For more information on security options for the EMR cluster, see Configure for EMR. Deployment Maintenance

Contents:

Health Checks Health Check Endpoint for Monitoring On Demand Health Checks Fault Monitoring Software Updates Emergency Maintenance Routine Maintenance Backup and Recovery Deployment Rollback Support Support Policy SLAs Contacting Support

Trifacta® Wrangler Enterprise provides several tools and processes for managing platform health and applying software updates.

Copyright © 2020 Trifacta Inc. Page #20 Health Checks

Health Check Endpoint for Monitoring

You can add the Trifacta deployment to your existing monitoring solution using our health check endpoint:

http://:3005/healthz

On Demand Health Checks

Periodic health checks can be performed directly through the platform.

Steps:

1. Login to the Trifacta application as an administrator. 2. Select User menu > Admin console > Admin settings. 3. At the bottom of the Admin page, click Run Tricheck.

For more information, see Admin Settings Page.

Fault Monitoring

For monitoring faults, such as AZ, instance, application or storage faults, you should utilize CloudWatch or an existing monitoring solution.

Software Updates

Please contact Trifacta Support for assistance in upgrading your Trifacta installation to ensure the best upgrade experience possible. For more information, see Contact Support.

Emergency Maintenance

Patches may be issued directly to customers in the event of a critical issue that must be resolved outside the typical upgrade cycle.

Routine Maintenance

The only required routine maintenance is to perform periodic backups of the following, based on your enterprise requirements:

Trifacta databases Configuration files

For more information, see below.

Backup and Recovery

Backups: Application backups can be created using the included backup script. It is the customer's responsibility to create and store these backups in accordance with their enterprise policies.

Copyright © 2020 Trifacta Inc. Page #21 You should store backups on S3 for ease of recovery.

Tip: You can add additional S3 buckets to the IAM policies attached to the Trifacta instance to allow access to the backups bucket.

Disaster Recovery: To recover from a catastrophic failure of the Trifacta node, you must deploy a new copy of the product and restore the most recent backup that was stored in a safe location.

You can use the included backup script to automate backups of your Trifacta software. Before starting the backup, you should stop the service.

/opt/trifacta/bin/setup-utils/trifacta-backup-config-and-db.sh

This script creates a backup in the /opt/trifacta-backups directory. Store and maintain these backups according to your existing backup processes.

Deployment Rollback

CloudFormation automatically rolls back if there is a problem during the upgrade process. If you want to roll back after a successful upgrade, you must deploy the previous version using CloudFormation and then restore your backups.

Support

Support Policy

Free community-based support, training and certification can be accessed within the Trifacta Community: https://community.trifacta.com/

General Support Policy information: https://www.trifacta.com/supportpolicy/

Copyright © 2020 Trifacta Inc. Page #22 SLAs

SLAs and Support Services are tied to the purchased Success Package:

For more details on SLAs and Trifacta Global Support and Services, see https://www.trifacta.com/professional-services/.

Contacting Support

For more information, see Contact Support. Deployment Process

Contents:

Deployment Process for AWS Verify AWS Deployment Maximizing Uptime and Reliability

This section provides an overview of the process for deploying Trifacta® Wrangler Enterprise through the AWS Marketplace.

Deployment Process for AWS

Through the AWS Marketplace, you can purchase and deploy Trifacta Wrangler Enterprise. Below are the high- level steps.

Steps:

1. Please visit the AWS Marketplace listing: https://aws.amazon.com/marketplace/pp/Trifacta-Wrangler-Enterprise/B077MK4J6Z 2. Complete any required steps to license the product.

NOTE: You must acquire a license key file from Trifacta Support. Details on pre-requisites are in the installation materials referenced below.

Copyright © 2020 Trifacta Inc. Page #23 3. Complete the steps to deploy Trifacta Wrangler Enterprise. For more information, see Install from AWS Marketplace with EMR.

Verify AWS Deployment

After installation, you should verify your deployment on AWS. The basic process:

1. Login 2. Import a dataset 3. Create a simple recipe 4. Run a job with profiling enabled 5. Verify job results

For verification:

Store your dataset in your S3 bucket. Run your test job on both Photon and EMR.

Detailed instructions are provided at the end of the install process. See Install from AWS Marketplace with EMR.

Maximizing Uptime and Reliability

To maximize the uptime and reliability of your Trifacta deployment, the following steps are recommended:

Utilize your monitoring infrastructure to monitor the application. Utilize CloudFormation to monitor the instance and associated resources. Verify backups are made according to your internal policies.

NOTE: Do not store backups on the Trifacta deployment.

Configure

The following topics describe how to configure Trifacta® Wrangler Enterprise for initial deployment and continued use. Configure for AWS

Contents:

Internet Access Database Installation Base AWS Configuration Base Storage Layer Configure AWS Region AWS Authentication AWS Auth Mode AWS Credential Provider AWS Storage S3 Sources Redshift Connections Snowflake Connections AWS Clusters EMR Hadoop

Copyright © 2020 Trifacta Inc. Page #24 This documentation applies to installation from a supported Marketplace. Please use the installation instructions provided with your deployment.

If you are installing or upgrading a Marketplace deployment, please use the available PDF content. You must use the install and configuration PDF available through the Marketplace listing.

The Trifacta® platform can be hosted within Amazon and supports integrations with multiple services from Amazon Web Services, including combinations of services for hybrid deployments. This section provides an overview of the integration options, as well as links to related configuration topics.

For an overview of AWS deployment scenarios, see Supported Deployment Scenarios for AWS.

Internet Access

From AWS, the Trifacta platform requires Internet access for the following services:

NOTE: Depending on your AWS deployment, some of these services may not be required.

AWS S3 Key Management System [KMS] (if sse-kms server side encryption is enabled) Secure Token Service [STS] (if temporary credential provider is used) EMR (if integration with EMR cluster is enabled)

NOTE: If the Trifacta platform is hosted in a VPC where Internet access is restricted, access to S3, KMS and STS services must be provided by creating a VPC endpoint. If the platform is accessing an EMR cluster, a proxy server can be configured to provide access to the AWS ElasticMapReduce regional endpoint.

Database Installation

The following database scenarios are supported.

Database Host Description

Cluster node By default, the Trifacta databases are installed on PostgreSQL instances in the Trifacta node or another accessible node in the enterprise environment. For more information, see Install Databases.

Amazon RDS For Amazon-based installations, you can install the Trifacta databases on PostgreSQL instances on Amazon RDS. For more information, see Install Databases on Amazon RDS.

Base AWS Configuration

The following configuration topics apply to AWS in general.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Copyright © 2020 Trifacta Inc. Page #25 Base Storage Layer

NOTE: The base storage layer must be set during initial configuration and cannot be modified after it is set.

S3: Most of these integrations require use of S3 as the base storage layer, which means that data uploads, default location of writing results, and sample generation all occur on S3. When base storage layer is set to S3, the Trifacta platform can:

read and write to S3 read and write to Redshift connect to an EMR cluster

HDFS: In on-premises installations, it is possible to use S3 as a read-only option for a Hadoop-based cluster when the base storage layer is HDFS. You can configure the platform to read from and write to S3 buckets during job execution and sampling. For more information, see Enable S3 Access.

For more information on setting the base storage layer, see Set Base Storage Layer.

For more information, see Storage Deployment Options.

Configure AWS Region

For Amazon integrations, you can configure the Trifacta node to connect to Amazon datastores located in different regions.

NOTE: This configuration is required under any of the following deployment conditions:

1. The Trifacta node is installed on-premises, and you are integrating with Amazon resources. 2. The EC2 instance hosting the Trifacta node is located in a different AWS region than your Amazon datastores. 3. The Trifacta node or the EC2 instance does not have access to s3.amazonaws.com .

1. In the AWS console, please identify the location of your datastores in other regions. For more information, see the Amazon documentation. 2. Login to the Trifacta application. 3. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 4. Set the value of the following property to the region where your S3 datastores are located:

aws.s3.region

If the above value is not set, then the Trifacta platform attempts to infer the region based on default S3 bucket location. 5. Save your changes.

Copyright © 2020 Trifacta Inc. Page #26 AWS Authentication

The following table illustrates the various methods of managing authentication between the platform and AWS. The matrix of options is basically determined by the settings for two key parameters.

credential provider - source of credentials: platform (default), instance (EC2 instance only), or temporary AWS mode - the method of authentication from platform to AWS: system-wide or by-user

AWS Mode System User

Credential Provider

Default One system-wide key/secret combo is Each user provides key/secret combo. inserted in the platform for use

Config: Config:

"aws. "aws. credentialProvi credentialProvi der": der": "default", "default", "aws.mode": "aws.mode": "system", "user", "aws.s3.key": , "aws.s3. User: Configure Your Access to S3 secret": ,

Instance Platform uses EC2 instance roles. Users provide EC2 instance roles.

Config: Config:

"aws. "aws. credentialProvi credentialProvi der": der": "instance", "instance", "aws.mode": "aws.mode": "system", "user",

Temporary Temporary credentials are issued based on Per-user authentication when using IAM per-user IAM roles. role.

Copyright © 2020 Trifacta Inc. Page #27 Config: Config:

"aws. "aws. credentialProvi credentialProvi der": der": "temporary", "temporary", "aws.mode": "aws.mode": "system", "user", "aws. systemIAMRole": ",

AWS Auth Mode

When connecting to AWS, the platform supports the following basic authentication modes:

Mode Configuration Description

system Access to AWS resources is managed "aws.mode": through a single, system account. The account that you specify is based on the "system", credential provider selected below.

The instance credential provider ignores this setting.

See below.

user Authentication must be specified for "aws.mode": individual users. "user", NOTE: Creation and use of custom dictionaries is not supported in user mode.

Tip: In AWS user mode, Trifacta administrators can manage S3 access for users through the Admin Settings page. See Manage Users.

AWS Credential Provider

The Trifacta platform supports the following methods of providing credentialed access to AWS and S3 resources.

Type Configuration Description

Copyright © 2020 Trifacta Inc. Page #28 default This method uses the provided AWS Key "aws. and Secret values to access resources. See below. credentialProvi der":"default",

instance When you are running the Trifacta platform "aws. on an EC2 instance, you can leverage your enterprise IAM roles to manage credentialProvi permissions on the instance for the Trifacta der":" platform. See below. instance",

temporary Details are below.

Default credential provider

Whether the AWS access mode is set to system or user, the default credential provider for AWS and S3 resources is the Trifacta platform.

Mode Description Configuration

A single AWS Key and Secret is inserted "aws.mode": into platform configuration. This account is "aws.s3.key": used to access all resources and must "system", have the appropriate permissions to do so. "", "aws.s3. secret": "",

Each user must specify an AWS Key and For more information on configuring "aws.mode": Secret into the account to access individual user accounts, see resources. Configure Your Access to S3. "user",

Default credential provider with EMR:

If you are using this method and integrating with an EMR cluster:

Copying the custom credential JAR file must be added as a bootstrap action to the EMR cluster definition. See Configure for EMR. As an alternative to copying the JAR file, you can use the EMR EC2 instance-based roles to govern access. In this case, you must set the following parameter:

"aws.emr.forceInstanceRole": true,

For more information, see Configure for EC2 Role-Based Authentication.

Copyright © 2020 Trifacta Inc. Page #29 Instance credential provider

When the platform is running on an EC2 instance, you can manage permissions through pre-defined IAM roles.

NOTE: If the Trifacta platform is connected to an EMR cluster, you can force authentication to the EMR cluster to use the specified IAM instance role. See Configure for EMR.

For more information, see Configure for EC2 Role-Based Authentication.

Temporary credential provider

For even better security, you can enable use temporary credentials provided from your AWS resources based on an IAM role specified per user.

Tip: This method is recommended by AWS.

Set the following properties.

Property Description

"aws.credentialProvider" If aws.mode = system , set this value to temporary . If aws.mode = user and you are using per-user authentication, then this setting is ignored and should stay as de fault.

Per-user authentication

Individual users can be configured to provide temporary credentials for access to AWS resources, which is a more secure authentication solution. For more information, see Configure AWS Per-User Authentication.

AWS Storage

S3 Sources

To integrate with S3, additional configuration is required. See Enable S3 Access.

Redshift Connections

You can create connections to one or more Redshift databases, from which you can read database sources and to which you can write job results. Samples are still generated on S3.

NOTE: Relational connections require installation of an encryption key file on the Trifacta node. For more information, see Create Encryption Key File.

For more information, see Create Redshift Connections.

Snowflake Connections

Through your AWS deployment, you can access your Snowflake databases. For more information, see Create Snowflake Connections.

Copyright © 2020 Trifacta Inc. Page #30 AWS Clusters

Trifacta Wrangler Enterprise can integrate with one instance of either of the following.

NOTE: If Trifacta Wrangler Enterprise is installed through the Amazon Marketplace, only the EMR integration is supported.

EMR

When Trifacta Wrangler Enterprise in installed through AWS, you can integrate with an EMR cluster for Spark- based job execution. For more information, see Configure for EMR.

Hadoop

If you have installed Trifacta Wrangler Enterprise on-premises or directly into an EC2 instance, you can integrate with a Hadoop cluster for Spark-based job execution. See Configure for Hadoop. Configure for EC2 Role-Based Authentication

Contents:

IAM roles AWS System Mode Additional AWS Configuration Use of S3 Sources

When you are running the Trifacta platform on an EC2 instance, you can leverage your enterprise IAM roles to manage permissions on the instance for the Trifacta platform. When this type of authentication is enabled, Trifacta administrators can apply a role to the EC2 instance where the platform is running. That role's permissions apply to all users of the platform.

IAM roles

Before you begin, your IAM roles should be defined and attached to the associated EC2 instance.

NOTE: The IAM instance role used for S3 access should have access to resources at the bucket level.

For more information, see http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html.

AWS System Mode

To enable role-based instance authentication, the following parameter must be enabled.

"aws.mode": "system",

Copyright © 2020 Trifacta Inc. Page #31 Additional AWS Configuration

The following additional parameters must be specified:

Parameter Description

aws.credentialProvider Set this value to instance. IAM instance role is used for providing access.

aws.hadoopFsUseSharedInstanceProvider Set this value to true for CDH. The class information is provided below.

Shared instance provider class information

Hortonworks:

"com.amazonaws.auth.InstanceProfileCredentialsProvider",

Pre-Cloudera 6.0.0:

"org.apache.hadoop.fs.s3a.SharedInstanceProfileCredentialsProvider"

Cloudera 6.0.0 and later:

Set the above parameters as follows:

"aws.credentialProvider": "instance", "aws.hadoopFSUseSharedInstanceProvider": false,

Use of S3 Sources

To access S3 for storage, additional configuration for S3 may be required.

NOTE: Do not configure the properties that apply to user mode.

Output sizing recommendations:

Single-file output: If you are generating a single file, you should try to keep its size under 1 GB. Multi-part output: For multiple-file outputs, each part file should be under 1 GB in size. For more information, see https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-multiple-files.html

For more information, see Enable S3 Access.

Enable S3 Access

Contents:

Copyright © 2020 Trifacta Inc. Page #32 Base Storage Layer Limitations Pre-requisites Required AWS Account Permissions Read-only access polices Write access polices Other AWS policies for S3 Configuration Define base storage layer Enable job output manifest Enable read access to S3 S3 access modes System mode - additional configuration S3 Configuration Configuration reference Enable use of server-side encryption Configure S3 filewriter Create Redshift Connection Hadoop distribution-specific configuration Hortonworks Additional Configuration for S3 Testing Troubleshooting Profiling consistently fails for S3 sources of data Spark local directory has no space

Below are instructions on how to configure Trifacta® Wrangler Enterprise to point to S3.

Simple Storage Service (S3) is an online data storage service provided by Amazon, which provides low- latency access through web services. For more information, see https://aws.amazon.com/s3/.

NOTE: Please review the limitations before enabling. See Limitations of S3 Integration below.

Base Storage Layer

If base storage layer is S3: you can enable read/write access to S3. If base storage layer is not S3: you can enable read-only access to S3.

Limitations

The Trifacta platform only supports running S3-enabled instances over AWS. Access to AWS S3 Regional Endpoints through internet protocol is required. If the machine hosting the Trif acta platform is in a VPC with no internet access, a VPC endpoint enabled for S3 services is required. The Trifacta platform does not support access to S3 through a proxy server.

Write access requires using S3 as the base storage layer. See Set Base Storage Layer.

NOTE: Spark 2.3.0 jobs may fail on S3-based datasets due to a known incompatibility. For details, see https://github.com/apache/incubator-druid/issues/4456.

Copyright © 2020 Trifacta Inc. Page #33 If you encounter this issue, please set spark.version to 2.1.0 in platform configuration. For more information, see Admin Settings Page.

Pre-requisites

On the Trifacta node, you must install the Oracle Java Runtime Environment for Java 1.8. Other versions of the JRE are not supported. For more information on the JRE, see http://www.oracle.com/technetwork/java/javase/downloads/index.html.

If IAM instance role is used for S3 access, it must have access to resources at the bucket level.

Required AWS Account Permissions

All access to S3 sources occurs through a single AWS account (system mode) or through an individual user's account (user mode). For either mode, the AWS access key and secret combination must provide access to the default bucket associated with the account.

NOTE: These permissions should be set up by your AWS administrator

Read-only access polices

NOTE: To enable viewing and browsing of all folders within a bucket, the following permissions are required:

The system account or individual user accounts must have the ListAllMyBuckets access permission for the bucket. All objects to be browsed within the bucket must have Get access enabled.

The policy statement to enable read-only access to your default S3 bucket should look similar to the following. Replace 3c-my-s3-bucket with the name of your bucket:

{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": [ "arn:aws:s3:::3c-my-s3-bucket", "arn:aws:s3:::3c-my-s3-bucket/*", ] } ] }

Copyright © 2020 Trifacta Inc. Page #34 Write access polices

Write access is enabled by adding the PutObject and DeleteObject actions to the above. Replace 3c-my- s3-bucket with the name of your bucket:

{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket", "s3:GetBucketLocation", "s3:PutObject", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::3c-my-s3-bucket", "arn:aws:s3:::3c-my-s3-bucket/*", ] } ] }

Other AWS policies for S3

KMS policy

If any accessible bucket is encrypted with KMS-SSE, another policy must be deployed. For more information, see https://docs.aws.amazon.com/kms/latest/developerguide/iam-policies.html.

Configuration

Depending on your S3 environment, you can define:

read access to S3 access to additional S3 buckets

S3 as base storage layer

Write access to S3 S3 bucket that is the default write destination

Define base storage layer

The base storage layer is the default platform for storing results.

Required for:

Copyright © 2020 Trifacta Inc. Page #35 Write access to S3 Connectivity to Redshift

The base storage layer for your Trifacta instance is defined during initial installation and cannot be changed afterward.

If S3 is the base storage layer, you must also define the default storage bucket to use during initial installation, which cannot be changed at a later time. See Define default S3 write bucket below.

For more information on the various options for storage, see Storage Deployment Options.

For more information on setting the base storage layer, see Set Base Storage Layer.

Enable job output manifest

When the base storage layer is set to S3, you must enable the platform to generate job output manifest files. During job execution, the platform can create a manifest file of all files generated during job execution. When the job results are published, this manifest file ensures proper publication.

NOTE: This feature must be enabled when using S3 as the base storage layer.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following parameter and set it to true:

"feature.enableJobOutputManifest": true,

3. Save your changes and restart the platform.

Enable read access to S3

When read access is enabled, Trifacta users can explore S3 buckets for creating datasets.

NOTE: When read access is enabled, Trifacta users have automatic access to all buckets to which the specified S3 user has access. You may want to create a specific user account for S3 access.

NOTE: Data that is mirrored from one S3 bucket to another might inherit the permissions from the bucket where it is owned.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Set the following property to true:

Copyright © 2020 Trifacta Inc. Page #36 "aws.s3.enabled": true,

3. Save your changes. 4. In the S3 configuration section, set enabled=true, which allows Trifacta users to browse S3 buckets through the Trifacta application. 5. Specify the AWS key and secret values for the user to access S3 storage.

S3 access modes

The Trifacta platform supports the following modes for access S3. You must choose one access mode and then complete the related configuration steps.

NOTE: Avoid switching between user mode and system mode, which can disable user access to data. At install mode, you should choose your preferred mode.

System mode

(default) Access to S3 buckets is enabled and defined for all users of the platform. All users use the same AWS access key, secret, and default bucket.

System mode - read-only access

For read-only access, the key, secret, and default bucket must be specified in configuration.

NOTE: Please verify that the AWS account has all required permissions to access the S3 buckets in use. The account must have the ListAllMyBuckets ACL among its permissions.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following parameters:

Parameters Description

aws.s3.key Set this value to the AWS key to use to access S3.

aws.s3.secret Set this value to the secret corresponding to the AWS key provided.

aws.s3.bucket.name Set this value to the name of the S3 bucket from which users may read data.

NOTE: Additional buckets may be specified. See below.

3. Save your changes.

User mode

Copyright © 2020 Trifacta Inc. Page #37 Optionally, access to S3 can be defined on a per-user basis. This mode allows administrators to define access to specific buckets using various key/secret combinations as a means of controlling permissions.

NOTE: When this mode is enabled, individual users must have AWS configuration settings applied to their account, either by an administrator or by themselves. The global settings in this section do not apply in this mode.

To enable:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Please verify that the settings below have been configured:

"aws.s3.enabled": true, "aws.mode": "user",

3. Additional configuration is required for per-user authentication. For more information, see Configure AWS Per-User Authentication.

User mode - Create encryption key file

If you have enabled user mode for S3 access, you must create and deploy an encryption key file. For more information, see Create Encryption Key File.

NOTE: If you have enabled user access mode, you can skip the following sections, which pertain to the system access mode, and jump to the Enable Redshift Connection section below.

System mode - additional configuration

The following sections apply only to system access mode.

Define default S3 write bucket

When S3 is defined as the base storage layer, write access to S3 is enabled. The Trifacta platform attempts to store outputs in the designated default S3 bucket.

NOTE: This bucket must be set during initial installation. Modifying it at a later time is not recommended and can result in inaccessible data in the platform.

NOTE: Bucket names cannot have underscores in them. See http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html.

Steps:

1. Define S3 to be the base storage layer. See Set Base Storage Layer. 2. Enable read access. See Enable read access. 3. Specify a value for aws.s3.bucket.name , which defines the S3 bucket where data is written. Do not include a protocol identifier. For example, if your bucket address is s3://MyOutputBucket, the value to specify is the following:

Copyright © 2020 Trifacta Inc. Page #38 MyOutputBucket

NOTE: Specify the top-level bucket name only. There should not be any backslashes in your entry.

NOTE: This bucket also appears as a read-access bucket if the specified S3 user has access.

Adding additional S3 buckets

When read access is enabled, all S3 buckets of which the specified user is the owner appear in the Trifacta application. You can also add additional S3 buckets from which to read.

NOTE: Additional buckets are accessible only if the specified S3 user has read privileges.

NOTE: Bucket names cannot have underscores in them.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following parameter: aws.s3.extraBuckets:

1. In the Admin Settings page, specify the extra buckets as a comma-separated string of additional S3 buckets that are available for storage. Do not put any quotes around the string. Whitespace between string values is ignored. 2. In trifacta-conf.json, specify the extraBuckets array as a comma-separated list of buckets as in the following:

"extraBuckets": ["MyExtraBucket01","MyExtraBucket02"," MyExtraBucket03"]

NOTE: Specify the top-level bucket name only. There should not be any backslashes in your entry.

3. Specify the extraBuckets array as a comma-separated list of buckets as in the following:

"extraBuckets": ["MyExtraBucket01","MyExtraBucket02"," MyExtraBucket03"]

4. These values are mapped to the following bucket addresses:

Copyright © 2020 Trifacta Inc. Page #39 s3://MyExtraBucket01 s3://MyExtraBucket02 s3://MyExtraBucket03

S3 Configuration

Configuration reference

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

"aws.s3.enabled": true, "aws.s3.bucket.name": "" "aws.s3.key": "", "aws.s3.secret": "", "aws.s3.extraBuckets": [""]

Setting Description

enabled When set to true , the S3 file browser is displayed in the GUI for locating files. For more information, see S3 Browser.

bucket.name Set this value to the name of the S3 bucket to which you are writing.

When webapp.storageProtocol is set to s3 , the output is delivered to aws.s3.bucket.name.

key Access Key ID for the AWS account to use.

NOTE: This value cannot contain a slash (/ ).

secret Secret Access Key for the AWS account.

extraBuckets Add references to any additional S3 buckets to this comma- separated array of values.

The S3 user must have read access to these buckets.

Enable use of server-side encryption

You can configure the Trifacta platform to publish data on S3 when a server-side encryption policy is enabled. SSE-S3 and SSE-KMS methods are supported. For more information, see http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html.

Notes:

When encryption is enabled, all buckets to which you are writing must share the same encryption policy. Read operations are unaffected.

To enable, please specify the following parameters.

Copyright © 2020 Trifacta Inc. Page #40 You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Server-side encryption method

"aws.s3.serverSideEncryption": "none",

Set this value to the method of encryption used by the S3 server. Supported values:

NOTE: Lower case values are required.

sse-s3 sse-kms none

Server-side KMS key identifier

When KMS encryption is enabled, you must specify the AWS KMS key ID to use for the server-side encryption.

"aws.s3.serverSideKmsKeyId": "",

Notes:

Access to the key: Access must be provided to the authenticating user. The AWS IAM role must be assigned to this key. Encrypt/Decrypt permissions for the specified KMS key ID: Permissions must be assigned to the authenticating user. The AWS IAM role must be given these permissions. For more information, see https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-modifying.html .

The format for referencing this key is the following:

"arn:aws:kms:::key/"

You can use an AWS alias in the following formats. The format of the AWS-managed alias is the following:

"alias/aws/s3"

The format for a custom alias is the following:

"alias/"

Copyright © 2020 Trifacta Inc. Page #41 where:

is the name of the alias for the entire key.

Save your changes and restart the platform.

Configure S3 filewriter

The following configuration can be applied through the Hadoop site-config.xml file. If your installation does not have a copy of this file, you can insert the properties listed in the steps below into trifacta-conf.json to configure the behavior of the S3 filewriter.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the filewriter.hadoopConfig block, where you can insert the following Hadoop configuration properties:

"filewriter": { max: 16, "hadoopConfig": { "fs.s3a.buffer.dir": "/tmp", "fs.s3a.fast.upload": "true" }, ... }

Property Description

fs.s3a.buffer.dir Specifies the temporary directory on the Trifacta node to use for buffering when uploading to S3. If fs.s3a.fast. upload is set to false, this parameter is unused.

NOTE: This directory must be accessible to the Batch Job Runner process during job execution.

fs.s3a.fast.upload Set to true to enable buffering in blocks.

When set to false, buffering in blocks is disabled. For a given file, the entire object is buffered to the disk of the Trifacta node. Depending on the size and volume of your datasets, the node can run out of disk space.

3. Save your changes and restart the platform.

Create Redshift Connection

For more information, see Create Redshift Connections.

Copyright © 2020 Trifacta Inc. Page #42 Hadoop distribution-specific configuration

Hortonworks

NOTE: If you are using Spark profiling through Hortonworks HDP on data stored in S3, additional configuration is required. See Configure for Hortonworks.

Additional Configuration for S3

The following parameters can be configured through the Trifacta platform to affect the integration with S3. You may or may not need to modify these values for your deployment.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Parameter Description

aws.s3.consistencyTimeout S3 does not guarantee that at any time the files that have been written to a directory will be consistent with the files available for reading. S3 does guarantee that eventually the files are in sync.

This guarantee is important for some platform jobs that write data to S3 and then immediately attempt to read from the written data.

This timeout defines how long the platform waits for this guarantee. If the timeout is exceeded, the job is failed. The default value is 120 .

Depending on your environment, you may need to modify this value.

aws.s3.endpoint This value should be the S3 endpoint DNS name value.

NOTE: Do not include the protocol identifier.

Example value:

s3.us-east-1.amazonaws.com

If your S3 deployment is either of the following:

located in a region that does not support the default endpoint, or v4-only signature is enabled in the region

Then, you can specify this setting to point to the S3 endpoint for Java/Spark services.

For more information on this location, see https://docs.aws.amazon.com/general/latest/gr/rande. html#s3_region .

Testing

Restart services. See Start and Stop the Platform.

Try running a simple job from the Trifacta application. For more information, see Verify Operations.

Copyright © 2020 Trifacta Inc. Page #43 Troubleshooting

Profiling consistently fails for S3 sources of data

If you are executing visual profiles of datasets sourced from S3, you may see an error similar to the following in the batch-job-runner.log file:

01:19:52.297 [Job 3] ERROR com.trifacta.hadoopdata.joblaunch.server. BatchFileWriterWorker - BatchFileWriterException: Batch File Writer unknown error: {jobId=3, why=bound must be positive} 01:19:52.298 [Job 3] INFO com.trifacta.hadoopdata.joblaunch.server. BatchFileWriterWorker - Notifying monitor for job 3 with status code FAILURE

This issue is caused by improperly configuring buffering when writing to S3 jobs. The specified local buffer cannot be accessed as part of the batch job running process, and the job fails to write results to S3.

Solution:

You may do one of the following:

Use a valid temp directory when buffering to S3. Disable buffering to directory completely.

Steps:

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. Locate the following, where you can insert either of the following Hadoop configuration properties:

"filewriter": { max: 16, "hadoopConfig": { "fs.s3a.buffer.dir": "/tmp", "fs.s3a.fast.upload": false }, ... }

Property Description

fs.s3a.buffer.dir Specifies the temporary directory on the Trifacta node to use for buffering when uploading to S3. If fs.s3a.fast. upload is set to false , this parameter is unused.

fs.s3a.fast.upload When set to false , buffering is disabled.

Save your changes and restart the platform.

Copyright © 2020 Trifacta Inc. Page #44 Spark local directory has no space

During execution of a Spark job, you may encounter the following error:

org.apache.hadoop.util.DiskChecker$DiskErrorException: No space available in any of the local directories.

Solution:

Restart Trifacta services, which may free up some temporary space. Use the steps in the preceding solution to reassign a temporary directory for Spark to use (fs.s3a. buffer.dir).

Create Redshift Connections

Contents:

Pre-requisites Limitations Create Connection Create through application Testing

This section provides information on how to enable Redshift connectivity and create one or more connections to Redshift sources.

Amazon Redshift is a hosted data warehouse available through Amazon Web Services. It is frequently used for hosting of datasets used by downstream analytic tools such as Tableau and Qlik. For more information, see https://aws.amazon.com/redshift/.

Pre-requisites

Before you begin, please verify that your Trifacta® environment meets the following requirements:

NOTE: In the Admin Settings page are some deprecated parameters pertaining to Redshift. Please ignore these parameters and their settings. They do not apply to this release.

1. S3 base storage layer: Redshift access requires use of S3 as the base storage layer, which must be enabled. See Set Base Storage Layer. 2. Same region: The Redshift cluster must be in the same region as the default S3 bucket. 3. Integration: Your Trifacta instance is connected to a running environment supported by your product edition.

4. Deployment: Trifacta platform is deployed either on-premises or in EC2.

Copyright © 2020 Trifacta Inc. Page #45 Limitations

You can publish any specific job once to Redshift through the export window. See Publishing Dialog.

When publishing to Redshift through the Publishing dialog, output must be in Avro or JSON format. This limitation does not apply to direct writing to Redshift. Management of nulls: Nulls are displayed as expected in the Trifacta application. When Redshift jobs are run, the UNLOAD SQL command in Redshift converts all nulls to empty strings. Null values appear as empty strings in generated results, which can be confusing. This is a known issue with Redshift.

Create Connection

You can create Redshift connections through the following methods.

Tip: SSL connections are recommended. Details are below.

Create through application

Any user can create a Redshift connection through the application.

Steps:

1. Login to the application. 2. In the menu, click User menu > Preferences > Connections. 3. In the Create Connection page, click the Redshift connection card. 4. Specify the properties for your Redshift database connection. The following parameters are specific to Redshift connections:

Property Description

IAM Role ARN for Redshift-S3 Connectivity (Optional) You can specify an IAM role ARN that enables role- based connectivity between Redshift and the S3 bucket that is used as intermediate storage during Redshift bulk COPY /UNLOAD operations. Example:

arn:aws:iam::1234567890: role/MyRedshiftRole

For more information, see Create Connection Window. 5. Click Save.

Enable SSL connections

To enable SSL connections to Redshift, you must enable them first on your Redshift cluster. For more information, see https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-ssl-support.html.

In your connection to Redshift, please add the following string to your Connect String Options:

Copyright © 2020 Trifacta Inc. Page #46 ;ssl=true

Save your changes.

Testing

Import a dataset from Redshift. Add it to a flow, and specify a publishing action. Run a job.

NOTE: When publishing to Redshift through the Publishing dialog, output must be in Avro or JSON format. This limitation does not apply to direct writing to Redshift.

For more information, see Verify Operations.

After you have run your job, you can publish the results to Redshift through the Job Details page. See Publishing Dialog. Configure for EMR

Contents:

Supported Versions Supported Spark Versions Limitations Create EMR Cluster Cluster options Specify cluster roles Authentication Set up S3 Buckets Bucket setup Set up EMR resources buckets Access Policies EC2 instance profile EMR roles General configuration for Trifacta platform Change admin password Verify S3 as base storage layer Set up S3 integration Configure EMR authentication mode Configure Trifacta platform for EMR Enable EMR integration Apply EMR cluster ID Extract IP address of master node in private sub-net Configure authentication mode Configure Spark for EMR Additional Configuration for EMR Default Hadoop job results format Configure Snappy publication Additional parameters Optional Configuration Configure for Redshift Switch EMR Cluster Configure Batch Job Runner

Copyright © 2020 Trifacta Inc. Page #47 Modify Job Tag Prefix Testing

You can configure your instance of the Trifacta platform to integrate with Amazon Elastic MapReduce (EMR), a highly scalable Hadoop-based execution environment.

NOTE: This section applies only to installations of Trifacta Wrangler Enterprise where a license key file has been acquired from Trifacta and applied to the platform.

Amazon EMR (Elastic MapReduce) provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. For more information on EMR, see http://docs.aws.amazon.com/cli/latest/reference/emr/.

Supported Versions

This section outlines how to create a new EMR cluster and integrate the Trifacta platform with it. The platform can be integrated with existing EMR clusters.

Supported Versions: EMR 5.13 - EMR 5.29.0

NOTE: EMR 5.28.0 is not supported, due to Spark compatibility issues. Please use 5.28.1 or later.

NOTE: EMR 5.20 - EMR 5.29 requires Spark 2.4. For more information, see Configure for Spark.

Supported Spark Versions

Depending on the version of EMR in use, you must configure the Trifacta platform to use the appropriate version of Spark. Please note the appropriate configuration settings below for later use.

NOTE: The version of Spark to use for the platform is defined in the spark.version property. This configuration step is covered later.

EMR versions Spark version Additional configuration and notes

EMR 5.13 - EMR 5.19 "spark. version": "2.3.0",

Copyright © 2020 Trifacta Inc. Page #48 EMR 5.20 - EMR 5.29 For EMR 5.20 and later, please set the "spark. following property value: version": "2.4.4", "spark. useVendorSparkL ibraries": true,

Limitations

NOTE: Job cancellation is not supported on an EMR cluster.

The Trifacta platform must be installed on AWS.

Create EMR Cluster

Use the following section to set up your EMR cluster for use with the Trifacta platform.

Via AWS EMR UI: This method is assumed in this documentation. Via AWS command line interface: For this method, it is assumed that you know the required steps to perform the basic configuration. For custom configuration steps, additional documentation is provided below.

NOTE: It is recommended that you set up your cluster for exclusive use by the Trifacta platform.

Cluster options

In the Amazon EMR console, click Create Cluster. Click Go to advanced options. Complete the sections listed below.

NOTE: Please be sure to read all of the cluster options before setting up your EMR cluster.

NOTE: Please perform your configuration through the Advanced Options workflow.

For more information on setting up your EMR cluster, see http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html.

Advanced Options

In the Advanced Options screen, please select the following:

Copyright © 2020 Trifacta Inc. Page #49 Software Configuration:

Release: EMR version to select. Select: Hadoop 2.8.3 Hue 3.12.0 Ganglia 3.7.2

Tip: Although it is optional, Ganglia is recommended for monitoring cluster performance.

Spark version should be set accordingly. See "Supported Spark Versions" above. Deselect everything else. Edit the software settings: Copy and paste the following into Enter Configuration:

[ { "Classification": "capacity-scheduler", "Properties": { "yarn.scheduler.capacity.resource-calculator": "org.apache. hadoop.yarn.util.resource.DominantResourceCalculator" } } ]

Auto-terminate cluster after the last step is completed: Leave this option disabled.

Hardware configuration

NOTE: Please apply the sizing information for your EMR cluster that was recommended for you. If you have not done so, please contact your Trifacta representative.

General Options

Cluster name: Provide a descriptive name. Logging: Enable logging on the cluster. S3 folder: Please specify the S3 bucket and path to the logging folder.

NOTE: Please verify that this location is read accessible to all users of the platform. See below for details.

Debugging: Enable. Termination protection: Enable. Tags: No options required. Additional Options: EMRFS consistent view: Do not enable. The platform can generate its own job output manifests. See Enable S3 Access. Custom AMI ID: None. Bootstrap Actions: If you are using a custom credential provider JAR, you must create a bootstrap action.

Copyright © 2020 Trifacta Inc. Page #50 NOTE: This configuration must be completed before you create the EMR cluster. For more information, see Authentication below.

Security Options

EC2 key pair: Please select a key/pair to use if you wish to access EMR nodes via SSH. Permissions: Set to Custom to reduce the scope of permissions. For more information, see EMR cluster policies below.

NOTE: Default permissions give access to everything in the cluster.

Encryption Options No requirements. EC2 Security Groups:

The selected security group for the master node on the cluster must allow TCP traffic from the Trifac ta instance on port 8088. For more information, see System Ports.

Create cluster and acquire cluster ID

If you performed all of the configuration, including the sections below, you can create the cluster.

NOTE: You must acquire your EMR cluster ID for use in configuration of the Trifacta platform.

Specify cluster roles

The following cluster roles and their permissions are required. For more information on the specifics of these policies, see EMR cluster policies.

EMR Role: Read/write access to log bucket Read access to resource bucket EC2 instance profile: If using instance mode: EC2 profile should have read/write access for all users. EC2 profile should have same permissions as EC2 Edge node role. Read/write access to log bucket Read access to resource bucket Auto-scaling role: Read/write access to log bucket Read access to resource bucket Standard auto-scaling permissions

Authentication

You can use one of two methods for authenticating the EMR cluster:

Role-based IAM authentication (recommended): This method leverages your IAM roles on the EC2 instance. Custom credential provider JAR file: This method utilizes a JAR file provided with the platform. This JAR file must be deployed to all nodes on the EMR cluster through a bootstrap action script.

Copyright © 2020 Trifacta Inc. Page #51 Role-based IAM authentication

You can leverage your IAM roles to provide role-based authentication to the S3 buckets.

NOTE: The IAM role that is assigned to the EMR cluster and to the EC2 instances on the cluster must have access to the data of all users on S3.

For more information, see Configure for EC2 Role-Based Authentication.

Specify the custom credential provider JAR file

If you are not using IAM roles for access, you can manage access using either of the following:

AWS key and secret values specified in trifacta-conf.json AWS user mode

In either scenario, you must use the custom credential provider JAR provided in the installation. This JAR file must be available to all nodes of the EMR cluster.

After you have installed the platform and configured the S3 buckets, please complete the following steps to deploy this JAR file.

NOTE: These steps must be completed before you create the EMR cluster.

NOTE: This section applies if you are using the default credential provider mechanism for AWS and are not using the IAM instance-based role authentication mechanism.

Steps:

1. From the installation of the Trifacta platform, retrieve the following file:

[TRIFACTA_INSTALL_DIR]/aws/credential-provider/build/libs/trifacta- aws-emr-credential-provider.jar

2. Upload this JAR file to an S3 bucket location where the EMR cluster can access it:

1. Via AWS Console S3 UI: See http://docs.aws.amazon.com/cli/latest/reference/s3/index.html. 2. Via AWS command line:

aws s3 cp trifacta-aws-emr-credential-provider.jar s3:///

3. Create a bootstrap action script named configure_emrfs_lib.sh. The contents must be the following:

Copyright © 2020 Trifacta Inc. Page #52 sudo aws s3 cp s3:///trifacta-aws-emr-credential- provider.jar /usr/share/aws/emr/emrfs/auxlib/

4. This script must be uploaded into S3 in a location that can be accessed from the EMR cluster. Retain the full path to this location. 5. Add bootstrap action to EMR cluster configuration. 1. Via AWS Console S3 UI: Create the bootstrap action to point to the script you uploaded on S3.

2. Via AWS command line: 1. Upload the configure_emrfs_lib.sh file to the accessible S3 bucket. 2. In the command line cluster creation script, add a custom bootstrap action, such as the following:

--bootstrap-actions '[ {"Path":"s3:///configure_emrfs_lib.sh"," Name":"Custom action"} ]'

When the EMR cluster is launched with the above custom bootstrap action, the cluster does one of the following:

Interacts with S3 using the credentials specified in trifacta-conf.json if aws.mode = user, then the credentials registered by the user are used.

For more information about AWSCredentialsProvider for EMRFS please see:

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-credentialsprovider.html https://aws.amazon.com/blogs/big-data/securely-analyze-data-from-another-aws-account-with-emrfs/

Set up S3 Buckets

Bucket setup

You must set up S3 buckets for read and write access.

NOTE: Within the Trifacta platform, you must enable use of S3 as the default storage layer. This configuration is described later.

For more information, see Enable S3 Access.

Set up EMR resources buckets

On the EMR cluster, all users of the platform must have access to the following locations:

Location Description Required Access

EMR Resources bucket and path The S3 bucket and path where resources Read/Write can be stored by the Trifacta platform for execution of Spark jobs on the cluster.

The locations are configured separately in the Trifacta platform.

Copyright © 2020 Trifacta Inc. Page #53 EMR Logs bucket and path The S3 bucket and path where logs are Read written for cluster job execution.

These locations are configured on the Trifacta platform later.

Access Policies

EC2 instance profile

Trifacta users require the following policies to run jobs on the EMR cluster:

{ "Statement": [ { "Effect": "Allow", "Action": [ "elasticmapreduce:AddJobFlowSteps", "elasticmapreduce:DescribeStep", "elasticmapreduce:DescribeCluster", "elasticmapreduce:ListInstanceGroups" ], "Resource": [ "*" ] }, { "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::__EMR_LOG_BUCKET__", "arn:aws:s3:::__EMR_LOG_BUCKET__/*", "arn:aws:s3:::__EMR_RESOURCE_BUCKET__", "arn:aws:s3:::__EMR_RESOURCE_BUCKET__/*" ] }

] }

EMR roles

The following policies should be assigned to the EMR roles listed below for read/write access:

Copyright © 2020 Trifacta Inc. Page #54 { "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::__EMR_LOG_BUCKET__", "arn:aws:s3:::__EMR_LOG_BUCKET__/*", "arn:aws:s3:::__EMR_RESOURCE_BUCKET__", "arn:aws:s3:::__EMR_RESOURCE_BUCKET__/*" ] } }

General configuration for Trifacta platform

Please complete the following sections to configure the Trifacta platform to communicate with the EMR cluster.

Change admin password

As soon as you have installed the software, you should login to the application and change the admin password. The initial admin password is the instanceId for the EC2 instance. For more information, see Change Password.

Verify S3 as base storage layer

EMR integrations requires use of S3 as the base storage layer.

NOTE: The base storage layer must be set during initial installation and set up of the Trifacta node.

See Set Base Storage Layer.

Set up S3 integration

To integrate with S3, additional configuration is required. See Enable S3 Access.

Configure EMR authentication mode

Authentication to AWS and to EMR supports the following basic modes:

System: A single set of credentials is used to connect to resources. User: Each user has a separate set of credentials. The user can choose to submit key-secret combinations or role-based authentication.

NOTE: Your method of authentication to AWS should already be configured. For more information, see Configure for AWS.

The authentication mode for your access to EMR can be configured independently from the base authentication mode for AWS, with the following exception:

NOTE: If aws.emr.authMode is set to user , then aws.mode must also be set to user.

Copyright © 2020 Trifacta Inc. Page #55 Authentication mode configuration matrix:

AWS mode system user (aws.mode)

EMR mode (aws.emr.authMode)

system AWS and EMR use a single key-secret AWS access uses a single key-secret combination. Parameters to set: combination.

EMR access is governed by per-user "aws.s3.key" credentials. Per-user credentials can be provided from one of several different "aws.s3.secret" providers.

NOTE: Per-user access requires See Configure for AWS. additional configuration for EMR. See the following section.

For more information on configuring per- user access, see Configure for AWS.

user Not supported AWS and EMR use the same per-user credentials for access. Per-user credentials can be provided from one of several different providers.

NOTE: Per-user access requires additional configuration for EMR. See the following section.

For more information on configuring per- user access, see Configure AWS Per-User Authentication.

Please apply the following configuration to set the EMR authentication mode:

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following settings and apply the appropriate values. See the table below:

"aws.emr.authMode": "user",

Setting Description

Copyright © 2020 Trifacta Inc. Page #56 aws.emr.authMode Configure the mode to use to authenticate to the EMR cluster:

system - In system mode, the specified AWS key and secret combination are used to authenticate to the EMR cluster. These credentials are used for all users.

user - In user mode, user configuration is retrieved from the database.

NOTE: User mode for EMR authentication requires that aws.mode be set to user . Additional configuration for EMR is below.

3. Save your changes.

EMR per-user authentication for the Trifacta platform

If you have enabled per-user authentication for EMR (aws.emr.authMode=user), you must set the following properties based on the credential provider for your AWS per-user credentials.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Authentication method Properties and values

Use default credential provider for all Trifacta access including EMR. "aws.credentialProvider":" default", NOTE: This method requires the deployment of a "aws.emr.forceInstanceRole": custom credential provider JAR. false,

Use default credential provider for all Trifacta access. However, EC2 role-based IAM authentication is used for EMR. "aws.credentialProvider":" default", "aws.emr.forceInstanceRole": true,

EC2 role-based IAM authentication for all Trifacta access "aws.credentialProvider":" instance",

Configure Trifacta platform for EMR

Enable EMR integration

After you have configured S3 to be the base storage layer, you must enable EMR integration.

Steps:

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Copyright © 2020 Trifacta Inc. Page #57 1. Set the following value:

"webapp.runInEMR": true,

2. Set the following values:

"webapp.runWithSparkSubmit": false,

3. Verify the following property values:

"webapp.runInTrifactaServer": true, "webapp.runWithSparkSubmit": false, "webapp.runInDataflow": false,

Apply EMR cluster ID

The Trifacta platform must be aware of the EMR cluster to which to connection.

Steps:

1. Administrators can apply this configuration change through the Admin Settings Page in the application. If the application is not available, the settings are available in trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Under External Service Settings, enter your AWS EMR Cluster ID. Click the Save button below the textbox.

For more information, see Admin Settings Page.

Extract IP address of master node in private sub-net

If you have deployed your EMR cluster on a private sub-net that is accessible outside of AWS, you must enable this property, which permits the extraction of the IP address of the master cluster node through DNS.

NOTE: This feature must be enabled if your EMR is accessible outside of AWS on a private network.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Set the following property to true:

"emr.extractIPFromDNS": false,

3. Save your changes and restart the platform.

Configure authentication mode

You can authenticate to the EMR cluster using either of the following authenticate modes:

System: A single set of credentials are used to connect to EMR.

Copyright © 2020 Trifacta Inc. Page #58 User: Each user has a separate set of credentials.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following settings and apply the appropriate values. See the table below:

"aws.emr.authMode": "user",

Setting Description

aws.emr.authMode Configure the mode to use to authenticate to the EMR cluster:

system - In system mode, the specified AWS key and secret combination are used to authenticate to the EMR cluster. These credentials are used for all users.

user - In user mode, user configuration is retrieved from the database.

NOTE: User mode for EMR authentication requires that aws.mode be set to user .

3. Save your changes.

Configure Spark for EMR

For EMR, you can configure a set of Spark-related properties to manage the integration and its performance.

Configure Spark version

Depending on the version of EMR with which you are integrating, the Trifacta platform must be modified to use the appropriate version of Spark to connect to EMR.

NOTE: You should have already acquired the value to apply. See "Supported Spark Versions" above.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following:

"spark.version": "",

3. Save your changes.

Use vendor libraries

If you are using EMR 5.20 or later (Spark 2.4 or later), you must configure the vendor libraries provided by the cluster. Please set the following parameter.

Copyright © 2020 Trifacta Inc. Page #59 Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following:

"spark.useVendorSparkLibraries": true,

3. Save your changes.

Disable Spark job service

The Spark job service is not used for EMR job execution. Please complete the following to disable it:

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following and set it to false:

"spark-job-service.enabled": false,

3. Locate the following and set it to false:

"spark-job-service.enableHiveSupport": false,

4. Save your changes.

Specify YARN queue for Spark jobs

Through the Admin Settings page, you can specify the YARN queue to which to submit your Spark jobs. All Spark jobs from the Trifacta platform are submitted to this queue.

Steps:

1. In platform configuration, locate the following:

"spark.props.spark.yarn.queue"

2. Specify the name of the queue. 3. Save your changes.

Allocation properties

The following properties must be passed from the Trifacta platform to Spark for proper execution on the EMR cluster.

To apply this configuration change, login as an administrator to the Trifacta node. Then, edit trifacta-conf. json. Some of these settings may not be available through the Admin Settings Page. For more information, see Platform Configuration Methods.

Copyright © 2020 Trifacta Inc. Page #60 NOTE: Do not modify these properties through the Admin Settings page. These properties must be added as extra properties through the Spark configuration block. Ignore any references in trifacta- conf.json to these properties and their settings.

"spark": { ... "props": { "spark.dynamicAllocation.enabled": "true", "spark.shuffle.service.enabled": "true", "spark.executor.instances": "0", "spark.executor.memory": "2048M", "spark.executor.cores": "2", "spark.driver.maxResultSize": "0" } ... }

Property Description Value

spark.dynamicAllocation.enabled Enable dynamic allocation on the Spark true cluster, which allows Spark to dynamically adjust the number of executors.

spark.shuffle.service.enabled Enable Spark shuffle service, which true manages the shuffle data for jobs, instead of the executors.

spark.executor.instances Default count of executor instances. See Sizing Guide.

spark.executor.memory Default memory allocation of executor See Sizing Guide. instances.

spark.executor.cores Default count of executor cores. See Sizing Guide.

spark.driver.maxResultSize Enable serialized results of unlimited size 0 by setting this parameter to zero (0).

Additional Configuration for EMR

Default Hadoop job results format

For smaller datasets, the platform recommends using the Trifacta Photon running environment.

For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this parameter. Publishing actions, including output format, can always be changed as part of the job specification.

As needed, you can change this default format. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

"webapp.defaultHadoopFileFormat": "csv",

Copyright © 2020 Trifacta Inc. Page #61 Accepted values: csv, json, avro, pqt

For more information, see Run Job Page.

Configure Snappy publication

If you are publishing using Snappy compression for jobs run on an EMR cluster, you may need to perform the following additional configuration.

Steps:

1. SSH into EMR cluster (master) node:

ssh

2. Create tarball of native Hadoop libraries:

tar -C /usr/lib/hadoop/lib -czvf emr-hadoop-native.tar.gz native

3. Copy the tarball to the Trifacta EC2 instance used by the into the /tmp directory:

scp -p emr-hadoop-native.tar.gz :/tmp

4. SSH to Trifacta EC2 instance:

ssh

5. Create path values for libraries:

sudo -u trifacta mkdir -p /opt/trifacta/services/batch-job-runner /build/libs

6. Untar the tarball to the Trifacta installation path:

sudo -u trifacta tar -C /opt/trifacta/services/batch-job-runner /build/libs -xzf /tmp/emr-hadoop-native.tar.gz

7. Verify libhadoop.so* and libsnappy.so* libraries exist and are owned by the Trifacta user:

Copyright © 2020 Trifacta Inc. Page #62 ls -l /opt/trifacta/services/batch-job-runner/build/libs/native/

8. Verify that the /tmp directory has the proper permissions for publication. For more information, see Supported File Formats. 9. A platform restart is not required.

Additional parameters

You can set the following parameters as needed:

Steps:

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Property Required Description

aws.emr.resource.bucket Y S3 bucket name where Trifacta executables , libraries, and other resources can be stored that are required for Spark execution.

aws.emr.resource.path Y S3 path within the bucket where resources can be stored for job execution on the EMR cluster.

NOTE: Do not include leading or trailing slashes for the path value.

aws.emr.proxyUser Y This value defines the user for the Trifacta users to use for connecting to the cluster.

NOTE: Do not modify this value.

aws.emr.maxLogPollingRetries N Configure maximum number of retries when polling for log files from EMR after job success or failure. Minimum value is 5 .

Copyright © 2020 Trifacta Inc. Page #63 aws.emr.tempfilesCleanupAge N Defines the number of days that temporary files in the /trifacta/tempfiles directory on EMR HDFS are permitted to age.

By default, this value is set to 0, which means that cleanup is disabled.

If needed, you can set this to a positive integer value. During each job run, the platform scans this directory for temp files older than the specified number of days and removes any that are found. This cleanup provides an additional level of system hygiene.

Before enabling this secondary cleanup process, please execute the following command to clear the tempfiles directory:

hdfs dfs -rm - r -skipTrash /trifacta /tempfiles

Optional Configuration

Configure for Redshift

For more information on configuring the platform to integrate with Redshift, see Create Redshift Connections.

Switch EMR Cluster

If needed, you can switch to a different EMR cluster through the application. For example, if the original cluster suffers a prolonged outage, you can switch clusters by entering the cluster ID of a new cluster. For more information, see Admin Settings Page.

Configure Batch Job Runner

Batch Job Runner manages jobs executed on the EMR cluster. You can modify aspects of how jobs are executed and how logs are collected. For more information, see Configure Batch Job Runner.

Modify Job Tag Prefix

In environments where the EMR cluster is shared with other job-executing applications, you can review and specify the job tag prefix, which is prepended to job identifiers to avoid conflicts with other applications.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following and modify if needed:

"aws.emr.jobTagPrefix": "TRIFACTA_JOB_",

Copyright © 2020 Trifacta Inc. Page #64 3. Save your changes and restart the platform.

Testing

1. Load a dataset from the EMR cluster. 2. Perform a few simple steps on the dataset. 3. Click Run Job in the Transformer page. 4. When specifying the job: 1. Click the Profile Results checkbox. 2. Select Spark. 5. When the job completes, verify that the results have been written to the appropriate location. Contact Support Do you need further assistance? Check out the resources below: Resources for Support Search Support Download logs: When you In Trifacta® Wrangler Enterprise, click the Help icon and select Search Help report your issue, please to search our help content. acquire the relevant logs available in the Trifacta If your question is not answered through search, you can file a support ticket application. Select Help through the Support Portal (see below). menu > Download logs. See Download Logs Dialog. Trifacta Community and Support Portal Email The Trifacta Community and Support Portal can be reached at: https://community.trifacta.com [email protected]

Within our Community, you can:

Manage your support cases Get free Wrangler certifications Post questions to the community Search our AI-driven knowledgebase Answer questions to earn points and get on the leaderboard Watch tutorials, access documentation, learn about new features, and more

Legal

Third-Party License Information

Copyright © 2020 Trifacta Inc.

This product also includes the following libraries which are covered by The (MIT AND BSD-3-Clause):

sha.js

This product also includes the following libraries which are covered by The 2-clause BSD License:

double_metaphone

This product also includes the following libraries which are covered by The 3-Clause BSD License:

Copyright © 2020 Trifacta Inc. Page #65 com.google.protobuf.protobuf-java com.google.protobuf.protobuf-java-util

This product also includes the following libraries which are covered by The ASL:

funcsigs org.json4s.json4s-ast_2.10 org.json4s.json4s-core_2.10 org.json4s.json4s-native_2.10

This product also includes the following libraries which are covered by The ASL 2.0:

pykerberos

This product also includes the following libraries which are covered by The Amazon Redshift ODBC and JDBC Driver License Agreement:

RedshiftJDBC41-1.1.7.1007 com.amazon.redshift.RedshiftJDBC41

This product also includes the following libraries which are covered by The Apache 2.0 License:

com.uber.jaeger.jaeger-b3 com.uber.jaeger.jaeger-core com.uber.jaeger.jaeger-thrift com.uber.jaeger.jaeger-zipkin org.apache.spark.spark-catalyst_2.11 org.apache.spark.spark-core_2.11 org.apache.spark.spark-hive_2.11 org.apache.spark.spark-kvstore_2.11 org.apache.spark.spark-launcher_2.11 org.apache.spark.spark-network-common_2.11 org.apache.spark.spark-network-shuffle_2.11 org.apache.spark.spark-sketch_2.11 org.apache.spark.spark-sql_2.11 org.apache.spark.spark-tags_2.11 org.apache.spark.spark-unsafe_2.11 org.apache.spark.spark-yarn_2.11

This product also includes the following libraries which are covered by The Apache License:

com.chuusai.shapeless_2.11 commons-httpclient.commons-httpclient org.apache.httpcomponents.httpclient org.apache.httpcomponents.httpcore

This product also includes the following libraries which are covered by The Apache License (v2.0):

com.vlkan.flatbuffers

This product also includes the following libraries which are covered by The Apache License 2.0:

@google-cloud/pubsub @google-cloud/resource @google-cloud/storage arrow avro aws-sdk azure-storage bootstrap

Copyright © 2020 Trifacta Inc. Page #66 browser-request bytebuffer cglib.cglib-nodep com.amazonaws.aws-java-sdk com.amazonaws.aws-java-sdk-bundle com.amazonaws.aws-java-sdk-core com.amazonaws.aws-java-sdk-dynamodb com.amazonaws.aws-java-sdk-emr com.amazonaws.aws-java-sdk-iam com.amazonaws.aws-java-sdk-kms com.amazonaws.aws-java-sdk-s3 com.amazonaws.aws-java-sdk-sts com.amazonaws.jmespath-java com.carrotsearch.hppc com.clearspring.analytics.stream com.cloudera.navigator.navigator-sdk com.cloudera.navigator.navigator-sdk-client com.cloudera.navigator.navigator-sdk-model com.codahale.metrics.metrics-core com.cronutils.cron-utils com.databricks.spark-avro_2.11 com.fasterxml.classmate com.fasterxml.jackson.dataformat.jackson-dataformat-cbor com.fasterxml.jackson.datatype.jackson-datatype-joda com.fasterxml.jackson.jaxrs.jackson-jaxrs-base com.fasterxml.jackson.jaxrs.jackson-jaxrs-json-provider com.fasterxml.jackson.module.jackson-module-jaxb-annotations com.fasterxml.jackson.module.jackson-module-paranamer com.fasterxml.jackson.module.jackson-module-scala_2.11 com.fasterxml.uuid.java-uuid-generator com..nscala-time.nscala-time_2.10 com.github.stephenc.jcip.jcip-annotations com.google.api-client.google-api-client com.google.api-client.google-api-client-jackson2 com.google.api-client.google-api-client-java6 com.google.api.grpc.grpc-google-cloud-bigquerystorage-v1beta1 com.google.api.grpc.grpc-google-cloud-bigtable-admin-v2 com.google.api.grpc.grpc-google-cloud-bigtable-v2 com.google.api.grpc.grpc-google-cloud-pubsub-v1 com.google.api.grpc.grpc-google-cloud-spanner-admin-database-v1 com.google.api.grpc.grpc-google-cloud-spanner-admin-instance-v1 com.google.api.grpc.grpc-google-cloud-spanner-v1 com.google.api.grpc.grpc-google-common-protos com.google.api.grpc.proto-google-cloud-bigquerystorage-v1beta1 com.google.api.grpc.proto-google-cloud-bigtable-admin-v2 com.google.api.grpc.proto-google-cloud-bigtable-v2 com.google.api.grpc.proto-google-cloud-datastore-v1 com.google.api.grpc.proto-google-cloud-monitoring-v3 com.google.api.grpc.proto-google-cloud-pubsub-v1 com.google.api.grpc.proto-google-cloud-spanner-admin-database-v1 com.google.api.grpc.proto-google-cloud-spanner-admin-instance-v1 com.google.api.grpc.proto-google-cloud-spanner-v1 com.google.api.grpc.proto-google-common-protos com.google.api.grpc.proto-google-iam-v1 com.google.apis.google-api-services-bigquery com.google.apis.google-api-services-clouddebugger com.google.apis.google-api-services-cloudresourcemanager com.google.apis.google-api-services-dataflow com.google.apis.google-api-services-iam com.google.apis.google-api-services-oauth2

Copyright © 2020 Trifacta Inc. Page #67 com.google.apis.google-api-services-pubsub com.google.apis.google-api-services-storage com.google.auto.service.auto-service com.google.auto.value.auto-value-annotations com.google.cloud.bigdataoss.gcsio com.google.cloud.bigdataoss.util com.google.cloud.google-cloud-bigquery com.google.cloud.google-cloud-bigquerystorage com.google.cloud.google-cloud-bigtable com.google.cloud.google-cloud-bigtable-admin com.google.cloud.google-cloud-core com.google.cloud.google-cloud-core-grpc com.google.cloud.google-cloud-core-http com.google.cloud.google-cloud-monitoring com.google.cloud.google-cloud-spanner com.google.code.findbugs.jsr305 com.google.code.gson.gson com.google.errorprone.error_prone_annotations com.google.flogger.flogger com.google.flogger.flogger-system-backend com.google.flogger.google-extensions com.google.guava.failureaccess com.google.guava.guava com.google.guava.guava-jdk5 com.google.guava.listenablefuture com.google.http-client.google-http-client com.google.http-client.google-http-client-apache com.google.http-client.google-http-client-appengine com.google.http-client.google-http-client-jackson com.google.http-client.google-http-client-jackson2 com.google.http-client.google-http-client-protobuf com.google.inject.extensions.guice-servlet com.google.inject.guice com.google.j2objc.j2objc-annotations com.google.oauth-client.google-oauth-client com.google.oauth-client.google-oauth-client-java6 com.googlecode.javaewah.JavaEWAH com.googlecode.libphonenumber.libphonenumber com.hadoop.gplcompression.hadoop-lzo com.jakewharton.threetenabp.threetenabp com.jamesmurty.utils.java-xmlbuilder com.jolbox.bonecp com.mapr.mapr-root com.microsoft.azure.adal4j com.microsoft.azure.azure-core com.microsoft.azure.azure-storage com.microsoft.windowsazure.storage.microsoft-windowsazure-storage-sdk com.nimbusds.lang-tag com.nimbusds.nimbus-jose-jwt com.ning.compress-lzf com.opencsv.opencsv com.squareup.okhttp.okhttp com.squareup.okhttp3.logging-interceptor com.squareup.okhttp3.okhttp com.squareup.okhttp3.okhttp-urlconnection com.squareup.okio.okio com.squareup.retrofit2.adapter-rxjava com.squareup.retrofit2.converter-jackson com.squareup.retrofit2.retrofit com.trifacta.hadoop.cloudera4

Copyright © 2020 Trifacta Inc. Page #68 com.twitter.chill-java com.twitter.chill_2.11 com.twitter.parquet-hadoop-bundle com.typesafe.akka.akka-actor_2.11 com.typesafe.akka.akka-cluster_2.11 com.typesafe.akka.akka-remote_2.11 com.typesafe.akka.akka-slf4j_2.11 com.typesafe.config com.univocity.univocity-parsers com.zaxxer.HikariCP com.zaxxer.HikariCP-java7 commons-beanutils.commons-beanutils commons-beanutils.commons-beanutils-core commons-cli.commons-cli commons-codec.commons-codec commons-collections.commons-collections commons-configuration.commons-configuration commons-dbcp.commons-dbcp commons-dbutils.commons-dbutils commons-digester.commons-digester commons-el.commons-el commons-fileupload.commons-fileupload commons-io.commons-io commons-lang.commons-lang commons-logging.commons-logging commons-net.commons-net commons-pool.commons-pool dateinfer de.odysseus.juel.juel-api de.odysseus.juel.juel-impl de.odysseus.juel.juel-spi express-opentracing google-benchmark googleapis io.airlift.aircompressor io.dropwizard.metrics.metrics-core io.dropwizard.metrics.metrics-graphite io.dropwizard.metrics.metrics-json io.dropwizard.metrics.metrics-jvm io.grpc.grpc-all io.grpc.grpc-alts io.grpc.grpc-auth io.grpc.grpc-context io.grpc.grpc-core io.grpc.grpc-grpclb io.grpc.grpc-netty io.grpc.grpc-netty-shaded io.grpc.grpc-okhttp io.grpc.grpc-protobuf io.grpc.grpc-protobuf-lite io.grpc.grpc-protobuf-nano io.grpc.grpc-stub io.grpc.grpc-testing io.netty.netty io.netty.netty-all io.netty.netty-buffer io.netty.netty-codec io.netty.netty-codec-http io.netty.netty-codec-http2 io.netty.netty-codec-socks

Copyright © 2020 Trifacta Inc. Page #69 io.netty.netty-common io.netty.netty-handler io.netty.netty-handler-proxy io.netty.netty-resolver io.netty.netty-tcnative-boringssl-static io.netty.netty-transport io.opentracing.contrib.opentracing-concurrent io.opentracing.contrib.opentracing-globaltracer io.opentracing.contrib.opentracing-web-servlet-filter io.opentracing.opentracing-api io.opentracing.opentracing-noop io.opentracing.opentracing-util io.prometheus.simpleclient io.prometheus.simpleclient_common io.prometheus.simpleclient_servlet io.reactivex.rxjava io.spray.spray-can_2.11 io.spray.spray-http_2.11 io.spray.spray-httpx_2.11 io.spray.spray-io_2.11 io.spray.spray-json_2.11 io.spray.spray-routing_2.11 io.spray.spray-util_2.11 io.springfox.springfox-core io.springfox.springfox-schema io.springfox.springfox-spi io.springfox.springfox-spring-web io.springfox.springfox-swagger-common io.springfox.springfox-swagger-ui io.springfox.springfox-swagger2 io.swagger.swagger-annotations io.swagger.swagger-models io.undertow.undertow-core io.undertow.undertow-servlet io.undertow.undertow-websockets-jsr io.zipkin.java.zipkin io.zipkin.reporter.zipkin-reporter io.zipkin.reporter.zipkin-sender-urlconnection io.zipkin.zipkin2.zipkin it.unimi.dsi.fastutil jaeger-client javax.inject.javax.inject javax.jdo.jdo-api javax.validation.validation-api joda-time.joda-time kerberos less libcuckoo log4j.apache-log4j-extras log4j.log4j long mathjs mx4j.mx4j net.bytebuddy.byte-buddy net.hydromatic.eigenbase-properties net.java.dev.eval.eval net.java.dev.jets3t.jets3t net.jcip.jcip-annotations net.jpountz.lz4.lz4 net.minidev.accessors-smart

Copyright © 2020 Trifacta Inc. Page #70 net.minidev.json-smart net.sf.opencsv.opencsv net.snowflake.snowflake-jdbc opentracing org.activiti.activiti-bpmn-converter org.activiti.activiti-bpmn-layout org.activiti.activiti-bpmn-model org.activiti.activiti-common-rest org.activiti.activiti-dmn-api org.activiti.activiti-dmn-model org.activiti.activiti-engine org.activiti.activiti-form-api org.activiti.activiti-form-model org.activiti.activiti-image-generator org.activiti.activiti-process-validation org.activiti.activiti-rest org.activiti.activiti-spring org.activiti.activiti-spring-boot-starter-basic org.activiti.activiti-spring-boot-starter-rest-api org.activiti.activiti5-compatibility org.activiti.activiti5-engine org.activiti.activiti5-spring org.activiti.activiti5-spring-compatibility org.apache.arrow.arrow-format org.apache.arrow.arrow-memory org.apache.arrow.arrow-vector org.apache.atlas.atlas-client org.apache.atlas.atlas-typesystem org.apache.avro.avro org.apache.avro.avro-ipc org.apache.avro.avro-mapred org.apache.beam.beam-model-job-management org.apache.beam.beam-model-pipeline org.apache.beam.beam-runners-core-construction-java org.apache.beam.beam-runners-direct-java org.apache.beam.beam-runners-google-cloud-dataflow-java org.apache.beam.beam-sdks-java-core org.apache.beam.beam-sdks-java-extensions-google-cloud-platform-core org.apache.beam.beam-sdks-java-extensions-protobuf org.apache.beam.beam-sdks-java-extensions-sorter org.apache.beam.beam-sdks-java-io-google-cloud-platform org.apache.beam.beam-sdks-java-io-parquet org.apache.beam.beam-vendor-grpc-1_13_1 org.apache.beam.beam-vendor-guava-20_0 org.apache.calcite.calcite-avatica org.apache.calcite.calcite-core org.apache.calcite.calcite-linq4j org.apache.commons.codec org.apache.commons.commons-collections4 org.apache.commons.commons-compress org.apache.commons.commons-configuration2 org.apache.commons.commons-crypto org.apache.commons.commons-csv org.apache.commons.commons-dbcp2 org.apache.commons.commons-email org.apache.commons.commons-exec org.apache.commons.commons-lang3 org.apache.commons.commons-math org.apache.commons.commons-math3 org.apache.commons.commons-pool2

Copyright © 2020 Trifacta Inc. Page #71 org.apache.curator.curator-client org.apache.curator.curator-framework org.apache.curator.curator-recipes org.apache.derby.derby org.apache.directory.api.api-asn1-api org.apache.directory.api.api-util org.apache.directory.server.apacheds-i18n org.apache.directory.server.apacheds-kerberos-codec org.apache.hadoop.avro org.apache.hadoop.hadoop-annotations org.apache.hadoop.hadoop-auth org.apache.hadoop.hadoop-aws org.apache.hadoop.hadoop-azure org.apache.hadoop.hadoop-azure-datalake org.apache.hadoop.hadoop-client org.apache.hadoop.hadoop-common org.apache.hadoop.hadoop-hdfs org.apache.hadoop.hadoop-hdfs-client org.apache.hadoop.hadoop-mapreduce-client-app org.apache.hadoop.hadoop-mapreduce-client-common org.apache.hadoop.hadoop-mapreduce-client-core org.apache.hadoop.hadoop-mapreduce-client-jobclient org.apache.hadoop.hadoop-mapreduce-client-shuffle org.apache.hadoop.hadoop-yarn-api org.apache.hadoop.hadoop-yarn-client org.apache.hadoop.hadoop-yarn-common org.apache.hadoop.hadoop-yarn-registry org.apache.hadoop.hadoop-yarn-server-common org.apache.hadoop.hadoop-yarn-server-nodemanager org.apache.hadoop.hadoop-yarn-server-web-proxy org.apache.htrace.htrace-core org.apache.htrace.htrace-core4 org.apache.httpcomponents.httpmime org.apache.ivy.ivy org.apache.kerby.kerb-admin org.apache.kerby.kerb-client org.apache.kerby.kerb-common org.apache.kerby.kerb-core org.apache.kerby.kerb-crypto org.apache.kerby.kerb-identity org.apache.kerby.kerb-server org.apache.kerby.kerb-simplekdc org.apache.kerby.kerb-util org.apache.kerby.kerby-asn1 org.apache.kerby.kerby-config org.apache.kerby.kerby-pkix org.apache.kerby.kerby-util org.apache.kerby.kerby-xdr org.apache.kerby.token-provider org.apache.logging.log4j.log4j-1.2-api org.apache.logging.log4j.log4j-api org.apache.logging.log4j.log4j-api-scala_2.11 org.apache.logging.log4j.log4j-core org.apache.logging.log4j.log4j-jcl org.apache.logging.log4j.log4j-jul org.apache.logging.log4j.log4j-slf4j-impl org.apache.logging.log4j.log4j-web org.apache.orc.orc-core org.apache.orc.orc-mapreduce org.apache.orc.orc-shims

Copyright © 2020 Trifacta Inc. Page #72 org.apache.parquet.parquet-avro org.apache.parquet.parquet-column org.apache.parquet.parquet-common org.apache.parquet.parquet-encoding org.apache.parquet.parquet-format org.apache.parquet.parquet-hadoop org.apache.parquet.parquet-jackson org.apache.parquet.parquet-tools org.apache.pig.pig org.apache.pig.pig-core-spork org.apache.thrift.libfb303 org.apache.thrift.libthrift org.apache.tomcat.embed.tomcat-embed-core org.apache.tomcat.embed.tomcat-embed-el org.apache.tomcat.embed.tomcat-embed-websocket org.apache.tomcat.tomcat-annotations-api org.apache.tomcat.tomcat-jdbc org.apache.tomcat.tomcat-juli org.apache.xbean.xbean-asm5-shaded org.apache.xbean.xbean-asm6-shaded org.apache.zookeeper.zookeeper org.codehaus.jackson.jackson-core-asl org.codehaus.jackson.jackson-mapper-asl org.codehaus.jettison.jettison org.datanucleus.datanucleus-api-jdo org.datanucleus.datanucleus-core org.datanucleus.datanucleus-rdbms org.eclipse.jetty.jetty-client org.eclipse.jetty.jetty-http org.eclipse.jetty.jetty-io org.eclipse.jetty.jetty-security org.eclipse.jetty.jetty-server org.eclipse.jetty.jetty-servlet org.eclipse.jetty.jetty-util org.eclipse.jetty.jetty-util-ajax org.eclipse.jetty.jetty-webapp org.eclipse.jetty.jetty-xml org.fusesource.jansi.jansi org.hibernate.hibernate-validator org.htrace.htrace-core org.iq80.snappy.snappy org.jboss.jandex org.joda.joda-convert org.json4s.json4s-ast_2.11 org.json4s.json4s-core_2.11 org.json4s.json4s-jackson_2.11 org.json4s.json4s-scalap_2.11 org.liquibase.liquibase-core org.lz4.lz4-java org.mapstruct.mapstruct org.mortbay.jetty.jetty org.mortbay.jetty.jetty-util org.mybatis.mybatis org.objenesis.objenesis org.osgi.org.osgi.core org.parboiled.parboiled-core org.parboiled.parboiled-scala_2.11 org.quartz-scheduler.quartz org.roaringbitmap.RoaringBitmap org.sonatype.oss.oss-parent

Copyright © 2020 Trifacta Inc. Page #73 org.sonatype.sisu.inject.cglib org.spark-project.hive.hive-exec org.spark-project.hive.hive-metastore org.springframework.boot.spring-boot org.springframework.boot.spring-boot-autoconfigure org.springframework.boot.spring-boot-starter org.springframework.boot.spring-boot-starter-aop org.springframework.boot.spring-boot-starter-data-jpa org.springframework.boot.spring-boot-starter-jdbc org.springframework.boot.spring-boot-starter-log4j2 org.springframework.boot.spring-boot-starter-logging org.springframework.boot.spring-boot-starter-tomcat org.springframework.boot.spring-boot-starter-undertow org.springframework.boot.spring-boot-starter-web org.springframework.data.spring-data-commons org.springframework.data.spring-data-jpa org.springframework.plugin.spring-plugin-core org.springframework.plugin.spring-plugin-metadata org.springframework.retry.spring-retry org.springframework.security.spring-security-config org.springframework.security.spring-security-core org.springframework.security.spring-security-crypto org.springframework.security.spring-security-web org.springframework.spring-aop org.springframework.spring-aspects org.springframework.spring-beans org.springframework.spring-context org.springframework.spring-context-support org.springframework.spring-core org.springframework.spring-expression org.springframework.spring-jdbc org.springframework.spring-orm org.springframework.spring-tx org.springframework.spring-web org.springframework.spring-webmvc org.springframework.spring-websocket org.uncommons.maths.uncommons-maths org.wildfly.openssl.wildfly-openssl org.xerial.snappy.snappy-java org.yaml.snakeyaml oro.oro parquet pbr pig-0.11.1-withouthadoop-23 pig-0.12.1-mapr-noversion-withouthadoop pig-0.14.0-core-spork piggybank-amzn-0.3 piggybank-cdh5.0.0-beta-2-0.12.0 python-iptables request requests stax.stax-api thrift tomcat.jasper-compiler tomcat.jasper-runtime xerces.xercesImpl xml-apis.xml-apis zipkin zipkin-transport-http

Copyright © 2020 Trifacta Inc. Page #74 This product also includes the following libraries which are covered by The Apache License 2.0 + Eclipse Public License 1.0:

spark-assembly-thinner

This product also includes the following libraries which are covered by The Apache License Version 2:

org.mortbay.jetty.jetty-sslengine

This product also includes the following libraries which are covered by The Apache License v2.0:

net.java.dev.jna.jna net.java.dev.jna.jna-platform

This product also includes the following libraries which are covered by The Apache License, version 2.0:

com.nimbusds.oauth2-oidc-sdk org.jboss.logging.jboss-logging

This product also includes the following libraries which are covered by The Apache Software Licenses:

org.slf4j.log4j-over-slf4j

This product also includes the following libraries which are covered by The BSD license:

alabaster antlr.antlr asm.asm-parent babel click com.google.api.api-common com.google.api.gax com.google.api.gax-grpc com.google.api.gax-httpjson com.jcraft.jsch com.thoughtworks.paranamer.paranamer dk.brics.automaton.automaton dom4j.dom4j enum34 flask itsdangerous javolution.javolution jinja2 jline.jline microee mock networkx numpy org.antlr.ST4 org.antlr.antlr-runtime org.antlr.antlr4-runtime org.antlr.stringtemplate org.codehaus.woodstox.stax2-api org.ow2.asm.asm org.scala-lang.jline pandas pluginbase psutil pygments

Copyright © 2020 Trifacta Inc. Page #75 python-enum34 python-json-logger scipy snowballstemmer sphinx sphinxcontrib-websupport strptime websocket-stream xlrd xmlenc.xmlenc xss-filters

This product also includes the following libraries which are covered by The BSD 2-Clause License:

com.github.luben.zstd-jni

This product also includes the following libraries which are covered by The BSD 3-Clause:

org.scala-lang.scala-compiler org.scala-lang.scala-library org.scala-lang.scala-reflect org.scala-lang.scalap

This product also includes the following libraries which are covered by The BSD 3-Clause "New" or "Revised" License (BSD-3-Clause):

org.abego.treelayout.org.abego.treelayout.core

This product also includes the following libraries which are covered by The BSD 3-Clause License:

org.antlr.antlr4

This product also includes the following libraries which are covered by The BSD 3-clause:

org.scala-lang.modules.scala-parser-combinators_2.11 org.scala-lang.modules.scala-xml_2.11 org.scala-lang.plugins.scala-continuations-library_2.11 org.threeten.threetenbp

This product also includes the following libraries which are covered by The BSD New license:

com.google.auth.google-auth-library-credentials com.google.auth.google-auth-library-oauth2-http

This product also includes the following libraries which are covered by The BSD-2-Clause:

cls-bluebird node-polyglot org.postgresql.postgresql terser uglify-js

This product also includes the following libraries which are covered by The BSD-3-Clause:

@sentry/node d3-dsv datalib datalib-sketch markupsafe md5

Copyright © 2020 Trifacta Inc. Page #76 node-forge protobufjs qs queue-async sqlite3 werkzeug

This product also includes the following libraries which are covered by The BSD-Style:

com.jsuereth.scala-arm_2.11

This product also includes the following libraries which are covered by The BSD-derived ( http://www.repoze.org/LICENSE.txt):

meld3 supervisor

This product also includes the following libraries which are covered by The BSD-like:

dnspython idna org.scala-lang.scala-actors org.scalamacros.quasiquotes_2.10

This product also includes the following libraries which are covered by The BSD-style license:

bzip2

This product also includes the following libraries which are covered by The Boost Software License:

asio boost cpp-netlib-uri expected jsbind poco

This product also includes the following libraries which are covered by The Bouncy Castle Licence:

org.bouncycastle.bcprov-jdk15on

This product also includes the following libraries which are covered by The CDDL:

javax.mail.mail javax.mail.mailapi javax.servlet.jsp-api javax.servlet.jsp.jsp-api javax.servlet.servlet-api javax.transaction.jta javax.xml.stream.stax-api org.glassfish.external.management-api org.glassfish.gmbal.gmbal-api-only org.jboss.spec.javax.annotation.jboss-annotations-api_1.2_spec

This product also includes the following libraries which are covered by The CDDL + GPLv2 with classpath exception:

javax.annotation.javax.annotation-api javax.jms.jms javax.servlet.javax.servlet-api

Copyright © 2020 Trifacta Inc. Page #77 javax.transaction.javax.transaction-api javax.transaction.transaction-api org.glassfish.grizzly.grizzly-framework org.glassfish.grizzly.grizzly-http org.glassfish.grizzly.grizzly-http-server org.glassfish.grizzly.grizzly-http-servlet org.glassfish.grizzly.grizzly-rcm org.glassfish.hk2.external.aopalliance-repackaged org.glassfish.hk2.external.javax.inject org.glassfish.hk2.hk2-api org.glassfish.hk2.hk2-locator org.glassfish.hk2.hk2-utils org.glassfish.hk2.osgi-resource-locator org.glassfish.javax.el org.glassfish.jersey.core.jersey-common

This product also includes the following libraries which are covered by The CDDL 1.1:

com.sun.jersey.contribs.jersey-guice com.sun.jersey.jersey-client com.sun.jersey.jersey-core com.sun.jersey.jersey-json com.sun.jersey.jersey-server com.sun.jersey.jersey-servlet com.sun.xml.bind.jaxb-impl javax.ws.rs.javax.ws.rs-api javax.xml.bind.jaxb-api org.jvnet.mimepull.mimepull

This product also includes the following libraries which are covered by The CDDL License:

javax.ws.rs.jsr311-api

This product also includes the following libraries which are covered by The CDDL/GPLv2+CE:

com.sun.mail.javax.mail

This product also includes the following libraries which are covered by The CERN:

colt.colt

This product also includes the following libraries which are covered by The COMMON DEVELOPMENT AND DISTRIBUTION LICENSE (CDDL) Version 1.0:

javax.activation.activation javax.annotation.jsr250-api

This product also includes the following libraries which are covered by The Doug Crockford's license that allows this module to be used for Good but not for Evil:

jsmin

This product also includes the following libraries which are covered by The Dual License:

python-dateutil

This product also includes the following libraries which are covered by The Eclipse Distribution License (EDL), Version 1.0:

org.hibernate.javax.persistence.hibernate-jpa-2.1-api

Copyright © 2020 Trifacta Inc. Page #78 This product also includes the following libraries which are covered by The Eclipse Public License:

com.github.oshi.oshi-core

This product also includes the following libraries which are covered by The Eclipse Public License - v 1.0:

org.aspectj.aspectjweaver

This product also includes the following libraries which are covered by The Eclipse Public License, Version 1.0:

com.mchange.mchange-commons-java

This product also includes the following libraries which are covered by The GNU General Public License v2.0 only, with Classpath exception:

org.jboss.spec.javax.servlet.jboss-servlet-api_3.1_spec org.jboss.spec.javax.websocket.jboss-websocket-api_1.1_spec

This product also includes the following libraries which are covered by The GNU LGPL:

nose

This product also includes the following libraries which are covered by The GNU Lesser General Public License:

org.hibernate.common.hibernate-commons-annotations org.hibernate.hibernate-core org.hibernate.hibernate-entitymanager

This product also includes the following libraries which are covered by The GNU Lesser General Public License Version 2.1, February 1999:

org.jgrapht.jgrapht-core

This product also includes the following libraries which are covered by The GNU Lesser General Public License, Version 2.1:

com.fasterxml.jackson.core.jackson-annotations com.fasterxml.jackson.core.jackson-core com.fasterxml.jackson.core.jackson-databind com.mchange.c3p0

This product also includes the following libraries which are covered by The GNU Lesser Public License:

com.google.code.findbugs.annotations

This product also includes the following libraries which are covered by The GPLv3:

yamllint

This product also includes the following libraries which are covered by The Google Cloud Software License:

com.google.cloud.google-cloud-storage

This product also includes the following libraries which are covered by The ICU License:

icu

This product also includes the following libraries which are covered by The ISC:

Copyright © 2020 Trifacta Inc. Page #79 browserify-sign iconify inherits lru-cache request-promise request-promise-native requests-kerberos rimraf sax semver split-ca

This product also includes the following libraries which are covered by The JGraph Ltd - 3 clause BSD license:

org.tinyjee.jgraphx.jgraphx

This product also includes the following libraries which are covered by The LGPL:

chardet com.sun.jna.jna

This product also includes the following libraries which are covered by The LGPL 2.1:

org.codehaus.jackson.jackson-jaxrs org.codehaus.jackson.jackson-xc org.javassist.javassist xmlhttprequest

This product also includes the following libraries which are covered by The LGPLv3 or later:

com.github.fge.json-schema-core com.github.fge.json-schema-validator

This product also includes the following libraries which are covered by The MIT license:

amplitude-js analytics-node args4j.args4j async avsc backbone backbone-forms basic-auth bcrypt bluebird body-parser browser-filesaver buffer-crc32 bufferedstream busboy byline bytes cachetools chai chalk cli-table clipboard codemirror colors com.github.tommyettinger.blazingchain

Copyright © 2020 Trifacta Inc. Page #80 com.microsoft.azure.azure-data-lake-store-sdk com.microsoft.sqlserver.mssql-jdbc commander common-tags compression console.table cookie cookie-parser cookie-session cronstrue crypto-browserify csrf csurf definitely eventEmitter express express-http-context express-params express-zip forever form-data fs-extra function-rate-limit future fuzzy generic-pool google-auto-auth iconv-lite imagesize int24 is-my-json-valid jade jq jquery jquery-serializeobject jquery.ba-serializeobject jquery.event.drag jquery.form jquery.ui json-stable-stringify jsonfile jsonschema jsonwebtoken jszip keygrip keysim knex less-middleware lodash lunr matic memcached method-override mini-css-extract-plugin minilog mkdirp moment moment-jdateformatparser moment-timezone mongoose

Copyright © 2020 Trifacta Inc. Page #81 morgan morgan-json mysql net.razorvine.pyrolite netifaces nock nodemailer org.checkerframework.checker-compat-qual org.checkerframework.checker-qual org.mockito.mockito-core org.slf4j.jcl-over-slf4j org.slf4j.jul-to-slf4j org.slf4j.slf4j-api org.slf4j.slf4j-log4j12 pace passport passport-azure-ad passport-http passport-http-bearer passport-ldapauth passport-local passport-saml passport-saml-metadata passport-strategy password-validator pegjs pg pg-hstore pip png-img promise-retry prop-types punycode py-cpuinfo python-crfsuite python-six pytz pyyaml query-string querystring randexp rapidjson react react-day-picker react-dom react-hot-loader react-modal react-router-dom react-select react-switch react-table react-virtualized recursive-readdir redefine requestretry require-jade require-json retry retry-as-promised rotating-file-stream

Copyright © 2020 Trifacta Inc. Page #82 safe-json-stringify sequelize setuptools simple-ldap-search simplejson singledispatch six slick.core slick.grid slick.headerbuttons slick.headerbuttons.css slick.rowselectionmodel snappy sphinx-rtd-theme split-pane sql stream-meter supertest tar-fs temp to-case tv4 ua-parser-js umzug underscore.string unicode-length universal-analytics urijs uritools url urllib3 user-agent-parser uuid validator wheel winston winston-daily-rotate-file ws yargs zxcvbn

This product also includes the following libraries which are covered by The MIT License; BSD 3-clause License:

antlr4-cpp-runtime

This product also includes the following libraries which are covered by The MIT license:

org.codehaus.mojo.animal-sniffer-annotations

This product also includes the following libraries which are covered by The MIT/X:

vnc2flv

This product also includes the following libraries which are covered by The MIT/X11 license:

optimist

This product also includes the following libraries which are covered by The MPL 2.0:

pathspec

Copyright © 2020 Trifacta Inc. Page #83 This product also includes the following libraries which are covered by The MPL 2.0 or EPL 1.0:

com.h2database.h2

This product also includes the following libraries which are covered by The MPL-2.0:

certifi

This product also includes the following libraries which are covered by The Microsoft Software License:

com.microsoft.sqlserver.sqljdbc4

This product also includes the following libraries which are covered by The Mozilla Public License, Version 2.0:

org.mozilla.rhino

This product also includes the following libraries which are covered by The New BSD License:

asm.asm backbone-queryparams bloomfilter com.esotericsoftware.kryo-shaded com.esotericsoftware.minlog com.googlecode.protobuf-java-format.protobuf-java-format d3 domReady double-conversion gflags glog gperftools gtest org.antlr.antlr-master org.codehaus.janino.commons-compiler org.codehaus.janino.janino org.hamcrest.hamcrest-core pcre protobuf re2 require-text sha1 termcolor topojson triflow vega websocketpp

This product also includes the following libraries which are covered by The New BSD license:

com.google.protobuf.nano.protobuf-javanano

This product also includes the following libraries which are covered by The Oracle Technology Network License:

com.oracle.ojdbc6

This product also includes the following libraries which are covered by The PSF:

futures heap typing

Copyright © 2020 Trifacta Inc. Page #84 This product also includes the following libraries which are covered by The PSF license:

functools32 python

This product also includes the following libraries which are covered by The PSF or ZPL:

wsgiref

This product also includes the following libraries which are covered by The Proprietary:

com.trifacta.connect.trifacta-TYcassandra com.trifacta.connect.trifacta-TYdb2 com.trifacta.connect.trifacta-TYgreenplum com.trifacta.connect.trifacta-TYhive com.trifacta.connect.trifacta-TYinformix com.trifacta.connect.trifacta-TYmongodb com.trifacta.connect.trifacta-TYmysql com.trifacta.connect.trifacta-TYopenedgewp com.trifacta.connect.trifacta-TYoracle com.trifacta.connect.trifacta-TYoraclesalescloud com.trifacta.connect.trifacta-TYpostgresql com.trifacta.connect.trifacta-TYredshift com.trifacta.connect.trifacta-TYrightnow com.trifacta.connect.trifacta-TYsforce com.trifacta.connect.trifacta-TYsparksql com.trifacta.connect.trifacta-TYsqlserver com.trifacta.connect.trifacta-TYsybase jsdata

This product also includes the following libraries which are covered by The Public Domain:

aopalliance.aopalliance org.jboss.xnio.xnio-api org.jboss.xnio.xnio-nio org.tukaani.xz protobuf-to-dict pycrypto simple-xml-writer

This product also includes the following libraries which are covered by The Public domain:

net.iharder.base64

This product also includes the following libraries which are covered by The Python Software Foundation License:

argparse backports-abc ipaddress python-setuptools regex

This product also includes the following libraries which are covered by The Tableau Binary Code License Agreement:

com.tableausoftware.common.tableaucommon com.tableausoftware.extract.tableauextract

This product also includes the following libraries which are covered by The Apache License, Version 2.0:

Copyright © 2020 Trifacta Inc. Page #85 com.fasterxml.woodstox.woodstox-core com.google.cloud.bigtable.bigtable-client-core com.google.cloud.datastore.datastore-v1-proto-client io.opencensus.opencensus-api io.opencensus.opencensus-contrib-grpc-metrics io.opencensus.opencensus-contrib-grpc-util io.opencensus.opencensus-contrib-http-util org.spark-project.spark.unused software.amazon.ion.ion-java

This product also includes the following libraries which are covered by The BSD 3-Clause License:

org.fusesource.leveldbjni.leveldbjni-all

This product also includes the following libraries which are covered by The GNU General Public License, Version 2:

mysql.mysql-connector-java

This product also includes the following libraries which are covered by The Go license:

com.google.re2j.re2j

This product also includes the following libraries which are covered by The MIT License (MIT):

com.microsoft.azure.azure com.microsoft.azure.azure-annotations com.microsoft.azure.azure-client-authentication com.microsoft.azure.azure-client-runtime com.microsoft.azure.azure-keyvault com.microsoft.azure.azure-keyvault-core com.microsoft.azure.azure-keyvault-webkey com.microsoft.azure.azure-mgmt-appservice com.microsoft.azure.azure-mgmt-batch com.microsoft.azure.azure-mgmt-cdn com.microsoft.azure.azure-mgmt-compute com.microsoft.azure.azure-mgmt-containerinstance com.microsoft.azure.azure-mgmt-containerregistry com.microsoft.azure.azure-mgmt-containerservice com.microsoft.azure.azure-mgmt-cosmosdb com.microsoft.azure.azure-mgmt-dns com.microsoft.azure.azure-mgmt-graph-rbac com.microsoft.azure.azure-mgmt-keyvault com.microsoft.azure.azure-mgmt-locks com.microsoft.azure.azure-mgmt-network com.microsoft.azure.azure-mgmt-redis com.microsoft.azure.azure-mgmt-resources com.microsoft.azure.azure-mgmt-search com.microsoft.azure.azure-mgmt-servicebus com.microsoft.azure.azure-mgmt-sql com.microsoft.azure.azure-mgmt-storage com.microsoft.azure.azure-mgmt-trafficmanager com.microsoft.rest.client-runtime

This product also includes the following libraries which are covered by The MIT License: http://www.opensource.org/licenses/mit-license.php:

probableparsing usaddress

Copyright © 2020 Trifacta Inc. Page #86 This product also includes the following libraries which are covered by The New BSD License:

net.sf.py4j.py4j org.jodd.jodd-core

This product also includes the following libraries which are covered by The Unicode/ICU License:

com.ibm.icu.icu4j

This product also includes the following libraries which are covered by The Unlicense:

moment-range stream-buffers

This product also includes the following libraries which are covered by The WTFPL:

org.reflections.reflections

This product also includes the following libraries which are covered by The http://www.apache.org/licenses/LICENSE-2.0:

tornado

This product also includes the following libraries which are covered by The new BSD:

scikit-learn

This product also includes the following libraries which are covered by The new BSD License:

decorator

This product also includes the following libraries which are covered by The provided without support or warranty:

org.json.json

This product also includes the following libraries which are covered by The public domain, Python, 2-Clause BSD, GPL 3 (see COPYING.txt):

docutils

This product also includes the following libraries which are covered by The zlib License:

zlib

Copyright © 2020 Trifacta Inc. Page #87 Copyright © 2020 - Trifacta, Inc. All rights reserved.