Quick Install for AWS EMR

Version: 6.8 Doc Build Date: 01/21/2020 Copyright © Trifacta Inc. 2020 - All Rights Reserved. CONFIDENTIAL

These materials (the “Documentation”) are the confidential and proprietary information of Trifacta Inc. and may not be reproduced, modified, or distributed without the prior written permission of Trifacta Inc.

EXCEPT AS OTHERWISE PROVIDED IN AN EXPRESS WRITTEN AGREEMENT, TRIFACTA INC. PROVIDES THIS DOCUMENTATION AS-IS AND WITHOUT WARRANTY AND TRIFACTA INC. DISCLAIMS ALL EXPRESS AND IMPLIED WARRANTIES TO THE EXTENT PERMITTED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE AND UNDER NO CIRCUMSTANCES WILL TRIFACTA INC. BE LIABLE FOR ANY AMOUNT GREATER THAN ONE HUNDRED DOLLARS ($100) BASED ON ANY USE OF THE DOCUMENTATION.

For third-party license information, please select About Trifacta from the Help menu. 1. Release Notes . . 4 1.1 Changes to System Behavior . . 4 1.1.1 Changes to the Language . . 4 1.1.2 Changes to the APIs . 18 1.1.3 Changes to Configuration 23 1.1.4 Changes to the Object Model . 26 1.2 Release Notes 6.8 . 30 1.3 Release Notes 6.4 . 36 1.4 Release Notes 6.0 . 42 1.5 Release Notes 5.1 . 49 2. Quick Start 55 2.1 Install from AWS Marketplace with EMR . 55 2.2 Upgrade for AWS Marketplace with EMR . 62 3. Configure 62 3.1 Configure for AWS . 62 3.1.1 Configure for EC2 Role-Based Authentication . 68 3.1.2 Enable S3 Access . 70 3.1.2.1 Create Redshift Connections 81 3.1.3 Configure for EMR . 83 4. Contact Support . 97 5. Legal 98 5.1 Third-Party License Information . 98

Page #3 Release Notes

This section contains release notes for published versions of Trifacta® Wrangler Enterprise.

Changes to System Behavior

The following pages contain information about changes to system features, capabilities, and behaviors in this release.

Changes to the Language

Contents:

Release 6.8 New Functions Release 6.6 New Functions Release 6.5 New Functions Release 6.4 Improvements to metadata references Release 6.3 New Functions Optional input formats for DateFormat task Release 6.2 New Functions ARRAYELEMENTAT function accepts new inputs Release 6.1 Release 6.0 New Functions Changes to LIST* inputs Renamed functions FILL Function has new before and after parameters Release 5.9 New functions Release 5.8 File lineage information using source metadata references New math and statistical functions for arrays Release 5.7 WEEKNUM function now behaves consistently across running environments Release 5.6 URLPARAMS function returns null values Release 5.1 Wrangle now supports nested expressions SOURCEROWNUMBER function generates null values consistently New Functions Release 5.0.1 RAND function generates true random numbers Release 5.0 Required type parameter Deprecated aggregate transform New search terms Support for <> operator ROUND function takes optional number of digits New Functions Release 4.2.1

Copyright © 2020 Trifacta Inc. Page #4 Release 4.2 New Filter transform New Case transform Rename transform now supports multi-column rename Delete specified columns or delete the others New string comparison functions NOW function returns 24-hour time values New Transforms New Functions

The following changes have been applied to Wrangle in this release of Trifacta® Wrangler Enterprise.

Release 6.8

New Functions

This release introduces the sampling method of calculating statistical functions. The following are now available:

Function Name Description

STDEVSAMP Function Computes the standard deviation across column values of Integer or Decimal type using the sample statistical method.

VARSAMP Function Computes the variance among all values in a column using the sample statistical method. Input column can be of Integer or Decimal. If no numeric values are detected in the input column, the function returns 0 .

STDEVSAMPIF Function Generates the standard deviation of values by group in a column that meet a specific condition using the sample statistical method.

VARSAMPIF Function Generates the variance of values by group in a column that meet a specific condition using the sample statistical method.

ROLLINGSTDEVSAMP Function Computes the rolling standard deviation of values forward or backward of the current row within the specified column using the sample statistical method.

ROLLINGVARSAMP Function Computes the rolling variance of values forward or backward of the current row within the specified column using the sample statistical .method

Release 6.6

New Functions

Function Name Description

SINH Function Computes the hyperbolic sine of an input value for a hyperbolic angle measured in radians. The value can be a Decimal or Integer literal or a reference to a column containing numeric values.

COSH Function Computes the hyperbolic cosine of an input value for a hyperbolic angle measured in radians. The value can be a Decimal or Integer literal or a reference to a column containing numeric values.

TANH Function Computes the hyperbolic tangent of an input value for a hyperbolic angle measured in radians. The value can be a Decimal or Integer literal or a reference to a column containing numeric values.

Copyright © 2020 Trifacta Inc. Page #5 ASINH Function Computes the arcsine of an input value for a hyperbolic angle measured in radians. The value can be a Decimal or Integer literal or a reference to a column containing numeric values.

ACOSH Function Computes the arccosine of an input value for a hyperbolic angle measured in radians. The value can be a Decimal or Integer literal or a reference to a column containing numeric values.

ATANH Function Computes the arctangent of an input value for a hyperbolic angle measured in radians. The value can be a Decimal or Integer literal or a reference to a column containing numeric values.

Release 6.5

New Functions

Function Name Description

SIN Function Computes the sine of an input value for an angle measured in radians. The value can be a Decimal or Integer literal or a reference to a column containing numeric values.

COS Function Computes the cosine of an input value for an angle measured in radians. The value can be a Decimal or Integer literal or a reference to a column containing numeric values.

TAN Function Computes the tangent of an input value for an angle measured in radians. The value can be a Decimal or Integer literal or a reference to a column containing numeric values.

Cotangent Function See TAN Function.

Secant Function See COS Function.

Cosecant Function See SIN Function.

ASIN Function For input values between -1 and 1 inclusive, this function returns the angle in radians whose sine value is the input. This function is the inverse of the sine function. The value can be a Decimal or Integer literal or a reference to a column containing numeric values.

ACOS Function For input values between -1 and 1 inclusive, this function returns the angle in radians whose cosine value is the input. This function is the inverse of the cosine function. The value can be a Decimal or Integer literal or a reference to a column containing numeric values.

ATAN Function For input values between -1 and 1 inclusive, this function returns the angle in radians whose tangent value is the input. This function is the inverse of the tangent function. The value can be a Decimal or Integer literal or a reference to a column containing numeric values.

Arccotangent Function See ATAN Function.

Arcsecant Function See ACOS Function.

Arccosecant Function See ASIN Function.

Release 6.4

Improvements to metadata references

Broader support for metadata references: For Excel files, $filepath references now return the location of the source Excel file. Sheet names are appended to the end of the reference. See Source Metadata References.

Copyright © 2020 Trifacta Inc. Page #6 Release 6.3

New Functions

Function Name Description

PARSEDATE Function Evaluates an input against an array of Datetime format strings in their listed order. If the input matches one of the listed formats, the function outputs a Datetime value.

Optional input formats for DateFormat task

The DateFormat task now supports a new parameter: Input Formats. This parameter specifies the date format to use when attempting to parse the input column.

If the parameter is specified, then the value of the parameter is used to parse the inputs. (default) if the parameter is not specified, then the following common formats are used for parsing the input:

Copyright © 2020 Trifacta Inc. Page #7 'M/d/yy', 'MM/dd/yy', 'MM-dd-yy', 'M-d-yy', 'MMM d, yyyy', 'MMMM d, yyyy', 'EEEE, MMMM d, yyyy', 'MMM d yyyy', 'MMMM d yyyy', 'MM-dd-yyyy', 'M-d-yyyy', 'yyyy-MM-ddXXX', 'dd/MM/yyyy', 'd/M/yyyy', 'MM/dd/yyyy', 'M/d/yyyy', 'yyyy/M/d', 'M/d/yy h:mm a', 'MM/dd/yy h:mm a', 'MM-dd-yy h:mm a', 'MMM dd yyyy HH.MM.SS xxx', 'M-d-yy h:mm a', 'MMM d, yyyy h:mm:ss a', 'EEEE, MMMM d, yyyy h:mm:ss a X', 'EEE MMM dd HH:mm:ss X yyyy', 'EEE, d MMM yyyy HH:mm:ss X', 'd MMM yyyy HH:mm:ss X', 'MM-dd-yyyy h:mm:ss a', 'M-d-yyyy h:mm:ss a', 'yyyy-MM-dd h:mm:ss a', 'yyyy-M-d h:mm:ss a', 'yyyy-MM-dd HH:mm:ss.S', 'dd/MM/yyyy h:mm:ss a', 'd/M/yyyy h:mm:ss a', 'MM/dd/yyyy h:mm:ss a', 'M/d/yyyy h:mm:ss a', 'MM/dd/yy h:mm:ss a', 'MM/dd/yy H:mm:ss', 'M/d/yy H:mm:ss', 'dd/MM/yyyy h:mm a', 'd/M/yyyy h:mm a', 'MM/dd/yyyy h:mm a', 'M/d/yyyy h:mm a', 'MM-dd-yy h:mm:ss a', 'M-d-yy h:mm:ss a', 'MM-dd-yyyy h:mm a', 'M-d-yyyy h:mm a', 'yyyy-MM-dd h:mm a', 'yyyy-M-d h:mm a', 'MMM.dd.yyyy', 'd/MMM/yyyy H:mm:ss X', 'dd/MMM/yy h:mm a',

Copyright © 2020 Trifacta Inc. Page #8 These formats are a subset of the date formatting strings supported by the product. For more information, see Datetime Data Type.

Release 6.2

New Functions

Function Name Description

RANK Function Computes the rank of an ordered set of value within groups. Tie values are assigned the same rank, and the next ranking is incremented by the number of tie values.

DENSERANK Function Computes the rank of an ordered set of value within groups. Tie values are assigned the same rank, and the next ranking is incremented by 1.

ARRAYELEMENTAT function accepts new inputs

In previous releases, the ARRAYELEMENTAT function accepted a second input parameter to specify the index value of the element to retrieve. This "at" parameter had to be an Integer literal.

Beginning in this release, the function also accepts for this second "at" parameter:

Names of columns containing Integer values Functions that return Integer values

For more information, see ARRAYELEMENTAT Function.

Release 6.1

None.

Release 6.0

New Functions

Function Name Description

ARRAYINDEXOF Function Computes the index at which a specified element is first found within an array. Indexing is left to right.

ARRAYRIGHTINDEXOF Function Computes the index at which a specified element is first found within an array, when searching right to left. Returned value is based on left-to-right indexing.

ARRAYSLICE Function Returns an array containing a slice of the input array, as determined by starting and ending index parameters.

ARRAYMERGEELEMENTS Function Merges the elements of an array in left to right order into a string. Values are optionally delimited by a provided delimiter.

Changes to LIST* inputs

The following LIST-based functions have been changed to narrow the accepted input data types. In previous releases, any data type was accepted for input, which was not valid for most data types.

In Release 6.0 and later, these functions accept only Array inputs. Inputs can be Array literals, a column of Arrays, or a function returning Arrays.

NOTE: You should references to these functions in your recipes.

Copyright © 2020 Trifacta Inc. Page #9 LIST* Functions

LISTAVERAGE Function

LISTMIN Function

LISTMAX Function

LISTMODE Function

LISTSTDEV Function

LISTSUM Function

LISTVAR Function

Renamed functions

The following functions have been renamed in Release 6.0.

Release 5.9 and earlier Release 6.0 and later

LISTUNIQUE Function UNIQUE Function

FILL Function has new before and after parameters

Prior to Release 6.0, the FILL function replaced empty cells with the most recent non-empty value.

In Release 6.0, before and after function parameters have been added. These parameters define the window of rows before and after the row being tested to search for non-empty values. Within this window, the most recent non-empty value is used.

The default values for these parameters are -1 and 0 respectively, which performs a search of an unlimited number of preceding rows for a non-empty value.

NOTE: Upon upgrade, the FILL function retains its preceding behavior, as the default values for the new parameters perform the same unlimited row search for non-empty values.

For more information, see FILL Function.

Release 5.9

New functions

The following functions can now be applied directly to arrays to derive meaningful statistics about them.

Function Description

ARRAYSORT Function Sorts array values in the specified column, array literal, or function that returns an array in ascending or descending order.

TRANSLITERATE Function Transliterates Asian script characters from one script form to another. The string can be specified as a column reference or a string literal.

Copyright © 2020 Trifacta Inc. Page #10 Release 5.8

File lineage information using source metadata references

Beginning in Release 5.8, you can insert the following references into the formulas of your transformations. These source metadata references enable you to continue to track file lineage information from within your datasets as part of your wrangling project.

NOTE: These references apply only to file-based sources. Some additional limitations may apply.

reference Description

$filepath Returns the full path and filename of the source of the dataset.

$sourcerownumber Returns the row number for the current row from the original source of the dataset.

NOTE: This reference is equivalent to the SOURCEROWNUMBER function, which is likely to be deprecated in a future release. You should begin using this reference in your recipes.

For more information, see Source Metadata References.

New math and statistical functions for arrays

The following functions can now be applied directly to arrays to derive meaningful statistics about them.

Function Description

LISTSUM Function Computes the sum of all numeric values found in input array. Input can be an array literal, a column of arrays, or a function returning an array. Input values must be of Integer or Decimal type.

LISTMAX Function Computes the maximum of all numeric values found in input array. Input can be an array literal, a column of arrays, or a function returning an array. Input values must be of Integer or Decimal type.

LISTMIN Function Computes the minimum of all numeric values found in input array. Input can be an array literal, a column of arrays, or a function returning an array. Input values must be of Integer or Decimal type.

LISTAVERAGE Function Computes the average of all numeric values found in input array. Input can be an array literal, a column of arrays, or a function returning an array. Input values must be of Integer or Decimal type.

LISTVAR Function Computes the variance of all numeric values found in input array. Input can be an array literal, a column of arrays, or a function returning an array. Input values must be of Integer or Decimal type.

LISTSTDEV Function Computes the standard deviation of all numeric values found in input array. Input can be an array literal, a column of arrays, or a function returning an array. Input values must be of Integer or Decimal type.

LISTMODE Function Computes the most common value of all numeric values found in input array. Input can be an array literal, a column of arrays, or a function returning an array. Input values must be of Integer or Decimal type.

Copyright © 2020 Trifacta Inc. Page #11 Release 5.7

WEEKNUM function now behaves consistently across running environments

In Release 5.6 and earlier, the WEEKNUM function treated the first week of the year differently between the Trifac ta Photon and Spark running environments:

Trifacta Photon week 1 of the year: The week that contains January 1. Spark week 1 of the year: The week that contains at least four days in the specified year.

This issue was caused by Spark following an ISO-8601 standard and relying on the joda datetimeformatter.

Beginning in Release 5.7, the WEEKNUM function behaves consistently for both Trifacta Photon and Spark:

Week 1 of the year: The week that contains January 1.

For more information, see WEEKNUM Function.

Release 5.6

URLPARAMS function returns null values

In Release 5.1 and earlier, the URLPARAMS function returned empty Objects when no answer was computed for the function.

In Release 5.6 and later, this function returns null values in the above case.

See URLPARAMS Function.

Release 5.1

Wrangle now supports nested expressions

Beginning in Release 5.1, all Wrangle functions now supported nested expressions, which can be arithmetic calculations, column references, or other function calls.

NOTE: This feature is enabled by default, as this change does not break any steps created in previous versions of the product. It can be disabled if needed. See Miscellaneous Configuration.

NOTE: This capability represents a powerful enhancement to the language, as you can now use dynamic inputs for all functions.

The following expression is a valid transform in Wrangle. It locates the substring in myString that begins with the @ sign until the end of the string, inclusive:

derive value: substring(myString, find(myString, '@', true, 0), length(myString)

Nested arithmetic expressions:

Suppose you wanted just the value after the @ sign until the end of the string. Prior to Release 5.1, the following generated a validation error:

derive value: substring(myString, find(myString, '@', true, 0) + 1, length(myString)

Copyright © 2020 Trifacta Inc. Page #12 In the above, the addition of +1 to the second parameter is a nested expression and was not supported. Instead, you had to use multiple steps to generate the string value.

Beginning in Release 5.1, the above single-step transform is supported.

Nested column references:

In addition to arithmetic expressions, you can nested column references. In the following example, the previous step has been modified to replace the static +1 with a reference to a column containing the appropriate value (at _sign_offset) :

derive value: substring(myString, find(myString, '@', true, 0) + at_sign_offset, length (myString)

Nested function references:

Now, you can combine multiple function references into a single computation. The following computes the total volume of a cube of length side and then multiplies that volume by the number of cubes (cube_count) to compute the total cube_volume

derive type: single value: MULTIPLY(POW(cube_side,3),cube_count) as: 'cube_volume'

For more information, see Wrangle Language.

SOURCEROWNUMBER function generates null values consistently

The SOURCEROWNUMBER function returns the row number of the row as it appears in the original dataset. After some operations, such as unions, joins, and aggregations, this row information is no longer available.

In Release 5.0.1 and earlier, the results were confusing. When source row information was not available, the function was simply not available for use.

In Release 5.1 and later, the behavior of the SOURCEROWNUMBER function is more consistent:

If the source row information is available, it is returned. If it is not available: The function can still be used. The function returns null values in all cases.

For more information, see SOURCEROWNUMBER Function.

New Functions

Function Name Description

ARRAYELEMENTAT Function Returns element value of input array for the provided index value.

DOUBLEMETAPHONE Function Returns primary and secondary phonetic spellings of an input string using the Double Metaphone algorithm.

DOUBLEMETAPHONEEQUALS Function Returns true if two strings match phonetic spellings using Double Metaphone algorithm. Tolerance threshold can be adjusted.

UNIQUE Function Generates a new column containing an array of the unique values from a source column.

Copyright © 2020 Trifacta Inc. Page #13 Release 5.0.1

RAND function generates true random numbers

In Release 5.0 and earlier, the RAND function produced the same set of random numbers within the browser, after browser refresh, and over subsequent runs of a job.

During job execution, a default seed value was inserted as the basis for the function during the execution of the job. In some cases, this behavior is desired.

In Release 5.0.1 and later, the RAND function accepts an optional integer as a parameter. When this new seed value is inserted, the function generates deterministic, pseudo-random values.

This version matches the behavior of the old function.

NOTE: On all upgraded instances of the platform, references to the RAND function have been converted to use a default seed value, so that previous behavior is maintained in the upgraded version.

If no seed value is inserted as a parameter, the RAND function generates true random values within the browser, after browser refresh, and over subsequent job runs.

NOTE: Be aware that modifying your dataset based on the generated values of RAND() may have unpredictable effects later in your recipe and downstream of it.

For more information, see RAND Function.

Release 5.0

Required type parameter

Prior to Release 5.0, the following was a valid Wrangle step:

derive value:colA + colB as:'colC'

Beginning in Release 5.0, the type parameter is required. This parameter defines whether the transform is a single or multi-row formula. In the Transform Builder, this value must be specified.

The following is valid in Release 5.0:

derive type:single value:colA + colB as:'colC'

See Derive Transform.

See Transform Builder.

Deprecated aggregate transform

In Release 4.2.1 and earlier, the aggregate transform could be used to aggregate your datasets using aggregation functions and groupings.

In Release 5.0 and later, this transform has been merged into the pivot transform. The aggregate transform has been deprecated and is no longer available.

Copyright © 2020 Trifacta Inc. Page #14 NOTE: During upgrade to Release 5.0 and later, recipes that had previously used the aggregate transform are automatically migrated to use the pivot equivalent.

Example 1

Release 4.2.1 and earlier Aggregate:

aggregate value:AVERAGE(Scores)

Release 5.0 and later Pivot:

pivot value: AVERAGE(Score) limit: 1

The limit parameter defines the maximum number of columns that can be generated by the pivot.

Example 2

Aggregate:

aggregate value:AVERAGE(Scores) group:studentId

Pivot:

pivot group: StudentId value: AVERAGE(Score) limit: 1

For more information, see Pivot Transform.

New search terms

In the new Search panel, you can search for terms that can be used to select transformations for quick population of parameters. In the following table, you can see Wrangle how terminology has changed in Release 5.0 for some common transforms from earlier release.

Tip: You can paste the Release 5.0 terms in the Search panel to locate the same transformations used in earlier releases.

Release 4.2.1 and earlier transforms Release 5.0 and later search terms

aggregate pivot

keep filter

delete filter

extract on: extractpatterns

extract at: extractpositions

extract before: extractbetweendelimiters

extract after: extractbetweendelimiters

replace on: replacepatterns

replace at: replacepositions

replace before: replacebetweenpatterns

replace after: replacebetweenpatterns

Copyright © 2020 Trifacta Inc. Page #15 replace from: replacebetweenpatterns

replace to: replacebetweenpatterns

split on: splitpatterns

split delimiters: splitpositions

split every: splitpositions

split positions: splitpositions

split after: splitpatterns

split before: splitpatterns

split from: splitpatterns

split to: splitpatterns

Support for <> operator

Prior to Release 5.0, the following operator was used to test "not equal" comparisons:

!=

Beginning in Release 5.0, the following operators is also supported:

<>

Example:

derive value:IF ((col1 <> col2), 'different','equal') as:'testNotEqual'

Tip: Both of the above operators are supported, although the <> operator is preferred.

For more information, see Comparison Operators.

ROUND function takes optional number of digits

The ROUND function now supports rounding to a specified number of digits. By default, values are rounded to the nearest integer, as before. See ROUND Function.

New Functions

Function Name Description

DEGREES Function Generates the value in degrees for an input radians value.

EXACT Function Compares two strings to see if they are exact matches.

FILTEROBJECT Function Filters the keys and values from an Object based on specified keys.

HOST Function Returns the host value from a URL.

ISEVEN Function Returns true if an Integer, function returning an Integer, or a column contains an even value.

Copyright © 2020 Trifacta Inc. Page #16 ISODD Function Returns true if an Integer, function returning an Integer, or a column contains an odd value.

KTHLARGESTUNIQUE Function Computes the kth-ranked unique value in a set of values.

LCM Function Returns the least common multiple between two input values.

MODE Function Computes the mode (most common) value for a set of values.

MODEIF Function Computes the mode based on a conditional test.

PAD Function Pads the left or right side of a value with a specified character string.

PI Function Generates the value for pi to 15 decimal places.

RADIANS Function Generates the value in radians for an input degrees value.

RANDBETWEEN Function Generates a random Integer in a range between two specified values.

RIGHTFIND Function Locates a substring by searching from the right side of an input value.

ROLLINGCOUNTA Function Computes count of non-null values across a rolling window within a column.

ROLLINGKTHLARGEST Function Computes the kth largest value across a rolling window within a column.

ROLLINGKTHLARGESTUNIQUE Function Computes the kth largest unique value across a rolling window within a column.

ROLLINGLIST Function Computes list of all values across a rolling window within a column.

ROLLINGMAX Function Computes maximum value across a rolling window within a column.

ROLLINGMIN Function Computes minimum value across a rolling window within a column.

ROLLINGMODE Function Computes mode (most common) value across a rolling window within a column.

ROLLINGSTDEV Function Computes standard deviation across a rolling window within a column.

ROLLINGVAR Function Computes variance across a rolling window within a column.

SIGN Function Computes the positive or negative sign of an input value.

TRUNC Function Truncates a value to the nearest integer or a specified number of digits.

URLPARAMS Function Extracts any query parameters from a URL into an Object.

WEEKNUM Function Calculates the week that the date appears during the year (1-52).

Release 4.2.1

None.

Release 4.2

New Filter transform

Perform a variety of predefined row filtrations using the new filter transform, or apply your own custom formula to keep or delete rows from your dataset.

See Remove Data. See Filter Transform.

Copyright © 2020 Trifacta Inc. Page #17 New Case transform

Beginning in Release 4.2, you can use the Transform Builder to simplify the construction of CASE statements. For each case, specify the conditional and resulting expression in separate textboxes.

See Apply Conditional Transformations. See Case Transform.

Rename transform now supports multi-column rename

Use the rename transform to rename multiple columns in a single transform.

See Rename Columns. See Rename Transform.

Delete specified columns or delete the others

The drop transform now supports the option of deleting all columns except the ones specified in the transform. See Drop Transform.

New string comparison functions

Compare two strings using Latin collation settings. See below.

NOW function returns 24-hour time values

In Release 4.1.1 and earlier, the NOW function returned time values for the specified time zone in 12-hour time, which was confusing.

In Release 4.2 and later, this function returns values in 24-hour time.

New Transforms

Transform Name Documentation

case Case Transform

filter Filter Transform

New Functions

Function Name Documentation

STRINGGREATERTHAN STRINGGREATERTHAN Function

STRINGGREATERTHANEQUAL STRINGGREATERTHANEQUAL Function

STRINGLESSTHAN STRINGLESSTHAN Function

STRINGLESSTHANEQUAL STRINGLESSTHANEQUAL Function

SUBSTITUTE SUBSTITUTE Function

Changes to the APIs

Contents:

Changes for Release 6.8 Overrides to relational sources and targets

Copyright © 2020 Trifacta Inc. Page #18 New flow object definition Export and import macros Changes for Release 6.4 v4 version of password reset request endpoint Changes to awsConfig object Changes for Release 6.3 Assign AWSConfigs to a user at create time Changes for Release 6.0 Error in Release 6.0.x API docs Planned End of Life of v3 API endpoints Changes for Release 5.9 Introducing Access Tokens Changes for Release 5.1 Changes for Release 5.0 Introducing v4 APIs Changes for Release 4.2 Create Hive and Redshift connections via API WrangledDataset endpoints are still valid

Review the changes to the publicly available REST APIs for the Trifacta® platform for the current release and past releases.

Changes for Release 6.8

Overrides to relational sources and targets

Through the APIs, you can now apply overrides to relational sources and targets during job execution or deployment import.

jobGroups

When you are running a job, you can override the default publication settings for the job using overrides in the request. For more information, see API Workflow - Run Job.

Deployments

When you import a flow package into a deployment, you may need to remap the source and output of the flow to use production versions of your data. This capability has been present in the product for file-based sources and targets. Now, it's available for relational sources and targets. For more information, see Define Import Mapping Rules.

New flow object definition

Release 6.8 introduces a new version of the flow object definition. This new version will support cross-product and cross-version import and export of flows in the future. For more information see Changes to the Object Model.

NOTE: The endpoints to use to manage flow packages remain unchanged. Similarly, the methods used to define import mapping rules remains unchanged. The API responses that contain flow definitions has changed. See below.

Export and import macros

Beginning in Release 6.8, you can export and import macro definitions via API.

Copyright © 2020 Trifacta Inc. Page #19 Export: API Macros Package Get v4 Import: API Macros Package Post v4

Changes for Release 6.4

v4 version of password reset request endpoint

To assist in migration from the command-line interface to using the APIs, a v4 version of an API endpoint has been made available to allow for administrators to generate password reset codes.

Changes to awsConfig object

NOTE: No action is required.

In Release 6.0, the awsConfig object was introduced to enable the assignment of AWS configurations to individual users (per-user auth) via API. This version of the awsConfig object supported a mapping of a single IAM role to an awsConfig object.

Beginning in Release 6.4, per-user authentication now supports mapping of multiple possible IAM roles to an individual user's configuration. To enable this one-to-many mapping, the awsRoles object was introduced.

An awsRoles object creates a one-to-one mapping between an IAM role and an awsConfig object. An awsConfig object can have multiple awsRoles assigned to it.

Changes to awsConfig object:

The role field in the object has been replaced by activeRoleId, which maps to the active role for the configuration object. For each role reference in the awsConfig objects, a corresponding awsRole object has been created and mapped to it.

Beginning in Release 6.4, you can create, edit, and delete awsRoles objects, which can be used to map an AWS IAM role ARN to a specified AWSConfig object. You can map multiple awsRoles to a single awsConfig.

For more information, see API Workflow - Manage AWS Configurations.

Changes for Release 6.3

Assign AWSConfigs to a user at create time

Beginning in Release 6.3, you can assign an AWSConfig object to a user when you create the object. This shortcut reduces the number of REST calls that you need to make.

NOTE: For security reasons, AWSConfig objects must be assigned to users at the time of creation. Admin users can assign to other users. Non-admin users are automatically assigned the AWSConfig objects that they create.

Prior to Release 6.3, AWSConfig objects were assigned through the following endpoint. Example:

/v4/people/2/awsConfigs/6

Copyright © 2020 Trifacta Inc. Page #20 NOTE: This endpoint has been removed from the platform. Please update any scripts that reference the above endpoint to manage AWS configuration assignments through the new method described in the following link.

See API Workflow - Manage AWS Configurations.

Changes for Release 6.0

Error in Release 6.0.x API docs

In Release 6.0 - Release 6.0.2, the online and PDF versions of the documentation referenced the following endpoint: API JobGroups Get Status v4. According to the docs, this endpoint was triggered in this manner:

Method GET

Endpoint /v4/jobGroups//status

This endpoint exists in v3 of the API endpoints. It does not exist in v4.

Instead, you should monitor the status field for the base GET endpoint for jobGroups. For more information, see API JobGroups Get v4.

Planned End of Life of v3 API endpoints

In Release 6.0, the v3 API endpoints are supported.

In the next release of Trifacta Wrangler Enterprise after Release 6.0, the v3 API endpoints will be removed from the product (End of Life).

You must migrate to using the v4 API endpoints before upgrading to the next release after Release 6.0.

Changes for Release 5.9

Introducing Access Tokens

Each request to the API endpoints of the Trifacta platform requires submission of authentication information. In Release 5.1 and earlier:

A request could include clear-text username/password combinations. This method is not secure. A request could include a browser cookie. This method does not work for well for use cases outside of the browser (e.g. scripts).

Beginning in Release 5.9, API users can manage authentication using access tokens. These tokens obscure any personally identifiable information and represent a standards-based method of secure authentication.

NOTE: All previous methods of API authentication are supported in this release. Access tokens is the preferred method of authentication.

The basic process works in the following manner:

1. API user requests generation of a new token. 1. This initial request must contain a valid username and password.

Copyright © 2020 Trifacta Inc. Page #21 2. Request includes expiration. 3. Token value is returned in the response. 2. The token value inserted into the Authorization header of each request to the platform. 3. User monitors current time and expiration time of the token. At any time, the user can request a new token to be generated using the same endpoint used in the initial request.

Access tokens can be generated via API or the Trifacta application.

NOTE: This feature must be enabled in your instance of the platform. See Enable API Access Tokens.

API: For more information, see API AccessTokens Create v4. Trifacta application: For more information, see Access Tokens Page.

For more information on API authentication methods, see API Authentication.

Changes for Release 5.1

None.

Changes for Release 5.0

Introducing v4 APIs

NOTE: This feature is in Beta release.

Release 5.0 signals the introduction of version 4 of the REST APIs.

NOTE: At this time, a very limited number of v4 REST APIs are publicly available. Where possible, you should continue to use the v3 endpoints. For more information, see v4 Endpoints.

v4 conventions

The following conventions apply to v4 and later versions of the APIs:

Parameter lists are consistently enveloped in the following manner:

{ "data": [ { ... } ] }

Field names are in camelCase and are consistent with the resource name in the URL or with the embed U RL parameter. From early API versions, foreign keys have been replaced with identifiers like the following:

v3 and earlier v4 and later

Copyright © 2020 Trifacta Inc. Page #22 "createdBy": 1, "creator": { "id": 1 },

"updatedBy": 2, "updater": { "id": 2 },

Publication endpoint references database differently. This change is to make the publishing endpoint for relational targets more flexible in the future.

v3 and earlier v4 and later

"database": "dbName", "path": ["dbName"],

Changes for Release 4.2

Create Hive and Redshift connections via API

You can create connections of these types via API:

Only one global Hive connection is still supported. You can create multiple Redshift connections.

WrangledDataset endpoints are still valid

In Release 4.1.1 and earlier, the WrangledDataset endpoints enabled creation, modification, and deletion of a wrangled dataset object, which also created the associated recipe object.

In Release 4.2, wrangled datasets have been removed from the application interface. However, the WrangledDataset API endpoints remain, although they now apply to the recipe directly.

The following endpoints and methods are still available:

NOTE: In a future release, these endpoints may be migrated to recipe-based endpoints. API users should review this page for each release.

For more information, see Changes to the Object Model.

Changes to Configuration

Contents:

Release Updates Release 6.5 Release 6.4.1

Copyright © 2020 Trifacta Inc. Page #23 Release 6.4 Release 6.0 Configuration Mapping

To centralize common enablement configuration on a per-workspace basis, a number of configuration properties are being migrated from trifacta-conf.json into the Trifacta® application.

In prior releases, some of these settings may have been surfaced in the Admin Settings page. See Admin Settings Page. For more information on configuration in general, see Platform Configuration Methods.

To assist administrators in managing these settings, these section provides a per-release set of updates and a map of old properties to new settings.

These settings now appear in the Workspace Admin page. To access, from the left menu bar select Settin gs menu > Settings > Workspace Admin. For more information, see Workspace Admin Page.

Release Updates

Release 6.5

Admin Settings or trifacta-conf. Workspace Admin setting Update json setting

n/a Show output directory in profile view This parameter has been hidden, as it is now enabled by default.

n/a Show upload directory in profile view This parameter has been hidden, as it is now enabled by default.

Release 6.4.1

Following parameter was moved in this release:

Admin Settings or trifacta-conf. Workspace Admin setting Update json setting

webapp.connectivity.customSQLQuery. Enable custom SQL query Feature is now enabled at the workspace enabled or tier level. For more information on this feature, see Enable Custom SQL Query.

webapp.enableTypecastOutput Schematized output Feature is now enabled at the workspace or tier level. For more information, see Miscellaneous Configuration.

Release 6.4

Parameters from Release 6.0 that are no longer available in the Workspace Admin page are now enabled by default for all users.

Some new features for this release may be enabled or disabled through the Workspace Admin page.

See Workspace Admin Page.

Copyright © 2020 Trifacta Inc. Page #24 Release 6.0

Initial release of the Workspace Admin page. See below for configuration mapping.

Configuration Mapping

The following mapping between old trifacta-conf.json settings and new Workspace Admin settings is accurate for the current release.

Admin Settings or trifacta-conf. Workspace Admin setting Update json setting

webapp.walkthrough.enabled Product walkthroughs

webapp.session.durationInMins Session duration

webapp.enableDataDownload Sample downloads

webapp.enableUserPathModification Allow the user to modify their paths

webapp.enableSelfServicePasswordReset Enable self service password reset

webapp.connectivity.customSQLQuery. Enable custom SQL query enabled

outputFormats.Parquet Parquet output format

outputFormats.JSON JSON output format

outputFormats.CSV CSV output format

outputFormats.Avro Avro output format

outputFormats.TDE TDE output format

feature.scheduling.enabled Enable Scheduling feature

feature.scheduling. Scheduling management UI (experimental) Now hidden. schedulingManagementUI

feature.scheduling.upgradeUI Show users a modal to upgrade to a plan Now hidden. with Scheduling

feature.sendFlowToOtherPeople.enabled Allow users to send copies of flows to other users

feature.flowSharing.enabled Enable Flow Sharing feature

feature.flowSharing.upgradeUI Show users a modal to upgrade to a plan Now hidden. with Flow Sharing

feature.enableFlowExport Allow users to export their flows

feature.enableFlowImport Allow users to import flows into the platform

feature.hideAddPublishingActionOptions Forbid users to add non-default publishing actions

feature.hideUnderlyingFileSystem Hide underlying file system to users

feature.publishEnabled Enable publishing

feature.rangeJoin.enabled Enable UI for range join

feature.showDatasourceTabs Show datasource tabs in the application

feature.showFileLocation Show file location

Copyright © 2020 Trifacta Inc. Page #25 feature.showOutputHomeDirectory Show output directory in profile view

feature.showUploadDirectory Show upload directory in profile view

metadata.branding Branding to use in-product for this Now hidden. deployment

webapp.connectivity.enabled Enable Connectivity feature

webapp.connectivity.upgradeUI Show users a modal to upgrade to a plan with Connectivity

feature.apiAccessTokens.enabled API Access Token

Changes to the Object Model

Contents:

Release 6.8 Version 2 of flow definition Release 6.4 Macros Release 6.0 Release 5.1 Release 5.0 Datasets with parameters Release 4.2 Wrangled datasets are removed Recipes can be reused and chained Introducing References Introducing Outputs Flow View Differences Connections as a first-class object

For more information on the objects available in the platform, see Object Overview.

Release 6.8

Version 2 of flow definition

This release introduces a new specification for the flow object.

NOTE: This version of the flow object now supports export and import across products and versions of those products in the future. There is no change to the capabilities and related objects of a flow.

Beginning in Release 6.8:

You can export a flow from one product and imported it into another. For example, you can develop a flow in Trifacta Wrangler Enterprise and then import it into Trifacta Wrangler, assuming that the product receiving the import is on the same build or a later one.

NOTE: Cloud-based products, such as free Trifacta Wrangler are updated on a periodic basis, as often as once a month. These products are likely to be on a version that is later than your installed version of Trifacta Wrangler Enterprise. For compatibility reasons, you should develop your flows in your earliest instance of Trifacta Wrangler Enterprise on Release 6.8 or later.

Copyright © 2020 Trifacta Inc. Page #26 You can export a flow from Release 6.8 or later of Trifacta Wrangler Enterprise and later import into Release 7.0 after upgrading the platform.

NOTE: You cannot import a pre-Release 6.8 flow into a Release 6.8 or later instance of Trifacta Wrangler Enterprise. You should re-import those flows before you upgrade to Release 6.8 or later.

Release 6.4

Macros

This release introduces macros, which are reusable sequences of parameterized steps. These sequences can be saved independently and references in other recipes in other flows. See Overview of Macros.

Release 6.0

None.

Release 5.1

None.

Release 5.0

Datasets with parameters

Beginning in Release 5.0, imported datasets can be augmented with parameters, which enables operationalizing sampling and jobs based on date ranges, wildcards, or variables applied to the input path. For more information, see Overview of Parameterization.

Release 4.2

In Release 4.2, the object model has undergone the following revisions to improve flexibility and control over the objects you create in the platform.

Wrangled datasets are removed

In Release 3.2, the object model introduced the concepts of imported datasets, recipes, and wrangled datasets. These objects represented data that you imported, steps that were applied to that data, and data that was modified by those steps.

In Release 4.2, the wrangled dataset object has been removed in place of two objects listed below. All of the functionality associated with a wrangled dataset remains, including the following actions. Next to these actions are the new object with which the action is associated.

Wrangled Dataset action Release 4.2 object

Run or schedule a job Output object

Preview data Recipe object

Reference to the dataset Reference object

NOTE: At the API level, the wrangledDataset endpoint continues to be in use. In a future release, separate endpoints will be available for recipes, outputs, and references. For more information, see API Reference.

Copyright © 2020 Trifacta Inc. Page #27 These objects are described below.

Recipes can be reused and chained

Since recipes are no longer tied to a specific wrangled dataset, you can now reuse recipes in your flow. Create a copy with or without inputs and move it to a new flow if needed. Some cleanup may be required.

This flexibility allows you to create, for example, recipes that are applicable to all of your datasets for initial cleanup or other common wrangling tasks.

Additionally, recipes can be created from recipes, which allows you to create chains of recipes. This sequencing allows for more effective management of common steps within a flow.

Introducing References

Before Release 4.2, reference datasets existed and were represented in the user interface. However, these objects existed in the downstream flow that consumes the source. If you had adequate permissions to reference a dataset from outside of your flow, you could pull it in as a reference dataset for use.

In Release 4.2, a reference is a link between a recipe in your flow to other flows. This object allows you to expose your flow's recipe for use outside of the flow. So, from the source flow, you can control whether your recipe is available for use.

This object allows you to have finer-grained control over the availability of data in other flows. It is a dependent object of a recipe.

NOTE: For multi-dataset operations such as union or join, you must now explicitly create a reference from the source flow and then union or join to that object. In previous releases, you could directly join or union to any object to which you had access.

Introducing Outputs

In Release 4.1, outputs became a configurable object that was part of the wrangled dataset. For each wrangled dataset, you could define one or more publishing actions, each with its own output types, locations, and other parameters. For scheduled executions, you defined a separate set of publishing actions. These publishing actions were attached to the wrangled dataset.

In Release 4.2, an output is a defined set of scheduled or ad-hoc publishing actions. With the removal of the wrangled dataset object, outputs are now top-level objects attached to recipes. Each output is a dependent object of a recipe.

Flow View Differences

Below, you can see the same flow as it appears in Release 4.1 and Release 4.2. In each Flow View:

The same datasets have been imported. POS-r01 has been unioned to POS-r02 and POS-r03. POS-r01 has been joined to REF-PROD, and the column containing the duplicate join key in the result has been dropped. In addition to the default CSV publishing action (output), a scheduled one has been created in JSON format and scheduled for weekly execution.

Copyright © 2020 Trifacta Inc. Page #28 Release 4.1 Flow View

Release 4.2 Flow View

Flow View differences

Wrangled dataset no longer exists. In Release 4.1, scheduling is managed off of the wrangled dataset. In Release 4.2, it is managed through the new output object. Outputs are configured in a very similar manner, although in Release 4.2, the tab is labeled, "Destinations." No changes to scheduling UI. Like the output object, the reference object is an externally visible link to a recipe in Flow View. This object just enables referencing the recipe object in other flows. See Flow View Page.

Copyright © 2020 Trifacta Inc. Page #29 Other differences

In application pages where you can select tabs to view object types, the available selections are typically: All, Imported Dataset, Recipe, and Reference. Wrangled datasets have been removed from the Dataset Details page, which means that the job cards for your dataset runs have been removed. These cards are still available in the Jobs page when you click the drop-down next to the jjob entry. The list of jobs for a recipe is now available through the output object in Flow View. Select the object and review the job details through the right panel. In Flow View and the Transformer page, context menu items have changed.

Connections as a first-class object

In Release 4.1.1 and earlier, connections appeared as objects to be created or explored in the Import Data page. Through the left navigation bar, you could create or edit connections to which you had permission to do so. Connections were also selections in the Run Job page.

Only administrators could create public connections. End-users could create private connections.

In Release 4.2, the Connections Manager enables you to manage your personal connections and (if you're an administrator) global connections. Key features:

Connections can be managed like other objects. Connections can be shared, much like flows. When a flow with a connection is shared, its connection is automatically shared. For more information, see Overview of Sharing. Release 4.2 introduces a much wider range of connectivity options. Multiple Redshift connections can be created through this interface. In prior releases, you could only create a single Redshift connection.

NOTE: Beginning in Release 4.2, all connections are initially created as private connections, accessible only to the user who created. Connections that are available to all users of the platform are called, public connections. You can make connections public through the Connections page.

For more information, see Connections Page.

Release Notes 6.8

Contents:

Release 6.8 What's New Changes in System Behavior Key Bug Fixes New Known Issues

Release 6.8

December 6, 2019

Copyright © 2020 Trifacta Inc. Page #30 Welcome to Release 6.8 of Trifacta® Wrangler Enterprise. This release introduces several key features around operationalizing the platform across the enterprise. Enterprise stakeholders can now receive email notifications when recurring jobs have succeeded or failed, updating data consumers outside of the platform. This release also introduces a generalized webhook interface, which facilitates push notifications to applications such as Slack when jobs have completed. When jobs fail, users can download a much richer support bundle containing configuration files, script files, and a specified set of log files.

Macros have been expanded to now be export- and import-ready across environments. In support of this feature, the Wrangle Exchange is now available through the Trifacta Community, where you can download macros created by others and import them for your own use. Like macros, you can now export and import flows across product editions and release (Release 6.8 or later only).

In the application, you can now use shortcut keys to navigate around the workspace and the Transformer page. And support for the Firefox browser has arrived. Read on for more goodness added with this release.

What's New

Install:

Support for ADLS Gen2 blob storage. See Enable ADLS Gen2 Access.

Workspace:

Individual users can now enable or disable keyboard shortcuts in the workspace or Transformer page. See User Profile Page. Configure locale settings at the workspace or user level. See Locale Settings. You can optionally duplicate the datasets from a source flow when you create a copy of it. See Flow View Page. Create a copy of your imported dataset. See Library Page.

Browser:

Support for Firefox browser.

NOTE: This feature is in Beta release.

For supported versions, see Desktop Requirements.

Project Management:

Support for export and import of macros. See Macros Page. For more information on macros, see Overview of Macros. Download and use macros available through the Wrangle Exchange. See https://www.trifacta.com/blog/crowdsourcing-macros-trifacta-wrangle-exchange/.

Operationalization:

Create webhook notifications for third-party platforms based on results of your job executions. See Create Flow Webhook Task. Enable and configure email notifications based on the success or failure of job executions.

NOTE: This feature requires access to an SMTP server. See Enable SMTP Email Server Integration.

For more information on enabling, see Workspace Admin Page.

Copyright © 2020 Trifacta Inc. Page #31 Individual users can opt out of receiving email messages or can configure use of a different email address. See Email Notifications Page. For more information on enabling emails for individual flows, see Manage Flow Notifications Dialog.

Supportability:

Download logs bundle on job success or failure now contains extensive configuration information to assist in debugging. For more information, see Configure Support Bundling.

Connectivity:

Support for integration with EMR 5.8 - 5.27. For more information, see Configure for EMR.

Connect to SFTP servers to read data and write datasets. See Create SFTP Connections.

Create connections to Databricks Tables.

NOTE: This connection is supported only when the Trifacta platform is connected to an Azure Databricks cluster.

For more information, see Create Databricks Tables Connections.

Support for using non-default database for your Snowflake stage. Support for ingest from read-only Snowflake databases. See Enable Snowflake Connections.

Import:

As of Release 6.8, you can import an exported flow into any edition or release after the build number of the export. See Import Flow. Improved monitoring of long-loading relational sources. See Import Data Page.

NOTE: This feature must be enabled. See Configure JDBC Ingestion.

Transformer Page:

Select columns, functions applied to your source, and constants to replace your current dataset. See Select. Improved Date/Time format selection. See Choose Datetime Format Dialog.

Tip: Datetime formats in card suggestions now factor in the user's locale settings for greater relevance.

Improved matching logic and performance when matching columns through RapidTarget. Align column based on the data contained in them, in addition to column name. This feature is enabled by default. For more information, see Overview of RapidTarget. Improvements to the Search panel enable faster discovery of transformations, functions, and other objects. See Search Panel.

Job execution:

By default, the Trifacta application permits up to four jobs from the same flow to be executed at the same time. If needed, you can configure the application to execute jobs from the same flow one at a time. See Configure Application Limits.

Copyright © 2020 Trifacta Inc. Page #32 If you enabled visual profiling for your job, you can download a JSON version of the visual profile. See Job Details Page.

Support for instance pooling in Azure Databricks. See Configure for Azure Databricks.

Language:

New trigonometry and statistical functions. See Changes to the Language.

API:

Apply overrides at time of job execution via API. Define import mapping rules for your deployments that use relational sources or publish to relational targets. Export and import macro definitions. See Changes to the APIs.

Changes in System Behavior

Browser Support Policy:

For supported browsers, at the time of release, the latest stable version and the two previous stable versions are supported.

NOTE: Stable browser versions released after a given release of Trifacta Wrangler Enterprise will NOT be supported for any prior version of Trifacta Wrangler Enterprise. A best effort will be made to support newer versions released during the support lifecycle of the release.

For more information, see Desktop Requirements.

Install:

NOTE: In the next release of Trifacta Wrangler Enterprise after Release 6.8, support for installation on CentOS/RHEL 6.x and Ubuntu 14.04 will be deprecated. You should upgrade the Trifacta node to a supported version of CentOS/RHEL 7.x or Ubuntu 16.04. Before performing the upgrade, please perform a full backup of the Trifacta platform and its databases. See Backup and Recovery.

Support for Spark 2.1 has been deprecated. Please upgrade to a supported version of Spark. Support for EMR 5.6 and eMR 5.7 has also been deprecated. Please upgrade to a supported version of EMR. For more information, see Product Support Matrix. To simplify the installation distribution, the Hadoop dependencies for the recommended version only are included in the software download. For the dependencies for other supported Hadoop distributions, you must download them from the Trifacta FTP site and install them on the Trifacta node. See Install Hadoop Dependencies. Trifacta node has been upgraded to use Python 3. This instance of Python has no dependencies on any Python version external to the Trifacta node.

Import/Export:

Flows can now be exported and imported across products and versions of products. See Changes to the Object Model.

Copyright © 2020 Trifacta Inc. Page #33 CLI and v3 endpoints (Release 6.4):

NOTE: Do not attempt to connect to the Trifacta platform using any version of the CLI or the v3 endpoints. They are no longer supported and unlikely to work.

In Release 6.4:

The Command Line Interface (CLI) was deprecated. Customers must use the v4 API endpoints instead. The v3 versions of the API endpoints were deprecated. Customers must use the v4 API endpoints instead. Developer content was provided to assist in migrating to the v4 API endpoints. For more information on acquiring this content, please contact Trifacta Support.

Key Bug Fixes

Ticket Description

TD-40348 When loading a recipe in an imported flow that references an imported Excel dataset, Transformer page displays Input validation failed: (Cannot read property 'filter' of undefined) error, and the screen is blank.

TD-42080 Cannot run flow or deployment that contains more than 10 recipe jobs

New Known Issues

Ticket Description

TD-45923 Publishing a compressed Snappy file to SFTP fails.

TD-45922 You cannot publish TDE format to SFTP destinations.

TD-45492 Publishing to Databricks Tables fails on ADLS Gen1 in user mode.

Copyright © 2020 Trifacta Inc. Page #34 TD-45273 Artifact Storage Service fails to start on HDP 3.1.

Workaround: The Artifact Storage Service can reference the HDP 2.6 Hadoop bundle JAR.

1. You can apply this change through the Admin Settings Page (recommended) or trif acta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following property:

"artifact-storage- service.classpath"

3. Replace this value:

:%(topOfTree)s/% (hadoopBundleJar)s

4. With the following:

:%(topOfTree)s /conf/hadoop-site /:%(topOfTree)s /hadoop-deps/hdp- 2.6/build/libs /hdp-2.6-bundle. jar

5. Save changes and restart the platform.

TD-45122 API: re-running job using only the wrangleDataset identifier fails even if the original job succeeds when writeSet tings were specified.

Workaround: Use a full jobGroups job specification each time that you run a job.

See API JobGroups Create v4.

Copyright © 2020 Trifacta Inc. Page #35 TD-44429 Cannot publish outputs to relational targets, receiving Encounte red error while processing stream.

Workaround: This issue may be caused by the trifa cta service account not having write and execute permissions to the /tmp directory on the Trifacta node . If so, you can do either of the following:

1. Enable write and execute permissions for the account on /tmp . 2. Create a new temporary account and provide the service account write and execute permissions to it. Then, add the following to da ta-service.jvmOptions:

-Dorg.xerial. snappy.tempdir= /new/directory /with /writeexecuteaccess

TD-44427 Cannot publish dataset containing duplicate rows to Teradata. Error message:

Caused by: java.sql. SQLException: [Teradata Database] [TeraJDBC 15.10.00.14] [Error -2802] [SQLState 23000] Duplicate row error in abc_trifacta. tmp_218768523. at

Workaround: This is a known limitation on Teradata. For more information on this limitation, see Enable Teradata Connections.

Release Notes 6.4

Contents:

Release 6.4.2 What's New Changes in System Behavior Key Bug Fixes New Known Issues Release 6.4.1 What's New Changes in System Behavior Key Bug Fixes

Copyright © 2020 Trifacta Inc. Page #36 New Known Issues Release 6.4 What's New Changes in System Behavior Key Bug Fixes New Known Issues

Release 6.4.2

November 15, 2019

This release is primarily a bug fix release with the following new features.

What's New

API:

Apply overrides at time of job execution via API. Define import mapping rules for your deployments that use relational sources or publish to relational targets. See Changes to the APIs.

Job execution:

By default, the Trifacta application permits up to four jobs from the same flow to be executed at the same time. If needed, you can configure the application to execute jobs from the same flow one at a time. See Configure Application Limits.

Changes in System Behavior

None.

Key Bug Fixes

Ticket Description

TD-44548 RANGE function returns null values if more than 1000 values in output.

TD-44494 Lists are not correctly updated in Deployment mode

TD-44311 Out of memory error when running a flow with many output objects

TD-44188 Performance is poor for SQL DW connection

TD-43877 Preview after a DATEFORMAT step does not agree with results or profile values

TD-44035 Spark job failure from Excel source

TD-43849 Export flows are broken when recipe includes Standardization or Transform by Example tasks.

NOTE: This Advanced Feature is available in Trifacta Wrangler Enterprise under a separate, additional license. If it is not available under your current license, do not enable it for use. Please feel free to contact your representative.

Copyright © 2020 Trifacta Inc. Page #37 New Known Issues

None.

Release 6.4.1

August 30. 2019This release includes bug fixes and introduces SSO connections for Azure relational sources.

What's New

Connectivity:

You can now leverage your Azure AD SSO infrastructure to create SSO connections to Azure relational databases. For more information, see Enable SSO for Azure Relational Connections.

Changes in System Behavior

Configuration changes:

The parameter to enable custom SQL query has been moved to the Workspace Admin page. The parameter to disable schematized output has been moved to the Workspace Admin page. For more information, see Changes to Configuration.

Key Bug Fixes

Ticket Description

TD-39086 Hive ingest job fails on Microsoft Azure.

New Known Issues

None.

Release 6.4

August 1, 2019

This release of Trifacta® Wrangler Enterprise features broad improvements to the recipe development experience, including multi-step operations and improved copied and paste within the Recipe panel. As a result of the panel's redesign, you can now create user-defined macros, which are sets of sequenced and parameterized steps for easy reuse and adaptation for other recipes. When jobs are executed, detailed monitoring provides enhanced information on progress of the job through each phase of the process. You can also connect to a broader ecosystem of sources and targets, including enhancements to the integration with Tableau Server and AWS Glue. New for this release: read from your Snowflake sources. Read on for additional details on new features and enhancements.

What's New

Transformer Page:

The redesigned Recipe panel enables multi-step operations and more robust copy and paste actions. See Recipe Panel. Introducing user-defined macros, which enable saving and reusing sequences of steps. For more information, see Overview of Macros. Transform by example output values for a column of values. See Transformation by Example Page. For an overview of this feature, see Overview of TBE.

Copyright © 2020 Trifacta Inc. Page #38 Browse current flow for datasets or recipes to join into the current recipe. See Join Panel. Replace specific cell values. See Replace Cell Values.

Job Execution:

Detailed job monitoring for ingest and publishing jobs. See Overview of Job Monitoring. Parameterize output paths and table and file names. See Run Job Page.

Install:

Support for RHEL/CentOS 7.5 and 7.6 for the Trifacta node. See System Requirements.

Support for deployment of Trifacta platform via Docker image. See Install for Docker.

Connectivity:

Support for integration with Cloudera 6.2.x. See System Requirements.

NOTE: Support for integration with Cloudera 5.15.x and earlier has been deprecated. See End of Life and Deprecated Features.

NOTE: Support for integration with HDP 2.5.x and earlier has been deprecated. See End of Life and Deprecated Features.

Support for Snowflake database connections.

NOTE: This feature is supported only when Trifacta Wrangler Enterprise is installed on customer- managed AWS infrastructure.

For more information, see Enable Snowflake Connections.

Support for direct publishing to Tableau Server. For more information, see Run Job Page. Support for MySQL database timezones. See Install Databases for MySQL.

Enhanced support for AWS Glue integration:

Metadata catalog browsing through the application. See AWS Glue Browser. Per-user authentication to Glue. See Configure AWS Per-User Authentication. See Enable AWS Glue Access.

Import:

Add timestamp parameters to your custom SQL statements to enable data import relative to the job execution time. See Create Dataset with SQL.

Authentication:

Leverage your enterprise's SAML identity provider to pass through a set of IAM roles that Trifacta users can select for access to AWS resources.

NOTE: This authentication method is supported only if SSO authentication has been enabled using the platform-native SAML authentication method. For more information, see Configure SSO for SAML.

Copyright © 2020 Trifacta Inc. Page #39 For more information, see Configure for AWS SAML Passthrough Authentication.

Support for AzureManaged Identities with Azure Databricks. See Configure for Azure Databricks.

Admin:

Administrators can review, enable, disable, and delete schedules through the application. See Schedules Page.

Sharing:

Share flows and connections with groups of users imported from your LDAP identity provider.

NOTE: This feature is in Beta release.

See Configure Users and Groups.

Logging:

Tracing user information across services for logging purposes. See Configure Logging for Services.

Language:

New functions. See Changes to the Language. Broader support for metadata references. For Excel files, $filepath references now return the location of the source Excel file. Sheet names are appended to the end of the reference. See Source Metadata References.

APIs:

Admins can now generate password reset requests via API. See Changes to the APIs.

Databases:

New databases:

Job Metadata Service database

Changes in System Behavior

NOTE: The Trifacta software must now be installed on an edge node of the cluster. Existing customers who cannot migrate to an edge node will be supported. You will be required to update cluster files on the Trifacta node whenever they change, and cluster upgrades may be more complicated. You should migrate your installation to an edge node if possible. For more information, see System Requirements.

NOTE: The v3 APIs are no longer supported. Please migrate immediately to using the v4 APIs.

NOTE: The command line interface (CLI) is no longer available. Please migrate immediately to using the v4 APIs.

Copyright © 2020 Trifacta Inc. Page #40 NOTE: The PNaCl browser client extension is no longer supported. Please verify that all users of Trifacta Wrangler Enterprise are using a supported version of Google Chrome, which automatically enables use of WebAssembly. For more information, see Desktop Requirements.

NOTE: Support for Java 7 has been deprecated in the platform. Please upgrade to Java 8 on the Trifacta node and any connected cluster. Some versions of Cloudera may install Java 7 by default.

NOTE: The Chat with us feature is no longer available. For Trifacta Wrangler Enterprise customers, this feature had to be enabled in the product. For more information, see Trifacta Support.

NOTE: The desktop version of Trifacta Wrangler will cease operations on August 31, 2019. If you are still using the product at that time, your data will be lost. Please transition to using the free Cloud version of Tri facta® Wrangler. Automated migration is not available. To register for a free account, please visit https://cloud.trifacta.com.

Workspace:

Configuration for AWS authentication for platform users has been migrated to a new location. See Configure Your Access to S3.

API:

The endpoint used to assign an AWSConfig object to a user has been replaced.

NOTE: If you used the APIs to assign AWSConfig objects in a previous release, you must update your scripts to assign AWS configurations. For more information, see Changes to the APIs.

Documentation:

In prior releases, the documentation listed UTF32-BE and UTF32-LE as supported file formats. These formats are not supported. Documentation has been updated to correct this error. See Supported File Encoding Types.

Key Bug Fixes

Ticket Description

TD-41260 Unable to append Trifacta Decimal type into table with Hive Float type. See Hive Data Type Conversions.

Copyright © 2020 Trifacta Inc. Page #41 TD-40424 UTF-32BE and UTF-32LE are available as supported file encoding options. They do not work.

NOTE: Although these options are available in the application, they have never been supported in the underlying platform. They have been removed from the interface.

TD-40299 Cloudera Navigator integration cannot locate the database name for JDBC sources on Hive.

TD-40243 API access tokens don't work with native SAML SSO authentication

TD-39513 Import of folder of Excel files as parameterized dataset only imports the first file, and sampling may fail.

TD-39455 HDI 3.6 is not compatible with Guava 26.

TD-39092 $filepath and $sourcerownumber references are not supported for Parquet file inputs.

For more information, see Source Metadata References.

TD-31354 When creating Tableau Server connections, the Test Connection button is missing. See Create Tableau Server Connections.

TD-36145 Spark running environment recognizes numeric values preceded by + as Integer or Decimal data type. Photon running environment does not and types these values as strings.

New Known Issues

Ticket Description

TD-42638 Publishing and ingest jobs that are short in duration cannot be canceled.

Workaround: Allow the job to complete. You can track the progress through these phases of the jobs through the application. See Job Details Page.

TD-39052 Changes to signout on reverse proxy method of SSO do not take effect after upgrade.

Release Notes 6.0

Contents:

Release 6.0.2 What's New Changes to System Behavior Key Bug Fixes New Known Issues Release 6.0.1 What's New Changes to System Behavior Key Bug Fixes New Known Issues

Copyright © 2020 Trifacta Inc. Page #42 Release 6.0 What's New Changes to System Behavior Key Bug Fixes New Known Issues

Release 6.0.2

This release addresses several bug fixes.

What's New

Support for Cloudera 6.2. For more information, see System Requirements.

Changes to System Behavior

NOTE: As of Release 6.0, all new and existing customers must license, download, and install the latest version of the Tableau SDK onto the Trifacta node. For more information, see Create Tableau Server Connections.

Upload:

In previous releases, files that were uploaded to the Trifacta platform that had an unsupported filename extension received a warning before upload. Beginning in this release, files with unsupported extensions are blocked from upload. You can change the list of supported file extensions. For more information, see Miscellaneous Configuration.

Documentation:

In Release 6.0.x documentation, documentation for the API JobGroups Get Status v4 endpoint was mistakenly published. This endpoint does not exist. For more information on the v4 equivalent, see Changes to the APIs.

Key Bug Fixes

Ticket Description

TD-40471 SAM auth: Logout functionality not working

TD-39318 Spark job fails with parameterized datasets sourced from Parquet files

TD-39213 Publishing to Hive table fails

New Known Issues

None.

Release 6.0.1

This release features support for several new Hadoop distributions and numerous bug fixes.

What's New

Connectivity:

Copyright © 2020 Trifacta Inc. Page #43 Support for integration with CDH 5.16. Support for integration with CDH 6.1. Version-specific configuration is required.

See Supported Deployment Scenarios for Cloudera. Support for integration with HDP 3.1. Version-specific configuration is required. See Supported Deployment Scenarios for Hortonworks. Support for Hive 3.0 on HDP 3.0 or HDP 3.1. Version-specific configuration is required. See Configure for Hive. Support for Spark 2.4.0.

NOTE: There are some restrictions around which running environment distributions support and do not support Spark 2.4.0.

For more information, see Configure for Spark. Support for integration with high availability for Hive.

NOTE: High availability for Hive is supported on HDP 2.6 and HDP 3.0 with Hive 2.x enabled. Other configurations are not currently supported.

For more information, see Create Hive Connections.

Publishing:

Support for automatic publishing of job metadata to Cloudera Navigator.

NOTE: For this release, Cloudera 5.16 only is supported.

For more information, see Configure Publishing to Cloudera Navigator.

Changes to System Behavior

Key Bug Fixes

Ticket Description

TD-39779 MySQL JARs must be downloaded by user.

NOTE: If you are installing the databases in MySQL, you must download a set of JARs and install them on the Trif acta node. For more information, see Install Databases for MySQL.

TD-39694 Tricheck returns status code 200, but there is no response. It does not work through Admin Settings page.

TD-39455 HDI 3.6 is not compatible with Guava 26.

TD-39086 Hive ingest job fails on Microsoft Azure.

New Known Issues

Ticket Description

Copyright © 2020 Trifacta Inc. Page #44 TD-40299 Cloudera Navigator integration cannot locate the database name for JDBC sources on Hive.

TD-40348 When loading a recipe in imported flow that references an imported Excel dataset, Transformer page displays Input validation failed: (Cannot read property 'filter' of undefined) error, and the screen is blank.

Workaround: In Flow View, select an output object, and run a job. Then, load the recipe in the Transformer page and generate a new sample. For more information, see Import Flow.

TD-39969 On import, some Parquet files cannot be previewed and result in a blank screen in the Transformer page.

Workaround: Parquet format supports row groups, which define the size of data chunks that can be ingested. If row group size is greater than 10 MB in a Parquet source, preview and initial sampling does not work. To workaround this issue, import the dataset and create a recipe for it. In the Transformer page, generate a new sample for it. For more information, see Parquet Data Type Conversions.

Release 6.0

This release of Trifacta® Wrangler Enterprise introduces key features around column management, including multi-select and copy and paste of columns and column values. A new Job Details page captures more detailed information about job execution and enables more detailed monitoring of in-progress jobs. Some relational connections now support publishing to connected databases. This is our largest release yet. Enjoy!

NOTE: This release also announces the deprecation of several features, versions, and supported extensions. Please be sure to review Changes to System Behavior below.

What's New

NOTE: Beginning in this release, the Wrangler Enterprise desktop application requires a 64-bit version of Microsoft Windows. For more information, see Install Desktop Application.

Wrangling:

In data grid, you can select multiple columns before receiving suggestions and performing transformations on them. For more information, see Data Grid Panel. New Selection Details panel enables selection of values and groups of values within a selected column. See Selection Details Panel. Copy and paste columns and column values through the column menus. see Copy and Paste Columns. Support for importing files in Parquet format. See Supported File Formats. Specify ranges of key values in your joins. See Configure Range Join.

Jobs:

Copyright © 2020 Trifacta Inc. Page #45 Review details and monitor the status of in-progress jobs through the new Job Details page. See Job Details Page. Filter list of jobs by source of job execution or by date range. See Jobs Page.

Connectivity:

Publishing (writeback) is now supported for relational connections. This feature is enabled by default

NOTE: After a connection has been enabled for publishing, you cannot disable publishing for that connection. Before you enable, please verify that all user accounts accessing databases of these types have appropriate permissions.

See Enable Relational Connections. The following connection types are natively supported for publishing to relational systems. Oracle Data Type Conversions Postgres Data Type Conversions SQL Server Data Type Conversions Teradata Data Type Conversions Import folders of Microsoft Excel workbooks. See Import Excel Data. Support for integration with CDH 6.0. Version-specific configuration is required.See Supported Deployment Scenarios for Cloudera. Support for integration with HDP 3.0. Version-specific configuration is required. See Supported Deployment Scenarios for Hortonworks. Support for Hive 3.0 on HDP 3.0 only. Version-specific configuration is required. See Configure for Hive.

Language:

Track file-based lineage using $filepath and $sourcerownumber references. See Source Metadata References. In addition to directly imported files, the $sourcerownumber reference now works for converted files (such as Microsoft Excel workbooks) and for datasets with parameters. See Source Metadata References.

Workspace:

Organize your flows into folders. See Flows Page.

Publishing:

Users can be permitted to append to Hive tables when they do not have CREATE or DROP permissions on the schema.

NOTE: This feature must be enabled. See Configure for Hive.

Administration:

New Workspace Admin page centralizes many of the most common admin settings. See Changes to System Behavior below. Download system logs through the Trifacta application. See Admin Settings Page.

Supportability:

High availability for the Trifacta node is now generally available. See Install for High Availability.

Authentication:

Copyright © 2020 Trifacta Inc. Page #46 Integrate SSO authentication with enterprise LDAP-AD using platform-native LDAP support.

NOTE: This feature is in Beta release.

NOTE: In previous releases, LDAP-AD SSO utilizes an Apache reverse proxy. While this method is still supported, it is likely to be deprecated in a future release. Please migrate to using the above SSO method. See Configure SSO for AD-LDAP.

Support for SAML SSO authentication. See Configure SSO for SAML.

Changes to System Behavior

NOTE: The Trifacta node requires NodeJS 10.13.0. See System Requirements .

Configuration:

To simplify configuration of the most common feature enablement settings, some settings have been migrated to the new Workspace Admin page. For more information, see Workspace Admin Page.

NOTE: Over subsequent releases, more settings will be migrated to the Workspace Admin page from the Admin Settings page and from trifacta-conf.json. For more information, see Changes to Configuration.

See Platform Configuration Methods.

See Admin Settings Page.Java 7:

NOTE: In the next release of Trifacta Wrangler Enterprise, support for Java 7 will be end of life. The product will no longer be able to use Java 7 at all. Please upgrade to Java 8 on the Trifacta node and your Hadoop cluster.

Key Bug Fixes

Ticket Description

TD-36332 Data grid can display wrong results if a sample is collected and dataset is unioned.

TD-36192 Canceling a step in recipe panel can result in column menus disappearing in the data grid.

TD-35916 Cannot logout via SSO

TD-35899 A deployment user can see all deployments in the instance.

TD-35780 Upgrade: Duplicate metadata in separate publications causes DB migration failure.

TD-35644 Extractpatterns with "HTTP Query strings" option doesn't work.

TD-35504 Cancel job throws 405 status code error. Clicking Yes repeatedly pops up Cancel Job dialog.

TD-35486 Spark jobs fail on LCM function that uses negative numbers as inputs.

Copyright © 2020 Trifacta Inc. Page #47 TD-35483 Differences in how WEEKNUM function is calculated in the Trifacta Photon and Spark running environments, due to the underlying frameworks on which the environments are created.

NOTE: Trifacta Photon and Spark jobs now behave consistently. Week 1 of the year is the week that contains January 1.

TD-35481 Upgrade Script is malformed due to SplitRows not having a Load parent transform.

TD-35177 Login screen pops up repeatedly when access permission is denied for a connection.

TD-27933 For multi-file imports lacking a newline in the final record of a file, this final record may be merged with the first one in the next file and then dropped in the Trifacta Photon running environment.

New Known Issues

Ticket Description

TD-39513 Import of folder of Excel files as parameterized dataset only imports the first file, and sampling may fail.

Workaround: Import as separate datasets and union together.

TD-39455 HDI 3.6 is not compatible with Guava 26.

TD-39092 $filepath and $sourcerownumber references are not supported for Parquet file inputs.

Workaround: Upload your Parquet files. Create an empty recipe and run a job to generate an output in a different file format, such as CSV or JSON. Use that output as a new dataset. See Build Sequence of Datasets.

For more information on these references, see Source Metadata References.

TD-39086 Hive ingest job fails on Microsoft Azure.

TD-39053 Cannot read datasets from Parquet files generated by Spark containing nested values.

Workaround: In the source for the job, change the data types of the affected columns to String and re-run the job on Spark.

TD-39052 Signout using reverse proxy method of SSO is not working after upgrade.

TD-38869 Upload of Parquet files does not support nested values, which appear as null values in the Transformer page.

Workaround: Unnest the values before importing into the platform.

Copyright © 2020 Trifacta Inc. Page #48 TD-37683 Send a copy does not create independent sets of recipes and datasets in new flow. If imported datasets are removed in the source flow, they disappear from the sent version.

Workaround: Create new versions of the imported datasets in the sent flow.

TD-36145 Spark running environment recognizes numeric values preceded by + as Integer or Decimal data type. Trifacta Photon running environment does not and types these values as strings.

TD-35867 v3 publishing API fails when publishing to alternate S3 buckets

Workaround: You can use the corresponding v4 API to perform these publication tasks. For more information on a workflow, see API Workflow - Manage Outputs.

Release Notes 5.1

Contents:

What's New Changes to System Behavior Key Bug Fixes New Known Issues

Welcome to Release 5.1 of Trifacta® Wrangler Enterprise! This release includes a significant expansion in database support and connectivity with more running environment versions, such as Azure Databricks. High availability is now available on the Trifacta platform node itself.

Within the Transformer page, you should see a number of enhancements, including improved toolbars and column menus. Samples can be named.

Regarding operationalization of the platform, datasets with parameters can now accept Trifacta patterns for parameter specification, which simplifies the process of creating complex matching patterns. Additionally, you can swap out a static imported dataset for a dataset with parameters, which enables development on a simpler dataset before expanding to a more complex set of sources. Variable overrides can now be applied to scheduled job executions, and you can specify multiple variable overrides in Flow View.

The underlying language has been improved with a number of transformations and functions, including a set of transformations designed around preparing data for machine processing. Details are below.

Tip: For a general overview of the product and its capabilities, see Product Overview.

What's New

Install:

Support for PostgreSQL 9.6 for Trifacta databases.

Copyright © 2020 Trifacta Inc. Page #49 NOTE: PostgreSQL 9.3 is no longer supported. PostgreSQL 9.3 is scheduled for end of life (EOL) in September 2018. For more information on upgrading, see Upgrade Databases for PostgreSQL.

Partial support for MySQL 5.7 for hosting the Trifacta databases.

NOTE: MySQL 5.7 is not supported for installation on Amazon RDS. See System Requirements.

Support for high availability on the Trifacta node. See Configure for High Availability.

NOTE: High availability on the platform is in Beta release.

Support for CDH 5.15.

NOTE: Support for CDH 5.12 has been deprecated. See End of Life and Deprecated Features.

Support for Spark 2.3.0 on the Hadoop cluster. See System Requirements. Support for integration with EMR 5.13, EMR 5.14, and EMR 5.15. See Configure for EMR.

NOTE: EMR 5.13 - 5.15 require Spark 2.3.0. See Configure for Spark.

Support for integration with Azure Databricks. See Configure for Azure Databricks. Support for WebAssembly, Google Chrome's standards-compliant native client.

NOTE: This feature is in Beta release.

NOTE: In a future release, use of PNaCl native client is likely to be deprecated.

Use of WebAssembly requires Google Chrome 68+. No additional installation is required. In this release, this feature must be enabled. For more information, see Miscellaneous Configuration.

The Trifacta® platform defaults to using Spark 2.3.0 for Hadoop job execution. See Configure for Spark.

Connectivity:

Enhanced import process for Excel files, including support for import from backend file systems. See Import Excel Data. Support for DB2 connections. See Connection Types. Support for HiveServer2 Interactive (Hive 2.x) on HDP 2.6. See Configure for Hive. Support for Kerberos-delegated relational connections. See Enable SSO for Relational Connections.

NOTE: In this release, only SQL Server connections can use SSO. See Create SQL Server Connections.

Performance caching for JDBC ingestion. See Configure JDBC Ingestion. Enable access to multiple WASB datastores. See Enable WASB Access.

Copyright © 2020 Trifacta Inc. Page #50 Import:

Support for use of Trifacta patterns in creating datasets with parameters. See Create Dataset with Parameters. Swap a static imported dataset with a dataset with parameters in Flow View. See Flow View Page.

Flow View:

Specify overrides for multiple variables through Flow View. See Flow View Page. Variable overrides can also be applied to scheduled job executions. See Add Schedule Dialog.

Transformer Page:

Join tool is now integrated into the context panel in the Transformer page. See Join Panel. Improved join inference key model. See Join Panel. Patterns are available for review and selection, prompting suggestions, in the context panel. Updated toolbar. See Transformer Toolbar. Enhanced options in the column menu. See Column Menus. Support for a broader range of characters in column names. See Rename Columns.

Sampling:

Samples can be named. See Samples Panel. Variable overrides can now be applied to samples taken from your datasets with parameters. See Samples Panel.

Jobs:

Filter list of jobs by date. See Jobs Page.

Language:

Rename columns using values across multiple rows. See Rename Columns.

Transformation Name Description

Bin column Place numeric values into bins of equal or custom size for the range of values.

Scale column Scale a column's values to a fixed range or to zero mean, unit variance distributions.

One-hot encoding Encode values from one column into separate columns containing 0 or 1 , depending on the absence or presence of the value in the corresponding row.

Group By Generate new columns or replacement tables from aggregate functions applied to grouped values from one or more columns.

Publishing:

Export dependencies of a job as a flow. See Flow View Page. Add quotes as CSV file publishing options. See Run Job Page. Specify CSV field delimiters for publishing. See Miscellaneous Configuration. Support for publishing Datetime values to Redshift as timestamps. See Redshift Data Type Conversions.

Execution:

UDFs are now supported for execution on HDInsight clusters. See Java UDFs.

Admin:

Copyright © 2020 Trifacta Inc. Page #51 Enable deletion of jobs. See Miscellaneous Configuration. Upload an updated license file through the application. See Admin Settings Page.

Changes to System Behavior

Diagnostic Server removed from product

The Diagnostic Server and its application page have been removed from the product. This feature has been superseded by Tricheck, which is available to administrators through the application. For more information, see Admin Settings Page.

Wrangle now supports nested expressions

The Wrangle now supports nested expressions within expressions. For more information, see Changes to the Language.

Language changes

The RAND function without parameters now generates true random numbers. When the source information is not available, the SOURCEROWNUMBER function can still be used. It returns null values in all cases. New functions. See Changes to the Language.

Key Bug Fixes

Ticket Description

TD-36332 Data grid can display wrong results if a sample is collected and dataset is unioned.

TD-36192 Canceling a step in recipe panel can result in column menus disappearing in the data grid.

TD-36011 User can import modified exports or exports from a different version, which do not work.

TD-35916 Cannot logout via SSO

TD-35899 A deployment user can see all deployments in the instance.

TD-35780 Upgrade: Duplicate metadata in separate publications causes DB migration failure.

TD-35746 /v4/importedDatasets GET method is failing.

TD-35644 Extractpatterns with "HTTP Query strings" option doesn't work

TD-35504 Cancel job throws 405 status code error. Clicking Yes repeatedly pops up Cancel Job dialog.

TD-35481 After upgrade, recipe is malformed at splitrows step.

TD-35177 Login screen pops up repeatedly when access permission is denied for a connection.

TD-34822 Case-sensitive variations in date range values are not matched when creating a dataset with parameters.

NOTE: Date range parameters are now case-insensitive.

TD-33428 Job execution on recipe with high limit in split transformation due to Java Null Pointer Error during profiling.

Copyright © 2020 Trifacta Inc. Page #52 NOTE: Avoid creating datasets that are wider than 2500 columns. Performance can degrade significantly on very wide datasets.

TD-31327 Unable to save dataset sourced from multi-line custom SQL on dataset with parameters.

TD-31252 Assigning a target schema through the Column Browser does not refresh the page.

TD-31165 Job results are incorrect when a sample is collected and then the last transform step is undone.

TD-30979 Transformation job on wide dataset fails on Spark 2.2 and earlier due to exceeding Java JVM limit. For details, see https://issues.apache.org/jira/browse/SPARK-18016.

TD-30857 Matching file path patterns in a large directory can be very slow, especially if using multiple patterns in a single dataset with parameters.

NOTE: To increase matching speed, avoid wildcards in top-level directories and be as specific as possible with your wildcards and patterns.

TD-30854 When creating a new dataset from the Export Results window from a CSV dataset with Snappy compression, the resulting dataset is empty when loaded in the Transformer page.

TD-30820 Some string comparison functions process leading spaces differently when executed on the Trifacta Photon or the Spark running environment.

TD-30717 No validation is performed for Redshift or SQL DW connections or permissions prior to job execution. Jobs are queued and then fail.

TD-27860 When the platform is restarted or an HA failover state is reached, any running jobs are stuck forever In Progress.

New Known Issues

Ticket Component Description

TD-35714 Installer/Upgrader/Utilities After installing on Ubuntu 16.04 (Xenial), platform may fail to start with "ImportError: No module named pkg_resources" error.

Workaround: Verify installation of python-setuptools package. Install if missing.

TD-35644 Compilation/Execution Extractpatterns for "HTTP Query strings" option doesn't work.

TD-35562 Compilation/Execution When executing Spark 2.3.0 jobs on S3- based datasets, jobs may fail due to a known incompatibility between HTTPClient: 4.5.x and aws-java-jdk:1.10.xx. For details, see https://github.com/apache/incubator-druid /issues/4456 .

Workaround: Use Spark 2.1.0 instead. In Admin Settings page,

Copyright © 2020 Trifacta Inc. Page #53 configure the spark. version property to 2.1.0 . For more information, see Admin Settings Page.

For additional details on Spark versions, see Configure for Spark.

TD-35504 Compilation/Execution Clicking Cancel Job button generates a 405 status code error. Click Yes button fails to close the dialog.

Workaround: After you have clicked the Yes button once, you can click the No button. The job is removed from the page.

TD-35486 Compilation/Execution Spark jobs fail on LCM function that uses negative numbers as inputs.

Workaround: If you wrap the negative number input in the ABS function, the LCM function may be computed. You may have to manually check if a negative value for the LCM output is applicable.

TD-35483 Compilation/Execution Differences in how WEEKNUM function is calculated in Trifacta Photon and Spark running environments, due to the underlying frameworks on which the environments are created.

Trifacta Photon week 1 of the year: The week that contains January 1. Spark week 1 of the year: The week that contains at least four days in the specified year.

For more information, see WEEKNUM Function.

TD-35478 Compilation/Execution The Spark running environment does not support use of multi-character delimiters for CSV outputs. For more information on this issue, see https://issues.apache.org/jira/browse /SPARK-24540 .

Workaround: You can switch your job to a different running environment or use single- character delimiters.

TD-34840 Transformer Page Platform fails to provide suggestions for transformations when selecting keys from an object with many of them.

TD-34119 Compilation/Execution WASB job fails when publishing two successive appends.

TD-30855 Publish Creating dataset from Parquet-only output results in "Dataset creation failed" error.

Copyright © 2020 Trifacta Inc. Page #54 NOTE: If you generate results in Parquet format only, you cannot create a dataset from it, even if the Create button is present.

TD-30828 Publish You cannot publish ad-hoc results for a job when another publishing job is in progress for the same job.

Workaround: Please wait until the previous job has been published before retrying to publish the failing job.

TD-27933 Connectivity For multi-file imports lacking a newline in the final record of a file, this final record may be merged with the first one in the next file and then dropped in the Trifacta Photon running environment.

Workaround: Verify that you have inserted a new line at the end of every file-based source.

Quick Start

Install from AWS Marketplace with EMR

Contents:

Scenario Description Install Pre-requisites Internet access SELinux Product Limitations Install Desktop Requirements Note about deleting the CloudFormation stack Verify Start and Stop the Platform Verify Operations Upgrade Documentation Related Topics

This documentation applies to installation from a supported Marketplace. Please use the installation instructions provided with your deployment.

Copyright © 2020 Trifacta Inc. Page #55 If you are installing or upgrading a Marketplace deployment, please use the available PDF content. You must use the install and configuration PDF available through the Marketplace listing.

Scenario Description

CloudFormation templates enable you to install Trifacta® Wrangler Enterprise with a minimal amount of effort.

After install, customizations can be applied by tweaking the resources that were created by the CloudFormation process. If you have additional requirements or a complex environment, please contact Trifacta Supportfor assistance with your solution.

Install

The CloudFormation template creates a complete working instance of Trifacta Wrangler Enterprise, including the following:

VPC and all required networking infrastructure EC2 instance with all supporting policies/roles S3 bucket EMR cluster Configurable autoscaling instance groups All supporting policies/roles

Pre-requisites

If you are integrating the Trifacta platform with an EMR cluster, you must acquire a Trifacta license first. Additional configuration is required. For more information, please contact [email protected].

Before you begin:

1. Read: Please read this entire document before you begin. 2. EULA. Before you begin, please review the End-User License Agreement. See https://docs.trifacta.com/display/PUB/End-User+License+Agreement+-+Trifacta+Wrangler+Enterprise. 3. Trifacta license file: If you have not done so already, please acquire a Trifacta license file from your Trifac ta representative.

Internet access

From AWS, the Trifacta platform requires Internet access for the following services:

NOTE: Depending on your AWS deployment, some of these services may not be required.

AWS S3 Key Management System [KMS] (if sse-kms server side encryption is enabled) Secure Token Service [STS] (if temporary credential provider is used) EMR (if integration with EMR cluster is enabled)

NOTE: If the Trifacta platform is hosted in a VPC where Internet access is restricted, access to S3, KMS and STS services must be provided by creating a VPC endpoint. If the platform is accessing an EMR

Copyright © 2020 Trifacta Inc. Page #56 cluster, a proxy server can be configured to provide access to the AWS ElasticMapReduce regional endpoint.

SELinux

By default, Trifacta Wrangler Enterprise is installed on a server with SELinux enabled. Security-enhanced Linux (SELinux) provides a set of security features for, among other things, managing access controls.

Tip: The following may be applied to other deployments of the Trifacta platform on servers where SELinux has been enabled.

In some cases, SELinux can interfere with normal operations of platform software. If you are experiencing connectivity problems related to SELinux, you can do either one of the following:

1. Disable SELinux on the server. For more information, please see the CentOS documentation. 2. Apply the following commands on the server, as root: 1. Open ports on the server for listening. 1. By default, the Trifacta application listens on port 3005. The following opens that port when SELinux is enabled:

semanage port -a -t http_port_t -p tcp 3005

2. Repeat the above step for any other ports that you wish to open on the server. 2. Permit nginx, the proxy on the Trifacta node, to open websockets:

setsebool -P httpd_can_network_connect 1

Product Limitations

The EC2 instance, S3 buckets, and any connected Redshift databases must be located in the same Amazon region. Cross-region integrations are not supported at this time. No support for Hive integration No support for secure impersonation or Kerberos No support for high availability and failover Job cancellation is not supported on EMR. When publishing single files to S3, you cannot apply an append publishing action. Spark 2.4 only.

Install

Desktop Requirements

All desktop users of the platform must have the latest stable version of Google Chrome installed on their desktops. All desktop users must be able to connect to the EC2 instance through the enterprise infrastructure.

Steps:

1. In the Marketplace listing, click Deploy into a new VPC.

Copyright © 2020 Trifacta Inc. Page #57 2. Choose a Template: The template path is automatically populated for you. 3. Specify Details: 1. Stack Name: Display name of the stack is used in the names of resources created by the stack and as an identifier for the stack.

NOTE: Each instance of the Trifacta platform must have a separate name.

2. Instance Type: Please select the appropriate instance depending on the number of users and data volumes of your environment. For more information, see the Sizing Guide above. 3. Key Pair: This SSH key pair is used to access the Trifacta Instance and the EMR cluster instances. 4. Allowed HTTP Source: This range of addresses are permitted access to the Trifacta Instance on port 80, 443, and 3005.

1. Port numbers 80 and 443 do not have any services by default, but you may modify the Trifact a configuration to enable access via these ports. 5. Allowed SSH Source: This range of addresses is permitted access to port 22 on the Trifacta Instance. 6. EMR Cluster Node Configuration: Allows you to customize the configuration of the deployed EMR nodes 1. Reasonable values are used as defaults. 2. If you do customize these values, you should upsize. Avoid downsizing these values. 7. EMR Cluster Autoscaling Configuration: Allows you to customize the autoscaling settings used by the EMR cluster. 1. Reasonable values are used as defaults. 4. Options: None of these is required for installation. Specify any options as needed for your environment. 5. Review: Review your installation and configured options. 1. Select the checkbox at the end of the page. 2. To launch the stack, click Create. 6. Please wait while the stack creates all required resources. 7. In the Stacks list, select the name of your application. Click the Outputs tab and collect the following information. Instructions on how to use this information are provided later.

Parameter Description Use

Trifacta URL value URL and port number to which to Users must connect to this IP address connect to the Trifacta application and port number to access. By default, it is set to 3005 . The access port can be moved to 80 or 443 if desired. Please contact us for more details.

Trifacta Bucket The address of the default S3 bucket This value must be applied through the application after it has been deployed.

Trifacta Instance Id The identifier for the instance of the This value is the default password for platform the admin account.

NOTE: You must change this password on the first login to the application.

8. After the Trifacta instance has been created, you must add a license file before starting the Trifacta service . In the following steps, you SSH into the server, create the license file, and paste in the license file content, plus update the ownership and permissions of that file:

1. SSH into the server as the CentOS user, using the key you specified. 2. Change to root user:

sudo su

Copyright © 2020 Trifacta Inc. Page #58 3. Add your license:

vi /opt/trifacta/license/license.json

4. Into the above file, paste the contents of the license.json file that was provided to you by your Tr ifacta representative. 5. Verify permissions on the file:

chown trifacta:trifacta /opt/trifacta/license/license.json chmod 644 /opt/trifacta/license/license.json

9. Start the Trifacta service:

service trifacta start

10. It may take some time for the server to finish coming online. Navigate to the Trifacta application. 11. When the login screen appears, enter the following: 1. Username: [email protected] 2. Password: (the TrifactaInstanceId value)

NOTE: After you login as an admin for the first time, you must change the password.

12. From the application menu, select the Settings menu. Then, click Settings >Admin Settings. 13. In the Admin Settings page, you can configure many aspects of the platform, including user management tasks, and perform restarts to apply the changes. 14. Add the S3 bucket that was automatically created to store Trifacta metadata and EMR content. Search for:

"aws.s3.bucket.name"

1. Update the value with the Trifacta Bucket value provided when you created the stack in AWS. 15. Verify your Spark version. If the cluster was launched from AWS, this value should be set to 2.4.4. Search for:

"spark.version"

1. Update its value to 2.4.4, if necessary. 16. Enable the Run in EMR option within the platform. Search for:

Copyright © 2020 Trifacta Inc. Page #59 "webapp.runinEMR"

1. Select the checkbox to enable it. 17. Click Save underneath the Platform Settings section. 18. In the Admin Settings page, locate the External Service Settings section.

1. AWS EMR Cluster ID: Paste the value for the EMR Cluster ID for the cluster to which the platform is connecting.

1. Verify that there are no extra spaces in any copied value. 2. AWS Region: Enter the region code where you have deployed the CloudFormation stack. 1. Example: us-east-1 2. For a list of available regions, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones. html#concepts-available-regions . 3. Resource Bucket: you may use the already created Trifacta Bucket. 1. Verify that there are no extra spaces in any copied value. 4. Resource Path: you should use something like EMRLOGS. 19. Click Save underneath the External Service Settings section. 20. When the platform restarts, you can begin using the product.

Note about deleting the CloudFormation stack

If you must delete the CloudFormation stack, please be aware of the following.

1. The S3 bucket that was created for the stack is not removed. If you want to delete it, you must empty it first and then delete it. 2. Any EMR security groups created for the stack cannot be deleted, due to circular references. The stack deletion process informs you of the security groups that it failed to delete. To complete the deletion: 1. Remove all rules from the security groups. 2. Delete the security groups manually. 3. Re-run the stack deletion, which should complete successfully.

Verify

Start and Stop the Platform

Use the following command line commands to start, stop, and restart the platform.

Start:

sudo service trifacta start

Stop:

sudo service trifacta stop

Restart:

sudo service trifacta restart

Copyright © 2020 Trifacta Inc. Page #60 Verify Operations

After you have installed or made changes to the platform, you should verify end-to-end operations.

NOTE: The Trifacta® platform is not operational until it is connected to a supported backend datastore.

Steps:

1. Login to the application as an administrator. See Login. 2. Through the Admin Settings page, run Tricheck, which performs tests on the Trifacta node and any connected cluster. See Admin Settings Page. 3. In the application menu bar, click Library. Click Import Dataset. Select your backend datastore.

4. Navigate your datastore directory structure to locate a small CSV or JSON file.

5. Select the file. In the right panel, click Create and Transform. 1. Troubleshooting: If the steps so far work, then you have read access to the datastore from the platform. If not, please check permissions for the Trifacta user and its access to the appropriate directories. 2. See Import Data Page. 6. In the Transformer page, some steps have already been added to your recipe, so you can run the job right away. Click Run Job. 1. See Transformer Page .

7. In the Run Job Page: 1. For Running Environment, some of these options may not be available. Choose according to the running environment you wish to test. 1. Photon: Runs job on the Photon running environment hosted on the Trifacta node. This method of job execution does not utilize any integrated cluster. 2. Spark: Runs the job on Spark on the integrated cluster. 3.

Databricks: If the platform is integrated with an Azure Databricks cluster, you can test job execution on the cluster.

NOTE: Use of Azure Databricks is not supported on Marketplace installs.

2. Select CSV and JSON output. 3. Select the Profile Results checkbox. 4. Troubleshooting: At this point, you are able to initiate a job for execution on the selected running environment. Later, you can verify operations by running the same job on other available environments . 5. See Run Job Page.

8. When the job completes, you should see a success message in the Jobs tab of the Flow View page. 1. Troubleshooting: Either the Transform job or the Profiling job may break. To localize the problem, mouse over the Job listing in the Jobs page. Try re-running a job by deselecting the broken job type or running the job in a different environment. You can also download the log files to try to identify the problem. See Jobs Page.

9. Click View Results in the Jobs page. In the Profile tab of the Job Details page, you can see a visual profile of the generated results. 1. See Job Details Page.

Copyright © 2020 Trifacta Inc. Page #61 10. In the Output Destinations tab, click the CSV and JSON links to download the results to your local desktop. See Import Data Page.

11. Load these results into a local application to verify that the content looks ok.

Upgrade

For more information, see Upgrade for AWS Marketplace with EMR.

Documentation

You can access complete product documentation in online and PDF format. After the platform has been installed, select Help menu > Product Docs from the menu in the Trifacta application.

Upgrade for AWS Marketplace with EMR

For more information on upgrading Trifacta® Wrangler Enterprise, please contact Trifacta Support. Configure

The following topics describe how to configure Trifacta® Wrangler Enterprise for initial deployment.

For more information on administration tasks and admin resources, see Admin.

Configure for AWS

Contents:

Internet Access Database Installation Base AWS Configuration Base Storage Layer Configure AWS Region AWS Authentication AWS Auth Mode AWS Credential Provider AWS Storage S3 Sources Redshift Connections Snowflake Connections AWS Clusters EMR Hadoop

This documentation applies to installation from a supported Marketplace. Please use the installation instructions provided with your deployment.

If you are installing or upgrading a Marketplace deployment, please use the available PDF content. You must use the install and configuration PDF available through the Marketplace listing.

Copyright © 2020 Trifacta Inc. Page #62 The Trifacta® platform can be hosted within Amazon and supports integrations with multiple services from Amazon Web Services, including combinations of services for hybrid deployments. This section provides an overview of the integration options, as well as links to related configuration topics.

For an overview of AWS deployment scenarios, see Supported Deployment Scenarios for AWS.

Internet Access

From AWS, the Trifacta platform requires Internet access for the following services:

NOTE: Depending on your AWS deployment, some of these services may not be required.

AWS S3 Key Management System [KMS] (if sse-kms server side encryption is enabled) Secure Token Service [STS] (if temporary credential provider is used) EMR (if integration with EMR cluster is enabled)

NOTE: If the Trifacta platform is hosted in a VPC where Internet access is restricted, access to S3, KMS and STS services must be provided by creating a VPC endpoint. If the platform is accessing an EMR cluster, a proxy server can be configured to provide access to the AWS ElasticMapReduce regional endpoint.

Database Installation

The following database scenarios are supported.

Database Host Description

Cluster node By default, the Trifacta databases are installed on PostgreSQL instances in the Trifacta node or another accessible node in the enterprise environment. For more information, see Install Databases.

Amazon RDS For Amazon-based installations, you can install the Trifacta databases on PostgreSQL instances on Amazon RDS. For more information, see Install Databases on Amazon RDS.

Base AWS Configuration

The following configuration topics apply to AWS in general.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Base Storage Layer

NOTE: The base storage layer must be set during initial configuration and cannot be modified after it is set.

S3: Most of these integrations require use of S3 as the base storage layer, which means that data uploads, default location of writing results, and sample generation all occur on S3. When base storage layer is set to S3, the Trifacta platform can:

Copyright © 2020 Trifacta Inc. Page #63 read and write to S3 read and write to Redshift connect to an EMR cluster

HDFS: In on-premises installations, it is possible to use S3 as a read-only option for a Hadoop-based cluster when the base storage layer is HDFS. You can configure the platform to read from and write to S3 buckets during job execution and sampling. For more information, see Enable S3 Access.

For more information on setting the base storage layer, see Set Base Storage Layer.

For more information, see Storage Deployment Options.

Configure AWS Region

For Amazon integrations, you can configure the Trifacta node to connect to Amazon datastores located in different regions.

NOTE: This configuration is required under any of the following deployment conditions:

1. The Trifacta node is installed on-premises, and you are integrating with Amazon resources. 2. The EC2 instance hosting the Trifacta node is located in a different AWS region than your Amazon datastores. 3. The Trifacta node or the EC2 instance does not have access to s3.amazonaws.com.

1. In the AWS console, please identify the location of your datastores in other regions. For more information, see the Amazon documentation. 2. Login to the Trifacta application. 3. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 4. Set the value of the following property to the region where your S3 datastores are located:

aws.s3.region

If the above value is not set, then the Trifacta platform attempts to infer the region based on default S3 bucket location. 5. Save your changes.

AWS Authentication

The following table illustrates the various methods of managing authentication between the platform and AWS. The matrix of options is basically determined by the settings for two key parameters.

credential provider - source of credentials: platform (default), instance (EC2 instance only), or temporary AWS mode - the method of authentication from platform to AWS: system-wide or by-user

AWS Mode System User

Credential Provider

Default One system-wide key/secret combo is Each user provides key/secret combo. inserted in the platform for use

Copyright © 2020 Trifacta Inc. Page #64 Config: Config:

"aws. "aws. credentialProvi credentialProvi der": der": "default", "default", "aws.mode": "aws.mode": "system", "user", "aws.s3.key": , "aws.s3. User: Configure Your Access to S3 secret": ,

Instance Platform uses EC2 instance roles. Users provide EC2 instance roles.

Config: Config:

"aws. "aws. credentialProvi credentialProvi der": der": "instance", "instance", "aws.mode": "aws.mode": "system", "user",

Temporary Temporary credentials are issued based on Per-user authentication when using IAM per-user IAM roles. role.

Config: Config:

"aws. "aws. credentialProvi credentialProvi der": der": "temporary", "instance", "aws.mode": "aws.mode": "system", "user",

AWS Auth Mode

When connecting to AWS, the platform supports the following basic authentication modes:

Mode Configuration Description

system Access to AWS resources is managed "aws.mode": through a single, system account. The account that you specify is based on the "system", credential provider selected below.

Copyright © 2020 Trifacta Inc. Page #65 The instance credential provider ignores this setting.

See below.

user Authentication must be specified for "aws.mode": individual users. "user", Tip: In AWS user mode, Trifacta administrators can manage S3 access for users through the Admin Settings page. See Manage Users.

AWS Credential Provider

The Trifacta platform supports the following methods of providing credentialed access to AWS and S3 resources.

Type Configuration Description

default This method uses the provided AWS Key "aws. and Secret values to access resources. See below. credentialProvi der":"default",

instance When you are running the Trifacta platform "aws. on an EC2 instance, you can leverage your enterprise IAM roles to manage credentialProvi permissions on the instance for the Trifacta der":" platform. See below. instance",

temporary Details are below.

Default credential provider

Whether the AWS access mode is set to system or user, the default credential provider for AWS and S3 resources is the Trifacta platform.

Mode Description Configuration

A single AWS Key and Secret is inserted "aws.mode": into platform configuration. This account is "aws.s3.key": used to access all resources and must "system", have the appropriate permissions to do so. "", "aws.s3. secret": "",

Each user must specify an AWS Key and For more information on configuring "aws.mode": Secret into the account to access individual user accounts, see resources. Configure Your Access to S3. "user",

Copyright © 2020 Trifacta Inc. Page #66 Default credential provider with EMR:

If you are using this method and integrating with an EMR cluster:

Copying the custom credential JAR file must be added as a bootstrap action to the EMR cluster definition. See Configure for EMR. As an alternative to copying the JAR file, you can use the EMR EC2 instance-based roles to govern access. In this case, you must set the following parameter:

"aws.emr.forceInstanceRole": true,

For more information, see Configure for EC2 Role-Based Authentication.

Instance credential provider

When the platform is running on an EC2 instance, you can manage permissions through pre-defined IAM roles.

NOTE: If the Trifacta platform is connected to an EMR cluster, you can force authentication to the EMR cluster to use the specified IAM instance role. See Configure for EMR.

For more information, see Configure for EC2 Role-Based Authentication.

Temporary credential provider

For even better security, you can enable use temporary credentials provided from your AWS resources based on an IAM role specified per user.

Tip: This method is recommended by AWS.

Set the following properties.

Property Description

"aws.credentialProvider" If aws.mode = system , set this value to temporary . If aws.mode = user and you are using per-user authentication, then this setting is ignored and should stay as de fault.

Per-user authentication

Individual users can be configured to provide temporary credentials for access to AWS resources, which is a more secure authentication solution. For more information, see Configure AWS Per-User Authentication.

AWS Storage

S3 Sources

To integrate with S3, additional configuration is required. See Enable S3 Access.

Copyright © 2020 Trifacta Inc. Page #67 Redshift Connections

You can create connections to one or more Redshift databases, from which you can read database sources and to which you can write job results. Samples are still generated on S3.

NOTE: Relational connections require installation of an encryption key file on the Trifacta node. For more information, see Create Encryption Key File.

For more information, see Create Redshift Connections.

Snowflake Connections

Through your AWS deployment, you can access your Snowflake databases. For more information, see Create Snowflake Connections.

AWS Clusters

Trifacta Wrangler Enterprise can integrate with one instance of either of the following.

NOTE: If Trifacta Wrangler Enterprise is installed through the Amazon Marketplace, only the EMR integration is supported.

EMR

When Trifacta Wrangler Enterprise in installed through AWS, you can integrate with an EMR cluster for Spark- based job execution. For more information, see Configure for EMR.

Hadoop

If you have installed Trifacta Wrangler Enterprise on-premises or directly into an EC2 instance, you can integrate with a Hadoop cluster for Spark-based job execution. See Configure for Hadoop. Configure for EC2 Role-Based Authentication

Contents:

IAM roles AWS System Mode Additional AWS Configuration Use of S3 Sources

When you are running the Trifacta platform on an EC2 instance, you can leverage your enterprise IAM roles to manage permissions on the instance for the Trifacta platform. When this type of authentication is enabled, Trifacta administrators can apply a role to the EC2 instance where the platform is running. That role's permissions apply to all users of the platform.

IAM roles

Before you begin, your IAM roles should be defined and attached to the associated EC2 instance.

Copyright © 2020 Trifacta Inc. Page #68 NOTE: The IAM instance role used for S3 access should have access to resources at the bucket level.

For more information, see http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html.

AWS System Mode

To enable role-based instance authentication, the following parameter must be enabled.

"aws.mode": "system",

Additional AWS Configuration

The following additional parameters must be specified:

Parameter Description

aws.credentialProvider Set this value to instance. IAM instance role is used for providing access.

aws.hadoopFsUseSharedInstanceProvider Set this value to true for CDH. The class information is provided below.

Shared instance provider class information

Hortonworks:

"com.amazonaws.auth.InstanceProfileCredentialsProvider",

Pre-Cloudera 6.0.0:

"org.apache.hadoop.fs.s3a.SharedInstanceProfileCredentialsProvider"

Cloudera 6.0.0 and later:

Set the above parameters as follows:

"aws.credentialProvider": "instance", "aws.hadoopFSUseSharedInstanceProvider": false,

Use of S3 Sources

To access S3 for storage, additional configuration for S3 may be required.

Copyright © 2020 Trifacta Inc. Page #69 NOTE: Do not configure the properties that apply to user mode.

Output sizing recommendations:

Single-file output: If you are generating a single file, you should try to keep its size under 1 GB. Multi-part output: For multiple-file outputs, each part file should be under 1 GB in size. For more information, see https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-multiple-files.html

For more information, see Enable S3 Access.

Enable S3 Access

Contents:

Base Storage Layer Limitations Pre-requisites Required AWS Account Permissions Read-only access polices Write access polices Other AWS policies for S3 Configuration Define base storage layer Enable read access to S3 S3 access modes System mode - additional configuration S3 Configuration Configuration reference Enable use of server-side encryption Configure S3 filewriter Create Redshift Connection Hadoop distribution-specific configuration Hortonworks Additional Configuration for S3 Testing Troubleshooting Profiling consistently fails for S3 sources of data Spark local directory has no space

Below are instructions on how to configure Trifacta® Wrangler Enterprise to point to S3.

Simple Storage Service (S3) is an online data storage service provided by Amazon, which provides low- latency access through web services. For more information, see https://aws.amazon.com/s3/.

NOTE: Please review the limitations before enabling. See Limitations of S3 Integration below.

Base Storage Layer

If base storage layer is S3: you can enable read/write access to S3. If base storage layer is not S3: you can enable read-only access to S3.

Copyright © 2020 Trifacta Inc. Page #70 Limitations

The Trifacta platform only supports running S3-enabled instances over AWS. Access to AWS S3 Regional Endpoints through internet protocol is required. If the machine hosting the Trif acta platform is in a VPC with no internet access, a VPC endpoint enabled for S3 services is required. The Trifacta platform does not support access to S3 through a proxy server.

Write access requires using S3 as the base storage layer. See Set Base Storage Layer.

NOTE: Spark 2.3.0 jobs may fail on S3-based datasets due to a known incompatibility. For details, see https://github.com/apache/incubator-druid/issues/4456.

If you encounter this issue, please set spark.version to 2.1.0 in platform configuration. For more information, see Admin Settings Page.

Pre-requisites

On the Trifacta node, you must install the Oracle Java Runtime Environment for Java 1.8. Other versions of the JRE are not supported. For more information on the JRE, see http://www.oracle.com/technetwork/java/javase/downloads/index.html.

If IAM instance role is used for S3 access, it must have access to resources at the bucket level.

Required AWS Account Permissions

All access to S3 sources occurs through a single AWS account (system mode) or through an individual user's account (user mode). For either mode, the AWS access key and secret combination must provide access to the default bucket associated with the account.

NOTE: These permissions should be set up by your AWS administrator

Read-only access polices

NOTE: To enable viewing and browsing of all folders within a bucket, the following permissions are required:

The system account or individual user accounts must have the ListAllMyBuckets access permission for the bucket. All objects to be browsed within the bucket must have Get access enabled.

The policy statement to enable read-only access to your default S3 bucket should look similar to the following. Replace 3c-my-s3-bucket with the name of your bucket:

{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0",

Copyright © 2020 Trifacta Inc. Page #71 "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": [ "arn:aws:s3:::3c-my-s3-bucket", "arn:aws:s3:::3c-my-s3-bucket/*", ] } ] }

Write access polices

Write access is enabled by adding the PutObject and DeleteObject actions to the above. Replace 3c-my- s3-bucket with the name of your bucket:

{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket", "s3:GetBucketLocation", "s3:PutObject", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::3c-my-s3-bucket", "arn:aws:s3:::3c-my-s3-bucket/*", ] } ] }

Other AWS policies for S3

KMS policy

If any accessible bucket is encrypted with KMS-SSE, another policy must be deployed. For more information, see https://docs.aws.amazon.com/kms/latest/developerguide/iam-policies.html.

Copyright © 2020 Trifacta Inc. Page #72 Configuration

Depending on your S3 environment, you can define:

read access to S3 access to additional S3 buckets

S3 as base storage layer

Write access to S3 S3 bucket that is the default write destination

Define base storage layer

The base storage layer is the default platform for storing results.

Required for:

Write access to S3 Connectivity to Redshift

The base storage layer for your Trifacta instance is defined during initial installation and cannot be changed afterward.

If S3 is the base storage layer, you must also define the default storage bucket to use during initial installation, which cannot be changed at a later time. See Define default S3 write bucket below.

For more information on the various options for storage, see Storage Deployment Options.

For more information on setting the base storage layer, see Set Base Storage Layer.

Enable read access to S3

When read access is enabled, Trifacta users can explore S3 buckets for creating datasets.

NOTE: When read access is enabled, Trifacta users have automatic access to all buckets to which the specified S3 user has access. You may want to create a specific user account for S3 access.

NOTE: Data that is mirrored from one S3 bucket to another might inherit the permissions from the bucket where it is owned.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Set the following property to true:

"aws.s3.enabled": true,

Copyright © 2020 Trifacta Inc. Page #73 3. Save your changes. 4. In the S3 configuration section, set enabled=true, which allows Trifacta users to browse S3 buckets through the Trifacta application. 5. Specify the AWS key and secret values for the user to access S3 storage.

S3 access modes

The Trifacta platform supports the following modes for access S3. You must choose one access mode and then complete the related configuration steps.

NOTE: Avoid switching between user mode and system mode, which can disable user access to data. At install mode, you should choose your preferred mode.

System mode

(default) Access to S3 buckets is enabled and defined for all users of the platform. All users use the same AWS access key, secret, and default bucket.

System mode - read-only access

For read-only access, the key, secret, and default bucket must be specified in configuration.

NOTE: Please verify that the AWS account has all required permissions to access the S3 buckets in use. The account must have the ListAllMyBuckets ACL among its permissions.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following parameters:

Parameters Description

aws.s3.key Set this value to the AWS key to use to access S3.

aws.s3.secret Set this value to the secret corresponding to the AWS key provided.

aws.s3.bucket.name Set this value to the name of the S3 bucket from which users may read data.

NOTE: Additional buckets may be specified. See below.

3. Save your changes.

User mode

Optionally, access to S3 can be defined on a per-user basis. This mode allows administrators to define access to specific buckets using various key/secret combinations as a means of controlling permissions.

Copyright © 2020 Trifacta Inc. Page #74 NOTE: When this mode is enabled, individual users must have AWS configuration settings applied to their account, either by an administrator or by themselves. The global settings in this section do not apply in this mode.

To enable:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Please verify that the settings below have been configured:

"aws.s3.enabled": true, "aws.mode": "user",

3. Additional configuration is required for per-user authentication. For more information, see Configure AWS Per-User Authentication.

User mode - Create encryption key file

If you have enabled user mode for S3 access, you must create and deploy an encryption key file. For more information, see Create Encryption Key File.

NOTE: If you have enabled user access mode, you can skip the following sections, which pertain to the system access mode, and jump to the Enable Redshift Connection section below.

System mode - additional configuration

The following sections apply only to system access mode.

Define default S3 write bucket

When S3 is defined as the base storage layer, write access to S3 is enabled. The Trifacta platform attempts to store outputs in the designated default S3 bucket.

NOTE: This bucket must be set during initial installation. Modifying it at a later time is not recommended and can result in inaccessible data in the platform.

NOTE: Bucket names cannot have underscores in them. See http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html.

Steps:

1. Define S3 to be the base storage layer. See Set Base Storage Layer. 2. Enable read access. See Enable read access. 3. Specify a value for aws.s3.bucket.name , which defines the S3 bucket where data is written. Do not include a protocol identifier. For example, if your bucket address is s3://MyOutputBucket, the value to specify is the following:

MyOutputBucket

Copyright © 2020 Trifacta Inc. Page #75 NOTE: Specify the top-level bucket name only. There should not be any backslashes in your entry.

NOTE: This bucket also appears as a read-access bucket if the specified S3 user has access.

Adding additional S3 buckets

When read access is enabled, all S3 buckets of which the specified user is the owner appear in the Trifacta application. You can also add additional S3 buckets from which to read.

NOTE: Additional buckets are accessible only if the specified S3 user has read privileges.

NOTE: Bucket names cannot have underscores in them.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following parameter: aws.s3.extraBuckets:

1. In the Admin Settings page, specify the extra buckets as a comma-separated string of additional S3 buckets that are available for storage. Do not put any quotes around the string. Whitespace between string values is ignored. 2. In trifacta-conf.json, specify the extraBuckets array as a comma-separated list of buckets as in the following:

"extraBuckets": ["MyExtraBucket01","MyExtraBucket02"," MyExtraBucket03"]

NOTE: Specify the top-level bucket name only. There should not be any backslashes in your entry.

3. Specify the extraBuckets array as a comma-separated list of buckets as in the following:

"extraBuckets": ["MyExtraBucket01","MyExtraBucket02"," MyExtraBucket03"]

4. These values are mapped to the following bucket addresses:

s3://MyExtraBucket01 s3://MyExtraBucket02 s3://MyExtraBucket03

Copyright © 2020 Trifacta Inc. Page #76 S3 Configuration

Configuration reference

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

"aws.s3.enabled": true, "aws.s3.bucket.name": "" "aws.s3.key": "", "aws.s3.secret": "", "aws.s3.extraBuckets": [""]

Setting Description

enabled When set to true , the S3 file browser is displayed in the GUI for locating files. For more information, see S3 Browser.

bucket.name Set this value to the name of the S3 bucket to which you are writing.

When webapp.storageProtocol is set to s3 , the output is delivered to aws.s3.bucket.name.

key Access Key ID for the AWS account to use.

NOTE: This value cannot contain a slash (/ ).

secret Secret Access Key for the AWS account.

extraBuckets Add references to any additional S3 buckets to this comma- separated array of values.

The S3 user must have read access to these buckets.

Enable use of server-side encryption

You can configure the Trifacta platform to publish data on S3 when a server-side encryption policy is enabled. SSE-S3 and SSE-KMS methods are supported. For more information, see http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html.

Notes:

When encryption is enabled, all buckets to which you are writing must share the same encryption policy. Read operations are unaffected.

To enable, please specify the following parameters.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Server-side encryption method

"aws.s3.serverSideEncryption": "none",

Copyright © 2020 Trifacta Inc. Page #77 Set this value to the method of encryption used by the S3 server. Supported values:

NOTE: Lower case values are required.

sse-s3 sse-kms none

Server-side KMS key identifier

When KMS encryption is enabled, you must specify the AWS KMS key ID to use for the server-side encryption.

"aws.s3.serverSideKmsKeyId": "",

Notes:

Access to the key: Access must be provided to the authenticating user. The AWS IAM role must be assigned to this key. Encrypt/Decrypt permissions for the specified KMS key ID: Permissions must be assigned to the authenticating user. The AWS IAM role must be given these permissions. For more information, see https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-modifying.html .

The format for referencing this key is the following:

"arn:aws:kms:::key/"

You can use an AWS alias in the following formats. The format of the AWS-managed alias is the following:

"alias/aws/s3"

The format for a custom alias is the following:

"alias/"

where:

is the name of the alias for the entire key.

Save your changes and restart the platform.

Copyright © 2020 Trifacta Inc. Page #78 Configure S3 filewriter

The following configuration can be applied through the Hadoop site-config.xml file. If your installation does not have a copy of this file, you can insert the properties listed in the steps below into trifacta-conf.json to configure the behavior of the S3 filewriter.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the filewriter.hadoopConfig block, where you can insert the following Hadoop configuration properties:

"filewriter": { max: 16, "hadoopConfig": { "fs.s3a.buffer.dir": "/tmp", "fs.s3a.fast.upload": "true" }, ... }

Property Description

fs.s3a.buffer.dir Specifies the temporary directory on the Trifacta node to use for buffering when uploading to S3. If fs.s3a.fast. upload is set to false , this parameter is unused.

NOTE: This directory must be accessible to the Batch Job Runner process during job execution.

fs.s3a.fast.upload Set to true to enable buffering in blocks.

When set to false , buffering in blocks is disabled. For a given file, the entire object is buffered to the disk of the Trifacta node. Depending on the size and volume of your datasets, the node can run out of disk space.

3. Save your changes and restart the platform.

Create Redshift Connection

For more information, see Create Redshift Connections.

Hadoop distribution-specific configuration

Hortonworks

NOTE: If you are using Spark profiling through Hortonworks HDP on data stored in S3, additional configuration is required. See Configure for Hortonworks.

Copyright © 2020 Trifacta Inc. Page #79 Additional Configuration for S3

The following parameters can be configured through the Trifacta platform to affect the integration with S3. You may or may not need to modify these values for your deployment.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Parameter Description

aws.s3.consistencyTimeout S3 does not guarantee that at any time the files that have been written to a directory will be consistent with the files available for reading. S3 does guarantee that eventually the files are in sync.

This guarantee is important for some platform jobs that write data to S3 and then immediately attempt to read from the written data.

This timeout defines how long the platform waits for this guarantee. If the timeout is exceeded, the job is failed. The default value is 120 .

Depending on your environment, you may need to modify this value.

aws.s3.endpoint If your S3 deployment is either of the following:

located in a region that does not support the default endpoint, or v4-only signature is enabled in the region

Then, you can specify this setting to point to the S3 endpoint for Java/Spark services. This value should be the S3 endpoint DNS name value.

For more information on this location, see https://docs.aws.amazon.com/general/latest/gr/rande. html#s3_region .

Testing

Restart services. See Start and Stop the Platform.

Try running a simple job from the Trifacta application. For more information, see Verify Operations.

Troubleshooting

Profiling consistently fails for S3 sources of data

If you are executing visual profiles of datasets sourced from S3, you may see an error similar to the following in the batch-job-runner.log file:

01:19:52.297 [Job 3] ERROR com.trifacta.hadoopdata.joblaunch.server. BatchFileWriterWorker - BatchFileWriterException: Batch File Writer unknown error: {jobId=3, why=bound must be positive} 01:19:52.298 [Job 3] INFO com.trifacta.hadoopdata.joblaunch.server. BatchFileWriterWorker - Notifying monitor for job 3 with status code FAILURE

This issue is caused by improperly configuring buffering when writing to S3 jobs. The specified local buffer cannot be accessed as part of the batch job running process, and the job fails to write results to S3.

Copyright © 2020 Trifacta Inc. Page #80 Solution:

You may do one of the following:

Use a valid temp directory when buffering to S3. Disable buffering to directory completely.

Steps:

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. Locate the following, where you can insert either of the following Hadoop configuration properties:

"filewriter": { max: 16, "hadoopConfig": { "fs.s3a.buffer.dir": "/tmp", "fs.s3a.fast.upload": false }, ... }

Property Description

fs.s3a.buffer.dir Specifies the temporary directory on the Trifacta node to use for buffering when uploading to S3. If fs.s3a.fast. upload is set to false , this parameter is unused.

fs.s3a.fast.upload When set to false , buffering is disabled.

Save your changes and restart the platform.

Spark local directory has no space

During execution of a Spark job, you may encounter the following error:

org.apache.hadoop.util.DiskChecker$DiskErrorException: No space available in any of the local directories.

Solution:

Restart Trifacta services, which may free up some temporary space. Use the steps in the preceding solution to reassign a temporary directory for Spark to use (fs.s3a. buffer.dir).

Create Redshift Connections

Contents:

Pre-requisites Limitations

Copyright © 2020 Trifacta Inc. Page #81 Create Connection Create through application Testing

This section provides information on how to enable Redshift connectivity and create one or more connections to Redshift sources.

Amazon Redshift is a hosted data warehouse available through Amazon Web Services. It is frequently used for hosting of datasets used by downstream analytic tools such as Tableau and Qlik. For more information, see https://aws.amazon.com/redshift/.

Pre-requisites

Before you begin, please verify that your Trifacta® environment meets the following requirements:

NOTE: In the Admin Settings page are some deprecated parameters pertaining to Redshift. Please ignore these parameters and their settings. They do not apply to this release.

1. S3 base storage layer: Redshift access requires use of S3 as the base storage layer, which must be enabled. See Set Base Storage Layer. 2. Same region: The Redshift cluster must be in the same region as the default S3 bucket. 3. Integration: Your Trifacta instance is connected to a running environment supported by your product edition.

4. Deployment: Trifacta platform is deployed either on-premises or in EC2.

Limitations

1. When publishing to Redshift through the Publishing dialog, output must be in Avro or JSON format. This limitation does not apply to direct writing to Redshift. 2. You can publish any specific job once to Redshift through the export window. See Publishing Dialog. 3. 4. Management of nulls: 1. Nulls are displayed as expected in the Trifacta application. 2. When Redshift jobs are run, the UNLOAD SQL command in Redshift converts all nulls to empty strings. Null values appear as empty strings in generated results, which can be confusing. This is a known issue with Redshift.

Create Connection

You can create Redshift connections through the following methods.

Tip: SSL connections are recommended. Details are below.

Create through application

Any user can create a Redshift connection through the application.

Steps:

Copyright © 2020 Trifacta Inc. Page #82 1. Login to the application. 2. In the menu, click Settings menu > Settings > Connections. 3. In the Create Connection page, click the Redshift connection card. 4. Specify the properties for your Redshift database connection. The following parameters are specific to Redshift connections:

Property Description

IAM Role ARN for Redshift-S3 Connectivity (Optional) You can specify an IAM role ARN that enables role- based connectivity between Redshift and the S3 bucket that is used as intermediate storage during Redshift bulk COPY /UNLOAD operations. Example:

arn:aws:iam::1234567890: role/MyRedshiftRole

For more information, see Create Connection Window. 5. Click Save.

Enable SSL connections

To enable SSL connections to Redshift, you must enable them first on your Redshift cluster. For more information, see https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-ssl-support.html.

In your connection to Redshift, please add the following string to your Connect String Options:

;ssl=true

Save your changes.

Testing

Import a dataset from Redshift. Add it to a flow, and specify a publishing action. Run a job.

NOTE: When publishing to Redshift through the Publishing dialog, output must be in Avro or JSON format. This limitation does not apply to direct writing to Redshift.

For more information, see Verify Operations.

After you have run your job, you can publish the results to Redshift through the Job Details page. See Publishing Dialog.

Configure for EMR

Contents:

Supported Versions Limitations

Copyright © 2020 Trifacta Inc. Page #83 Create EMR Cluster Cluster options Specify cluster roles Authentication EMRFS consistent view is recommended Set up S3 Buckets Bucket setup Set up EMR resources buckets Access Policies EC2 instance profile EMR roles EMRFS consistent view policies Configure Trifacta platform for EMR Change admin password Verify S3 as base storage layer Set up S3 integration Enable EMR integration Apply EMR cluster ID Extract IP address of master node in private sub-net EMR Authentication for the Trifacta platform Configure Spark for EMR Default Hadoop job results format Additional configuration for EMR Optional Configuration Configure for Redshift Switch EMR Cluster Configure Batch Job Runner Modify Job Tag Prefix Testing

You can configure your instance of the Trifacta platform to integrate with Amazon Elastic MapReduce (EMR), a highly scalable Hadoop-based execution environment.

NOTE: This section applies only to installations of Trifacta Wrangler Enterprise where a license key file has been acquired from Trifacta and applied to the platform.

Amazon EMR (Elastic MapReduce) provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. For more information on EMR, see http://docs.aws.amazon.com/cli/latest/reference/emr/.

Supported Versions

This section outlines how to create a new EMR cluster and integrate the Trifacta platform with it. The platform can be integrated with existing EMR clusters.

Supported Versions: EMR 5.8 - EMR 5.27

NOTE: EMR 5.20 - EMR 5.27 requires Spark 2.4. For more information, see Configure for Spark.

Limitations

NOTE: Job cancellation is not supported on an EMR cluster.

Copyright © 2020 Trifacta Inc. Page #84 The Trifacta platform must be installed on AWS.

Create EMR Cluster

Use the following section to set up your EMR cluster for use with the Trifacta platform.

Via AWS EMR UI: This method is assumed in this documentation. Via AWS command line interface: For this method, it is assumed that you know the required steps to perform the basic configuration. For custom configuration steps, additional documentation is provided below.

NOTE: It is recommended that you set up your cluster for exclusive use by the Trifacta platform.

Cluster options

In the Amazon EMR console, click Create Cluster. Click Go to advanced options. Complete the sections listed below.

NOTE: Please be sure to read all of the cluster options before setting up your EMR cluster.

NOTE: Please perform your configuration through the Advanced Options workflow.

For more information on setting up your EMR cluster, see http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html.

Advanced Options

In the Advanced Options screen, please select the following:

Software Configuration:

Release: 5.15 Select: Hadoop 2.8.3 Hue 3.12.0 Ganglia 3.7.2

Tip: Although it is optional, Ganglia is recommended for monitoring cluster performance.

Spark 2.3.0

NOTE: Additional configuration is required. You must apply the Spark version number in the spark.version property in Admin Settings. See Configure for Spark .

Deselect everything else. Edit the software settings: Copy and paste the following into Enter Configuration:

Copyright © 2020 Trifacta Inc. Page #85 [ { "Classification": "capacity-scheduler", "Properties": { "yarn.scheduler.capacity.resource-calculator": "org.apache. hadoop.yarn.util.resource.DominantResourceCalculator" } } ]

Auto-terminate cluster after the last step is completed: Leave this option disabled.

Hardware configuration

NOTE: Please apply the sizing information for your EMR cluster that was recommended for you. If you have not done so, please contact your Trifacta representative.

General Options

Cluster name: Provide a descriptive name. Logging: Enable logging on the cluster. S3 folder: Please specify the S3 bucket and path to the logging folder.

NOTE: Please verify that this location is read accessible to all users of the platform. See below for details.

Debugging: Enable. Termination protection: Enable. Tags: No options required. Additional Options: EMRFS consistent view: You should enable this setting. Doing so may incur additional costs. For more information, see EMRFS consistent view is recommended below. Custom AMI ID: None. Bootstrap Actions: If you are using a custom credential provider JAR, you must create a bootstrap action.

NOTE: This configuration must be completed before you create the EMR cluster. For more information, see Authentication below.

Security Options

EC2 key pair: Please select a key/pair to use if you wish to access EMR nodes via SSH. Permissions: Set to Custom to reduce the scope of permissions. For more information, see EMR cluster policies below.

NOTE: Default permissions give access to everything in the cluster.

Encryption Options No requirements. EC2 Security Groups:

Copyright © 2020 Trifacta Inc. Page #86 The selected security group for the master node on the cluster must allow TCP traffic from the Trifac ta instance on port 8088. For more information, see System Ports.

Create cluster and acquire cluster ID

If you performed all of the configuration, including the sections below, you can create the cluster.

NOTE: You must acquire your EMR cluster ID for use in configuration of the Trifacta platform.

Specify cluster roles

The following cluster roles and their permissions are required. For more information on the specifics of these policies, see EMR cluster policies.

EMR Role: Read/write access to log bucket Read access to resource bucket EC2 instance profile: If using instance mode: EC2 profile should have read/write access for all users. EC2 profile should have same permissions as EC2 Edge node role. Read/write access to log bucket Read access to resource bucket Auto-scaling role: Read/write access to log bucket Read access to resource bucket Standard auto-scaling permissions

Authentication

You can use one of two methods for authenticating the EMR cluster:

Role-based IAM authentication (recommended): This method leverages your IAM roles on the EC2 instance. Custom credential provider JAR file: This method utilizes a JAR file provided with the platform. This JAR file must be deployed to all nodes on the EMR cluster through a bootstrap action script.

Role-based IAM authentication

You can leverage your IAM roles to provide role-based authentication to the S3 buckets.

NOTE: The IAM role that is assigned to the EMR cluster and to the EC2 instances on the cluster must have access to the data of all users on S3.

For more information, see Configure for EC2 Role-Based Authentication.

Specify the custom credential provider JAR file

If you are not using IAM roles for access, you can manage access using either of the following:

AWS key and secret values specified in trifacta-conf.json AWS user mode

In either scenario, you must use the custom credential provider JAR provided in the installation. This JAR file must be available to all nodes of the EMR cluster.

Copyright © 2020 Trifacta Inc. Page #87 After you have installed the platform and configured the S3 buckets, please complete the following steps to deploy this JAR file.

NOTE: These steps must be completed before you create the EMR cluster.

NOTE: This section applies if you are using the default credential provider mechanism for AWS and are not using the IAM instance-based role authentication mechanism.

Steps:

1. From the installation of the Trifacta platform, retrieve the following file:

[TRIFACTA_INSTALL_DIR]/aws/credential-provider/build/libs/trifacta- aws-emr-credential-provider.jar

2. Upload this JAR file to an S3 bucket location where the EMR cluster can access it:

1. Via AWS Console S3 UI: See http://docs.aws.amazon.com/cli/latest/reference/s3/index.html. 2. Via AWS command line:

aws s3 cp trifacta-aws-emr-credential-provider.jar s3:///

3. Create a bootstrap action script named configure_emrfs_lib.sh. The contents must be the following:

sudo aws s3 cp s3:///trifacta-aws-emr-credential- provider.jar /usr/share/aws/emr/emrfs/auxlib/

4. This script must be uploaded into S3 in a location that can be accessed from the EMR cluster. Retain the full path to this location. 5. Add bootstrap action to EMR cluster configuration. 1. Via AWS Console S3 UI: Create the bootstrap action to point to the script you uploaded on S3.

2. Via AWS command line: 1. Upload the configure_emrfs_lib.sh file to the accessible S3 bucket. 2. In the command line cluster creation script, add a custom bootstrap action, such as the following:

--bootstrap-actions '[ {"Path":"s3:///configure_emrfs_lib.sh"," Name":"Custom action"} ]'

When the EMR cluster is launched with the above custom bootstrap action, the cluster does one of the following:

Copyright © 2020 Trifacta Inc. Page #88 Interacts with S3 using the credentials specified in trifacta-conf.json if aws.mode = user, then the credentials registered by the user are used.

For more information about AWSCredentialsProvider for EMRFS please see:

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-credentialsprovider.html https://aws.amazon.com/blogs/big-data/securely-analyze-data-from-another-aws-account-with-emrfs/

EMRFS consistent view is recommended

Although it is not required, you should enable the consistent view feature for EMRFS on your cluster.

During job execution, including profiling jobs, on EMR, the Trifacta platform writes files in rapid succession, and these files are quickly read back from storage for further processing. However, Amazon S3 does not provide a guarantee of a consistent file listing until a later time.

To ensure that the Trifacta platform does not begin reading back an incomplete set of files, you should enable EMRFS consistent view.

NOTE: If EMRFS consistent view is enabled, additional policies must be added for users and the EMR cluster. Details are below.

NOTE: If EMRFS consistent view is not enabled, profiling jobs may not get a consistent set of files at the time of execution. Jobs can fail or generate inconsistent results.

For more information on EMRFS consistent view, see http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html.

DynamoDB

Amazon's DynamoDB is automatically enabled to store metadata for EMRFS consistent view.

NOTE: DynamoDB incurs costs while it is in use. For more information, see https://aws.amazon.com/dynamodb/pricing/.

NOTE: DynamoDB does not automatically purge metadata after a job completes. You should configure periodic purges of the database during off-peak hours.

Enable output job manifest

When EMRFS consistent view is enabled on the cluster, the platform must be configured to use it. During job execution, the platform can use consistent view to create a manifest file of all files generated during job execution. When the job results are published to an external target, this manifest file ensures proper publication.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following parameter and set it to true:

"feature.enableJobOutputManifest": true,

Copyright © 2020 Trifacta Inc. Page #89 3. Save your changes and restart the platform.

Set up S3 Buckets

Bucket setup

You must set up S3 buckets for read and write access.

NOTE: Within the Trifacta platform, you must enable use of S3 as the default storage layer. This configuration is described later.

For more information, see Enable S3 Access.

Set up EMR resources buckets

On the EMR cluster, all users of the platform must have access to the following locations:

Location Description Required Access

EMR Resources bucket and path The S3 bucket and path where resources Read/Write can be stored by the Trifacta platform for execution of Spark jobs on the cluster.

The locations are configured separately in the Trifacta platform.

EMR Logs bucket and path The S3 bucket and path where logs are Read written for cluster job execution.

These locations are configured on the Trifacta platform later.

Access Policies

EC2 instance profile

Trifacta users require the following policies to run jobs on the EMR cluster:

{ "Statement": [ { "Effect": "Allow", "Action": [ "elasticmapreduce:AddJobFlowSteps", "elasticmapreduce:DescribeStep", "elasticmapreduce:DescribeCluster", "elasticmapreduce:ListInstanceGroups" ], "Resource": [ "*" ] }, { "Effect": "Allow", "Action": [ "s3:*"

Copyright © 2020 Trifacta Inc. Page #90 ], "Resource": [ "arn:aws:s3:::__EMR_LOG_BUCKET__", "arn:aws:s3:::__EMR_LOG_BUCKET__/*", "arn:aws:s3:::__EMR_RESOURCE_BUCKET__", "arn:aws:s3:::__EMR_RESOURCE_BUCKET__/*" ] }

] }

EMR roles

The following policies should be assigned to the EMR roles listed below for read/write access:

{ "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::__EMR_LOG_BUCKET__", "arn:aws:s3:::__EMR_LOG_BUCKET__/*", "arn:aws:s3:::__EMR_RESOURCE_BUCKET__", "arn:aws:s3:::__EMR_RESOURCE_BUCKET__/*" ] } }

EMRFS consistent view policies

If EMRFS consistent view is enabled, the following policy must be added for users and the EMR cluster permissions:

{ "Version": "2012-10-17", "Statement": [ { "Action": [ "dynamodb:*" ], "Effect": "Allow", "Resource": [ "*" ] } ] }

Copyright © 2020 Trifacta Inc. Page #91 Configure Trifacta platform for EMR

Please complete the following sections to configure the Trifacta platform to communicate with the EMR cluster.

Change admin password

As soon as you have installed the software, you should login to the application and change the admin password. The initial admin password is the instanceId for the EC2 instance. For more information, see Change Password.

Verify S3 as base storage layer

EMR integrations requires use of S3 as the base storage layer.

NOTE: The base storage layer must be set during initial installation and set up of the Trifacta node.

See Set Base Storage Layer.

Set up S3 integration

To integrate with S3, additional configuration is required. See Enable S3 Access.

Enable EMR integration

After you have configured S3 to be the base storage layer, you must enable EMR integration.

Steps:

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

1. Search for the following setting:

"webapp.runInEMR": false,

2. Set the above value to true. 3. Set the following values:

"webapp.runWithSparkSubmit": false,

4. Verify the following property values:

"webapp.runInTrifactaServer": true, "webapp.runInEMR": true, "webapp.runWithSparkSubmit": false, "webapp.runInDataflow": false, "photon.enabled": true,

Copyright © 2020 Trifacta Inc. Page #92 Apply EMR cluster ID

The Trifacta platform must be aware of the EMR cluster to which to connection.

Steps:

1. Administrators can apply this configuration change through the Admin Settings Page in the application. If the application is not available, the settings are available in trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Under External Service Settings, enter your AWS EMR Cluster ID. Click the Save button below the textbox.

For more information, see Admin Settings Page.

Extract IP address of master node in private sub-net

If you have deployed your EMR cluster on a private sub-net that is accessible outside of AWS, you must enable this property, which permits the extraction of the IP address of the master cluster node through DNS.

NOTE: This feature must be enabled if your EMR is accessible outside of AWS on a private network.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Set the following property to true:

"emr.extractIPFromDNS": false,

3. Save your changes and restart the platform.

EMR Authentication for the Trifacta platform

Depending on the authentication method you used, you must set the following properties.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Authentication method Properties and values

Use default credential provider for all Trifacta access including EMR. "aws.credentialProvider":" default", NOTE: This method requires the deployment of a "aws.emr.forceInstanceRole": custom credential provider JAR. false,

Use default credential provider for all Trifacta access. However, EC 2 role-based IAM authentication is used for EMR. "aws.credentialProvider":" default", "aws.emr.forceInstanceRole": true,

Copyright © 2020 Trifacta Inc. Page #93 EC2 role-based IAM authentication for all Trifacta access "aws.credentialProvider":" instance",

Configure Spark for EMR

For EMR, you can configure a set of Spark-related properties to manage the integration and its performance.

Configure Spark version

Depending on the version of EMR with which you are integrating, the Trifacta platform must be modified to use the appropriate version of Spark to connect to EMR. For more information, see Configure for Spark.

Specify YARN queue for Spark jobs

Through the Admin Settings page, you can specify the YARN queue to which to submit your Spark jobs. All Spark jobs from the Trifacta platform are submitted to this queue.

Steps:

1. In platform configuration, locate the following:

"spark.props.spark.yarn.queue"

2. Specify the name of the queue. 3. Save your changes.

Allocation properties

The following properties must be passed from the Trifacta platform to Spark for proper execution on the EMR cluster.

To apply this configuration change, login as an administrator to the Trifacta node. Then, edit trifacta-conf. json. Some of these settings may not be available through the Admin Settings Page. For more information, see Platform Configuration Methods.

NOTE: Do not modify these properties through the Admin Settings page. These properties must be added as extra properties through the Spark configuration block. Ignore any references in trifacta- conf.json to these properties and their settings.

"spark": { ... "props": { "spark.dynamicAllocation.enabled": "true", "spark.shuffle.service.enabled": "true", "spark.executor.instances": "0", "spark.executor.memory": "2048M", "spark.executor.cores": "2", "spark.driver.maxResultSize": "0"

Copyright © 2020 Trifacta Inc. Page #94 } ... }

Property Description Value

spark.dynamicAllocation.enabled Enable dynamic allocation on the Spark true cluster, which allows Spark to dynamically adjust the number of executors.

spark.shuffle.service.enabled Enable Spark shuffle service, which true manages the shuffle data for jobs, instead of the executors.

spark.executor.instances Default count of executor instances. See Sizing Guide.

spark.executor.memory Default memory allocation of executor See Sizing Guide. instances.

spark.executor.cores Default count of executor cores. See Sizing Guide.

spark.driver.maxResultSize Enable serialized results of unlimited size 0 by setting this parameter to zero (0).

Default Hadoop job results format

For smaller datasets, the platform recommends using the Trifacta Photon running environment.

For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this parameter. Publishing actions, including output format, can always be changed as part of the job specification.

As needed, you can change this default format. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

"webapp.defaultHadoopFileFormat": "csv",

Accepted values: csv, json, avro, pqt

For more information, see Run Job Page.

Additional configuration for EMR

You can set the following parameters as needed:

Steps:

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Property Required Description

aws.emr.resource.bucket Y S3 bucket name where Trifacta executables , libraries, and other resources can be stored that are required for Spark execution.

aws.emr.resource.path Y

Copyright © 2020 Trifacta Inc. Page #95 S3 path within the bucket where resources can be stored for job execution on the EMR cluster.

NOTE: Do not include leading or trailing slashes for the path value.

aws.emr.proxyUser Y This value defines the user for the Trifacta users to use for connecting to the cluster.

NOTE: Do not modify this value.

aws.emr.maxLogPollingRetries N Configure maximum number of retries when polling for log files from EMR after job success or failure. Minimum value is 5 .

aws.emr.tempfilesCleanupAge N Defines the number of days that temporary files in the /trifacta/tempfiles directory on EMR HDFS are permitted to age.

By default, this value is set to 0 , which means that batch job files are cleaned up after every job run.

If needed, you can set this to a positive integer value. During each job run, the platform scans this directory for temp files older than the specified number of days and removes any that are found. This cleanup provides an additional level of system hygiene.

Before enabling this secondary cleanup process, please execute the following command to clear the tempfiles directory:

hdfs dfs -rm - -skipTrash /trifacta /tempfiles

Optional Configuration

Configure for Redshift

For more information on configuring the platform to integrate with Redshift, see Create Redshift Connections.

Switch EMR Cluster

If needed, you can switch to a different EMR cluster through the application. For example, if the original cluster suffers a prolonged outage, you can switch clusters by entering the cluster ID of a new cluster. For more information, see Admin Settings Page.

Copyright © 2020 Trifacta Inc. Page #96 Configure Batch Job Runner

Batch Job Runner manages jobs executed on the EMR cluster. You can modify aspects of how jobs are executed and how logs are collected. For more information, see Configure Batch Job Runner.

Modify Job Tag Prefix

In environments where the EMR cluster is shared with other job-executing applications, you can review and specify the job tag prefix, which is prepended to job identifiers to avoid conflicts with other applications.

Steps:

1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods. 2. Locate the following and modify if needed:

"aws.emr.jobTagPrefix": "TRIFACTA_JOB_",

3. Save your changes and restart the platform.

Testing

1. Load a dataset from the EMR cluster. 2. Perform a few simple steps on the dataset. 3. Click Run Job in the Transformer page. 4. When specifying the job: 1. Click the Profile Results checkbox. 2. Select Spark. 5. When the job completes, verify that the results have been written to the appropriate location. Contact Support Do you need further assistance? Check out the resources below: Email

Search Support [email protected]

In Trifacta® Wrangler Enterprise, click the Help icon and select Search Help to search our help content.

If your question is not answered through search, you can file a support ticket through the Support Portal (see below).

Trifacta Community and Support Portal

The Trifacta Community and Support Portal can be reached at: https://community.trifacta.com

Within our Community, you can:

Manage your support cases Get free Wrangler certifications Post questions to the community Search our AI-driven knowledgebase Answer questions to earn points and get on the leaderboard

Copyright © 2020 Trifacta Inc. Page #97 Watch tutorials, access documentation, learn about new features, and more

Legal

Third-Party License Information

Copyright © 2020 Trifacta Inc.

This product also includes the following libraries which are covered by The (MIT AND BSD-3-Clause):

sha.js

This product also includes the following libraries which are covered by The 2-clause BSD License:

double_metaphone

This product also includes the following libraries which are covered by The 3-Clause BSD License:

com.google.protobuf.protobuf-java com.google.protobuf.protobuf-java-util

This product also includes the following libraries which are covered by The ASL:

funcsigs org.json4s.json4s-ast_2.10 org.json4s.json4s-core_2.10 org.json4s.json4s-native_2.10

This product also includes the following libraries which are covered by The ASL 2.0:

pykerberos

This product also includes the following libraries which are covered by The Amazon Redshift ODBC and JDBC Driver License Agreement:

RedshiftJDBC41-1.1.7.1007 com.amazon.redshift.RedshiftJDBC41

This product also includes the following libraries which are covered by The Apache 2.0 License:

com.uber.jaeger.jaeger-b3 com.uber.jaeger.jaeger-core com.uber.jaeger.jaeger-thrift com.uber.jaeger.jaeger-zipkin org.apache.spark.spark-catalyst_2.11 org.apache.spark.spark-core_2.11 org.apache.spark.spark-hive_2.11 org.apache.spark.spark-kvstore_2.11 org.apache.spark.spark-launcher_2.11 org.apache.spark.spark-network-common_2.11 org.apache.spark.spark-network-shuffle_2.11 org.apache.spark.spark-sketch_2.11 org.apache.spark.spark-sql_2.11 org.apache.spark.spark-tags_2.11 org.apache.spark.spark-unsafe_2.11

Copyright © 2020 Trifacta Inc. Page #98 org.apache.spark.spark-yarn_2.11

This product also includes the following libraries which are covered by The :

com.chuusai.shapeless_2.11 commons-httpclient.commons-httpclient org.apache.httpcomponents.httpclient org.apache.httpcomponents.httpcore

This product also includes the following libraries which are covered by The Apache License (v2.0):

com.vlkan.flatbuffers

This product also includes the following libraries which are covered by The Apache License 2.0:

@google-cloud/pubsub @google-cloud/resource @google-cloud/storage arrow avro aws-sdk azure-storage bootstrap browser-request bytebuffer cglib.cglib-nodep com.amazonaws.aws-java-sdk com.amazonaws.aws-java-sdk-bundle com.amazonaws.aws-java-sdk-core com.amazonaws.aws-java-sdk-dynamodb com.amazonaws.aws-java-sdk-emr com.amazonaws.aws-java-sdk-iam com.amazonaws.aws-java-sdk-kms com.amazonaws.aws-java-sdk-s3 com.amazonaws.aws-java-sdk-sts com.amazonaws.jmespath-java com.carrotsearch.hppc com.clearspring.analytics.stream com.cloudera.navigator.navigator-sdk com.cloudera.navigator.navigator-sdk-client com.cloudera.navigator.navigator-sdk-model com.codahale.metrics.metrics-core com.cronutils.cron-utils com.databricks.spark-avro_2.11 com.fasterxml.classmate com.fasterxml.jackson.dataformat.jackson-dataformat-cbor com.fasterxml.jackson.datatype.jackson-datatype-joda com.fasterxml.jackson.jaxrs.jackson-jaxrs-base com.fasterxml.jackson.jaxrs.jackson-jaxrs-json-provider com.fasterxml.jackson.module.jackson-module-jaxb-annotations com.fasterxml.jackson.module.jackson-module-paranamer com.fasterxml.jackson.module.jackson-module-scala_2.11 com.fasterxml.uuid.java-uuid-generator com.github.nscala-time.nscala-time_2.10 com.github.stephenc.jcip.jcip-annotations com.google.api-client.google-api-client com.google.api-client.google-api-client-jackson2 com.google.api-client.google-api-client-java6 com.google.api.grpc.grpc-google-cloud-bigquerystorage-v1beta1 com.google.api.grpc.grpc-google-cloud-bigtable-admin-v2

Copyright © 2020 Trifacta Inc. Page #99 com.google.api.grpc.grpc-google-cloud-bigtable-v2 com.google.api.grpc.grpc-google-cloud-pubsub-v1 com.google.api.grpc.grpc-google-cloud-spanner-admin-database-v1 com.google.api.grpc.grpc-google-cloud-spanner-admin-instance-v1 com.google.api.grpc.grpc-google-cloud-spanner-v1 com.google.api.grpc.grpc-google-common-protos com.google.api.grpc.proto-google-cloud-bigquerystorage-v1beta1 com.google.api.grpc.proto-google-cloud-bigtable-admin-v2 com.google.api.grpc.proto-google-cloud-bigtable-v2 com.google.api.grpc.proto-google-cloud-datastore-v1 com.google.api.grpc.proto-google-cloud-monitoring-v3 com.google.api.grpc.proto-google-cloud-pubsub-v1 com.google.api.grpc.proto-google-cloud-spanner-admin-database-v1 com.google.api.grpc.proto-google-cloud-spanner-admin-instance-v1 com.google.api.grpc.proto-google-cloud-spanner-v1 com.google.api.grpc.proto-google-common-protos com.google.api.grpc.proto-google-iam-v1 com.google.apis.google-api-services-bigquery com.google.apis.google-api-services-clouddebugger com.google.apis.google-api-services-cloudresourcemanager com.google.apis.google-api-services-dataflow com.google.apis.google-api-services-iam com.google.apis.google-api-services-oauth2 com.google.apis.google-api-services-pubsub com.google.apis.google-api-services-storage com.google.auto.service.auto-service com.google.auto.value.auto-value-annotations com.google.cloud.bigdataoss.gcsio com.google.cloud.bigdataoss.util com.google.cloud.google-cloud-bigquery com.google.cloud.google-cloud-bigquerystorage com.google.cloud.google-cloud-bigtable com.google.cloud.google-cloud-bigtable-admin com.google.cloud.google-cloud-core com.google.cloud.google-cloud-core-grpc com.google.cloud.google-cloud-core-http com.google.cloud.google-cloud-monitoring com.google.cloud.google-cloud-spanner com.google.code.findbugs.jsr305 com.google.code.gson.gson com.google.errorprone.error_prone_annotations com.google.flogger.flogger com.google.flogger.flogger-system-backend com.google.flogger.google-extensions com.google.guava.failureaccess com.google.guava.guava com.google.guava.guava-jdk5 com.google.guava.listenablefuture com.google.http-client.google-http-client com.google.http-client.google-http-client-apache com.google.http-client.google-http-client-appengine com.google.http-client.google-http-client-jackson com.google.http-client.google-http-client-jackson2 com.google.http-client.google-http-client-protobuf com.google.inject.extensions.guice-servlet com.google.inject.guice com.google.j2objc.j2objc-annotations com.google.oauth-client.google-oauth-client com.google.oauth-client.google-oauth-client-java6 com.googlecode.javaewah.JavaEWAH

Copyright © 2020 Trifacta Inc. Page #100 com.googlecode.libphonenumber.libphonenumber com.hadoop.gplcompression.hadoop-lzo com.jakewharton.threetenabp.threetenabp com.jamesmurty.utils.java-xmlbuilder com.jolbox.bonecp com.mapr.mapr-root com.microsoft.azure.adal4j com.microsoft.azure.azure-core com.microsoft.azure.azure-storage com.microsoft.windowsazure.storage.microsoft-windowsazure-storage-sdk com.nimbusds.lang-tag com.nimbusds.nimbus-jose-jwt com.ning.compress-lzf com.opencsv.opencsv com.squareup.okhttp.okhttp com.squareup.okhttp3.logging-interceptor com.squareup.okhttp3.okhttp com.squareup.okhttp3.okhttp-urlconnection com.squareup.okio.okio com.squareup.retrofit2.adapter-rxjava com.squareup.retrofit2.converter-jackson com.squareup.retrofit2.retrofit com.trifacta.hadoop.cloudera4 com.twitter.chill-java com.twitter.chill_2.11 com.twitter.parquet-hadoop-bundle com.typesafe.akka.akka-actor_2.11 com.typesafe.akka.akka-cluster_2.11 com.typesafe.akka.akka-remote_2.11 com.typesafe.akka.akka-slf4j_2.11 com.typesafe.config com.univocity.univocity-parsers com.zaxxer.HikariCP com.zaxxer.HikariCP-java7 commons-beanutils.commons-beanutils commons-beanutils.commons-beanutils-core commons-cli.commons-cli commons-codec.commons-codec commons-collections.commons-collections commons-configuration.commons-configuration commons-dbcp.commons-dbcp commons-dbutils.commons-dbutils commons-digester.commons-digester commons-el.commons-el commons-fileupload.commons-fileupload commons-io.commons-io commons-lang.commons-lang commons-logging.commons-logging commons-net.commons-net commons-pool.commons-pool dateinfer de.odysseus.juel.juel-api de.odysseus.juel.juel-impl de.odysseus.juel.juel-spi express-opentracing google-benchmark googleapis io.airlift.aircompressor io.dropwizard.metrics.metrics-core io.dropwizard.metrics.metrics-graphite

Copyright © 2020 Trifacta Inc. Page #101 io.dropwizard.metrics.metrics-json io.dropwizard.metrics.metrics-jvm io.grpc.grpc-all io.grpc.grpc-alts io.grpc.grpc-auth io.grpc.grpc-context io.grpc.grpc-core io.grpc.grpc-grpclb io.grpc.grpc-netty io.grpc.grpc-netty-shaded io.grpc.grpc-okhttp io.grpc.grpc-protobuf io.grpc.grpc-protobuf-lite io.grpc.grpc-protobuf-nano io.grpc.grpc-stub io.grpc.grpc-testing io.netty.netty io.netty.netty-all io.netty.netty-buffer io.netty.netty-codec io.netty.netty-codec-http io.netty.netty-codec-http2 io.netty.netty-codec-socks io.netty.netty-common io.netty.netty-handler io.netty.netty-handler-proxy io.netty.netty-resolver io.netty.netty-tcnative-boringssl-static io.netty.netty-transport io.opentracing.contrib.opentracing-concurrent io.opentracing.contrib.opentracing-globaltracer io.opentracing.contrib.opentracing-web-servlet-filter io.opentracing.opentracing-api io.opentracing.opentracing-noop io.opentracing.opentracing-util io.prometheus.simpleclient io.prometheus.simpleclient_common io.prometheus.simpleclient_servlet io.reactivex.rxjava io.spray.spray-can_2.11 io.spray.spray-http_2.11 io.spray.spray-httpx_2.11 io.spray.spray-io_2.11 io.spray.spray-json_2.11 io.spray.spray-routing_2.11 io.spray.spray-util_2.11 io.springfox.springfox-core io.springfox.springfox-schema io.springfox.springfox-spi io.springfox.springfox-spring-web io.springfox.springfox-swagger-common io.springfox.springfox-swagger-ui io.springfox.springfox-swagger2 io.swagger.swagger-annotations io.swagger.swagger-models io.undertow.undertow-core io.undertow.undertow-servlet io.undertow.undertow-websockets-jsr io.zipkin.java.zipkin io.zipkin.reporter.zipkin-reporter

Copyright © 2020 Trifacta Inc. Page #102 io.zipkin.reporter.zipkin-sender-urlconnection io.zipkin.zipkin2.zipkin it.unimi.dsi.fastutil jaeger-client javax.inject.javax.inject javax.jdo.jdo-api javax.validation.validation-api joda-time.joda-time kerberos less libcuckoo .apache-log4j-extras log4j.log4j long mathjs mx4j.mx4j net.bytebuddy.byte-buddy net.hydromatic.eigenbase-properties net.java.dev.eval.eval net.java.dev.jets3t.jets3t net.jcip.jcip-annotations net.jpountz.lz4.lz4 net.minidev.accessors-smart net.minidev.json-smart net.sf.opencsv.opencsv net.snowflake.snowflake-jdbc opentracing org.activiti.activiti-bpmn-converter org.activiti.activiti-bpmn-layout org.activiti.activiti-bpmn-model org.activiti.activiti-common-rest org.activiti.activiti-dmn-api org.activiti.activiti-dmn-model org.activiti.activiti-engine org.activiti.activiti-form-api org.activiti.activiti-form-model org.activiti.activiti-image-generator org.activiti.activiti-process-validation org.activiti.activiti-rest org.activiti.activiti-spring org.activiti.activiti-spring-boot-starter-basic org.activiti.activiti-spring-boot-starter-rest-api org.activiti.activiti5-compatibility org.activiti.activiti5-engine org.activiti.activiti5-spring org.activiti.activiti5-spring-compatibility org.apache.arrow.arrow-format org.apache.arrow.arrow-memory org.apache.arrow.arrow-vector org.apache.atlas.atlas-client org.apache.atlas.atlas-typesystem org.apache.avro.avro org.apache.avro.avro-ipc org.apache.avro.avro-mapred org.apache.beam.beam-model-job-management org.apache.beam.beam-model-pipeline org.apache.beam.beam-runners-core-construction-java org.apache.beam.beam-runners-direct-java org.apache.beam.beam-runners-google-cloud-dataflow-java org.apache.beam.beam-sdks-java-core

Copyright © 2020 Trifacta Inc. Page #103 org.apache.beam.beam-sdks-java-extensions-google-cloud-platform-core org.apache.beam.beam-sdks-java-extensions-protobuf org.apache.beam.beam-sdks-java-extensions-sorter org.apache.beam.beam-sdks-java-io-google-cloud-platform org.apache.beam.beam-sdks-java-io-parquet org.apache.beam.beam-vendor-grpc-1_13_1 org.apache.beam.beam-vendor-guava-20_0 org.apache.calcite.calcite-avatica org.apache.calcite.calcite-core org.apache.calcite.calcite-linq4j org.apache.commons.codec org.apache.commons.commons-collections4 org.apache.commons.commons-compress org.apache.commons.commons-configuration2 org.apache.commons.commons-crypto org.apache.commons.commons-csv org.apache.commons.commons-dbcp2 org.apache.commons.commons-email org.apache.commons.commons-exec org.apache.commons.commons-lang3 org.apache.commons.commons-math org.apache.commons.commons-math3 org.apache.commons.commons-pool2 org.apache.curator.curator-client org.apache.curator.curator-framework org.apache.curator.curator-recipes org.apache.derby.derby org.apache.directory.api.api-asn1-api org.apache.directory.api.api-util org.apache.directory.server.apacheds-i18n org.apache.directory.server.apacheds-kerberos-codec org.apache.hadoop.avro org.apache.hadoop.hadoop-annotations org.apache.hadoop.hadoop-auth org.apache.hadoop.hadoop-aws org.apache.hadoop.hadoop-azure org.apache.hadoop.hadoop-azure-datalake org.apache.hadoop.hadoop-client org.apache.hadoop.hadoop-common org.apache.hadoop.hadoop-hdfs org.apache.hadoop.hadoop-hdfs-client org.apache.hadoop.hadoop-mapreduce-client-app org.apache.hadoop.hadoop-mapreduce-client-common org.apache.hadoop.hadoop-mapreduce-client-core org.apache.hadoop.hadoop-mapreduce-client-jobclient org.apache.hadoop.hadoop-mapreduce-client-shuffle org.apache.hadoop.hadoop-yarn-api org.apache.hadoop.hadoop-yarn-client org.apache.hadoop.hadoop-yarn-common org.apache.hadoop.hadoop-yarn-registry org.apache.hadoop.hadoop-yarn-server-common org.apache.hadoop.hadoop-yarn-server-nodemanager org.apache.hadoop.hadoop-yarn-server-web-proxy org.apache.htrace.htrace-core org.apache.htrace.htrace-core4 org.apache.httpcomponents.httpmime org.apache.ivy.ivy org.apache.kerby.kerb-admin org.apache.kerby.kerb-client org.apache.kerby.kerb-common

Copyright © 2020 Trifacta Inc. Page #104 org.apache.kerby.kerb-core org.apache.kerby.kerb-crypto org.apache.kerby.kerb-identity org.apache.kerby.kerb-server org.apache.kerby.kerb-simplekdc org.apache.kerby.kerb-util org.apache.kerby.kerby-asn1 org.apache.kerby.kerby-config org.apache.kerby.kerby-pkix org.apache.kerby.kerby-util org.apache.kerby.kerby-xdr org.apache.kerby.token-provider org.apache.logging.log4j.log4j-1.2-api org.apache.logging.log4j.log4j-api org.apache.logging.log4j.log4j-api-scala_2.11 org.apache.logging.log4j.log4j-core org.apache.logging.log4j.log4j-jcl org.apache.logging.log4j.log4j-jul org.apache.logging.log4j.log4j-slf4j-impl org.apache.logging.log4j.log4j-web org.apache.orc.orc-core org.apache.orc.orc-mapreduce org.apache.orc.orc-shims org.apache.parquet.parquet-avro org.apache.parquet.parquet-column org.apache.parquet.parquet-common org.apache.parquet.parquet-encoding org.apache.parquet.parquet-format org.apache.parquet.parquet-hadoop org.apache.parquet.parquet-jackson org.apache.parquet.parquet-tools org.apache.pig.pig org.apache.pig.pig-core-spork org.apache.thrift.libfb303 org.apache.thrift.libthrift org.apache.tomcat.embed.tomcat-embed-core org.apache.tomcat.embed.tomcat-embed-el org.apache.tomcat.embed.tomcat-embed-websocket org.apache.tomcat.tomcat-annotations-api org.apache.tomcat.tomcat-jdbc org.apache.tomcat.tomcat-juli org.apache.xbean.xbean-asm5-shaded org.apache.xbean.xbean-asm6-shaded org.apache.zookeeper.zookeeper org.codehaus.jackson.jackson-core-asl org.codehaus.jackson.jackson-mapper-asl org.codehaus.jettison.jettison org.datanucleus.datanucleus-api-jdo org.datanucleus.datanucleus-core org.datanucleus.datanucleus-rdbms org.eclipse.jetty.jetty-client org.eclipse.jetty.jetty-http org.eclipse.jetty.jetty-io org.eclipse.jetty.jetty-security org.eclipse.jetty.jetty-server org.eclipse.jetty.jetty-servlet org.eclipse.jetty.jetty-util org.eclipse.jetty.jetty-util-ajax org.eclipse.jetty.jetty-webapp org.eclipse.jetty.jetty-xml

Copyright © 2020 Trifacta Inc. Page #105 org.fusesource.jansi.jansi org.hibernate.hibernate-validator org.htrace.htrace-core org.iq80.snappy.snappy org.jboss.jandex org.joda.joda-convert org.json4s.json4s-ast_2.11 org.json4s.json4s-core_2.11 org.json4s.json4s-jackson_2.11 org.json4s.json4s-scalap_2.11 org.liquibase.liquibase-core org.lz4.lz4-java org.mapstruct.mapstruct org.mortbay.jetty.jetty org.mortbay.jetty.jetty-util org.mybatis.mybatis org.objenesis.objenesis org.osgi.org.osgi.core org.parboiled.parboiled-core org.parboiled.parboiled-scala_2.11 org.quartz-scheduler.quartz org.roaringbitmap.RoaringBitmap org.sonatype.oss.oss-parent org.sonatype.sisu.inject.cglib org.spark-project.hive.hive-exec org.spark-project.hive.hive-metastore org.springframework.boot.spring-boot org.springframework.boot.spring-boot-autoconfigure org.springframework.boot.spring-boot-starter org.springframework.boot.spring-boot-starter-aop org.springframework.boot.spring-boot-starter-data-jpa org.springframework.boot.spring-boot-starter-jdbc org.springframework.boot.spring-boot-starter-log4j2 org.springframework.boot.spring-boot-starter-logging org.springframework.boot.spring-boot-starter-tomcat org.springframework.boot.spring-boot-starter-undertow org.springframework.boot.spring-boot-starter-web org.springframework.data.spring-data-commons org.springframework.data.spring-data-jpa org.springframework.plugin.spring-plugin-core org.springframework.plugin.spring-plugin-metadata org.springframework.retry.spring-retry org.springframework.security.spring-security-config org.springframework.security.spring-security-core org.springframework.security.spring-security-crypto org.springframework.security.spring-security-web org.springframework.spring-aop org.springframework.spring-aspects org.springframework.spring-beans org.springframework.spring-context org.springframework.spring-context-support org.springframework.spring-core org.springframework.spring-expression org.springframework.spring-jdbc org.springframework.spring-orm org.springframework.spring-tx org.springframework.spring-web org.springframework.spring-webmvc org.springframework.spring-websocket org.uncommons.maths.uncommons-maths

Copyright © 2020 Trifacta Inc. Page #106 org.wildfly.openssl.wildfly-openssl org.xerial.snappy.snappy-java org.yaml.snakeyaml oro.oro parquet pbr pig-0.11.1-withouthadoop-23 pig-0.12.1-mapr-noversion-withouthadoop pig-0.14.0-core-spork piggybank-amzn-0.3 piggybank-cdh5.0.0-beta-2-0.12.0 python-iptables request requests stax.stax-api thrift tomcat.jasper-compiler tomcat.jasper-runtime xerces.xercesImpl xml-apis.xml-apis zipkin zipkin-transport-http

This product also includes the following libraries which are covered by The Apache License 2.0 + Eclipse Public License 1.0:

spark-assembly-thinner

This product also includes the following libraries which are covered by The Apache License Version 2:

org.mortbay.jetty.jetty-sslengine

This product also includes the following libraries which are covered by The Apache License v2.0:

net.java.dev.jna.jna net.java.dev.jna.jna-platform

This product also includes the following libraries which are covered by The Apache License, version 2.0:

com.nimbusds.oauth2-oidc-sdk org.jboss.logging.jboss-logging

This product also includes the following libraries which are covered by The Apache Software Licenses:

org.slf4j.log4j-over-slf4j

This product also includes the following libraries which are covered by The BSD license:

alabaster antlr.antlr asm.asm-parent babel click com.google.api.api-common com.google.api.gax com.google.api.gax-grpc com.google.api.gax-httpjson com.jcraft.jsch com.thoughtworks.paranamer.paranamer dk.brics.automaton.automaton

Copyright © 2020 Trifacta Inc. Page #107 dom4j.dom4j enum34 flask itsdangerous javolution.javolution jinja2 jline.jline microee mock networkx numpy org.antlr.ST4 org.antlr.antlr-runtime org.antlr.antlr4-runtime org.antlr.stringtemplate org.codehaus.woodstox.stax2-api org.ow2.asm.asm org.scala-lang.jline pluginbase psutil pygments python-enum34 python-json-logger scipy snowballstemmer sphinx sphinxcontrib-websupport strptime websocket-stream xlrd xmlenc.xmlenc xss-filters

This product also includes the following libraries which are covered by The BSD 2-Clause License:

com.github.luben.zstd-jni

This product also includes the following libraries which are covered by The BSD 3-Clause:

org.scala-lang.scala-compiler org.scala-lang.scala-library org.scala-lang.scala-reflect org.scala-lang.scalap

This product also includes the following libraries which are covered by The BSD 3-Clause "New" or "Revised" License (BSD-3-Clause):

org.abego.treelayout.org.abego.treelayout.core

This product also includes the following libraries which are covered by The BSD 3-Clause License:

org.antlr.antlr4

This product also includes the following libraries which are covered by The BSD 3-clause:

org.scala-lang.modules.scala-parser-combinators_2.11 org.scala-lang.modules.scala-xml_2.11 org.scala-lang.plugins.scala-continuations-library_2.11 org.threeten.threetenbp

Copyright © 2020 Trifacta Inc. Page #108 This product also includes the following libraries which are covered by The BSD New license:

com.google.auth.google-auth-library-credentials com.google.auth.google-auth-library-oauth2-http

This product also includes the following libraries which are covered by The BSD-2-Clause:

cls-bluebird node-polyglot org.postgresql.postgresql terser uglify-js

This product also includes the following libraries which are covered by The BSD-3-Clause:

@sentry/node d3-dsv datalib datalib-sketch markupsafe md5 node-forge protobufjs qs queue-async sqlite3 werkzeug

This product also includes the following libraries which are covered by The BSD-Style:

com.jsuereth.scala-arm_2.11

This product also includes the following libraries which are covered by The BSD-derived ( http://www.repoze.org/LICENSE.txt):

meld3 supervisor

This product also includes the following libraries which are covered by The BSD-like:

dnspython idna org.scala-lang.scala-actors org.scalamacros.quasiquotes_2.10

This product also includes the following libraries which are covered by The BSD-style license:

bzip2

This product also includes the following libraries which are covered by The Boost Software License:

asio boost cpp-netlib-uri expected jsbind poco

This product also includes the following libraries which are covered by The Bouncy Castle Licence:

Copyright © 2020 Trifacta Inc. Page #109 org.bouncycastle.bcprov-jdk15on

This product also includes the following libraries which are covered by The CDDL:

javax.mail.mail javax.mail.mailapi javax.servlet.jsp-api javax.servlet.jsp.jsp-api javax.servlet.servlet-api javax.transaction.jta javax.xml.stream.stax-api org.glassfish.external.management-api org.glassfish.gmbal.gmbal-api-only org.jboss.spec.javax.annotation.jboss-annotations-api_1.2_spec

This product also includes the following libraries which are covered by The CDDL + GPLv2 with classpath exception:

javax.annotation.javax.annotation-api javax.jms.jms javax.servlet.javax.servlet-api javax.transaction.javax.transaction-api javax.transaction.transaction-api org.glassfish.grizzly.grizzly-framework org.glassfish.grizzly.grizzly-http org.glassfish.grizzly.grizzly-http-server org.glassfish.grizzly.grizzly-http-servlet org.glassfish.grizzly.grizzly-rcm org.glassfish.hk2.external.aopalliance-repackaged org.glassfish.hk2.external.javax.inject org.glassfish.hk2.hk2-api org.glassfish.hk2.hk2-locator org.glassfish.hk2.hk2-utils org.glassfish.hk2.osgi-resource-locator org.glassfish.javax.el org.glassfish.jersey.core.jersey-common

This product also includes the following libraries which are covered by The CDDL 1.1:

com.sun.jersey.contribs.jersey-guice com.sun.jersey.jersey-client com.sun.jersey.jersey-core com.sun.jersey.jersey-json com.sun.jersey.jersey-server com.sun.jersey.jersey-servlet com.sun.xml.bind.jaxb-impl javax.ws.rs.javax.ws.rs-api javax.xml.bind.jaxb-api org.jvnet.mimepull.mimepull

This product also includes the following libraries which are covered by The CDDL License:

javax.ws.rs.jsr311-api

This product also includes the following libraries which are covered by The CDDL/GPLv2+CE:

com.sun.mail.javax.mail

This product also includes the following libraries which are covered by The CERN:

Copyright © 2020 Trifacta Inc. Page #110 colt.colt

This product also includes the following libraries which are covered by The COMMON DEVELOPMENT AND DISTRIBUTION LICENSE (CDDL) Version 1.0:

javax.activation.activation javax.annotation.jsr250-api

This product also includes the following libraries which are covered by The Doug Crockford's license that allows this module to be used for Good but not for Evil:

jsmin

This product also includes the following libraries which are covered by The Dual License:

python-dateutil

This product also includes the following libraries which are covered by The Eclipse Distribution License (EDL), Version 1.0:

org.hibernate.javax.persistence.hibernate-jpa-2.1-api

This product also includes the following libraries which are covered by The Eclipse Public License:

com.github.oshi.oshi-core

This product also includes the following libraries which are covered by The Eclipse Public License - v 1.0:

org.aspectj.aspectjweaver

This product also includes the following libraries which are covered by The Eclipse Public License, Version 1.0:

com.mchange.mchange-commons-java

This product also includes the following libraries which are covered by The GNU General Public License v2.0 only, with Classpath exception:

org.jboss.spec.javax.servlet.jboss-servlet-api_3.1_spec org.jboss.spec.javax.websocket.jboss-websocket-api_1.1_spec

This product also includes the following libraries which are covered by The GNU LGPL:

nose

This product also includes the following libraries which are covered by The GNU Lesser General Public License:

org.hibernate.common.hibernate-commons-annotations org.hibernate.hibernate-core org.hibernate.hibernate-entitymanager

This product also includes the following libraries which are covered by The GNU Lesser General Public License Version 2.1, February 1999:

org.jgrapht.jgrapht-core

This product also includes the following libraries which are covered by The GNU Lesser General Public License, Version 2.1:

com.fasterxml.jackson.core.jackson-annotations com.fasterxml.jackson.core.jackson-core

Copyright © 2020 Trifacta Inc. Page #111 com.fasterxml.jackson.core.jackson-databind com.mchange.c3p0

This product also includes the following libraries which are covered by The GNU Lesser Public License:

com.google.code.findbugs.annotations

This product also includes the following libraries which are covered by The GPLv3:

yamllint

This product also includes the following libraries which are covered by The Google Cloud Software License:

com.google.cloud.google-cloud-storage

This product also includes the following libraries which are covered by The ICU License:

icu

This product also includes the following libraries which are covered by The ISC:

browserify-sign iconify inherits lru-cache request-promise request-promise-native requests-kerberos rimraf sax semver split-ca

This product also includes the following libraries which are covered by The JGraph Ltd - 3 clause BSD license:

org.tinyjee.jgraphx.jgraphx

This product also includes the following libraries which are covered by The LGPL:

chardet com.sun.jna.jna

This product also includes the following libraries which are covered by The LGPL 2.1:

org.codehaus.jackson.jackson-jaxrs org.codehaus.jackson.jackson-xc org.javassist.javassist xmlhttprequest

This product also includes the following libraries which are covered by The LGPLv3 or later:

com.github.fge.json-schema-core com.github.fge.json-schema-validator

This product also includes the following libraries which are covered by The MIT license:

amplitude-js analytics-node args4j.args4j async

Copyright © 2020 Trifacta Inc. Page #112 avsc backbone backbone-forms basic-auth bcrypt bluebird body-parser browser-filesaver buffer-crc32 bufferedstream busboy byline bytes cachetools chai chalk cli-table clipboard codemirror colors com.github.tommyettinger.blazingchain com.microsoft.azure.azure-data-lake-store-sdk com.microsoft.sqlserver.mssql-jdbc commander common-tags compression console.table cookie cookie-parser cookie-session cronstrue crypto-browserify csrf csurf definitely eventEmitter express express-http-context express-params express-zip forever form-data fs-extra function-rate-limit future fuzzy generic-pool google-auto-auth iconv-lite imagesize int24 is-my-json-valid jade jq jquery jquery-serializeobject jquery.ba-serializeobject jquery.event.drag jquery.form jquery.ui

Copyright © 2020 Trifacta Inc. Page #113 json-stable-stringify jsonfile jsonschema jsonwebtoken jszip keygrip keysim knex less-middleware lodash lunr matic memcached method-override mini-css-extract-plugin minilog mkdirp moment moment-jdateformatparser moment-timezone mongoose morgan morgan-json mysql net.razorvine.pyrolite netifaces nock nodemailer org.checkerframework.checker-compat-qual org.checkerframework.checker-qual org.mockito.mockito-core org.slf4j.jcl-over-slf4j org.slf4j.jul-to-slf4j org.slf4j.slf4j-api org.slf4j.slf4j-log4j12 pace passport passport-azure-ad passport-http passport-http-bearer passport-ldapauth passport-local passport-saml passport-saml-metadata passport-strategy password-validator pegjs pg pg-hstore pip png-img promise-retry prop-types punycode py-cpuinfo python-crfsuite python-six pytz pyyaml query-string

Copyright © 2020 Trifacta Inc. Page #114 querystring randexp rapidjson react react-day-picker react-dom react-hot-loader react-modal react-router-dom react-select react-switch react-table react-virtualized recursive-readdir redefine requestretry require-jade require-json retry retry-as-promised rotating-file-stream safe-json-stringify sequelize setuptools simple-ldap-search simplejson singledispatch six slick.core slick.grid slick.headerbuttons slick.headerbuttons.css slick.rowselectionmodel snappy sphinx-rtd-theme split-pane sql stream-meter supertest tar-fs temp to-case tv4 ua-parser-js umzug underscore.string unicode-length universal-analytics urijs uritools url urllib3 user-agent-parser uuid validator wheel winston winston-daily-rotate-file ws yargs

Copyright © 2020 Trifacta Inc. Page #115 zxcvbn

This product also includes the following libraries which are covered by The MIT License; BSD 3-clause License:

antlr4-cpp-runtime

This product also includes the following libraries which are covered by The MIT license:

org.codehaus.mojo.animal-sniffer-annotations

This product also includes the following libraries which are covered by The MIT/X:

vnc2flv

This product also includes the following libraries which are covered by The MIT/X11 license:

optimist

This product also includes the following libraries which are covered by The MPL 2.0:

pathspec

This product also includes the following libraries which are covered by The MPL 2.0 or EPL 1.0:

com.h2database.h2

This product also includes the following libraries which are covered by The MPL-2.0:

certifi

This product also includes the following libraries which are covered by The Microsoft Software License:

com.microsoft.sqlserver.sqljdbc4

This product also includes the following libraries which are covered by The Mozilla Public License, Version 2.0:

org.mozilla.rhino

This product also includes the following libraries which are covered by The New BSD License:

asm.asm backbone-queryparams bloomfilter com.esotericsoftware.kryo-shaded com.esotericsoftware.minlog com.googlecode.protobuf-java-format.protobuf-java-format d3 domReady double-conversion gflags glog gperftools gtest org.antlr.antlr-master org.codehaus.janino.commons-compiler org.codehaus.janino.janino org.hamcrest.hamcrest-core pcre protobuf re2

Copyright © 2020 Trifacta Inc. Page #116 require-text sha1 termcolor topojson triflow vega websocketpp

This product also includes the following libraries which are covered by The New BSD license:

com.google.protobuf.nano.protobuf-javanano

This product also includes the following libraries which are covered by The Oracle Technology Network License:

com.oracle.ojdbc6

This product also includes the following libraries which are covered by The PSF:

futures heap typing

This product also includes the following libraries which are covered by The PSF license:

functools32 python

This product also includes the following libraries which are covered by The PSF or ZPL:

wsgiref

This product also includes the following libraries which are covered by The Proprietary:

com.trifacta.connect.trifacta-TYcassandra com.trifacta.connect.trifacta-TYdb2 com.trifacta.connect.trifacta-TYgreenplum com.trifacta.connect.trifacta-TYhive com.trifacta.connect.trifacta-TYinformix com.trifacta.connect.trifacta-TYmongodb com.trifacta.connect.trifacta-TYmysql com.trifacta.connect.trifacta-TYopenedgewp com.trifacta.connect.trifacta-TYoracle com.trifacta.connect.trifacta-TYoraclesalescloud com.trifacta.connect.trifacta-TYpostgresql com.trifacta.connect.trifacta-TYredshift com.trifacta.connect.trifacta-TYrightnow com.trifacta.connect.trifacta-TYsforce com.trifacta.connect.trifacta-TYsparksql com.trifacta.connect.trifacta-TYsqlserver com.trifacta.connect.trifacta-TYsybase jsdata

This product also includes the following libraries which are covered by The Public Domain:

aopalliance.aopalliance org.jboss.xnio.xnio-api org.jboss.xnio.xnio-nio org.tukaani.xz protobuf-to-dict pycrypto

Copyright © 2020 Trifacta Inc. Page #117 simple-xml-writer

This product also includes the following libraries which are covered by The Public domain:

net.iharder.base64

This product also includes the following libraries which are covered by The Python Software Foundation License:

argparse backports-abc ipaddress python-setuptools regex

This product also includes the following libraries which are covered by The Tableau Binary Code License Agreement:

com.tableausoftware.common.tableaucommon com.tableausoftware.extract.tableauextract

This product also includes the following libraries which are covered by The Apache License, Version 2.0:

com.fasterxml.woodstox.woodstox-core com.google.cloud.bigtable.bigtable-client-core com.google.cloud.datastore.datastore-v1-proto-client io.opencensus.opencensus-api io.opencensus.opencensus-contrib-grpc-metrics io.opencensus.opencensus-contrib-grpc-util io.opencensus.opencensus-contrib-http-util org.spark-project.spark.unused software.amazon.ion.ion-java

This product also includes the following libraries which are covered by The BSD 3-Clause License:

org.fusesource.leveldbjni.leveldbjni-all

This product also includes the following libraries which are covered by The GNU General Public License, Version 2:

mysql.mysql-connector-java

This product also includes the following libraries which are covered by The Go license:

com.google.re2j.re2j

This product also includes the following libraries which are covered by The MIT License (MIT):

com.microsoft.azure.azure com.microsoft.azure.azure-annotations com.microsoft.azure.azure-client-authentication com.microsoft.azure.azure-client-runtime com.microsoft.azure.azure-keyvault com.microsoft.azure.azure-keyvault-core com.microsoft.azure.azure-keyvault-webkey com.microsoft.azure.azure-mgmt-appservice com.microsoft.azure.azure-mgmt-batch com.microsoft.azure.azure-mgmt-cdn com.microsoft.azure.azure-mgmt-compute com.microsoft.azure.azure-mgmt-containerinstance com.microsoft.azure.azure-mgmt-containerregistry

Copyright © 2020 Trifacta Inc. Page #118 com.microsoft.azure.azure-mgmt-containerservice com.microsoft.azure.azure-mgmt-cosmosdb com.microsoft.azure.azure-mgmt-dns com.microsoft.azure.azure-mgmt-graph-rbac com.microsoft.azure.azure-mgmt-keyvault com.microsoft.azure.azure-mgmt-locks com.microsoft.azure.azure-mgmt-network com.microsoft.azure.azure-mgmt-redis com.microsoft.azure.azure-mgmt-resources com.microsoft.azure.azure-mgmt-search com.microsoft.azure.azure-mgmt-servicebus com.microsoft.azure.azure-mgmt-sql com.microsoft.azure.azure-mgmt-storage com.microsoft.azure.azure-mgmt-trafficmanager com.microsoft.rest.client-runtime

This product also includes the following libraries which are covered by The MIT License: http://www.opensource.org/licenses/mit-license.php:

probableparsing usaddress

This product also includes the following libraries which are covered by The New BSD License:

net.sf.py4j.py4j org.jodd.jodd-core

This product also includes the following libraries which are covered by The Unicode/ICU License:

com.ibm.icu.icu4j

This product also includes the following libraries which are covered by The Unlicense:

moment-range stream-buffers

This product also includes the following libraries which are covered by The WTFPL:

org.reflections.reflections

This product also includes the following libraries which are covered by The http://www.apache.org/licenses/LICENSE-2.0:

tornado

This product also includes the following libraries which are covered by The new BSD:

scikit-learn

This product also includes the following libraries which are covered by The new BSD License:

decorator

This product also includes the following libraries which are covered by The provided without support or warranty:

org.json.json

This product also includes the following libraries which are covered by The public domain, Python, 2-Clause BSD, GPL 3 (see COPYING.txt):

Copyright © 2020 Trifacta Inc. Page #119 docutils

This product also includes the following libraries which are covered by The zlib License:

zlib

Copyright © 2020 Trifacta Inc. Page #120 Copyright © 2020 - Trifacta, Inc. All rights reserved.