Talend Open Studio for Data Quality User Guide
Total Page:16
File Type:pdf, Size:1020Kb
Talend Open Studio for Data Quality User Guide 6.1.2 Talend Open Studio for Data Quality Adapted for v6.1.2. Supersedes previous releases. Publication date: September 13, 2016 Copyleft This documentation is provided under the terms of the Creative Commons Public License (CCPL). For more information about what you can and cannot do with this documentation in accordance with the CCPL, please read: http://creativecommons.org/licenses/by-nc-sa/2.0/ Notices Talend is a trademark of Talend, Inc. All brands, product names, company names, trademarks and service marks are the properties of their respective owners. License Agreement The software described in this documentation is licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.html. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. This product includes software developed at ASM, AntlR, Apache ActiveMQ, Apache Ant, Apache Axiom, Apache Axis, Apache Axis 2, Apache Chemistry, Apache Common Http Client, Apache Common Http Core, Apache Commons, Apache Commons Bcel, Apache Commons Lang, Apache Datafu, Apache Derby Database Engine and Embedded JDBC Driver, Apache Geronimo, Apache HCatalog, Apache Hadoop, Apache Hbase, Apache Hive, Apache HttpClient, Apache HttpComponents Client, Apache JAMES, Apache Log4j, Apache Neethi, Apache POI, Apache Pig, Apache Thrift, Apache Tomcat, Apache Xml-RPC, Apache Zookeeper, CSV Tools, DataNucleus, Doug Lea, Ezmorph, Google's phone number handling library, Guava: Google Core Libraries for Java, H2 Embedded Database and JDBC Driver, HighScale Lib, HsqlDB, JSON, JUnit, Jackson Java JSON- processor, Java API for RESTful Services, Java Universal Network Graph, Jaxb, Jaxen, Jetty, Joda-Time, Json Simple, MapDB, MetaStuff, Paraccel JDBC Driver, PostgreSQL JDBC Driver, Protocol Buffers - Google's data interchange format, Resty: A simple HTTP REST client for Java, SL4J: Simple Logging Facade for Java, SQLite JDBC Driver, The Castor Project, The Legion of the Bouncy Castle, Woden, Xalan-J, Xerces2, XmlBeans, XmlSchema Core, atinject. Licensed under their respective license. 5.2.3. Using the Java or the SQL Table of Contents engine ..................................... 80 5.2.4. Accessing the detailed view of Preface .................................................. v the database column analysis ............ 81 1. General information ........................... v 5.2.5. Viewing and exporting 1.1. Purpose ................................ v analyzed data ............................. 83 1.2. Audience .............................. v 5.2.6. Using regular expressions and 1.3. Typographical conventions ........... v SQL patterns in a column analysis ....... 86 2. Feedback and Support ........................ v 5.2.7. Saving the queries executed on Chapter 1. Data Profiling: Concepts indicators ................................. 92 and Principles ........................................ 1 5.2.8. Creating analyses from table or 1.1. Why profiling data .......................... 2 column names ............................ 93 1.2. About Talend Data Quality ................. 2 5.3. Creating a basic column analysis on a 1.2.1. What is Talend Data Quality ........ 2 file ................................................ 95 1.2.2. Core features ........................ 2 5.3.1. Analyzing columns in a Chapter 2. Getting started with Talend delimited file ............................. 95 5.3.2. Analyzing columns in an excel Data Quality .......................................... 5 file ....................................... 105 2.1. Working principles of data quality ......... 6 5.4. Analyzing discrete data ................... 108 2.2. Important features and configuration 5.5. Data mining types ........................ 113 options ............................................. 6 5.5.1. Nominal .......................... 114 2.2.1. Defining the maximum 5.5.2. Interval ........................... 114 memory size threshold .................... 6 5.5.3. Unstructured text ................. 114 2.2.2. Setting preferences of analysis 5.5.4. Other ............................. 114 editors and analysis results ................ 7 Chapter 6. Table analyses .................... 115 2.2.3. Displaying and hiding the help 6.1. Steps to analyze database tables ......... 116 content ..................................... 8 6.2. Analyzing tables in databases ............ 116 2.2.4. Displaying the Module view ....... 12 6.2.1. Creating a simple table 2.2.5. Displaying the error log view analysis (Column Set Analysis) ......... 116 and managing log files ................... 13 6.2.2. Creating a table analysis with 2.2.6. Opening new editors ............... 15 SQL business rules ..................... 130 2.3. Icons appended on analyses names in 6.2.3. Detecting anomalies in the DQ Repository .............................. 16 columns (Functional Dependency Chapter 3. Before you begin profiling Analysis) ................................ 155 data .................................................... 17 6.3. Analyzing tables in delimited files ....... 161 3.1. Creating connections to data sources ..... 18 6.3.1. Creating a column set analysis 3.1.1. Connecting to a database .......... 18 on a delimited file using patterns ....... 161 3.1.2. Connecting to a file ............... 26 6.3.2. Creating a column analysis 3.2. Managing connections to data from the analysis of a set of columns ... 171 sources ........................................... 29 6.4. Analyzing duplicates ...................... 172 3.2.1. Managing database 6.4.1. Creating a match analysis ........ 172 connections ............................... 29 6.4.2. Creating a match rule ............ 184 3.2.2. Managing file connections ........ 41 Chapter 7. Redundancy analyses .......... 191 Chapter 4. Profiling database 7.1. What are redundancy analyses .......... 192 content ................................................ 43 7.2. Comparing identical columns in 4.1. Analyzing databases ....................... 44 different tables ................................. 192 4.1.1. Creating a database content 7.3. Matching primary and foreign keys ..... 198 analysis ................................... 44 Chapter 8. Correlation analyses ........... 205 4.1.2. Creating a database content 8.1. What are column correlation analysis in shortcut procedure ............ 49 analyses ......................................... 206 4.1.3. Creating a catalog analysis ........ 49 8.2. Numerical correlation analyses .......... 206 4.1.4. Creating a schema analysis ........ 54 8.2.1. Creating a numerical 4.2. Previewing data in the SQL editor ........ 58 correlation analysis ..................... 207 4.3. Displaying keys and indexes of 8.2.2. Accessing the detailed view of database tables .................................. 59 the analysis results ...................... 213 4.4. Synchronizing connection and 8.3. Time correlation analyses ................ 215 database structures ............................. 61 8.3.1. Creating a time correlation 4.4.1. How to synchronize and reload analysis .................................. 215 catalog and schema lists ................. 62 8.3.2. Accessing the detailed view of 4.4.2. How to synchronize and reload the analysis results ...................... 219 table lists ................................. 63 8.4. Nominal correlation analyses ............ 221 4.4.3. How to synchronize and reload 8.4.1. Creating a nominal correlation column lists .............................. 63 analysis .................................. 221 Chapter 5. Column analyses .................. 65 8.4.2. Accessing the detailed view of 5.1. Where to start? ............................. 66 the analysis results ...................... 225 5.2. Creating a basic analysis on a Chapter 9. Extended functionality: database column ................................ 67 patterns and indicators ........................ 229 5.2.1. Defining the columns to be 9.1. Patterns .................................... 230 analyzed and setting indicators ........... 67 9.1.1. Pattern types ...................... 230 5.2.2. Finalizing and executing the 9.1.2. Managing User-Defined column analysis .......................... 76 Functions in databases .................. 230 Talend Open Studio for Data Quality User Guide Talend Open Studio for Data Quality 9.1.3. Adding regular expressions D.1. Main concept .............................. 350 and SQL patterns to column D.2. How to create a regular expression analyses ................................. 236 function on SQL Server ........................ 350 9.1.4. Managing regular expressions D.2.1. How to create a project in and SQL patterns ....................... 236 Visual Studio ............................ 350 9.2. Indicators .................................. 261 D.2.2. How to deploy the regular 9.2.1. Indicator types ................... 261 expression function to the SQL 9.2.2. Managing system indicators ...... 268 server .................................... 351 9.2.3. Managing user-defined D.2.3. How to set up the studio ......... 354 indicators ................................ 271 D.3. How to test the created function via the