Data Quality Guide Table of Contents
Total Page:16
File Type:pdf, Size:1020Kb
Spectrum™ Technology Platform Version 2019.1.0 Data Quality Guide Table of Contents Using an Express Match Key 107 Analyzing Match Results 112 1 - Getting Started Dataflow Templates for Matching 127 Introduction to Data Quality 5 5 - Deduplication 2 - Parsing Filtering Out Duplicate Records 136 Creating a Best of Breed Record 140 Introduction to Parsing 8 Defining Domain-Independent Parsing Grammars in Dataflows 9 6 - Exception Records Culture-Specific Parsing 10 Analyzing Parsing Results 36 Designing a Dataflow to Handle Exceptions 146 Parsing Personal Names 40 Designing a Dataflow for Real-Time Dataflow Templates for Parsing 41 Revalidation 147 3 - Standardization 7 - Lookup Tables Standardizing Terms 56 Introduction to Lookup Tables 151 Standardizing Personal Names 57 Data Normalization Module Tables 151 Templates for Standardization 59 Universal Name Module Tables 156 Viewing the Contents of a Lookup Table 158 Adding a Term to a Lookup Table 159 4 - Matching Removing a Term from a Lookup Table 159 Modifying the Standardized Form of a Term 159 Matching Terminology 62 Reverting Table Customizations 160 Standard Fields 63 Creating a Lookup Table 160 Techniques for Defining Match Keys 64 Importing Data 161 Match Rules 67 Matching Records from a Single Source 80 Matching Records from One Source to Another 8 - Stages Reference Source 86 Matching Records Between and Within Advanced Matching Module 165 Sources 92 Business Steward Module 226 Matching Records Against a Database 98 Data Normalization Module 271 Matching Records Using Multiple Match Rules 100 Universal Name Module 289 Creating a Universal Matching Service 104 9 - ISO Country Codes and Module Support ISO Country Codes and Module Support 324 Spectrum™ Technology Platform 2019.1.0 Data Quality Guide 3 1 - Getting Started In this section Introduction to Data Quality 5 Getting Started Introduction to Data Quality Data quality involves ensuring the accuracy, timeliness, completeness, and consistency of the data used by an organization so that the data is fit for use. Spectrum™ Technology Platform supports data quality initiatives by providing the following capabilities. Parsing Parsing is the process of analyzing a sequence of input characters in a field and breaking it up into multiple fields. For example, you might have a field called Name which contains the value "John A. Smith" and through parsing, you can break it up so that you have a FirstName field containing "John", a MiddleName field containing "A" and a LastName field containing "Smith." Standardization Standardization takes data of the same type and puts it in the same format. Some types of data that may be standardized include telephone numbers, dates, names, addresses, and identification numbers. For example, telephone numbers can be formatted to eliminate non-numeric characters such as parentheses, periods, or dashes. You should standardize your data before performing matching or deduplication activities since standardized data will be more accurately matched than data that is inconsistently formatted. Matching Matching is the process of identifying records that are related to each other in some way that is significant for your purposes. For example, if you are trying to eliminate redundant information from your customer data, you may want to identify duplicate records for the same customer; or, if you are trying to eliminate duplicate marketing pieces going to the same address, you may want to identify records of customers that live in the same household. Deduplication Deduplication identifies records that represent one entity but for one reason or another were entered into the system multiple times, sometimes with slightly different data. For example, your system may contain vendor information from different departments in your organization, with each department using a different vendor ID for the same vendor. Using Spectrum™ Technology Platform you can consolidate these records into a single record for each vendor. Review of Exception Records In some cases you may have data that cannot be confidently processed automatically and that must be reviewed by a knowledgeable data steward. Some examples of records that may require manual review include: • Address verification failures Spectrum™ Technology Platform 2019.1.0 Data Quality Guide 5 Getting Started • Geocoding failures • Low-confidence matches • Merge/consolidation decisions The Business Steward Module is a set of features that allow you to identify and resolve exception records. Spectrum™ Technology Platform 2019.1.0 Data Quality Guide 6 2 - Parsing In this section Introduction to Parsing 8 Defining Domain-Independent Parsing Grammars in Dataflows 9 Culture-Specific Parsing 10 Analyzing Parsing Results 36 Parsing Personal Names 40 Dataflow Templates for Parsing 41 Parsing Introduction to Parsing Parsing is the process of analyzing a sequence of input characters in a field and breaking it up into multiple fields. For example, you might have a field called Name which contains the value "John A. Smith" and through parsing, you can break it up so that you have a FirstName field containing "John", a MiddleName field containing "A" and a LastName field containing "Smith." To create a dataflow that parses, use the Open Parser stage. Open Parser allows you to write parsing rules called grammars. A grammar is a set of expressions that map a sequence of characters to a set of named entities called domain patterns. A domain pattern is a sequence of one or more tokens in your input data that you want to represent as a data structure, such as name, address, or account numbers. A domain pattern can consist of any number of tokens that can be parsed from your input data. A domain pattern is represented in the parsing grammar as the <root> expression. Input data often contains such tokens in hard-to-use or mixed formats. For example: • Your input data contains names in a single field that you want to separate into given name and family name. • Your input data contains addresses from several cultures and you want to extract address data for a specific culture only. • Your input data includes free-form text that contains embedded email addresses and you want to extract email addresses and match them up with personal data and store them in a database. There are two kinds of grammars: culture specific and domain independent. A culture-specific parsing grammar is associated with a culture and/or language (such as English, Canadian English, Spanish, Mexican Spanish, and so on) and a particular type of data (phone numbers, personal names, and so on). When an Open Parser stage is configured to perform culture-specific parsing, each culture's parsing grammar is applied to each record. The grammar with the best parser score (or the first one to have a score of 100) is the one whose results are returned. Alternatively, culture-specific parsing grammars can use the value in the input record's CultureCode field and process the data according to the culture settings contained in the culture's parsing grammar. Culture-specific parsing grammars can inherit properties from a parent. A domain-independent parsing grammar is not associated with either a language or a particular type of data. Domain-independent parsing grammars do not inherit properties from a parent and ignore any CultureCode information in the input data. Open Parser analyzes a sequence of characters in input fields and categorizes them into a sequence of tokens through a process called tokenization. Tokenization is the process of delimiting and classifying sections of a string of input characters into a set of tokens based on separator characters (also called tokenizing characters), such as space, hyphen, and others. The tokens are then placed into output fields you specify. The following diagram illustrates the process of creating a parsing grammar: Spectrum™ Technology Platform 2019.1.0 Data Quality Guide 8 Parsing Defining Domain-Independent Parsing Grammars in Dataflows To define domain-independent parsing grammars in a dataflow: 1. In Enterprise Designer, add an Open Parser stage to your dataflow. 2. Double-click the Open parser stage on the canvas. 3. Click Define Domain Independent Grammar on the Rules tab. 4. Use the Grammar Editor to create the grammar rules. You can type commands and variables into the text box or use the commands provided in the Commands tab. For more information, see Grammars on page 27. Spectrum™ Technology Platform 2019.1.0 Data Quality Guide 9 Parsing 5. To cut, copy, paste, and find and replace text strings in your parsing grammar, right-click in the Grammar Editor and select the appropriate command. 6. To check the parsing grammar you have created, click Validate. The validate feature lists any errors in your grammar syntax, including the line and column where the error occurs, a description of the error, and the command name or value that is causing the error. 7. Click the Preview tab to test the parsing grammar. 8. When you are finished creating your parsing grammar, click OK. Culture-Specific Parsing Defining a Culture-Specific Parsing Grammar A culture-specific parsing grammar allows you to specify different parsing rules for different languages and cultures. This allows you to parse data from different countries in a single Open Parser stage, for example phone numbers from the United States and phone numbers from the United Kingdom. By default, each input record is parsed using each culture's parsing grammar, in the order specified in the Open Parser stage. You can also add a CultureCode field to the input records if you want a specific culture's parsing grammar to be used for that record. For more information, see Assigning a Parsing Culture to a Record on page 12. Note: If you want to create a domain-independent parsing grammar, see Defining Domain-Independent Parsing Grammars in Dataflows on page 9. 1. In Enterprise Designer, go to Tools > Open Parser Domain Editor.